Re: [PATCH v5 0/8] DAMON based tiered memory management for CXL memory
Hi SeongJae, On Thu, 13 Jun 2024 10:46:04 -0700 SeongJae Park wrote: > Hi Honggyu, > > On Thu, 13 Jun 2024 22:20:47 +0900 Honggyu Kim wrote: > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This patch series is about its implementation for physical > > address space so that this scheme can be applied in system wide level. > > > > Changes from RFC v4: > > https://lore.kernel.org/20240512175447.75943-1...@kernel.org > > 1. Add usage and design documents > > 2. Rename alloc_demote_folio to alloc_migrate_folio > > 3. Add evaluation results with "demotion_enabled" true > > 4. Rebase based on v6.10-rc3 > > I left comments on the new patches for the documentation. > > [...] > > > > Evaluation Results > > == > > > > All the result values are normalized to DRAM-only execution time because > > the workload cannot be faster than DRAM-only unless the workload hits > > the peak bandwidth but our redis test doesn't go beyond the bandwidth > > limit. > > > > So the DRAM-only execution time is the ideal result without affected by > > the gap between DRAM and CXL performance difference. The NUMA node > > environment is as follows. > > > > node0 - local DRAM, 512GB with a CPU socket (fast tier) > > node1 - disabled > > node2 - CXL DRAM, 96GB, no CPU attached (slow tier) > > > > The following is the result of generating zipfian distribution to > > redis-server and the numbers are averaged by 50 times of execution. > > > > 1. YCSB zipfian distribution read only workload > > memory pressure with cold memory on node0 with 512GB of local DRAM. > > > > ++= > > | cold memory occupied by mmap and memset | > > | 0G 440G 450G 460G 470G 480G 490G 500G | > > > > ++= > > Execution time normalized to DRAM-only values| > > GEOMEAN > > > > ++- > > DRAM-only | 1.00 - - - - - - - | > > 1.00 > > CXL-only| 1.19 - - - - - - - | > > 1.19 > > default |- 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | > > 1.11 > > DAMON tiered|- 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | > > 1.04 > > DAMON lazy |- 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | > > 1.05 > > > > ++= > > CXL usage of redis-server in GB | > > AVERAGE > > > > ++- > > DRAM-only | 0.0 - - - - - - - | > > 0.0 > > CXL-only| 51.4 - - - - - - - | > > 51.4 > > default |- 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | > > 28.7 > > DAMON tiered|- 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | > > 2.2 > > DAMON lazy |- 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | > > 5.5 > > > > ++= > > > > Each test result is based on the exeuction environment as follows. > > Nit. s/exeuction/execution/ Thanks. Fixed it. > [...] > > In summary, the evaluation results show that DAMON memory management > > with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown > > compared to the "default" memory policy from 11% to 3~5% when the system > > runs with high memory pressure on its fast tier DRAM nodes. > > > > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make > > tiered memory systems run more efficiently under high memory pressures. > > Thank you very much for continuing this great work. > > Other than trivial comments on documentation patches and the above typo, I > have > no particular concern on this patchset. I'm looking forward to the next > version. I have addressed all your comments and resent v6 again. Please have a look again. https://lore.kernel.org/20240614030010.751-1-honggyu@sk.com Thanks very much for your review! Thanks, Honggyu > > Thanks, > SJ > [...]
[PATCH v6 7/7] Docs/damon: document damos_migrate_{hot,cold}
This patch adds damon description for "migrate_hot" and "migrate_cold" actions for both usage and design documents as long as a new "target_nid" knob to set the migration target node. Signed-off-by: Honggyu Kim --- Documentation/admin-guide/mm/damon/usage.rst | 4 Documentation/mm/damon/design.rst| 4 2 files changed, 8 insertions(+) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index e58ceb89ea2a..98804e34448b 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -300,6 +300,10 @@ from the file and their meaning are same to those of the list on The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval ` in microseconds. +The ``target_nid`` file is for setting the migration target node, which is +only meaningful when the ``action`` is either ``migrate_hot`` or +``migrate_cold``. + .. _sysfs_access_pattern: schemes//access_pattern/ diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 3df387249937..3f12c884eb3a 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -325,6 +325,10 @@ that supports each action are as below. Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. Supported by ``paddr`` operations set. + - ``migrate_hot``: Migrate the regions prioritizing warmer regions. + Supported by ``paddr`` operations set. + - ``migrate_cold``: Migrate the regions prioritizing colder regions. + Supported by ``paddr`` operations set. - ``stat``: Do nothing but count the statistics. Supported by all operations sets. -- 2.34.1
[PATCH v6 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
This patch introduces DAMOS_MIGRATE_COLD action, which is similar to DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs instead of swapping them out. The 'target_nid' sysfs knob informs the migration target node ID. Here is one of the example usage of this 'migrate_cold' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/ $ cat contexts//schemes//action migrate_cold $ echo 2 > contexts//schemes//target_nid $ echo commit > state $ numactl -p 0 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 701 (hot_cold) 501 0601 1101 Since there are some common routines with pageout, many functions have similar logics between pageout and migrate cold. damon_pa_migrate_folio_list() is a minimized version of shrink_folio_list(). Signed-off-by: Honggyu Kim Signed-off-by: Hyeongtak Ji Signed-off-by: SeongJae Park --- include/linux/damon.h| 2 + mm/damon/paddr.c | 154 +++ mm/damon/sysfs-schemes.c | 1 + 3 files changed, 157 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 21d6b69a015c..56714b6eb0d7 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -105,6 +105,7 @@ struct damon_target { * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. + * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions. * @DAMOS_STAT:Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions * @@ -122,6 +123,7 @@ enum damos_action { DAMOS_NOHUGEPAGE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, + DAMOS_MIGRATE_COLD, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, }; diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 18797c1b419b..882ae54af829 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -12,6 +12,9 @@ #include #include #include +#include +#include +#include #include "../internal.h" #include "ops-common.h" @@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r, return damon_pa_mark_accessed_or_deactivate(r, s, false); } +static unsigned int __damon_pa_migrate_folio_list( + struct list_head *migrate_folios, struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_succeeded; + nodemask_t allowed_mask = NODE_MASK_NONE; + struct migration_target_control mtc = { + /* +* Allocate from 'node', or fail quickly and quietly. +* When this happens, 'page' will likely just be discarded +* instead of migrated. +*/ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | + __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = _mask + }; + + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) + return 0; + + if (list_empty(migrate_folios)) + return 0; + + /* Migration ignores all cpuset and mempolicy settings */ + migrate_pages(migrate_folios, alloc_migrate_folio, NULL, + (unsigned long), MIGRATE_ASYNC, MR_DAMON, + _succeeded); + + return nr_succeeded; +} + +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list, + struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_migrated = 0; + struct folio *folio; + LIST_HEAD(ret_folios); + LIST_HEAD(migrate_folios); + + while (!list_empty(folio_list)) { + struct folio *folio; + + cond_resched(); + + folio = lru_to_folio(folio_list); + list_del(>lru); + + if (!folio_trylock(folio)) + goto keep; + + /* Relocate its contents to another node. */ + list_add(>lru, _folios); + folio_unlock(folio); + continue; +keep: + list_add(>lru, _folios); + } + /* 'folio_list' is always empty here */ + + /* Migrate folios selected for migration */ + nr_migrated += __damon_pa_migrate_folio_list( + _folios, pgdat, target_nid); + /* +* Folios that could not be migrated are still in @migrate_folios. Add +* those back on @folio_list +*/ + if (!list_empty(_folios)) + list_splice_init(_folios, folio_list); + + try_to_unmap_flush(); + +
[PATCH v6 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion
From: Hyeongtak Ji This patch introduces DAMOS_MIGRATE_HOT action, which is similar to DAMOS_MIGRATE_COLD, but proritizes hot pages. It migrates pages inside the given region to the 'target_nid' NUMA node in the sysfs. Here is one of the example usage of this 'migrate_hot' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/ $ cat contexts//schemes//action migrate_hot $ echo 0 > contexts//schemes//target_nid $ echo commit > state $ numactl -p 2 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 701 (hot_cold) 501 0601 1101 Signed-off-by: Hyeongtak Ji Signed-off-by: Honggyu Kim Signed-off-by: SeongJae Park --- include/linux/damon.h| 2 ++ mm/damon/paddr.c | 3 +++ mm/damon/sysfs-schemes.c | 1 + 3 files changed, 6 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 56714b6eb0d7..3d62d98d6359 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -105,6 +105,7 @@ struct damon_target { * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. + * @DAMOS_MIGRATE_HOT: Migrate the regions prioritizing warmer regions. * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions. * @DAMOS_STAT:Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions @@ -123,6 +124,7 @@ enum damos_action { DAMOS_NOHUGEPAGE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, + DAMOS_MIGRATE_HOT, DAMOS_MIGRATE_COLD, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 882ae54af829..af6aac388a43 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -486,6 +486,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, return damon_pa_mark_accessed(r, scheme); case DAMOS_LRU_DEPRIO: return damon_pa_deactivate_pages(r, scheme); + case DAMOS_MIGRATE_HOT: case DAMOS_MIGRATE_COLD: return damon_pa_migrate(r, scheme); case DAMOS_STAT: @@ -508,6 +509,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context, return damon_hot_score(context, r, scheme); case DAMOS_LRU_DEPRIO: return damon_cold_score(context, r, scheme); + case DAMOS_MIGRATE_HOT: + return damon_hot_score(context, r, scheme); case DAMOS_MIGRATE_COLD: return damon_cold_score(context, r, scheme); default: diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 880015d5b5ea..66fccfa776d7 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] = { "nohugepage", "lru_prio", "lru_deprio", + "migrate_hot", "migrate_cold", "stat", }; -- 2.34.1
[PATCH v6 4/7] mm/migrate: add MR_DAMON to migrate_reason
The current patch series introduces DAMON based migration across NUMA nodes so it'd be better to have a new migrate_reason in trace events. Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: SeongJae Park --- include/linux/migrate_mode.h | 1 + include/trace/events/migrate.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index f37cc03f9369..cec36b7e7ced 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -29,6 +29,7 @@ enum migrate_reason { MR_CONTIG_RANGE, MR_LONGTERM_PIN, MR_DEMOTION, + MR_DAMON, MR_TYPES }; diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 0190ef725b43..cd01dd7b3640 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -22,7 +22,8 @@ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ EM( MR_CONTIG_RANGE,"contig_range") \ EM( MR_LONGTERM_PIN,"longterm_pin") \ - EMe(MR_DEMOTION,"demotion") + EM( MR_DEMOTION,"demotion") \ + EMe(MR_DAMON, "damon") /* * First define the enums in the above macros to be exported to userspace -- 2.34.1
[PATCH v6 2/7] mm: rename alloc_demote_folio to alloc_migrate_folio
The alloc_demote_folio can also be used for general migration including both demotion and promotion so it'd be better to rename it from alloc_demote_folio to alloc_migrate_folio. Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park --- mm/internal.h | 2 +- mm/vmscan.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index b3ca996a4efc..9f967842f636 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1052,7 +1052,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); -struct folio *alloc_demote_folio(struct folio *src, unsigned long private); +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); diff --git a/mm/vmscan.c b/mm/vmscan.c index 2f4406872f43..f5414b101909 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -916,7 +916,7 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -struct folio *alloc_demote_folio(struct folio *src, unsigned long private) +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private) { struct folio *dst; nodemask_t *allowed_mask; @@ -979,7 +979,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, node_get_allowed_targets(pgdat, _mask); /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_folios, alloc_demote_folio, NULL, + migrate_pages(demote_folios, alloc_migrate_folio, NULL, (unsigned long), MIGRATE_ASYNC, MR_DEMOTION, _succeeded); -- 2.34.1
[PATCH v6 3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
From: Hyeongtak Ji This patch adds target_nid under /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes// The 'target_nid' can be used as the destination node for DAMOS actions such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches. Signed-off-by: Hyeongtak Ji Signed-off-by: Honggyu Kim Signed-off-by: SeongJae Park --- include/linux/damon.h| 11 ++- mm/damon/core.c | 5 - mm/damon/dbgfs.c | 2 +- mm/damon/lru_sort.c | 3 ++- mm/damon/reclaim.c | 3 ++- mm/damon/sysfs-schemes.c | 33 - 6 files changed, 51 insertions(+), 6 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index f7da65e1ac04..21d6b69a015c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -374,6 +374,7 @@ struct damos_access_pattern { * @apply_interval_us: The time between applying the @action. * @quota: Control the aggressiveness of this scheme. * @wmarks:Watermarks for automated (in)activation of this scheme. + * @target_nid:Destination node if @action is "migrate_{hot,cold}". * @filters: Additional set of damos_filter for * @stat: Statistics of this scheme. * @list: List head for siblings. @@ -389,6 +390,10 @@ struct damos_access_pattern { * monitoring context are inactive, DAMON stops monitoring either, and just * repeatedly checks the watermarks. * + * @target_nid is used to set the migration target node for migrate_hot or + * migrate_cold actions, which means it's only meaningful when @action is either + * "migrate_hot" or "migrate_cold". + * * Before applying the to a memory region, damon_operations * implementation could check pages of the region and skip to respect * @@ -410,6 +415,9 @@ struct damos { /* public: */ struct damos_quota quota; struct damos_watermarks wmarks; + union { + int target_nid; + }; struct list_head filters; struct damos_stat stat; struct list_head list; @@ -726,7 +734,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks); + struct damos_watermarks *wmarks, + int target_nid); void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); void damon_destroy_scheme(struct damos *s); diff --git a/mm/damon/core.c b/mm/damon/core.c index 6392f1cc97a3..c0ec5be4f56e 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks) + struct damos_watermarks *wmarks, + int target_nid) { struct damos *scheme; @@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, scheme->wmarks = *wmarks; scheme->wmarks.activated = true; + scheme->target_nid = target_nid; + return scheme; } diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 2461cfe2e968..51a6f1cac385 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, pos += parsed; scheme = damon_new_scheme(, action, 0, , - ); + , NUMA_NO_NODE); if (!scheme) goto fail; diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 3de2916a65c3..3775f0f2743d 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme( /* under the quota. */ , /* (De)activate this according to the watermarks. */ - _lru_sort_wmarks); + _lru_sort_wmarks, + NUMA_NO_NODE); } /* Create a DAMON-based operation scheme for hot memory regions */ diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 9bd341d62b4c..a05ccb41749b 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -177,7 +177,8 @@ static struct damos *damon_reclaim_new_scheme(void) /* under the quota. */ _reclaim_quota, /* (De)activate this according to the watermarks. */ - _reclaim_wmarks); + _reclaim_wmarks, + NUMA_NO_NODE); } static void damon_reclaim_copy_quota_status(struct
[PATCH v6 1/7] mm: make alloc_demote_folio externally invokable for migration
The alloc_demote_folio can be used out of vmscan.c so it'd be better to remove static keyword from it. Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: SeongJae Park --- mm/internal.h | 1 + mm/vmscan.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index b2c75b12014e..b3ca996a4efc 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1052,6 +1052,7 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); +struct folio *alloc_demote_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e34de9cd0d4..2f4406872f43 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -916,8 +916,7 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct folio *alloc_demote_folio(struct folio *src, - unsigned long private) +struct folio *alloc_demote_folio(struct folio *src, unsigned long private) { struct folio *dst; nodemask_t *allowed_mask; -- 2.34.1
[PATCH v6 0/7] DAMON based tiered memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This patch series is about its implementation for physical address space so that this scheme can be applied in system wide level. Changes from v5: https://lore.kernel.org/20240613132056.608-1-honggyu@sk.com 1. Remove new actions in usage document as its for debugfs 2. Apply minor fixes on cover letter Changes from RFC v4: https://lore.kernel.org/20240512175447.75943-1...@kernel.org 1. Add usage and design documents 2. Rename alloc_demote_folio to alloc_migrate_folio 3. Add evaluation results with "demotion_enabled" true 4. Rebase based on v6.10-rc3 Changes from RFC v3: https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com 0. updated from v3 and posted by SJ on behalf of Honggyu under his approval. 1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode' 2. Drop vmstat change 3. Drop unnecessary page reference check Changes from RFC v2: https://lore.kernel.org/20240226140555.1615-1-honggyu@sk.com 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. 2. Create 'target_nid' to set the migration target node instead of depending on node distance based information. 3. Instead of having page level access check in this patch series, delegate the job to a new DAMOS filter type YOUNG[2]. 4. Introduce vmstat counters "damon_migrate_{hot,cold}". 5. Rebase from v6.7 to v6.8. Changes from RFC: https://lore.kernel.org/20240115045253.1775-1-honggyu@sk.com 1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c. 2. Simplify some functions of vmscan.c and used in paddr.c, but need to be reviewed more in depth. 3. Refactor most functions for common usage for both promote and demote actions and introduce an enum migration_mode for its control. 4. Add "target_nid" sysfs knob for migration destination node for both promote and demote actions. 5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above DAMOS_STAT. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 11% to 3~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest
Re: [PATCH v4 12/35] kmsan: Support SLAB_POISON
On Thu, 2024-06-13 at 16:30 -0700, SeongJae Park wrote: > Hi Ilya, > > On Thu, 13 Jun 2024 17:34:14 +0200 Ilya Leoshkevich > wrote: > > > Avoid false KMSAN negatives with SLUB_DEBUG by allowing > > kmsan_slab_free() to poison the freed memory, and by preventing > > init_object() from unpoisoning new allocations by using __memset(). > > > > There are two alternatives to this approach. First, init_object() > > can be marked with __no_sanitize_memory. This annotation should be > > used > > with great care, because it drops all instrumentation from the > > function, and any shadow writes will be lost. Even though this is > > not a > > concern with the current init_object() implementation, this may > > change > > in the future. > > > > Second, kmsan_poison_memory() calls may be added after memset() > > calls. > > The downside is that init_object() is called from > > free_debug_processing(), in which case poisoning will erase the > > distinction between simply uninitialized memory and UAF. > > > > Signed-off-by: Ilya Leoshkevich > > --- > > mm/kmsan/hooks.c | 2 +- > > mm/slub.c | 13 + > > 2 files changed, 10 insertions(+), 5 deletions(-) > > > [...] > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache > > *s, void *object, u8 val) > > unsigned int poison_size = s->object_size; > > > > if (s->flags & SLAB_RED_ZONE) { > > - memset(p - s->red_left_pad, val, s->red_left_pad); > > + /* > > +* Use __memset() here and below in order to avoid > > overwriting > > +* the KMSAN shadow. Keeping the shadow makes it > > possible to > > +* distinguish uninit-value from use-after-free. > > +*/ > > + __memset(p - s->red_left_pad, val, s- > > >red_left_pad); > > I found my build test[1] fails with below error on latest mm-unstable > branch. > 'git bisect' points me this patch. > > CC mm/slub.o > /mm/slub.c: In function 'init_object': > /mm/slub.c:1147:17: error: implicit declaration of function > '__memset'; did you mean 'memset'? [-Werror=implicit-function- > declaration] > 1147 | __memset(p - s->red_left_pad, val, s- > >red_left_pad); > | ^~~~ > | memset > cc1: some warnings being treated as errors > > I haven't looked in deep, but reporting first. Do you have any idea? > > [1] > https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh > > > Thanks, > SJ > > [...] Thanks for the report. Apparently not all architectures have __memset(). We should probably go back to memset_no_sanitize_memory() [1], but this time mark it with noinline __maybe_unused __no_sanitize_memory, like it's done in, e.g., 32/35. Alexander, what do you think? [1] https://lore.kernel.org/lkml/20231121220155.1217090-14-...@linux.ibm.com/
[PATCH v7 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
From: "Steven Rostedt (Google)" In order to allow for requesting a memory region that can be used for things like pstore on multiple machines where the memory layout is not the same, add a new option to the kernel command line called "reserve_mem". The format is: reserve_mem=nn:align:name Where it will find nn amount of memory at the given alignment of align. The name field is to allow another subsystem to retrieve where the memory was found. For example: reserve_mem=12M:4096:oops ramoops.mem_name=oops Where ramoops.mem_name will tell ramoops that memory was reserved for it via the reserve_mem option and it can find it by calling: if (reserve_mem_find_by_name("oops", , )) { // start holds the start address and size holds the size given This is typically used for systems that do not wipe the RAM, and this command line will try to reserve the same physical memory on soft reboots. Note, it is not guaranteed to be the same location. For example, if KASLR places the kernel at the location of where the RAM reservation was from a previous boot, the new reservation will be at a different location. Any subsystem using this feature must add a way to verify that the contents of the physical memory is from a previous boot, as there may be cases where the memory will not be located at the same location. Not all systems may work either. There could be bit flips if the reboot goes through the BIOS. Using kexec to reboot the machine is likely to have better results in such cases. Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/ Suggested-by: Mike Rapoport Tested-by: Guilherme G. Piccoli Signed-off-by: Steven Rostedt (Google) --- .../admin-guide/kernel-parameters.txt | 22 include/linux/mm.h| 2 + mm/memblock.c | 117 ++ 3 files changed, 141 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b600df82669d..56e18b1a520d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5710,6 +5710,28 @@ them. If is less than 0x1, the region is assumed to be I/O ports; otherwise it is memory. + reserve_mem=[RAM] + Format: nn[KNG]:: + Reserve physical memory and label it with a name that + other subsystems can use to access it. This is typically + used for systems that do not wipe the RAM, and this command + line will try to reserve the same physical memory on + soft reboots. Note, it is not guaranteed to be the same + location. For example, if anything about the system changes + or if booting a different kernel. It can also fail if KASLR + places the kernel at the location of where the RAM reservation + was from a previous boot, the new reservation will be at a + different location. + Any subsystem using this feature must add a way to verify + that the contents of the physical memory is from a previous + boot, as there may be cases where the memory will not be + located at the same location. + + The format is size:align:label for example, to request + 12 megabytes of 4096 alignment for ramoops: + + reserve_mem=12M:4096:oops ramoops.mem_name=oops + reservetop= [X86-32,EARLY] Format: nn[KMG] Reserves a hole at the top of the kernel virtual diff --git a/include/linux/mm.h b/include/linux/mm.h index 9849dfda44d4..077fb589b88a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn) void vma_pgtable_walk_begin(struct vm_area_struct *vma); void vma_pgtable_walk_end(struct vm_area_struct *vma); +int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *size); + #endif /* _LINUX_MM_H */ diff --git a/mm/memblock.c b/mm/memblock.c index d09136e040d3..b7b0e8c3868d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2244,6 +2244,123 @@ void __init memblock_free_all(void) totalram_pages_add(pages); } +/* Keep a table to reserve named memory */ +#define RESERVE_MEM_MAX_ENTRIES8 +#define RESERVE_MEM_NAME_SIZE 16 +struct reserve_mem_table { + charname[RESERVE_MEM_NAME_SIZE]; + phys_addr_t start; + phys_addr_t size; +}; +static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES]; +static int reserved_mem_count; + +/* Add wildcard
[PATCH v7 2/2] pstore/ramoops: Add ramoops.mem_name= command line option
From: "Steven Rostedt (Google)" Add a method to find a region specified by reserve_mem=nn:align:name for ramoops. Adding a kernel command line parameter: reserve_mem=12M:4096:oops ramoops.mem_name=oops Will use the size and location defined by the memmap parameter where it finds the memory and labels it "oops". The "oops" in the ramoops option is used to search for it. This allows for arbitrary RAM to be used for ramoops if it is known that the memory is not cleared on kernel crashes or soft reboots. Tested-by: Guilherme G. Piccoli Signed-off-by: Steven Rostedt (Google) --- Documentation/admin-guide/ramoops.rst | 13 + fs/pstore/ram.c | 14 ++ 2 files changed, 27 insertions(+) diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index e9f85142182d..2eabef31220d 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -23,6 +23,8 @@ and type of the memory area are set using three variables: * ``mem_size`` for the size. The memory size will be rounded down to a power of two. * ``mem_type`` to specify if the memory type (default is pgprot_writecombine). + * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command +line parameter. Typically the default value of ``mem_type=0`` should be used as that sets the pstore mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use @@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners: return ret; } + D. Using a region of memory reserved via ``reserve_mem`` command line +parameter. The address and size will be defined by the ``reserve_mem`` +parameter. Note, that ``reserve_mem`` may not always allocate memory +in the same location, and cannot be relied upon. Testing will need +to be done, and it may not work on every machine, nor every kernel. +Consider this a "best effort" approach. The ``reserve_mem`` option +takes a size, alignment and name as arguments. The name is used +to map the memory to a label that can be retrieved by ramoops. + + reserve_mem=2M:4096:oops ramoops.mem_name=oops + You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() very early in the architecture code, e.g.:: diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index b1a455f42e93..4311fcbc84f2 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400); MODULE_PARM_DESC(mem_address, "start of reserved RAM used to store oops/panic logs"); +static char *mem_name; +module_param_named(mem_name, mem_name, charp, 0400); +MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr"); + static ulong mem_size; module_param(mem_size, ulong, 0400); MODULE_PARM_DESC(mem_size, @@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void) { struct ramoops_platform_data pdata; + if (mem_name) { + phys_addr_t start; + phys_addr_t size; + + if (reserve_mem_find_by_name(mem_name, , )) { + mem_address = start; + mem_size = size; + } + } + /* * Prepare a dummy platform data structure to carry the module * parameters. If mem_size isn't set, then there are no module -- 2.43.0
[PATCH v7 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
Reserve unspecified location of physical memory from kernel command line Background: In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract dmesg output and some other information when a crash happens in the field. (This is only done when the user selects "Allow Google to collect data for improving the system"). But there are cases when there's a bug that requires more data to be retrieved to figure out what is happening. We would like to increase the pstore size, either temporarily, or maybe even permanently. The pstore on these devices are at a fixed location in RAM (as the RAM is not cleared on soft reboots nor crashes). The location is chosen by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. There's a driver that queries for this to initialize the pstore for ChromeOS: See drivers/platform/chrome/chromeos_pstore.c Problem: The problem is that, even though there's a process to change the kernel on these systems, and is done regularly to install updates, the firmware is updated much less frequently. Choosing the place in RAM also takes special care, and may be in a different address for different boards. Updating the size via firmware is a large effort and not something that many are willing to do for a temporary pstore size change. Requirement: Need a way to reserve memory that will be at a consistent location for every boot, if the kernel and system are the same. Does not need to work if rebooting to a different kernel, or if the system can change the memory layout between boots. The reserved memory can not be an hard coded address, as the same kernel / command line needs to run on several different machines. The picked memory reservation just needs to be the same for a given machine, but may be different for different machines. Solution: The solution I have come up with is to introduce a new "reserve_mem=" kernel command line. This parameter takes the following format: reserve_mem=nn:align:label Where nn is the size of memory to reserve, the align is the alignment of that memory, and label is the way for other sub-systems to find that memory. This way the kernel command line could have: reserve_mem=12M:4096:oops ramoops.mem_name=oops At boot up, the kernel will search for 12 megabytes in usable memory regions with an alignment of 4096. It will start at the highest regions and work its way down (for those old devices that want access to lower address DMA). When it finds a region, it will save it off in a small table and mark it with the "oops" label. Then the pstore ramoops sub-system could ask for that memory and location, and it will map itself there. This prototype allows for 8 different mappings (which may be overkill, 4 is probably plenty) with 16 byte size to store the label. I have tested this and it works for us to solve the above problem. We can update the kernel and command line and increase the size of pstore without needing to update the firmware, or knowing every memory layout of each board. I only tested this locally, it has not been tested in the field. Changes since v6: https://lore.kernel.org/all/20240613155506.811013...@goodmis.org/ - Fixed typo of s/reserver/reserve/ Changes since v5: https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/ - Stressed more that this is a best effort use case - Updated ramoops.rst to document this new feature - Used a new variable "tmp" to use in reserve_mem_find_by_name() instead of using "size" and possibly corrupting it. Changes since v4: https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/ - Add all checks about reserve_mem before allocation. This means reserved_mem_add() is now a void function. - Check for name duplications. - Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=") Changes since v3: https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/ - Changed table type of start and size from unsigned long to phys_addr_t (as well as the parameters to the functions that use them) - Changed old reference to "early_reserve_mem" to "reserve_mem" - Check before reservering memory: o Size is non-zero o name has text in it - If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES - Remove the silly check of testing *p == '\0' after a p += strlen(p) Changes since v2: https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/ - Fixed typo of "reserver" - Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name() - Removed "built-in" from module description that was changed from v1. Changes since v1: https://lore.kernel.org/all/2024060320.801075...@goodmis.org/ - Updated the change log of the first patch as well as added an entry into kernel-parameters.txt about how reserve_mem is for soft reboots and may not be reliable. Steven Rostedt (Google) (2): mm/memblock: Add "reserve_mem" to reserved named memory at boot up pstore/ramoops: Add ramoops.mem_name= command line option
Re: [PATCH v4 12/35] kmsan: Support SLAB_POISON
Hi Ilya, On Thu, 13 Jun 2024 17:34:14 +0200 Ilya Leoshkevich wrote: > Avoid false KMSAN negatives with SLUB_DEBUG by allowing > kmsan_slab_free() to poison the freed memory, and by preventing > init_object() from unpoisoning new allocations by using __memset(). > > There are two alternatives to this approach. First, init_object() > can be marked with __no_sanitize_memory. This annotation should be used > with great care, because it drops all instrumentation from the > function, and any shadow writes will be lost. Even though this is not a > concern with the current init_object() implementation, this may change > in the future. > > Second, kmsan_poison_memory() calls may be added after memset() calls. > The downside is that init_object() is called from > free_debug_processing(), in which case poisoning will erase the > distinction between simply uninitialized memory and UAF. > > Signed-off-by: Ilya Leoshkevich > --- > mm/kmsan/hooks.c | 2 +- > mm/slub.c| 13 + > 2 files changed, 10 insertions(+), 5 deletions(-) > [...] > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache *s, void > *object, u8 val) > unsigned int poison_size = s->object_size; > > if (s->flags & SLAB_RED_ZONE) { > - memset(p - s->red_left_pad, val, s->red_left_pad); > + /* > + * Use __memset() here and below in order to avoid overwriting > + * the KMSAN shadow. Keeping the shadow makes it possible to > + * distinguish uninit-value from use-after-free. > + */ > + __memset(p - s->red_left_pad, val, s->red_left_pad); I found my build test[1] fails with below error on latest mm-unstable branch. 'git bisect' points me this patch. CC mm/slub.o /mm/slub.c: In function 'init_object': /mm/slub.c:1147:17: error: implicit declaration of function '__memset'; did you mean 'memset'? [-Werror=implicit-function-declaration] 1147 | __memset(p - s->red_left_pad, val, s->red_left_pad); | ^~~~ | memset cc1: some warnings being treated as errors I haven't looked in deep, but reporting first. Do you have any idea? [1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh Thanks, SJ [...]
Re: [PATCH 3/8] riscv: ftrace: support fastcc in Clang for WITH_ARGS
On Thu, Jun 13, 2024 at 03:11:08PM +0800, Andy Chiu wrote: > Some caller-saved registers which are not defined as function arguments > in the ABI can still be passed as arguments when the kernel is compiled > with Clang. As a result, we must save and restore those registers to > prevent ftrace from clobbering them. > > - [1]: https://reviews.llvm.org/D68559 > Reported-by: Evgenii Shatokhin > Closes: > https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c4...@yadro.com/ > Signed-off-by: Andy Chiu Acked-by: Nathan Chancellor > --- > arch/riscv/include/asm/ftrace.h | 7 +++ > arch/riscv/kernel/asm-offsets.c | 7 +++ > arch/riscv/kernel/mcount-dyn.S | 16 ++-- > 3 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h > index 9eb31a7ea0aa..5f81c53dbfd9 100644 > --- a/arch/riscv/include/asm/ftrace.h > +++ b/arch/riscv/include/asm/ftrace.h > @@ -144,6 +144,13 @@ struct ftrace_regs { > unsigned long a5; > unsigned long a6; > unsigned long a7; > +#ifdef CONFIG_CC_IS_CLANG > + unsigned long t2; > + unsigned long t3; > + unsigned long t4; > + unsigned long t5; > + unsigned long t6; > +#endif > }; > }; > }; > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c > index b09ca5f944f7..db5a26fcc9ae 100644 > --- a/arch/riscv/kernel/asm-offsets.c > +++ b/arch/riscv/kernel/asm-offsets.c > @@ -497,6 +497,13 @@ void asm_offsets(void) > DEFINE(FREGS_SP,offsetof(struct ftrace_regs, sp)); > DEFINE(FREGS_S0,offsetof(struct ftrace_regs, s0)); > DEFINE(FREGS_T1,offsetof(struct ftrace_regs, t1)); > +#ifdef CONFIG_CC_IS_CLANG > + DEFINE(FREGS_T2,offsetof(struct ftrace_regs, t2)); > + DEFINE(FREGS_T3,offsetof(struct ftrace_regs, t3)); > + DEFINE(FREGS_T4,offsetof(struct ftrace_regs, t4)); > + DEFINE(FREGS_T5,offsetof(struct ftrace_regs, t5)); > + DEFINE(FREGS_T6,offsetof(struct ftrace_regs, t6)); > +#endif > DEFINE(FREGS_A0,offsetof(struct ftrace_regs, a0)); > DEFINE(FREGS_A1,offsetof(struct ftrace_regs, a1)); > DEFINE(FREGS_A2,offsetof(struct ftrace_regs, a2)); > diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S > index 745dd4c4a69c..e988bd26b28b 100644 > --- a/arch/riscv/kernel/mcount-dyn.S > +++ b/arch/riscv/kernel/mcount-dyn.S > @@ -96,7 +96,13 @@ > REG_S x8, FREGS_S0(sp) > #endif > REG_S x6, FREGS_T1(sp) > - > +#ifdef CONFIG_CC_IS_CLANG > + REG_S x7, FREGS_T2(sp) > + REG_S x28, FREGS_T3(sp) > + REG_S x29, FREGS_T4(sp) > + REG_S x30, FREGS_T5(sp) > + REG_S x31, FREGS_T6(sp) > +#endif > // save the arguments > REG_S x10, FREGS_A0(sp) > REG_S x11, FREGS_A1(sp) > @@ -115,7 +121,13 @@ > REG_L x8, FREGS_S0(sp) > #endif > REG_L x6, FREGS_T1(sp) > - > +#ifdef CONFIG_CC_IS_CLANG > + REG_L x7, FREGS_T2(sp) > + REG_L x28, FREGS_T3(sp) > + REG_L x29, FREGS_T4(sp) > + REG_L x30, FREGS_T5(sp) > + REG_L x31, FREGS_T6(sp) > +#endif > // restore the arguments > REG_L x10, FREGS_A0(sp) > REG_L x11, FREGS_A1(sp) > > -- > 2.43.0 >
Re: [PATCH 1/1] s390/virtio_ccw: fix config change notifications
On Thu, Jun 13, 2024 at 03:21:15PM +0200, Halil Pasic wrote: > On Wed, 12 Jun 2024 16:04:15 +0200 > Thomas Huth wrote: > > > On 11/06/2024 23.47, Halil Pasic wrote: > > > Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > > > broke configuration change notifications for virtio-ccw by putting the > > > DMA address of *indicatorp directly into ccw->cda disregarding the fact > > > that if !!(vcdev->is_thinint) then the function > > > virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value > > > with the address of the virtio_thinint_area so it can actually set up > > > the adapter interrupts via CCW_CMD_SET_IND_ADAPTER. Thus we end up > > > pointing to the wrong object for both CCW_CMD_SET_IND if setting up the > > > adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless > > > whether it succeeds or fails. > > > > > > To fix this, let us save away the dma address of *indicatorp in a local > > > variable, and copy it to ccw->cda after the "vcdev->is_thinint" branch. > > > > > > Reported-by: Boqiao Fu > > > Reported-by: Sebastian Mitterle > > > Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > > > Signed-off-by: Halil Pasic > > > --- > > > I know that checkpatch.pl complains about a missing 'Closes' tag. > > > Unfortunately I don't have an appropriate URL at hand. @Sebastian, > > > @Boqiao: do you have any suggetions? > > > > Closes: https://issues.redhat.com/browse/RHEL-39983 > > ? > > Yep! That is a public bug tracker bug. Qualifies! > @Vasily: Can you guys pick hat one up when picking the patch? Sure, applied. Thanks!
Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option
On 13/06/2024 15:42, Steven Rostedt wrote: > On Thu, 13 Jun 2024 15:19:50 -0300 > "Guilherme G. Piccoli" wrote: > >>> + >>> + reserver_mem=2M:4096:oops ramoops.mem_name=oops >>> + >> >> Likely this could be fixed on merge, to avoid another version, but... >> >> s/reserver_mem/reserve_mem > > That 'r' is my nemesis! Almost every time I type "reserve" I write it > as "reserver" and have to go back and delete it. :-p LOL, I thought the same! And I even wondered if you're not a vim user that makes use of 'r' for replacing text and double typed that - I do that all the time, with r and i. Anyway, thanks for fixing that. Cheers!
Re: [PATCH] linux++: delete some forward declarations
On Thu, 13 Jun 2024 14:21:10 -0700 Andrew Morton wrote: > On Thu, 13 Jun 2024 16:10:12 -0400 Steven Rostedt wrote: > > > > And... I'm a bit surprised that forward declarations are allowed in C. > > > A billion years ago I used a C compiler which would use 16 bits for > > > an enum if the enumted values would fit in 16 bits. And it would use 32 > > > bits otherwise. So the enumerated values were *required* for the > > > compiler to be able to figure out the sizeof. But it was a billion > > > years ago. > > > > Well, I only looked at the one change in ftrace.h which has a > > "struct seq_file;" that is not used anywhere else in the file, so that > > one definitely can go. > > The risk is that something which includes ftrace.h is depending upon > the enum declaration. You mean forward struct declaration. And if so, good! it needs to be fixed ;-) -- Steve
Re: [PATCH] linux++: delete some forward declarations
On Thu, 13 Jun 2024 16:10:12 -0400 Steven Rostedt wrote: > > And... I'm a bit surprised that forward declarations are allowed in C. > > A billion years ago I used a C compiler which would use 16 bits for > > an enum if the enumted values would fit in 16 bits. And it would use 32 > > bits otherwise. So the enumerated values were *required* for the > > compiler to be able to figure out the sizeof. But it was a billion > > years ago. > > Well, I only looked at the one change in ftrace.h which has a > "struct seq_file;" that is not used anywhere else in the file, so that > one definitely can go. The risk is that something which includes ftrace.h is depending upon the enum declaration.
Re: [PATCH] ARM: dts: qcom: motorola-falcon: add accelerometer, magnetometer
On Thu, Jun 13, 2024 at 09:48:26AM +0200, Konrad Dybcio wrote: > > > On 6/9/24 13:05, Stanislav Jakubek wrote: > > Add the accelerometer and magnetometer that are present on the Motorola > > Moto G (2013) device. > > > > Signed-off-by: Stanislav Jakubek > > --- > > [...] > > > +_i2c2 { > > Consider setting a clock-frequency = <> Hi Konrad, checking downstream [1], qcom,i2c-bus-freq for I2C2 is 10, which seems to be the default. Should I specify it anyway? Though, now that I've checked, it seems that I missed the clock-frequency for I2C3, for which qcom,i2c-bus-freq is 40... The currently supported HW on it seems to work fine though ¯\_(ツ)_/¯ [1] https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/arch/arm/boot/dts/msm8226.dtsi#L983 Stanislav > > With that: > > Reviewed-by: Konrad Dybcio > > Konrad
Re: [PATCH] linux++: delete some forward declarations
On Thu, 13 Jun 2024 22:22:18 +0300 Alexey Dobriyan wrote: > g++ doesn't like forward enum declarations: > > error: use of enum ‘E’ without previous declaration > 64 | enum E; But we don't care about g++. Do we? I would make that a separate patch. > > Delete those which aren't used. > > Delete some unused/unnecessary forward struct declarations for a change. This is a clean up, but should have a better change log. Just something simple like: Delete unnecessary forward struct declarations. Thanks, -- Steve > > Signed-off-by: Alexey Dobriyan > --- > > fs/ramfs/inode.c |1 - > include/linux/console.h |2 -- > include/linux/device.h |3 --- > include/linux/ftrace.h |4 > include/linux/security.h |6 -- > include/linux/signal.h |2 -- > include/linux/syscalls.h |7 --- > include/linux/sysfs.h|2 -- > mm/internal.h|4 > mm/shmem.c |1 - > 10 files changed, 32 deletions(-) > > --- a/fs/ramfs/inode.c > +++ b/fs/ramfs/inode.c > @@ -51,7 +51,6 @@ struct ramfs_fs_info { > > #define RAMFS_DEFAULT_MODE 0755 > > -static const struct super_operations ramfs_ops; > static const struct inode_operations ramfs_dir_inode_operations; > > struct inode *ramfs_get_inode(struct super_block *sb, > --- a/include/linux/console.h > +++ b/include/linux/console.h > @@ -21,10 +21,8 @@ > #include > > struct vc_data; > -struct console_font_op; > struct console_font; > struct module; > -struct tty_struct; > struct notifier_block; > > enum con_scroll { > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -36,10 +36,7 @@ > struct device; > struct device_private; > struct device_driver; > -struct driver_private; > struct module; > -struct class; > -struct subsys_private; > struct device_node; > struct fwnode_handle; > struct iommu_group; > --- a/include/linux/ftrace.h > +++ b/include/linux/ftrace.h > @@ -531,8 +531,6 @@ extern const void *ftrace_expected; > > void ftrace_bug(int err, struct dyn_ftrace *rec); > > -struct seq_file; > - > extern int ftrace_text_reserved(const void *start, const void *end); > > struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr); > @@ -1147,8 +1145,6 @@ static inline void unpause_graph_tracing(void) { } > #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ > > #ifdef CONFIG_TRACING > -enum ftrace_dump_mode; > - > #define MAX_TRACER_SIZE 100 > extern char ftrace_dump_on_oops[]; > extern int ftrace_dump_on_oops_enabled(void); > --- a/include/linux/security.h > +++ b/include/linux/security.h > @@ -41,7 +41,6 @@ struct rlimit; > struct kernel_siginfo; > struct sembuf; > struct kern_ipc_perm; > -struct audit_context; > struct super_block; > struct inode; > struct dentry; > @@ -59,8 +58,6 @@ struct xfrm_sec_ctx; > struct mm_struct; > struct fs_context; > struct fs_parameter; > -enum fs_value_type; > -struct watch; > struct watch_notification; > struct lsm_ctx; > > @@ -183,8 +180,6 @@ struct sock; > struct sockaddr; > struct socket; > struct flowi_common; > -struct dst_entry; > -struct xfrm_selector; > struct xfrm_policy; > struct xfrm_state; > struct xfrm_user_sec_ctx; > @@ -219,7 +214,6 @@ extern unsigned long dac_mmap_min_addr; > #define LSM_PRLIMIT_WRITE 2 > > /* forward declares to avoid warnings */ > -struct sched_param; > struct request_sock; > > /* bprm->unsafe reasons */ > --- a/include/linux/signal.h > +++ b/include/linux/signal.h > @@ -274,8 +274,6 @@ static inline int valid_signal(unsigned long sig) > return sig <= _NSIG ? 1 : 0; > } > > -struct timespec; > -struct pt_regs; > enum pid_type; > > extern int next_signal(struct sigpending *pending, sigset_t *mask); > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -11,8 +11,6 @@ > > struct __aio_sigset; > struct epoll_event; > -struct iattr; > -struct inode; > struct iocb; > struct io_event; > struct iovec; > @@ -20,14 +18,12 @@ struct __kernel_old_itimerval; > struct kexec_segment; > struct linux_dirent; > struct linux_dirent64; > -struct list_head; > struct mmap_arg_struct; > struct msgbuf; > struct user_msghdr; > struct mmsghdr; > struct msqid_ds; > struct new_utsname; > -struct nfsctl_arg; > struct __old_kernel_stat; > struct oldold_utsname; > struct old_utsname; > @@ -38,7 +34,6 @@ struct rusage; > struct sched_param; > struct sched_attr; > struct sel_arg_struct; > -struct semaphore; > struct sembuf; > struct shmid_ds; > struct sockaddr; > @@ -48,14 +43,12 @@ struct statfs; > struct statfs64; > struct statx; > struct sysinfo; > -struct timespec; > struct __kernel_old_timeval; > struct __kernel_timex; > struct timezone; > struct tms; > struct utimbuf; > struct mq_attr; > -struct compat_stat; > struct old_timeval32; > struct robust_list_head; > struct futex_waitv; > --- a/include/linux/sysfs.h > +++ b/include/linux/sysfs.h > @@ -23,9 +23,7 @@ > #include >
Re: [PATCH] ACPI: NFIT: add missing MODULE_DESCRIPTION() macro
On Thu, Jun 6, 2024 at 6:10 PM Dave Jiang wrote: > > > > On 6/3/24 6:30 AM, Jeff Johnson wrote: > > make allmodconfig && make W=1 C=1 reports: > > WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/acpi/nfit/nfit.o > > > > Add the missing invocation of the MODULE_DESCRIPTION() macro. > > > > Signed-off-by: Jeff Johnson > > Reviewed-by: Dave Jiang > > --- > > drivers/acpi/nfit/core.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > > index d4595d1985b1..e8520fb8af4f 100644 > > --- a/drivers/acpi/nfit/core.c > > +++ b/drivers/acpi/nfit/core.c > > @@ -3531,5 +3531,6 @@ static __exit void nfit_exit(void) > > > > module_init(nfit_init); > > module_exit(nfit_exit); > > +MODULE_DESCRIPTION("ACPI NVDIMM Firmware Interface Table (NFIT) module"); > > MODULE_LICENSE("GPL v2"); > > MODULE_AUTHOR("Intel Corporation"); > > > > --- Applied as 6.11 material, thanks!
Re: [PATCH] linux++: delete some forward declarations
On Thu, 13 Jun 2024 13:04:20 -0700 Andrew Morton wrote: > On Thu, 13 Jun 2024 15:34:02 -0400 Steven Rostedt wrote: > > > On Thu, 13 Jun 2024 22:22:18 +0300 > > Alexey Dobriyan wrote: > > > > > g++ doesn't like forward enum declarations: > > > > > > error: use of enum ‘E’ without previous declaration > > > 64 | enum E; > > > > But we don't care about g++. Do we? > > It appears that g++ is a useful enum declaration detector. > > I'm curious to know how even the above warning was generated. Does g++ > work at all on Linux? > > > I would make that a separate patch. > > What are you referring to here? The enum change should be separate from the struct changes. > > > > > > > Delete those which aren't used. > > > > > > Delete some unused/unnecessary forward struct declarations for a change. > > > > This is a clean up, but should have a better change log. Just something > > simple like: > > > > Delete unnecessary forward struct declarations. > > Alexey specializes in cute changelogs. eh > > I do have a concern about the patch: has it been tested with all > possible Kconfigs? No. There may be some configs in which the forward > declaration is required. > > And... I'm a bit surprised that forward declarations are allowed in C. > A billion years ago I used a C compiler which would use 16 bits for > an enum if the enumted values would fit in 16 bits. And it would use 32 > bits otherwise. So the enumerated values were *required* for the > compiler to be able to figure out the sizeof. But it was a billion > years ago. Well, I only looked at the one change in ftrace.h which has a "struct seq_file;" that is not used anywhere else in the file, so that one definitely can go. -- Steve
Re: [PATCH] linux++: delete some forward declarations
On Thu, 13 Jun 2024 15:34:02 -0400 Steven Rostedt wrote: > On Thu, 13 Jun 2024 22:22:18 +0300 > Alexey Dobriyan wrote: > > > g++ doesn't like forward enum declarations: > > > > error: use of enum ‘E’ without previous declaration > >64 | enum E; > > But we don't care about g++. Do we? It appears that g++ is a useful enum declaration detector. I'm curious to know how even the above warning was generated. Does g++ work at all on Linux? > I would make that a separate patch. What are you referring to here? > > > > Delete those which aren't used. > > > > Delete some unused/unnecessary forward struct declarations for a change. > > This is a clean up, but should have a better change log. Just something > simple like: > > Delete unnecessary forward struct declarations. Alexey specializes in cute changelogs. I do have a concern about the patch: has it been tested with all possible Kconfigs? No. There may be some configs in which the forward declaration is required. And... I'm a bit surprised that forward declarations are allowed in C. A billion years ago I used a C compiler which would use 16 bits for an enum if the enumted values would fit in 16 bits. And it would use 32 bits otherwise. So the enumerated values were *required* for the compiler to be able to figure out the sizeof. But it was a billion years ago.
[PATCH] linux++: delete some forward declarations
g++ doesn't like forward enum declarations: error: use of enum ‘E’ without previous declaration 64 | enum E; Delete those which aren't used. Delete some unused/unnecessary forward struct declarations for a change. Signed-off-by: Alexey Dobriyan --- fs/ramfs/inode.c |1 - include/linux/console.h |2 -- include/linux/device.h |3 --- include/linux/ftrace.h |4 include/linux/security.h |6 -- include/linux/signal.h |2 -- include/linux/syscalls.h |7 --- include/linux/sysfs.h|2 -- mm/internal.h|4 mm/shmem.c |1 - 10 files changed, 32 deletions(-) --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -51,7 +51,6 @@ struct ramfs_fs_info { #define RAMFS_DEFAULT_MODE 0755 -static const struct super_operations ramfs_ops; static const struct inode_operations ramfs_dir_inode_operations; struct inode *ramfs_get_inode(struct super_block *sb, --- a/include/linux/console.h +++ b/include/linux/console.h @@ -21,10 +21,8 @@ #include struct vc_data; -struct console_font_op; struct console_font; struct module; -struct tty_struct; struct notifier_block; enum con_scroll { --- a/include/linux/device.h +++ b/include/linux/device.h @@ -36,10 +36,7 @@ struct device; struct device_private; struct device_driver; -struct driver_private; struct module; -struct class; -struct subsys_private; struct device_node; struct fwnode_handle; struct iommu_group; --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -531,8 +531,6 @@ extern const void *ftrace_expected; void ftrace_bug(int err, struct dyn_ftrace *rec); -struct seq_file; - extern int ftrace_text_reserved(const void *start, const void *end); struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr); @@ -1147,8 +1145,6 @@ static inline void unpause_graph_tracing(void) { } #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ #ifdef CONFIG_TRACING -enum ftrace_dump_mode; - #define MAX_TRACER_SIZE100 extern char ftrace_dump_on_oops[]; extern int ftrace_dump_on_oops_enabled(void); --- a/include/linux/security.h +++ b/include/linux/security.h @@ -41,7 +41,6 @@ struct rlimit; struct kernel_siginfo; struct sembuf; struct kern_ipc_perm; -struct audit_context; struct super_block; struct inode; struct dentry; @@ -59,8 +58,6 @@ struct xfrm_sec_ctx; struct mm_struct; struct fs_context; struct fs_parameter; -enum fs_value_type; -struct watch; struct watch_notification; struct lsm_ctx; @@ -183,8 +180,6 @@ struct sock; struct sockaddr; struct socket; struct flowi_common; -struct dst_entry; -struct xfrm_selector; struct xfrm_policy; struct xfrm_state; struct xfrm_user_sec_ctx; @@ -219,7 +214,6 @@ extern unsigned long dac_mmap_min_addr; #define LSM_PRLIMIT_WRITE 2 /* forward declares to avoid warnings */ -struct sched_param; struct request_sock; /* bprm->unsafe reasons */ --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -274,8 +274,6 @@ static inline int valid_signal(unsigned long sig) return sig <= _NSIG ? 1 : 0; } -struct timespec; -struct pt_regs; enum pid_type; extern int next_signal(struct sigpending *pending, sigset_t *mask); --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -11,8 +11,6 @@ struct __aio_sigset; struct epoll_event; -struct iattr; -struct inode; struct iocb; struct io_event; struct iovec; @@ -20,14 +18,12 @@ struct __kernel_old_itimerval; struct kexec_segment; struct linux_dirent; struct linux_dirent64; -struct list_head; struct mmap_arg_struct; struct msgbuf; struct user_msghdr; struct mmsghdr; struct msqid_ds; struct new_utsname; -struct nfsctl_arg; struct __old_kernel_stat; struct oldold_utsname; struct old_utsname; @@ -38,7 +34,6 @@ struct rusage; struct sched_param; struct sched_attr; struct sel_arg_struct; -struct semaphore; struct sembuf; struct shmid_ds; struct sockaddr; @@ -48,14 +43,12 @@ struct statfs; struct statfs64; struct statx; struct sysinfo; -struct timespec; struct __kernel_old_timeval; struct __kernel_timex; struct timezone; struct tms; struct utimbuf; struct mq_attr; -struct compat_stat; struct old_timeval32; struct robust_list_head; struct futex_waitv; --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -23,9 +23,7 @@ #include struct kobject; -struct module; struct bin_attribute; -enum kobj_ns_type; struct attribute { const char *name; --- a/mm/internal.h +++ b/mm/internal.h @@ -1095,10 +1095,6 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, /* Flags that allow allocations below the min watermark. */ #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM) -enum ttu_flags; -struct tlbflush_unmap_batch; - - /* * only for MM internal work items which do not depend on * any allocations or locks which might depend on allocations --- a/mm/shmem.c +++ b/mm/shmem.c @@ -261,7 +261,6 @@
Re: [PATCH 4/8] riscv: ftrace: align patchable functions to 4 Byte boundary
Hi Andy, On Thu, Jun 13, 2024 at 03:11:09PM +0800, Andy Chiu wrote: > We are changing ftrace code patching in order to remove dependency from > stop_machine() and enable kernel preemption. This requires us to align > functions entry at a 4-B align address. > > However, -falign-functions on older versions of GCC alone was not strong > enoungh to align all functions. In fact, cold functions are not aligned > after turning on optimizations. We consider this is a bug in GCC and > turn off guess-branch-probility as a workaround to align all functions. > > GCC bug id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 > > The option -fmin-function-alignment is able to align all functions > properly on newer versions of gcc. So, we add a cc-option to test if > the toolchain supports it. > > Suggested-by: Evgenii Shatokhin > Signed-off-by: Andy Chiu > --- > arch/riscv/Kconfig | 1 + > arch/riscv/Makefile | 7 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index b94176e25be1..80b8d48e1e46 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -203,6 +203,7 @@ config CLANG_SUPPORTS_DYNAMIC_FTRACE > config GCC_SUPPORTS_DYNAMIC_FTRACE > def_bool CC_IS_GCC > depends on $(cc-option,-fpatchable-function-entry=8) > + depends on $(cc-option,-fmin-function-alignment=4) || !RISCV_ISA_C Please use CC_HAS_MIN_FUNCTION_ALIGNMENT (from arch/Kconfig), which already checks for support for this option. > config HAVE_SHADOW_CALL_STACK > def_bool $(cc-option,-fsanitize=shadow-call-stack) > diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile > index 06de9d365088..74628ad8dcf8 100644 > --- a/arch/riscv/Makefile > +++ b/arch/riscv/Makefile > @@ -14,8 +14,13 @@ endif > ifeq ($(CONFIG_DYNAMIC_FTRACE),y) > LDFLAGS_vmlinux += --no-relax > KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY > +ifeq ($(CONFIG_CC_IS_CLANG),y) Same here, please invert this and use ifdef CONFIG_CC_HAS_MIN_FUNCTION_ALIGNMENT like the main Makefile does. > + cflags_ftrace_align := -falign-functions=4 > +else > + cflags_ftrace_align := -fmin-function-alignment=4 > +endif > ifeq ($(CONFIG_RISCV_ISA_C),y) > - CC_FLAGS_FTRACE := -fpatchable-function-entry=4 > + CC_FLAGS_FTRACE := -fpatchable-function-entry=4 $(cflags_ftrace_align) > else > CC_FLAGS_FTRACE := -fpatchable-function-entry=2 > endif > > -- > 2.43.0 > >
[PATCH v4 24/35] s390/cpacf: Unpoison the results of cpacf_trng()
Prevent KMSAN from complaining about buffers filled by cpacf_trng() being uninitialized. Tested-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/cpacf.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h index c786538e397c..dae8843b164f 100644 --- a/arch/s390/include/asm/cpacf.h +++ b/arch/s390/include/asm/cpacf.h @@ -12,6 +12,7 @@ #define _ASM_S390_CPACF_H #include +#include /* * Instruction opcodes for the CPACF instructions @@ -542,6 +543,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long ucbuf_len, : [ucbuf] "+" (u.pair), [cbuf] "+" (c.pair) : [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO) : "cc", "memory", "0"); + kmsan_unpoison_memory(ucbuf, ucbuf_len); + kmsan_unpoison_memory(cbuf, cbuf_len); } /** -- 2.45.1
[PATCH v4 35/35] kmsan: Enable on s390
Now that everything else is in place, enable KMSAN in Kconfig. Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c59d2b54df49..3cba4993d7c7 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -158,6 +158,7 @@ config S390 select HAVE_ARCH_KASAN select HAVE_ARCH_KASAN_VMALLOC select HAVE_ARCH_KCSAN + select HAVE_ARCH_KMSAN select HAVE_ARCH_KFENCE select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER -- 2.45.1
[PATCH v4 29/35] s390/mm: Define KMSAN metadata for vmalloc and modules
The pages for the KMSAN metadata associated with most kernel mappings are taken from memblock by the common code. However, vmalloc and module metadata needs to be defined by the architectures. Be a little bit more careful than x86: allocate exactly MODULES_LEN for the module shadow and origins, and then take 2/3 of vmalloc for the vmalloc shadow and origins. This ensures that users passing small vmalloc= values on the command line do not cause module metadata collisions. Reviewed-by: Alexander Potapenko Acked-by: Alexander Gordeev Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/startup.c| 7 +++ arch/s390/include/asm/pgtable.h | 8 2 files changed, 15 insertions(+) diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c index 182aac6a0f77..93775142322d 100644 --- a/arch/s390/boot/startup.c +++ b/arch/s390/boot/startup.c @@ -301,11 +301,18 @@ static unsigned long setup_kernel_memory_layout(unsigned long kernel_size) MODULES_END = round_down(kernel_start, _SEGMENT_SIZE); MODULES_VADDR = MODULES_END - MODULES_LEN; VMALLOC_END = MODULES_VADDR; + if (IS_ENABLED(CONFIG_KMSAN)) + VMALLOC_END -= MODULES_LEN * 2; /* allow vmalloc area to occupy up to about 1/2 of the rest virtual space left */ vsize = (VMALLOC_END - FIXMAP_SIZE) / 2; vsize = round_down(vsize, _SEGMENT_SIZE); vmalloc_size = min(vmalloc_size, vsize); + if (IS_ENABLED(CONFIG_KMSAN)) { + /* take 2/3 of vmalloc area for KMSAN shadow and origins */ + vmalloc_size = round_down(vmalloc_size / 3, _SEGMENT_SIZE); + VMALLOC_END -= vmalloc_size * 2; + } VMALLOC_START = VMALLOC_END - vmalloc_size; __memcpy_real_area = round_down(VMALLOC_START - MEMCPY_REAL_SIZE, PAGE_SIZE); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 70b6ee557eb2..2f44c23efec0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -107,6 +107,14 @@ static inline int is_module_addr(void *addr) return 1; } +#ifdef CONFIG_KMSAN +#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START) +#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END +#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + KMSAN_VMALLOC_SIZE) +#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN) +#endif + #ifdef CONFIG_RANDOMIZE_BASE #define KASLR_LEN (1UL << 31) #else -- 2.45.1
Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option
On Thu, 13 Jun 2024 15:19:50 -0300 "Guilherme G. Piccoli" wrote: > > + > > + reserver_mem=2M:4096:oops ramoops.mem_name=oops > > + > > Likely this could be fixed on merge, to avoid another version, but... > > s/reserver_mem/reserve_mem That 'r' is my nemesis! Almost every time I type "reserve" I write it as "reserver" and have to go back and delete it. :-p Just to make patchwork work nicely, I'll send a v7. But I'll wait for other comments or acks and reviews. -- Steve
[PATCH v4 05/35] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
Comparing pointers with TASK_SIZE does not make sense when kernel and userspace overlap. Skip the comparison when this is the case. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/instrumentation.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index 470b0b4afcc4..8a1bbbc723ab 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -20,7 +20,8 @@ static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store) { - if ((u64)addr < TASK_SIZE) + if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) && + (u64)addr < TASK_SIZE) return true; if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW)) return true; -- 2.45.1
[PATCH v4 21/35] s390: Use a larger stack for KMSAN
Adjust the stack size for the KMSAN-enabled kernel like it was done for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double the stack size"). Both tools have similar requirements. Reviewed-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/Makefile | 2 +- arch/s390/include/asm/thread_info.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/Makefile b/arch/s390/Makefile index f2b21c7a70ef..7fd57398221e 100644 --- a/arch/s390/Makefile +++ b/arch/s390/Makefile @@ -36,7 +36,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option KBUILD_CFLAGS_DECOMPRESSOR += $(if $(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds) UTS_MACHINE:= s390x -STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384) +STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384)) CHECKFLAGS += -D__s390__ -D__s390x__ export LD_BFD diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h index a674c7d25da5..d02a709717b8 100644 --- a/arch/s390/include/asm/thread_info.h +++ b/arch/s390/include/asm/thread_info.h @@ -16,7 +16,7 @@ /* * General size of kernel stacks */ -#ifdef CONFIG_KASAN +#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN) #define THREAD_SIZE_ORDER 4 #else #define THREAD_SIZE_ORDER 2 -- 2.45.1
[PATCH v4 08/35] kmsan: Remove an x86-specific #include from kmsan.h
Replace the x86-specific asm/pgtable_64_types.h #include with the linux/pgtable.h one, which all architectures have. While at it, sort the headers alphabetically for the sake of consistency with other KMSAN code. Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Suggested-by: Heiko Carstens Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/kmsan.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index a14744205435..adf443bcffe8 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -10,14 +10,14 @@ #ifndef __MM_KMSAN_KMSAN_H #define __MM_KMSAN_KMSAN_H -#include #include +#include +#include +#include +#include #include #include #include -#include -#include -#include #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200 -- 2.45.1
[PATCH v2 07/14] openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag
Add support for TIF_NOTIFY_IPI on OpenRISC. With TIF_NOTIFY_IPI, a sender sending an IPI to an idle CPU in TIF_POLLING mode will set the TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This avoids spurious calls to schedule_idle() in cases where an IPI does not necessarily wake up a task on the idle CPU. Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: K Prateek Nayak Cc: linux-openr...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: K Prateek Nayak --- v1..v2: o No changes. --- arch/openrisc/include/asm/thread_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h index 4af3049c34c2..6a386703bc43 100644 --- a/arch/openrisc/include/asm/thread_info.h +++ b/arch/openrisc/include/asm/thread_info.h @@ -92,6 +92,7 @@ register struct thread_info *current_thread_info_reg asm("r10"); * mode */ #define TIF_NOTIFY_SIGNAL 5 /* signal notifications exist */ +#define TIF_NOTIFY_IPI 6 /* Pending IPI on TIF_POLLLING idle CPU */ #define TIF_SYSCALL_TRACEPOINT 8 /* for ftrace syscall instrumentation */ #define TIF_RESTORE_SIGMASK 9 #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling * TIF_NEED_RESCHED @@ -104,6 +105,7 @@ register struct thread_info *current_thread_info_reg asm("r10"); #define _TIF_NEED_RESCHED (1<
[PATCH v4 04/35] kmsan: Increase the maximum store size to 4096
The inline assembly block in s390's chsc() stores that much. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/instrumentation.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index cc3907a9c33a..470b0b4afcc4 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t size) ua_flags = user_access_save(); /* -* Most of the accesses are below 32 bytes. The two exceptions so far -* are clwb() (64 bytes) and FPU state (512 bytes). -* It's unlikely that the assembly will touch more than 512 bytes. +* Most of the accesses are below 32 bytes. The exceptions so far are +* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes). */ - if (size > 512) { + if (size > 4096) { WARN_ONCE(1, "assembly store size too big: %ld\n", size); size = 8; } -- 2.45.1
Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option
Thanks Steve, re-tested in my VM and it's working fine. Just a minor below... On 13/06/2024 12:55, Steven Rostedt wrote: > [...] > + D. Using a region of memory reserved via ``reserve_mem`` command line > +parameter. The address and size will be defined by the ``reserve_mem`` > +parameter. Note, that ``reserve_mem`` may not always allocate memory > +in the same location, and cannot be relied upon. Testing will need > +to be done, and it may not work on every machine, nor every kernel. > +Consider this a "best effort" approach. The ``reserve_mem`` option > +takes a size, alignment and name as arguments. The name is used > +to map the memory to a label that can be retrieved by ramoops. > + > + reserver_mem=2M:4096:oops ramoops.mem_name=oops > + Likely this could be fixed on merge, to avoid another version, but... s/reserver_mem/reserve_mem Cheers!
[PATCH v4 16/35] mm: slub: Unpoison the memchr_inv() return value
Even though the KMSAN warnings generated by memchr_inv() are suppressed by metadata_access_enable(), its return value may still be poisoned. The reason is that the last iteration of memchr_inv() returns `*start != value ? start : NULL`, where *start is poisoned. Because of this, somewhat counterintuitively, the shadow value computed by visitSelectInst() is equal to `(uintptr_t)start`. The intention behind guarding memchr_inv() behind metadata_access_enable() is to touch poisoned metadata without triggering KMSAN, so unpoison its return value. Acked-by: Vlastimil Babka Signed-off-by: Ilya Leoshkevich --- mm/slub.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index a290f6c63e7b..b9101b2dc9aa 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1185,6 +1185,7 @@ static int check_bytes_and_report(struct kmem_cache *s, struct slab *slab, metadata_access_enable(); fault = memchr_inv(kasan_reset_tag(start), value, bytes); metadata_access_disable(); + kmsan_unpoison_memory(, sizeof(fault)); if (!fault) return 1; @@ -1291,6 +1292,7 @@ static void slab_pad_check(struct kmem_cache *s, struct slab *slab) metadata_access_enable(); fault = memchr_inv(kasan_reset_tag(pad), POISON_INUSE, remainder); metadata_access_disable(); + kmsan_unpoison_memory(, sizeof(fault)); if (!fault) return; while (end > fault && end[-1] == POISON_INUSE) -- 2.45.1
[PATCH v4 19/35] kmsan: Accept ranges starting with 0 on s390
On s390 the virtual address 0 is valid (current CPU's lowcore is mapped there), therefore KMSAN should not complain about it. Disable the respective check on s390. There doesn't seem to be a Kconfig option to describe this situation, so explicitly check for s390. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/init.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 9de76ac7062c..3f8b1bbb9060 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) bool merged = false; KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES); - KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend); + KMSAN_WARN_ON((nstart >= nend) || + /* Virtual address 0 is valid on s390. */ + (!IS_ENABLED(CONFIG_S390) && !nstart) || + !nend); nstart = ALIGN_DOWN(nstart, PAGE_SIZE); nend = ALIGN(nend, PAGE_SIZE); -- 2.45.1
Re: [PATCH v5 0/8] DAMON based tiered memory management for CXL memory
Hi Honggyu, On Thu, 13 Jun 2024 22:20:47 +0900 Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This patch series is about its implementation for physical > address space so that this scheme can be applied in system wide level. > > Changes from RFC v4: > https://lore.kernel.org/20240512175447.75943-1...@kernel.org > 1. Add usage and design documents > 2. Rename alloc_demote_folio to alloc_migrate_folio > 3. Add evaluation results with "demotion_enabled" true > 4. Rebase based on v6.10-rc3 I left comments on the new patches for the documentation. [...] > > Evaluation Results > == > > All the result values are normalized to DRAM-only execution time because > the workload cannot be faster than DRAM-only unless the workload hits > the peak bandwidth but our redis test doesn't go beyond the bandwidth > limit. > > So the DRAM-only execution time is the ideal result without affected by > the gap between DRAM and CXL performance difference. The NUMA node > environment is as follows. > > node0 - local DRAM, 512GB with a CPU socket (fast tier) > node1 - disabled > node2 - CXL DRAM, 96GB, no CPU attached (slow tier) > > The following is the result of generating zipfian distribution to > redis-server and the numbers are averaged by 50 times of execution. > > 1. YCSB zipfian distribution read only workload > memory pressure with cold memory on node0 with 512GB of local DRAM. > > ++= > | cold memory occupied by mmap and memset | > | 0G 440G 450G 460G 470G 480G 490G 500G | > > ++= > Execution time normalized to DRAM-only values| > GEOMEAN > > ++- > DRAM-only | 1.00 - - - - - - - | 1.00 > CXL-only| 1.19 - - - - - - - | 1.19 > default |- 1.00 1.05 1.08 1.12 1.14 1.18 1.18 | 1.11 > DAMON tiered|- 1.03 1.03 1.03 1.03 1.03 1.07 *1.05 | 1.04 > DAMON lazy |- 1.04 1.03 1.04 1.05 1.06 1.06 *1.06 | 1.05 > > ++= > CXL usage of redis-server in GB | > AVERAGE > > ++- > DRAM-only | 0.0 - - - - - - - | 0.0 > CXL-only| 51.4 - - - - - - - | 51.4 > default |- 0.6 10.6 20.5 30.5 40.5 47.6 50.4 | 28.7 > DAMON tiered|- 0.6 0.5 0.4 0.7 0.8 7.1 5.6 | 2.2 > DAMON lazy |- 0.5 3.0 4.5 5.4 6.4 9.4 9.1 | 5.5 > > ++= > > Each test result is based on the exeuction environment as follows. Nit. s/exeuction/execution/ [...] > In summary, the evaluation results show that DAMON memory management > with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown > compared to the "default" memory policy from 11% to 3~5% when the system > runs with high memory pressure on its fast tier DRAM nodes. > > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make > tiered memory systems run more efficiently under high memory pressures. Thank you very much for continuing this great work. Other than trivial comments on documentation patches and the above typo, I have no particular concern on this patchset. I'm looking forward to the next version. Thanks, SJ [...]
Re: [PATCH v5 8/8] Docs/.../mm/damon: add more damos actions
On Thu, 13 Jun 2024 07:07:29 -0700 SeongJae Park wrote: > Hi Honggyu, > > On Thu, 13 Jun 2024 22:20:55 +0900 Honggyu Kim wrote: > > > This patch adds damon description for "migrate_hot" and "migrate_cold" > > actions for both usage and design documents as long as a new > > "target_nid" knob to set the migration target node. [...] > Except the DAMON debugfs interface section, this patch looks good to me. One more thing. I'd prefer making the subject more specific. What about, "Docs/damon: document damos_migrate_{hot,cold}"? Thanks, SJ [...]
[PATCH v4 23/35] s390/checksum: Add a KMSAN check
Add a KMSAN check to the CKSM inline assembly, similar to how it was done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm instruction"). Acked-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/checksum.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h index b89159591ca0..46f5c9660616 100644 --- a/arch/s390/include/asm/checksum.h +++ b/arch/s390/include/asm/checksum.h @@ -13,6 +13,7 @@ #define _S390_CHECKSUM_H #include +#include #include static inline __wsum cksm(const void *buff, int len, __wsum sum) @@ -23,6 +24,7 @@ static inline __wsum cksm(const void *buff, int len, __wsum sum) }; instrument_read(buff, len); + kmsan_check_memory(buff, len); asm volatile("\n" "0: cksm%[sum],%[rp]\n" " jo 0b\n" -- 2.45.1
Re: [PATCH] ARM: dts: qcom: motorola-falcon: add accelerometer, magnetometer
On 6/13/24 19:18, Stanislav Jakubek wrote: On Thu, Jun 13, 2024 at 09:48:26AM +0200, Konrad Dybcio wrote: On 6/9/24 13:05, Stanislav Jakubek wrote: Add the accelerometer and magnetometer that are present on the Motorola Moto G (2013) device. Signed-off-by: Stanislav Jakubek --- [...] +_i2c2 { Consider setting a clock-frequency = <> Hi Konrad, checking downstream [1], qcom,i2c-bus-freq for I2C2 is 10, which seems to be the default. Should I specify it anyway? Yes, at the very least to silence the dmesg warning ;) Konrad
Re: [PATCH v5 2/8] mm: rename alloc_demote_folio to alloc_migrate_folio
On Thu, 13 Jun 2024 22:20:49 +0900 Honggyu Kim wrote: > The alloc_demote_folio can also be used for general migration including > both demotion and promotion so it'd be better to rename it from > alloc_demote_folio to alloc_migrate_folio. > > Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park Thanks, SJ [...]
[PATCH v4 01/35] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
Architectures use assembly code to initialize ftrace_regs and call ftrace_ops_list_func(). Therefore, from the KMSAN's point of view, ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes KMSAN warnings when running the ftrace testsuite. Fix by trusting the architecture-specific assembly code and always unpoisoning ftrace_regs in ftrace_ops_list_func. The issue was not encountered on x86_64 so far only by accident: assembly-allocated ftrace_regs was overlapping a stale partially unpoisoned stack frame. Poisoning stack frames before returns [1] makes the issue appear on x86_64 as well. [1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/ Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- kernel/trace/ftrace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 65208d3b5ed9..c35ad4362d71 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { + kmsan_unpoison_memory(fregs, sizeof(*fregs)); __ftrace_ops_list_func(ip, parent_ip, NULL, fregs); } #else -- 2.45.1
[PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
Reserve unspecified location of physical memory from kernel command line Background: In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract dmesg output and some other information when a crash happens in the field. (This is only done when the user selects "Allow Google to collect data for improving the system"). But there are cases when there's a bug that requires more data to be retrieved to figure out what is happening. We would like to increase the pstore size, either temporarily, or maybe even permanently. The pstore on these devices are at a fixed location in RAM (as the RAM is not cleared on soft reboots nor crashes). The location is chosen by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. There's a driver that queries for this to initialize the pstore for ChromeOS: See drivers/platform/chrome/chromeos_pstore.c Problem: The problem is that, even though there's a process to change the kernel on these systems, and is done regularly to install updates, the firmware is updated much less frequently. Choosing the place in RAM also takes special care, and may be in a different address for different boards. Updating the size via firmware is a large effort and not something that many are willing to do for a temporary pstore size change. Requirement: Need a way to reserve memory that will be at a consistent location for every boot, if the kernel and system are the same. Does not need to work if rebooting to a different kernel, or if the system can change the memory layout between boots. The reserved memory can not be an hard coded address, as the same kernel / command line needs to run on several different machines. The picked memory reservation just needs to be the same for a given machine, but may be different for different machines. Solution: The solution I have come up with is to introduce a new "reserve_mem=" kernel command line. This parameter takes the following format: reserve_mem=nn:align:label Where nn is the size of memory to reserve, the align is the alignment of that memory, and label is the way for other sub-systems to find that memory. This way the kernel command line could have: reserve_mem=12M:4096:oops ramoops.mem_name=oops At boot up, the kernel will search for 12 megabytes in usable memory regions with an alignment of 4096. It will start at the highest regions and work its way down (for those old devices that want access to lower address DMA). When it finds a region, it will save it off in a small table and mark it with the "oops" label. Then the pstore ramoops sub-system could ask for that memory and location, and it will map itself there. This prototype allows for 8 different mappings (which may be overkill, 4 is probably plenty) with 16 byte size to store the label. I have tested this and it works for us to solve the above problem. We can update the kernel and command line and increase the size of pstore without needing to update the firmware, or knowing every memory layout of each board. I only tested this locally, it has not been tested in the field. Changes since v5: https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/ [ patch at bottom showing differences ] - Stressed more that this is a best effort use case - Updated ramoops.rst to document this new feature - Used a new variable "tmp" to use in reserve_mem_find_by_name() instead of using "size" and possibly corrupting it. Changes since v4: https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/ - Add all checks about reserve_mem before allocation. This means reserved_mem_add() is now a void function. - Check for name duplications. - Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=") Changes since v3: https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/ - Changed table type of start and size from unsigned long to phys_addr_t (as well as the parameters to the functions that use them) - Changed old reference to "early_reserve_mem" to "reserve_mem" - Check before reservering memory: o Size is non-zero o name has text in it - If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES - Remove the silly check of testing *p == '\0' after a p += strlen(p) Changes since v2: https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/ - Fixed typo of "reserver" - Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name() - Removed "built-in" from module description that was changed from v1. Changes since v1: https://lore.kernel.org/all/2024060320.801075...@goodmis.org/ - Updated the change log of the first patch as well as added an entry into kernel-parameters.txt about how reserve_mem is for soft reboots and may not be reliable. Steven Rostedt (Google) (2): mm/memblock: Add "reserve_mem" to reserved named memory at boot up pstore/ramoops: Add ramoops.mem_name= command line option Documentation/admin-guide/kernel-parameters.txt | 22 +
Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
On Thu, 13 Jun 2024 18:54:12 +0200 Alexander Graf wrote: > > Do you have a "real" pstore on these systems that you could store > non-volatile variables in, such as persistent UEFI variables? If so, you > could create an actually persistent mapping for your trace pstore even > across kernel version updates as a general mechanism to create reserved > memblocks at fixed offsets. After implementing all this, I don't think I can use pstore for my purpose. pstore is a generic interface for persistent storage, and requires an interface to access it. From what I understand, it's not the place to just ask for an area of RAM. For this, I have a single patch that allows the tracing instance to use an area reserved by reserve_mem. reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace I've already tested this on qemu and a couple of chromebooks. It works well. > > > > Requirement: > > > > Need a way to reserve memory that will be at a consistent location for > > every boot, if the kernel and system are the same. Does not need to work > > if rebooting to a different kernel, or if the system can change the > > memory layout between boots. > > > > The reserved memory can not be an hard coded address, as the same kernel / > > command line needs to run on several different machines. The picked memory > > reservation just needs to be the same for a given machine, but may be > > > With KASLR is enabled, doesn't this approach break too often to be > reliable enough for the data you want to extract? > > Picking up the idea above, with a persistent variable we could even make > KASLR avoid that reserved pstore region in its search for a viable KASLR > offset. I think I was hit by it once in all my testing. For our use case, the few times it fails to map is not going to affect what we need this for at all. -- Steve
[PATCH v4 17/35] mm: kfence: Disable KMSAN when checking the canary
KMSAN warns about check_canary() accessing the canary. The reason is that, even though set_canary() is properly instrumented and sets shadow, slub explicitly poisons the canary's address range afterwards. Unpoisoning the canary is not the right thing to do: only check_canary() is supposed to ever touch it. Instead, disable KMSAN checks around canary read accesses. Reviewed-by: Alexander Potapenko Tested-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kfence/core.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 964b8482275b..cce330d5b797 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex WRITE_ONCE(meta->state, next); } +#ifdef CONFIG_KMSAN +#define CHECK_CANARY_ATTRIBUTES noinline __no_kmsan_checks +#else +#define CHECK_CANARY_ATTRIBUTES inline +#endif + /* Check canary byte at @addr. */ -static inline bool check_canary_byte(u8 *addr) +static CHECK_CANARY_ATTRIBUTES bool check_canary_byte(u8 *addr) { struct kfence_metadata *meta; unsigned long flags; @@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata *meta) *((u64 *)addr) = KFENCE_CANARY_PATTERN_U64; } -static inline void check_canary(const struct kfence_metadata *meta) +static CHECK_CANARY_ATTRIBUTES void +check_canary(const struct kfence_metadata *meta) { const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE); unsigned long addr = pageaddr; -- 2.45.1
[PATCH v4 22/35] s390/boot: Add the KMSAN runtime stub
It should be possible to have inline functions in the s390 header files, which call kmsan_unpoison_memory(). The problem is that these header files might be included by the decompressor, which does not contain KMSAN runtime, causing linker errors. Not compiling these calls if __SANITIZE_MEMORY__ is not defined - either by changing kmsan-checks.h or at the call sites - may cause unintended side effects, since calling these functions from an uninstrumented code that is linked into the kernel is valid use case. One might want to explicitly distinguish between the kernel and the decompressor. Checking for a decompressor-specific #define is quite heavy-handed, and will have to be done at all call sites. A more generic approach is to provide a dummy kmsan_unpoison_memory() definition. This produces some runtime overhead, but only when building with CONFIG_KMSAN. The benefit is that it does not disturb the existing KMSAN build logic and call sites don't need to be changed. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/Makefile | 1 + arch/s390/boot/kmsan.c | 6 ++ 2 files changed, 7 insertions(+) create mode 100644 arch/s390/boot/kmsan.c diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile index 526ed20b9d31..e7658997452b 100644 --- a/arch/s390/boot/Makefile +++ b/arch/s390/boot/Makefile @@ -44,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) $(CONFIG_PGSTE)) += obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o obj-y += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o +obj-$(CONFIG_KMSAN) += kmsan.o obj-all := $(obj-y) piggy.o syms.o targets:= bzImage section_cmp.boot.data section_cmp.boot.preserved.data $(obj-y) diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c new file mode 100644 index ..e7b3ac48143e --- /dev/null +++ b/arch/s390/boot/kmsan.c @@ -0,0 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 +#include + +void kmsan_unpoison_memory(const void *address, size_t size) +{ +} -- 2.45.1
Re: [PATCH v3 1/1] dt-bindings: remoteproc: imx_rproc: add minItems for power-domain
On Mon, 10 Jun 2024 11:17:21 -0400, Frank Li wrote: > "fsl,imx8qxp-cm4" and "fsl,imx8qm-cm4" need minimum 2 power domains. Other > platform doesn't require 'power-domain'. > > Signed-off-by: Frank Li > --- > > Notes: > Change from v2 to v3 > - only imx8qxp and imx8qm need power-domain, other platform don't need it. > - update commit message. > > Change from v1 to v2 > - set minitem to 2 at top > - Add imx8qm compatible string also > - use not logic to handle difference compatible string restriction > - update commit message. > > pass dt_binding_check. > > make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8 dt_binding_check > DT_SCHEMA_FILES=fsl,imx-rproc.yaml > SCHEMA Documentation/devicetree/bindings/processed-schema.json > CHKDT Documentation/devicetree/bindings > LINTDocumentation/devicetree/bindings > DTEX > Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.example.dts > DTC_CHK > Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.example.dtb > > .../bindings/remoteproc/fsl,imx-rproc.yaml| 15 +++ > 1 file changed, 15 insertions(+) > Reviewed-by: Rob Herring (Arm)
Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
Hey Steve, On 13.06.24 17:55, Steven Rostedt wrote: Reserve unspecified location of physical memory from kernel command line Background: In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract dmesg output and some other information when a crash happens in the field. (This is only done when the user selects "Allow Google to collect data for improving the system"). But there are cases when there's a bug that requires more data to be retrieved to figure out what is happening. We would like to increase the pstore size, either temporarily, or maybe even permanently. The pstore on these devices are at a fixed location in RAM (as the RAM is not cleared on soft reboots nor crashes). The location is chosen by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86. There's a driver that queries for this to initialize the pstore for ChromeOS: See drivers/platform/chrome/chromeos_pstore.c Problem: The problem is that, even though there's a process to change the kernel on these systems, and is done regularly to install updates, the firmware is updated much less frequently. Choosing the place in RAM also takes special care, and may be in a different address for different boards. Updating the size via firmware is a large effort and not something that many are willing to do for a temporary pstore size change. (sorry for not commenting on earlier versions, I didn't see v1-v5 in my inbox) Do you have a "real" pstore on these systems that you could store non-volatile variables in, such as persistent UEFI variables? If so, you could create an actually persistent mapping for your trace pstore even across kernel version updates as a general mechanism to create reserved memblocks at fixed offsets. Requirement: Need a way to reserve memory that will be at a consistent location for every boot, if the kernel and system are the same. Does not need to work if rebooting to a different kernel, or if the system can change the memory layout between boots. The reserved memory can not be an hard coded address, as the same kernel / command line needs to run on several different machines. The picked memory reservation just needs to be the same for a given machine, but may be With KASLR is enabled, doesn't this approach break too often to be reliable enough for the data you want to extract? Picking up the idea above, with a persistent variable we could even make KASLR avoid that reserved pstore region in its search for a viable KASLR offset. Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
[PATCH v4 18/35] lib/zlib: Unpoison DFLTCC output buffers
The constraints of the DFLTCC inline assembly are not precise: they do not communicate the size of the output buffers to the compiler, so it cannot automatically instrument it. Add the manual kmsan_unpoison_memory() calls for the output buffers. The logic is the same as in [1]. [1] https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5 Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- lib/zlib_dfltcc/dfltcc.h | 1 + lib/zlib_dfltcc/dfltcc_util.h | 28 2 files changed, 29 insertions(+) diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h index b96232bdd44d..0f2a16d7a48a 100644 --- a/lib/zlib_dfltcc/dfltcc.h +++ b/lib/zlib_dfltcc/dfltcc.h @@ -80,6 +80,7 @@ struct dfltcc_param_v0 { uint8_t csb[1152]; }; +static_assert(offsetof(struct dfltcc_param_v0, csb) == 384); static_assert(sizeof(struct dfltcc_param_v0) == 1536); #define CVT_CRC32 0 diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h index 4a46b5009f0d..10509270d822 100644 --- a/lib/zlib_dfltcc/dfltcc_util.h +++ b/lib/zlib_dfltcc/dfltcc_util.h @@ -2,6 +2,8 @@ #ifndef DFLTCC_UTIL_H #define DFLTCC_UTIL_H +#include "dfltcc.h" +#include #include /* @@ -20,6 +22,7 @@ typedef enum { #define DFLTCC_CMPR 2 #define DFLTCC_XPND 4 #define HBT_CIRCULAR (1 << 7) +#define DFLTCC_FN_MASK ((1 << 7) - 1) #define HB_BITS 15 #define HB_SIZE (1 << HB_BITS) @@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc( ) { Byte *t2 = op1 ? *op1 : NULL; +unsigned char *orig_t2 = t2; size_t t3 = len1 ? *len1 : 0; const Byte *t4 = op2 ? *op2 : NULL; size_t t5 = len2 ? *len2 : 0; @@ -59,6 +63,30 @@ static inline dfltcc_cc dfltcc( : "cc", "memory"); t2 = r2; t3 = r3; t4 = r4; t5 = r5; +/* + * Unpoison the parameter block and the output buffer. + * This is a no-op in non-KMSAN builds. + */ +switch (fn & DFLTCC_FN_MASK) { +case DFLTCC_QAF: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param)); +break; +case DFLTCC_GDHT: +kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb)); +break; +case DFLTCC_CMPR: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); +kmsan_unpoison_memory( +orig_t2, +t2 - orig_t2 + +(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1)); +break; +case DFLTCC_XPND: +kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0)); +kmsan_unpoison_memory(orig_t2, t2 - orig_t2); +break; +} + if (op1) *op1 = t2; if (len1) -- 2.45.1
Re: [PATCH v4 01/35] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
On Thu, 13 Jun 2024 17:34:03 +0200 Ilya Leoshkevich wrote: > Architectures use assembly code to initialize ftrace_regs and call > ftrace_ops_list_func(). Therefore, from the KMSAN's point of view, > ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes > KMSAN warnings when running the ftrace testsuite. > > Fix by trusting the architecture-specific assembly code and always > unpoisoning ftrace_regs in ftrace_ops_list_func. > > The issue was not encountered on x86_64 so far only by accident: > assembly-allocated ftrace_regs was overlapping a stale partially > unpoisoned stack frame. Poisoning stack frames before returns [1] > makes the issue appear on x86_64 as well. > > [1] > https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/ > > Reviewed-by: Alexander Potapenko > Signed-off-by: Ilya Leoshkevich > --- Acked-by: Steven Rostedt (Google) -- Steve > kernel/trace/ftrace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 65208d3b5ed9..c35ad4362d71 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long > parent_ip, > void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, > struct ftrace_ops *op, struct ftrace_regs *fregs) > { > + kmsan_unpoison_memory(fregs, sizeof(*fregs)); > __ftrace_ops_list_func(ip, parent_ip, NULL, fregs); > } > #else
[PATCH v4 34/35] s390: Implement the architecture-specific KMSAN functions
arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the prefix and calling kmsan_get_metadata() again. kmsan_virt_addr_valid() delegates to virt_addr_valid(). Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/kmsan.h | 43 +++ 1 file changed, 43 insertions(+) create mode 100644 arch/s390/include/asm/kmsan.h diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h new file mode 100644 index ..e572686d340c --- /dev/null +++ b/arch/s390/include/asm/kmsan.h @@ -0,0 +1,43 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_S390_KMSAN_H +#define _ASM_S390_KMSAN_H + +#include +#include +#include +#include +#include + +#ifndef MODULE + +static inline bool is_lowcore_addr(void *addr) +{ + return addr >= (void *)_lowcore && + addr < (void *)(_lowcore + 1); +} + +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin) +{ + if (is_lowcore_addr(addr)) { + /* +* Different lowcores accessed via S390_lowcore are described +* by the same struct page. Resolve the prefix manually in +* order to get a distinct struct page. +*/ + addr += (void *)lowcore_ptr[raw_smp_processor_id()] - + (void *)_lowcore; + if (WARN_ON_ONCE(is_lowcore_addr(addr))) + return NULL; + return kmsan_get_metadata(addr, is_origin); + } + return NULL; +} + +static inline bool kmsan_virt_addr_valid(void *addr) +{ + return virt_addr_valid(addr); +} + +#endif /* !MODULE */ + +#endif /* _ASM_S390_KMSAN_H */ -- 2.45.1
[PATCH v4 31/35] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
This is normally done by the generic entry code, but the kernel_stack_overflow() flow bypasses it. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/traps.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c index 52578b5cecbd..dde69d2a64f0 100644 --- a/arch/s390/kernel/traps.c +++ b/arch/s390/kernel/traps.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -262,6 +263,11 @@ static void monitor_event_exception(struct pt_regs *regs) void kernel_stack_overflow(struct pt_regs *regs) { + /* +* Normally regs are unpoisoned by the generic entry code, but +* kernel_stack_overflow() is a rare case that is called bypassing it. +*/ + kmsan_unpoison_entry_regs(regs); bust_spinlocks(1); printk("Kernel stack overflow.\n"); show_regs(regs); -- 2.45.1
[PATCH v4 27/35] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
s390 uses assembly code to initialize ftrace_regs and call kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view, ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes KMSAN warnings when running the ftrace testsuite. Fix by trusting the assembly code and always unpoisoning ftrace_regs in kprobe_ftrace_handler(). Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/ftrace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index ddf2ee47cb87..0bd6adc40a34 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -303,6 +304,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip, if (bit < 0) return; + kmsan_unpoison_memory(fregs, sizeof(*fregs)); regs = ftrace_get_regs(fregs); p = get_kprobe((kprobe_opcode_t *)ip); if (!regs || unlikely(!p) || kprobe_disabled(p)) -- 2.45.1
[PATCH v6 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
From: "Steven Rostedt (Google)" In order to allow for requesting a memory region that can be used for things like pstore on multiple machines where the memory layout is not the same, add a new option to the kernel command line called "reserve_mem". The format is: reserve_mem=nn:align:name Where it will find nn amount of memory at the given alignment of align. The name field is to allow another subsystem to retrieve where the memory was found. For example: reserve_mem=12M:4096:oops ramoops.mem_name=oops Where ramoops.mem_name will tell ramoops that memory was reserved for it via the reserve_mem option and it can find it by calling: if (reserve_mem_find_by_name("oops", , )) { // start holds the start address and size holds the size given This is typically used for systems that do not wipe the RAM, and this command line will try to reserve the same physical memory on soft reboots. Note, it is not guaranteed to be the same location. For example, if KASLR places the kernel at the location of where the RAM reservation was from a previous boot, the new reservation will be at a different location. Any subsystem using this feature must add a way to verify that the contents of the physical memory is from a previous boot, as there may be cases where the memory will not be located at the same location. Not all systems may work either. There could be bit flips if the reboot goes through the BIOS. Using kexec to reboot the machine is likely to have better results in such cases. Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/ Suggested-by: Mike Rapoport Tested-by: Guilherme G. Piccoli Signed-off-by: Steven Rostedt (Google) --- .../admin-guide/kernel-parameters.txt | 22 include/linux/mm.h| 2 + mm/memblock.c | 117 ++ 3 files changed, 141 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b600df82669d..56e18b1a520d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5710,6 +5710,28 @@ them. If is less than 0x1, the region is assumed to be I/O ports; otherwise it is memory. + reserve_mem=[RAM] + Format: nn[KNG]:: + Reserve physical memory and label it with a name that + other subsystems can use to access it. This is typically + used for systems that do not wipe the RAM, and this command + line will try to reserve the same physical memory on + soft reboots. Note, it is not guaranteed to be the same + location. For example, if anything about the system changes + or if booting a different kernel. It can also fail if KASLR + places the kernel at the location of where the RAM reservation + was from a previous boot, the new reservation will be at a + different location. + Any subsystem using this feature must add a way to verify + that the contents of the physical memory is from a previous + boot, as there may be cases where the memory will not be + located at the same location. + + The format is size:align:label for example, to request + 12 megabytes of 4096 alignment for ramoops: + + reserve_mem=12M:4096:oops ramoops.mem_name=oops + reservetop= [X86-32,EARLY] Format: nn[KMG] Reserves a hole at the top of the kernel virtual diff --git a/include/linux/mm.h b/include/linux/mm.h index 9849dfda44d4..077fb589b88a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn) void vma_pgtable_walk_begin(struct vm_area_struct *vma); void vma_pgtable_walk_end(struct vm_area_struct *vma); +int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t *size); + #endif /* _LINUX_MM_H */ diff --git a/mm/memblock.c b/mm/memblock.c index d09136e040d3..b7b0e8c3868d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -2244,6 +2244,123 @@ void __init memblock_free_all(void) totalram_pages_add(pages); } +/* Keep a table to reserve named memory */ +#define RESERVE_MEM_MAX_ENTRIES8 +#define RESERVE_MEM_NAME_SIZE 16 +struct reserve_mem_table { + charname[RESERVE_MEM_NAME_SIZE]; + phys_addr_t start; + phys_addr_t size; +}; +static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES]; +static int reserved_mem_count; + +/* Add wildcard
[PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option
From: "Steven Rostedt (Google)" Add a method to find a region specified by reserve_mem=nn:align:name for ramoops. Adding a kernel command line parameter: reserve_mem=12M:4096:oops ramoops.mem_name=oops Will use the size and location defined by the memmap parameter where it finds the memory and labels it "oops". The "oops" in the ramoops option is used to search for it. This allows for arbitrary RAM to be used for ramoops if it is known that the memory is not cleared on kernel crashes or soft reboots. Tested-by: Guilherme G. Piccoli Signed-off-by: Steven Rostedt (Google) --- Documentation/admin-guide/ramoops.rst | 13 + fs/pstore/ram.c | 14 ++ 2 files changed, 27 insertions(+) diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index e9f85142182d..6f534a707b2a 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -23,6 +23,8 @@ and type of the memory area are set using three variables: * ``mem_size`` for the size. The memory size will be rounded down to a power of two. * ``mem_type`` to specify if the memory type (default is pgprot_writecombine). + * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command +line parameter. Typically the default value of ``mem_type=0`` should be used as that sets the pstore mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use @@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners: return ret; } + D. Using a region of memory reserved via ``reserve_mem`` command line +parameter. The address and size will be defined by the ``reserve_mem`` +parameter. Note, that ``reserve_mem`` may not always allocate memory +in the same location, and cannot be relied upon. Testing will need +to be done, and it may not work on every machine, nor every kernel. +Consider this a "best effort" approach. The ``reserve_mem`` option +takes a size, alignment and name as arguments. The name is used +to map the memory to a label that can be retrieved by ramoops. + + reserver_mem=2M:4096:oops ramoops.mem_name=oops + You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() very early in the architecture code, e.g.:: diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index b1a455f42e93..4311fcbc84f2 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400); MODULE_PARM_DESC(mem_address, "start of reserved RAM used to store oops/panic logs"); +static char *mem_name; +module_param_named(mem_name, mem_name, charp, 0400); +MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr"); + static ulong mem_size; module_param(mem_size, ulong, 0400); MODULE_PARM_DESC(mem_size, @@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void) { struct ramoops_platform_data pdata; + if (mem_name) { + phys_addr_t start; + phys_addr_t size; + + if (reserve_mem_find_by_name(mem_name, , )) { + mem_address = start; + mem_size = size; + } + } + /* * Prepare a dummy platform data structure to carry the module * parameters. If mem_size isn't set, then there are no module -- 2.43.0
[PATCH v4 32/35] s390/uaccess: Add KMSAN support to put_user() and get_user()
put_user() uses inline assembly with precise constraints, so Clang is in principle capable of instrumenting it automatically. Unfortunately, one of the constraints contains a dereferenced user pointer, and Clang does not currently distinguish user and kernel pointers. Therefore KMSAN attempts to access shadow for user pointers, which is not a right thing to do. An obvious fix to add __no_sanitize_memory to __put_user_fn() does not work, since it's __always_inline. And __always_inline cannot be removed due to the __put_user_bad() trick. A different obvious fix of using the "a" instead of the "+Q" constraint degrades the code quality, which is very important here, since it's a hot path. Instead, repurpose the __put_user_asm() macro to define __put_user_{char,short,int,long}_noinstr() functions and mark them with __no_sanitize_memory. For the non-KMSAN builds make them __always_inline in order to keep the generated code quality. Also define __put_user_{char,short,int,long}() functions, which call the aforementioned ones and which *are* instrumented, because they call KMSAN hooks, which may be implemented as macros. The same applies to get_user() as well. Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/uaccess.h | 111 +++- 1 file changed, 79 insertions(+), 32 deletions(-) diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 81ae8a98e7ec..c3c26dd1fc04 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -78,13 +78,24 @@ union oac { int __noreturn __put_user_bad(void); -#define __put_user_asm(to, from, size) \ -({ \ +#ifdef CONFIG_KMSAN +#define GET_PUT_USER_NOINSTR_ATTRIBUTES \ + noinline __maybe_unused __no_sanitize_memory +#else +#define GET_PUT_USER_NOINSTR_ATTRIBUTES __always_inline +#endif + +#define DEFINE_PUT_USER(type) \ +static GET_PUT_USER_NOINSTR_ATTRIBUTES int \ +__put_user_##type##_noinstr(unsigned type __user *to, \ + unsigned type *from,\ + unsigned long size) \ +{ \ union oac __oac_spec = {\ .oac1.as = PSW_BITS_AS_SECONDARY, \ .oac1.a = 1,\ }; \ - int __rc; \ + int rc; \ \ asm volatile( \ " lr 0,%[spec]\n"\ @@ -93,12 +104,28 @@ int __noreturn __put_user_bad(void); "2:\n" \ EX_TABLE_UA_STORE(0b, 2b, %[rc])\ EX_TABLE_UA_STORE(1b, 2b, %[rc])\ - : [rc] "=" (__rc), [_to] "+Q" (*(to)) \ + : [rc] "=" (rc), [_to] "+Q" (*(to)) \ : [_size] "d" (size), [_from] "Q" (*(from)),\ [spec] "d" (__oac_spec.val) \ : "cc", "0"); \ - __rc; \ -}) + return rc; \ +} \ + \ +static __always_inline int \ +__put_user_##type(unsigned type __user *to, unsigned type *from, \ + unsigned long size) \ +{ \ + int rc; \ + \ + rc = __put_user_##type##_noinstr(to, from, size); \ + instrument_put_user(*from, to, size); \ + return rc; \ +} + +DEFINE_PUT_USER(char); +DEFINE_PUT_USER(short); +DEFINE_PUT_USER(int); +DEFINE_PUT_USER(long); static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned long size) { @@ -106,24 +133,24 @@ static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned lon
[PATCH v4 30/35] s390/string: Add KMSAN support
Add KMSAN support for the s390 implementations of the string functions. Do this similar to how it's already done for KASAN, except that the optimized memset{16,32,64}() functions need to be disabled: it's important for KMSAN to know that they initialized something. The way boot code is built with regard to string functions is problematic, since most files think it's configured with sanitizers, but boot/string.c doesn't. This creates various problems with the memset64() definitions, depending on whether the code is built with sanitizers or fortify. This should probably be streamlined, but in the meantime resolve the issues by introducing the IN_BOOT_STRING_C macro, similar to the existing IN_ARCH_STRING_C macro. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/string.c| 16 arch/s390/include/asm/string.h | 20 +++- 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c index faccb33b462c..f6b9b1df48a8 100644 --- a/arch/s390/boot/string.c +++ b/arch/s390/boot/string.c @@ -1,11 +1,18 @@ // SPDX-License-Identifier: GPL-2.0 +#define IN_BOOT_STRING_C 1 #include #include #include #undef CONFIG_KASAN #undef CONFIG_KASAN_GENERIC +#undef CONFIG_KMSAN #include "../lib/string.c" +/* + * Duplicate some functions from the common lib/string.c + * instead of fully including it. + */ + int strncmp(const char *cs, const char *ct, size_t count) { unsigned char c1, c2; @@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count) return 0; } +void *memset64(uint64_t *s, uint64_t v, size_t count) +{ + uint64_t *xs = s; + + while (count--) + *xs++ = v; + return s; +} + char *skip_spaces(const char *str) { while (isspace(*str)) diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h index 351685de53d2..2ab868cbae6c 100644 --- a/arch/s390/include/asm/string.h +++ b/arch/s390/include/asm/string.h @@ -15,15 +15,12 @@ #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */ #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */ #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */ -#define __HAVE_ARCH_MEMSET16 /* arch function */ -#define __HAVE_ARCH_MEMSET32 /* arch function */ -#define __HAVE_ARCH_MEMSET64 /* arch function */ void *memcpy(void *dest, const void *src, size_t n); void *memset(void *s, int c, size_t n); void *memmove(void *dest, const void *src, size_t n); -#ifndef CONFIG_KASAN +#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) #define __HAVE_ARCH_MEMCHR /* inline & arch function */ #define __HAVE_ARCH_MEMCMP /* arch function */ #define __HAVE_ARCH_MEMSCAN/* inline & arch function */ @@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n); #define __HAVE_ARCH_STRNCPY/* arch function */ #define __HAVE_ARCH_STRNLEN/* inline & arch function */ #define __HAVE_ARCH_STRSTR /* arch function */ +#define __HAVE_ARCH_MEMSET16 /* arch function */ +#define __HAVE_ARCH_MEMSET32 /* arch function */ +#define __HAVE_ARCH_MEMSET64 /* arch function */ /* Prototypes for non-inlined arch strings functions. */ int memcmp(const void *s1, const void *s2, size_t n); @@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n); char *strncat(char *dest, const char *src, size_t n); char *strncpy(char *dest, const char *src, size_t n); char *strstr(const char *s1, const char *s2); -#endif /* !CONFIG_KASAN */ +#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */ #undef __HAVE_ARCH_STRCHR #undef __HAVE_ARCH_STRNCHR @@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count); void *__memset32(uint32_t *s, uint32_t v, size_t count); void *__memset64(uint64_t *s, uint64_t v, size_t count); +#ifdef __HAVE_ARCH_MEMSET16 static inline void *memset16(uint16_t *s, uint16_t v, size_t count) { return __memset16(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET32 static inline void *memset32(uint32_t *s, uint32_t v, size_t count) { return __memset32(s, v, count * sizeof(v)); } +#endif +#ifdef __HAVE_ARCH_MEMSET64 +#ifdef IN_BOOT_STRING_C +void *memset64(uint64_t *s, uint64_t v, size_t count); +#else static inline void *memset64(uint64_t *s, uint64_t v, size_t count) { return __memset64(s, v, count * sizeof(v)); } +#endif +#endif #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || defined(__NO_FORTIFY)) -- 2.45.1
[PATCH v4 09/35] kmsan: Expose kmsan_get_metadata()
Each s390 CPU has lowcore pages associated with it. Each CPU sees its own lowcore at virtual address 0 through a hardware mechanism called prefixing. Additionally, all lowcores are mapped to non-0 virtual addresses stored in the lowcore_ptr[] array. When lowcore is accessed through virtual address 0, one needs to resolve metadata for lowcore_ptr[raw_smp_processor_id()]. Expose kmsan_get_metadata() to make it possible to do this from the arch code. Signed-off-by: Ilya Leoshkevich --- include/linux/kmsan.h | 9 + mm/kmsan/instrumentation.c | 1 + mm/kmsan/kmsan.h | 1 - 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index e0c23a32cdf0..fe6c2212bdb1 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out); */ void kmsan_unpoison_entry_regs(const struct pt_regs *regs); +/** + * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins. + * @addr: kernel address. + * @is_origin: whether to return origins or shadow. + * + * Return NULL if metadata cannot be found. + */ +void *kmsan_get_metadata(void *addr, bool is_origin); + #else static inline void kmsan_init_shadow(void) diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c index 8a1bbbc723ab..94b49fac9d8b 100644 --- a/mm/kmsan/instrumentation.c +++ b/mm/kmsan/instrumentation.c @@ -14,6 +14,7 @@ #include "kmsan.h" #include +#include #include #include #include diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h index adf443bcffe8..34b83c301d57 100644 --- a/mm/kmsan/kmsan.h +++ b/mm/kmsan/kmsan.h @@ -66,7 +66,6 @@ struct shadow_origin_ptr { struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size, bool store); -void *kmsan_get_metadata(void *addr, bool is_origin); void __init kmsan_init_alloc_meta_for_range(void *start, void *end); enum kmsan_bug_reason { -- 2.45.1
[PATCH v4 28/35] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN
Lockdep generates the following false positives with KMSAN on s390x: [6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled()) [ ...] [6.577050] Call Trace: [6.619637] [<0690d2de>] check_flags+0x1fe/0x210 [6.665411] ([<0690d2da>] check_flags+0x1fa/0x210) [6.707478] [<006cec1a>] lock_acquire+0x2ca/0xce0 [6.749959] [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190 [6.794912] [<041fc988>] __stack_depot_save+0x218/0x5b0 [6.838420] [<0197affe>] __msan_poison_alloca+0xfe/0x1a0 [6.882985] [<07c5827c>] start_kernel+0x70c/0xd50 [6.927454] [<00100036>] startup_continue+0x36/0x40 Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that interrupts are on, but on the CPU they are still off. KMSAN instrumentation takes spinlocks, giving lockdep a chance to see and complain about this discrepancy. KMSAN instrumentation is inserted in order to poison the __mask variable. Disable instrumentation in the respective functions. They are very small and it's easy to see that no important metadata updates are lost because of this. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/irqflags.h | 17 ++--- drivers/s390/char/sclp.c | 2 +- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h index 02427b205c11..bcab456dfb80 100644 --- a/arch/s390/include/asm/irqflags.h +++ b/arch/s390/include/asm/irqflags.h @@ -37,12 +37,18 @@ static __always_inline void __arch_local_irq_ssm(unsigned long flags) asm volatile("ssm %0" : : "Q" (flags) : "memory"); } -static __always_inline unsigned long arch_local_save_flags(void) +#ifdef CONFIG_KMSAN +#define arch_local_irq_attributes noinline notrace __no_sanitize_memory __maybe_unused +#else +#define arch_local_irq_attributes __always_inline +#endif + +static arch_local_irq_attributes unsigned long arch_local_save_flags(void) { return __arch_local_irq_stnsm(0xff); } -static __always_inline unsigned long arch_local_irq_save(void) +static arch_local_irq_attributes unsigned long arch_local_irq_save(void) { return __arch_local_irq_stnsm(0xfc); } @@ -52,7 +58,12 @@ static __always_inline void arch_local_irq_disable(void) arch_local_irq_save(); } -static __always_inline void arch_local_irq_enable(void) +static arch_local_irq_attributes void arch_local_irq_enable_external(void) +{ + __arch_local_irq_stosm(0x01); +} + +static arch_local_irq_attributes void arch_local_irq_enable(void) { __arch_local_irq_stosm(0x03); } diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c index d53ee34d398f..fb1d9949adca 100644 --- a/drivers/s390/char/sclp.c +++ b/drivers/s390/char/sclp.c @@ -736,7 +736,7 @@ sclp_sync_wait(void) cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK; cr0_sync.val |= 1UL << (63 - 54); local_ctl_load(0, _sync); - __arch_local_irq_stosm(0x01); + arch_local_irq_enable_external(); /* Loop until driver state indicates finished request */ while (sclp_running_state != sclp_running_state_idle) { /* Check for expired request timer */ -- 2.45.1
[PATCH v4 25/35] s390/cpumf: Unpoison STCCTM output buffer
stcctm() uses the "Q" constraint for dest, therefore KMSAN does not understand that it fills multiple doublewords pointed to by dest, not just one. This results in false positives. Unpoison the whole dest manually with kmsan_unpoison_memory(). Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/include/asm/cpu_mf.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h index a0de5b9b02ea..9e4bbc3e53f8 100644 --- a/arch/s390/include/asm/cpu_mf.h +++ b/arch/s390/include/asm/cpu_mf.h @@ -10,6 +10,7 @@ #define _ASM_S390_CPU_MF_H #include +#include #include #include @@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, u64 range, u64 *dest) : "=d" (cc) : "Q" (*dest), "d" (range), "i" (set) : "cc", "memory"); + /* +* If cc == 2, less than RANGE counters are stored, but it's not easy +* to tell how many. Always unpoison the whole range for simplicity. +*/ + kmsan_unpoison_memory(dest, range * sizeof(u64)); return cc; } -- 2.45.1
[PATCH v4 33/35] s390/unwind: Disable KMSAN checks
The unwind code can read uninitialized frames. Furthermore, even in the good case, KMSAN does not emit shadow for backchains. Therefore disable it for the unwinding functions. Reviewed-by: Alexander Potapenko Acked-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/unwind_bc.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c index 0ece156fdd7c..cd44be2b6ce8 100644 --- a/arch/s390/kernel/unwind_bc.c +++ b/arch/s390/kernel/unwind_bc.c @@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state *state, READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE; } +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks bool unwind_next_frame(struct unwind_state *state) { struct stack_info *info = >stack_info; @@ -118,6 +120,8 @@ bool unwind_next_frame(struct unwind_state *state) } EXPORT_SYMBOL_GPL(unwind_next_frame); +/* Avoid KMSAN false positives from touching uninitialized frames. */ +__no_kmsan_checks void __unwind_start(struct unwind_state *state, struct task_struct *task, struct pt_regs *regs, unsigned long first_frame) { -- 2.45.1
[PATCH v4 11/35] kmsan: Allow disabling KMSAN checks for the current task
Like for KASAN, it's useful to temporarily disable KMSAN checks around, e.g., redzone accesses. Introduce kmsan_disable_current() and kmsan_enable_current(), which are similar to their KASAN counterparts. Make them reentrant in order to handle memory allocations in interrupt context. Repurpose the allow_reporting field for this. Signed-off-by: Ilya Leoshkevich --- Documentation/dev-tools/kmsan.rst | 4 ++-- include/linux/kmsan.h | 24 include/linux/kmsan_types.h | 2 +- mm/kmsan/core.c | 1 - mm/kmsan/hooks.c | 18 +++--- mm/kmsan/report.c | 7 --- tools/objtool/check.c | 2 ++ 7 files changed, 48 insertions(+), 10 deletions(-) diff --git a/Documentation/dev-tools/kmsan.rst b/Documentation/dev-tools/kmsan.rst index 323eedad53cd..022a823f5f1b 100644 --- a/Documentation/dev-tools/kmsan.rst +++ b/Documentation/dev-tools/kmsan.rst @@ -338,11 +338,11 @@ Per-task KMSAN state Every task_struct has an associated KMSAN task state that holds the KMSAN -context (see above) and a per-task flag disallowing KMSAN reports:: +context (see above) and a per-task counter disallowing KMSAN reports:: struct kmsan_context { ... -bool allow_reporting; +unsigned int depth; struct kmsan_context_state cstate; ... } diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h index fe6c2212bdb1..23de1b3d6aee 100644 --- a/include/linux/kmsan.h +++ b/include/linux/kmsan.h @@ -239,6 +239,22 @@ void kmsan_unpoison_entry_regs(const struct pt_regs *regs); */ void *kmsan_get_metadata(void *addr, bool is_origin); +/* + * kmsan_enable_current(): Enable KMSAN for the current task. + * + * Each kmsan_enable_current() current call must be preceded by a + * kmsan_disable_current() call. These call pairs may be nested. + */ +void kmsan_enable_current(void); + +/* + * kmsan_disable_current(): Disable KMSAN for the current task. + * + * Each kmsan_disable_current() current call must be followed by a + * kmsan_enable_current() call. These call pairs may be nested. + */ +void kmsan_disable_current(void); + #else static inline void kmsan_init_shadow(void) @@ -338,6 +354,14 @@ static inline void kmsan_unpoison_entry_regs(const struct pt_regs *regs) { } +static inline void kmsan_enable_current(void) +{ +} + +static inline void kmsan_disable_current(void) +{ +} + #endif #endif /* _LINUX_KMSAN_H */ diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h index 929287981afe..dfc59918b3c0 100644 --- a/include/linux/kmsan_types.h +++ b/include/linux/kmsan_types.h @@ -31,7 +31,7 @@ struct kmsan_context_state { struct kmsan_ctx { struct kmsan_context_state cstate; int kmsan_in_runtime; - bool allow_reporting; + unsigned int depth; }; #endif /* _LINUX_KMSAN_TYPES_H */ diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c index 95f859e38c53..81b0711a 100644 --- a/mm/kmsan/core.c +++ b/mm/kmsan/core.c @@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task) struct thread_info *info = current_thread_info(); __memset(ctx, 0, sizeof(*ctx)); - ctx->allow_reporting = true; kmsan_internal_unpoison_memory(info, sizeof(*info), false); } diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index b408714f9ba3..267d0afa2e8b 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task) void kmsan_task_exit(struct task_struct *task) { - struct kmsan_ctx *ctx = >kmsan_ctx; - if (!kmsan_enabled || kmsan_in_runtime()) return; - ctx->allow_reporting = false; + kmsan_disable_current(); } void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags) @@ -424,3 +422,17 @@ void kmsan_check_memory(const void *addr, size_t size) REASON_ANY); } EXPORT_SYMBOL(kmsan_check_memory); + +void kmsan_enable_current(void) +{ + KMSAN_WARN_ON(current->kmsan_ctx.depth == 0); + current->kmsan_ctx.depth--; +} +EXPORT_SYMBOL(kmsan_enable_current); + +void kmsan_disable_current(void) +{ + current->kmsan_ctx.depth++; + KMSAN_WARN_ON(current->kmsan_ctx.depth == 0); +} +EXPORT_SYMBOL(kmsan_disable_current); diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c index c79d3b0d2d0d..92e73ec61435 100644 --- a/mm/kmsan/report.c +++ b/mm/kmsan/report.c @@ -8,6 +8,7 @@ */ #include +#include #include #include #include @@ -158,12 +159,12 @@ void kmsan_report(depot_stack_handle_t origin, void *address, int size, if (!kmsan_enabled) return; - if (!current->kmsan_ctx.allow_reporting) + if (current->kmsan_ctx.depth) return; if (!origin) return; - current->kmsan_ctx.allow_reporting = false; + kmsan_disable_current(); ua_flags =
[PATCH v4 10/35] kmsan: Export panic_on_kmsan
When building the kmsan test as a module, modpost fails with the following error message: ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined! Export panic_on_kmsan in order to improve the KMSAN usability for modules. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/report.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c index 02736ec757f2..c79d3b0d2d0d 100644 --- a/mm/kmsan/report.c +++ b/mm/kmsan/report.c @@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock); /* Protected by kmsan_report_lock */ static char report_local_descr[DESCR_SIZE]; int panic_on_kmsan __read_mostly; +EXPORT_SYMBOL_GPL(panic_on_kmsan); #ifdef MODULE_PARAM_PREFIX #undef MODULE_PARAM_PREFIX -- 2.45.1
[PATCH v4 26/35] s390/diag: Unpoison diag224() output buffer
Diagnose 224 stores 4k bytes, which currently cannot be deduced from the inline assembly constraints. This leads to KMSAN false positives. Fix the constraints by using a 4k-sized struct instead of a raw pointer. While at it, prettify them too. Suggested-by: Heiko Carstens Signed-off-by: Ilya Leoshkevich --- arch/s390/kernel/diag.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c index 8dee9aa0ec95..8a7009618ba7 100644 --- a/arch/s390/kernel/diag.c +++ b/arch/s390/kernel/diag.c @@ -278,12 +278,14 @@ int diag224(void *ptr) int rc = -EOPNOTSUPP; diag_stat_inc(DIAG_STAT_X224); - asm volatile( - " diag%1,%2,0x224\n" - "0: lhi %0,0x0\n" + asm volatile("\n" + " diag%[type],%[addr],0x224\n" + "0: lhi %[rc],0\n" "1:\n" EX_TABLE(0b,1b) - : "+d" (rc) :"d" (0), "d" (addr) : "memory"); + : [rc] "+d" (rc) + , "=m" (*(struct { char buf[PAGE_SIZE]; } *)ptr) + : [type] "d" (0), [addr] "d" (addr)); return rc; } EXPORT_SYMBOL(diag224); -- 2.45.1
[PATCH v4 20/35] s390/boot: Turn off KMSAN
All other sanitizers are disabled for boot as well. While at it, add a comment explaining why we need this. Reviewed-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- arch/s390/boot/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile index 070c9b2e905f..526ed20b9d31 100644 --- a/arch/s390/boot/Makefile +++ b/arch/s390/boot/Makefile @@ -3,11 +3,13 @@ # Makefile for the linux s390-specific parts of the memory manager. # +# Tooling runtimes are unavailable and cannot be linked for early boot code KCOV_INSTRUMENT := n GCOV_PROFILE := n UBSAN_SANITIZE := n KASAN_SANITIZE := n KCSAN_SANITIZE := n +KMSAN_SANITIZE := n KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR) KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR) -- 2.45.1
[PATCH v4 00/35] kmsan: Enable on s390
v3: https://lore.kernel.org/lkml/20231213233605.661251-1-...@linux.ibm.com/ v3 -> v4: Rebase. Elaborate why ftrace_ops_list_func() change is needed on x64_64 (Steven). Add a comment to the DFLTCC patch (Alexander P.). Simplify diag224(); Improve __arch_local_irq_attributes style; Use IS_ENABLED(CONFIG_KMSAN) for vmalloc area (Heiko). Align vmalloc area on _SEGMENT_SIZE (Alexander G.). v2: https://lore.kernel.org/lkml/20231121220155.1217090-1-...@linux.ibm.com/ v2 -> v3: Drop kmsan_memmove_metadata() and strlcpy() patches; Remove kmsan_get_metadata() stub; Move kmsan_enable_current() and kmsan_disable_current() to include/linux/kmsan.h, explain why a counter is needed; Drop the memset_no_sanitize_memory() patch; Use __memset() in the SLAB_POISON patch; Add kmsan-checks.h to the DFLTCC patch; Add recursion check to the arch_kmsan_get_meta_or_null() patch (Alexander P.). Fix inline + __no_kmsan_checks issues. New patch for s390/irqflags, that resolves a lockdep warning. New patch for s390/diag, that resolves a false positive when running on an LPAR. New patch for STCCTM, same as above. New patch for check_bytes_and_report() that resolves a false positive that occurs even on Intel. v1: https://lore.kernel.org/lkml/20231115203401.2495875-1-...@linux.ibm.com/ v1 -> v2: Add comments, sort #includes, introduce memset_no_sanitize_memory() and use it to avoid unpoisoning of redzones, change vmalloc alignment to _REGION3_SIZE, add R-bs (Alexander P.). Fix building [PATCH 28/33] s390/string: Add KMSAN support with FORTIFY_SOURCE. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202311170550.bsbo44ix-...@intel.com/ Hi, This series provides the minimal support for Kernel Memory Sanitizer on s390. Kernel Memory Sanitizer is clang-only instrumentation for finding accesses to uninitialized memory. The clang support for s390 has already been merged [1]. With this series, I can successfully boot s390 defconfig and debug_defconfig with kmsan.panic=1. The tool found one real s390-specific bug (fixed in master). Best regards, Ilya [1] https://reviews.llvm.org/D148596 Ilya Leoshkevich (35): ftrace: Unpoison ftrace_regs in ftrace_ops_list_func() kmsan: Make the tests compatible with kmsan.panic=1 kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled kmsan: Increase the maximum store size to 4096 kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush() kmsan: Remove an x86-specific #include from kmsan.h kmsan: Expose kmsan_get_metadata() kmsan: Export panic_on_kmsan kmsan: Allow disabling KMSAN checks for the current task kmsan: Support SLAB_POISON kmsan: Use ALIGN_DOWN() in kmsan_get_metadata() kmsan: Do not round up pg_data_t size mm: slub: Let KMSAN access metadata mm: slub: Unpoison the memchr_inv() return value mm: kfence: Disable KMSAN when checking the canary lib/zlib: Unpoison DFLTCC output buffers kmsan: Accept ranges starting with 0 on s390 s390/boot: Turn off KMSAN s390: Use a larger stack for KMSAN s390/boot: Add the KMSAN runtime stub s390/checksum: Add a KMSAN check s390/cpacf: Unpoison the results of cpacf_trng() s390/cpumf: Unpoison STCCTM output buffer s390/diag: Unpoison diag224() output buffer s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler() s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN s390/mm: Define KMSAN metadata for vmalloc and modules s390/string: Add KMSAN support s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs s390/uaccess: Add KMSAN support to put_user() and get_user() s390/unwind: Disable KMSAN checks s390: Implement the architecture-specific KMSAN functions kmsan: Enable on s390 Documentation/dev-tools/kmsan.rst | 4 +- arch/s390/Kconfig | 1 + arch/s390/Makefile | 2 +- arch/s390/boot/Makefile | 3 + arch/s390/boot/kmsan.c | 6 ++ arch/s390/boot/startup.c| 7 ++ arch/s390/boot/string.c | 16 arch/s390/include/asm/checksum.h| 2 + arch/s390/include/asm/cpacf.h | 3 + arch/s390/include/asm/cpu_mf.h | 6 ++ arch/s390/include/asm/irqflags.h| 17 - arch/s390/include/asm/kmsan.h | 43 +++ arch/s390/include/asm/pgtable.h | 8 ++ arch/s390/include/asm/string.h | 20 +++-- arch/s390/include/asm/thread_info.h | 2 +- arch/s390/include/asm/uaccess.h | 111 arch/s390/kernel/diag.c | 10
[PATCH v4 13/35] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
Improve the readability by replacing the custom aligning logic with ALIGN_DOWN(). Unlike other places where a similar sequence is used, there is no size parameter that needs to be adjusted, so the standard macro fits. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/shadow.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index 2d57408c78ae..9c58f081d84f 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *address, u64 size, */ void *kmsan_get_metadata(void *address, bool is_origin) { - u64 addr = (u64)address, pad, off; + u64 addr = (u64)address, off; struct page *page; void *ret; - if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) { - pad = addr % KMSAN_ORIGIN_SIZE; - addr -= pad; - } + if (is_origin) + addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE); address = (void *)addr; if (kmsan_internal_is_vmalloc_addr(address) || kmsan_internal_is_module_addr(address)) -- 2.45.1
[PATCH v4 12/35] kmsan: Support SLAB_POISON
Avoid false KMSAN negatives with SLUB_DEBUG by allowing kmsan_slab_free() to poison the freed memory, and by preventing init_object() from unpoisoning new allocations by using __memset(). There are two alternatives to this approach. First, init_object() can be marked with __no_sanitize_memory. This annotation should be used with great care, because it drops all instrumentation from the function, and any shadow writes will be lost. Even though this is not a concern with the current init_object() implementation, this may change in the future. Second, kmsan_poison_memory() calls may be added after memset() calls. The downside is that init_object() is called from free_debug_processing(), in which case poisoning will erase the distinction between simply uninitialized memory and UAF. Signed-off-by: Ilya Leoshkevich --- mm/kmsan/hooks.c | 2 +- mm/slub.c| 13 + 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 267d0afa2e8b..26d86dfdc819 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -74,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object) return; /* RCU slabs could be legally used after free within the RCU period */ - if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON))) + if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) return; /* * If there's a constructor, freed memory must remain in the same state diff --git a/mm/slub.c b/mm/slub.c index 1373ac365a46..4dd55cabe701 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) unsigned int poison_size = s->object_size; if (s->flags & SLAB_RED_ZONE) { - memset(p - s->red_left_pad, val, s->red_left_pad); + /* +* Use __memset() here and below in order to avoid overwriting +* the KMSAN shadow. Keeping the shadow makes it possible to +* distinguish uninit-value from use-after-free. +*/ + __memset(p - s->red_left_pad, val, s->red_left_pad); if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) { /* @@ -1152,12 +1157,12 @@ static void init_object(struct kmem_cache *s, void *object, u8 val) } if (s->flags & __OBJECT_POISON) { - memset(p, POISON_FREE, poison_size - 1); - p[poison_size - 1] = POISON_END; + __memset(p, POISON_FREE, poison_size - 1); + __memset(p + poison_size - 1, POISON_END, 1); } if (s->flags & SLAB_RED_ZONE) - memset(p + poison_size, val, s->inuse - poison_size); + __memset(p + poison_size, val, s->inuse - poison_size); } static void restore_bytes(struct kmem_cache *s, char *message, u8 data, -- 2.45.1
[PATCH v4 14/35] kmsan: Do not round up pg_data_t size
x86's alloc_node_data() rounds up node data size to PAGE_SIZE. It's not explained why it's needed, but it's most likely for performance reasons, since the padding bytes are not used anywhere. Some other architectures do it as well, e.g., mips rounds it up to the cache line size. kmsan_init_shadow() initializes metadata for each node data and assumes the x86 rounding, which does not match other architectures. This may cause the range end to overshoot the end of available memory, in turn causing virt_to_page_or_null() in kmsan_init_alloc_meta_for_range() to return NULL, which leads to kernel panic shortly after. Since the padding bytes are not used, drop the rounding. Signed-off-by: Ilya Leoshkevich --- mm/kmsan/init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c index 3ac3b8921d36..9de76ac7062c 100644 --- a/mm/kmsan/init.c +++ b/mm/kmsan/init.c @@ -72,7 +72,7 @@ static void __init kmsan_record_future_shadow_range(void *start, void *end) */ void __init kmsan_init_shadow(void) { - const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE); + const size_t nd_size = sizeof(pg_data_t); phys_addr_t p_start, p_end; u64 loop; int nid; -- 2.45.1
[PATCH v4 15/35] mm: slub: Let KMSAN access metadata
Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes KMSAN to complain about touching redzones in kfree(). Fix by extending the existing KASAN-related metadata_access_enable() and metadata_access_disable() functions to KMSAN. Acked-by: Vlastimil Babka Signed-off-by: Ilya Leoshkevich --- mm/slub.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index 4dd55cabe701..a290f6c63e7b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -829,10 +829,12 @@ static int disable_higher_order_debug; static inline void metadata_access_enable(void) { kasan_disable_current(); + kmsan_disable_current(); } static inline void metadata_access_disable(void) { + kmsan_enable_current(); kasan_enable_current(); } -- 2.45.1
[PATCH v4 07/35] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()
The value assigned to prot is immediately overwritten on the next line with PAGE_KERNEL. The right hand side of the assignment has no side-effects. Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Suggested-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/shadow.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c index b9d05aff313e..2d57408c78ae 100644 --- a/mm/kmsan/shadow.c +++ b/mm/kmsan/shadow.c @@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end, s_pages[i] = shadow_page_for(pages[i]); o_pages[i] = origin_page_for(pages[i]); } - prot = __pgprot(pgprot_val(prot) | _PAGE_NX); prot = PAGE_KERNEL; origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN); -- 2.45.1
[PATCH v4 03/35] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
KMSAN relies on memblock returning all available pages to it (see kmsan_memblock_free_pages()). It partitions these pages into 3 categories: pages available to the buddy allocator, shadow pages and origin pages. This partitioning is static. If new pages appear after kmsan_init_runtime(), it is considered an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as incompatible with KMSAN. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..9791fce5d0a7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -946,6 +946,7 @@ config DEFERRED_STRUCT_PAGE_INIT depends on SPARSEMEM depends on !NEED_PER_CPU_KM depends on 64BIT + depends on !KMSAN select PADATA help Ordinarily all struct pages are initialised during early boot in a -- 2.45.1
[PATCH v4 06/35] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces
Comparing pointers with TASK_SIZE does not make sense when kernel and userspace overlap. Assume that we are handling user memory access in this case. Reported-by: Alexander Gordeev Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/hooks.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c index 22e8657800ef..b408714f9ba3 100644 --- a/mm/kmsan/hooks.c +++ b/mm/kmsan/hooks.c @@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy, return; ua_flags = user_access_save(); - if ((u64)to < TASK_SIZE) { + if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) || + (u64)to < TASK_SIZE) { /* This is a user memory access, check it. */ kmsan_internal_check_memory((void *)from, to_copy - left, to, REASON_COPY_TO_USER); -- 2.45.1
[PATCH v4 02/35] kmsan: Make the tests compatible with kmsan.panic=1
It's useful to have both tests and kmsan.panic=1 during development, but right now the warnings, that the tests cause, lead to kernel panics. Temporarily set kmsan.panic=0 for the duration of the KMSAN testing. Reviewed-by: Alexander Potapenko Signed-off-by: Ilya Leoshkevich --- mm/kmsan/kmsan_test.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c index 07d3a3a5a9c5..9bfd11674fe3 100644 --- a/mm/kmsan/kmsan_test.c +++ b/mm/kmsan/kmsan_test.c @@ -659,9 +659,13 @@ static void test_exit(struct kunit *test) { } +static int orig_panic_on_kmsan; + static int kmsan_suite_init(struct kunit_suite *suite) { register_trace_console(probe_console, NULL); + orig_panic_on_kmsan = panic_on_kmsan; + panic_on_kmsan = 0; return 0; } @@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite) { unregister_trace_console(probe_console, NULL); tracepoint_synchronize_unregister(); + panic_on_kmsan = orig_panic_on_kmsan; } static struct kunit_suite kmsan_test_suite = { -- 2.45.1
Re: [PATCH v2] net: missing check virtio
Thu, Jun 13, 2024 at 11:54:48AM CEST, are...@swemel.ru wrote: >Two missing check in virtio_net_hdr_to_skb() allowed syzbot >to crash kernels again > >1. After the skb_segment function the buffer may become non-linear >(nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere >the __skb_linearize function will not be executed, then the buffer will >remain non-linear. Then the condition (offset >= skb_headlen(skb)) >becomes true, which causes WARN_ON_ONCE in skb_checksum_help. > >2. The struct sk_buff and struct virtio_net_hdr members must be >mathematically related. >(gso_size) must be greater than (needed) otherwise WARN_ON_ONCE. >(remainder) must be greater than (needed) otherwise WARN_ON_ONCE. >(remainder) may be 0 if division is without remainder. > >offset+2 (4191) > skb_headlen() (1116) >WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 >net/core/dev.c:3303 >Modules linked in: >CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted >6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0 >Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google >11/10/2023 >RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303 >Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b >53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe >ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef >RSP: 0018:c90003a9f338 EFLAGS: 00010286 >RAX: RBX: 888025125780 RCX: 814db209 >RDX: 888015393b80 RSI: 814db216 RDI: 0001 >RBP: 8880251257f4 R08: 0001 R09: >R10: R11: 0001 R12: 045c >R13: 105f R14: 8880251257f0 R15: 105d >FS: 55c24380() GS:8880b990() knlGS: >CS: 0010 DS: ES: CR0: 80050033 >CR2: 2000f000 CR3: 23151000 CR4: 003506f0 >DR0: DR1: DR2: >DR3: DR6: fffe0ff0 DR7: 0400 >Call Trace: > > ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777 > ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584 > ip_finish_output_gso net/ipv4/ip_output.c:286 [inline] > __ip_finish_output net/ipv4/ip_output.c:308 [inline] > __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295 > ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323 > NF_HOOK_COND include/linux/netfilter.h:303 [inline] > ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433 > dst_output include/net/dst.h:451 [inline] > ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129 > iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82 > ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline] > sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076 > __netdev_start_xmit include/linux/netdevice.h:4940 [inline] > netdev_start_xmit include/linux/netdevice.h:4954 [inline] > xmit_one net/core/dev.c:3545 [inline] > dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561 > __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346 > dev_queue_xmit include/linux/netdevice.h:3134 [inline] > packet_xmit+0x257/0x380 net/packet/af_packet.c:276 > packet_snd net/packet/af_packet.c:3087 [inline] > packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119 > sock_sendmsg_nosec net/socket.c:730 [inline] > __sock_sendmsg+0xd5/0x180 net/socket.c:745 > __sys_sendto+0x255/0x340 net/socket.c:2190 > __do_sys_sendto net/socket.c:2202 [inline] > __se_sys_sendto net/socket.c:2198 [inline] > __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198 > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 > entry_SYSCALL_64_after_hwframe+0x63/0x6b > >Found by Linux Verification Center (linuxtesting.org) with Syzkaller > >Signed-off-by: Denis Arefev Could you please provide "Fixes" blaming the commit which itroduced the bug? >--- > V1 -> V2: incorrect type in argument 2 > include/linux/virtio_net.h | 11 +++ > 1 file changed, 11 insertions(+) > >diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h >index 4dfa9b69ca8d..d1d7825318c3 100644 >--- a/include/linux/virtio_net.h >+++ b/include/linux/virtio_net.h >@@ -56,6 +56,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, > unsigned int thlen = 0; > unsigned int p_off = 0; > unsigned int ip_proto; >+ u64 ret, remainder, gso_size; > > if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { > switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) { >@@ -98,6 +99,16 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, > u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset); > u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16)); > >+ if (hdr->gso_size) { >+ gso_size = __virtio16_to_cpu(little_endian, >hdr->gso_size); >+ ret = div64_u64_rem(skb->len, gso_size, ); >+
Re: [PATCH v5 8/8] Docs/.../mm/damon: add more damos actions
Hi Honggyu, On Thu, 13 Jun 2024 22:20:55 +0900 Honggyu Kim wrote: > This patch adds damon description for "migrate_hot" and "migrate_cold" > actions for both usage and design documents as long as a new > "target_nid" knob to set the migration target node. > > Signed-off-by: Honggyu Kim > --- > Documentation/admin-guide/mm/damon/usage.rst | 8 +++- > Documentation/mm/damon/design.rst| 4 > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/mm/damon/usage.rst > b/Documentation/admin-guide/mm/damon/usage.rst > index 7bff54963975..84d62d16c9f9 100644 > --- a/Documentation/admin-guide/mm/damon/usage.rst > +++ b/Documentation/admin-guide/mm/damon/usage.rst > @@ -300,6 +300,10 @@ from the file and their meaning are same to those of the > list on > The ``apply_interval_us`` file is for setting and getting the scheme's > :ref:`apply_interval ` in microseconds. > > +The ``target_nid`` file is for setting the migration target node, which is > +only meaningful when the ``action`` is either ``migrate_hot`` or > +``migrate_cold``. > + > .. _sysfs_access_pattern: > > schemes//access_pattern/ > @@ -759,7 +763,9 @@ list on :ref:`design doc `. > - 4: ``nohugepage`` > - 5: ``lru_prio`` > - 6: ``lru_deprio`` > - - 7: ``stat`` > + - 7: ``migrate_hot`` > + - 8: ``migrate_cold`` > + - 9: ``stat`` This section is for DAMON debugfs interface. And to my understanding, this patchset is not adding support of migrate_{hot,cold} DAMOS actions on DAMON debugfs interface. So I think this part should be removed. > > Quota > ~ > diff --git a/Documentation/mm/damon/design.rst > b/Documentation/mm/damon/design.rst > index 3df387249937..3f12c884eb3a 100644 > --- a/Documentation/mm/damon/design.rst > +++ b/Documentation/mm/damon/design.rst > @@ -325,6 +325,10 @@ that supports each action are as below. > Supported by ``paddr`` operations set. > - ``lru_deprio``: Deprioritize the region on its LRU lists. > Supported by ``paddr`` operations set. > + - ``migrate_hot``: Migrate the regions prioritizing warmer regions. > + Supported by ``paddr`` operations set. > + - ``migrate_cold``: Migrate the regions prioritizing colder regions. > + Supported by ``paddr`` operations set. > - ``stat``: Do nothing but count the statistics. > Supported by all operations sets. Except the DAMON debugfs interface section, this patch looks good to me. Thanks, SJ [...]
Re: [PATCH 1/2] vdpa: support set mac address from vdpa tool
Thu, Jun 13, 2024 at 09:50:54AM CEST, m...@redhat.com wrote: >On Thu, Jun 13, 2024 at 09:21:07AM +0200, Jiri Pirko wrote: >> Thu, Jun 13, 2024 at 08:49:25AM CEST, m...@redhat.com wrote: >> >On Wed, Jun 12, 2024 at 09:22:32AM +0200, Jiri Pirko wrote: >> >> Wed, Jun 12, 2024 at 09:15:44AM CEST, m...@redhat.com wrote: >> >> >On Wed, Jun 12, 2024 at 08:29:53AM +0200, Jiri Pirko wrote: >> >> >> Wed, Jun 12, 2024 at 03:58:10AM CEST, k...@kernel.org wrote: >> >> >> >On Tue, 11 Jun 2024 13:32:32 +0800 Cindy Lu wrote: >> >> >> >> Add new UAPI to support the mac address from vdpa tool >> >> >> >> Function vdpa_nl_cmd_dev_config_set_doit() will get the >> >> >> >> MAC address from the vdpa tool and then set it to the device. >> >> >> >> >> >> >> >> The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** >> >> >> > >> >> >> >Why don't you use devlink? >> >> >> >> >> >> Fair question. Why does vdpa-specific uapi even exist? To have >> >> >> driver-specific uapi Does not make any sense to me :/ >> >> > >> >> >I am not sure which uapi do you refer to? The one this patch proposes or >> >> >the existing one? >> >> >> >> Sure, I'm sure pointing out, that devlink should have been the answer >> >> instead of vdpa netlink introduction. That ship is sailed, >> > >> >> now we have >> >> unfortunate api duplication which leads to questions like Jakub's one. >> >> That's all :/ >> > >> > >> > >> >Yea there's no point to argue now, there were arguments this and that >> >way. I don't think we currently have a lot >> >of duplication, do we? >> >> True. I think it would be good to establish guidelines for api >> extensions in this area. >> >> > >> >-- >> >MST >> > > > >Guidelines are good, are there existing examples of such guidelines in >Linux to follow though? Specifically after reviewing this some more, I Documentation directory in general. >think what Cindy is trying to do is actually provisioning as opposed to >programming. > >-- >MST >
Re: [PATCH] function_graph: Add READ_ONCE() when accessing fgraph_array[]
On Thu, 13 Jun 2024 09:52:23 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > In function_graph_enter() there's a loop that looks at fgraph_array[] > elements which are fgraph_ops. It first tests if it is a fgraph_stub op, > and if so skips it, as that's just there as a place holder. Then it checks > the fgraph_ops filters to see if the ops wants to trace the current > function. > > But if the compiler reloads the fgraph_array[] after the check against > fgraph_stub, it could race with the fgraph_array[] being updated with the > fgraph_stub. That would cause the stub to be processed. But the stub has a > null "func_hash" field which will cause a NULL pointer dereference. > > Add a READ_ONCE() so that the gops that is compared against the > fgraph_stub is also the gops that is processed later. > > Link: > https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjtc94pgsnmf9y65bnov6xsocg_b6...@mail.gmail.com/ > > Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize > entry function") > Reported-by: Naresh Kamboju > Signed-off-by: Steven Rostedt (Google) Looks good to me. Acked-by: Masami Hiramatsu (Google) Thanks. > --- > kernel/trace/fgraph.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c > index 8317d1a7f43a..fc205ad167a9 100644 > --- a/kernel/trace/fgraph.c > +++ b/kernel/trace/fgraph.c > @@ -641,7 +641,7 @@ int function_graph_enter(unsigned long ret, unsigned long > func, > { > for_each_set_bit(i, _array_bitmask, >sizeof(fgraph_array_bitmask) * > BITS_PER_BYTE) { > - struct fgraph_ops *gops = fgraph_array[i]; > + struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]); > int save_curr_ret_stack; > > if (gops == _stub) > -- > 2.43.0 > -- Masami Hiramatsu (Google)
Re: [PATCH 1/1] s390/virtio_ccw: fix config change notifications
On Tue, 2024-06-11 at 23:47 +0200, Halil Pasic wrote: > Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > broke configuration change notifications for virtio-ccw by putting > the > DMA address of *indicatorp directly into ccw->cda disregarding the > fact > that if !!(vcdev->is_thinint) then the function > virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value > with the address of the virtio_thinint_area so it can actually set up > the adapter interrupts via CCW_CMD_SET_IND_ADAPTER. Thus we end up > pointing to the wrong object for both CCW_CMD_SET_IND if setting up > the > adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless > whether it succeeds or fails. > > To fix this, let us save away the dma address of *indicatorp in a > local > variable, and copy it to ccw->cda after the "vcdev->is_thinint" > branch. > > Reported-by: Boqiao Fu > Reported-by: Sebastian Mitterle > Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API") > Signed-off-by: Halil Pasic > --- > I know that checkpatch.pl complains about a missing 'Closes' tag. > Unfortunately I don't have an appropriate URL at hand. @Sebastian, > @Boqiao: do you have any suggetions? > --- > drivers/s390/virtio/virtio_ccw.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) Reviewed-by: Eric Farman
Re: [PATCH v5 7/8] Docs/admin-guide/mm/damon/usage: add missing actions
Hi Honggyu, On Thu, 13 Jun 2024 22:20:54 +0900 Honggyu Kim wrote: > "lru_prio" and "lru_deprio" are missing in the damos action list so they > have to be added properly at damon usage document. > > Signed-off-by: Honggyu Kim > --- > Documentation/admin-guide/mm/damon/usage.rst | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/mm/damon/usage.rst > b/Documentation/admin-guide/mm/damon/usage.rst > index e58ceb89ea2a..7bff54963975 100644 > --- a/Documentation/admin-guide/mm/damon/usage.rst > +++ b/Documentation/admin-guide/mm/damon/usage.rst > @@ -757,7 +757,9 @@ list on :ref:`design doc `. > - 2: ``pageout`` > - 3: ``hugepage`` > - 4: ``nohugepage`` > - - 5: ``stat`` > + - 5: ``lru_prio`` > + - 6: ``lru_deprio`` > + - 7: ``stat`` This section is for DAMON debugfs interface, which is deprecated. After the deprecation, we stopped supporting new DAMOS actions. This section is not documenting lru_[de]prio for the reason. I think this patch is not needed. Please let me know if I'm missing something, though. Thanks, SJ [...]
Re: [PATCH 0/8] DAMON based tiered memory management for CXL memory
On Thu, 13 Jun 2024 22:27:26 +0900 Honggyu Kim wrote: > On Thu, 13 Jun 2024 22:17:31 +0900 Honggyu Kim wrote: > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This patch series is about its implementation for physical > > address space so that this scheme can be applied in system wide level. > > > > Changes from RFC v4: > > https://lore.kernel.org/20240512175447.75943-1...@kernel.org > > 1. Add usage and design documents > > 2. Rename alloc_demote_folio to alloc_migrate_folio > > 3. Add evaluation results with "demotion_enabled" true > > 4. Rebase based on v6.10-rc3 > > Sorry for making confusion, I didn't add "PATCH v5" tag for this patch > series so please ignore and see the resent one. Thank you for clarifying this. Nonetheless, I don't mine resetting the version number of a patchset after dropping RFC. Actually, I personally rather prefer resetting the version number. Anyway, I don't care that much. Please use any way that you feel more comfortable :) Please just keep the number monotonically increase. Thanks, SJ > > Honggyu
[PATCH] function_graph: Add READ_ONCE() when accessing fgraph_array[]
From: "Steven Rostedt (Google)" In function_graph_enter() there's a loop that looks at fgraph_array[] elements which are fgraph_ops. It first tests if it is a fgraph_stub op, and if so skips it, as that's just there as a place holder. Then it checks the fgraph_ops filters to see if the ops wants to trace the current function. But if the compiler reloads the fgraph_array[] after the check against fgraph_stub, it could race with the fgraph_array[] being updated with the fgraph_stub. That would cause the stub to be processed. But the stub has a null "func_hash" field which will cause a NULL pointer dereference. Add a READ_ONCE() so that the gops that is compared against the fgraph_stub is also the gops that is processed later. Link: https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjtc94pgsnmf9y65bnov6xsocg_b6...@mail.gmail.com/ Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Reported-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- kernel/trace/fgraph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 8317d1a7f43a..fc205ad167a9 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -641,7 +641,7 @@ int function_graph_enter(unsigned long ret, unsigned long func, { for_each_set_bit(i, _array_bitmask, sizeof(fgraph_array_bitmask) * BITS_PER_BYTE) { - struct fgraph_ops *gops = fgraph_array[i]; + struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]); int save_curr_ret_stack; if (gops == _stub) -- 2.43.0
[PATCH 6/6] ftrace: Revert the FTRACE_MCOUNT_MAX_OFFSET workaround
After patch titled "ftrace: Skip invalid __fentry__ in ftrace_process_locs()", __fentry__ locations in overridden weak function have been checked and skipped, then all records in ftrace_pages are valid, the FTRACE_MCOUNT_MAX_OFFSET workaround can be reverted, include: 1. commit b39181f7c690 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function") 2. commit 7af82ff90a2b ("powerpc/ftrace: Ignore weak functions") 3. commit f6834c8c59a8 ("powerpc/ftrace: Fix dropping weak symbols with older toolchains") Signed-off-by: Zheng Yejian --- arch/powerpc/include/asm/ftrace.h | 7 -- arch/x86/include/asm/ftrace.h | 7 -- kernel/trace/ftrace.c | 141 +- 3 files changed, 2 insertions(+), 153 deletions(-) diff --git a/arch/powerpc/include/asm/ftrace.h b/arch/powerpc/include/asm/ftrace.h index 107fc5a48456..d6ed058c8041 100644 --- a/arch/powerpc/include/asm/ftrace.h +++ b/arch/powerpc/include/asm/ftrace.h @@ -10,13 +10,6 @@ #define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR -/* Ignore unused weak functions which will have larger offsets */ -#if defined(CONFIG_MPROFILE_KERNEL) || defined(CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY) -#define FTRACE_MCOUNT_MAX_OFFSET 16 -#elif defined(CONFIG_PPC32) -#define FTRACE_MCOUNT_MAX_OFFSET 8 -#endif - #ifndef __ASSEMBLY__ extern void _mcount(void); diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h index 897cf02c20b1..7a147c9da08d 100644 --- a/arch/x86/include/asm/ftrace.h +++ b/arch/x86/include/asm/ftrace.h @@ -9,13 +9,6 @@ # define MCOUNT_ADDR ((unsigned long)(__fentry__)) #define MCOUNT_INSN_SIZE 5 /* sizeof mcount call */ -/* Ignore unused weak functions which will have non zero offsets */ -#ifdef CONFIG_HAVE_FENTRY -# include -/* Add offset for endbr64 if IBT enabled */ -# define FTRACE_MCOUNT_MAX_OFFSET ENDBR_INSN_SIZE -#endif - #ifdef CONFIG_DYNAMIC_FTRACE #define ARCH_SUPPORTS_FTRACE_OPS 1 #endif diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index c46c35ac9b42..1d60dc9a850b 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -49,8 +49,6 @@ #define FTRACE_NOCLEAR_FLAGS (FTRACE_FL_DISABLED | FTRACE_FL_TOUCHED | \ FTRACE_FL_MODIFIED) -#define FTRACE_INVALID_FUNCTION"__ftrace_invalid_address__" - #define FTRACE_WARN_ON(cond) \ ({ \ int ___r = cond;\ @@ -3709,105 +3707,6 @@ static void add_trampoline_func(struct seq_file *m, struct ftrace_ops *ops, seq_printf(m, " ->%pS", ptr); } -#ifdef FTRACE_MCOUNT_MAX_OFFSET -/* - * Weak functions can still have an mcount/fentry that is saved in - * the __mcount_loc section. These can be detected by having a - * symbol offset of greater than FTRACE_MCOUNT_MAX_OFFSET, as the - * symbol found by kallsyms is not the function that the mcount/fentry - * is part of. The offset is much greater in these cases. - * - * Test the record to make sure that the ip points to a valid kallsyms - * and if not, mark it disabled. - */ -static int test_for_valid_rec(struct dyn_ftrace *rec) -{ - char str[KSYM_SYMBOL_LEN]; - unsigned long offset; - const char *ret; - - ret = kallsyms_lookup(rec->ip, NULL, , NULL, str); - - /* Weak functions can cause invalid addresses */ - if (!ret || offset > FTRACE_MCOUNT_MAX_OFFSET) { - rec->flags |= FTRACE_FL_DISABLED; - return 0; - } - return 1; -} - -static struct workqueue_struct *ftrace_check_wq __initdata; -static struct work_struct ftrace_check_work __initdata; - -/* - * Scan all the mcount/fentry entries to make sure they are valid. - */ -static __init void ftrace_check_work_func(struct work_struct *work) -{ - struct ftrace_page *pg; - struct dyn_ftrace *rec; - - mutex_lock(_lock); - do_for_each_ftrace_rec(pg, rec) { - test_for_valid_rec(rec); - } while_for_each_ftrace_rec(); - mutex_unlock(_lock); -} - -static int __init ftrace_check_for_weak_functions(void) -{ - INIT_WORK(_check_work, ftrace_check_work_func); - - ftrace_check_wq = alloc_workqueue("ftrace_check_wq", WQ_UNBOUND, 0); - - queue_work(ftrace_check_wq, _check_work); - return 0; -} - -static int __init ftrace_check_sync(void) -{ - /* Make sure the ftrace_check updates are finished */ - if (ftrace_check_wq) - destroy_workqueue(ftrace_check_wq); - return 0; -} - -late_initcall_sync(ftrace_check_sync); -subsys_initcall(ftrace_check_for_weak_functions); - -static int print_rec(struct seq_file *m, unsigned long ip) -{ - unsigned long offset; - char str[KSYM_SYMBOL_LEN]; - char *modname; - const char *ret; - - ret = kallsyms_lookup(ip, NULL, , , str); - /* Weak functions can cause invalid addresses */ - if (!ret
[PATCH 3/6] module: kallsyms: Determine exact function size
When a weak type function is overridden, its symbol will be removed from the symbol table, but its code will not been removed. It will cause find_kallsyms_symbol() to compute a larger function size than it actually is, just because symbol of its following weak function is removed. To fix this issue, check that an given address is within the size of the function found. Signed-off-by: Zheng Yejian --- include/linux/module.h | 7 +++ kernel/module/kallsyms.c | 19 +-- 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index ffa1c603163c..13518f464d3f 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -597,6 +597,13 @@ static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) } #endif +#ifndef HAVE_ARCH_KALLSYMS_SYMBOL_TYPE +static inline unsigned int kallsyms_symbol_type(const Elf_Sym *sym) +{ + return ELF_ST_TYPE(sym->st_info); +} +#endif + /* FIXME: It'd be nice to isolate modules during init, too, so they aren't used before they (may) fail. But presently too much code (IDE & SCSI) require entry into the module during init.*/ diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c index 62fb57bb9f16..092ae6f43dad 100644 --- a/kernel/module/kallsyms.c +++ b/kernel/module/kallsyms.c @@ -262,6 +262,7 @@ static const char *find_kallsyms_symbol(struct module *mod, unsigned long nextval, bestval; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); struct module_memory *mod_mem; + const Elf_Sym *sym; /* At worse, next value is at end of module */ if (within_module_init(addr, mod)) @@ -278,9 +279,10 @@ static const char *find_kallsyms_symbol(struct module *mod, * starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { - const Elf_Sym *sym = >symtab[i]; - unsigned long thisval = kallsyms_symbol_value(sym); + unsigned long thisval; + sym = >symtab[i]; + thisval = kallsyms_symbol_value(sym); if (sym->st_shndx == SHN_UNDEF) continue; @@ -292,6 +294,13 @@ static const char *find_kallsyms_symbol(struct module *mod, is_mapping_symbol(kallsyms_symbol_name(kallsyms, i))) continue; + if (kallsyms_symbol_type(sym) == STT_FUNC && + addr >= thisval && addr < thisval + sym->st_size) { + best = i; + bestval = thisval; + nextval = thisval + sym->st_size; + goto find; + } if (thisval <= addr && thisval > bestval) { best = i; bestval = thisval; @@ -303,6 +312,12 @@ static const char *find_kallsyms_symbol(struct module *mod, if (!best) return NULL; + sym = >symtab[best]; + if (kallsyms_symbol_type(sym) == STT_FUNC && sym->st_size && + addr >= kallsyms_symbol_value(sym) + sym->st_size) + return NULL; + +find: if (size) *size = nextval - bestval; if (offset) -- 2.25.1
Re: [PATCH v7 3/5] regulator: add regulators driver for Marvell 88PM886 PMIC
On Fri, 31 May 2024, Karel Balej wrote: > Support the LDO and buck regulators of the Marvell 88PM886 PMIC. > > Signed-off-by: Karel Balej > --- > > Notes: > v7: > - Address Mark's feedback: > - Drop get_current_limit op, max_uA values and thus unneeded struct > pm886_regulator and adapt the code accordingly. > v6: > - Remove all definitions (now present in the header). > v5: > - Add remaining regulators. > - Clean up includes. > - Address Mark's feedback: > - Use dedicated regmap config. > RFC v4: > - Initialize regulators regmap in the regulators driver. > - Register all regulators at once. > - Drop regulator IDs. > - Add missing '\n' to dev_err_probe message. > - Fix includes. > - Add ID table. > RFC v3: > - Do not have a variable for each regulator -- define them all in the > pm886_regulators array. > - Use new regulators regmap index name. > - Use dev_err_probe. > RFC v2: > - Drop of_compatible and related code. > - Drop unused include. > - Remove some abstraction: use only one regmap for all regulators and > only mention 88PM886 in Kconfig description. > - Reword commit message. > > drivers/regulator/88pm886-regulator.c | 392 ++ > drivers/regulator/Kconfig | 6 + > drivers/regulator/Makefile| 1 + > 3 files changed, 399 insertions(+) > create mode 100644 drivers/regulator/88pm886-regulator.c I'm fine with this set - just waiting for Mark to review. -- Lee Jones [李琼斯]
Re: [PATCH 0/8] DAMON based tiered memory management for CXL memory
On Thu, 13 Jun 2024 22:17:31 +0900 Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This patch series is about its implementation for physical > address space so that this scheme can be applied in system wide level. > > Changes from RFC v4: > https://lore.kernel.org/20240512175447.75943-1...@kernel.org > 1. Add usage and design documents > 2. Rename alloc_demote_folio to alloc_migrate_folio > 3. Add evaluation results with "demotion_enabled" true > 4. Rebase based on v6.10-rc3 Sorry for making confusion, I didn't add "PATCH v5" tag for this patch series so please ignore and see the resent one. Honggyu
[PATCH 5/6] ftrace: Fix possible out-of-bound issue in ftrace_process_locs()
In ftrace_process_locs(), a series pages are prepared and linked in start_pg, then fentry records are skipped or added, then unused pages are freed. However, assume that all records are skipped, currently the start_pg will still be in list of ftrace_pages_start but without any record. Then in ftrace_free_mem() index record by (pg->index - 1) will be out of bound. To fix this issue, properly handle with unused start_pg and add WARN_ON_ONCE() where the records need to be indexed. Fixes: 26efd79c4624 ("ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()") Signed-off-by: Zheng Yejian --- kernel/trace/ftrace.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 0e8628e4d296..c46c35ac9b42 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -6575,10 +6575,22 @@ static int ftrace_process_locs(struct module *mod, rec->ip = addr; } - if (pg->next) { + if (pg->index == 0) { + /* No record is added on the current page, so it's unused */ + pg_unuse = pg; + } else if (pg->next) { + /* Current page has records, so it's next page is unused */ pg_unuse = pg->next; pg->next = NULL; } + /* +* Even the start_pg hasn't been used, that means, no record has +* been added, so restore state of ftrace_pages and just go out. +*/ + if (pg_unuse == start_pg) { + ftrace_pages->next = NULL; + goto out; + } /* Assign the last page to ftrace_pages */ ftrace_pages = pg; @@ -6794,6 +6806,8 @@ void ftrace_release_mod(struct module *mod) */ last_pg = _pages_start; for (pg = ftrace_pages_start; pg; pg = *last_pg) { + /* The page should have at lease one record */ + WARN_ON_ONCE(!pg->index); rec = >records[0]; if (within_module(rec->ip, mod)) { /* @@ -7176,6 +7190,8 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, void *end_ptr) mod_map = allocate_ftrace_mod_map(mod, start, end); for (pg = ftrace_pages_start; pg; last_pg = >next, pg = *last_pg) { + /* The page should have at lease one record */ + WARN_ON_ONCE(!pg->index); if (end < pg->records[0].ip || start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE)) continue; -- 2.25.1
[PATCH 1/6] kallsyms: Optimize multiple times of realloc() to one time of malloc()
Array 'table' is used to store pointers of symbols that read from in.map file, and its size depends on the number of symbols. Currently 'table' is expanded by calling realloc() every 1 symbols read. However, there generally are around 10+ symbols, which means that the expansion is generally 10+ times. As an optimization, introduce linked list 'sym_list' to associate and count all symbols, then store them into 'table' at one time. Signed-off-by: Zheng Yejian --- scripts/kallsyms.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c index 47978efe4797..6559a9802f6e 100644 --- a/scripts/kallsyms.c +++ b/scripts/kallsyms.c @@ -33,6 +33,7 @@ #define KSYM_NAME_LEN 512 struct sym_entry { + struct sym_entry *next; unsigned long long addr; unsigned int len; unsigned int seq; @@ -60,7 +61,8 @@ static struct addr_range percpu_range = { }; static struct sym_entry **table; -static unsigned int table_size, table_cnt; +static struct sym_entry *sym_list; +static unsigned int table_cnt; static int all_symbols; static int absolute_percpu; static int base_relative; @@ -273,6 +275,7 @@ static void read_map(const char *in) struct sym_entry *sym; char *buf = NULL; size_t buflen = 0; + int i; fp = fopen(in, "r"); if (!fp) { @@ -286,18 +289,22 @@ static void read_map(const char *in) continue; sym->start_pos = table_cnt; - - if (table_cnt >= table_size) { - table_size += 1; - table = realloc(table, sizeof(*table) * table_size); - if (!table) { - fprintf(stderr, "out of memory\n"); - fclose(fp); - exit (1); - } - } - - table[table_cnt++] = sym; + table_cnt++; + sym->next = sym_list; + sym_list = sym; + } + table = malloc(sizeof(*table) * table_cnt); + if (!table) { + fprintf(stderr, "unable to allocate memory for table\n"); + free(buf); + fclose(fp); + exit(EXIT_FAILURE); + } + sym = sym_list; + i = table_cnt - 1; + while (sym) { + table[i--] = sym; + sym = sym->next; } free(buf); -- 2.25.1
[PATCH 0/6] kallsyms: Emit symbol for holes in text and fix weak function issue
ftrace_location() was changed to not only return the __fentry__ location when called for the __fentry__ location, but also when called for the sym+0 location after commit aebfd12521d9 ("x86/ibt,ftrace: Search for __fentry__ location"). That is, if sym+0 location is not __fentry__, ftrace_location() would find one over the entire size of the sym. However, there is case that more than one __fentry__ exist in the sym range (described below) and ftrace_location() would find wrong __fentry__ location by binary searching, which would cause its users like livepatch/ kprobe/bpf to not work properly on this sym! The case is that, based on current compiler behavior, suppose: - function A is followed by weak function B1 in same binary file; - weak function B1 is overridden by function B2; Then in the final binary file: - symbol B1 will be removed from symbol table while its instructions are not removed; - __fentry__ of B1 will be still in __mcount_loc table; - function size of A is computed by substracting the symbol address of A from its next symbol address (see kallsyms_lookup_size_offset()), but because symbol info of B1 is removed, the next symbol of A is originally the next symbol of B1. See following example, function sizeof A will be (symbol_address_C - symbol_address_A): symbol_address_A symbol_address_B1 (Not in symbol table) symbol_address_C The weak function issue has been discovered in commit b39181f7c690 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function") but it didn't resolve the issue in ftrace_location(). Peter suggested to use entry size for FUNC type objects to find holes in the text and fill them with a symbol, then check the mcount locations against the symbol table and for every one that falls in a hole [1] [2]. What the patch set does is described as follows: - Patch 1: Do an optimization for scripts/kallsym.c about memory allocation when read symbols from file. This patch has little to do with the above issue, but since I changed this script, so it also can be reviewed here; - Patch 2: Change scripts/kallsyms.c to emit a symbol where there is a hole in the text, the symbol name is temporarily named "__hole_symbol_X"; - Patch 3: When lookup symbols in module, use entry size info to determine the exact boundaries of a function symbol; - Patch 4: Holes in text have been found in previous patches, now check __fentry__ in mcount table and skip those locate in the holes; - Patch 5: Accidentally found a out-of-bound issue when all __fentry__ are skipped, so fix it; - Patch 6: Revert Steve's patch about the FTRACE_MCOUNT_MAX_OFFSET solution, also two related definition for powerpc. [1] https://lore.kernel.org/all/20240607150228.gr8...@noisy.programming.kicks-ass.net/ [2] https://lore.kernel.org/all/20240611092157.gu40...@noisy.programming.kicks-ass.net/ Zheng Yejian (6): kallsyms: Optimize multiple times of realloc() to one time of malloc() kallsyms: Emit symbol at the holes in the text module: kallsyms: Determine exact function size ftrace: Skip invalid __fentry__ in ftrace_process_locs() ftrace: Fix possible out-of-bound issue in ftrace_process_locs() ftrace: Revert the FTRACE_MCOUNT_MAX_OFFSET workaround arch/powerpc/include/asm/ftrace.h | 7 -- arch/x86/include/asm/ftrace.h | 7 -- include/linux/kallsyms.h | 13 +++ include/linux/module.h| 14 +++ kernel/module/kallsyms.c | 42 ++-- kernel/trace/ftrace.c | 174 ++ scripts/kallsyms.c| 134 --- scripts/link-vmlinux.sh | 4 +- scripts/mksysmap | 2 +- 9 files changed, 216 insertions(+), 181 deletions(-) -- 2.25.1
[PATCH 4/6] ftrace: Skip invalid __fentry__ in ftrace_process_locs()
ftrace_location() was changed to not only return the __fentry__ location when called for the __fentry__ location, but also when called for the sym+0 location after commit aebfd12521d9 ("x86/ibt,ftrace: Search for __fentry__ location"). That is, if sym+0 location is not __fentry__, ftrace_location() would find one over the entire size of the sym. However, there is case that more than one __fentry__ exist in the sym range (described below) and ftrace_location() would find wrong __fentry__ location by binary searching, which would cause its users like livepatch/ kprobe/bpf to not work properly on this sym! The case is that, based on current compiler behavior, suppose: - function A is followed by weak function B1 in same binary file; - weak function B1 is overridden by function B2; Then in the final binary file: - symbol B1 will be removed from symbol table while its instructions are not removed; - __fentry__ of B1 will be still in __mcount_loc table; - function size of A is computed by substracting the symbol address of A from its next symbol address (see kallsyms_lookup_size_offset()), but because symbol info of B1 is removed, the next symbol of A is originally the next symbol of B1. See following example, function sizeof A will be (symbol_address_C - symbol_address_A): symbol_address_A symbol_address_B1 (Not in symbol table) symbol_address_C The weak function issue has been discovered in commit b39181f7c690 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function") but it didn't resolve the issue in ftrace_location(). To solve the issue, with Peter's suggestions, in previous patches, all holes in the text have been found and filled with specail symbols, also the same case with module weak function has been handled. Then check and skip __fentry__ that locate in the holes. Also in this patch, introduce following helper functions: 1. kallsyms_is_hole_symbol() is used to check if a function name is of a hole symbol; 2. module_kallsyms_find_symbol() is used to check if a __fentry__ locate in a valid function of the given module. It is needed because other symbol lookup functions like module_address_lookup() will find module of the passed address, but as ftrace_process_locs() is called, the module has not been fully loaded, so those lookup functions can not work. Fixes: aebfd12521d9 ("x86/ibt,ftrace: Search for __fentry__ location") Signed-off-by: Zheng Yejian --- include/linux/kallsyms.h | 13 + include/linux/module.h | 7 +++ kernel/module/kallsyms.c | 23 +-- kernel/trace/ftrace.c| 15 ++- 4 files changed, 51 insertions(+), 7 deletions(-) diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index c3f075e8f60c..0bf0d595f244 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -88,6 +88,14 @@ const char *kallsyms_lookup(unsigned long addr, unsigned long *offset, char **modname, char *namebuf); +/* + * Check if the name is of a hole symbol + */ +static inline int kallsyms_is_hole_symbol(const char *name) +{ + return !strcmp(name, "__hole_symbol_X"); +} + /* Look up a kernel symbol and return it in a text buffer. */ extern int sprint_symbol(char *buffer, unsigned long address); extern int sprint_symbol_build_id(char *buffer, unsigned long address); @@ -119,6 +127,11 @@ static inline const char *kallsyms_lookup(unsigned long addr, return NULL; } +static inline int kallsyms_is_hole_symbol(const char *name) +{ + return 0; +} + static inline int sprint_symbol(char *buffer, unsigned long addr) { *buffer = '\0'; diff --git a/include/linux/module.h b/include/linux/module.h index 13518f464d3f..a3fd077ef2a8 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -960,6 +960,8 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, unsigned long module_kallsyms_lookup_name(const char *name); unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name); +int module_kallsyms_find_symbol(struct module *mod, unsigned long addr, + unsigned long *size, unsigned long *offset); #else /* CONFIG_MODULES && CONFIG_KALLSYMS */ @@ -1004,6 +1006,11 @@ static inline unsigned long find_kallsyms_symbol_value(struct module *mod, return 0; } +static inline int module_kallsyms_find_symbol(struct module *mod, unsigned long addr, + unsigned long *size, unsigned long *offset) +{ + return 0; +} #endif /* CONFIG_MODULES && CONFIG_KALLSYMS */ #endif /* _LINUX_MODULE_H */ diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c index 092ae6f43dad..e9c439d81708 100644 --- a/kernel/module/kallsyms.c +++ b/kernel/module/kallsyms.c @@ -253,10 +253,10 @@ static const char *kallsyms_symbol_name(struct mod_kallsyms
[PATCH 2/6] kallsyms: Emit symbol at the holes in the text
When a weak type function is overridden, its symbol will be removed from the symbol table, but its code will not be removed. Besides, due to lacking of size for kallsyms, kernel compute function size by substracting its symbol address from its next symbol address (see kallsyms_lookup_size_offset()). These will cause that size of some function is computed to be larger than it actually is, just because symbol of its following weak function is removed. This issue also causes multiple __fentry__ locations to be counted in the some function scope, and eventually causes ftrace_location() to find wrong __fentry__ location. It was reported in Link: https://lore.kernel.org/all/20240607115211.734845-1-zhengyeji...@huawei.com/ Peter suggested to change scipts/kallsyms.c to emit readily identifiable symbol names for all the weak junk, eg: __weak_junk_N The name of this kind symbol needs some discussion, but it's temporarily called "__hole_symbol_X" in this patch: 1. Pass size info to scripts/kallsyms (see mksysmap()); 2. Traverse sorted function symbols, if one function address plus its size less than next function address, it means there's a hole, then emit a symbol "__hole_symbol_X" there which type is 't'. After this patch, the effect is as follows: $ cat /proc/kallsyms | grep -A 3 do_one_initcall 810021e0 T do_one_initcall 8100245e t __hole_symbol_X 810024a0 t __pfx_rootfs_init_fs_context Suggested-by: Peter Zijlstra Signed-off-by: Zheng Yejian --- scripts/kallsyms.c | 101 +++- scripts/link-vmlinux.sh | 4 +- scripts/mksysmap| 2 +- 3 files changed, 102 insertions(+), 5 deletions(-) diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c index 6559a9802f6e..5c4cde864a04 100644 --- a/scripts/kallsyms.c +++ b/scripts/kallsyms.c @@ -35,6 +35,7 @@ struct sym_entry { struct sym_entry *next; unsigned long long addr; + unsigned long long size; unsigned int len; unsigned int seq; unsigned int start_pos; @@ -74,6 +75,7 @@ static int token_profit[0x1]; static unsigned char best_table[256][2]; static unsigned char best_table_len[256]; +static const char hole_symbol[] = "__hole_symbol_X"; static void usage(void) { @@ -130,8 +132,16 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, size_t *buf_len) size_t len; ssize_t readlen; struct sym_entry *sym; + unsigned long long size = 0; errno = 0; + /* +* Example of expected symbol format: +* 1. symbol with size info: +*8170 01d7 T __startup_64 +* 2. symbol without size info: +*02a0 A text_size +*/ readlen = getline(buf, buf_len, in); if (readlen < 0) { if (errno) { @@ -145,9 +155,24 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, size_t *buf_len) (*buf)[readlen - 1] = 0; addr = strtoull(*buf, , 16); + if (*buf == p || *p++ != ' ') { + fprintf(stderr, "line format error: unable to parse address\n"); + exit(EXIT_FAILURE); + } + + if (*p == '0') { + char *str = p; - if (*buf == p || *p++ != ' ' || !isascii((type = *p++)) || *p++ != ' ') { - fprintf(stderr, "line format error\n"); + size = strtoull(str, , 16); + if (str == p || *p++ != ' ') { + fprintf(stderr, "line format error: unable to parse size\n"); + exit(EXIT_FAILURE); + } + } + + type = *p++; + if (!isascii(type) || *p++ != ' ') { + fprintf(stderr, "line format error: unable to parse type\n"); exit(EXIT_FAILURE); } @@ -182,6 +207,7 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, size_t *buf_len) exit(EXIT_FAILURE); } sym->addr = addr; + sym->size = size; sym->len = len; sym->sym[0] = type; strcpy(sym_name(sym), name); @@ -795,6 +821,76 @@ static void sort_symbols(void) qsort(table, table_cnt, sizeof(table[0]), compare_symbols); } +static int may_exist_hole_after_symbol(const struct sym_entry *se) +{ + char type = se->sym[0]; + + /* Only check text symbol or weak symbol */ + if (type != 't' && type != 'T' && + type != 'w' && type != 'W') + return 0; + /* Symbol without size has no hole */ + return se->size != 0; +} + +static struct sym_entry *gen_hole_symbol(unsigned long long addr) +{ + struct sym_entry *sym; + static size_t len = sizeof(hole_symbol); + + /* include type field */ + sym = malloc(sizeof(*sym) + len + 1); + if (!sym) { + fprintf(stderr, "unable to allocate memory for hole symbol\n"); +
[PATCH v5 5/8] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
This patch introduces DAMOS_MIGRATE_COLD action, which is similar to DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs instead of swapping them out. The 'target_nid' sysfs knob informs the migration target node ID. Here is one of the example usage of this 'migrate_cold' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/ $ cat contexts//schemes//action migrate_cold $ echo 2 > contexts//schemes//target_nid $ echo commit > state $ numactl -p 0 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 701 (hot_cold) 501 0601 1101 Since there are some common routines with pageout, many functions have similar logics between pageout and migrate cold. damon_pa_migrate_folio_list() is a minimized version of shrink_folio_list(). Signed-off-by: Honggyu Kim Signed-off-by: Hyeongtak Ji Signed-off-by: SeongJae Park --- include/linux/damon.h| 2 + mm/damon/paddr.c | 154 +++ mm/damon/sysfs-schemes.c | 1 + 3 files changed, 157 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 21d6b69a015c..56714b6eb0d7 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -105,6 +105,7 @@ struct damon_target { * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. + * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions. * @DAMOS_STAT:Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions * @@ -122,6 +123,7 @@ enum damos_action { DAMOS_NOHUGEPAGE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, + DAMOS_MIGRATE_COLD, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, }; diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 18797c1b419b..882ae54af829 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -12,6 +12,9 @@ #include #include #include +#include +#include +#include #include "../internal.h" #include "ops-common.h" @@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct damon_region *r, return damon_pa_mark_accessed_or_deactivate(r, s, false); } +static unsigned int __damon_pa_migrate_folio_list( + struct list_head *migrate_folios, struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_succeeded; + nodemask_t allowed_mask = NODE_MASK_NONE; + struct migration_target_control mtc = { + /* +* Allocate from 'node', or fail quickly and quietly. +* When this happens, 'page' will likely just be discarded +* instead of migrated. +*/ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | + __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = _mask + }; + + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE) + return 0; + + if (list_empty(migrate_folios)) + return 0; + + /* Migration ignores all cpuset and mempolicy settings */ + migrate_pages(migrate_folios, alloc_migrate_folio, NULL, + (unsigned long), MIGRATE_ASYNC, MR_DAMON, + _succeeded); + + return nr_succeeded; +} + +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list, + struct pglist_data *pgdat, + int target_nid) +{ + unsigned int nr_migrated = 0; + struct folio *folio; + LIST_HEAD(ret_folios); + LIST_HEAD(migrate_folios); + + while (!list_empty(folio_list)) { + struct folio *folio; + + cond_resched(); + + folio = lru_to_folio(folio_list); + list_del(>lru); + + if (!folio_trylock(folio)) + goto keep; + + /* Relocate its contents to another node. */ + list_add(>lru, _folios); + folio_unlock(folio); + continue; +keep: + list_add(>lru, _folios); + } + /* 'folio_list' is always empty here */ + + /* Migrate folios selected for migration */ + nr_migrated += __damon_pa_migrate_folio_list( + _folios, pgdat, target_nid); + /* +* Folios that could not be migrated are still in @migrate_folios. Add +* those back on @folio_list +*/ + if (!list_empty(_folios)) + list_splice_init(_folios, folio_list); + + try_to_unmap_flush(); + +
[PATCH v5 4/8] mm/migrate: add MR_DAMON to migrate_reason
The current patch series introduces DAMON based migration across NUMA nodes so it'd be better to have a new migrate_reason in trace events. Signed-off-by: Honggyu Kim Reviewed-by: SeongJae Park Signed-off-by: SeongJae Park --- include/linux/migrate_mode.h | 1 + include/trace/events/migrate.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index f37cc03f9369..cec36b7e7ced 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -29,6 +29,7 @@ enum migrate_reason { MR_CONTIG_RANGE, MR_LONGTERM_PIN, MR_DEMOTION, + MR_DAMON, MR_TYPES }; diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 0190ef725b43..cd01dd7b3640 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -22,7 +22,8 @@ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ EM( MR_CONTIG_RANGE,"contig_range") \ EM( MR_LONGTERM_PIN,"longterm_pin") \ - EMe(MR_DEMOTION,"demotion") + EM( MR_DEMOTION,"demotion") \ + EMe(MR_DAMON, "damon") /* * First define the enums in the above macros to be exported to userspace -- 2.34.1
Re: [PATCH 2/8] tracing: do not trace kernel_text_address()
On Thu, 13 Jun 2024 15:11:07 +0800 Andy Chiu wrote: > kernel_text_address() and __kernel_text_address() are called in > arch_stack_walk() of riscv. This results in excess amount of un-related > traces when the kernel is compiled with CONFIG_TRACE_IRQFLAGS. The > situation worsens when function_graph is active, as it calls > local_irq_save/restore in each function's entry/exit. This patch adds > both functions to notrace, so they won't show up on the trace records. I rather not add notrace just because something is noisy. You can always just add: echo '*kernel_text_address' > /sys/kernel/tracing/set_ftrace_notrace and achieve the same result. -- Steve
[PATCH v5 6/8] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion
From: Hyeongtak Ji This patch introduces DAMOS_MIGRATE_HOT action, which is similar to DAMOS_MIGRATE_COLD, but proritizes hot pages. It migrates pages inside the given region to the 'target_nid' NUMA node in the sysfs. Here is one of the example usage of this 'migrate_hot' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/ $ cat contexts//schemes//action migrate_hot $ echo 0 > contexts//schemes//target_nid $ echo commit > state $ numactl -p 2 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 701 (hot_cold) 501 0601 1101 Signed-off-by: Hyeongtak Ji Signed-off-by: Honggyu Kim Signed-off-by: SeongJae Park --- include/linux/damon.h| 2 ++ mm/damon/paddr.c | 3 +++ mm/damon/sysfs-schemes.c | 1 + 3 files changed, 6 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 56714b6eb0d7..3d62d98d6359 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -105,6 +105,7 @@ struct damon_target { * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. + * @DAMOS_MIGRATE_HOT: Migrate the regions prioritizing warmer regions. * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions. * @DAMOS_STAT:Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions @@ -123,6 +124,7 @@ enum damos_action { DAMOS_NOHUGEPAGE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, + DAMOS_MIGRATE_HOT, DAMOS_MIGRATE_COLD, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 882ae54af829..af6aac388a43 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -486,6 +486,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, return damon_pa_mark_accessed(r, scheme); case DAMOS_LRU_DEPRIO: return damon_pa_deactivate_pages(r, scheme); + case DAMOS_MIGRATE_HOT: case DAMOS_MIGRATE_COLD: return damon_pa_migrate(r, scheme); case DAMOS_STAT: @@ -508,6 +509,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context, return damon_hot_score(context, r, scheme); case DAMOS_LRU_DEPRIO: return damon_cold_score(context, r, scheme); + case DAMOS_MIGRATE_HOT: + return damon_hot_score(context, r, scheme); case DAMOS_MIGRATE_COLD: return damon_cold_score(context, r, scheme); default: diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 880015d5b5ea..66fccfa776d7 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] = { "nohugepage", "lru_prio", "lru_deprio", + "migrate_hot", "migrate_cold", "stat", }; -- 2.34.1
[PATCH v5 7/8] Docs/admin-guide/mm/damon/usage: add missing actions
"lru_prio" and "lru_deprio" are missing in the damos action list so they have to be added properly at damon usage document. Signed-off-by: Honggyu Kim --- Documentation/admin-guide/mm/damon/usage.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index e58ceb89ea2a..7bff54963975 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -757,7 +757,9 @@ list on :ref:`design doc `. - 2: ``pageout`` - 3: ``hugepage`` - 4: ``nohugepage`` - - 5: ``stat`` + - 5: ``lru_prio`` + - 6: ``lru_deprio`` + - 7: ``stat`` Quota ~ -- 2.34.1
[PATCH v5 0/8] DAMON based tiered memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This patch series is about its implementation for physical address space so that this scheme can be applied in system wide level. Changes from RFC v4: https://lore.kernel.org/20240512175447.75943-1...@kernel.org 1. Add usage and design documents 2. Rename alloc_demote_folio to alloc_migrate_folio 3. Add evaluation results with "demotion_enabled" true 4. Rebase based on v6.10-rc3 Changes from RFC v3: https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com 0. updated from v3 and posted by SJ on behalf of Honggyu under his approval. 1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode' 2. Drop vmstat change 3. Drop unnecessary page reference check Changes from RFC v2: https://lore.kernel.org/20240226140555.1615-1-honggyu@sk.com 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. 2. Create 'target_nid' to set the migration target node instead of depending on node distance based information. 3. Instead of having page level access check in this patch series, delegate the job to a new DAMOS filter type YOUNG[2]. 4. Introduce vmstat counters "damon_migrate_{hot,cold}". 5. Rebase from v6.7 to v6.8. Changes from RFC: https://lore.kernel.org/20240115045253.1775-1-honggyu@sk.com 1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c. 2. Simplify some functions of vmscan.c and used in paddr.c, but need to be reviewed more in depth. 3. Refactor most functions for common usage for both promote and demote actions and introduce an enum migration_mode for its control. 4. Add "target_nid" sysfs knob for migration destination node for both promote and demote actions. 5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above DAMOS_STAT. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 17~18% to 4~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these
[PATCH v5 3/8] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
From: Hyeongtak Ji This patch adds target_nid under /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes// The 'target_nid' can be used as the destination node for DAMOS actions such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches. Signed-off-by: Hyeongtak Ji Signed-off-by: Honggyu Kim Signed-off-by: SeongJae Park --- include/linux/damon.h| 11 ++- mm/damon/core.c | 5 - mm/damon/dbgfs.c | 2 +- mm/damon/lru_sort.c | 3 ++- mm/damon/reclaim.c | 3 ++- mm/damon/sysfs-schemes.c | 33 - 6 files changed, 51 insertions(+), 6 deletions(-) diff --git a/include/linux/damon.h b/include/linux/damon.h index f7da65e1ac04..21d6b69a015c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -374,6 +374,7 @@ struct damos_access_pattern { * @apply_interval_us: The time between applying the @action. * @quota: Control the aggressiveness of this scheme. * @wmarks:Watermarks for automated (in)activation of this scheme. + * @target_nid:Destination node if @action is "migrate_{hot,cold}". * @filters: Additional set of damos_filter for * @stat: Statistics of this scheme. * @list: List head for siblings. @@ -389,6 +390,10 @@ struct damos_access_pattern { * monitoring context are inactive, DAMON stops monitoring either, and just * repeatedly checks the watermarks. * + * @target_nid is used to set the migration target node for migrate_hot or + * migrate_cold actions, which means it's only meaningful when @action is either + * "migrate_hot" or "migrate_cold". + * * Before applying the to a memory region, damon_operations * implementation could check pages of the region and skip to respect * @@ -410,6 +415,9 @@ struct damos { /* public: */ struct damos_quota quota; struct damos_watermarks wmarks; + union { + int target_nid; + }; struct list_head filters; struct damos_stat stat; struct list_head list; @@ -726,7 +734,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks); + struct damos_watermarks *wmarks, + int target_nid); void damon_add_scheme(struct damon_ctx *ctx, struct damos *s); void damon_destroy_scheme(struct damos *s); diff --git a/mm/damon/core.c b/mm/damon/core.c index 6392f1cc97a3..c0ec5be4f56e 100644 --- a/mm/damon/core.c +++ b/mm/damon/core.c @@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, enum damos_action action, unsigned long apply_interval_us, struct damos_quota *quota, - struct damos_watermarks *wmarks) + struct damos_watermarks *wmarks, + int target_nid) { struct damos *scheme; @@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern *pattern, scheme->wmarks = *wmarks; scheme->wmarks.activated = true; + scheme->target_nid = target_nid; + return scheme; } diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 2461cfe2e968..51a6f1cac385 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, pos += parsed; scheme = damon_new_scheme(, action, 0, , - ); + , NUMA_NO_NODE); if (!scheme) goto fail; diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c index 3de2916a65c3..3775f0f2743d 100644 --- a/mm/damon/lru_sort.c +++ b/mm/damon/lru_sort.c @@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme( /* under the quota. */ , /* (De)activate this according to the watermarks. */ - _lru_sort_wmarks); + _lru_sort_wmarks, + NUMA_NO_NODE); } /* Create a DAMON-based operation scheme for hot memory regions */ diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 9bd341d62b4c..a05ccb41749b 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -177,7 +177,8 @@ static struct damos *damon_reclaim_new_scheme(void) /* under the quota. */ _reclaim_quota, /* (De)activate this according to the watermarks. */ - _reclaim_wmarks); + _reclaim_wmarks, + NUMA_NO_NODE); } static void damon_reclaim_copy_quota_status(struct