Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang wrote: > > On Tue, 4 Jul 2023 16:50:48 +0900 > David Stevens wrote: > > > From: David Stevens > > > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > > callers to resolve a gfn when the associated pfn has a valid struct page > > that isn't being actively refcounted (e.g. tail pages of non-compound > > higher order pages). For a caller to safely omit FOLL_GET, all usages of > > the returned pfn must be guarded by a mmu notifier. > > > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > > is set when the returned pfn has an associated struct page with a valid > > refcount. Callers that don't pass FOLL_GET should remember this value > > and use it to avoid places like kvm_is_ad_tracked_page that assume a > > non-zero refcount. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 10 ++ > > virt/kvm/kvm_main.c | 67 +--- > > virt/kvm/pfncache.c | 2 +- > > 3 files changed, 47 insertions(+), 32 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index ef2763c2b12e..a45308c7d2d9 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +void kvm_set_page_accessed(struct page *page); > > +void kvm_set_page_dirty(struct page *page); > > + > > struct kvm_follow_pfn { > > const struct kvm_memory_slot *slot; > > gfn_t gfn; > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > > bool atomic; > > /* Allow a read fault to create a writeable mapping. */ > > bool allow_write_mapping; > > + /* > > + * Usage of the returned pfn will be guared by a mmu notifier. Must > ^guarded > > + * be true if FOLL_GET is not set. > > + */ > > + bool guarded_by_mmu_notifier; > > > It seems no one sets the guraded_by_mmu_notifier in this patch. Is > guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the > caller of __kvm_follow_pfn()? Yes, this is the case. > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page? > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a > temporary solution. I don't think it is a good idea to play tricks with > a temporary solution, more like we are abusing the toleration. I'm not sure I understand what you're getting at. This series never calls gup without FOLL_GET. This series aims to provide kvm_follow_pfn as a unified API on top of gup+follow_pte. Since one of the major clients of this API uses an mmu notifier, it makes sense to support returning a pfn without taking a reference. And we indeed need to do that for certain types of memory. > Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to > indicate a tail page? What do you mean by to indicate a tail page? Do you mean to indicate that the returned pfn refers to non-refcounted page? That's specified by is_refcounted_page. -David
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
Le 06/07/2023 à 08:14, Benjamin Gray a écrit : > On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: >> >> >> Le 05/07/2023 à 02:30, Benjamin Gray a écrit : >>> The drivers/rtc/rtc-ds1307.c driver has a direct dependency on >>> struct regmap_config, which is guarded behind CONFIG_REGMAP. >>> >>> Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select >>> REGMAP") >>> exposed this by disabling the default pick unless KUNIT_ALL_TESTS >>> is >>> set, causing the ppc64be allnoconfig build to fail. >>> >>> Signed-off-by: Benjamin Gray >>> --- >>> drivers/rtc/Kconfig | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig >>> index ffca9a8bb878..7455ebd189fe 100644 >>> --- a/drivers/rtc/Kconfig >>> +++ b/drivers/rtc/Kconfig >>> @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 >>> >>> config RTC_DRV_DS1307 >>> tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, >>> EPSON RX-8025, ISL12057" >>> + select REGMAP >> >> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is >> selected. >> Can you explain more in details why you have to select it explicitely >> ? >> If there is something wrong with the logic, then the logic should be >> fixed instead of just adding a selection of REGMAP for that >> particular >> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or >> RTC_DRV_ABEOZ9 >> might have the exact same problem. > > Right, yeah, I don't want to assert this patch is the correct solution, > sending it was more to offer a fix and allow discussion if it should be > resolved some other way (so thanks for replying, I appreciate it). > > In terms of why I made this patch, the way I see it, if a config option > requires another config option be set, then "selects" is the natural > way of phrasing this dependency. "default" on the REGMAP side seems > weird. If it's a default, does that mean it can be overridden? But > RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The > build breaks without it. I mean't "why doesn't it work as is", and/or "why didn't you fix what doesn't work". Well, until commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP") it was not user-selectable so I couldn't be overridden. After commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP") it is overridable so we have an additional problem. Does RTC_DRV_DS1307 require REGMAP or does RTC_DRV_DS1307 require REGMAP_I2C and then REGMAP_I2C requires REGMAP ? I think that huge default like for REGMAP should be replaced by individual selection of REGMAP from each of those config items. For exemple REGMAP_I2C should select REGMAP then I guess it would also solve your issue wouldn't it ? > > But maybe KConfig works differently to my assumptions. Maybe the > referenced patch that causes the build failure is actually incorrect > (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not > supposed to depend on REGMAP like KUnit does? > > As for other drivers having the same problem, quite possibly, yes. But > the PPC configs don't seem to build those drivers, so I'd prefer to > leave it to anyone who does build them to report the error. I wasn't > planning to audit all the drivers, I was just trying to fix the PPC > configs I build and test. Well ok but that's probably the indication of a deeper problem which deserves a more generic fix.
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote: > > > Le 05/07/2023 à 02:30, Benjamin Gray a écrit : > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on > > struct regmap_config, which is guarded behind CONFIG_REGMAP. > > > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select > > REGMAP") > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS > > is > > set, causing the ppc64be allnoconfig build to fail. > > > > Signed-off-by: Benjamin Gray > > --- > > drivers/rtc/Kconfig | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig > > index ffca9a8bb878..7455ebd189fe 100644 > > --- a/drivers/rtc/Kconfig > > +++ b/drivers/rtc/Kconfig > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 > > > > config RTC_DRV_DS1307 > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, > > EPSON RX-8025, ISL12057" > > + select REGMAP > > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is > selected. > Can you explain more in details why you have to select it explicitely > ? > If there is something wrong with the logic, then the logic should be > fixed instead of just adding a selection of REGMAP for that > particular > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or > RTC_DRV_ABEOZ9 > might have the exact same problem. Right, yeah, I don't want to assert this patch is the correct solution, sending it was more to offer a fix and allow discussion if it should be resolved some other way (so thanks for replying, I appreciate it). In terms of why I made this patch, the way I see it, if a config option requires another config option be set, then "selects" is the natural way of phrasing this dependency. "default" on the REGMAP side seems weird. If it's a default, does that mean it can be overridden? But RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The build breaks without it. But maybe KConfig works differently to my assumptions. Maybe the referenced patch that causes the build failure is actually incorrect (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not supposed to depend on REGMAP like KUnit does? As for other drivers having the same problem, quite possibly, yes. But the PPC configs don't seem to build those drivers, so I'd prefer to leave it to anyone who does build them to report the error. I wasn't planning to audit all the drivers, I was just trying to fix the PPC configs I build and test.
[next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)
While running LTP tests (madvise06) on IBM Power9 LPAR booted with 6.4.0-next-20230705 following crash is seen Injecting memory failure for pfn 0x3f79 at process virtual address 0x7fff9b74 Memory failure: 0x3f79: recovery action for clean LRU page: Recovered madvise06 (133636): drop_caches: 3 [ cut here ] kernel BUG at mm/memcontrol.c:3715! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse [last unloaded: init_module(O)] CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1 Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 (VL950_131) hv:phyp pSeries NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0 REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705) MSR: 80029033 CR: 28008288 XER: CFAR: c054e904 IRQMASK: 0 GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 GPR04: c2b3b288 00010192 0001 GPR08: c000f9abb180 0002 c002bc978580 GPR12: c054e8d0 c0001ec53f00 GPR16: GPR20: c0001b2e6578 00400cc0 7fff f000 GPR24: c0029dd7bd30 c0029dd7bd58 c0001b2e6568 GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0 LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160 Call Trace: [c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160 [c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60 [c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620 [c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0 [c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350 [c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140 [c0029dd7be10] [c0036900] system_call_exception+0x140/0x350 [c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4 --- interrupt: c00 at 0x7fff9eb41484 NIP: 7fff9eb41484 LR: 10008540 CTR: REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705) MSR: 8280f033 CR: 28002282 XER: IRQMASK: 0 GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 GPR04: 7fffc33e0aa0 1fff 0013 GPR08: 7fffc33e0aa0 GPR12: 7fff9ecca3a0 GPR16: 10035520 10035b90 100347a8 GPR20: 1002fb68 10063900 2000 1002fb68 GPR24: 004c 1002fa78 7fffc33e0aa0 GPR28: 0013 1fff 1fff NIP [7fff9eb41484] 0x7fff9eb41484 LR [10008540] 0x10008540 --- interrupt: c00 Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 ---[ end trace ]--- pstore: backend (nvram) writing error (-1) Kernel panic - not syncing: Fatal exception Rebooting in 10 seconds.. Git bisect points to following patch: commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b memcg: drop kmem.limit_in_bytes Does the testcase madvise06 need an update? 90 tst_res(TINFO, "\tCached: %ld Kb", 91 SAFE_READ_MEMINFO("Cached:") - init_cached); 92 93 print_cgmem("memory.current"); 94 print_cgmem("memory.swap.current"); 95 print_cgmem("memory.kmem.usage_in_bytes”); <<== this line. 96 } If I comment line 95 from the testcase, it completes successfully. - Sachin
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang wrote: > > On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote: > > From: David Stevens > > > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > > callers to resolve a gfn when the associated pfn has a valid struct page > > that isn't being actively refcounted (e.g. tail pages of non-compound > > higher order pages). For a caller to safely omit FOLL_GET, all usages of > > the returned pfn must be guarded by a mmu notifier. > > > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > > is set when the returned pfn has an associated struct page with a valid > > refcount. Callers that don't pass FOLL_GET should remember this value > > and use it to avoid places like kvm_is_ad_tracked_page that assume a > > non-zero refcount. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 10 ++ > > virt/kvm/kvm_main.c | 67 +--- > > virt/kvm/pfncache.c | 2 +- > > 3 files changed, 47 insertions(+), 32 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index ef2763c2b12e..a45308c7d2d9 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +void kvm_set_page_accessed(struct page *page); > > +void kvm_set_page_dirty(struct page *page); > > + > > struct kvm_follow_pfn { > > const struct kvm_memory_slot *slot; > > gfn_t gfn; > > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > > bool atomic; > > /* Allow a read fault to create a writeable mapping. */ > > bool allow_write_mapping; > > + /* > > + * Usage of the returned pfn will be guared by a mmu notifier. Must > > + * be true if FOLL_GET is not set. > > + */ > > + bool guarded_by_mmu_notifier; > > And how? Any place to check the invalidate seq? kvm_follow_pfn can't meaningfully validate the seq number, since the mmu notifier locking is handled by the caller. This is more of a sanity check that the API is being used properly, as proposed here [1]. I did deviate from the proposal with a bool instead of some type of integer, since the exact value of mmu_seq wouldn't be useful. [1] https://lore.kernel.org/all/zgvusf7lmkrnd...@google.com/#t -David
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata wrote: > > On Tue, Jul 04, 2023 at 04:50:47PM +0900, > David Stevens wrote: > > > From: David Stevens > > > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. > > __kvm_follow_pfn refactors the old API's arguments into a struct and, > > where possible, combines the boolean arguments into a single flags > > argument. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 16 > > virt/kvm/kvm_main.c | 171 ++- > > virt/kvm/kvm_mm.h| 3 +- > > virt/kvm/pfncache.c | 8 +- > > 4 files changed, 122 insertions(+), 76 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 9d3ac7720da9..ef2763c2b12e 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -97,6 +97,7 @@ > > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) > > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) > > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) > > > > /* > > * error pfns indicate that the gfn is in slot but faild to > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +struct kvm_follow_pfn { > > + const struct kvm_memory_slot *slot; > > + gfn_t gfn; > > + unsigned int flags; > > + bool atomic; > > + /* Allow a read fault to create a writeable mapping. */ > > + bool allow_write_mapping; > > Maybe, make them const for input arguments? Unfortunately using const isn't straightforward as long as the kernel continues to use -Wdeclaration-after-statement. If these fields were const, then they would need to be specified in the initializer when declaring the variable, but that's not necessarily always possible. -David
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang wrote: > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote: > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang wrote: > > > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, > > > > bool write_fault, > > > > * The slow path to get the pfn of the specified host virtual address, > > > > * 1 indicates success, -errno is returned if error is detected. > > > > */ > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > > > write_fault, > > > > -bool interruptible, bool *writable, kvm_pfn_t > > > > *pfn) > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > > { > > > > - unsigned int flags = FOLL_HWPOISON; > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > > > struct page *page; > > > > int npages; > > > > > > > > might_sleep(); > > > > > > > > - if (writable) > > > > - *writable = write_fault; > > > > - > > > > - if (write_fault) > > > > - flags |= FOLL_WRITE; > > > > - if (async) > > > > - flags |= FOLL_NOWAIT; > > > > - if (interruptible) > > > > - flags |= FOLL_INTERRUPTIBLE; > > > > - > > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > > > > if (npages != 1) > > > > return npages; > > > > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > > > foll->allow_write_mapping; > > > > + > > > > /* map read fault as writable if possible */ > > > > - if (unlikely(!write_fault) && writable) { > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here. > > > > The two statements are logically equivalent, although I guess using > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more > > verbose. > > Well, as the comment says, we wanna try to map the read fault as writable > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE > for write faults. So I guess using !foll->writable will not allow this. > Did I miss anything? We just set the foll->writable out parameter to be equal to ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a = foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a && b. -David
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata wrote: > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, > David Stevens wrote: > > > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > > index cf2c6426a6fc..46c681dc45e6 100644 > > --- a/arch/x86/kvm/mmu/spte.c > > +++ b/arch/x86/kvm/mmu/spte.c > > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct > > kvm_mmu_page *sp, > > const struct kvm_memory_slot *slot, > > unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, > > u64 old_spte, bool prefetch, bool can_unsync, > > -bool host_writable, u64 *new_spte) > > +bool host_writable, bool is_refcounted, u64 *new_spte) > > { > > int level = sp->role.level; > > u64 spte = SPTE_MMU_PRESENT_MASK; > > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct > > kvm_mmu_page *sp, > > > > if (level > PG_LEVEL_4K) > > spte |= PT_PAGE_SIZE_MASK; > > + else if (is_refcounted) > > + spte |= SPTE_MMU_PAGE_REFCOUNTED; > > Is REFCOUNTED for 4K page only? What guarantees that large page doesn't have > FOLL_GET? or can we set the bit for large page? Oh, you're right, it should apply to >4K pages as well. This was based on stale thinking from earlier versions of this series. -David
Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307
Le 05/07/2023 à 02:30, Benjamin Gray a écrit : > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on > struct regmap_config, which is guarded behind CONFIG_REGMAP. > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP") > exposed this by disabling the default pick unless KUNIT_ALL_TESTS is > set, causing the ppc64be allnoconfig build to fail. > > Signed-off-by: Benjamin Gray > --- > drivers/rtc/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig > index ffca9a8bb878..7455ebd189fe 100644 > --- a/drivers/rtc/Kconfig > +++ b/drivers/rtc/Kconfig > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722 > > config RTC_DRV_DS1307 > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, EPSON RX-8025, > ISL12057" > + select REGMAP As far as I can see, REGMAP defaults to Y when REGMAP_I2C is selected. Can you explain more in details why you have to select it explicitely ? If there is something wrong with the logic, then the logic should be fixed instead of just adding a selection of REGMAP for that particular RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or RTC_DRV_ABEOZ9 might have the exact same problem. > select REGMAP_I2C > select WATCHDOG_CORE if WATCHDOG > help
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang wrote: > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > > From: David Stevens > > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > > memory into the guest that is backed by un-refcounted struct pages - for > > example, higher order non-compound pages allocated by the amdgpu driver > > via ttm_pool_alloc_page. > > I guess you mean the tail pages of the higher order non-compound pages? > And as to the head page, it is said to be set to one coincidentally[*], > and shall not be considered as refcounted. IIUC, refcount of this head > page will be increased and decreased soon in hva_to_pfn_remapped(), so > this may not be a problem(?). But treating this head page differently, > as a refcounted one(e.g., to set the A/D flags), is weired. > > Or maybe I missed some context, e.g., can the head page be allocted to > guest at all? Yes, this is to allow mapping the tail pages of higher order non-compound pages - I should have been more precise in my wording. The head pages can already be mapped into the guest. Treating the head and tail pages would require changing how KVM behaves in a situation it supports today (rather than just adding support for an unsupported situation). Currently, without this series, KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the guest. When that happens, KVM sets the A/D flags. I'm not sure whether that's actually valid behavior, nor do I know whether anyone actually cares about it. But it's what KVM does today, and I would shy away from modifying that behavior without good reason. > > > > The bulk of this change is tracking the is_refcounted_page flag so that > > non-refcounted pages don't trigger page_count() == 0 warnings. This is > > done by storing the flag in an unused bit in the sptes. > > Also, maybe we should mention this only works on x86-64. > > > > > Signed-off-by: David Stevens > > --- > > arch/x86/kvm/mmu/mmu.c | 44 + > > arch/x86/kvm/mmu/mmu_internal.h | 1 + > > arch/x86/kvm/mmu/paging_tmpl.h | 9 --- > > arch/x86/kvm/mmu/spte.c | 4 ++- > > arch/x86/kvm/mmu/spte.h | 12 - > > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++--- > > 6 files changed, 62 insertions(+), 30 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index e44ab512c3a1..b1607e314497 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > ... > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct > > kvm_memory_slot *slot, > > bool host_writable = !fault || fault->map_writable; > > bool prefetch = !fault || fault->prefetch; > > bool write_fault = fault && fault->write; > > + bool is_refcounted = !fault || fault->is_refcounted_page; > > Just wonder, what if a non-refcounted page is prefetched? Or is it possible > in > practice? Prefetching is still done via gfn_to_page_many_atomic, which sets FOLL_GET. That's fixable, but it's not something this series currently does. > ... > > > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, > > struct kvm_mmu *mmu, > > */ > > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page > > *sp, int i) > > { > > - bool host_writable; > > + bool host_writable, is_refcounted; > > gpa_t first_pte_gpa; > > u64 *sptep, spte; > > struct kvm_memory_slot *slot; > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, > > struct kvm_mmu_page *sp, int > > sptep = &sp->spt[i]; > > spte = *sptep; > > host_writable = spte & shadow_host_writable_mask; > > + // TODO: is this correct? > > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED; > > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); > > make_spte(vcpu, sp, slot, pte_access, gfn, > > spte_to_pfn(spte), spte, true, false, > > - host_writable, &spte); > > + host_writable, is_refcounted, &spte); > > Could we restrict that a non-refcounted page shall not be used as shadow page? I'm not very familiar with the shadow mmu, so my response might not make sense. But do you mean not allowing non-refcoutned pages as the guest page tables shadowed by a kvm_mmu_page? It would probably be possible to do that, and I doubt anyone would care about the restriction. But as far as I can tell, the guest page table is only accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted pages just fine. -David
Re: [PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr
Le 06/07/2023 à 03:08, Benjamin Gray a écrit : > On a powermac platform, under the call path > >start_kernel > time_init > ppc_md.calibrate_decr (pmac_calibrate_decr) > via_calibrate_decr > > we run ioremap and iounmap. The unmap can enable interrupts > unexpectedly (cond_resched in vunmap_pmd_range), which is warned about > later in the boot sequence in start_kernel. > > Use the early_* variants of these IO functions to prevent this. > > The issue is pre-existing, but is surfaced by commit 721255b9826b > ("genirq: Use a maple tree for interrupt descriptor management"). > It's not clear to me why this causes it to surface. > > Signed-off-by: Benjamin Gray Reviewed-by: Christophe Leroy > --- > arch/powerpc/platforms/powermac/time.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/powermac/time.c > b/arch/powerpc/platforms/powermac/time.c > index 4c5790aff1b5..8633891b7aa5 100644 > --- a/arch/powerpc/platforms/powermac/time.c > +++ b/arch/powerpc/platforms/powermac/time.c > @@ -26,8 +26,8 @@ > #include > #include > > +#include > #include > -#include > #include > #include > #include > @@ -182,7 +182,7 @@ static int __init via_calibrate_decr(void) > return 0; > } > of_node_put(vias); > - via = ioremap(rsrc.start, resource_size(&rsrc)); > + via = early_ioremap(rsrc.start, resource_size(&rsrc)); > if (via == NULL) { > printk(KERN_ERR "Failed to map VIA for timer calibration !\n"); > return 0; > @@ -207,7 +207,7 @@ static int __init via_calibrate_decr(void) > > ppc_tb_freq = (dstart - dend) * 100 / 6; > > - iounmap(via); > + early_iounmap((void *)via, resource_size(&rsrc)); > > return 1; > }
Re: [PATCH] KVM: ppc64: Enable ring-based dirty memory tracking
On 8/6/23 10:34 pm, Kautuk Consul wrote: Need at least a little context in the commit message itself: "Enable ring-based dirty memory tracking on ppc64:" - Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly ordered. - Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the kvmppc_xive_native_set_attr is called in the context of an ioctl syscall and will call kvmppc_xive_native_eq_sync for setting the KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page() when there isn't a running vcpu. Implemented the kvm_arch_allow_write_without_running_vcpu to always return true to allow mark_page_dirty_in_slot to mark the page dirty in the memslot->dirty_bitmap in this case. Should kvm_arch_allow_write_without_running_vcpu() only return true in the context of kvmppc_xive_native_eq_sync()? - Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page offset. - Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required for the generic KVM code to call. - Add a check to kvmppc_vcpu_run_hv for checking whether the dirty ring is soft full. - Implement the kvm_arch_flush_remote_tlbs_memslot function to support the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option. On testing with live migration it was found that there is around 150-180 ms improvment in overall migration time with this Signed-off-by: Kautuk Consul --- Documentation/virt/kvm/api.rst | 2 +- arch/powerpc/include/uapi/asm/kvm.h | 2 ++ arch/powerpc/kvm/Kconfig| 2 ++ arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 + arch/powerpc/kvm/book3s_hv.c| 3 +++ include/linux/kvm_dirty_ring.h | 5 6 files changed, 55 insertions(+), 1 deletion(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index add067793b90..ce1ebc513bae 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through the CPUID leaf. 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL -- -:Architectures: x86, arm64 +:Architectures: x86, arm64, ppc64 :Parameters: args[0] - size of the dirty log ring KVM is capable of tracking dirty memory using ring buffers that are diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 9f18fa090f1f..f722309ed7fb 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -33,6 +33,8 @@ /* Not always available, but if it is, this is the correct offset. */ #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 +#define KVM_DIRTY_LOG_PAGE_OFFSET 64 + struct kvm_regs { __u64 pc; __u64 cr; diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 902611954200..c93354ec3bd5 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -26,6 +26,8 @@ config KVM select IRQ_BYPASS_MANAGER select HAVE_KVM_IRQ_BYPASS select INTERVAL_TREE + select HAVE_KVM_DIRTY_RING_ACQ_REL + select NEED_KVM_DIRTY_RING_WITH_BITMAP config KVM_BOOK3S_HANDLER bool diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 7f765d5ad436..c92e8022e017 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu) vcpu->arch.hflags |= BOOK3S_HFLAG_SLB; } + +/* + * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected + * dirty pages. + * + * It write protects selected pages to enable dirty logging for them. + */ +void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, +struct kvm_memory_slot *slot, +gfn_t gfn_offset, +unsigned long mask) +{ + phys_addr_t base_gfn = slot->base_gfn + gfn_offset; + phys_addr_t start = (base_gfn + __ffs(mask)) << PAGE_SHIFT; + phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT; + + while (start < end) { + pte_t *ptep; + unsigned int shift; + + ptep = find_kvm_secondary_pte(kvm, start, &shift); + + *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE)); On rpt I think you'd need to use kvmppc_radix_update_pte()? + + start += PAGE_SIZE > + } +} + +#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP +bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm) +{ + return true; +} +#endif + +#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT +void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm, + const struct kvm_memory_slot *memslot) +{ + kvm_flush_remote_tlbs(kvm); +} +#endif diff --git a/arch/powerpc
Re: [PATCH] powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size
On Thu, 6 Jul 2023, Aneesh Kumar K.V wrote: > With commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to > fail") the kernel is now using pmd_same to compare pmd values that are > pointing to a level 4 page table page. Move the functions out of #ifdef > CONFIG_TRANSPARENT_HUGEPAGE and add a variant that can work with both 4K > and 64K page size. > > kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:141! > Oops: Exception in kernel mode, sig: 5 [#1] > LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > . > NIP [c048aee0] __pte_offset_map_lock+0xf0/0x164 > LR [c048ae78] __pte_offset_map_lock+0x88/0x164 > Call Trace: > 0xc0003f09a340 (unreliable) > __handle_mm_fault+0x1340/0x1980 > handle_mm_fault+0xbc/0x380 > __get_user_pages+0x320/0x550 > get_user_pages_remote+0x13c/0x520 > get_arg_page+0x80/0x1d0 > copy_string_kernel+0xc8/0x250 > kernel_execve+0x11c/0x270 > run_init_process+0xe4/0x10c > kernel_init+0xbc/0x1a0 > ret_from_kernel_user_thread+0x14/0x1c > > Cc: Hugh Dickins > Reported-by: Michael Ellerman > Signed-off-by: Aneesh Kumar K.V Acked-by: Hugh Dickins Thanks for rescuing us so quickly! > --- > arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 -- > arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 - > arch/powerpc/include/asm/book3s/64/hash.h | 5 + > 3 files changed, 5 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h > b/arch/powerpc/include/asm/book3s/64/hash-4k.h > index b6ac4f86c87b..6472b08fa1b0 100644 > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > @@ -136,12 +136,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd) > return 0; > } > > -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) > -{ > - BUG(); > - return 0; > -} > - > static inline pmd_t hash__pmd_mkhuge(pmd_t pmd) > { > BUG(); > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h > b/arch/powerpc/include/asm/book3s/64/hash-64k.h > index 338e62fbea0b..0bf6fd0bf42a 100644 > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h > @@ -263,11 +263,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd) > (_PAGE_PTE | H_PAGE_THP_HUGE)); > } > > -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) > -{ > - return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & > ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0); > -} > - > static inline pmd_t hash__pmd_mkhuge(pmd_t pmd) > { > return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE)); > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h > b/arch/powerpc/include/asm/book3s/64/hash.h > index 17e7a778c856..d4a19e6547ac 100644 > --- a/arch/powerpc/include/asm/book3s/64/hash.h > +++ b/arch/powerpc/include/asm/book3s/64/hash.h > @@ -132,6 +132,11 @@ static inline int get_region_id(unsigned long ea) > return region_id; > } > > +static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) > +{ > + return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & > ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0); > +} > + > #define hash__pmd_bad(pmd) (pmd_val(pmd) & H_PMD_BAD_BITS) > #define hash__pud_bad(pud) (pud_val(pud) & H_PUD_BAD_BITS) > static inline int hash__p4d_bad(p4d_t p4d) > -- > 2.41.0 > >
[PATCH] powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size
With commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail") the kernel is now using pmd_same to compare pmd values that are pointing to a level 4 page table page. Move the functions out of #ifdef CONFIG_TRANSPARENT_HUGEPAGE and add a variant that can work with both 4K and 64K page size. kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:141! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries . NIP [c048aee0] __pte_offset_map_lock+0xf0/0x164 LR [c048ae78] __pte_offset_map_lock+0x88/0x164 Call Trace: 0xc0003f09a340 (unreliable) __handle_mm_fault+0x1340/0x1980 handle_mm_fault+0xbc/0x380 __get_user_pages+0x320/0x550 get_user_pages_remote+0x13c/0x520 get_arg_page+0x80/0x1d0 copy_string_kernel+0xc8/0x250 kernel_execve+0x11c/0x270 run_init_process+0xe4/0x10c kernel_init+0xbc/0x1a0 ret_from_kernel_user_thread+0x14/0x1c Cc: Hugh Dickins Reported-by: Michael Ellerman Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 -- arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 - arch/powerpc/include/asm/book3s/64/hash.h | 5 + 3 files changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index b6ac4f86c87b..6472b08fa1b0 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -136,12 +136,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd) return 0; } -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) -{ - BUG(); - return 0; -} - static inline pmd_t hash__pmd_mkhuge(pmd_t pmd) { BUG(); diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index 338e62fbea0b..0bf6fd0bf42a 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -263,11 +263,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd) (_PAGE_PTE | H_PAGE_THP_HUGE)); } -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) -{ - return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0); -} - static inline pmd_t hash__pmd_mkhuge(pmd_t pmd) { return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE)); diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h index 17e7a778c856..d4a19e6547ac 100644 --- a/arch/powerpc/include/asm/book3s/64/hash.h +++ b/arch/powerpc/include/asm/book3s/64/hash.h @@ -132,6 +132,11 @@ static inline int get_region_id(unsigned long ea) return region_id; } +static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b) +{ + return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0); +} + #definehash__pmd_bad(pmd) (pmd_val(pmd) & H_PMD_BAD_BITS) #definehash__pud_bad(pud) (pud_val(pud) & H_PUD_BAD_BITS) static inline int hash__p4d_bad(p4d_t p4d) -- 2.41.0
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c > index cf2c6426a6fc..46c681dc45e6 100644 > --- a/arch/x86/kvm/mmu/spte.c > +++ b/arch/x86/kvm/mmu/spte.c > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page > *sp, > const struct kvm_memory_slot *slot, > unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn, > u64 old_spte, bool prefetch, bool can_unsync, > -bool host_writable, u64 *new_spte) > +bool host_writable, bool is_refcounted, u64 *new_spte) > { > int level = sp->role.level; > u64 spte = SPTE_MMU_PRESENT_MASK; > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page > *sp, > > if (level > PG_LEVEL_4K) > spte |= PT_PAGE_SIZE_MASK; > + else if (is_refcounted) > + spte |= SPTE_MMU_PAGE_REFCOUNTED; Is REFCOUNTED for 4K page only? What guarantees that large page doesn't have FOLL_GET? or can we set the bit for large page? > > if (shadow_memtype_mask) > spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn, -- Isaku Yamahata
Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote: > From: David Stevens > > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. > > Signed-off-by: David Stevens > --- > arch/x86/kvm/mmu/mmu.c | 35 +-- > 1 file changed, 25 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index ec169f5c7dce..e44ab512c3a1 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, > struct kvm_async_pf *work) > static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault > *fault) > { > struct kvm_memory_slot *slot = fault->slot; > - bool async; > + struct kvm_follow_pfn foll = { > + .slot = slot, > + .gfn = fault->gfn, > + .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0), > + .allow_write_mapping = true, > + }; > > /* >* Retry the page fault if the gfn hit a memslot that is being deleted > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault > return RET_PF_EMULATE; > } > > - async = false; > - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, > &async, > - fault->write, &fault->map_writable, > - &fault->hva); > - if (!async) > - return RET_PF_CONTINUE; /* *pfn has correct page already */ > + foll.flags |= FOLL_NOWAIT; > + fault->pfn = __kvm_follow_pfn(&foll); > + > + if (!is_error_noslot_pfn(fault->pfn)) We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn() return int instead of kvm_pfn_t? KVM_PFN_* seems widely used, though. > + goto success; > + > + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO) > + return RET_PF_CONTINUE; > > if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { > trace_kvm_try_async_get_page(fault->addr, fault->gfn); > @@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault >* to wait for IO. Note, gup always bails if it is unable to quickly >* get a page and a fatal signal, i.e. SIGKILL, is pending. >*/ > - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL, > - fault->write, &fault->map_writable, > - &fault->hva); > + foll.flags |= FOLL_INTERRUPTIBLE; > + foll.flags &= ~FOLL_NOWAIT; > + fault->pfn = __kvm_follow_pfn(&foll); > + > + if (!is_error_noslot_pfn(fault->pfn)) > + goto success; > + > + return RET_PF_CONTINUE; > +success: > + fault->hva = foll.hva; > + fault->map_writable = foll.writable; > return RET_PF_CONTINUE; > } > > -- > 2.41.0.255.g8b1d071c50-goog > -- Isaku Yamahata
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Tue, Jul 04, 2023 at 04:50:47PM +0900, David Stevens wrote: > From: David Stevens > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. > __kvm_follow_pfn refactors the old API's arguments into a struct and, > where possible, combines the boolean arguments into a single flags > argument. > > Signed-off-by: David Stevens > --- > include/linux/kvm_host.h | 16 > virt/kvm/kvm_main.c | 171 ++- > virt/kvm/kvm_mm.h| 3 +- > virt/kvm/pfncache.c | 8 +- > 4 files changed, 122 insertions(+), 76 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 9d3ac7720da9..ef2763c2b12e 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -97,6 +97,7 @@ > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) > > /* > * error pfns indicate that the gfn is in slot but faild to > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct > kvm_memory_slot *slot, gfn_t gfn, > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > > +struct kvm_follow_pfn { > + const struct kvm_memory_slot *slot; > + gfn_t gfn; > + unsigned int flags; > + bool atomic; > + /* Allow a read fault to create a writeable mapping. */ > + bool allow_write_mapping; Maybe, make them const for input arguments? > + > + /* Outputs of __kvm_follow_pfn */ > + hva_t hva; > + bool writable; > +}; > + > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > + > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, > bool *writable); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 371bd783ff2b..b13f22861d2f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned > long addr) > * true indicates success, otherwise false is returned. It's also the > * only part that runs if we can in atomic context. > */ > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, > - bool *writable, kvm_pfn_t *pfn) > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > { > struct page *page[1]; > + bool write_fault = foll->flags & FOLL_WRITE; > > /* >* Fast pin a writable pfn only if it is a write fault request >* or the caller allows to map a writable pfn for a read fault >* request. >*/ > - if (!(write_fault || writable)) > + if (!(write_fault || foll->allow_write_mapping)) > return false; > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > *pfn = page_to_pfn(page[0]); > - > - if (writable) > - *writable = true; > + foll->writable = foll->allow_write_mapping; > return true; > } > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool > write_fault, > * The slow path to get the pfn of the specified host virtual address, > * 1 indicates success, -errno is returned if error is detected. > */ > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > -bool interruptible, bool *writable, kvm_pfn_t *pfn) > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > { > - unsigned int flags = FOLL_HWPOISON; > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; Although adding FOLL_GET doesn't affect the behavior of get_user_pages_unlocked(), I wondered how this affects the next change It's better to mention it in the commit message. get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET, but __kvm_follow_pfn() doesn't imply FOLL_GET. > struct page *page; > int npages; > > might_sleep(); > > - if (writable) > - *writable = write_fault; > - > - if (write_fault) > - flags |= FOLL_WRITE; > - if (async) > - flags |= FOLL_NOWAIT; > - if (interruptible) > - flags |= FOLL_INTERRUPTIBLE; > - > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > if (npages != 1) > return npages; > > + foll->writable = (foll->flags & FOLL_WRITE) && > foll->allow_write_mapping; > + > /* map read fault as writable if possible */ > - if (unlikely(!write_fault) && writable) { > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > struct page *wpage;
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Wed, 5 Jul 2023, Gerald Schaefer wrote: > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT) > Hugh Dickins wrote: > > On Tue, 4 Jul 2023, Gerald Schaefer wrote: > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT) > > > Hugh Dickins wrote: > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote: > > ... > > > > --- a/arch/s390/mm/pgalloc.c > > > > +++ b/arch/s390/mm/pgalloc.c > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) > > > > * logic described above. Both AA bits are set to 1 to denote a > > > > 4KB-pgtable > > > > * while the PP bits are never used, nor such a page is added to or > > > > removed > > > > * from mm_context_t::pgtable_list. > > > > + * > > > > + * pte_free_defer() overrides those rules: it takes the page off > > > > pgtable_list, > > > > + * and prevents both 2K fragments from being reused. pte_free_defer() > > > > has to > > > > + * guarantee that its pgtable cannot be reused before the RCU grace > > > > period > > > > + * has elapsed (which page_table_free_rcu() does not actually > > > > guarantee). > > > > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not > > > allow reuse before grace period elapsed. And I hope that it does so, by > > > setting the PP bits, which would be noticed in page_table_alloc(), in > > > case the page would be seen there. > > > > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the > > > end of the list, and so they could be seen in page_table_alloc(), but they > > > should not be reused before grace period elapsed and __tlb_remove_table() > > > cleared the PP bits, as far as I understand. > > > > > > So what exactly do you mean with "which page_table_free_rcu() does not > > > actually > > > guarantee"? > > > > I'll answer without locating and re-reading what Jason explained earlier, > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table(): > > he may have explained it better. And without working out again all the > > MMU_GATHER #defines, and which of them do and do not apply to s390 here. > > > > The detail that sticks in my mind is the fallback in tlb_remove_table() > > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(), > but that is rather a generic issue, and not s390-specific. Yes. > I thought you > meant some s390-oddity here, of which we have a lot, unfortunately... > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so > I guess you could say that page_table_free_rcu() cannot guarantee what > tlb_remove_table() cannot guarantee. > > Maybe change to "which page_table_free_rcu() does not actually guarantee, > by calling tlb_remove_table()", to make it clear that this is not a problem > of page_table_free_rcu() itself. Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-) > > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot > > batch the tables for freeing by RCU, and resorts instead to an immediate > > TLB flush (I think: that again involves chasing definitions) followed by > > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU, > > and is commented: > > /* > > * This isn't an RCU grace period and hence the page-tables cannot be > > * assumed to be actually RCU-freed. > > * > > * It is however sufficient for software page-table walkers that rely on > > * IRQ disabling. > > */ > > > > Whether that's good for your PP pages or not, I've given no thought: > > I've just taken it on trust that what s390 has working today is good. > > Yes, we should be fine with that, current code can be trusted :-) Glad to hear it :-) Yes, I think it's not actually relying on the "rcu" implied by the function name. > > > > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(), > > then I would not have written "(which page_table_free_rcu() does not > > actually guarantee)". But it cannot use call_rcu() because it does > > not have an rcu_head to work with - it's in some generic code, and > > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set. > > > > And Jason would have much preferred us to address the issue from that > > angle; but not only would doing so destroy my sanity, I'd also destroy > > 20 architectures TLB-flushing, unbuilt and untested, in the attempt. > > Oh yes, if your changes would have allowed to get rid of that "semi RCU" > logic, that would really be a major boost in popularity, I guess. But > it probably is as it is, because it is not so easily fixed... I'm hoping that this series might help stir someone else to get into that. > > > > > ... > > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, > > > > unsigned long *table) > > > > */ > > > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit > > > > + 24)); > > > > mask >>= 24; > > > > - if (mask & 0x03U) > > > > + if ((mask & 0
[PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr
On a powermac platform, under the call path start_kernel time_init ppc_md.calibrate_decr (pmac_calibrate_decr) via_calibrate_decr we run ioremap and iounmap. The unmap can enable interrupts unexpectedly (cond_resched in vunmap_pmd_range), which is warned about later in the boot sequence in start_kernel. Use the early_* variants of these IO functions to prevent this. The issue is pre-existing, but is surfaced by commit 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management"). It's not clear to me why this causes it to surface. Signed-off-by: Benjamin Gray --- arch/powerpc/platforms/powermac/time.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powermac/time.c b/arch/powerpc/platforms/powermac/time.c index 4c5790aff1b5..8633891b7aa5 100644 --- a/arch/powerpc/platforms/powermac/time.c +++ b/arch/powerpc/platforms/powermac/time.c @@ -26,8 +26,8 @@ #include #include +#include #include -#include #include #include #include @@ -182,7 +182,7 @@ static int __init via_calibrate_decr(void) return 0; } of_node_put(vias); - via = ioremap(rsrc.start, resource_size(&rsrc)); + via = early_ioremap(rsrc.start, resource_size(&rsrc)); if (via == NULL) { printk(KERN_ERR "Failed to map VIA for timer calibration !\n"); return 0; @@ -207,7 +207,7 @@ static int __init via_calibrate_decr(void) ppc_tb_freq = (dstart - dend) * 100 / 6; - iounmap(via); + early_iounmap((void *)via, resource_size(&rsrc)); return 1; } -- 2.41.0
Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp
On Thu, 2023-07-06 at 08:19 +0900, Damien Le Moal wrote: > On 7/6/23 03:58, Jeff Layton wrote: > > A rename potentially involves updating 4 different inode timestamps. Add > > a function that handles the details sanely, and convert the libfs.c > > callers to use it. > > > > Signed-off-by: Jeff Layton > > --- > > fs/libfs.c | 36 +++- > > include/linux/fs.h | 2 ++ > > 2 files changed, 29 insertions(+), 9 deletions(-) > > > > diff --git a/fs/libfs.c b/fs/libfs.c > > index a7e56baf8bbd..9ee79668c909 100644 > > --- a/fs/libfs.c > > +++ b/fs/libfs.c > > @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry > > *dentry) > > } > > EXPORT_SYMBOL(simple_rmdir); > > > > +/** > > + * simple_rename_timestamp - update the various inode timestamps for rename > > + * @old_dir: old parent directory > > + * @old_dentry: dentry that is being renamed > > + * @new_dir: new parent directory > > + * @new_dentry: target for rename > > + * > > + * POSIX mandates that the old and new parent directories have their ctime > > and > > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), > > have > > + * their ctime updated. > > + */ > > +void simple_rename_timestamp(struct inode *old_dir, struct dentry > > *old_dentry, > > +struct inode *new_dir, struct dentry *new_dentry) > > +{ > > + struct inode *newino = d_inode(new_dentry); > > + > > + old_dir->i_mtime = inode_set_ctime_current(old_dir); > > + if (new_dir != old_dir) > > + new_dir->i_mtime = inode_set_ctime_current(new_dir); > > + inode_set_ctime_current(d_inode(old_dentry)); > > + if (newino) > > + inode_set_ctime_current(newino); > > +} > > +EXPORT_SYMBOL_GPL(simple_rename_timestamp); > > + > > int simple_rename_exchange(struct inode *old_dir, struct dentry > > *old_dentry, > >struct inode *new_dir, struct dentry *new_dentry) > > { > > @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, > > struct dentry *old_dentry, > > inc_nlink(old_dir); > > } > > } > > - old_dir->i_ctime = old_dir->i_mtime = > > - new_dir->i_ctime = new_dir->i_mtime = > > - d_inode(old_dentry)->i_ctime = > > - d_inode(new_dentry)->i_ctime = current_time(old_dir); > > - > > + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); > > This is somewhat changing the current behavior: before the patch, the mtime > and > ctime of old_dir, new_dir and the inodes associated with the dentries are > always > equal. But given that simple_rename_timestamp() calls > inode_set_ctime_current() > 4 times, the times could potentially be different. > > I am not sure if that is an issue, but it seems that calling > inode_set_ctime_current() once, recording the "now" time it sets and using > that > value to set all times may be more efficient and preserve the existing > behavior. > I don't believe it's an issue. I've seen nothing in the POSIX spec that mandates that timestamp updates to different inodes involved in an operation be set to the _same_ value. It just says they must be updated. It's also hard to believe that any software would depend on this either, given that it's very inconsistent across filesystems today. AFAICT, this was mostly done in the past just as a matter of convenience. The other problem with doing it that way is that it assumes that current_time(inode) should always return the same value when given different inodes. Is it really correct to do this? inode_set_ctime(dir, inode_set_ctime_current(inode)); "dir" and "inode" are different inodes, after all, and you're setting dir's timestamp to "inode"'s value. It's not a big deal today since they're always on the same sb, but the ultimate goal of these changes is to implement multigrain timestamps. That will mean that fetching a fine- grained timestamp for an update when the existing mtime or ctime value has been queried via getattr. With that change, I think it's best that we treat updates to different inodes individually, as some of them may require updating with a fine- grained timestamp and some may not. > > return 0; > > } > > EXPORT_SYMBOL_GPL(simple_rename_exchange); > > @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > > *old_dir, > > struct dentry *old_dentry, struct inode *new_dir, > > struct dentry *new_dentry, unsigned int flags) > > { > > - struct inode *inode = d_inode(old_dentry); > > int they_are_dirs = d_is_dir(old_dentry); > > > > if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > > @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > > *old_dir, > > inc_nlink(new_dir); > > } > > > > - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = > > - new_dir->i_mtime = inode->i_ctime = current_time(old_dir); > > - > > + simple_rename_times
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Wed, 5 Jul 2023, Alexander Gordeev wrote: > On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote: > > On Thu, 29 Jun 2023, Hugh Dickins wrote: > > Hi Hugh, > > ... > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable) > > +{ > > + struct page *page; > > If I got your and Claudio conversation right, you were going to add > here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)? Well, Claudio approved, so I would have put it in, if we had stuck with that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer(); but once that went away, it became somewhat irrelevant... to me anyway. But I don't mind adding it here, in the v3 I'll post when -rc1 is out, if it might help you guys - there is some point, since pte_free_defer() is a route which can usefully check for such a case, without confusion from harmless traffic from immediate frees of just-in-case allocations. But don't expect it to catch all such cases (if they exist): another category of s390 page_table_free()s comes from the PageAnon zap_deposited_table() in zap_huge_pmd(): those tables might or might not have been exposed to userspace at some time in the past. I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that WARN_ON_ONCE is the one we want - I get confused by all the different flavours of WARN, and have to check the header file each time to be sure of the syntax and semantics): but be aware that it won't be checking all potential cases. Hugh > > > + page = virt_to_page(pgtable); > > + SetPageActive(page); > > + page_table_free(mm, (unsigned long *)pgtable); > > +} > > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > + > > /* > > * Base infrastructure required to generate basic asces, region, segment, > > * and page tables that do not make use of enhanced features like EDAT1. > > Thanks!
Re: [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime
On 7/6/23 03:58, Jeff Layton wrote: > Now that everything in-tree is converted to use the accessor functions, > rename the i_ctime field in the inode to discourage direct access. > > Signed-off-by: Jeff Layton Looks OK to me. Reviewed-by: Damien Le Moal -- Damien Le Moal Western Digital Research
Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp
On 7/6/23 03:58, Jeff Layton wrote: > A rename potentially involves updating 4 different inode timestamps. Add > a function that handles the details sanely, and convert the libfs.c > callers to use it. > > Signed-off-by: Jeff Layton > --- > fs/libfs.c | 36 +++- > include/linux/fs.h | 2 ++ > 2 files changed, 29 insertions(+), 9 deletions(-) > > diff --git a/fs/libfs.c b/fs/libfs.c > index a7e56baf8bbd..9ee79668c909 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry > *dentry) > } > EXPORT_SYMBOL(simple_rmdir); > > +/** > + * simple_rename_timestamp - update the various inode timestamps for rename > + * @old_dir: old parent directory > + * @old_dentry: dentry that is being renamed > + * @new_dir: new parent directory > + * @new_dentry: target for rename > + * > + * POSIX mandates that the old and new parent directories have their ctime > and > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), > have > + * their ctime updated. > + */ > +void simple_rename_timestamp(struct inode *old_dir, struct dentry > *old_dentry, > + struct inode *new_dir, struct dentry *new_dentry) > +{ > + struct inode *newino = d_inode(new_dentry); > + > + old_dir->i_mtime = inode_set_ctime_current(old_dir); > + if (new_dir != old_dir) > + new_dir->i_mtime = inode_set_ctime_current(new_dir); > + inode_set_ctime_current(d_inode(old_dentry)); > + if (newino) > + inode_set_ctime_current(newino); > +} > +EXPORT_SYMBOL_GPL(simple_rename_timestamp); > + > int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry, > struct inode *new_dir, struct dentry *new_dentry) > { > @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct > dentry *old_dentry, > inc_nlink(old_dir); > } > } > - old_dir->i_ctime = old_dir->i_mtime = > - new_dir->i_ctime = new_dir->i_mtime = > - d_inode(old_dentry)->i_ctime = > - d_inode(new_dentry)->i_ctime = current_time(old_dir); > - > + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); This is somewhat changing the current behavior: before the patch, the mtime and ctime of old_dir, new_dir and the inodes associated with the dentries are always equal. But given that simple_rename_timestamp() calls inode_set_ctime_current() 4 times, the times could potentially be different. I am not sure if that is an issue, but it seems that calling inode_set_ctime_current() once, recording the "now" time it sets and using that value to set all times may be more efficient and preserve the existing behavior. > return 0; > } > EXPORT_SYMBOL_GPL(simple_rename_exchange); > @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > *old_dir, > struct dentry *old_dentry, struct inode *new_dir, > struct dentry *new_dentry, unsigned int flags) > { > - struct inode *inode = d_inode(old_dentry); > int they_are_dirs = d_is_dir(old_dentry); > > if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode > *old_dir, > inc_nlink(new_dir); > } > > - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = > - new_dir->i_mtime = inode->i_ctime = current_time(old_dir); > - > + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); > return 0; > } > EXPORT_SYMBOL(simple_rename); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index bdfbd11a5811..14e38bd900f1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file > *file); > extern int simple_link(struct dentry *, struct inode *, struct dentry *); > extern int simple_unlink(struct inode *, struct dentry *); > extern int simple_rmdir(struct inode *, struct dentry *); > +void simple_rename_timestamp(struct inode *old_dir, struct dentry > *old_dentry, > + struct inode *new_dir, struct dentry *new_dentry); > extern int simple_rename_exchange(struct inode *old_dir, struct dentry > *old_dentry, > struct inode *new_dir, struct dentry > *new_dentry); > extern int simple_rename(struct mnt_idmap *, struct inode *, -- Damien Le Moal Western Digital Research
Re: [PATCH v2 07/92] fs: add ctime accessors infrastructure
On 7/6/23 03:58, Jeff Layton wrote: > struct timespec64 has unused bits in the tv_nsec field that can be used > for other purposes. In future patches, we're going to change how the > inode->i_ctime is accessed in certain inodes in order to make use of > them. In order to do that safely though, we'll need to eradicate raw > accesses of the inode->i_ctime field from the kernel. > > Add new accessor functions for the ctime that we use to replace them. > > Reviewed-by: Jan Kara > Reviewed-by: Luis Chamberlain > Signed-off-by: Jeff Layton Looks OK to me. Reviewed-by: Damien Le Moal -- Damien Le Moal Western Digital Research
Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime
On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote: > v2: > - prepend patches to add missing ctime updates > - add simple_rename_timestamp helper function > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_* > - drop individual inode_ctime_set_{sec,nsec} helpers > > I've been working on a patchset to change how the inode->i_ctime is > accessed in order to give us conditional, high-res timestamps for the > ctime and mtime. struct timespec64 has unused bits in it that we can use > to implement this. In order to do that however, we need to wrap all > accesses of inode->i_ctime to ensure that bits used as flags are > appropriately handled. > > The patchset starts with reposts of some missing ctime updates that I > spotted in the tree. It then adds a new helper function for updating the > timestamp after a successful rename, and new ctime accessor > infrastructure. > > The bulk of the patchset is individual conversions of different > subsysteme to use the new infrastructure. Finally, the patchset renames > the i_ctime field to __i_ctime to help ensure that I didn't miss > anything. > > This should apply cleanly to linux-next as of this morning. > > Most of this conversion was done via 5 different coccinelle scripts, run > in succession, with a large swath of by-hand conversions to clean up the > remainder. > A couple of other things I should note: If you sent me an Acked-by or Reviewed-by in the previous set, then I tried to keep it on the patch here, since the respun patches are mostly just renaming stuff from v1. Let me know if I've missed any. I've also pushed the pile to my tree as this tag: https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705 In case that's easier to work with. Cheers, -- Jeff Layton
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On 7/5/23 11:23, Eric DeVolder wrote: On 7/5/23 10:49, Nathan Chancellor wrote: Hi Eric, On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote: The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. NOTE: The original Kconfig has a KEXEC_SIG which depends on MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT dependency (using the strategy outlined in this series, and other techniques) results in 'error: recursive dependency detected' on CRYPTO. Per Alexander Gordeev : "the MODULE_SIG_FORMAT dependency was introduced with [git commit below] and in fact was not necessary, since s390 did/does not use mod_check_sig() anyway. commit c8424e776b09 ("MODSIGN: Export module signature definitions") MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping MODULE_SIG_FORMAT does not hurt." Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency from KEXEC_SIG. Still results in equivalent .config files for s390. Signed-off-by: Eric DeVolder Acked-by: Alexander Gordeev --- arch/s390/Kconfig | 65 ++- 1 file changed, 19 insertions(+), 46 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5b39918b7042..5d4fbbfdd1cd 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -244,6 +244,25 @@ config PGTABLE_LEVELS source "kernel/livepatch/Kconfig" +config ARCH_DEFAULT_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y + help + Refer to for more details on this. + This option also enables s390 zfcpdump. + See also + menu "Processor type and features" config HAVE_MARCH_Z10_FEATURES @@ -482,36 +501,6 @@ config SCHED_TOPOLOGY source "kernel/Kconfig.hz" -config KEXEC - def_bool y - select KEXEC_CORE - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - depends on CRYPTO - depends on CRYPTO_SHA256 - depends on CRYPTO_SHA256_S390 - help - Enable the kexec file based system call. In contrast to the normal - kexec system call this system call takes file descriptors for the - kernel and initramfs as arguments. - -config ARCH_HAS_KEXEC_PURGATORY - def_bool y - depends on KEXEC_FILE - -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE && MODULE_SIG_FORMAT - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. - - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. - config KERNEL_NOBP def_bool n prompt "Enable modified branch prediction for the kernel by default" @@ -733,22 +722,6 @@ config VFIO_AP endmenu -menu "Dump support" - -config CRASH_DUMP - bool "kernel crash dumps" - select KEXEC - help - Generate crash dump after being started by kexec. - Crash dump kernels are loaded in the main kernel with kexec-tools - into a specially reserved region and then later executed after - a crash by kdump/kexec. - Refer to for more details on this. - This option also enables s390 zfcpdump. - See also - -endmenu - config CCW def_bool y -- 2.31.1 I just bisected the following build failure visible with 'ARCH=s390 allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec: refactor for kernel/Kconfig.kexec") in -next. arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 120 | static bool kdump_csum_valid(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 188 | int machine_kexec_prepare(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare': arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined type 'struct kimage' 192 | if (image->type == KEXEC_TYPE_CRASH) | ^~ arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'? 192 | if (image->type == KEXEC_TYPE_CRASH) |
[PATCH v2 92/92] fs: rename i_ctime field to __i_ctime
Now that everything in-tree is converted to use the accessor functions, rename the i_ctime field in the inode to discourage direct access. Signed-off-by: Jeff Layton --- include/linux/fs.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 14e38bd900f1..b66442f91835 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -642,7 +642,7 @@ struct inode { loff_t i_size; struct timespec64 i_atime; struct timespec64 i_mtime; - struct timespec64 i_ctime; + struct timespec64 __i_ctime; /* use inode_*_ctime accessors! */ spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; u8 i_blkbits; @@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode *inode); */ static inline struct timespec64 inode_get_ctime(const struct inode *inode) { - return inode->i_ctime; + return inode->__i_ctime; } /** @@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const struct inode *inode) static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts) { - inode->i_ctime = ts; + inode->__i_ctime = ts; return ts; } -- 2.41.0
[PATCH v2 08/92] fs: new helper: simple_rename_timestamp
A rename potentially involves updating 4 different inode timestamps. Add a function that handles the details sanely, and convert the libfs.c callers to use it. Signed-off-by: Jeff Layton --- fs/libfs.c | 36 +++- include/linux/fs.h | 2 ++ 2 files changed, 29 insertions(+), 9 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index a7e56baf8bbd..9ee79668c909 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry *dentry) } EXPORT_SYMBOL(simple_rmdir); +/** + * simple_rename_timestamp - update the various inode timestamps for rename + * @old_dir: old parent directory + * @old_dentry: dentry that is being renamed + * @new_dir: new parent directory + * @new_dentry: target for rename + * + * POSIX mandates that the old and new parent directories have their ctime and + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), have + * their ctime updated. + */ +void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry, +struct inode *new_dir, struct dentry *new_dentry) +{ + struct inode *newino = d_inode(new_dentry); + + old_dir->i_mtime = inode_set_ctime_current(old_dir); + if (new_dir != old_dir) + new_dir->i_mtime = inode_set_ctime_current(new_dir); + inode_set_ctime_current(d_inode(old_dentry)); + if (newino) + inode_set_ctime_current(newino); +} +EXPORT_SYMBOL_GPL(simple_rename_timestamp); + int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry) { @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry, inc_nlink(old_dir); } } - old_dir->i_ctime = old_dir->i_mtime = - new_dir->i_ctime = new_dir->i_mtime = - d_inode(old_dentry)->i_ctime = - d_inode(new_dentry)->i_ctime = current_time(old_dir); - + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); return 0; } EXPORT_SYMBOL_GPL(simple_rename_exchange); @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry, unsigned int flags) { - struct inode *inode = d_inode(old_dentry); int they_are_dirs = d_is_dir(old_dentry); if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode *old_dir, inc_nlink(new_dir); } - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime = - new_dir->i_mtime = inode->i_ctime = current_time(old_dir); - + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry); return 0; } EXPORT_SYMBOL(simple_rename); diff --git a/include/linux/fs.h b/include/linux/fs.h index bdfbd11a5811..14e38bd900f1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file *file); extern int simple_link(struct dentry *, struct inode *, struct dentry *); extern int simple_unlink(struct inode *, struct dentry *); extern int simple_rmdir(struct inode *, struct dentry *); +void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry, +struct inode *new_dir, struct dentry *new_dentry); extern int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry); extern int simple_rename(struct mnt_idmap *, struct inode *, -- 2.41.0
[PATCH v2 07/92] fs: add ctime accessors infrastructure
struct timespec64 has unused bits in the tv_nsec field that can be used for other purposes. In future patches, we're going to change how the inode->i_ctime is accessed in certain inodes in order to make use of them. In order to do that safely though, we'll need to eradicate raw accesses of the inode->i_ctime field from the kernel. Add new accessor functions for the ctime that we use to replace them. Reviewed-by: Jan Kara Reviewed-by: Luis Chamberlain Signed-off-by: Jeff Layton --- fs/inode.c | 16 include/linux/fs.h | 45 - 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/fs/inode.c b/fs/inode.c index d37fad91c8da..21b026d95b51 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode) } EXPORT_SYMBOL(current_time); +/** + * inode_set_ctime_current - set the ctime to current_time + * @inode: inode + * + * Set the inode->i_ctime to the current value for the inode. Returns + * the current value that was assigned to i_ctime. + */ +struct timespec64 inode_set_ctime_current(struct inode *inode) +{ + struct timespec64 now = current_time(inode); + + inode_set_ctime(inode, now.tv_sec, now.tv_nsec); + return now; +} +EXPORT_SYMBOL(inode_set_ctime_current); + /** * in_group_or_capable - check whether caller is CAP_FSETID privileged * @idmap: idmap of the mount @inode was found from diff --git a/include/linux/fs.h b/include/linux/fs.h index 824accb89a91..bdfbd11a5811 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1474,7 +1474,50 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb, kgid_has_mapping(fs_userns, kgid); } -extern struct timespec64 current_time(struct inode *inode); +struct timespec64 current_time(struct inode *inode); +struct timespec64 inode_set_ctime_current(struct inode *inode); + +/** + * inode_get_ctime - fetch the current ctime from the inode + * @inode: inode from which to fetch ctime + * + * Grab the current ctime from the inode and return it. + */ +static inline struct timespec64 inode_get_ctime(const struct inode *inode) +{ + return inode->i_ctime; +} + +/** + * inode_set_ctime_to_ts - set the ctime in the inode + * @inode: inode in which to set the ctime + * @ts: value to set in the ctime field + * + * Set the ctime in @inode to @ts + */ +static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode, + struct timespec64 ts) +{ + inode->i_ctime = ts; + return ts; +} + +/** + * inode_set_ctime - set the ctime in the inode + * @inode: inode in which to set the ctime + * @sec: tv_sec value to set + * @nsec: tv_nsec value to set + * + * Set the ctime in @inode to { @sec, @nsec } + */ +static inline struct timespec64 inode_set_ctime(struct inode *inode, + time64_t sec, long nsec) +{ + struct timespec64 ts = { .tv_sec = sec, +.tv_nsec = nsec }; + + return inode_set_ctime_to_ts(inode, ts); +} /* * Snapshotting support. -- 2.41.0
[PATCH v2 00/89] fs: new accessors for inode->i_ctime
v2: - prepend patches to add missing ctime updates - add simple_rename_timestamp helper function - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_* - drop individual inode_ctime_set_{sec,nsec} helpers I've been working on a patchset to change how the inode->i_ctime is accessed in order to give us conditional, high-res timestamps for the ctime and mtime. struct timespec64 has unused bits in it that we can use to implement this. In order to do that however, we need to wrap all accesses of inode->i_ctime to ensure that bits used as flags are appropriately handled. The patchset starts with reposts of some missing ctime updates that I spotted in the tree. It then adds a new helper function for updating the timestamp after a successful rename, and new ctime accessor infrastructure. The bulk of the patchset is individual conversions of different subsysteme to use the new infrastructure. Finally, the patchset renames the i_ctime field to __i_ctime to help ensure that I didn't miss anything. This should apply cleanly to linux-next as of this morning. Most of this conversion was done via 5 different coccinelle scripts, run in succession, with a large swath of by-hand conversions to clean up the remainder. The coccinelle scripts that were used are below: :: cocci/ctime1.cocci :: // convert as much to use inode_set_ctime_current as possible @@ identifier timei; struct inode *inode; expression E1, E2; @@ ( - inode->i_ctime = E1 = E2 = current_time(timei) + E1 = E2 = inode_set_ctime_current(inode) | - inode->i_ctime = E1 = current_time(timei) + E1 = inode_set_ctime_current(inode) | - E1 = inode->i_ctime = current_time(timei) + E1 = inode_set_ctime_current(inode) | - inode->i_ctime = current_time(timei) + inode_set_ctime_current(inode) ) @@ struct inode *inode; expression E1, E2, E3; @@ ( - E1 = current_time(inode) + E1 = inode_set_ctime_current(inode) | - E1 = current_time(E3) + E1 = inode_set_ctime_current(inode) ) ... ( - inode->i_ctime = E1; | - E2 = inode->i_ctime = E1; + E2 = E1; ) :: cocci/ctime2.cocci :: // get the places that set individual timespec64 fields @@ struct inode *inode; expression val, val2; @@ - inode->i_ctime.tv_sec = val + inode_set_ctime(inode, val, val2) ... - inode->i_ctime.tv_nsec = val2; // get places that just set the tv_sec @@ struct inode *inode; expression sec, E1, E2, E3; @@ ( - E3 = inode->i_ctime.tv_sec = sec + E3 = inode_set_ctime(inode, sec, 0).tv_sec | - inode->i_ctime.tv_sec = sec + inode_set_ctime(inode, sec, 0) ) <... ( - inode->i_ctime.tv_nsec = 0; | - E1 = inode->i_ctime.tv_nsec = 0 + E1 = 0 | - inode->i_ctime.tv_nsec = E1 = 0 + E1 = 0 | - inode->i_ctime.tv_nsec = E1 = E2 = 0 + E1 = E2 = 0 ) ...> :: cocci/ctime3.cocci :: // convert places that set i_ctime to a timespec64 directly @@ struct inode *inode; expression ts, E1, E2; @@ ( - inode->i_ctime = E1 = E2 = ts + E1 = E2 = inode_set_ctime_to_ts(inode, ts) | - inode->i_ctime = E1 = ts + E1 = inode_set_ctime_to_ts(inode, ts) | - inode->i_ctime = ts + inode_set_ctime_to_ts(inode, ts) ) :: cocci/ctime4.cocci :: // catch places that set the i_ctime in an inode embedded in another structure @@ expression E1, E2, E3; @@ ( - E3.i_ctime = E1 = E2 = current_time(&E3) + E1 = E2 = inode_set_ctime_current(&E3) | - E3.i_ctime = E1 = current_time(&E3) + E1 = inode_set_ctime_current(&E3) | - E1 = E3.i_ctime = current_time(&E3) + E1 = inode_set_ctime_current(&E3) | - E3.i_ctime = current_time(&E3) + inode_set_ctime_current(&E3) ) :: cocci/ctime5.cocci :: // convert the remaining i_ctime accesses @@ struct inode *inode; @@ - inode->i_ctime + inode_get_ctime(inode) Jeff Layton (92): ibmvmc: update ctime in conjunction with mtime on write bfs: update ctime in addition to mtime when adding entries efivarfs: update ctime when mtime changes on a write exfat: ensure that ctime is updated whenever the mtime is apparmor: update ctime whenever the mtime changes on an inode cifs: update the ctime on a partial page write fs: add ctime accessors infrastructure fs: new helper: simple_rename_timestamp btrfs: convert to simple_rename_timestamp ubifs: convert to simple_rename_timestamp shmem: convert to simple_rename_timestamp exfat: convert to simple_rename_timestamp ntfs3: convert to simple_rename_timestamp reiserfs: convert to simple_rename_timestamp spufs: convert to ctime accessor functions s390: convert to ctime accessor functions binderfs: convert to ctime accessor functions infiniband: convert to ctime accessor functions ibm: convert to ctime accessor functions usb: convert to ctime accessor functions 9p: convert to ctime accessor functions adfs: convert to ctime accessor functions affs: convert to ctime accessor functions afs: convert to ctime accessor functions fs: convert to ctime accessor functions autofs: convert to ctime accessor functions befs
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
On 7/5/23 10:49, Nathan Chancellor wrote: Hi Eric, On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote: The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. NOTE: The original Kconfig has a KEXEC_SIG which depends on MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT dependency (using the strategy outlined in this series, and other techniques) results in 'error: recursive dependency detected' on CRYPTO. Per Alexander Gordeev : "the MODULE_SIG_FORMAT dependency was introduced with [git commit below] and in fact was not necessary, since s390 did/does not use mod_check_sig() anyway. commit c8424e776b09 ("MODSIGN: Export module signature definitions") MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping MODULE_SIG_FORMAT does not hurt." Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency from KEXEC_SIG. Still results in equivalent .config files for s390. Signed-off-by: Eric DeVolder Acked-by: Alexander Gordeev --- arch/s390/Kconfig | 65 ++- 1 file changed, 19 insertions(+), 46 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5b39918b7042..5d4fbbfdd1cd 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -244,6 +244,25 @@ config PGTABLE_LEVELS source "kernel/livepatch/Kconfig" +config ARCH_DEFAULT_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y + help + Refer to for more details on this. + This option also enables s390 zfcpdump. + See also + menu "Processor type and features" config HAVE_MARCH_Z10_FEATURES @@ -482,36 +501,6 @@ config SCHED_TOPOLOGY source "kernel/Kconfig.hz" -config KEXEC - def_bool y - select KEXEC_CORE - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - depends on CRYPTO - depends on CRYPTO_SHA256 - depends on CRYPTO_SHA256_S390 - help - Enable the kexec file based system call. In contrast to the normal - kexec system call this system call takes file descriptors for the - kernel and initramfs as arguments. - -config ARCH_HAS_KEXEC_PURGATORY - def_bool y - depends on KEXEC_FILE - -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE && MODULE_SIG_FORMAT - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. - - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. - config KERNEL_NOBP def_bool n prompt "Enable modified branch prediction for the kernel by default" @@ -733,22 +722,6 @@ config VFIO_AP endmenu -menu "Dump support" - -config CRASH_DUMP - bool "kernel crash dumps" - select KEXEC - help - Generate crash dump after being started by kexec. - Crash dump kernels are loaded in the main kernel with kexec-tools - into a specially reserved region and then later executed after - a crash by kdump/kexec. - Refer to for more details on this. - This option also enables s390 zfcpdump. - See also - -endmenu - config CCW def_bool y -- 2.31.1 I just bisected the following build failure visible with 'ARCH=s390 allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec: refactor for kernel/Kconfig.kexec") in -next. arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 120 | static bool kdump_csum_valid(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 188 | int machine_kexec_prepare(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare': arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined type 'struct kimage' 192 | if (image->type == KEXEC_TYPE_CRASH) | ^~ arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' undeclared (first use in this function); did you me
Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec
On Wed, Jul 5, 2023, at 17:22, Eric DeVolder wrote: > On 7/5/23 10:05, Arnd Bergmann wrote: >> On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote: >> >> I see this is now in linux-next, and it caused a few randconfig >> build issues, these never happened in the past: > > Arnd, > Thanks for looking at this! > > I received randconfig errors from Andrew Morton's machinery. When > investigating I > found that randconfig was able to specify CRASH_DUMP without KEXEC, and > that lead > to problems. I believe this situation existed prior to this series as > well. On Arm, there was definitely a bug because one could enable CRASH_DUMP without enabling KEXEC, but that had no effect at all. I only noticed the problem testing linux-next because it turned from silently broken into a build failure > Specifically CRASH_DUMP does not have a dependency on KEXEC, or select > (only s390 > has this hole closed). > > For CRASH_DUMP, this series now selects KEXEC to close this gap, which is > what a > sane config would have (ie both CRASH_DUMP and KEXEC). Right, that is the easier way out here. > Do you think the changes outlined below are still needed? I think only the first one still applies on arm, you need to ensure that ARCH_SUPPORTS_CRASH_DUMP has the same dependency as ARCH_SUPPORTS_KEXEC, or it just depends on ARCH_SUPPORTS_KEXEC. Arnd
Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
Hi Eric, On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote: > The kexec and crash kernel options are provided in the common > kernel/Kconfig.kexec. Utilize the common options and provide > the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the > equivalent set of KEXEC and CRASH options. > > NOTE: The original Kconfig has a KEXEC_SIG which depends on > MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT > dependency (using the strategy outlined in this series, and other > techniques) results in 'error: recursive dependency detected' > on CRYPTO. > > Per Alexander Gordeev : "the MODULE_SIG_FORMAT > dependency was introduced with [git commit below] and in fact was not > necessary, since s390 did/does not use mod_check_sig() anyway. > > commit c8424e776b09 ("MODSIGN: Export module signature definitions") > > MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But > SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping > MODULE_SIG_FORMAT does not hurt." > > Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency > from KEXEC_SIG. Still results in equivalent .config files for s390. > > Signed-off-by: Eric DeVolder > Acked-by: Alexander Gordeev > --- > arch/s390/Kconfig | 65 ++- > 1 file changed, 19 insertions(+), 46 deletions(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index 5b39918b7042..5d4fbbfdd1cd 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -244,6 +244,25 @@ config PGTABLE_LEVELS > > source "kernel/livepatch/Kconfig" > > +config ARCH_DEFAULT_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC > + def_bool y > + > +config ARCH_SUPPORTS_KEXEC_FILE > + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 > + > +config ARCH_HAS_KEXEC_PURGATORY > + def_bool KEXEC_FILE > + > +config ARCH_SUPPORTS_CRASH_DUMP > + def_bool y > + help > + Refer to for more details on > this. > + This option also enables s390 zfcpdump. > + See also > + > menu "Processor type and features" > > config HAVE_MARCH_Z10_FEATURES > @@ -482,36 +501,6 @@ config SCHED_TOPOLOGY > > source "kernel/Kconfig.hz" > > -config KEXEC > - def_bool y > - select KEXEC_CORE > - > -config KEXEC_FILE > - bool "kexec file based system call" > - select KEXEC_CORE > - depends on CRYPTO > - depends on CRYPTO_SHA256 > - depends on CRYPTO_SHA256_S390 > - help > - Enable the kexec file based system call. In contrast to the normal > - kexec system call this system call takes file descriptors for the > - kernel and initramfs as arguments. > - > -config ARCH_HAS_KEXEC_PURGATORY > - def_bool y > - depends on KEXEC_FILE > - > -config KEXEC_SIG > - bool "Verify kernel signature during kexec_file_load() syscall" > - depends on KEXEC_FILE && MODULE_SIG_FORMAT > - help > - This option makes kernel signature verification mandatory for > - the kexec_file_load() syscall. > - > - In addition to that option, you need to enable signature > - verification for the corresponding kernel image type being > - loaded in order for this to work. > - > config KERNEL_NOBP > def_bool n > prompt "Enable modified branch prediction for the kernel by default" > @@ -733,22 +722,6 @@ config VFIO_AP > > endmenu > > -menu "Dump support" > - > -config CRASH_DUMP > - bool "kernel crash dumps" > - select KEXEC > - help > - Generate crash dump after being started by kexec. > - Crash dump kernels are loaded in the main kernel with kexec-tools > - into a specially reserved region and then later executed after > - a crash by kdump/kexec. > - Refer to for more details on > this. > - This option also enables s390 zfcpdump. > - See also > - > -endmenu > - > config CCW > def_bool y > > -- > 2.31.1 > I just bisected the following build failure visible with 'ARCH=s390 allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec: refactor for kernel/Kconfig.kexec") in -next. arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 120 | static bool kdump_csum_valid(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared inside parameter list will not be visible outside of this definition or declaration 188 | int machine_kexec_prepare(struct kimage *image) | ^~ arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare': arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined type 'struct kimage' 192 | if (image->type == KEXEC_TYPE_CRASH) | ^~ arch/s390/kernel/machine_kexec.c:192:
Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec
On 7/5/23 10:05, Arnd Bergmann wrote: On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote: The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder +config ARCH_SUPPORTS_KEXEC + def_bool (!SMP || PM_SLEEP_SMP) && MMU config ATAGS_PROC bool "Export atags in procfs" @@ -1668,17 +1656,8 @@ config ATAGS_PROC Should the atags used to boot the kernel be exported in an "atags" file in procfs. Useful with kexec. -config CRASH_DUMP - bool "Build kdump crash kernel (EXPERIMENTAL)" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. The crash dump kernel must be compiled to a - memory address not used by the main kernel - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y I see this is now in linux-next, and it caused a few randconfig build issues, these never happened in the past: Arnd, Thanks for looking at this! I received randconfig errors from Andrew Morton's machinery. When investigating I found that randconfig was able to specify CRASH_DUMP without KEXEC, and that lead to problems. I believe this situation existed prior to this series as well. Specifically CRASH_DUMP does not have a dependency on KEXEC, or select (only s390 has this hole closed). For CRASH_DUMP, this series now selects KEXEC to close this gap, which is what a sane config would have (ie both CRASH_DUMP and KEXEC). Do you think the changes outlined below are still needed? eric * The #ifdef CONFIG_KEXEC check in arch/arm/include/asm/kexec.h needs to be changed to CONFIG_KEXEC_CORE: include/linux/kexec.h:41:2: error: #error KEXEC_SOURCE_MEMORY_LIMIT not defined same thing on m68k * ARCH_SUPPORTS_CRASH_DUMP needs the same dependency as ARCH_SUPPORTS_KEXEC, otherwise we seem to run into an obscure assembler error building the kdump core on architectures that do not support kdump: /tmp/ccpYl6w9.s:1546: Error: selected processor does not support requested special purpose register -- `mrs r2,cpsr' * Most architectures build machine_kexec.o only when KEXEC is enabled, this also needs to be changed to KEXEC_CORE: --- a/arch/arm/kernel/Makefile +++ b/arch/arm/kernel/Makefile @@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o insn.o patch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o insn.o patch.o -obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o # Main staffs in KPROBES are in arch/arm/probes/ . obj-$(CONFIG_KPROBES) += patch.o insn.o obj-$(CONFIG_OABI_COMPAT) += sys_oabi-compat.o Arnd
Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec
On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote: > The kexec and crash kernel options are provided in the common > kernel/Kconfig.kexec. Utilize the common options and provide > the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the > equivalent set of KEXEC and CRASH options. > > Signed-off-by: Eric DeVolder > +config ARCH_SUPPORTS_KEXEC > + def_bool (!SMP || PM_SLEEP_SMP) && MMU > > config ATAGS_PROC > bool "Export atags in procfs" > @@ -1668,17 +1656,8 @@ config ATAGS_PROC > Should the atags used to boot the kernel be exported in an "atags" > file in procfs. Useful with kexec. > > -config CRASH_DUMP > - bool "Build kdump crash kernel (EXPERIMENTAL)" > - help > - Generate crash dump after being started by kexec. This should > - be normally only set in special crash dump kernels which are > - loaded in the main kernel with kexec-tools into a specially > - reserved region and then later executed after a crash by > - kdump/kexec. The crash dump kernel must be compiled to a > - memory address not used by the main kernel > - > - For more details see Documentation/admin-guide/kdump/kdump.rst > +config ARCH_SUPPORTS_CRASH_DUMP > + def_bool y > I see this is now in linux-next, and it caused a few randconfig build issues, these never happened in the past: * The #ifdef CONFIG_KEXEC check in arch/arm/include/asm/kexec.h needs to be changed to CONFIG_KEXEC_CORE: include/linux/kexec.h:41:2: error: #error KEXEC_SOURCE_MEMORY_LIMIT not defined same thing on m68k * ARCH_SUPPORTS_CRASH_DUMP needs the same dependency as ARCH_SUPPORTS_KEXEC, otherwise we seem to run into an obscure assembler error building the kdump core on architectures that do not support kdump: /tmp/ccpYl6w9.s:1546: Error: selected processor does not support requested special purpose register -- `mrs r2,cpsr' * Most architectures build machine_kexec.o only when KEXEC is enabled, this also needs to be changed to KEXEC_CORE: --- a/arch/arm/kernel/Makefile +++ b/arch/arm/kernel/Makefile @@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o insn.o patch.o obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o obj-$(CONFIG_JUMP_LABEL) += jump_label.o insn.o patch.o -obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o +obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o # Main staffs in KPROBES are in arch/arm/probes/ . obj-$(CONFIG_KPROBES) += patch.o insn.o obj-$(CONFIG_OABI_COMPAT) += sys_oabi-compat.o Arnd
[PATCH v4 13/13] sh/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Acked-by: John Paul Adrian Glaubitz --- arch/sh/Kconfig | 46 -- 1 file changed, 8 insertions(+), 38 deletions(-) diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 2b3ce4fd3956..1cf6603781c7 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -548,44 +548,14 @@ menu "Kernel features" source "kernel/Kconfig.hz" -config KEXEC - bool "kexec system call (EXPERIMENTAL)" - depends on MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config CRASH_DUMP - bool "kernel crash dumps (EXPERIMENTAL)" - depends on BROKEN_ON_SMP - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel using - PHYSICAL_START. - - For more details see Documentation/admin-guide/kdump/kdump.rst - -config KEXEC_JUMP - bool "kexec jump (EXPERIMENTAL)" - depends on KEXEC && HIBERNATION - help - Jump between original kernel and kexeced kernel and invoke - code via KEXEC +config ARCH_SUPPORTS_KEXEC + def_bool MMU + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool BROKEN_ON_SMP + +config ARCH_SUPPORTS_KEXEC_JUMP + def_bool y config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) -- 2.31.1
[PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. NOTE: The original Kconfig has a KEXEC_SIG which depends on MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT dependency (using the strategy outlined in this series, and other techniques) results in 'error: recursive dependency detected' on CRYPTO. Per Alexander Gordeev : "the MODULE_SIG_FORMAT dependency was introduced with [git commit below] and in fact was not necessary, since s390 did/does not use mod_check_sig() anyway. commit c8424e776b09 ("MODSIGN: Export module signature definitions") MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping MODULE_SIG_FORMAT does not hurt." Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency from KEXEC_SIG. Still results in equivalent .config files for s390. Signed-off-by: Eric DeVolder Acked-by: Alexander Gordeev --- arch/s390/Kconfig | 65 ++- 1 file changed, 19 insertions(+), 46 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 5b39918b7042..5d4fbbfdd1cd 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -244,6 +244,25 @@ config PGTABLE_LEVELS source "kernel/livepatch/Kconfig" +config ARCH_DEFAULT_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390 + +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y + help + Refer to for more details on this. + This option also enables s390 zfcpdump. + See also + menu "Processor type and features" config HAVE_MARCH_Z10_FEATURES @@ -482,36 +501,6 @@ config SCHED_TOPOLOGY source "kernel/Kconfig.hz" -config KEXEC - def_bool y - select KEXEC_CORE - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - depends on CRYPTO - depends on CRYPTO_SHA256 - depends on CRYPTO_SHA256_S390 - help - Enable the kexec file based system call. In contrast to the normal - kexec system call this system call takes file descriptors for the - kernel and initramfs as arguments. - -config ARCH_HAS_KEXEC_PURGATORY - def_bool y - depends on KEXEC_FILE - -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE && MODULE_SIG_FORMAT - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. - - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. - config KERNEL_NOBP def_bool n prompt "Enable modified branch prediction for the kernel by default" @@ -733,22 +722,6 @@ config VFIO_AP endmenu -menu "Dump support" - -config CRASH_DUMP - bool "kernel crash dumps" - select KEXEC - help - Generate crash dump after being started by kexec. - Crash dump kernels are loaded in the main kernel with kexec-tools - into a specially reserved region and then later executed after - a crash by kdump/kexec. - Refer to for more details on this. - This option also enables s390 zfcpdump. - See also - -endmenu - config CCW def_bool y -- 2.31.1
[PATCH v4 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/riscv/Kconfig | 44 +--- 1 file changed, 13 insertions(+), 31 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index b49793cf34eb..8a3af850597a 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT If unsure what to do here, say N. -config KEXEC - bool "Kexec system call" - depends on MMU +config ARCH_SUPPORTS_KEXEC + def_bool MMU + +config ARCH_SELECTS_KEXEC + def_bool y + depends on KEXEC select HOTPLUG_CPU if SMP - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - The name comes from the similarity to the exec system call. +config ARCH_SUPPORTS_KEXEC_FILE + def_bool 64BIT && MMU -config KEXEC_FILE - bool "kexec file based systmem call" - depends on 64BIT && MMU +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE select HAVE_IMA_KEXEC if IMA - select KEXEC_CORE select KEXEC_ELF - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - - If you don't know what to do here, say Y. config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE depends on CRYPTO=y depends on CRYPTO_SHA256=y -config CRASH_DUMP - bool "Build kdump crash kernel" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config COMPAT bool "Kernel support for 32-bit U-mode" -- 2.31.1
[PATCH v4 10/13] powerpc/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Reviewed-by: Sourabh Jain --- arch/powerpc/Kconfig | 55 ++-- 1 file changed, 17 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..1695a71777f0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -589,41 +589,21 @@ config PPC64_SUPPORTS_MEMORY_FAILURE default "y" if PPC_POWERNV select ARCH_SUPPORTS_MEMORY_FAILURE -config KEXEC - bool "kexec system call" - depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP) - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - select KEXEC_ELF - depends on PPC64 - depends on CRYPTO=y - depends on CRYPTO_SHA256=y - help - This is a new version of the kexec system call. This call is - file based and takes in file descriptors as system call arguments - for kernel and initramfs as opposed to a list of segments as is the - case for the older kexec call. +config ARCH_SUPPORTS_KEXEC + def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP) + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select KEXEC_ELF + select HAVE_IMA_KEXEC if IMA + config PPC64_BIG_ENDIAN_ELF_ABI_V2 # Option is available to BFD, but LLD does not support ELFv1 so this is # always true there. @@ -683,14 +663,13 @@ config RELOCATABLE_TEST loaded at, which tends to be non-zero and therefore test the relocation code. -config CRASH_DUMP - bool "Build a dump capture kernel" - depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) +config ARCH_SUPPORTS_CRASH_DUMP + def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) + +config ARCH_SELECTS_CRASH_DUMP + def_bool y + depends on CRASH_DUMP select RELOCATABLE if PPC64 || 44x || PPC_85xx - help - Build a kernel suitable for use as a dump capture kernel. - The same kernel binary can be used as production kernel and dump - capture kernel. config FA_DUMP bool "Firmware-assisted dump" -- 2.31.1
[PATCH v4 08/13] mips/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Acked-by: Thomas Bogendoerfer --- arch/mips/Kconfig | 32 +--- 1 file changed, 5 insertions(+), 27 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index fc6fba925aea..bc8421859006 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2878,33 +2878,11 @@ config HZ config SCHED_HRTICK def_bool HIGH_RES_TIMERS -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config CRASH_DUMP - bool "Kernel crash dumps" - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel or firmware using - PHYSICAL_START. +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config PHYSICAL_START hex "Physical address where the kernel is loaded" -- 2.31.1
[PATCH v4 09/13] parisc/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/parisc/Kconfig | 34 +++--- 1 file changed, 11 insertions(+), 23 deletions(-) diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 4cb46d5c64a2..2ef6843aae60 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -339,29 +339,17 @@ config NR_CPUS default "8" if 64BIT default "16" -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - It is an ongoing process to be certain the hardware in a machine - shutdown, so do not be surprised if this code does not - initially work for you. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select KEXEC_ELF - help - This enables the kexec_file_load() System call. This is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - endmenu +config ARCH_SUPPORTS_KEXEC + def_bool y + +config ARCH_SUPPORTS_KEXEC_FILE + def_bool y + +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select KEXEC_ELF + source "drivers/parisc/Kconfig" -- 2.31.1
[PATCH v4 07/13] m68k/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder Reviewed-by: Geert Uytterhoeven Acked-by: Geert Uytterhoeven --- arch/m68k/Kconfig | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index dc792b321f1e..3e318bf9504c 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -89,23 +89,8 @@ config MMU_SUN3 bool depends on MMU && !MMU_MOTOROLA && !MMU_COLDFIRE -config KEXEC - bool "kexec system call" - depends on M68KCLASSIC && MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. +config ARCH_SUPPORTS_KEXEC + def_bool M68KCLASSIC && MMU config BOOTINFO_PROC bool "Export bootinfo in procfs" -- 2.31.1
[PATCH v4 05/13] arm64/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/arm64/Kconfig | 62 +- 1 file changed, 12 insertions(+), 50 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7856c3a3e35a..bc6c9ff85a77 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1459,60 +1459,22 @@ config PARAVIRT_TIME_ACCOUNTING If in doubt, say N here. -config KEXEC - depends on PM_SLEEP_SMP - select KEXEC_CORE - bool "kexec system call" - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. - -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" - depends on KEXEC_FILE - help - Select this option to verify a signature with loaded kernel - image. If configured, any attempt of loading a image without - valid signature will fail. - - In addition to that option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. +config ARCH_SUPPORTS_KEXEC + def_bool PM_SLEEP_SMP -config KEXEC_IMAGE_VERIFY_SIG - bool "Enable Image signature verification support" - default y - depends on KEXEC_SIG - depends on EFI && SIGNED_PE_FILE_VERIFICATION - help - Enable Image signature verification support. +config ARCH_SUPPORTS_KEXEC_FILE + def_bool y -comment "Support for PE file signature verification disabled" - depends on KEXEC_SIG - depends on !EFI || !SIGNED_PE_FILE_VERIFICATION +config ARCH_SELECTS_KEXEC_FILE + def_bool y + depends on KEXEC_FILE + select HAVE_IMA_KEXEC if IMA -config CRASH_DUMP - bool "Build kdump crash kernel" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. +config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG + def_bool y - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config TRANS_TABLE def_bool y -- 2.31.1
[PATCH v4 06/13] loongarch/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/loongarch/Kconfig | 26 +++--- 1 file changed, 7 insertions(+), 19 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index e55511af4c77..397203e18800 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -537,28 +537,16 @@ config CPU_HAS_PREFETCH bool default y -config KEXEC - bool "Kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. +config ARCH_SUPPORTS_KEXEC + def_bool y - The name comes from the similarity to the exec system call. +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y -config CRASH_DUMP - bool "Build kdump crash kernel" +config ARCH_SELECTS_CRASH_DUMP + def_bool y + depends on CRASH_DUMP select RELOCATABLE - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. - - For more details see Documentation/admin-guide/kdump/kdump.rst config RELOCATABLE bool "Relocatable kernel" -- 2.31.1
[PATCH v4 02/13] x86/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/x86/Kconfig | 89 +++- 1 file changed, 13 insertions(+), 76 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7422db409770..6d2212c38046 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2040,88 +2040,25 @@ config EFI_RUNTIME_MAP source "kernel/Kconfig.hz" -config KEXEC - bool "kexec system call" - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. - -config KEXEC_FILE - bool "kexec file based system call" - select KEXEC_CORE - select HAVE_IMA_KEXEC if IMA - depends on X86_64 - depends on CRYPTO=y - depends on CRYPTO_SHA256=y - help - This is new version of kexec system call. This system call is - file based and takes file descriptors as system call argument - for kernel and initramfs as opposed to list of segments as - accepted by previous system call. +config ARCH_SUPPORTS_KEXEC + def_bool y -config ARCH_HAS_KEXEC_PURGATORY - def_bool KEXEC_FILE +config ARCH_SUPPORTS_KEXEC_FILE + def_bool X86_64 && CRYPTO && CRYPTO_SHA256 -config KEXEC_SIG - bool "Verify kernel signature during kexec_file_load() syscall" +config ARCH_SELECTS_KEXEC_FILE + def_bool y depends on KEXEC_FILE - help - - This option makes the kexec_file_load() syscall check for a valid - signature of the kernel image. The image can still be loaded without - a valid signature unless you also enable KEXEC_SIG_FORCE, though if - there's a signature that we can check, then it must be valid. - - In addition to this option, you need to enable signature - verification for the corresponding kernel image type being - loaded in order for this to work. - -config KEXEC_SIG_FORCE - bool "Require a valid signature in kexec_file_load() syscall" - depends on KEXEC_SIG - help - This option makes kernel signature verification mandatory for - the kexec_file_load() syscall. + select HAVE_IMA_KEXEC if IMA -config KEXEC_BZIMAGE_VERIFY_SIG - bool "Enable bzImage signature verification support" - depends on KEXEC_SIG - depends on SIGNED_PE_FILE_VERIFICATION - select SYSTEM_TRUSTED_KEYRING - help - Enable bzImage signature verification support. +config ARCH_HAS_KEXEC_PURGATORY + def_bool KEXEC_FILE -config CRASH_DUMP - bool "kernel crash dumps" - depends on X86_64 || (X86_32 && HIGHMEM) - help - Generate crash dump after being started by kexec. - This should be normally only set in special crash dump kernels - which are loaded in the main kernel with kexec-tools into - a specially reserved region and then later executed after - a crash by kdump/kexec. The crash dump kernel must be compiled - to a memory address not used by the main kernel or BIOS using - PHYSICAL_START, or it must be built as a relocatable image - (CONFIG_RELOCATABLE=y). - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_KEXEC_JUMP + def_bool y -config KEXEC_JUMP - bool "kexec jump" - depends on KEXEC && HIBERNATION - help - Jump between original kernel and kexeced kernel and invoke - code in physical address mode via KEXEC +config ARCH_SUPPORTS_CRASH_DUMP + def_bool X86_64 || (X86_32 && HIGHMEM) config PHYSICAL_START hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP) -- 2.31.1
[PATCH v4 00/13] refactor Kconfig to consolidate KEXEC and CRASH options
The Kconfig is refactored to consolidate KEXEC and CRASH options from various arch//Kconfig files into new file kernel/Kconfig.kexec. The Kconfig.kexec is now a submenu titled "Kexec and crash features" located under "General Setup". The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP Over time, these options have been copied between Kconfig files and are very similar to one another, but with slight differences. The following architectures are impacted by the refactor (because of use of one or more KEXEC/CRASH options): - arm - arm64 - ia64 - loongarch - m68k - mips - parisc - powerpc - riscv - s390 - sh - x86 More information: In the patch series "crash: Kernel handling of CPU and memory hot un/plug" https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devol...@oracle.com/ the new kernel feature introduces the config option CRASH_HOTPLUG. In reviewing, Thomas Gleixner requested that the new config option not be placed in x86 Kconfig. Rather the option needs a generic/common home. To Thomas' point, the KEXEC and CRASH options have largely been duplicated in the various arch//Kconfig files, with minor differences. This kind of proliferation is to be avoid/stopped. https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/ To that end, I have refactored the arch Kconfigs so as to consolidate the various KEXEC and CRASH options. Generally speaking, this work has the following themes: - KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec - These items from arch/Kconfig: CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC - These items from arch/x86/Kconfig form the common options: KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec - The Kconfig.kexec is now a submenu titled "Kexec and crash features" and is now listed in "General Setup" submenu from init/Kconfig. - To control the main common options, new options ARCH_SUPPORTS_KEXEC, ARCH_SUPPORTS_KEXEC_FILE and ARCH_SUPPORTS_CRASH_DUMP are introduced. NOTE: The existing ARCH_HAS_KEXEC_PURGATORY remains unchanged. - To account for the slight differences, new options ARCH_SELECTS_KEXEC, ARCH_SELECTS_KEXEC_FILE and ARCH_SELECTS_CRASH_DUMP are used to elicit the same side effects as the original arch//Kconfig files for KEXEC and CRASH options. An example, 'make menuconfig' illustrating the submenu: > General setup > Kexec and crash features [*] Enable kexec system call [*] Enable kexec file based system call [*] Verify kernel signature during kexec_file_load() syscall [ ] Require a valid signature in kexec_file_load() syscall [ ] Enable bzImage signature verification support [*] kexec jump [*] kernel crash dumps [*] Update the crash elfcorehdr on system configuration changes The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. In the process of consolidating these options, I encountered slight differences in the coding of these options in several of the architectures. As a result, I settled on the following solution: - Each of three main options has a 'depends on ARCH_SUPPORTS_' statement: ARCH_SUPPORTS_KEXEC, ARCH_SUPPORTS_KEXEC_FILE, ARCH_SUPPORTS_CRASH_DUMP. For example, the KEXEC_FILE option has a 'depends on ARCH_SUPPORTS_KEXEC_FILE' statement. - The boolean ARCH_SUPPORTS_ in effect allows the arch to determine when the feature is allowed. Archs which don't have the feature simply do not provide the corresponding ARCH_SUPPORTS_. For each arch, where there previously were KEXEC and/or CRASH options, these have been replaced with the corresponding boolean ARCH_SUPPORTS_, and an appropriate def_bool statement. For example, if the arch supports KEXEC_FILE, then the ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the KEXEC_FILE option to be available. If the arch has a 'depends on' statement in its original coding of the option, then that expression becomes part of the def_bool expression. For example, arm64 had: config KEXEC depends on PM_SLEEP_SMP and in this solution, this converts to: config ARCH_SUPPORTS_KEXEC def_bool PM_SLEEP_SMP - In order to account for the differences in the config coding for the three common options, the ARCH_SELECTS_ is used. This options has a 'depends on ' statement to couple it to the main option, and from there can insert the differences from the common option and the arch original coding of that option. For example, a few archs enable CRYPTO and CRYTPO_SHA256 for KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and 'select CRYPTO' and 'select CRYPTO_SHA256' statements. Illustrating the option relationships: For KEXEC: ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC KEXEC # in Kconfig.kexec ARCH_SUPPORTS_KEX
[PATCH v4 04/13] ia64/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/ia64/Kconfig | 28 +--- 1 file changed, 5 insertions(+), 23 deletions(-) diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 2cd93e6bf0fe..88382f105301 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -361,31 +361,13 @@ config IA64_HP_AML_NFW the "force" module parameter, e.g., with the "aml_nfw.force" kernel command line option. -config KEXEC - bool "kexec system call" - depends on !SMP || HOTPLUG_CPU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - The name comes from the similarity to the exec system call. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. As of this writing the exact hardware - interface is strongly in flux, so no good recommendation can be - made. +endmenu -config CRASH_DUMP - bool "kernel crash dumps" - depends on IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU) - help - Generate crash dump after being started by kexec. +config ARCH_SUPPORTS_KEXEC + def_bool !SMP || HOTPLUG_CPU -endmenu +config ARCH_SUPPORTS_CRASH_DUMP + def_bool IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU) menu "Power management and ACPI options" -- 2.31.1
[PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec
The kexec and crash kernel options are provided in the common kernel/Kconfig.kexec. Utilize the common options and provide the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the equivalent set of KEXEC and CRASH options. Signed-off-by: Eric DeVolder --- arch/arm/Kconfig | 29 - 1 file changed, 4 insertions(+), 25 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 7a27550ff3c1..1a6a6eb48a15 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1645,20 +1645,8 @@ config XIP_DEFLATED_DATA copied, saving some precious ROM space. A possible drawback is a slightly longer boot delay. -config KEXEC - bool "Kexec system call (EXPERIMENTAL)" - depends on (!SMP || PM_SLEEP_SMP) - depends on MMU - select KEXEC_CORE - help - kexec is a system call that implements the ability to shutdown your - current kernel, and to start another kernel. It is like a reboot - but it is independent of the system firmware. And like a reboot - you can start any kernel with it, not just Linux. - - It is an ongoing process to be certain the hardware in a machine - is properly shutdown, so do not be surprised if this code does not - initially work for you. +config ARCH_SUPPORTS_KEXEC + def_bool (!SMP || PM_SLEEP_SMP) && MMU config ATAGS_PROC bool "Export atags in procfs" @@ -1668,17 +1656,8 @@ config ATAGS_PROC Should the atags used to boot the kernel be exported in an "atags" file in procfs. Useful with kexec. -config CRASH_DUMP - bool "Build kdump crash kernel (EXPERIMENTAL)" - help - Generate crash dump after being started by kexec. This should - be normally only set in special crash dump kernels which are - loaded in the main kernel with kexec-tools into a specially - reserved region and then later executed after a crash by - kdump/kexec. The crash dump kernel must be compiled to a - memory address not used by the main kernel - - For more details see Documentation/admin-guide/kdump/kdump.rst +config ARCH_SUPPORTS_CRASH_DUMP + def_bool y config AUTO_ZRELADDR bool "Auto calculation of the decompressed kernel image address" if !ARCH_MULTIPLATFORM -- 2.31.1
[PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec
The config options for kexec and crash features are consolidated into new file kernel/Kconfig.kexec. Under the "General Setup" submenu is a new submenu "Kexec and crash handling". All the kexec and crash options that were once in the arch-dependent submenu "Processor type and features" are now consolidated in the new submenu. The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. Architectures specify support of certain KEXEC and CRASH features with similarly named new ARCH_SUPPORTS_ config options. Architectures can utilize the new ARCH_SELECTS_ config options to specify additional components when is enabled. To summarize, the ARCH_SUPPORTS_ permits the to be enabled, and the ARCH_SELECTS_ handles side effects (ie. select statements). Signed-off-by: Eric DeVolder --- arch/Kconfig | 13 - init/Kconfig | 2 + kernel/Kconfig.kexec | 110 +++ 3 files changed, 112 insertions(+), 13 deletions(-) create mode 100644 kernel/Kconfig.kexec diff --git a/arch/Kconfig b/arch/Kconfig index aff2746c8af2..b2872e9d3760 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -11,19 +11,6 @@ source "arch/$(SRCARCH)/Kconfig" menu "General architecture-dependent options" -config CRASH_CORE - bool - -config KEXEC_CORE - select CRASH_CORE - bool - -config KEXEC_ELF - bool - -config HAVE_IMA_KEXEC - bool - config ARCH_HAS_SUBPAGE_FAULTS bool help diff --git a/init/Kconfig b/init/Kconfig index f7f65af4ee12..639e8a3363c3 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1927,6 +1927,8 @@ config BINDGEN_VERSION_TEXT config TRACEPOINTS bool +source "kernel/Kconfig.kexec" + endmenu# General setup source "arch/Kconfig" diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec new file mode 100644 index ..d82a7ce59c05 --- /dev/null +++ b/kernel/Kconfig.kexec @@ -0,0 +1,110 @@ +# SPDX-License-Identifier: GPL-2.0-only + +menu "Kexec and crash features" + +config CRASH_CORE + bool + +config KEXEC_CORE + select CRASH_CORE + bool + +config KEXEC_ELF + bool + +config HAVE_IMA_KEXEC + bool + +config KEXEC + bool "Enable kexec system call" + default ARCH_DEFAULT_KEXEC + depends on ARCH_SUPPORTS_KEXEC + select KEXEC_CORE + help + kexec is a system call that implements the ability to shutdown your + current kernel, and to start another kernel. It is like a reboot + but it is independent of the system firmware. And like a reboot + you can start any kernel with it, not just Linux. + + The name comes from the similarity to the exec system call. + + It is an ongoing process to be certain the hardware in a machine + is properly shutdown, so do not be surprised if this code does not + initially work for you. As of this writing the exact hardware + interface is strongly in flux, so no good recommendation can be + made. + +config KEXEC_FILE + bool "Enable kexec file based system call" + depends on ARCH_SUPPORTS_KEXEC_FILE + select KEXEC_CORE + help + This is new version of kexec system call. This system call is + file based and takes file descriptors as system call argument + for kernel and initramfs as opposed to list of segments as + accepted by kexec system call. + +config KEXEC_SIG + bool "Verify kernel signature during kexec_file_load() syscall" + depends on KEXEC_FILE + help + This option makes the kexec_file_load() syscall check for a valid + signature of the kernel image. The image can still be loaded without + a valid signature unless you also enable KEXEC_SIG_FORCE, though if + there's a signature that we can check, then it must be valid. + + In addition to this option, you need to enable signature + verification for the corresponding kernel image type being + loaded in order for this to work. + +config KEXEC_SIG_FORCE + bool "Require a valid signature in kexec_file_load() syscall" + depends on KEXEC_SIG + help + This option makes kernel signature verification mandatory for + the kexec_file_load() syscall. + +config KEXEC_IMAGE_VERIFY_SIG + bool "Enable Image signature verification support (ARM)" + default ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG + depends on KEXEC_SIG + depends on EFI && SIGNED_PE_FILE_VERIFICATION + help + Enable Image signature verification support. + +config KEXEC_BZIMAGE_VERIFY_SIG + bool "Enable bzImage signature verification support" + depends on KEXEC_SIG + depends on SIGNED_PE_FILE_VERIFICATION + select SYSTEM_TRUSTED_KEYRING + help + Enable b
[PATCH v2 15/92] spufs: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Acked-by: Jeremy Kerr Reviewed-by: Jan Kara Signed-off-by: Jeff Layton --- arch/powerpc/platforms/cell/spufs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c index ea807aa0c31a..38c5be34c895 100644 --- a/arch/powerpc/platforms/cell/spufs/inode.c +++ b/arch/powerpc/platforms/cell/spufs/inode.c @@ -86,7 +86,7 @@ spufs_new_inode(struct super_block *sb, umode_t mode) inode->i_mode = mode; inode->i_uid = current_fsuid(); inode->i_gid = current_fsgid(); - inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode); out: return inode; } -- 2.41.0
Re: [PATCH 1/2] tools/perf/tests: perf all metrics test fails when perf_event access is restricted
On Mon, Jul 3, 2023 at 10:04 PM Athira Rajeev wrote: > > > > > On 15-Jun-2023, at 1:08 PM, Athira Rajeev > > wrote: > > > > Perf all metrics test fails as below when perf_event access > > is restricted. > > > >./perf test -v "perf all metrics test" > >Metric 'Memory_RD_BW_Chip' not printed in: > >Error: > >Access to performance monitoring and observability operations is limited. > >Enforced MAC policy settings (SELinux) can limit access to performance > >— > >access to performance monitoring and observability operations for > > processes > >without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. > >— > >test child finished with -1 > > end > >perf all metrics test: FAILED! In my system, it fails like below: $ ./perf test -v 101 101: perf all metrics test : --- start --- test child forked, pid 398458 Testing branch_misprediction_ratio Testing all_remote_links_outbound Metric 'all_remote_links_outbound' not printed in: Error: Invalid event (remote_outbound_data_controller_3:u) in per-thread mode, enable system wide with '-a'. Testing nps1_die_to_dram ... Thanks, Namhyung > > > Hi, > > Looking for review comments on this patch. > > Thanks > > > > The perf all metrics test picks the input events from > > "perf list --raw-dump metrics" and runs "perf stat -M "$m"" > > for each of the metrics in the list. It fails here for some > > of the metrics which needs access, since it collects system > > wide resource details/statistics. Fix the testcase to skip > > those metric events. > > > > Signed-off-by: Athira Rajeev > > --- > > tools/perf/tests/shell/stat_all_metrics.sh | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/tools/perf/tests/shell/stat_all_metrics.sh > > b/tools/perf/tests/shell/stat_all_metrics.sh > > index 54774525e18a..14b96484a359 100755 > > --- a/tools/perf/tests/shell/stat_all_metrics.sh > > +++ b/tools/perf/tests/shell/stat_all_metrics.sh > > @@ -6,7 +6,9 @@ err=0 > > for m in $(perf list --raw-dump metrics); do > > echo "Testing $m" > > result=$(perf stat -M "$m" true 2>&1) > > - if [[ "$result" =~ ${m:0:50} ]] || [[ "$result" =~ "" ]] > > + # Skip if there is no access to perf_events monitoring > > + # and observability operations > > + if [[ "$result" =~ ${m:0:50} ]] || [[ "$result" =~ "" ]] > > || [[ "$result" =~ "Access to performance monitoring and observability > > operations is limited" ]] > > then > > continue > > fi > > -- > > 2.31.1 > > >
Re: [PATCH] tools/perf/tests: Fix objdump in Object code reading test to look for all sections
On Mon, Jul 3, 2023 at 10:04 PM Athira Rajeev wrote: > > Object code reading test fails intermittently with below logs: > >Reading object code for memory address: 0xc00801dd34fc >File is: /lib/modules/6.3.0-rc7+/kernel/fs/xfs/xfs.ko >On file address is: 0x11359c >Objdump command is: objdump -z -d --start-address=0x1134fc > --stop-address=0x11357c /lib/modules/6.3.0-rc7+/kernel/fs/xfs/xfs.ko >objdump read too few bytes: 128 >test child finished with -1 > end >Object code reading: FAILED! > > This issue happens ramdomly depending on the sample ip > captured during the test run. In some cases, the same ip falls > in the xfs module. The test does an objdump on the xfs.ko file and > compares it with the dump from the dso that perf captures. But > since the range of ip address falls in debug info section, it fails This is strange. Why did it fall into the debug section? Thanks, Namhyung > to find the address range with objdump. Fix the objdump option so > as to disasseble all sections to check the address range. > > Signed-off-by: Athira Rajeev > --- > tools/perf/tests/code-reading.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/perf/tests/code-reading.c b/tools/perf/tests/code-reading.c > index ed3815163d1b..02132478856a 100644 > --- a/tools/perf/tests/code-reading.c > +++ b/tools/perf/tests/code-reading.c > @@ -184,7 +184,7 @@ static int read_via_objdump(const char *filename, u64 > addr, void *buf, > FILE *f; > int ret; > > - fmt = "%s -z -d --start-address=0x%"PRIx64" > --stop-address=0x%"PRIx64" %s"; > + fmt = "%s -z -D --start-address=0x%"PRIx64" > --stop-address=0x%"PRIx64" %s"; > ret = snprintf(cmd, sizeof(cmd), fmt, "objdump", addr, addr + len, >filename); > if (ret <= 0 || (size_t)ret >= sizeof(cmd)) > -- > 2.35.3 >
Re: [PATCH] tools/perf/tests: Fix Basic BPF llvm compile to check for libbpf support
Hello, On Mon, Jul 3, 2023 at 10:03 PM Athira Rajeev wrote: > > Basic BPF llvm compile fails in systems with libbpf > that doesn't support BTF. Log shows below information. > > libbpf: BTF is required, but is missing or corrupted. > Failed to parse test case 'Basic BPF llvm compile' > test child finished with -2 > end > > Here BPF llvm compile fails due to missing BTF support. > Fix the llvm test to skip the test incase BTF support is > missing. > > Signed-off-by: Athira Rajeev > --- > tools/perf/tests/llvm.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c > index 0bc25a56cfef..4c73c9eab0bb 100644 > --- a/tools/perf/tests/llvm.c > +++ b/tools/perf/tests/llvm.c > @@ -4,6 +4,7 @@ > #include > #include "tests.h" > #include "debug.h" > +#include > > #ifdef HAVE_LIBBPF_SUPPORT > #include > @@ -14,8 +15,12 @@ static int test__bpf_parsing(void *obj_buf, size_t > obj_buf_sz) > struct bpf_object *obj; > > obj = bpf_object__open_mem(obj_buf, obj_buf_sz, NULL); > - if (libbpf_get_error(obj)) > + if (libbpf_get_error(obj)) { > + /* Skip if there is no BTF support */ > + if (errno == ENOENT) Can we use the return value of libbpf_get_error() instead of using the errno? Also I'm curious why it requires BTF even if the BPF program doesn't use it. Is there an option or something for that? Thanks, Namhyung > + return TEST_SKIP; > return TEST_FAIL; > + } > bpf_object__close(obj); > return TEST_OK; > } > -- > 2.27.0 >
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, Jul 5, 2023 at 9:14 AM Suren Baghdasaryan wrote: > > On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton > wrote: > > > > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten > > Leemhuis)" wrote: > > > > > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > > > >>> fix rather than disabling the feature in -stable. > > > > > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > > > Greg said below and that we already had three reports I know of I'd > > > prefer if we could fix this rather sooner than later in mainline -- > > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > > > 6.4.y already or will do so soon. > > > > I'll send today's 2-patch series to Linus today or tomorrow. > > I need to make a correction to the patch fixing the issue (the first > one in the series) and will post it within an hour. Thanks! Corrected patch is posted at https://lore.kernel.org/all/20230705171213.2843068-2-sur...@google.com/. The other patch is unchanged but I posted v3 of the entire patchset anyway. Andrew, could you please replace the old version with this one before sending it to Linus? Thanks, Suren.
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton wrote: > > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten > Leemhuis)" wrote: > > > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > > >>> fix rather than disabling the feature in -stable. > > > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > > Greg said below and that we already had three reports I know of I'd > > prefer if we could fix this rather sooner than later in mainline -- > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > > 6.4.y already or will do so soon. > > I'll send today's 2-patch series to Linus today or tomorrow. I need to make a correction to the patch fixing the issue (the first one in the series) and will post it within an hour. Thanks!
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten Leemhuis)" wrote: > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > >>> fix rather than disabling the feature in -stable. > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > Greg said below and that we already had three reports I know of I'd > prefer if we could fix this rather sooner than later in mainline -- > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > 6.4.y already or will do so soon. I'll send today's 2-patch series to Linus today or tomorrow.
[PATCH v4 02/10] cpu/SMT: Move SMT prototypes into cpu_smt.h
From: Michael Ellerman In order to export the cpuhp_smt_control enum as part of the interface between generic and architecture code, the architecture code needs to include asm/topology.h. But that leads to circular header dependencies. So split the enum and related declarations into a separate header. Signed-off-by: Michael Ellerman [ldufour: rewording the commit's description] Signed-off-by: Laurent Dufour --- arch/x86/include/asm/topology.h | 2 ++ include/linux/cpu.h | 25 + include/linux/cpu_smt.h | 29 + kernel/cpu.c| 1 + 4 files changed, 33 insertions(+), 24 deletions(-) create mode 100644 include/linux/cpu_smt.h diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index caf41c4869a0..ae49ed4417d0 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -136,6 +136,8 @@ static inline int topology_max_smt_threads(void) return __max_smt_threads; } +#include + int topology_update_package_map(unsigned int apicid, unsigned int cpu); int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 6e6e57ec69e8..6b326a9e8191 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -18,6 +18,7 @@ #include #include #include +#include struct device; struct device_node; @@ -204,30 +205,6 @@ void cpuhp_report_idle_dead(void); static inline void cpuhp_report_idle_dead(void) { } #endif /* #ifdef CONFIG_HOTPLUG_CPU */ -enum cpuhp_smt_control { - CPU_SMT_ENABLED, - CPU_SMT_DISABLED, - CPU_SMT_FORCE_DISABLED, - CPU_SMT_NOT_SUPPORTED, - CPU_SMT_NOT_IMPLEMENTED, -}; - -#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT) -extern enum cpuhp_smt_control cpu_smt_control; -extern void cpu_smt_disable(bool force); -extern void cpu_smt_check_topology(void); -extern bool cpu_smt_possible(void); -extern int cpuhp_smt_enable(void); -extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval); -#else -# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED) -static inline void cpu_smt_disable(bool force) { } -static inline void cpu_smt_check_topology(void) { } -static inline bool cpu_smt_possible(void) { return false; } -static inline int cpuhp_smt_enable(void) { return 0; } -static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; } -#endif - extern bool cpu_mitigations_off(void); extern bool cpu_mitigations_auto_nosmt(void); diff --git a/include/linux/cpu_smt.h b/include/linux/cpu_smt.h new file mode 100644 index ..722c2e306fef --- /dev/null +++ b/include/linux/cpu_smt.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CPU_SMT_H_ +#define _LINUX_CPU_SMT_H_ + +enum cpuhp_smt_control { + CPU_SMT_ENABLED, + CPU_SMT_DISABLED, + CPU_SMT_FORCE_DISABLED, + CPU_SMT_NOT_SUPPORTED, + CPU_SMT_NOT_IMPLEMENTED, +}; + +#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT) +extern enum cpuhp_smt_control cpu_smt_control; +extern void cpu_smt_disable(bool force); +extern void cpu_smt_check_topology(void); +extern bool cpu_smt_possible(void); +extern int cpuhp_smt_enable(void); +extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval); +#else +# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED) +static inline void cpu_smt_disable(bool force) { } +static inline void cpu_smt_check_topology(void) { } +static inline bool cpu_smt_possible(void) { return false; } +static inline int cpuhp_smt_enable(void) { return 0; } +static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; } +#endif + +#endif /* _LINUX_CPU_SMT_H_ */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 03309f2f35a4..e02204c4675a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -592,6 +592,7 @@ static void lockdep_release_cpus_lock(void) void __weak arch_smt_update(void) { } #ifdef CONFIG_HOTPLUG_SMT + enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; void __init cpu_smt_disable(bool force) -- 2.41.0
[PATCH] powerpc: dts: add missing space before {
Add missing whitespace between node name/label and opening {. Signed-off-by: Krzysztof Kozlowski --- arch/powerpc/boot/dts/fsl/c293si-post.dtsi | 12 ++-- arch/powerpc/boot/dts/fsl/p1022rdk.dts | 10 +- arch/powerpc/boot/dts/fsl/p1022si-post.dtsi | 2 +- arch/powerpc/boot/dts/fsl/p3041ds.dts | 4 ++-- arch/powerpc/boot/dts/fsl/p5040ds.dts | 2 +- arch/powerpc/boot/dts/fsl/t4240qds.dts | 2 +- arch/powerpc/boot/dts/mpc5121.dtsi | 2 +- arch/powerpc/boot/dts/mpc5125twr.dts| 2 +- 8 files changed, 18 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/boot/dts/fsl/c293si-post.dtsi b/arch/powerpc/boot/dts/fsl/c293si-post.dtsi index bec0fc36849d..f208fb8f64b3 100644 --- a/arch/powerpc/boot/dts/fsl/c293si-post.dtsi +++ b/arch/powerpc/boot/dts/fsl/c293si-post.dtsi @@ -124,10 +124,10 @@ crypto@8 { reg = <0x8 0x2>; ranges = <0x0 0x8 0x2>; - jr@1000{ + jr@1000 { interrupts = <45 2 0 0>; }; - jr@2000{ + jr@2000 { interrupts = <57 2 0 0>; }; }; @@ -140,10 +140,10 @@ crypto@a { reg = <0xa 0x2>; ranges = <0x0 0xa 0x2>; - jr@1000{ + jr@1000 { interrupts = <49 2 0 0>; }; - jr@2000{ + jr@2000 { interrupts = <50 2 0 0>; }; }; @@ -156,10 +156,10 @@ crypto@c { reg = <0xc 0x2>; ranges = <0x0 0xc 0x2>; - jr@1000{ + jr@1000 { interrupts = <55 2 0 0>; }; - jr@2000{ + jr@2000 { interrupts = <56 2 0 0>; }; }; diff --git a/arch/powerpc/boot/dts/fsl/p1022rdk.dts b/arch/powerpc/boot/dts/fsl/p1022rdk.dts index 29e8af1e3711..4261c2f7e4b3 100644 --- a/arch/powerpc/boot/dts/fsl/p1022rdk.dts +++ b/arch/powerpc/boot/dts/fsl/p1022rdk.dts @@ -60,23 +60,23 @@ rtc@68 { compatible = "st,m41t62"; reg = <0x68>; }; - adt7461@4c{ + adt7461@4c { compatible = "adi,adt7461"; reg = <0x4c>; }; - zl6100@21{ + zl6100@21 { compatible = "isil,zl6100"; reg = <0x21>; }; - zl6100@24{ + zl6100@24 { compatible = "isil,zl6100"; reg = <0x24>; }; - zl6100@26{ + zl6100@26 { compatible = "isil,zl6100"; reg = <0x26>; }; - zl6100@29{ + zl6100@29 { compatible = "isil,zl6100"; reg = <0x29>; }; diff --git a/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi b/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi index 5f51b7bfc064..093e4e3ed368 100644 --- a/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi +++ b/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi @@ -238,7 +238,7 @@ global-utilities@e { fsl,has-rstcr; }; - power@e0070{ + power@e0070 { compatible = "fsl,mpc8536-pmc", "fsl,mpc8548-pmc"; reg = <0xe0070 0x20>; }; diff --git a/arch/powerpc/boot/dts/fsl/p3041ds.dts b/arch/powerpc/boot/dts/fsl/p3041ds.dts index 6f5f7283c533..ca0e0272ac62 100644 --- a/arch/powerpc/boot/dts/fsl/p3041ds.dts +++ b/arch/powerpc/boot/dts/fsl/p3041ds.dts @@ -41,7 +41,7 @@ / { #size-cells = <2>; interrupt-parent = <&mpic>; - aliases{ + aliases { phy_rgmii_0 = &phy_rgmii_0; phy_rgmii_1 = &phy_rgmii_1; phy_sgmii_1c = &phy_sgmii_1c; @@ -165,7 +165,7 @@ adt7461@4c { }; }; - fman@40{ + fman@40 { ethernet@e { phy-handle = <&phy_sgmii_1c>; phy-connection-type = "sgmii"; diff --git a/arch/powerpc/boot/dts/fsl/p5040ds.dts b/arch/powerpc/boot/dts/fsl/p5040ds.dts index 30850b3228e0..5cfc689ee474 100644 --- a/arch/powerpc/boot/dts/fsl/p5040ds.dts +++ b/arch/powerpc/boot/dts/fsl/p5040ds.dts @@ -41,7 +41,7 @@ / { #size-cells = <2>; interrupt-parent = <&mpic>; - aliases{ + aliases {
[PATCH v4 10/10] powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs
From: Michael Ellerman Integrate with the generic SMT support, so that when a CPU is DLPAR onlined it is brought up with the correct SMT mode. Signed-off-by: Michael Ellerman --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 61fb7cb00880..e62835a12d73 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -398,6 +398,14 @@ static int dlpar_online_cpu(struct device_node *dn) for_each_present_cpu(cpu) { if (get_hard_smp_processor_id(cpu) != thread) continue; + + if (!topology_is_primary_thread(cpu)) { + if (cpu_smt_control != CPU_SMT_ENABLED) + break; + if (!topology_smt_thread_allowed(cpu)) + break; + } + cpu_maps_update_done(); find_and_update_cpu_nid(cpu); rc = device_online(get_cpu_device(cpu)); -- 2.41.0
[PATCH v4 09/10] powerpc: Add HOTPLUG_SMT support
From: Michael Ellerman Add support for HOTPLUG_SMT, which enables the generic sysfs SMT support files in /sys/devices/system/cpu/smt, as well as the "nosmt" boot parameter. Implement the recently added hooks to allow partial SMT states, allow any number of threads per core. Tie the config symbol to HOTPLUG_CPU, which enables it on the major platforms that support SMT. If there are other platforms that want the SMT support that can be tweaked in future. Signed-off-by: Michael Ellerman [ldufour: pass current SMT level to cpu_smt_set_num_threads] [ldufour: remove topology_smt_supported] [ldufour: remove topology_smt_threads_supported] [ldufour: select CONFIG_SMT_NUM_THREADS_DYNAMIC] [ldufour: update kernel-parameters.txt] Signed-off-by: Laurent Dufour --- Documentation/admin-guide/kernel-parameters.txt | 4 ++-- arch/powerpc/Kconfig| 2 ++ arch/powerpc/include/asm/topology.h | 15 +++ arch/powerpc/kernel/smp.c | 8 +++- 4 files changed, 26 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 44bcaf791ce6..979f9bad59da 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3853,10 +3853,10 @@ nosmp [SMP] Tells an SMP kernel to act as a UP kernel, and disable the IO APIC. legacy for "maxcpus=0". - nosmt [KNL,MIPS,S390] Disable symmetric multithreading (SMT). + nosmt [KNL,MIPS,S390, PPC] Disable symmetric multithreading (SMT). Equivalent to smt=1. - [KNL,X86] Disable symmetric multithreading (SMT). + [KNL,X86,PPC] Disable symmetric multithreading (SMT). nosmt=force: Force disable SMT, cannot be undone via the sysfs control file. diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 0b1172cbeccb..aef38d2ca542 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -272,6 +272,8 @@ config PPC select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING select HAVE_VIRT_CPU_ACCOUNTING_GEN + select HOTPLUG_SMT if HOTPLUG_CPU + select SMT_NUM_THREADS_DYNAMIC select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE select IOMMU_HELPER if PPC64 select IRQ_DOMAIN diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 8a4d4f4d9749..f4e6f2dd04b7 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -143,5 +143,20 @@ static inline int cpu_to_coregroup_id(int cpu) #endif #endif +#ifdef CONFIG_HOTPLUG_SMT +#include +#include + +static inline bool topology_is_primary_thread(unsigned int cpu) +{ + return cpu == cpu_first_thread_sibling(cpu); +} + +static inline bool topology_smt_thread_allowed(unsigned int cpu) +{ + return cpu_thread_in_core(cpu) < cpu_smt_num_threads; +} +#endif + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_TOPOLOGY_H */ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index fbbb695bae3d..b9f0f8f11c37 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1087,7 +1087,7 @@ static int __init init_big_cores(void) void __init smp_prepare_cpus(unsigned int max_cpus) { - unsigned int cpu; + unsigned int cpu, num_threads; DBG("smp_prepare_cpus\n"); @@ -1154,6 +1154,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus) if (smp_ops && smp_ops->probe) smp_ops->probe(); + + // Initalise the generic SMT topology support + num_threads = 1; + if (smt_enabled_at_boot) + num_threads = smt_enabled_at_boot; + cpu_smt_set_num_threads(num_threads, threads_per_core); } void smp_prepare_boot_cpu(void) -- 2.41.0
[PATCH v4 07/10] cpu/SMT: Allow enabling partial SMT states via sysfs
From: Michael Ellerman Add support to the /sys/devices/system/cpu/smt/control interface for enabling a specified number of SMT threads per core, including partial SMT states where not all threads are brought online. The current interface accepts "on" and "off", to enable either 1 or all SMT threads per core. This commit allows writing an integer, between 1 and the number of SMT threads supported by the machine. Writing 1 is a synonym for "off", 2 or more enables SMT with the specified number of threads. When reading the file, if all threads are online "on" is returned, to avoid changing behaviour for existing users. If some other number of threads is online then the integer value is returned. Architectures like x86 only supporting 1 thread or all threads, should not define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT states, like PowerPC, should define it. Signed-off-by: Michael Ellerman [ldufour: slightly reword the commit's description] [ldufour: remove switch() in __store_smt_control()] Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202306282340.ihqm0fla-...@intel.com/ [ldufour: fix build issue in control_show()] Signed-off-by: Laurent Dufour --- .../ABI/testing/sysfs-devices-system-cpu | 1 + kernel/cpu.c | 60 ++- 2 files changed, 45 insertions(+), 16 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index ecd585ca2d50..6dba65fb1956 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -555,6 +555,7 @@ Description:Control Symmetric Multi Threading (SMT) = "on" SMT is enabled "off"SMT is disabled +""SMT is enabled with N threads per core. "forceoff" SMT is force disabled. Cannot be changed. "notsupported" SMT is not supported by the CPU "notimplemented" SMT runtime toggling is not diff --git a/kernel/cpu.c b/kernel/cpu.c index 9a8d0685e055..7e8f1b044772 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2876,11 +2876,19 @@ static const struct attribute_group cpuhp_cpu_root_attr_group = { #ifdef CONFIG_HOTPLUG_SMT +static bool cpu_smt_num_threads_valid(unsigned int threads) +{ + if (IS_ENABLED(CONFIG_SMT_NUM_THREADS_DYNAMIC)) + return threads >= 1 && threads <= cpu_smt_max_threads; + return threads == 1 || threads == cpu_smt_max_threads; +} + static ssize_t __store_smt_control(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - int ctrlval, ret; + int ctrlval, ret, num_threads, orig_threads; + bool force_off; if (cpu_smt_control == CPU_SMT_FORCE_DISABLED) return -EPERM; @@ -2888,30 +2896,39 @@ __store_smt_control(struct device *dev, struct device_attribute *attr, if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED) return -ENODEV; - if (sysfs_streq(buf, "on")) + if (sysfs_streq(buf, "on")) { ctrlval = CPU_SMT_ENABLED; - else if (sysfs_streq(buf, "off")) + num_threads = cpu_smt_max_threads; + } else if (sysfs_streq(buf, "off")) { ctrlval = CPU_SMT_DISABLED; - else if (sysfs_streq(buf, "forceoff")) + num_threads = 1; + } else if (sysfs_streq(buf, "forceoff")) { ctrlval = CPU_SMT_FORCE_DISABLED; - else + num_threads = 1; + } else if (kstrtoint(buf, 10, &num_threads) == 0) { + if (num_threads == 1) + ctrlval = CPU_SMT_DISABLED; + else if (cpu_smt_num_threads_valid(num_threads)) + ctrlval = CPU_SMT_ENABLED; + else + return -EINVAL; + } else { return -EINVAL; + } ret = lock_device_hotplug_sysfs(); if (ret) return ret; - if (ctrlval != cpu_smt_control) { - switch (ctrlval) { - case CPU_SMT_ENABLED: - ret = cpuhp_smt_enable(); - break; - case CPU_SMT_DISABLED: - case CPU_SMT_FORCE_DISABLED: - ret = cpuhp_smt_disable(ctrlval); - break; - } - } + orig_threads = cpu_smt_num_threads; + cpu_smt_num_threads = num_threads; + + force_off = ctrlval != cpu_smt_control && ctrlval == CPU_SMT_FORCE_DISABLED; + + if (num_threads > orig_threads) + ret = cpuhp_smt_enable(); + else if (num_threads < orig_threads || fo
[PATCH v4 06/10] cpu/SMT: Create topology_smt_thread_allowed()
From: Michael Ellerman Some architectures allows partial SMT states, ie. when not all SMT threads are brought online. To support that, add an architecture helper which checks whether a given CPU is allowed to be brought online depending on how many SMT threads are currently enabled. Since this is only applicable to architecture supporting partial SMT, only these architectures should select the new configuration variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not supporting the partial SMT states, there is no need to define topology_cpu_smt_allowed(), the generic code assumed that all the threads are allowed or only the primary ones. Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is enabled, to check if the particular thread should be onlined. Notably, also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow offlining some threads to move from a higher to lower number of threads online. Signed-off-by: Michael Ellerman Suggested-by: Thomas Gleixner [ldufour: slightly reword the commit's description] [ldufour: introduce CONFIG_SMT_NUM_THREADS_DYNAMIC] Signed-off-by: Laurent Dufour --- arch/Kconfig | 3 +++ kernel/cpu.c | 24 +++- 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index aff2746c8af2..63c5d6a2022b 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -34,6 +34,9 @@ config ARCH_HAS_SUBPAGE_FAULTS config HOTPLUG_SMT bool +config SMT_NUM_THREADS_DYNAMIC + bool + # Selected by HOTPLUG_CORE_SYNC_DEAD or HOTPLUG_CORE_SYNC_FULL config HOTPLUG_CORE_SYNC bool diff --git a/kernel/cpu.c b/kernel/cpu.c index 70add058e77b..9a8d0685e055 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -645,9 +645,23 @@ static int __init smt_cmdline_disable(char *str) } early_param("nosmt", smt_cmdline_disable); +/* + * For Archicture supporting partial SMT states check if the thread is allowed. + * Otherwise this has already been checked through cpu_smt_max_threads when + * setting the SMT level. + */ +static inline bool cpu_smt_thread_allowed(unsigned int cpu) +{ +#ifdef CONFIG_SMT_NUM_THREADS_DYNAMIC + return topology_smt_thread_allowed(cpu); +#else + return true; +#endif +} + static inline bool cpu_smt_allowed(unsigned int cpu) { - if (cpu_smt_control == CPU_SMT_ENABLED) + if (cpu_smt_control == CPU_SMT_ENABLED && cpu_smt_thread_allowed(cpu)) return true; if (topology_is_primary_thread(cpu)) @@ -2642,6 +2656,12 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) for_each_online_cpu(cpu) { if (topology_is_primary_thread(cpu)) continue; + /* +* Disable can be called with CPU_SMT_ENABLED when changing +* from a higher to lower number of SMT threads per core. +*/ + if (ctrlval == CPU_SMT_ENABLED && cpu_smt_thread_allowed(cpu)) + continue; ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE); if (ret) break; @@ -2676,6 +2696,8 @@ int cpuhp_smt_enable(void) /* Skip online CPUs and CPUs on offline nodes */ if (cpu_online(cpu) || !node_online(cpu_to_node(cpu))) continue; + if (!cpu_smt_thread_allowed(cpu)) + continue; ret = _cpu_up(cpu, 0, CPUHP_ONLINE); if (ret) break; -- 2.41.0
[PATCH v4 08/10] powerpc/pseries: Initialise CPU hotplug callbacks earlier
From: Michael Ellerman As part of the generic HOTPLUG_SMT code, there is support for disabling secondary SMT threads at boot time, by passing "nosmt" on the kernel command line. The way that is implemented is the secondary threads are brought partly online, and then taken back offline again. That is done to support x86 CPUs needing certain initialisation done on all threads. However powerpc has similar needs, see commit d70a54e2d085 ("powerpc/powernv: Ignore smt-enabled on Power8 and later"). For that to work the powerpc CPU hotplug callbacks need to be registered before secondary CPUs are brought online, otherwise __cpu_disable() fails due to smp_ops->cpu_disable being NULL. So split the basic initialisation into pseries_cpu_hotplug_init() which can be called early from setup_arch(). The DLPAR related initialisation can still be done later, because it needs to do allocations. Signed-off-by: Michael Ellerman --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 22 arch/powerpc/platforms/pseries/pseries.h | 2 ++ arch/powerpc/platforms/pseries/setup.c | 2 ++ 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 1a3cb313976a..61fb7cb00880 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -845,15 +845,9 @@ static struct notifier_block pseries_smp_nb = { .notifier_call = pseries_smp_notifier, }; -static int __init pseries_cpu_hotplug_init(void) +void __init pseries_cpu_hotplug_init(void) { int qcss_tok; - unsigned int node; - -#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE - ppc_md.cpu_probe = dlpar_cpu_probe; - ppc_md.cpu_release = dlpar_cpu_release; -#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ rtas_stop_self_token = rtas_function_token(RTAS_FN_STOP_SELF); qcss_tok = rtas_function_token(RTAS_FN_QUERY_CPU_STOPPED_STATE); @@ -862,12 +856,22 @@ static int __init pseries_cpu_hotplug_init(void) qcss_tok == RTAS_UNKNOWN_SERVICE) { printk(KERN_INFO "CPU Hotplug not supported by firmware " "- disabling.\n"); - return 0; + return; } smp_ops->cpu_offline_self = pseries_cpu_offline_self; smp_ops->cpu_disable = pseries_cpu_disable; smp_ops->cpu_die = pseries_cpu_die; +} + +static int __init pseries_dlpar_init(void) +{ + unsigned int node; + +#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE + ppc_md.cpu_probe = dlpar_cpu_probe; + ppc_md.cpu_release = dlpar_cpu_release; +#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */ /* Processors can be added/removed only on LPAR */ if (firmware_has_feature(FW_FEATURE_LPAR)) { @@ -886,4 +890,4 @@ static int __init pseries_cpu_hotplug_init(void) return 0; } -machine_arch_initcall(pseries, pseries_cpu_hotplug_init); +machine_arch_initcall(pseries, pseries_dlpar_init); diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index f8bce40ebd0c..f8893ba46e83 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -75,11 +75,13 @@ static inline int dlpar_hp_pmem(struct pseries_hp_errorlog *hp_elog) #ifdef CONFIG_HOTPLUG_CPU int dlpar_cpu(struct pseries_hp_errorlog *hp_elog); +void pseries_cpu_hotplug_init(void); #else static inline int dlpar_cpu(struct pseries_hp_errorlog *hp_elog) { return -EOPNOTSUPP; } +static inline void pseries_cpu_hotplug_init(void) { } #endif /* PCI root bridge prepare function override for pseries */ diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index e2a57cfa6c83..41451b76c6e5 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -816,6 +816,8 @@ static void __init pSeries_setup_arch(void) /* Discover PIC type and setup ppc_md accordingly */ smp_init_pseries(); + // Setup CPU hotplug callbacks + pseries_cpu_hotplug_init(); if (radix_enabled() && !mmu_has_feature(MMU_FTR_GTSE)) if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) -- 2.41.0
[PATCH v4 05/10] cpu/SMT: Remove topology_smt_supported()
Since the maximum number of threads is now passed to cpu_smt_set_num_threads(), checking that value is enough to know if SMT is supported. Cc: Michael Ellerman Suggested-by: Thomas Gleixner Signed-off-by: Laurent Dufour --- arch/x86/include/asm/topology.h | 2 -- arch/x86/kernel/smpboot.c | 8 kernel/cpu.c| 4 ++-- 3 files changed, 2 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index ae49ed4417d0..3235ba1e5b06 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -141,7 +141,6 @@ static inline int topology_max_smt_threads(void) int topology_update_package_map(unsigned int apicid, unsigned int cpu); int topology_update_die_map(unsigned int dieid, unsigned int cpu); int topology_phys_to_logical_pkg(unsigned int pkg); -bool topology_smt_supported(void); extern struct cpumask __cpu_primary_thread_mask; #define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask) @@ -164,7 +163,6 @@ static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; } static inline int topology_max_die_per_package(void) { return 1; } static inline int topology_max_smt_threads(void) { return 1; } static inline bool topology_is_primary_thread(unsigned int cpu) { return true; } -static inline bool topology_smt_supported(void) { return false; } #endif /* !CONFIG_SMP */ static inline void arch_fix_phys_package_id(int num, u32 slot) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index ed2d51960a7d..f8e709fd2cd5 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -326,14 +326,6 @@ static void notrace start_secondary(void *unused) cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); } -/** - * topology_smt_supported - Check whether SMT is supported by the CPUs - */ -bool topology_smt_supported(void) -{ - return smp_num_siblings > 1; -} - /** * topology_phys_to_logical_pkg - Map a physical package id to a logical * @phys_pkg: The physical package id to map diff --git a/kernel/cpu.c b/kernel/cpu.c index d7dd535cb5b5..70add058e77b 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -621,7 +621,7 @@ void __init cpu_smt_set_num_threads(unsigned int num_threads, { WARN_ON(!num_threads || (num_threads > max_threads)); - if (!topology_smt_supported()) + if (max_threads == 1) cpu_smt_control = CPU_SMT_NOT_SUPPORTED; cpu_smt_max_threads = max_threads; @@ -1801,7 +1801,7 @@ early_param("cpuhp.parallel", parallel_bringup_parse_param); static inline bool cpuhp_smt_aware(void) { - return topology_smt_supported(); + return cpu_smt_max_threads > 1; } static inline const struct cpumask *cpuhp_get_primary_thread_mask(void) -- 2.41.0
[PATCH v4 03/10] cpu/SMT: Move smt/control simple exit cases earlier
From: Michael Ellerman Move the simple exit cases, ie. which don't depend on the value written, earlier in the function. That makes it clearer that regardless of the input those states can not be transitioned out of. That does have a user-visible effect, in that the error returned will now always be EPERM/ENODEV for those states, regardless of the value written. Previously writing an invalid value would return EINVAL even when in those states. Signed-off-by: Michael Ellerman --- kernel/cpu.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index e02204c4675a..b6fe170c93e9 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2841,6 +2841,12 @@ __store_smt_control(struct device *dev, struct device_attribute *attr, { int ctrlval, ret; + if (cpu_smt_control == CPU_SMT_FORCE_DISABLED) + return -EPERM; + + if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED) + return -ENODEV; + if (sysfs_streq(buf, "on")) ctrlval = CPU_SMT_ENABLED; else if (sysfs_streq(buf, "off")) @@ -2850,12 +2856,6 @@ __store_smt_control(struct device *dev, struct device_attribute *attr, else return -EINVAL; - if (cpu_smt_control == CPU_SMT_FORCE_DISABLED) - return -EPERM; - - if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED) - return -ENODEV; - ret = lock_device_hotplug_sysfs(); if (ret) return ret; -- 2.41.0
[PATCH v4 00/10] Introduce SMT level and add PowerPC support
I'm taking over the series Michael sent previously [1] which is smartly reviewing the initial series I sent [2]. This series is addressing the comments sent by Thomas and me on the Michael's one. Here is a short introduction to the issue this series is addressing: When a new CPU is added, the kernel is activating all its threads. This leads to weird, but functional, result when adding CPU on a SMT 4 system for instance. Here the newly added CPU 1 has 8 threads while the other one has 4 threads active (system has been booted with the 'smt-enabled=4' kernel option): ltcden3-lp12:~ # ppc64_cpu --info Core 0:0*1*2*3*4 5 6 7 Core 1:8*9* 10* 11* 12* 13* 14* 15* This mixed SMT level may confused end users and/or some applications. There is no SMT level recorded in the kernel (common code), neither in user space, as far as I know. Such a level is helpful when adding new CPU or when optimizing the energy efficiency (when reactivating CPUs). When SMP and HOTPLUG_SMT are defined, this series is adding a new SMT level (cpu_smt_num_threads) and few callbacks allowing the architecture code to fine control this value, setting a max and a "at boot" level, and controling whether a thread should be onlined or not. v4: Rebase on top of 6.5's updates Remove a dependancy against the X86's symbol cpu_primary_thread_mask v3: Fix a build error in the patch 6/9 v2: As Thomas suggested, Reword some commit's description Remove topology_smt_supported() Remove topology_smt_threads_supported() Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC Remove switch() in __store_smt_control() Update kernel-parameters.txt [1] https://lore.kernel.org/linuxppc-dev/20230524155630.794584-1-...@ellerman.id.au/ [2] https://lore.kernel.org/linuxppc-dev/20230331153905.31698-1-lduf...@linux.ibm.com/ Laurent Dufour (2): cpu/hotplug: remove dependancy against cpu_primary_thread_mask cpu/SMT: Remove topology_smt_supported() Michael Ellerman (8): cpu/SMT: Move SMT prototypes into cpu_smt.h cpu/SMT: Move smt/control simple exit cases earlier cpu/SMT: Store the current/max number of threads cpu/SMT: Create topology_smt_thread_allowed() cpu/SMT: Allow enabling partial SMT states via sysfs powerpc/pseries: Initialise CPU hotplug callbacks earlier powerpc: Add HOTPLUG_SMT support powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs .../ABI/testing/sysfs-devices-system-cpu | 1 + .../admin-guide/kernel-parameters.txt | 4 +- arch/Kconfig | 3 + arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/topology.h | 15 ++ arch/powerpc/kernel/smp.c | 8 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 30 ++-- arch/powerpc/platforms/pseries/pseries.h | 2 + arch/powerpc/platforms/pseries/setup.c| 2 + arch/x86/include/asm/topology.h | 4 +- arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/smpboot.c | 8 - include/linux/cpu.h | 25 +-- include/linux/cpu_smt.h | 33 kernel/cpu.c | 142 +- 15 files changed, 196 insertions(+), 85 deletions(-) create mode 100644 include/linux/cpu_smt.h -- 2.41.0
[PATCH v4 04/10] cpu/SMT: Store the current/max number of threads
From: Michael Ellerman Some architectures allow partial SMT states at boot time, ie. when not all SMT threads are brought online. To support that the SMT code needs to know the maximum number of SMT threads, and also the currently configured number. The architecture code knows the max number of threads, so have the architecture code pass that value to cpu_smt_set_num_threads(). Note that although topology_max_smt_threads() exists, it is not configured early enough to be used here. As architecture, like PowerPC, allows the threads number to be set through the kernel command line, also pass that value. Signed-off-by: Michael Ellerman [ldufour: slightly reword the commit message] [ldufour: rename cpu_smt_check_topology and add a num_threads argument] Signed-off-by: Laurent Dufour --- arch/x86/kernel/cpu/common.c | 2 +- include/linux/cpu_smt.h | 8 ++-- kernel/cpu.c | 21 - 3 files changed, 27 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 52683fddafaf..12a48a85da3d 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2317,7 +2317,7 @@ void __init arch_cpu_finalize_init(void) * identify_boot_cpu() initialized SMT support information, let the * core code know. */ - cpu_smt_check_topology(); + cpu_smt_set_num_threads(smp_num_siblings, smp_num_siblings); if (!IS_ENABLED(CONFIG_SMP)) { pr_info("CPU: "); diff --git a/include/linux/cpu_smt.h b/include/linux/cpu_smt.h index 722c2e306fef..0c1664294b57 100644 --- a/include/linux/cpu_smt.h +++ b/include/linux/cpu_smt.h @@ -12,15 +12,19 @@ enum cpuhp_smt_control { #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT) extern enum cpuhp_smt_control cpu_smt_control; +extern unsigned int cpu_smt_num_threads; extern void cpu_smt_disable(bool force); -extern void cpu_smt_check_topology(void); +extern void cpu_smt_set_num_threads(unsigned int num_threads, + unsigned int max_threads); extern bool cpu_smt_possible(void); extern int cpuhp_smt_enable(void); extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval); #else # define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED) +# define cpu_smt_num_threads 1 static inline void cpu_smt_disable(bool force) { } -static inline void cpu_smt_check_topology(void) { } +static inline void cpu_smt_set_num_threads(unsigned int num_threads, + unsigned int max_threads) { } static inline bool cpu_smt_possible(void) { return false; } static inline int cpuhp_smt_enable(void) { return 0; } static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; } diff --git a/kernel/cpu.c b/kernel/cpu.c index b6fe170c93e9..d7dd535cb5b5 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -594,6 +594,8 @@ void __weak arch_smt_update(void) { } #ifdef CONFIG_HOTPLUG_SMT enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED; +static unsigned int cpu_smt_max_threads __ro_after_init; +unsigned int cpu_smt_num_threads __read_mostly = UINT_MAX; void __init cpu_smt_disable(bool force) { @@ -607,16 +609,33 @@ void __init cpu_smt_disable(bool force) pr_info("SMT: disabled\n"); cpu_smt_control = CPU_SMT_DISABLED; } + cpu_smt_num_threads = 1; } /* * The decision whether SMT is supported can only be done after the full * CPU identification. Called from architecture code. */ -void __init cpu_smt_check_topology(void) +void __init cpu_smt_set_num_threads(unsigned int num_threads, + unsigned int max_threads) { + WARN_ON(!num_threads || (num_threads > max_threads)); + if (!topology_smt_supported()) cpu_smt_control = CPU_SMT_NOT_SUPPORTED; + + cpu_smt_max_threads = max_threads; + + /* +* If SMT has been disabled via the kernel command line or SMT is +* not supported, set cpu_smt_num_threads to 1 for consistency. +* If enabled, take the architecture requested number of threads +* to bring up into account. +*/ + if (cpu_smt_control != CPU_SMT_ENABLED) + cpu_smt_num_threads = 1; + else if (num_threads < cpu_smt_num_threads) + cpu_smt_num_threads = num_threads; } static int __init smt_cmdline_disable(char *str) -- 2.41.0
[PATCH v4 01/10] cpu/hotplug: remove dependancy against cpu_primary_thread_mask
The commit 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable cpu_primary_thread_mask exported by the X86 code. This variable is only used when CONFIG_HOTPLUG_PARALLEL is set. Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set. There is no functional change introduce by that patch. Cc: Thomas Gleixner Signed-off-by: Laurent Dufour --- kernel/cpu.c | 24 ++-- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index 88a7ede322bd..03309f2f35a4 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -650,22 +650,8 @@ bool cpu_smt_possible(void) } EXPORT_SYMBOL_GPL(cpu_smt_possible); -static inline bool cpuhp_smt_aware(void) -{ - return topology_smt_supported(); -} - -static inline const struct cpumask *cpuhp_get_primary_thread_mask(void) -{ - return cpu_primary_thread_mask; -} #else static inline bool cpu_smt_allowed(unsigned int cpu) { return true; } -static inline bool cpuhp_smt_aware(void) { return false; } -static inline const struct cpumask *cpuhp_get_primary_thread_mask(void) -{ - return cpu_present_mask; -} #endif static inline enum cpuhp_state @@ -1793,6 +1779,16 @@ static int __init parallel_bringup_parse_param(char *arg) } early_param("cpuhp.parallel", parallel_bringup_parse_param); +static inline bool cpuhp_smt_aware(void) +{ + return topology_smt_supported(); +} + +static inline const struct cpumask *cpuhp_get_primary_thread_mask(void) +{ + return cpu_primary_thread_mask; +} + /* * On architectures which have enabled parallel bringup this invokes all BP * prepare states for each of the to be onlined APs first. The last state -- 2.41.0
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, > > struct kvm_mmu *mmu, > > */ > > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page > > *sp, int i) > > { > > - bool host_writable; > > + bool host_writable, is_refcounted; > > gpa_t first_pte_gpa; > > u64 *sptep, spte; > > struct kvm_memory_slot *slot; > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, > > struct kvm_mmu_page *sp, int > > sptep = &sp->spt[i]; > > spte = *sptep; > > host_writable = spte & shadow_host_writable_mask; > > + // TODO: is this correct? > > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED; > > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); > > make_spte(vcpu, sp, slot, pte_access, gfn, > > spte_to_pfn(spte), spte, true, false, > > - host_writable, &spte); > > + host_writable, is_refcounted, &spte); > > Could we restrict that a non-refcounted page shall not be used as shadow page? Oh, sorry. It's not about shadow page. It's about guest page being mapped as not refcounted. Silly me... B.R. Yu
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Tue, 4 Jul 2023 16:50:48 +0900 David Stevens wrote: > From: David Stevens > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > callers to resolve a gfn when the associated pfn has a valid struct page > that isn't being actively refcounted (e.g. tail pages of non-compound > higher order pages). For a caller to safely omit FOLL_GET, all usages of > the returned pfn must be guarded by a mmu notifier. > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > is set when the returned pfn has an associated struct page with a valid > refcount. Callers that don't pass FOLL_GET should remember this value > and use it to avoid places like kvm_is_ad_tracked_page that assume a > non-zero refcount. > > Signed-off-by: David Stevens > --- > include/linux/kvm_host.h | 10 ++ > virt/kvm/kvm_main.c | 67 +--- > virt/kvm/pfncache.c | 2 +- > 3 files changed, 47 insertions(+), 32 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index ef2763c2b12e..a45308c7d2d9 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > kvm_memory_slot *slot, gfn_t gfn, > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > > +void kvm_set_page_accessed(struct page *page); > +void kvm_set_page_dirty(struct page *page); > + > struct kvm_follow_pfn { > const struct kvm_memory_slot *slot; > gfn_t gfn; > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > bool atomic; > /* Allow a read fault to create a writeable mapping. */ > bool allow_write_mapping; > + /* > + * Usage of the returned pfn will be guared by a mmu notifier. Must ^guarded > + * be true if FOLL_GET is not set. > + */ > + bool guarded_by_mmu_notifier; > It seems no one sets the guraded_by_mmu_notifier in this patch. Is guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the caller of __kvm_follow_pfn()? If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page? It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a temporary solution. I don't think it is a good idea to play tricks with a temporary solution, more like we are abusing the toleration. Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to indicate a tail page? > /* Outputs of __kvm_follow_pfn */ > hva_t hva; > bool writable; > + /* True if the returned pfn is for a page with a valid refcount. */ > + bool is_refcounted_page; > }; > > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index b13f22861d2f..0f7b41f220b6 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn > *foll, kvm_pfn_t *pfn) > if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > *pfn = page_to_pfn(page[0]); > foll->writable = foll->allow_write_mapping; > + foll->is_refcounted_page = true; > + if (!(foll->flags & FOLL_GET)) > + put_page(page[0]); > return true; > } > > @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, > kvm_pfn_t *pfn) > return npages; > > foll->writable = (foll->flags & FOLL_WRITE) && > foll->allow_write_mapping; > + foll->is_refcounted_page = true; > > /* map read fault as writable if possible */ > if (unlikely(!foll->writable) && foll->allow_write_mapping) { > @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, > kvm_pfn_t *pfn) > } > } > *pfn = page_to_pfn(page); > + if (!(foll->flags & FOLL_GET)) > + put_page(page); > return npages; > } > > @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, > bool write_fault) > return true; > } > > -static int kvm_try_get_pfn(kvm_pfn_t pfn) > -{ > - struct page *page = kvm_pfn_to_refcounted_page(pfn); > - > - if (!page) > - return 1; > - > - return get_page_unless_zero(page); > -} > - > static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct > kvm_follow_pfn *foll, > kvm_pfn_t *p_pfn) > { > @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct > *vma, struct kvm_follow_pfn > pte_t *ptep; > spinlock_t *ptl; > bool write_fault = foll->flags & FOLL_WRITE; > + struct page *page; > int r; > > r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl); > @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct > *vma, struct kvm_follow_pfn > pfn = pte_pfn(
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Tue, 4 Jul 2023 16:50:47 +0900 David Stevens wrote: > From: David Stevens > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. > __kvm_follow_pfn refactors the old API's arguments into a struct and, > where possible, combines the boolean arguments into a single flags > argument. > > Signed-off-by: David Stevens > --- > include/linux/kvm_host.h | 16 > virt/kvm/kvm_main.c | 171 ++- > virt/kvm/kvm_mm.h| 3 +- > virt/kvm/pfncache.c | 8 +- > 4 files changed, 122 insertions(+), 76 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 9d3ac7720da9..ef2763c2b12e 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -97,6 +97,7 @@ > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) > > /* > * error pfns indicate that the gfn is in slot but faild to > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct > kvm_memory_slot *slot, gfn_t gfn, > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > > +struct kvm_follow_pfn { > + const struct kvm_memory_slot *slot; > + gfn_t gfn; > + unsigned int flags; > + bool atomic; > + /* Allow a read fault to create a writeable mapping. */ > + bool allow_write_mapping; > + > + /* Outputs of __kvm_follow_pfn */ > + hva_t hva; > + bool writable; > +}; > + > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > + > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, > bool *writable); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 371bd783ff2b..b13f22861d2f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned > long addr) > * true indicates success, otherwise false is returned. It's also the > * only part that runs if we can in atomic context. > */ > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, > - bool *writable, kvm_pfn_t *pfn) > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > { > struct page *page[1]; > + bool write_fault = foll->flags & FOLL_WRITE; > > /* >* Fast pin a writable pfn only if it is a write fault request >* or the caller allows to map a writable pfn for a read fault >* request. >*/ > - if (!(write_fault || writable)) > + if (!(write_fault || foll->allow_write_mapping)) > return false; > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > *pfn = page_to_pfn(page[0]); > - > - if (writable) > - *writable = true; > + foll->writable = foll->allow_write_mapping; > return true; > } > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool > write_fault, > * The slow path to get the pfn of the specified host virtual address, > * 1 indicates success, -errno is returned if error is detected. > */ > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > -bool interruptible, bool *writable, kvm_pfn_t *pfn) > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > { > - unsigned int flags = FOLL_HWPOISON; > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > struct page *page; > int npages; > > might_sleep(); > > - if (writable) > - *writable = write_fault; > - > - if (write_fault) > - flags |= FOLL_WRITE; > - if (async) > - flags |= FOLL_NOWAIT; > - if (interruptible) > - flags |= FOLL_INTERRUPTIBLE; > - > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > if (npages != 1) > return npages; > > + foll->writable = (foll->flags & FOLL_WRITE) && > foll->allow_write_mapping; > + > /* map read fault as writable if possible */ > - if (unlikely(!write_fault) && writable) { > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > struct page *wpage; > > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) { > - *writable = true; > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) { > + foll->writable = true; > put_page(page); > page = wpage; > } > @@ -257
Re: [06/12] arch: Declare screen_info in
Hi Am 05.07.23 um 03:21 schrieb Sui Jingfeng: Hi, Thomas I love your patch, yet after applied your patch, the linux kernel fail to compile on my LoongArch machine. screen_info is missing. I think this should be solved with your update to patch 1. Best regards Thomas ``` CC arch/loongarch/kernel/efi.o arch/loongarch/kernel/efi.c: In function ‘init_screen_info’: arch/loongarch/kernel/efi.c:77:54: error: invalid application of ‘sizeof’ to incomplete type ‘struct screen_info’ 77 | si = early_memremap(screen_info_table, sizeof(*si)); | ^ arch/loongarch/kernel/efi.c:82:9: error: ‘screen_info’ undeclared (first use in this function) 82 | screen_info = *si; | ^~~ arch/loongarch/kernel/efi.c:82:9: note: each undeclared identifier is reported only once for each function it appears in arch/loongarch/kernel/efi.c:82:23: error: invalid use of undefined type ‘struct screen_info’ 82 | screen_info = *si; | ^ arch/loongarch/kernel/efi.c:83:29: error: invalid application of ‘sizeof’ to incomplete type ‘struct screen_info’ 83 | memset(si, 0, sizeof(*si)); | ^ arch/loongarch/kernel/efi.c:84:34: error: invalid application of ‘sizeof’ to incomplete type ‘struct screen_info’ 84 | early_memunmap(si, sizeof(*si)); | ^ make[3]: *** [scripts/Makefile.build:252: arch/loongarch/kernel/efi.o] Error 1 make[3]: *** Waiting for unfinished jobs make[2]: *** [scripts/Makefile.build:494: arch/loongarch/kernel] Error 2 make[1]: *** [scripts/Makefile.build:494: arch/loongarch] Error 2 make[1]: *** Waiting for unfinished jobs make: *** [Makefile:2026: .] Error 2 ``` On 2023/6/29 19:45, Thomas Zimmermann wrote: The variable screen_info does not exist on all architectures. Declare it in . All architectures that do declare it will provide it via . Add the Kconfig token ARCH_HAS_SCREEN_INFO to guard against access on architectures that don't provide screen_info. Signed-off-by: Thomas Zimmermann Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Guo Ren Cc: Brian Cain Cc: Huacai Chen Cc: WANG Xuerui Cc: Thomas Bogendoerfer Cc: Dinh Nguyen Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Cc: "H. Peter Anvin" Cc: Chris Zankel Cc: Max Filippov Cc: Helge Deller Cc: Arnd Bergmann Cc: Kees Cook Cc: "Paul E. McKenney" Cc: Peter Zijlstra Cc: Frederic Weisbecker Cc: Andrew Morton Cc: Ard Biesheuvel Cc: Sami Tolvanen Cc: Juerg Haefliger Cc: Geert Uytterhoeven Cc: Anshuman Khandual Cc: Niklas Schnelle Cc: "Russell King (Oracle)" Cc: Linus Walleij Cc: Sebastian Reichel Cc: "Mike Rapoport (IBM)" Cc: "Kirill A. Shutemov" Cc: Zi Yan Acked-by: WANG Xuerui # loongarch --- arch/Kconfig | 6 ++ arch/alpha/Kconfig | 1 + arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/csky/Kconfig | 1 + arch/hexagon/Kconfig | 1 + arch/ia64/Kconfig | 1 + arch/loongarch/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/nios2/Kconfig | 1 + arch/powerpc/Kconfig | 1 + arch/riscv/Kconfig | 1 + arch/sh/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/x86/Kconfig | 1 + arch/xtensa/Kconfig | 1 + drivers/video/Kconfig | 3 +++ include/asm-generic/Kbuild | 1 + include/asm-generic/screen_info.h | 12 include/linux/screen_info.h | 2 +- 20 files changed, 38 insertions(+), 1 deletion(-) create mode 100644 include/asm-generic/screen_info.h diff --git a/arch/Kconfig b/arch/Kconfig index 205fd23e0cada..2f58293fd7bcb 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1466,6 +1466,12 @@ config ARCH_HAS_NONLEAF_PMD_YOUNG address translations. Page table walkers that clear the accessed bit may use this capability to reduce their search space. +config ARCH_HAS_SCREEN_INFO + bool + help + Selected by architectures that provide a global instance of + screen_info. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index a5c2b1aa46b02..d749011d88b14 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -4,6 +4,7 @@ config ALPHA default y select ARCH_32BIT_USTAT_F_TINODE select ARCH_HAS_CURRENT_STACK_POINTER
Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
On Tue, 4 Jul 2023 10:03:57 -0700 (PDT) Hugh Dickins wrote: > On Tue, 4 Jul 2023, Gerald Schaefer wrote: > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT) > > Hugh Dickins wrote: > > > On Thu, 29 Jun 2023, Hugh Dickins wrote: > > > > > > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the > > > > extent that I've not even tried to verify it; but I think I do get the > > > > point now, that we need further info than just PPHHAA to know whether > > > > the page is on the list or not. But I think that if we move where the > > > > call_rcu() is done, then the page can stay on or off the list by same > > > > rules as before (but need to check HH bits along with PP when deciding > > > > whether to allocate, and whether to list_add_tail() when freeing). > > > > > > No, not quite the same rules as before: I came to realize that using > > > list_add_tail() for the HH pages would be liable to put a page on the > > > list which forever blocked reuse of PP list_add_tail() pages after it > > > (could be solved by a list_move() somewhere, but we have agreed to > > > prefer simplicity). > > > > > > I've dropped the HH bits, I'm using PageActive like we did on powerpc, > > > I've dropped most of the pte_free_*() helpers, and list_del_init() is > > > an easier way of dealing with those "is it on the list" questions. > > > I expect that we shall be close to reaching agreement on... > > > > This looks really nice, almost too good and easy to be true. I did not > > find any obvious flaw, just some comments below. It also survived LTP > > without any visible havoc, so I guess this approach is the best so far. > > Phew! I'm of course glad to hear this: thanks for your efforts on it. > > ... > > > --- a/arch/s390/mm/pgalloc.c > > > +++ b/arch/s390/mm/pgalloc.c > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page) > > > * logic described above. Both AA bits are set to 1 to denote a > > > 4KB-pgtable > > > * while the PP bits are never used, nor such a page is added to or > > > removed > > > * from mm_context_t::pgtable_list. > > > + * > > > + * pte_free_defer() overrides those rules: it takes the page off > > > pgtable_list, > > > + * and prevents both 2K fragments from being reused. pte_free_defer() > > > has to > > > + * guarantee that its pgtable cannot be reused before the RCU grace > > > period > > > + * has elapsed (which page_table_free_rcu() does not actually > > > guarantee). > > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not > > allow reuse before grace period elapsed. And I hope that it does so, by > > setting the PP bits, which would be noticed in page_table_alloc(), in > > case the page would be seen there. > > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the > > end of the list, and so they could be seen in page_table_alloc(), but they > > should not be reused before grace period elapsed and __tlb_remove_table() > > cleared the PP bits, as far as I understand. > > > > So what exactly do you mean with "which page_table_free_rcu() does not > > actually > > guarantee"? > > I'll answer without locating and re-reading what Jason explained earlier, > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table(): > he may have explained it better. And without working out again all the > MMU_GATHER #defines, and which of them do and do not apply to s390 here. > > The detail that sticks in my mind is the fallback in tlb_remove_table() Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(), but that is rather a generic issue, and not s390-specific. I thought you meant some s390-oddity here, of which we have a lot, unfortunately... Of course, we call tlb_remove_table() from our page_table_free_rcu(), so I guess you could say that page_table_free_rcu() cannot guarantee what tlb_remove_table() cannot guarantee. Maybe change to "which page_table_free_rcu() does not actually guarantee, by calling tlb_remove_table()", to make it clear that this is not a problem of page_table_free_rcu() itself. > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot > batch the tables for freeing by RCU, and resorts instead to an immediate > TLB flush (I think: that again involves chasing definitions) followed by > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU, > and is commented: > /* > * This isn't an RCU grace period and hence the page-tables cannot be > * assumed to be actually RCU-freed. > * > * It is however sufficient for software page-table walkers that rely on > * IRQ disabling. > */ > > Whether that's good for your PP pages or not, I've given no thought: > I've just taken it on trust that what s390 has working today is good. Yes, we should be fine with that, current code can be trusted :-) > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(), > then I would not have written "(which page_tab
Re: [PATCH v3 3/9] cpu/SMT: Store the current/max number of threads
Le 05/07/2023 à 05:05, Zhang, Rui a écrit : On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote: From: Michael Ellerman Some architectures allows partial SMT states at boot time, s/allows/allow. Thanks Rui !
Re: [PATCH v3 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs
Le 05/07/2023 à 05:14, Zhang, Rui a écrit : On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote: @@ -2580,6 +2597,17 @@ static ssize_t control_show(struct device *dev, { const char *state = smt_states[cpu_smt_control]; +#ifdef CONFIG_HOTPLUG_SMT + /* + * If SMT is enabled but not all threads are enabled then show the + * number of threads. If all threads are enabled show "on". Otherwise + * show the state name. + */ + if (cpu_smt_control == CPU_SMT_ENABLED && + cpu_smt_num_threads != cpu_smt_max_threads) + return sysfs_emit(buf, "%d\n", cpu_smt_num_threads); +#endif + My understanding is that cpu_smt_control is always set to CPU_SMT_NOT_IMPLEMENTED when CONFIG_HOTPLUG_SMT is not set, so this ifdef is not necessary, right? Hi Rui, Indeed, cpu_smt_control, cpu_smt_num_threads and cpu_smt_max_threads are only defined when CONFIG_HOTPLUG_SMT is set. This is the reason for this #ifdef block. This has been reported by the kernel test robot testing v2: https://lore.kernel.org/oe-kbuild-all/202306282340.ihqm0fla-...@intel.com Cheers, Laurent.
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote: > From: David Stevens > > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows > callers to resolve a gfn when the associated pfn has a valid struct page > that isn't being actively refcounted (e.g. tail pages of non-compound > higher order pages). For a caller to safely omit FOLL_GET, all usages of > the returned pfn must be guarded by a mmu notifier. > > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that > is set when the returned pfn has an associated struct page with a valid > refcount. Callers that don't pass FOLL_GET should remember this value > and use it to avoid places like kvm_is_ad_tracked_page that assume a > non-zero refcount. > > Signed-off-by: David Stevens > --- > include/linux/kvm_host.h | 10 ++ > virt/kvm/kvm_main.c | 67 +--- > virt/kvm/pfncache.c | 2 +- > 3 files changed, 47 insertions(+), 32 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index ef2763c2b12e..a45308c7d2d9 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct > kvm_memory_slot *slot, gfn_t gfn, > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > > +void kvm_set_page_accessed(struct page *page); > +void kvm_set_page_dirty(struct page *page); > + > struct kvm_follow_pfn { > const struct kvm_memory_slot *slot; > gfn_t gfn; > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn { > bool atomic; > /* Allow a read fault to create a writeable mapping. */ > bool allow_write_mapping; > + /* > + * Usage of the returned pfn will be guared by a mmu notifier. Must > + * be true if FOLL_GET is not set. > + */ > + bool guarded_by_mmu_notifier; And how? Any place to check the invalidate seq? B.R. Yu
Re: [PATCH v3 0/9] Introduce SMT level and add PowerPC support
Le 05/07/2023 à 05:04, Zhang, Rui a écrit : Hi, Laurent, I want to test this patch set and found that it does not apply on top of latest usptream git, because of some changes in this merge window, so better rebase. Hi Rui, Thanks for your interest for this series. The latest Thomas's changes came into the PowerPC next branch. I'm working on a rebase. Cheers, Laurent. thanks, rui On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote: I'm taking over the series Michael sent previously [1] which is smartly reviewing the initial series I sent [2]. This series is addressing the comments sent by Thomas and me on the Michael's one. Here is a short introduction to the issue this series is addressing: When a new CPU is added, the kernel is activating all its threads. This leads to weird, but functional, result when adding CPU on a SMT 4 system for instance. Here the newly added CPU 1 has 8 threads while the other one has 4 threads active (system has been booted with the 'smt-enabled=4' kernel option): ltcden3-lp12:~ # ppc64_cpu --info Core 0: 0* 1* 2* 3* 4 5 6 7 Core 1: 8* 9* 10* 11* 12* 13* 14* 15* This mixed SMT level may confused end users and/or some applications. There is no SMT level recorded in the kernel (common code), neither in user space, as far as I know. Such a level is helpful when adding new CPU or when optimizing the energy efficiency (when reactivating CPUs). When SMP and HOTPLUG_SMT are defined, this series is adding a new SMT level (cpu_smt_num_threads) and few callbacks allowing the architecture code to fine control this value, setting a max and a "at boot" level, and controling whether a thread should be onlined or not. v3: Fix a build error in the patch 6/9 v2: As Thomas suggested, Reword some commit's description Remove topology_smt_supported() Remove topology_smt_threads_supported() Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC Remove switch() in __store_smt_control() Update kernel-parameters.txt [1] https://lore.kernel.org/linuxppc-dev/20230524155630.794584-1-...@ellerman.id.au/ [2] https://lore.kernel.org/linuxppc-dev/20230331153905.31698-1-lduf...@linux.ibm.com/ Laurent Dufour (1): cpu/SMT: Remove topology_smt_supported() Michael Ellerman (8): cpu/SMT: Move SMT prototypes into cpu_smt.h cpu/SMT: Move smt/control simple exit cases earlier cpu/SMT: Store the current/max number of threads cpu/SMT: Create topology_smt_thread_allowed() cpu/SMT: Allow enabling partial SMT states via sysfs powerpc/pseries: Initialise CPU hotplug callbacks earlier powerpc: Add HOTPLUG_SMT support powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs .../ABI/testing/sysfs-devices-system-cpu | 1 + .../admin-guide/kernel-parameters.txt | 4 +- arch/Kconfig | 3 + arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/topology.h | 15 +++ arch/powerpc/kernel/smp.c | 8 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 30 +++-- arch/powerpc/platforms/pseries/pseries.h | 2 + arch/powerpc/platforms/pseries/setup.c | 2 + arch/x86/include/asm/topology.h | 4 +- arch/x86/kernel/cpu/bugs.c | 3 +- arch/x86/kernel/smpboot.c | 8 -- include/linux/cpu.h | 25 +--- include/linux/cpu_smt.h | 33 + kernel/cpu.c | 118 ++-- -- 15 files changed, 187 insertions(+), 71 deletions(-) create mode 100644 include/linux/cpu_smt.h
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote: > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang wrote: > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, > > > bool write_fault, > > > * The slow path to get the pfn of the specified host virtual address, > > > * 1 indicates success, -errno is returned if error is detected. > > > */ > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > > write_fault, > > > -bool interruptible, bool *writable, kvm_pfn_t > > > *pfn) > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > > { > > > - unsigned int flags = FOLL_HWPOISON; > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > > struct page *page; > > > int npages; > > > > > > might_sleep(); > > > > > > - if (writable) > > > - *writable = write_fault; > > > - > > > - if (write_fault) > > > - flags |= FOLL_WRITE; > > > - if (async) > > > - flags |= FOLL_NOWAIT; > > > - if (interruptible) > > > - flags |= FOLL_INTERRUPTIBLE; > > > - > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > > > if (npages != 1) > > > return npages; > > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > > foll->allow_write_mapping; > > > + > > > /* map read fault as writable if possible */ > > > - if (unlikely(!write_fault) && writable) { > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here. > > The two statements are logically equivalent, although I guess using > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more > verbose. Well, as the comment says, we wanna try to map the read fault as writable whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE for write faults. So I guess using !foll->writable will not allow this. Did I miss anything? > > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t > > > gfn, > > > +bool atomic, bool interruptible, bool *async, > > > +bool write_fault, bool *writable, hva_t *hva) > > > +{ > > > + kvm_pfn_t pfn; > > > + struct kvm_follow_pfn foll = { > > > + .slot = slot, > > > + .gfn = gfn, > > > + .flags = 0, > > > + .atomic = atomic, > > > + .allow_write_mapping = !!writable, > > > + }; > > > + > > > + if (write_fault) > > > + foll.flags |= FOLL_WRITE; > > > + if (async) > > > + foll.flags |= FOLL_NOWAIT; > > > + if (interruptible) > > > + foll.flags |= FOLL_INTERRUPTIBLE; > > > + > > > + pfn = __kvm_follow_pfn(&foll); > > > + if (pfn == KVM_PFN_ERR_NEEDS_IO) { > > > > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e., > > if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))? > > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault > > seems unnecessary. > > There are the cases where the fault does not fall within a vma or when > the target vma's flags don't support the fault's access permissions. > In those cases, continuing to try to resolve the fault won't cause > problems per-se, but it's wasteful and a bit confusing. Having > hva_to_pfn detect whether or not it may be possible to resolve the > fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like > a good idea. It also matches what the existing code does. Got it. Sounds reasonable. And thanks! :) B.R. Yu
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > From: David Stevens > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > memory into the guest that is backed by un-refcounted struct pages - for > example, higher order non-compound pages allocated by the amdgpu driver > via ttm_pool_alloc_page. > > The bulk of this change is tracking the is_refcounted_page flag so that > non-refcounted pages don't trigger page_count() == 0 warnings. This is > done by storing the flag in an unused bit in the sptes. > > Signed-off-by: David Stevens > --- > arch/x86/kvm/mmu/mmu.c | 44 + > arch/x86/kvm/mmu/mmu_internal.h | 1 + > arch/x86/kvm/mmu/paging_tmpl.h | 9 --- > arch/x86/kvm/mmu/spte.c | 4 ++- > arch/x86/kvm/mmu/spte.h | 12 - > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++--- > 6 files changed, 62 insertions(+), 30 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e44ab512c3a1..b1607e314497 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) > > if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { > flush = true; > - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); > + if (is_refcounted_page_pte(old_spte)) > + > kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); > } > > if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) { > flush = true; > - kvm_set_pfn_dirty(spte_to_pfn(old_spte)); > + if (is_refcounted_page_pte(old_spte)) > + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte))); > } > > return flush; > @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, > u64 *sptep) >* before they are reclaimed. Sanity check that, if the pfn is backed >* by a refcounted page, the refcount is elevated. >*/ > - page = kvm_pfn_to_refcounted_page(pfn); > - WARN_ON(page && !page_count(page)); > + if (is_refcounted_page_pte(old_spte)) { > + page = kvm_pfn_to_refcounted_page(pfn); > + WARN_ON(!page || !page_count(page)); > + } > > - if (is_accessed_spte(old_spte)) > - kvm_set_pfn_accessed(pfn); > + if (is_refcounted_page_pte(old_spte)) { > + if (is_accessed_spte(old_spte)) > + kvm_set_page_accessed(pfn_to_page(pfn)); > > - if (is_dirty_spte(old_spte)) > - kvm_set_pfn_dirty(pfn); > + if (is_dirty_spte(old_spte)) > + kvm_set_page_dirty(pfn_to_page(pfn)); > + } > > return old_spte; > } > @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep) >* Capture the dirty status of the page, so that it doesn't get >* lost when the SPTE is marked for access tracking. >*/ > - if (is_writable_pte(spte)) > - kvm_set_pfn_dirty(spte_to_pfn(spte)); > + if (is_writable_pte(spte) && is_refcounted_page_pte(spte)) > + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte))); > > spte = mark_spte_for_access_track(spte); > mmu_spte_update_no_track(sptep, spte); > @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep) > { > bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT, > (unsigned long *)sptep); > - if (was_writable && !spte_ad_enabled(*sptep)) > - kvm_set_pfn_dirty(spte_to_pfn(*sptep)); > + if (was_writable && !spte_ad_enabled(*sptep) && > is_refcounted_page_pte(*sptep)) > + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep))); > > return was_writable; > } > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct > kvm_memory_slot *slot, > bool host_writable = !fault || fault->map_writable; > bool prefetch = !fault || fault->prefetch; > bool write_fault = fault && fault->write; > + bool is_refcounted = !fault || fault->is_refcounted_page; > > pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__, >*sptep, write_fault, gfn); > @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct > kvm_memory_slot *slot, > } > > wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, > prefetch, > -true, host_writable, &spte); > +true, host_writable, is_refcounted, &spte); > > if (*sptep == spte) { > ret = RET_PF_SPURIOUS; > @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault > struct kvm_follow_pfn foll = { > .slot = slot, >
Re: [PATCH] KVM: PPC: Update MAINTAINERS
Sean Christopherson writes: > On Thu, Jun 08, 2023, Nicholas Piggin wrote: >> Michael is merging KVM PPC patches via the powerpc tree and KVM topic >> branches. He doesn't necessarily have time to be across all of KVM so >> is reluctant to call himself maintainer, but for the mechanics of how >> patches flow upstream, it is maintained and does make sense to have >> some contact people in MAINTAINERS. >> >> So add Michael Ellerman as KVM PPC maintainer and myself as reviewer. >> Split out the subarchs that don't get so much attention. >> >> Signed-off-by: Nicholas Piggin >> --- > > Thanks for documenting the reality of things, much appreciated! > > Acked-by: Sean Christopherson > >> MAINTAINERS | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 0dab9737ec16..44417acd2936 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -11379,7 +11379,13 @@ F: arch/mips/include/uapi/asm/kvm* >> F: arch/mips/kvm/ >> >> KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc) >> +M: Michael Ellerman >> +R: Nicholas Piggin >> L: linuxppc-dev@lists.ozlabs.org >> +L: k...@vger.kernel.org >> +S: Maintained (Book3S 64-bit HV) >> +S: Odd fixes (Book3S 64-bit PR) >> +S: Orphan (Book3E and 32-bit) > > Do you think there's any chance of dropping support for everything except > Book3S > 64-bit HV at some point soonish? Nick proposed disabling BookE KVM, which prompted some users to report they are still actively using it: https://lore.kernel.org/all/20221128043623.1745708-1-npig...@gmail.com/ There are also still some KVM PR users. In total I'd guess it's only some small 100s of users, but we don't really know. > There haven't been many generic KVM changes that touch PPC, but in my > experience when such series do come along, the many flavors and layers > of PPC incur quite a bit of development and testing cost, and have a > high chance of being broken compared to other architectures. Ack. cheers
Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration
On 2023/7/5 16:43, Barry Song wrote: > On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang wrote: >> >> On 2023/6/30 1:26, Catalin Marinas wrote: >>> On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote: On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote: > From: Barry Song > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. [...] > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig| 1 + > arch/arm64/include/asm/tlbbatch.h | 12 > arch/arm64/include/asm/tlbflush.h | 33 - > arch/arm64/mm/flush.c | 69 +++ > arch/x86/include/asm/tlbflush.h | 5 +- > include/linux/mm_types_task.h | 4 +- > mm/rmap.c | 12 ++-- First of all, this patch needs to be split in some preparatory patches introducing/renaming functions with no functional change for x86. Once done, you can add the arm64-only changes. >> >> got it. will try to split this patch as suggested. >> Now, on the implementation, I had some comments on v7 but we didn't get to a conclusion and the thread eventually died: https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/ I know I said a command line argument is better than Kconfig or some random number of CPUs heuristics but it would be even better if we don't bother with any, just make this always on. >> >> ok, will make this always on. >> Barry had some comments around mprotect() being racy and that's why we have flush_tlb_batched_pending() but I don't think it's needed (or, for arm64, it can be a DSB since this patch issues the TLBIs but without the DVM Sync). So we need to clarify this (see Barry's last email on the above thread) and before attempting new versions of this patchset. With flush_tlb_batched_pending() removed (or DSB), I have a suspicion such implementation would be faster on any SoC irrespective of the number of CPUs. >>> >>> I think I got the need for flush_tlb_batched_pending(). If >>> try_to_unmap() marks the pte !present and we have a pending TLBI, >>> change_pte_range() will skip the TLB maintenance altogether since it did >>> not change the pte. So we could be left with stale TLB entries after >>> mprotect() before TTU does the batch flushing. >>> > > Good catch. > This could be also true for MADV_DONTNEED. after try_to_unmap, we run > MADV_DONTNEED on this area, as pte is not present, we don't do anything > on this PTE in zap_pte_range afterwards. > >>> We can have an arch-specific flush_tlb_batched_pending() that can be a >>> DSB only on arm64 and a full mm flush on x86. >>> >> >> We need to do a flush/dsb in flush_tlb_batched_pending() only in a race >> condition so we first check whether there's a pended batched flush and >> if so do the tlb flush. The pending checking is common and the differences >> among the archs is how to flush the TLB here within the >> flush_tlb_batched_pending(), >> on arm64 it should only be a dsb. >> >> As we only needs to maintain the TLBs already pended in batched flush, >> does it make sense to only handle those TLBs in flush_tlb_batched_pending()? >> Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in >> flush_tlb_batched_pending() and no arch specific function needed. > > as we have issued no-sync tlbi on those pending addresses , that means > our hardware > has already "recorded" what should be flushed in the specific mm. so > DSB only will flush > them correctly. right? > yes it's right. I was just thought something like below. arch_tlbbatch_flush() will only be a dsb on arm64 so this will match what Catalin wants. But as you told that this maybe incorrect on x86 so we'd better have arch specific implementation for flush_tlb_batched_pending() as suggested. diff --git a/mm/rmap.c b/mm/rmap.c index 9699c6011b0e..afa3571503a0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -717,7 +717,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm) int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT; if (pending != flushed) { - flush_tlb_mm(mm); + arch_tlbbatch_flush(¤t->tlb_ubc.arch); /* * If the new TLB flushing is pending during flushing, leave * mm->tlb_flush_batched as is, to avoid losing flushing.
Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote: > From: David Stevens > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map > memory into the guest that is backed by un-refcounted struct pages - for > example, higher order non-compound pages allocated by the amdgpu driver > via ttm_pool_alloc_page. I guess you mean the tail pages of the higher order non-compound pages? And as to the head page, it is said to be set to one coincidentally[*], and shall not be considered as refcounted. IIUC, refcount of this head page will be increased and decreased soon in hva_to_pfn_remapped(), so this may not be a problem(?). But treating this head page differently, as a refcounted one(e.g., to set the A/D flags), is weired. Or maybe I missed some context, e.g., can the head page be allocted to guest at all? > > The bulk of this change is tracking the is_refcounted_page flag so that > non-refcounted pages don't trigger page_count() == 0 warnings. This is > done by storing the flag in an unused bit in the sptes. Also, maybe we should mention this only works on x86-64. > > Signed-off-by: David Stevens > --- > arch/x86/kvm/mmu/mmu.c | 44 + > arch/x86/kvm/mmu/mmu_internal.h | 1 + > arch/x86/kvm/mmu/paging_tmpl.h | 9 --- > arch/x86/kvm/mmu/spte.c | 4 ++- > arch/x86/kvm/mmu/spte.h | 12 - > arch/x86/kvm/mmu/tdp_mmu.c | 22 ++--- > 6 files changed, 62 insertions(+), 30 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e44ab512c3a1..b1607e314497 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c ... > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct > kvm_memory_slot *slot, > bool host_writable = !fault || fault->map_writable; > bool prefetch = !fault || fault->prefetch; > bool write_fault = fault && fault->write; > + bool is_refcounted = !fault || fault->is_refcounted_page; Just wonder, what if a non-refcounted page is prefetched? Or is it possible in practice? ... > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, > struct kvm_mmu *mmu, > */ > static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, > int i) > { > - bool host_writable; > + bool host_writable, is_refcounted; > gpa_t first_pte_gpa; > u64 *sptep, spte; > struct kvm_memory_slot *slot; > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, > struct kvm_mmu_page *sp, int > sptep = &sp->spt[i]; > spte = *sptep; > host_writable = spte & shadow_host_writable_mask; > + // TODO: is this correct? > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED; > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); > make_spte(vcpu, sp, slot, pte_access, gfn, > spte_to_pfn(spte), spte, true, false, > - host_writable, &spte); > + host_writable, is_refcounted, &spte); Could we restrict that a non-refcounted page shall not be used as shadow page? [*] https://lore.kernel.org/all/8caf3008-dcf3-985a-631e-e019b277c...@amd.com/ B.R. Yu
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, Jul 05, 2023 at 10:51:57AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote: > On 05.07.23 09:08, Greg KH wrote: > > On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote: > >> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton > >> wrote: > >>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH > >>> wrote: > Thanks! I'll investigate this later today. After discussing with > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until > the issue is fixed. I'll post a patch shortly. > >>> > >>> Posted at: > >>> https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > >> > >> As that change fixes something in 6.4, why not cc: stable on it as > >> well? > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > > patch is fixing 6.4 I didn't need to send it to stable for older > > versions. Did I miss something? > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included > there :) > >>> > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > >>> fix rather than disabling the feature in -stable. > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > Greg said below and that we already had three reports I know of I'd > prefer if we could fix this rather sooner than later in mainline -- > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > 6.4.y already or will do so soon. Ick, yeah, and Fedora should be switching soon too, and I want to drop support for 6.3.y "any day now". Is there just a revert we can do now first to resolve the regression and then work on fixing this up "better" for 6.6-rc1? thanks, greg k-h
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang wrote: > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, > > bool write_fault, > > * The slow path to get the pfn of the specified host virtual address, > > * 1 indicates success, -errno is returned if error is detected. > > */ > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > write_fault, > > -bool interruptible, bool *writable, kvm_pfn_t *pfn) > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > { > > - unsigned int flags = FOLL_HWPOISON; > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > struct page *page; > > int npages; > > > > might_sleep(); > > > > - if (writable) > > - *writable = write_fault; > > - > > - if (write_fault) > > - flags |= FOLL_WRITE; > > - if (async) > > - flags |= FOLL_NOWAIT; > > - if (interruptible) > > - flags |= FOLL_INTERRUPTIBLE; > > - > > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > > if (npages != 1) > > return npages; > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > foll->allow_write_mapping; > > + > > /* map read fault as writable if possible */ > > - if (unlikely(!write_fault) && writable) { > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here. The two statements are logically equivalent, although I guess using !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more verbose. > > struct page *wpage; > > > > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) { > > - *writable = true; > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) { > > + foll->writable = true; > > put_page(page); > > page = wpage; > > } > > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn) > > return get_page_unless_zero(page); > > } > > > ... > > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t > > gfn, > > +bool atomic, bool interruptible, bool *async, > > +bool write_fault, bool *writable, hva_t *hva) > > +{ > > + kvm_pfn_t pfn; > > + struct kvm_follow_pfn foll = { > > + .slot = slot, > > + .gfn = gfn, > > + .flags = 0, > > + .atomic = atomic, > > + .allow_write_mapping = !!writable, > > + }; > > + > > + if (write_fault) > > + foll.flags |= FOLL_WRITE; > > + if (async) > > + foll.flags |= FOLL_NOWAIT; > > + if (interruptible) > > + foll.flags |= FOLL_INTERRUPTIBLE; > > + > > + pfn = __kvm_follow_pfn(&foll); > > + if (pfn == KVM_PFN_ERR_NEEDS_IO) { > > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e., > if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))? > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault > seems unnecessary. There are the cases where the fault does not fall within a vma or when the target vma's flags don't support the fault's access permissions. In those cases, continuing to try to resolve the fault won't cause problems per-se, but it's wasteful and a bit confusing. Having hva_to_pfn detect whether or not it may be possible to resolve the fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like a good idea. It also matches what the existing code does. -David
Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function
On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang wrote: > > On Tue, 4 Jul 2023 16:50:47 +0900 > David Stevens wrote: > > > From: David Stevens > > > > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot. > > __kvm_follow_pfn refactors the old API's arguments into a struct and, > > where possible, combines the boolean arguments into a single flags > > argument. > > > > Signed-off-by: David Stevens > > --- > > include/linux/kvm_host.h | 16 > > virt/kvm/kvm_main.c | 171 ++- > > virt/kvm/kvm_mm.h| 3 +- > > virt/kvm/pfncache.c | 8 +- > > 4 files changed, 122 insertions(+), 76 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 9d3ac7720da9..ef2763c2b12e 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -97,6 +97,7 @@ > > #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1) > > #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) > > #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3) > > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4) > > > > /* > > * error pfns indicate that the gfn is in slot but faild to > > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct > > kvm_memory_slot *slot, gfn_t gfn, > > void kvm_release_page_clean(struct page *page); > > void kvm_release_page_dirty(struct page *page); > > > > +struct kvm_follow_pfn { > > + const struct kvm_memory_slot *slot; > > + gfn_t gfn; > > + unsigned int flags; > > + bool atomic; > > + /* Allow a read fault to create a writeable mapping. */ > > + bool allow_write_mapping; > > + > > + /* Outputs of __kvm_follow_pfn */ > > + hva_t hva; > > + bool writable; > > +}; > > + > > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll); > > + > > kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn); > > kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault, > > bool *writable); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 371bd783ff2b..b13f22861d2f 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned > > long addr) > > * true indicates success, otherwise false is returned. It's also the > > * only part that runs if we can in atomic context. > > */ > > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault, > > - bool *writable, kvm_pfn_t *pfn) > > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > { > > struct page *page[1]; > > + bool write_fault = foll->flags & FOLL_WRITE; > > > > /* > >* Fast pin a writable pfn only if it is a write fault request > >* or the caller allows to map a writable pfn for a read fault > >* request. > >*/ > > - if (!(write_fault || writable)) > > + if (!(write_fault || foll->allow_write_mapping)) > > return false; > > > > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) { > > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) { > > *pfn = page_to_pfn(page[0]); > > - > > - if (writable) > > - *writable = true; > > + foll->writable = foll->allow_write_mapping; > > return true; > > } > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, > > bool write_fault, > > * The slow path to get the pfn of the specified host virtual address, > > * 1 indicates success, -errno is returned if error is detected. > > */ > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool > > write_fault, > > -bool interruptible, bool *writable, kvm_pfn_t *pfn) > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn) > > { > > - unsigned int flags = FOLL_HWPOISON; > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags; > > struct page *page; > > int npages; > > > > might_sleep(); > > > > - if (writable) > > - *writable = write_fault; > > - > > - if (write_fault) > > - flags |= FOLL_WRITE; > > - if (async) > > - flags |= FOLL_NOWAIT; > > - if (interruptible) > > - flags |= FOLL_INTERRUPTIBLE; > > - > > - npages = get_user_pages_unlocked(addr, 1, &page, flags); > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags); > > if (npages != 1) > > return npages; > > > > + foll->writable = (foll->flags & FOLL_WRITE) && > > foll->allow_write_mapping; > > + > > /* map read fault as writable if possible */ > > - if (unlikely(!write_fault) && writable) { > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) { > > struct page *wpage; > > > > - if (get_user_page_fast_only(addr, FOLL_W
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On 05.07.23 09:08, Greg KH wrote: > On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote: >> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton >> wrote: >>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH >>> wrote: Thanks! I'll investigate this later today. After discussing with Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until the issue is fixed. I'll post a patch shortly. >>> >>> Posted at: >>> https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ >> >> As that change fixes something in 6.4, why not cc: stable on it as well? > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > patch is fixing 6.4 I didn't need to send it to stable for older > versions. Did I miss something? 6.4.y is a stable kernel tree right now, so yes, it needs to be included there :) >>> >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable >>> fix rather than disabling the feature in -stable. Andrew, how long will you remain in "wait-a-few-days-mode"? Given what Greg said below and that we already had three reports I know of I'd prefer if we could fix this rather sooner than later in mainline -- especially as Arch Linux and openSUSE Tumbleweed likely have switched to 6.4.y already or will do so soon. >> Ok, I think we have a fix posted at [2] and it's cleanly applies to >> 6.4.y stable branch as well. However fork() performance might slightly >> regress, therefore disabling per-VMA locks by default for now seems to >> be preferable even with this fix (see discussion at >> https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/). >> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply >> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately >> to stable@vger? > > We can't do anything for stable until it lands in Linus's tree, so if > you didn't happen to have the stable@ tag in the patch, just email us > the git SHA1 and I can pick it up that way. > > thanks, > > greg k-h Ciao, Thorsten
Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration
On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang wrote: > > On 2023/6/30 1:26, Catalin Marinas wrote: > > On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote: > >> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote: > >>> From: Barry Song > >>> > >>> on x86, batched and deferred tlb shootdown has lead to 90% > >>> performance increase on tlb shootdown. on arm64, HW can do > >>> tlb shootdown without software IPI. But sync tlbi is still > >>> quite expensive. > >> [...] > >>> .../features/vm/TLB/arch-support.txt | 2 +- > >>> arch/arm64/Kconfig| 1 + > >>> arch/arm64/include/asm/tlbbatch.h | 12 > >>> arch/arm64/include/asm/tlbflush.h | 33 - > >>> arch/arm64/mm/flush.c | 69 +++ > >>> arch/x86/include/asm/tlbflush.h | 5 +- > >>> include/linux/mm_types_task.h | 4 +- > >>> mm/rmap.c | 12 ++-- > >> > >> First of all, this patch needs to be split in some preparatory patches > >> introducing/renaming functions with no functional change for x86. Once > >> done, you can add the arm64-only changes. > >> > > got it. will try to split this patch as suggested. > > >> Now, on the implementation, I had some comments on v7 but we didn't get > >> to a conclusion and the thread eventually died: > >> > >> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/ > >> > >> I know I said a command line argument is better than Kconfig or some > >> random number of CPUs heuristics but it would be even better if we don't > >> bother with any, just make this always on. > > ok, will make this always on. > > >> Barry had some comments > >> around mprotect() being racy and that's why we have > >> flush_tlb_batched_pending() but I don't think it's needed (or, for > >> arm64, it can be a DSB since this patch issues the TLBIs but without the > >> DVM Sync). So we need to clarify this (see Barry's last email on the > >> above thread) and before attempting new versions of this patchset. With > >> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such > >> implementation would be faster on any SoC irrespective of the number of > >> CPUs. > > > > I think I got the need for flush_tlb_batched_pending(). If > > try_to_unmap() marks the pte !present and we have a pending TLBI, > > change_pte_range() will skip the TLB maintenance altogether since it did > > not change the pte. So we could be left with stale TLB entries after > > mprotect() before TTU does the batch flushing. > > Good catch. This could be also true for MADV_DONTNEED. after try_to_unmap, we run MADV_DONTNEED on this area, as pte is not present, we don't do anything on this PTE in zap_pte_range afterwards. > > We can have an arch-specific flush_tlb_batched_pending() that can be a > > DSB only on arm64 and a full mm flush on x86. > > > > We need to do a flush/dsb in flush_tlb_batched_pending() only in a race > condition so we first check whether there's a pended batched flush and > if so do the tlb flush. The pending checking is common and the differences > among the archs is how to flush the TLB here within the > flush_tlb_batched_pending(), > on arm64 it should only be a dsb. > > As we only needs to maintain the TLBs already pended in batched flush, > does it make sense to only handle those TLBs in flush_tlb_batched_pending()? > Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in > flush_tlb_batched_pending() and no arch specific function needed. as we have issued no-sync tlbi on those pending addresses , that means our hardware has already "recorded" what should be flushed in the specific mm. so DSB only will flush them correctly. right? > > Thanks. > Barry
Re: [PATCH 07/12] arch/x86: Declare edid_info in
Hi Arnd Am 30.06.23 um 13:53 schrieb Arnd Bergmann: On Fri, Jun 30, 2023, at 09:46, Thomas Zimmermann wrote: Am 29.06.23 um 15:21 schrieb Arnd Bergmann: On Thu, Jun 29, 2023, at 15:01, Thomas Zimmermann wrote: Am 29.06.23 um 14:35 schrieb Arnd Bergmann: On Thu, Jun 29, 2023, at 13:45, Thomas Zimmermann wrote: FIRMWARE_EDID is a user-selectable feature, while ARCH_HAS_EDID_INFO announces an architecture feature. They do different things. I still have trouble seeing the difference. The idea here is that ARCH_HAS_ signals the architecture's support for the feature. Drivers set 'depends on' in their Kconfig. Another Kconfig token, VIDEO_SCREEN_INFO or FIRMWARE_EDID, would then actually enable the feature. Drivers select VIDEO_SCREEN_INFO or FIRMWARE_EDID and the architectures contains code like Fair enough. In that case, I guess FIRMWARE_EDID will just depend on ARCH_HAS_EDID_INFO, or possibly "depends on FIRMWARE_EDID || EFI" after it starts calling into an EFI specific function, right? #ifdef VIDEO_SCREEN_INFO struct screen_info screen_info = { /* set values here */ } #endif This allows us to disable code that requires screen_info/edid_info, but also disable screen_info/edid_info unless such code has been enabled in the kernel config. Some architectures currently mimic this by guarding screen_info with ifdef CONFIG_VT or similar. I'd like to make this more flexible. The cost of a few more internal Kconfig tokens seems negligible. I definitely get it for the screen_info, which needs the complexity. For ARCHARCH_HAS_EDID_INFO I would hope that it's never selected by anything other than x86, so I would still go with just a dependency on x86 for simplicity, but I don't mind having the extra symbol if that keeps it more consistent with how the screen_info is handled. Well, I'd like to add edid_info to platforms with EFI. What would be arm/arm64 and loongarch, I guess. See below for the future plans. I suppose you could use FIRMWARE_EDID on EFI or OF systems without the need for a global edid_info structure, but that would not share any code with the current fb_firmware_edid() function. The current code is build on top of screen_info and edid_info. I'd preferably not replace that, if possible. One way I could imagine this looking in the end would be something like struct screen_info *fb_screen_info(struct device *dev) { struct screen_info *si = NULL; if (IS_ENABLED(CONFIG_EFI)) si = efi_get_screen_info(dev); if (IS_ENABLED(CONFIG_ARCH_HAS_SCREEN_INFO) && !si) si = screen_info; return si; } corresponding to fb_firmware_edid(). With this, any driver that wants to access screen_info would call this function instead of using the global pointer, plus either NULL pointer check or a CONFIG_ARCH_HAS_SCREEN_INFO dependency. This way we could completely eliminate the global screen_info on arm64, riscv, and loongarch but still use the efi and hyperv framebuffer/drm drivers. If possible, I'd like to remove global screen_info and edid_info entirely from fbdev and the various consoles. We currently use screen_info to set up the generic framebuffer device in drivers/firmware/sysfb.c. I'd like to use edid_info here as well, so that the generic graphics drivers can get EDID information. For the few fbdev drivers and consoles that require the global screen_info/edid_info, I'd rather provide lookup functions in sysfb (e.g., sysfb_get_screen_info(), sysfb_get_edid_info()). The global screen_info/edid_info state would then become an internal artifact of the sysfb code. Hopefully that explains some of the decisions made in this patchset. Best regards Thomas Arnd -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Frankenstrasse 146, 90461 Nuernberg, Germany GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman HRB 36809 (AG Nuernberg) OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn
On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote: > From: David Stevens > > Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. > > Signed-off-by: David Stevens > --- > arch/x86/kvm/mmu/mmu.c | 35 +-- > 1 file changed, 25 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index ec169f5c7dce..e44ab512c3a1 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, > struct kvm_async_pf *work) > static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault > *fault) > { > struct kvm_memory_slot *slot = fault->slot; > - bool async; > + struct kvm_follow_pfn foll = { > + .slot = slot, > + .gfn = fault->gfn, > + .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0), > + .allow_write_mapping = true, > + }; > > /* >* Retry the page fault if the gfn hit a memslot that is being deleted > @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault > return RET_PF_EMULATE; > } > > - async = false; > - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, > &async, > - fault->write, &fault->map_writable, > - &fault->hva); > - if (!async) > - return RET_PF_CONTINUE; /* *pfn has correct page already */ > + foll.flags |= FOLL_NOWAIT; > + fault->pfn = __kvm_follow_pfn(&foll); > + > + if (!is_error_noslot_pfn(fault->pfn)) > + goto success; > + > + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO) > + return RET_PF_CONTINUE; IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So KVM_PFN_ERR_NEEDS_IO may not be necessary? B.R. Yu
Re: [PATCH 10/12] fbdev/core: Use fb_is_primary_device() in fb_firmware_edid()
Thomas Zimmermann writes: > Detect the primary VGA device with fb_is_primary_device() in the > implementation of fb_firmware_edid(). Remove the existing code. > An explanation about why this is possible would be useful here. > Adapt the function to receive an instance of struct fb_info and > update all callers. > [...] > -const unsigned char *fb_firmware_edid(struct device *device) > +const unsigned char *fb_firmware_edid(struct fb_info *info) > { > - struct pci_dev *dev = NULL; > - struct resource *res = NULL; > unsigned char *edid = NULL; > > - if (device) > - dev = to_pci_dev(device); > - > - if (dev) > - res = &dev->resource[PCI_ROM_RESOURCE]; > - > - if (res && res->flags & IORESOURCE_ROM_SHADOW) This open codes what used to be the fb_is_primary_device() logic before commit 5ca1479cd35d ("fbdev: Simplify fb_is_primary_device for x86"). But now after that commit there is functional change since the ROM shadowing check would be dropped. I believe that's OK and Sima explains in their commit message that vga_default_device() should be enough and the check is redundant. Still, I think that this change should be documented in your commit message. With that change, Reviewed-by: Javier Martinez Canillas -- Best regards, Javier Martinez Canillas Core Platforms Red Hat
Re: [01/12] efi: Do not include from EFI header
Hi Am 05.07.23 um 03:40 schrieb Sui Jingfeng: Hi, Thomas I love your patch, LoongArch also have UEFI GOP support, Maybe the arch/loongarch/kernel/efi.c don't include the '#include ' explicitly. Oh, thank you for testing. I try to build arch/ changes on all of the affected platforms, but I cannot build for all of them. I have no means of building for loongarch ATM. I'll update the patch according to your feedback. Best regards Thomas ``` diff --git a/arch/loongarch/kernel/efi.c b/arch/loongarch/kernel/efi.c index 3d448fef3af4..04f4d217aefb 100644 --- a/arch/loongarch/kernel/efi.c +++ b/arch/loongarch/kernel/efi.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include ``` On 2023/6/29 19:45, Thomas Zimmermann wrote: The header file does not need anything from . Declare struct screen_info and remove the include statements. Update a number of source files that require struct screen_info's definition. Signed-off-by: Thomas Zimmermann Cc: Ard Biesheuvel Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Reviewed-by: Javier Martinez Canillas With the above issue solved, please take my R-B if you would like. Reviewed-by: Sui Jingfeng --- arch/arm/kernel/efi.c | 2 ++ arch/arm64/kernel/efi.c | 1 + drivers/firmware/efi/libstub/efi-stub-entry.c | 2 ++ drivers/firmware/efi/libstub/screen_info.c | 2 ++ include/linux/efi.h | 3 ++- 5 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/arm/kernel/efi.c b/arch/arm/kernel/efi.c index e2b9d2618c672..e94655ef16bb3 100644 --- a/arch/arm/kernel/efi.c +++ b/arch/arm/kernel/efi.c @@ -5,6 +5,8 @@ #include #include +#include + #include #include #include diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c index baab8dd3ead3c..3afbe503b066f 100644 --- a/arch/arm64/kernel/efi.c +++ b/arch/arm64/kernel/efi.c @@ -9,6 +9,7 @@ #include #include +#include #include #include diff --git a/drivers/firmware/efi/libstub/efi-stub-entry.c b/drivers/firmware/efi/libstub/efi-stub-entry.c index cc4dcaea67fa6..2f1902e5d4075 100644 --- a/drivers/firmware/efi/libstub/efi-stub-entry.c +++ b/drivers/firmware/efi/libstub/efi-stub-entry.c @@ -1,6 +1,8 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include + #include #include "efistub.h" diff --git a/drivers/firmware/efi/libstub/screen_info.c b/drivers/firmware/efi/libstub/screen_info.c index 4be1c4d1f922b..a51ec201ca3cb 100644 --- a/drivers/firmware/efi/libstub/screen_info.c +++ b/drivers/firmware/efi/libstub/screen_info.c @@ -1,6 +1,8 @@ // SPDX-License-Identifier: GPL-2.0 #include +#include + #include #include "efistub.h" diff --git a/include/linux/efi.h b/include/linux/efi.h index 571d1a6e1b744..360895a5572c0 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -24,10 +24,11 @@ #include #include #include -#include #include +struct screen_info; + #define EFI_SUCCESS 0 #define EFI_LOAD_ERROR ( 1 | (1UL << (BITS_PER_LONG-1))) #define EFI_INVALID_PARAMETER ( 2 | (1UL << (BITS_PER_LONG-1))) -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Frankenstrasse 146, 90461 Nuernberg, Germany GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman HRB 36809 (AG Nuernberg) OpenPGP_signature Description: OpenPGP digital signature
Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit : > On 2023-05-16 15:36, Vincent Guittot wrote: > > On Mon, 15 May 2023 at 13:46, Tobias Huschle > > wrote: > > > > > > The current load balancer implementation implies that scheduler > > > groups, > > > within the same domain, all host the same number of CPUs. This is > > > reflected in the condition, that a scheduler group, which is load > > > balancing and classified as having spare capacity, should pull work > > > from the busiest group, if the local group runs less processes than > > > the busiest one. This implies that these two groups should run the > > > same number of processes, which is problematic if the groups are not > > > of the same size. > > > > > > The assumption that scheduler groups within the same scheduler domain > > > host the same number of CPUs appears to be true for non-s390 > > > architectures. Nevertheless, s390 can have scheduler groups of unequal > > > size. > > > > > > This introduces a performance degredation in the following scenario: > > > > > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > > > the remaining 2 are located on another socket: > > > > > > Socket -1--2- > > > CPU 1 2 3 4 5 67 8 > > > > > > Placing some workload ( x = one task ) yields the following > > > scenarios: > > > > > > The first 5 tasks are distributed evenly across the two groups. > > > > > > Socket -1--2- > > > CPU 1 2 3 4 5 67 8 > > > x x x x x > > > > > > Adding a 6th task yields the following distribution: > > > > > > Socket -1--2- > > > CPU 1 2 3 4 5 67 8 > > > SMT1 x x x x x > > > SMT2x > > > > Your description is a bit confusing for me. What you name CPU above > > should be named Core, doesn' it ? > > > > Could you share with us your scheduler topology ? > > > > You are correct, it should say core instead of CPU. > > One actual configuration from one of my test machines (lscpu -e): > [...] > > So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other. Thaks for the details > > If I run stress-ng with 8 cpu stressors on the original code I get a > distribution > like this: > > 00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15 > x x x x x x x x > > Which means that the two cores in the smaller group are running into SMT > while two > cores in the larger group are still idle. This is caused by the > prefer_sibling path > which really wants to see both groups run the same number of tasks. yes and it considers that there are the same number of CPUs per group > > > > > > > The task is added to the 2nd scheduler group, as the scheduler has the > > > assumption that scheduler groups are of the same size, so they should > > > also host the same number of tasks. This makes CPU 7 run into SMT > > > thread, which comes with a performance penalty. This means, that in > > > the window of 6-8 tasks, load balancing is done suboptimally, because > > > SMT is used although there is no reason to do so as fully idle CPUs > > > are still available. > > > > > > Taking the weight of the scheduler groups into account, ensures that > > > a load balancing CPU within a smaller group will not try to pull tasks > > > from a bigger group while the bigger group still has idle CPUs > > > available. > > > > > > Signed-off-by: Tobias Huschle > > > --- > > > kernel/sched/fair.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 48b6f0ca13ac..b1307d7e4065 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -10426,7 +10426,8 @@ static struct sched_group > > > *find_busiest_group(struct lb_env *env) > > > * group's child domain. > > > */ > > > if (sds.prefer_sibling && local->group_type == > > > group_has_spare && > > > - busiest->sum_nr_running > local->sum_nr_running + 1) > > > + busiest->sum_nr_running * local->group_weight > > > > + local->sum_nr_running * > > > busiest->group_weight + 1) what you want to test here is that moving 1 task from busiest to local group would help and balance the ratio of tasks per cpu (busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight which can be develop into busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running by something like (busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight) > > > > This is the prefer_sibling path. Could it be that you should disable > > prefer_siling bet
Re: [PATCH 08/12] drivers/firmware: Remove trailing whitespaces
Thomas Zimmermann writes: > Fix coding style. No functional changes. > > Signed-off-by: Thomas Zimmermann > --- Reviewed-by: Javier Martinez Canillas -- Best regards, Javier Martinez Canillas Core Platforms Red Hat
Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
> > +void kvm_set_page_accessed(struct page *page); > +void kvm_set_page_dirty(struct page *page); > + No one calls these 2 routines in this patch. How about move this change to [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ? ... > @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page) > return !PageReserved(page); > } > > -static void kvm_set_page_dirty(struct page *page) > +void kvm_set_page_dirty(struct page *page) > { > if (kvm_is_ad_tracked_page(page)) > SetPageDirty(page); > } > +EXPORT_SYMBOL_GPL(kvm_set_page_dirty); > > -static void kvm_set_page_accessed(struct page *page) > +void kvm_set_page_accessed(struct page *page) > { > if (kvm_is_ad_tracked_page(page)) > mark_page_accessed(page); > } > +EXPORT_SYMBOL_GPL(kvm_set_page_accessed); Same here. B.R. Yu
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote: > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton > wrote: > > > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH > > wrote: > > > > > > > > > Thanks! I'll investigate this later today. After discussing with > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default > > > > > > > until > > > > > > > the issue is fixed. I'll post a patch shortly. > > > > > > > > > > > > Posted at: > > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > > > > > > > As that change fixes something in 6.4, why not cc: stable on it as > > > > > well? > > > > > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > > > > patch is fixing 6.4 I didn't need to send it to stable for older > > > > versions. Did I miss something? > > > > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included > > > there :) > > > > I'm in wait-a-few-days-mode on this. To see if we have a backportable > > fix rather than disabling the feature in -stable. > > Ok, I think we have a fix posted at [2] and it's cleanly applies to > 6.4.y stable branch as well. However fork() performance might slightly > regress, therefore disabling per-VMA locks by default for now seems to > be preferable even with this fix (see discussion at > https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/). > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately > to stable@vger? We can't do anything for stable until it lands in Linus's tree, so if you didn't happen to have the stable@ tag in the patch, just email us the git SHA1 and I can pick it up that way. thanks, greg k-h