date:20230705

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang  wrote:
>
> On Tue,  4 Jul 2023 16:50:48 +0900
> David Stevens  wrote:
>
> > From: David Stevens 
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h | 10 ++
> >  virt/kvm/kvm_main.c  | 67 +---
> >  virt/kvm/pfncache.c  |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >   const struct kvm_memory_slot *slot;
> >   gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >   bool atomic;
> >   /* Allow a read fault to create a writeable mapping. */
> >   bool allow_write_mapping;
> > + /*
> > +  * Usage of the returned pfn will be guared by a mmu notifier. Must
>   ^guarded
> > +  * be true if FOLL_GET is not set.
> > +  */
> > + bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-05 Thread Christophe Leroy



Le 06/07/2023 à 08:14, Benjamin Gray a écrit :
> On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
>>
>>
>> Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
>>> The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
>>> struct regmap_config, which is guarded behind CONFIG_REGMAP.
>>>
>>> Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
>>> REGMAP")
>>> exposed this by disabling the default pick unless KUNIT_ALL_TESTS
>>> is
>>> set, causing the ppc64be allnoconfig build to fail.
>>>
>>> Signed-off-by: Benjamin Gray 
>>> ---
>>>    drivers/rtc/Kconfig | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
>>> index ffca9a8bb878..7455ebd189fe 100644
>>> --- a/drivers/rtc/Kconfig
>>> +++ b/drivers/rtc/Kconfig
>>> @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
>>>
>>>    config RTC_DRV_DS1307
>>>  tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
>>> EPSON RX-8025, ISL12057"
>>> +   select REGMAP
>>
>> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
>> selected.
>> Can you explain more in details why you have to select it explicitely
>> ?
>> If there is something wrong with the logic, then the logic should be
>> fixed instead of just adding a selection of REGMAP for that
>> particular
>> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
>> RTC_DRV_ABEOZ9
>> might have the exact same problem.
> 
> Right, yeah, I don't want to assert this patch is the correct solution,
> sending it was more to offer a fix and allow discussion if it should be
> resolved some other way (so thanks for replying, I appreciate it).
> 
> In terms of why I made this patch, the way I see it, if a config option
> requires another config option be set, then "selects" is the natural
> way of phrasing this dependency. "default" on the REGMAP side seems
> weird. If it's a default, does that mean it can be overridden? But
> RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
> build breaks without it.

I mean't "why doesn't it work as is", and/or "why didn't you fix what 
doesn't work".

Well, until commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select 
REGMAP") it was not user-selectable so I couldn't be overridden.
After commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select 
REGMAP") it is overridable so we have an additional problem.

Does RTC_DRV_DS1307 require REGMAP or does RTC_DRV_DS1307 require 
REGMAP_I2C and then REGMAP_I2C requires REGMAP ?

I think that huge default like for REGMAP should be replaced by 
individual selection of REGMAP from each of those config items. For 
exemple REGMAP_I2C should select REGMAP then I guess it would also solve 
your issue wouldn't it ?

> 
> But maybe KConfig works differently to my assumptions. Maybe the
> referenced patch that causes the build failure is actually incorrect
> (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
> supposed to depend on REGMAP like KUnit does?
> 
> As for other drivers having the same problem, quite possibly, yes. But
> the PPC configs don't seem to build those drivers, so I'd prefer to
> leave it to anyone who does build them to report the error. I wasn't
> planning to audit all the drivers, I was just trying to fix the PPC
> configs I build and test.

Well ok but that's probably the indication of a deeper problem which 
deserves a more generic fix.

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-05 Thread Benjamin Gray

On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
> 
> 
> Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
> > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
> > struct regmap_config, which is guarded behind CONFIG_REGMAP.
> > 
> > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
> > REGMAP")
> > exposed this by disabling the default pick unless KUNIT_ALL_TESTS
> > is
> > set, causing the ppc64be allnoconfig build to fail.
> > 
> > Signed-off-by: Benjamin Gray 
> > ---
> >   drivers/rtc/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
> > index ffca9a8bb878..7455ebd189fe 100644
> > --- a/drivers/rtc/Kconfig
> > +++ b/drivers/rtc/Kconfig
> > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
> >   
> >   config RTC_DRV_DS1307
> > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
> > EPSON RX-8025, ISL12057"
> > +   select REGMAP
> 
> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
> selected.
> Can you explain more in details why you have to select it explicitely
> ? 
> If there is something wrong with the logic, then the logic should be 
> fixed instead of just adding a selection of REGMAP for that
> particular 
> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
> RTC_DRV_ABEOZ9 
> might have the exact same problem.

Right, yeah, I don't want to assert this patch is the correct solution,
sending it was more to offer a fix and allow discussion if it should be
resolved some other way (so thanks for replying, I appreciate it).

In terms of why I made this patch, the way I see it, if a config option
requires another config option be set, then "selects" is the natural
way of phrasing this dependency. "default" on the REGMAP side seems
weird. If it's a default, does that mean it can be overridden? But
RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
build breaks without it.

But maybe KConfig works differently to my assumptions. Maybe the
referenced patch that causes the build failure is actually incorrect
(CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
supposed to depend on REGMAP like KUnit does?

As for other drivers having the same problem, quite possibly, yes. But
the PPC configs don't seem to build those drivers, so I'd prefer to
leave it to anyone who does build them to report the error. I wasn't
planning to audit all the drivers, I was just trying to fix the PPC
configs I build and test.

[next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)

2023-07-05 Thread Sachin Sant

While running LTP tests (madvise06) on IBM Power9 LPAR booted with
6.4.0-next-20230705 following crash is seen

Injecting memory failure for pfn 0x3f79 at process virtual address 
0x7fff9b74
Memory failure: 0x3f79: recovery action for clean LRU page: Recovered
madvise06 (133636): drop_caches: 3
[ cut here ]
kernel BUG at mm/memcontrol.c:3715!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel 
udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat 
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill 
nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 
sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse 
[last unloaded: init_module(O)]
CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1
Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 
(VL950_131) hv:phyp pSeries
NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0
REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705)
MSR: 80029033  CR: 28008288 XER: 
CFAR: c054e904 IRQMASK: 0 
GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 
GPR04: c2b3b288 00010192  0001 
GPR08: c000f9abb180 0002 c002bc978580  
GPR12: c054e8d0 c0001ec53f00   
GPR16:     
GPR20: c0001b2e6578 00400cc0 7fff f000 
GPR24: c0029dd7bd30  c0029dd7bd58 c0001b2e6568 
GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 
NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0
LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160
Call Trace:
[c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160
[c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60
[c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620
[c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0
[c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350
[c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140
[c0029dd7be10] [c0036900] system_call_exception+0x140/0x350
[c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4
--- interrupt: c00 at 0x7fff9eb41484
NIP: 7fff9eb41484 LR: 10008540 CTR: 
REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705)
MSR: 8280f033  CR: 28002282 XER: 

IRQMASK: 0 
GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 
GPR04: 7fffc33e0aa0 1fff  0013 
GPR08: 7fffc33e0aa0    
GPR12:  7fff9ecca3a0   
GPR16:  10035520 10035b90 100347a8 
GPR20: 1002fb68 10063900 2000 1002fb68 
GPR24:  004c 1002fa78 7fffc33e0aa0 
GPR28: 0013  1fff 1fff 
NIP [7fff9eb41484] 0x7fff9eb41484
LR [10008540] 0x10008540
--- interrupt: c00
Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 
7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 
---[ end trace  ]---
pstore: backend (nvram) writing error (-1)

Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..

Git bisect points to following patch:

commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b
memcg: drop kmem.limit_in_bytes

Does the testcase madvise06 need an update?

90 tst_res(TINFO, "\tCached: %ld Kb",
91 SAFE_READ_MEMINFO("Cached:") - init_cached);
92   
93 print_cgmem("memory.current");
94 print_cgmem("memory.swap.current");
95 print_cgmem("memory.kmem.usage_in_bytes”);  <<== this line. 
96 }

If I comment line 95 from the testcase, it completes successfully.

- Sachin

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang  wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens 
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h | 10 ++
> >  virt/kvm/kvm_main.c  | 67 +---
> >  virt/kvm/pfncache.c  |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >   const struct kvm_memory_slot *slot;
> >   gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >   bool atomic;
> >   /* Allow a read fault to create a writeable mapping. */
> >   bool allow_write_mapping;
> > + /*
> > +  * Usage of the returned pfn will be guared by a mmu notifier. Must
> > +  * be true if FOLL_GET is not set.
> > +  */
> > + bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/zgvusf7lmkrnd...@google.com/#t

-David

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread David Stevens

On Thu, Jul 6, 2023 at 10:34 AM Isaku Yamahata  wrote:
>
> On Tue, Jul 04, 2023 at 04:50:47PM +0900,
> David Stevens  wrote:
>
> > From: David Stevens 
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h |  16 
> >  virt/kvm/kvm_main.c  | 171 ++-
> >  virt/kvm/kvm_mm.h|   3 +-
> >  virt/kvm/pfncache.c  |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING   (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > + const struct kvm_memory_slot *slot;
> > + gfn_t gfn;
> > + unsigned int flags;
> > + bool atomic;
> > + /* Allow a read fault to create a writeable mapping. */
> > + bool allow_write_mapping;
>
> Maybe, make them const for input arguments?

Unfortunately using const isn't straightforward as long as the kernel
continues to use -Wdeclaration-after-statement. If these fields were
const, then they would need to be specified in the initializer when
declaring the variable, but that's not necessarily always possible.

-David

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang  wrote:
>
> On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang  wrote:
> > >
> > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, 
> > > > bool write_fault,
> > > >   * The slow path to get the pfn of the specified host virtual address,
> > > >   * 1 indicates success, -errno is returned if error is detected.
> > > >   */
> > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > > > write_fault,
> > > > -bool interruptible, bool *writable, kvm_pfn_t 
> > > > *pfn)
> > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > >  {
> > > > - unsigned int flags = FOLL_HWPOISON;
> > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > >   struct page *page;
> > > >   int npages;
> > > >
> > > >   might_sleep();
> > > >
> > > > - if (writable)
> > > > - *writable = write_fault;
> > > > -
> > > > - if (write_fault)
> > > > - flags |= FOLL_WRITE;
> > > > - if (async)
> > > > - flags |= FOLL_NOWAIT;
> > > > - if (interruptible)
> > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > -
> > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > >   if (npages != 1)
> > > >   return npages;
> > > >
> > > > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > > > foll->allow_write_mapping;
> > > > +
> > > >   /* map read fault as writable if possible */
> > > > - if (unlikely(!write_fault) && writable) {
> > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > >
> > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> >
> > The two statements are logically equivalent, although I guess using
> > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > verbose.
>
> Well, as the comment says, we wanna try to map the read fault as writable
> whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> for write faults. So I guess using !foll->writable will not allow this.
> Did I miss anything?

We just set the foll->writable out parameter to be equal to
((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
!(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
b.

-David

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread David Stevens

On Thu, Jul 6, 2023 at 11:10 AM Isaku Yamahata  wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900,
> David Stevens  wrote:
>
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index cf2c6426a6fc..46c681dc45e6 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct 
> > kvm_mmu_page *sp,
> >  const struct kvm_memory_slot *slot,
> >  unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
> >  u64 old_spte, bool prefetch, bool can_unsync,
> > -bool host_writable, u64 *new_spte)
> > +bool host_writable, bool is_refcounted, u64 *new_spte)
> >  {
> >   int level = sp->role.level;
> >   u64 spte = SPTE_MMU_PRESENT_MASK;
> > @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct 
> > kvm_mmu_page *sp,
> >
> >   if (level > PG_LEVEL_4K)
> >   spte |= PT_PAGE_SIZE_MASK;
> > + else if (is_refcounted)
> > + spte |= SPTE_MMU_PAGE_REFCOUNTED;
>
> Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
> FOLL_GET? or can we set the bit for large page?

Oh, you're right, it should apply to >4K pages as well. This was based
on stale thinking from earlier versions of this series.

-David

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-05 Thread Christophe Leroy



Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
> The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
> struct regmap_config, which is guarded behind CONFIG_REGMAP.
> 
> Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP")
> exposed this by disabling the default pick unless KUNIT_ALL_TESTS is
> set, causing the ppc64be allnoconfig build to fail.
> 
> Signed-off-by: Benjamin Gray 
> ---
>   drivers/rtc/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
> index ffca9a8bb878..7455ebd189fe 100644
> --- a/drivers/rtc/Kconfig
> +++ b/drivers/rtc/Kconfig
> @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
>   
>   config RTC_DRV_DS1307
>   tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, EPSON RX-8025, 
> ISL12057"
> + select REGMAP

As far as I can see, REGMAP defaults to Y when REGMAP_I2C is selected.
Can you explain more in details why you have to select it explicitely ? 
If there is something wrong with the logic, then the logic should be 
fixed instead of just adding a selection of REGMAP for that particular 
RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or RTC_DRV_ABEOZ9 
might have the exact same problem.

>   select REGMAP_I2C
>   select WATCHDOG_CORE if WATCHDOG
>   help

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang  wrote:
>
> On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > From: David Stevens 
> >
> > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > memory into the guest that is backed by un-refcounted struct pages - for
> > example, higher order non-compound pages allocated by the amdgpu driver
> > via ttm_pool_alloc_page.
>
> I guess you mean the tail pages of the higher order non-compound pages?
> And as to the head page, it is said to be set to one coincidentally[*],
> and shall not be considered as refcounted.  IIUC, refcount of this head
> page will be increased and decreased soon in hva_to_pfn_remapped(), so
> this may not be a problem(?). But treating this head page differently,
> as a refcounted one(e.g., to set the A/D flags), is weired.
>
> Or maybe I missed some context, e.g., can the head page be allocted to
> guest at all?

Yes, this is to allow mapping the tail pages of higher order
non-compound pages - I should have been more precise in my wording.
The head pages can already be mapped into the guest.

Treating the head and tail pages would require changing how KVM
behaves in a situation it supports today (rather than just adding
support for an unsupported situation). Currently, without this series,
KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
guest. When that happens, KVM sets the A/D flags. I'm not sure whether
that's actually valid behavior, nor do I know whether anyone actually
cares about it. But it's what KVM does today, and I would shy away
from modifying that behavior without good reason.

> >
> > The bulk of this change is tracking the is_refcounted_page flag so that
> > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > done by storing the flag in an unused bit in the sptes.
>
> Also, maybe we should mention this only works on x86-64.
>
> >
> > Signed-off-by: David Stevens 
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 44 +
> >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
> >  arch/x86/kvm/mmu/spte.c |  4 ++-
> >  arch/x86/kvm/mmu/spte.h | 12 -
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
> >  6 files changed, 62 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e44ab512c3a1..b1607e314497 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
>
> ...
>
> > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
> > kvm_memory_slot *slot,
> >   bool host_writable = !fault || fault->map_writable;
> >   bool prefetch = !fault || fault->prefetch;
> >   bool write_fault = fault && fault->write;
> > + bool is_refcounted = !fault || fault->is_refcounted_page;
>
> Just wonder, what if a non-refcounted page is prefetched?  Or is it possible 
> in
> practice?

Prefetching is still done via gfn_to_page_many_atomic, which sets
FOLL_GET. That's fixable, but it's not something this series currently
does.

> ...
> >
> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, 
> > struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
> > *sp, int i)
> >  {
> > - bool host_writable;
> > + bool host_writable, is_refcounted;
> >   gpa_t first_pte_gpa;
> >   u64 *sptep, spte;
> >   struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, 
> > struct kvm_mmu_page *sp, int
> >   sptep = &sp->spt[i];
> >   spte = *sptep;
> >   host_writable = spte & shadow_host_writable_mask;
> > + // TODO: is this correct?
> > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> >   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >   make_spte(vcpu, sp, slot, pte_access, gfn,
> > spte_to_pfn(spte), spte, true, false,
> > -   host_writable, &spte);
> > +   host_writable, is_refcounted, &spte);
>
> Could we restrict that a non-refcounted page shall not be used as shadow page?

I'm not very familiar with the shadow mmu, so my response might not
make sense. But do you mean not allowing non-refcoutned pages as the
guest page tables shadowed by a kvm_mmu_page? It would probably be
possible to do that, and I doubt anyone would care about the
restriction. But as far as I can tell, the guest page table is only
accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
pages just fine.

-David

Re: [PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr

2023-07-05 Thread Christophe Leroy



Le 06/07/2023 à 03:08, Benjamin Gray a écrit :
> On a powermac platform, under the call path
> 
>start_kernel
> time_init
>  ppc_md.calibrate_decr (pmac_calibrate_decr)
>   via_calibrate_decr
> 
> we run ioremap and iounmap. The unmap can enable interrupts
> unexpectedly (cond_resched in vunmap_pmd_range), which is warned about
> later in the boot sequence in start_kernel.
> 
> Use the early_* variants of these IO functions to prevent this.
> 
> The issue is pre-existing, but is surfaced by commit 721255b9826b
> ("genirq: Use a maple tree for interrupt descriptor management").
> It's not clear to me why this causes it to surface.
> 
> Signed-off-by: Benjamin Gray 

Reviewed-by: Christophe Leroy 

> ---
>   arch/powerpc/platforms/powermac/time.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powermac/time.c 
> b/arch/powerpc/platforms/powermac/time.c
> index 4c5790aff1b5..8633891b7aa5 100644
> --- a/arch/powerpc/platforms/powermac/time.c
> +++ b/arch/powerpc/platforms/powermac/time.c
> @@ -26,8 +26,8 @@
>   #include 
>   #include 
>   
> +#include 
>   #include 
> -#include 
>   #include 
>   #include 
>   #include 
> @@ -182,7 +182,7 @@ static int __init via_calibrate_decr(void)
>   return 0;
>   }
>   of_node_put(vias);
> - via = ioremap(rsrc.start, resource_size(&rsrc));
> + via = early_ioremap(rsrc.start, resource_size(&rsrc));
>   if (via == NULL) {
>   printk(KERN_ERR "Failed to map VIA for timer calibration !\n");
>   return 0;
> @@ -207,7 +207,7 @@ static int __init via_calibrate_decr(void)
>   
>   ppc_tb_freq = (dstart - dend) * 100 / 6;
>   
> - iounmap(via);
> + early_iounmap((void *)via, resource_size(&rsrc));
>   
>   return 1;
>   }

Re: [PATCH] KVM: ppc64: Enable ring-based dirty memory tracking

2023-07-05 Thread Jordan Niethe





On 8/6/23 10:34 pm, Kautuk Consul wrote:

Need at least a little context in the commit message itself:

"Enable ring-based dirty memory tracking on ppc64:"


- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly
   ordered.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the
   kvmppc_xive_native_set_attr is called in the context of an ioctl
   syscall and will call kvmppc_xive_native_eq_sync for setting the
   KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page()
   when there isn't a running vcpu. Implemented the
   kvm_arch_allow_write_without_running_vcpu to always return true
   to allow mark_page_dirty_in_slot to mark the page dirty in the
   memslot->dirty_bitmap in this case.


Should kvm_arch_allow_write_without_running_vcpu() only return true in 
the context of kvmppc_xive_native_eq_sync()?



- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
   offset.
- Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required
   for the generic KVM code to call.
- Add a check to kvmppc_vcpu_run_hv for checking whether the dirty
   ring is soft full.
- Implement the kvm_arch_flush_remote_tlbs_memslot function to support
   the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option.

On testing with live migration it was found that there is around
150-180 ms improvment in overall migration time with this

Signed-off-by: Kautuk Consul 
---
  Documentation/virt/kvm/api.rst  |  2 +-
  arch/powerpc/include/uapi/asm/kvm.h |  2 ++
  arch/powerpc/kvm/Kconfig|  2 ++
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 +
  arch/powerpc/kvm/book3s_hv.c|  3 +++
  include/linux/kvm_dirty_ring.h  |  5 
  6 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index add067793b90..ce1ebc513bae 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through the 
CPUID leaf.
  8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
  --
  
-:Architectures: x86, arm64

+:Architectures: x86, arm64, ppc64
  :Parameters: args[0] - size of the dirty log ring
  
  KVM is capable of tracking dirty memory using ring buffers that are

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 9f18fa090f1f..f722309ed7fb 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -33,6 +33,8 @@
  /* Not always available, but if it is, this is the correct offset.  */
  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
  
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64

+
  struct kvm_regs {
__u64 pc;
__u64 cr;
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..c93354ec3bd5 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -26,6 +26,8 @@ config KVM
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select INTERVAL_TREE
+   select HAVE_KVM_DIRTY_RING_ACQ_REL
+   select NEED_KVM_DIRTY_RING_WITH_BITMAP
  
  config KVM_BOOK3S_HANDLER

bool
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..c92e8022e017 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
  
  	vcpu->arch.hflags |= BOOK3S_HFLAG_SLB;

  }
+
+/*
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * dirty pages.
+ *
+ * It write protects selected pages to enable dirty logging for them.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset,
+unsigned long mask)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+   while (start < end) {
+   pte_t *ptep;
+   unsigned int shift;
+
+   ptep = find_kvm_secondary_pte(kvm, start, &shift);
+
+   *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE));

On rpt I think you'd need to use kvmppc_radix_update_pte()?


+
+   start += PAGE_SIZE > +   }
+}
+
+#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+   return true;
+}
+#endif
+
+#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
+void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   kvm_flush_remote_tlbs(kvm);
+}
+#endif
diff --git a/arch/powerpc

Re: [PATCH] powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size

2023-07-05 Thread Hugh Dickins

On Thu, 6 Jul 2023, Aneesh Kumar K.V wrote:

> With commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to
> fail") the kernel is now using pmd_same to compare pmd values that are
> pointing to a level 4 page table page. Move the functions out of #ifdef
> CONFIG_TRANSPARENT_HUGEPAGE and add a variant that can work with both 4K
> and 64K page size.
> 
> kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:141!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .
> NIP [c048aee0] __pte_offset_map_lock+0xf0/0x164
> LR [c048ae78] __pte_offset_map_lock+0x88/0x164
> Call Trace:
>  0xc0003f09a340 (unreliable)
>  __handle_mm_fault+0x1340/0x1980
>  handle_mm_fault+0xbc/0x380
>  __get_user_pages+0x320/0x550
>  get_user_pages_remote+0x13c/0x520
>  get_arg_page+0x80/0x1d0
>  copy_string_kernel+0xc8/0x250
>  kernel_execve+0x11c/0x270
>  run_init_process+0xe4/0x10c
>  kernel_init+0xbc/0x1a0
>  ret_from_kernel_user_thread+0x14/0x1c
> 
> Cc: Hugh Dickins 
> Reported-by: Michael Ellerman 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Hugh Dickins 

Thanks for rescuing us so quickly!

> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 6 --
>  arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 -
>  arch/powerpc/include/asm/book3s/64/hash.h | 5 +
>  3 files changed, 5 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b6ac4f86c87b..6472b08fa1b0 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -136,12 +136,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
>   return 0;
>  }
>  
> -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> -{
> - BUG();
> - return 0;
> -}
> -
>  static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
>  {
>   BUG();
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 338e62fbea0b..0bf6fd0bf42a 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -263,11 +263,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
> (_PAGE_PTE | H_PAGE_THP_HUGE));
>  }
>  
> -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> -{
> - return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
> ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
> -}
> -
>  static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
>  {
>   return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE));
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
> b/arch/powerpc/include/asm/book3s/64/hash.h
> index 17e7a778c856..d4a19e6547ac 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -132,6 +132,11 @@ static inline int get_region_id(unsigned long ea)
>   return region_id;
>  }
>  
> +static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> +{
> + return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
> ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
> +}
> +
>  #define  hash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
>  #define  hash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
>  static inline int hash__p4d_bad(p4d_t p4d)
> -- 
> 2.41.0
> 
>

[PATCH] powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size

2023-07-05 Thread Aneesh Kumar K.V

With commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to
fail") the kernel is now using pmd_same to compare pmd values that are
pointing to a level 4 page table page. Move the functions out of #ifdef
CONFIG_TRANSPARENT_HUGEPAGE and add a variant that can work with both 4K
and 64K page size.

kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:141!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.
NIP [c048aee0] __pte_offset_map_lock+0xf0/0x164
LR [c048ae78] __pte_offset_map_lock+0x88/0x164
Call Trace:
 0xc0003f09a340 (unreliable)
 __handle_mm_fault+0x1340/0x1980
 handle_mm_fault+0xbc/0x380
 __get_user_pages+0x320/0x550
 get_user_pages_remote+0x13c/0x520
 get_arg_page+0x80/0x1d0
 copy_string_kernel+0xc8/0x250
 kernel_execve+0x11c/0x270
 run_init_process+0xe4/0x10c
 kernel_init+0xbc/0x1a0
 ret_from_kernel_user_thread+0x14/0x1c

Cc: Hugh Dickins 
Reported-by: Michael Ellerman 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 6 --
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 -
 arch/powerpc/include/asm/book3s/64/hash.h | 5 +
 3 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b6ac4f86c87b..6472b08fa1b0 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -136,12 +136,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
return 0;
 }
 
-static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
-{
-   BUG();
-   return 0;
-}
-
 static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
 {
BUG();
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 338e62fbea0b..0bf6fd0bf42a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -263,11 +263,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
  (_PAGE_PTE | H_PAGE_THP_HUGE));
 }
 
-static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
-{
-   return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
-}
-
 static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
 {
return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE));
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 17e7a778c856..d4a19e6547ac 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -132,6 +132,11 @@ static inline int get_region_id(unsigned long ea)
return region_id;
 }
 
+static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+   return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
+}
+
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
 #definehash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
 static inline int hash__p4d_bad(p4d_t p4d)
-- 
2.41.0

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread Isaku Yamahata

On Tue, Jul 04, 2023 at 04:50:50PM +0900,
David Stevens  wrote:

> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cf2c6426a6fc..46c681dc45e6 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -138,7 +138,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
> *sp,
>  const struct kvm_memory_slot *slot,
>  unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
>  u64 old_spte, bool prefetch, bool can_unsync,
> -bool host_writable, u64 *new_spte)
> +bool host_writable, bool is_refcounted, u64 *new_spte)
>  {
>   int level = sp->role.level;
>   u64 spte = SPTE_MMU_PRESENT_MASK;
> @@ -188,6 +188,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
> *sp,
>  
>   if (level > PG_LEVEL_4K)
>   spte |= PT_PAGE_SIZE_MASK;
> + else if (is_refcounted)
> + spte |= SPTE_MMU_PAGE_REFCOUNTED;

Is REFCOUNTED for 4K page only?  What guarantees that large page doesn't have
FOLL_GET? or can we set the bit for large page?


>  
>   if (shadow_memtype_mask)
>   spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,

-- 
Isaku Yamahata

Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn

2023-07-05 Thread Isaku Yamahata

On Tue, Jul 04, 2023 at 04:50:49PM +0900,
David Stevens  wrote:

> From: David Stevens 
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens 
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +--
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
> struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault)
>  {
>   struct kvm_memory_slot *slot = fault->slot;
> - bool async;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = fault->gfn,
> + .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> + .allow_write_mapping = true,
> + };
>  
>   /*
>* Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault
>   return RET_PF_EMULATE;
>   }
>  
> - async = false;
> - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, 
> &async,
> -   fault->write, &fault->map_writable,
> -   &fault->hva);
> - if (!async)
> - return RET_PF_CONTINUE; /* *pfn has correct page already */
> + foll.flags |= FOLL_NOWAIT;
> + fault->pfn = __kvm_follow_pfn(&foll);
> +
> + if (!is_error_noslot_pfn(fault->pfn))

We have pfn in struct kvm_follow_pfn as output. Can we make __kvm_follow_pfn()
return int instead of kvm_pfn_t?  KVM_PFN_* seems widely used, though.


> + goto success;
> +
> + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> + return RET_PF_CONTINUE;
>  
>   if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
>   trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> @@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault
>* to wait for IO.  Note, gup always bails if it is unable to quickly
>* get a page and a fatal signal, i.e. SIGKILL, is pending.
>*/
> - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
> -   fault->write, &fault->map_writable,
> -   &fault->hva);
> + foll.flags |= FOLL_INTERRUPTIBLE;
> + foll.flags &= ~FOLL_NOWAIT;
> + fault->pfn = __kvm_follow_pfn(&foll);
> +
> + if (!is_error_noslot_pfn(fault->pfn))
> + goto success;
> +
> + return RET_PF_CONTINUE;
> +success:
> + fault->hva = foll.hva;
> + fault->map_writable = foll.writable;
>   return RET_PF_CONTINUE;
>  }
>  
> -- 
> 2.41.0.255.g8b1d071c50-goog
> 

-- 
Isaku Yamahata

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread Isaku Yamahata

On Tue, Jul 04, 2023 at 04:50:47PM +0900,
David Stevens  wrote:

> From: David Stevens 
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens 
> ---
>  include/linux/kvm_host.h |  16 
>  virt/kvm/kvm_main.c  | 171 ++-
>  virt/kvm/kvm_mm.h|   3 +-
>  virt/kvm/pfncache.c  |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING   (KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> + const struct kvm_memory_slot *slot;
> + gfn_t gfn;
> + unsigned int flags;
> + bool atomic;
> + /* Allow a read fault to create a writeable mapping. */
> + bool allow_write_mapping;

Maybe, make them const for input arguments?


> +
> + /* Outputs of __kvm_follow_pfn */
> + hva_t hva;
> + bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned 
> long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>   struct page *page[1];
> + bool write_fault = foll->flags & FOLL_WRITE;
>  
>   /*
>* Fast pin a writable pfn only if it is a write fault request
>* or the caller allows to map a writable pfn for a read fault
>* request.
>*/
> - if (!(write_fault || writable))
> + if (!(write_fault || foll->allow_write_mapping))
>   return false;
>  
> - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>   *pfn = page_to_pfn(page[0]);
> -
> - if (writable)
> - *writable = true;
> + foll->writable = foll->allow_write_mapping;
>   return true;
>   }
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool 
> write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;

Although adding FOLL_GET doesn't affect the behavior of
get_user_pages_unlocked(), I wondered how this affects the next change
It's better to mention it in the commit message.
get_user_pages_*() called by hva_to_pfn_{fast, slot} imply FOLL_GET,
but __kvm_follow_pfn() doesn't imply FOLL_GET.


>   struct page *page;
>   int npages;
>  
>   might_sleep();
>  
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>   if (npages != 1)
>   return npages;
>  
> + foll->writable = (foll->flags & FOLL_WRITE) && 
> foll->allow_write_mapping;
> +
>   /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>   struct page *wpage;

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Hugh Dickins

On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > Hugh Dickins  wrote:  
> > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > ...
> > > > --- a/arch/s390/mm/pgalloc.c
> > > > +++ b/arch/s390/mm/pgalloc.c
> > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > > 4KB-pgtable
> > > >   * while the PP bits are never used, nor such a page is added to or 
> > > > removed
> > > >   * from mm_context_t::pgtable_list.
> > > > + *
> > > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > > pgtable_list,
> > > > + * and prevents both 2K fragments from being reused. pte_free_defer() 
> > > > has to
> > > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > > period
> > > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > > guarantee).  
> > > 
> > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > case the page would be seen there.
> > > 
> > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > cleared the PP bits, as far as I understand.
> > > 
> > > So what exactly do you mean with "which page_table_free_rcu() does not 
> > > actually
> > > guarantee"?  
> > 
> > I'll answer without locating and re-reading what Jason explained earlier,
> > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > he may have explained it better.  And without working out again all the
> > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > 
> > The detail that sticks in my mind is the fallback in tlb_remove_table()
> 
> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific.

Yes.

> I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.
> 
> Maybe change to "which page_table_free_rcu() does not actually guarantee,
> by calling tlb_remove_table()", to make it clear that this is not a problem
> of page_table_free_rcu() itself.

Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)

> 
> > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > TLB flush (I think: that again involves chasing definitions) followed by
> > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > and is commented: 
> > /*
> >  * This isn't an RCU grace period and hence the page-tables cannot be
> >  * assumed to be actually RCU-freed.
> >  *
> >  * It is however sufficient for software page-table walkers that rely on
> >  * IRQ disabling.
> >  */
> > 
> > Whether that's good for your PP pages or not, I've given no thought:
> > I've just taken it on trust that what s390 has working today is good.
> 
> Yes, we should be fine with that, current code can be trusted :-)

Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
implied by the function name.

> 
> > 
> > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > then I would not have written "(which page_table_free_rcu() does not
> > actually guarantee)".  But it cannot use call_rcu() because it does
> > not have an rcu_head to work with - it's in some generic code, and
> > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > 
> > And Jason would have much preferred us to address the issue from that
> > angle; but not only would doing so destroy my sanity, I'd also destroy
> > 20 architectures TLB-flushing, unbuilt and untested, in the attempt.
> 
> Oh yes, if your changes would have allowed to get rid of that "semi RCU"
> logic, that would really be a major boost in popularity, I guess. But
> it probably is as it is, because it is not so easily fixed...

I'm hoping that this series might help stir someone else to get into that.

> 
> > 
> > ...
> > > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, 
> > > > unsigned long *table)
> > > >  */
> > > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit 
> > > > + 24));
> > > > mask >>= 24;
> > > > -   if (mask & 0x03U)
> > > > +   if ((mask & 0

[PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr

2023-07-05 Thread Benjamin Gray

On a powermac platform, under the call path

  start_kernel
   time_init
ppc_md.calibrate_decr (pmac_calibrate_decr)
 via_calibrate_decr

we run ioremap and iounmap. The unmap can enable interrupts
unexpectedly (cond_resched in vunmap_pmd_range), which is warned about
later in the boot sequence in start_kernel.

Use the early_* variants of these IO functions to prevent this.

The issue is pre-existing, but is surfaced by commit 721255b9826b
("genirq: Use a maple tree for interrupt descriptor management").
It's not clear to me why this causes it to surface.

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/platforms/powermac/time.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/time.c 
b/arch/powerpc/platforms/powermac/time.c
index 4c5790aff1b5..8633891b7aa5 100644
--- a/arch/powerpc/platforms/powermac/time.c
+++ b/arch/powerpc/platforms/powermac/time.c
@@ -26,8 +26,8 @@
 #include 
 #include 
 
+#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -182,7 +182,7 @@ static int __init via_calibrate_decr(void)
return 0;
}
of_node_put(vias);
-   via = ioremap(rsrc.start, resource_size(&rsrc));
+   via = early_ioremap(rsrc.start, resource_size(&rsrc));
if (via == NULL) {
printk(KERN_ERR "Failed to map VIA for timer calibration !\n");
return 0;
@@ -207,7 +207,7 @@ static int __init via_calibrate_decr(void)
 
ppc_tb_freq = (dstart - dend) * 100 / 6;
 
-   iounmap(via);
+   early_iounmap((void *)via, resource_size(&rsrc));
 
return 1;
 }
-- 
2.41.0

Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-05 Thread Jeff Layton

On Thu, 2023-07-06 at 08:19 +0900, Damien Le Moal wrote:
> On 7/6/23 03:58, Jeff Layton wrote:
> > A rename potentially involves updating 4 different inode timestamps. Add
> > a function that handles the details sanely, and convert the libfs.c
> > callers to use it.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/libfs.c | 36 +++-
> >  include/linux/fs.h |  2 ++
> >  2 files changed, 29 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index a7e56baf8bbd..9ee79668c909 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry 
> > *dentry)
> >  }
> >  EXPORT_SYMBOL(simple_rmdir);
> >  
> > +/**
> > + * simple_rename_timestamp - update the various inode timestamps for rename
> > + * @old_dir: old parent directory
> > + * @old_dentry: dentry that is being renamed
> > + * @new_dir: new parent directory
> > + * @new_dentry: target for rename
> > + *
> > + * POSIX mandates that the old and new parent directories have their ctime 
> > and
> > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> > have
> > + * their ctime updated.
> > + */
> > +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> > *old_dentry,
> > +struct inode *new_dir, struct dentry *new_dentry)
> > +{
> > +   struct inode *newino = d_inode(new_dentry);
> > +
> > +   old_dir->i_mtime = inode_set_ctime_current(old_dir);
> > +   if (new_dir != old_dir)
> > +   new_dir->i_mtime = inode_set_ctime_current(new_dir);
> > +   inode_set_ctime_current(d_inode(old_dentry));
> > +   if (newino)
> > +   inode_set_ctime_current(newino);
> > +}
> > +EXPORT_SYMBOL_GPL(simple_rename_timestamp);
> > +
> >  int simple_rename_exchange(struct inode *old_dir, struct dentry 
> > *old_dentry,
> >struct inode *new_dir, struct dentry *new_dentry)
> >  {
> > @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, 
> > struct dentry *old_dentry,
> > inc_nlink(old_dir);
> > }
> > }
> > -   old_dir->i_ctime = old_dir->i_mtime =
> > -   new_dir->i_ctime = new_dir->i_mtime =
> > -   d_inode(old_dentry)->i_ctime =
> > -   d_inode(new_dentry)->i_ctime = current_time(old_dir);
> > -
> > +   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> 
> This is somewhat changing the current behavior: before the patch, the mtime 
> and
> ctime of old_dir, new_dir and the inodes associated with the dentries are 
> always
> equal. But given that simple_rename_timestamp() calls 
> inode_set_ctime_current()
> 4 times, the times could potentially be different.
> 
> I am not sure if that is an issue, but it seems that calling
> inode_set_ctime_current() once, recording the "now" time it sets and using 
> that
> value to set all times may be more efficient and preserve the existing 
> behavior.
> 

I don't believe it's an issue. I've seen nothing in the POSIX spec that
mandates that timestamp updates to different inodes involved in an
operation be set to the _same_ value. It just says they must be updated.

It's also hard to believe that any software would depend on this either,
given that it's very inconsistent across filesystems today. AFAICT, this
was mostly done in the past just as a matter of convenience.

The other problem with doing it that way is that it assumes that
current_time(inode) should always return the same value when given
different inodes. Is it really correct to do this?

inode_set_ctime(dir, inode_set_ctime_current(inode));

"dir" and "inode" are different inodes, after all, and you're setting
dir's timestamp to "inode"'s value. It's not a big deal today since
they're always on the same sb, but the ultimate goal of these changes is
to implement multigrain timestamps. That will mean that fetching a fine-
grained timestamp for an update when the existing mtime or ctime value
has been queried via getattr.

With that change, I think it's best that we treat updates to different
inodes individually, as some of them may require updating with a fine-
grained timestamp and some may not.

> > return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(simple_rename_exchange);
> > @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> > *old_dir,
> >   struct dentry *old_dentry, struct inode *new_dir,
> >   struct dentry *new_dentry, unsigned int flags)
> >  {
> > -   struct inode *inode = d_inode(old_dentry);
> > int they_are_dirs = d_is_dir(old_dentry);
> >  
> > if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
> > @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> > *old_dir,
> > inc_nlink(new_dir);
> > }
> >  
> > -   old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
> > -   new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
> > -
> > +   simple_rename_times

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Hugh Dickins

On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> 
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +   struct page *page;
> 
> If I got your and Claudio conversation right, you were going to add
> here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

Well, Claudio approved, so I would have put it in, if we had stuck with
that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
but once that went away, it became somewhat irrelevant... to me anyway.

But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
if it might help you guys - there is some point, since pte_free_defer()
is a route which can usefully check for such a case, without confusion
from harmless traffic from immediate frees of just-in-case allocations.

But don't expect it to catch all such cases (if they exist): another
category of s390 page_table_free()s comes from the PageAnon
zap_deposited_table() in zap_huge_pmd(): those tables might or might
not have been exposed to userspace at some time in the past.

I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
WARN_ON_ONCE is the one we want - I get confused by all the different
flavours of WARN, and have to check the header file each time to be
sure of the syntax and semantics): but be aware that it won't be
checking all potential cases.

Hugh

> 
> > +   page = virt_to_page(pgtable);
> > +   SetPageActive(page);
> > +   page_table_free(mm, (unsigned long *)pgtable);
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> >  /*
> >   * Base infrastructure required to generate basic asces, region, segment,
> >   * and page tables that do not make use of enhanced features like EDAT1.
> 
> Thanks!

Re: [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime

2023-07-05 Thread Damien Le Moal

On 7/6/23 03:58, Jeff Layton wrote:
> Now that everything in-tree is converted to use the accessor functions,
> rename the i_ctime field in the inode to discourage direct access.
> 
> Signed-off-by: Jeff Layton 

Looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-05 Thread Damien Le Moal

On 7/6/23 03:58, Jeff Layton wrote:
> A rename potentially involves updating 4 different inode timestamps. Add
> a function that handles the details sanely, and convert the libfs.c
> callers to use it.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/libfs.c | 36 +++-
>  include/linux/fs.h |  2 ++
>  2 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a7e56baf8bbd..9ee79668c909 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  EXPORT_SYMBOL(simple_rmdir);
>  
> +/**
> + * simple_rename_timestamp - update the various inode timestamps for rename
> + * @old_dir: old parent directory
> + * @old_dentry: dentry that is being renamed
> + * @new_dir: new parent directory
> + * @new_dentry: target for rename
> + *
> + * POSIX mandates that the old and new parent directories have their ctime 
> and
> + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> have
> + * their ctime updated.
> + */
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +  struct inode *new_dir, struct dentry *new_dentry)
> +{
> + struct inode *newino = d_inode(new_dentry);
> +
> + old_dir->i_mtime = inode_set_ctime_current(old_dir);
> + if (new_dir != old_dir)
> + new_dir->i_mtime = inode_set_ctime_current(new_dir);
> + inode_set_ctime_current(d_inode(old_dentry));
> + if (newino)
> + inode_set_ctime_current(newino);
> +}
> +EXPORT_SYMBOL_GPL(simple_rename_timestamp);
> +
>  int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
>  struct inode *new_dir, struct dentry *new_dentry)
>  {
> @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct 
> dentry *old_dentry,
>   inc_nlink(old_dir);
>   }
>   }
> - old_dir->i_ctime = old_dir->i_mtime =
> - new_dir->i_ctime = new_dir->i_mtime =
> - d_inode(old_dentry)->i_ctime =
> - d_inode(new_dentry)->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);

This is somewhat changing the current behavior: before the patch, the mtime and
ctime of old_dir, new_dir and the inodes associated with the dentries are always
equal. But given that simple_rename_timestamp() calls inode_set_ctime_current()
4 times, the times could potentially be different.

I am not sure if that is an issue, but it seems that calling
inode_set_ctime_current() once, recording the "now" time it sets and using that
value to set all times may be more efficient and preserve the existing behavior.

>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(simple_rename_exchange);
> @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
> struct dentry *old_dentry, struct inode *new_dir,
> struct dentry *new_dentry, unsigned int flags)
>  {
> - struct inode *inode = d_inode(old_dentry);
>   int they_are_dirs = d_is_dir(old_dentry);
>  
>   if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
> @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
>   inc_nlink(new_dir);
>   }
>  
> - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
> - new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>   return 0;
>  }
>  EXPORT_SYMBOL(simple_rename);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bdfbd11a5811..14e38bd900f1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file 
> *file);
>  extern int simple_link(struct dentry *, struct inode *, struct dentry *);
>  extern int simple_unlink(struct inode *, struct dentry *);
>  extern int simple_rmdir(struct inode *, struct dentry *);
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +  struct inode *new_dir, struct dentry *new_dentry);
>  extern int simple_rename_exchange(struct inode *old_dir, struct dentry 
> *old_dentry,
> struct inode *new_dir, struct dentry 
> *new_dentry);
>  extern int simple_rename(struct mnt_idmap *, struct inode *,

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2 07/92] fs: add ctime accessors infrastructure

2023-07-05 Thread Damien Le Moal

On 7/6/23 03:58, Jeff Layton wrote:
> struct timespec64 has unused bits in the tv_nsec field that can be used
> for other purposes. In future patches, we're going to change how the
> inode->i_ctime is accessed in certain inodes in order to make use of
> them. In order to do that safely though, we'll need to eradicate raw
> accesses of the inode->i_ctime field from the kernel.
> 
> Add new accessor functions for the ctime that we use to replace them.
> 
> Reviewed-by: Jan Kara 
> Reviewed-by: Luis Chamberlain 
> Signed-off-by: Jeff Layton 

Looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-05 Thread Jeff Layton

On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> v2:
> - prepend patches to add missing ctime updates
> - add simple_rename_timestamp helper function
> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> - drop individual inode_ctime_set_{sec,nsec} helpers
> 
> I've been working on a patchset to change how the inode->i_ctime is
> accessed in order to give us conditional, high-res timestamps for the
> ctime and mtime. struct timespec64 has unused bits in it that we can use
> to implement this. In order to do that however, we need to wrap all
> accesses of inode->i_ctime to ensure that bits used as flags are
> appropriately handled.
> 
> The patchset starts with reposts of some missing ctime updates that I
> spotted in the tree. It then adds a new helper function for updating the
> timestamp after a successful rename, and new ctime accessor
> infrastructure.
> 
> The bulk of the patchset is individual conversions of different
> subsysteme to use the new infrastructure. Finally, the patchset renames
> the i_ctime field to __i_ctime to help ensure that I didn't miss
> anything.
> 
> This should apply cleanly to linux-next as of this morning.
> 
> Most of this conversion was done via 5 different coccinelle scripts, run
> in succession, with a large swath of by-hand conversions to clean up the
> remainder.
> 

A couple of other things I should note:

If you sent me an Acked-by or Reviewed-by in the previous set, then I
tried to keep it on the patch here, since the respun patches are mostly
just renaming stuff from v1. Let me know if I've missed any.

I've also pushed the pile to my tree as this tag:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705

In case that's easier to work with.

Cheers,
-- 
Jeff Layton

Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder





On 7/5/23 11:23, Eric DeVolder wrote:



On 7/5/23 10:49, Nathan Chancellor wrote:

Hi Eric,

On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote:

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

NOTE: The original Kconfig has a KEXEC_SIG which depends on
MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
dependency (using the strategy outlined in this series, and other
techniques) results in 'error: recursive dependency detected'
on CRYPTO.

Per Alexander Gordeev : "the MODULE_SIG_FORMAT
dependency was introduced with [git commit below] and in fact was not
necessary, since s390 did/does not use mod_check_sig() anyway.

  commit c8424e776b09 ("MODSIGN: Export module signature definitions")

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
MODULE_SIG_FORMAT does not hurt."

Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
from KEXEC_SIG. Still results in equivalent .config files for s390.

Signed-off-by: Eric DeVolder 
Acked-by: Alexander Gordeev 
---
  arch/s390/Kconfig | 65 ++-
  1 file changed, 19 insertions(+), 46 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5b39918b7042..5d4fbbfdd1cd 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -244,6 +244,25 @@ config PGTABLE_LEVELS
  source "kernel/livepatch/Kconfig"
+config ARCH_DEFAULT_KEXEC
+    def_bool y
+
+config ARCH_SUPPORTS_KEXEC
+    def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+    def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
+
+config ARCH_HAS_KEXEC_PURGATORY
+    def_bool KEXEC_FILE
+
+config ARCH_SUPPORTS_CRASH_DUMP
+    def_bool y
+    help
+  Refer to  for more details on this.
+  This option also enables s390 zfcpdump.
+  See also 
+
  menu "Processor type and features"
  config HAVE_MARCH_Z10_FEATURES
@@ -482,36 +501,6 @@ config SCHED_TOPOLOGY
  source "kernel/Kconfig.hz"
-config KEXEC
-    def_bool y
-    select KEXEC_CORE
-
-config KEXEC_FILE
-    bool "kexec file based system call"
-    select KEXEC_CORE
-    depends on CRYPTO
-    depends on CRYPTO_SHA256
-    depends on CRYPTO_SHA256_S390
-    help
-  Enable the kexec file based system call. In contrast to the normal
-  kexec system call this system call takes file descriptors for the
-  kernel and initramfs as arguments.
-
-config ARCH_HAS_KEXEC_PURGATORY
-    def_bool y
-    depends on KEXEC_FILE
-
-config KEXEC_SIG
-    bool "Verify kernel signature during kexec_file_load() syscall"
-    depends on KEXEC_FILE && MODULE_SIG_FORMAT
-    help
-  This option makes kernel signature verification mandatory for
-  the kexec_file_load() syscall.
-
-  In addition to that option, you need to enable signature
-  verification for the corresponding kernel image type being
-  loaded in order for this to work.
-
  config KERNEL_NOBP
  def_bool n
  prompt "Enable modified branch prediction for the kernel by default"
@@ -733,22 +722,6 @@ config VFIO_AP
  endmenu
-menu "Dump support"
-
-config CRASH_DUMP
-    bool "kernel crash dumps"
-    select KEXEC
-    help
-  Generate crash dump after being started by kexec.
-  Crash dump kernels are loaded in the main kernel with kexec-tools
-  into a specially reserved region and then later executed after
-  a crash by kdump/kexec.
-  Refer to  for more details on this.
-  This option also enables s390 zfcpdump.
-  See also 
-
-endmenu
-
  config CCW
  def_bool y
--
2.31.1



I just bisected the following build failure visible with 'ARCH=s390
allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec:
refactor for kernel/Kconfig.kexec") in -next.

   arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared inside parameter 
list will not be visible outside of this definition or declaration

 120 | static bool kdump_csum_valid(struct kimage *image)
 | ^~
   arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared inside parameter 
list will not be visible outside of this definition or declaration

 188 | int machine_kexec_prepare(struct kimage *image)
 |  ^~
   arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare':
   arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined 
type 'struct kimage'
 192 | if (image->type == KEXEC_TYPE_CRASH)
 |  ^~
   arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' undeclared (first use in 
this function); did you mean 'KEXEC_ON_CRASH'?

 192 | if (image->type == KEXEC_TYPE_CRASH)
 |

[PATCH v2 92/92] fs: rename i_ctime field to __i_ctime

2023-07-05 Thread Jeff Layton

Now that everything in-tree is converted to use the accessor functions,
rename the i_ctime field in the inode to discourage direct access.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 14e38bd900f1..b66442f91835 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,7 +642,7 @@ struct inode {
loff_t  i_size;
struct timespec64   i_atime;
struct timespec64   i_mtime;
-   struct timespec64   i_ctime;
+   struct timespec64   __i_ctime; /* use inode_*_ctime accessors! */
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
u8  i_blkbits;
@@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode);
  */
 static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-   return inode->i_ctime;
+   return inode->__i_ctime;
 }
 
 /**
@@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const 
struct inode *inode)
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
  struct timespec64 ts)
 {
-   inode->i_ctime = ts;
+   inode->__i_ctime = ts;
return ts;
 }
 
-- 
2.41.0

[PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-05 Thread Jeff Layton

A rename potentially involves updating 4 different inode timestamps. Add
a function that handles the details sanely, and convert the libfs.c
callers to use it.

Signed-off-by: Jeff Layton 
---
 fs/libfs.c | 36 +++-
 include/linux/fs.h |  2 ++
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index a7e56baf8bbd..9ee79668c909 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry *dentry)
 }
 EXPORT_SYMBOL(simple_rmdir);
 
+/**
+ * simple_rename_timestamp - update the various inode timestamps for rename
+ * @old_dir: old parent directory
+ * @old_dentry: dentry that is being renamed
+ * @new_dir: new parent directory
+ * @new_dentry: target for rename
+ *
+ * POSIX mandates that the old and new parent directories have their ctime and
+ * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), have
+ * their ctime updated.
+ */
+void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
+struct inode *new_dir, struct dentry *new_dentry)
+{
+   struct inode *newino = d_inode(new_dentry);
+
+   old_dir->i_mtime = inode_set_ctime_current(old_dir);
+   if (new_dir != old_dir)
+   new_dir->i_mtime = inode_set_ctime_current(new_dir);
+   inode_set_ctime_current(d_inode(old_dentry));
+   if (newino)
+   inode_set_ctime_current(newino);
+}
+EXPORT_SYMBOL_GPL(simple_rename_timestamp);
+
 int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
   struct inode *new_dir, struct dentry *new_dentry)
 {
@@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct 
dentry *old_dentry,
inc_nlink(old_dir);
}
}
-   old_dir->i_ctime = old_dir->i_mtime =
-   new_dir->i_ctime = new_dir->i_mtime =
-   d_inode(old_dentry)->i_ctime =
-   d_inode(new_dentry)->i_ctime = current_time(old_dir);
-
+   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
return 0;
 }
 EXPORT_SYMBOL_GPL(simple_rename_exchange);
@@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
*old_dir,
  struct dentry *old_dentry, struct inode *new_dir,
  struct dentry *new_dentry, unsigned int flags)
 {
-   struct inode *inode = d_inode(old_dentry);
int they_are_dirs = d_is_dir(old_dentry);
 
if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
@@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
*old_dir,
inc_nlink(new_dir);
}
 
-   old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
-   new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
-
+   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
return 0;
 }
 EXPORT_SYMBOL(simple_rename);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bdfbd11a5811..14e38bd900f1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file 
*file);
 extern int simple_link(struct dentry *, struct inode *, struct dentry *);
 extern int simple_unlink(struct inode *, struct dentry *);
 extern int simple_rmdir(struct inode *, struct dentry *);
+void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
+struct inode *new_dir, struct dentry *new_dentry);
 extern int simple_rename_exchange(struct inode *old_dir, struct dentry 
*old_dentry,
  struct inode *new_dir, struct dentry 
*new_dentry);
 extern int simple_rename(struct mnt_idmap *, struct inode *,
-- 
2.41.0

[PATCH v2 07/92] fs: add ctime accessors infrastructure

2023-07-05 Thread Jeff Layton

struct timespec64 has unused bits in the tv_nsec field that can be used
for other purposes. In future patches, we're going to change how the
inode->i_ctime is accessed in certain inodes in order to make use of
them. In order to do that safely though, we'll need to eradicate raw
accesses of the inode->i_ctime field from the kernel.

Add new accessor functions for the ctime that we use to replace them.

Reviewed-by: Jan Kara 
Reviewed-by: Luis Chamberlain 
Signed-off-by: Jeff Layton 
---
 fs/inode.c | 16 
 include/linux/fs.h | 45 -
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d37fad91c8da..21b026d95b51 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_set_ctime_current - set the ctime to current_time
+ * @inode: inode
+ *
+ * Set the inode->i_ctime to the current value for the inode. Returns
+ * the current value that was assigned to i_ctime.
+ */
+struct timespec64 inode_set_ctime_current(struct inode *inode)
+{
+   struct timespec64 now = current_time(inode);
+
+   inode_set_ctime(inode, now.tv_sec, now.tv_nsec);
+   return now;
+}
+EXPORT_SYMBOL(inode_set_ctime_current);
+
 /**
  * in_group_or_capable - check whether caller is CAP_FSETID privileged
  * @idmap: idmap of the mount @inode was found from
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 824accb89a91..bdfbd11a5811 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,7 +1474,50 @@ static inline bool fsuidgid_has_mapping(struct 
super_block *sb,
   kgid_has_mapping(fs_userns, kgid);
 }
 
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
+struct timespec64 inode_set_ctime_current(struct inode *inode);
+
+/**
+ * inode_get_ctime - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime from the inode and return it.
+ */
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+{
+   return inode->i_ctime;
+}
+
+/**
+ * inode_set_ctime_to_ts - set the ctime in the inode
+ * @inode: inode in which to set the ctime
+ * @ts: value to set in the ctime field
+ *
+ * Set the ctime in @inode to @ts
+ */
+static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
+ struct timespec64 ts)
+{
+   inode->i_ctime = ts;
+   return ts;
+}
+
+/**
+ * inode_set_ctime - set the ctime in the inode
+ * @inode: inode in which to set the ctime
+ * @sec: tv_sec value to set
+ * @nsec: tv_nsec value to set
+ *
+ * Set the ctime in @inode to { @sec, @nsec }
+ */
+static inline struct timespec64 inode_set_ctime(struct inode *inode,
+   time64_t sec, long nsec)
+{
+   struct timespec64 ts = { .tv_sec  = sec,
+.tv_nsec = nsec };
+
+   return inode_set_ctime_to_ts(inode, ts);
+}
 
 /*
  * Snapshotting support.
-- 
2.41.0

[PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-05 Thread Jeff Layton

v2:
- prepend patches to add missing ctime updates
- add simple_rename_timestamp helper function
- rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
- drop individual inode_ctime_set_{sec,nsec} helpers

I've been working on a patchset to change how the inode->i_ctime is
accessed in order to give us conditional, high-res timestamps for the
ctime and mtime. struct timespec64 has unused bits in it that we can use
to implement this. In order to do that however, we need to wrap all
accesses of inode->i_ctime to ensure that bits used as flags are
appropriately handled.

The patchset starts with reposts of some missing ctime updates that I
spotted in the tree. It then adds a new helper function for updating the
timestamp after a successful rename, and new ctime accessor
infrastructure.

The bulk of the patchset is individual conversions of different
subsysteme to use the new infrastructure. Finally, the patchset renames
the i_ctime field to __i_ctime to help ensure that I didn't miss
anything.

This should apply cleanly to linux-next as of this morning.

Most of this conversion was done via 5 different coccinelle scripts, run
in succession, with a large swath of by-hand conversions to clean up the
remainder.

The coccinelle scripts that were used are below:

::
cocci/ctime1.cocci
::
// convert as much to use inode_set_ctime_current as possible
@@
identifier timei;
struct inode *inode;
expression E1, E2;
@@
(
- inode->i_ctime = E1 = E2 = current_time(timei)
+ E1 = E2 = inode_set_ctime_current(inode)
|
- inode->i_ctime = E1 = current_time(timei)
+ E1 = inode_set_ctime_current(inode)
|
- E1 = inode->i_ctime = current_time(timei)
+ E1 = inode_set_ctime_current(inode)
|
- inode->i_ctime = current_time(timei)
+ inode_set_ctime_current(inode)
)

@@
struct inode *inode;
expression E1, E2, E3;
@@
(
- E1 = current_time(inode)
+ E1 = inode_set_ctime_current(inode)
|
- E1 = current_time(E3)
+ E1 = inode_set_ctime_current(inode)
)
...
(
- inode->i_ctime = E1;
|
- E2 = inode->i_ctime = E1;
+ E2 = E1;
)
::
cocci/ctime2.cocci
::
// get the places that set individual timespec64 fields
@@
struct inode *inode;
expression val, val2;
@@
- inode->i_ctime.tv_sec = val
+ inode_set_ctime(inode, val, val2)
...
- inode->i_ctime.tv_nsec = val2;

// get places that just set the tv_sec
@@
struct inode *inode;
expression sec, E1, E2, E3;
@@
(
- E3 = inode->i_ctime.tv_sec = sec
+ E3 = inode_set_ctime(inode, sec, 0).tv_sec
|
- inode->i_ctime.tv_sec = sec
+ inode_set_ctime(inode, sec, 0)
)
<...
(
- inode->i_ctime.tv_nsec = 0;
|
- E1 = inode->i_ctime.tv_nsec = 0
+ E1 = 0
|
- inode->i_ctime.tv_nsec = E1 = 0
+ E1 = 0
|
- inode->i_ctime.tv_nsec = E1 = E2 = 0
+ E1 = E2 = 0
)
...>

::
cocci/ctime3.cocci
::
// convert places that set i_ctime to a timespec64 directly
@@
struct inode *inode;
expression ts, E1, E2;
@@
(
- inode->i_ctime = E1 = E2 = ts
+ E1 = E2 = inode_set_ctime_to_ts(inode, ts)
|
- inode->i_ctime = E1 = ts
+ E1 = inode_set_ctime_to_ts(inode, ts)
|
- inode->i_ctime = ts
+ inode_set_ctime_to_ts(inode, ts)
)
::
cocci/ctime4.cocci
::
// catch places that set the i_ctime in an inode embedded in another structure
@@
expression E1, E2, E3;
@@
(
- E3.i_ctime = E1 = E2 = current_time(&E3)
+ E1 = E2 = inode_set_ctime_current(&E3)
|
- E3.i_ctime = E1 = current_time(&E3)
+ E1 = inode_set_ctime_current(&E3)
|
- E1 = E3.i_ctime = current_time(&E3)
+ E1 = inode_set_ctime_current(&E3)
|
- E3.i_ctime = current_time(&E3)
+ inode_set_ctime_current(&E3)
)
::
cocci/ctime5.cocci
::
// convert the remaining i_ctime accesses
@@
struct inode *inode;
@@
- inode->i_ctime
+ inode_get_ctime(inode)


Jeff Layton (92):
  ibmvmc: update ctime in conjunction with mtime on write
  bfs: update ctime in addition to mtime when adding entries
  efivarfs: update ctime when mtime changes on a write
  exfat: ensure that ctime is updated whenever the mtime is
  apparmor: update ctime whenever the mtime changes on an inode
  cifs: update the ctime on a partial page write
  fs: add ctime accessors infrastructure
  fs: new helper: simple_rename_timestamp
  btrfs: convert to simple_rename_timestamp
  ubifs: convert to simple_rename_timestamp
  shmem: convert to simple_rename_timestamp
  exfat: convert to simple_rename_timestamp
  ntfs3: convert to simple_rename_timestamp
  reiserfs: convert to simple_rename_timestamp
  spufs: convert to ctime accessor functions
  s390: convert to ctime accessor functions
  binderfs: convert to ctime accessor functions
  infiniband: convert to ctime accessor functions
  ibm: convert to ctime accessor functions
  usb: convert to ctime accessor functions
  9p: convert to ctime accessor functions
  adfs: convert to ctime accessor functions
  affs: convert to ctime accessor functions
  afs: convert to ctime accessor functions
  fs: convert to ctime accessor functions
  autofs: convert to ctime accessor functions
  befs

Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder





On 7/5/23 10:49, Nathan Chancellor wrote:

Hi Eric,

On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote:

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

NOTE: The original Kconfig has a KEXEC_SIG which depends on
MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
dependency (using the strategy outlined in this series, and other
techniques) results in 'error: recursive dependency detected'
on CRYPTO.

Per Alexander Gordeev : "the MODULE_SIG_FORMAT
dependency was introduced with [git commit below] and in fact was not
necessary, since s390 did/does not use mod_check_sig() anyway.

  commit c8424e776b09 ("MODSIGN: Export module signature definitions")

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
MODULE_SIG_FORMAT does not hurt."

Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
from KEXEC_SIG. Still results in equivalent .config files for s390.

Signed-off-by: Eric DeVolder 
Acked-by: Alexander Gordeev 
---
  arch/s390/Kconfig | 65 ++-
  1 file changed, 19 insertions(+), 46 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5b39918b7042..5d4fbbfdd1cd 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -244,6 +244,25 @@ config PGTABLE_LEVELS
  
  source "kernel/livepatch/Kconfig"
  
+config ARCH_DEFAULT_KEXEC

+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
+
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
+   help
+ Refer to  for more details on 
this.
+ This option also enables s390 zfcpdump.
+ See also 
+
  menu "Processor type and features"
  
  config HAVE_MARCH_Z10_FEATURES

@@ -482,36 +501,6 @@ config SCHED_TOPOLOGY
  
  source "kernel/Kconfig.hz"
  
-config KEXEC

-   def_bool y
-   select KEXEC_CORE
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   depends on CRYPTO
-   depends on CRYPTO_SHA256
-   depends on CRYPTO_SHA256_S390
-   help
- Enable the kexec file based system call. In contrast to the normal
- kexec system call this system call takes file descriptors for the
- kernel and initramfs as arguments.
-
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool y
-   depends on KEXEC_FILE
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE && MODULE_SIG_FORMAT
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
  config KERNEL_NOBP
def_bool n
prompt "Enable modified branch prediction for the kernel by default"
@@ -733,22 +722,6 @@ config VFIO_AP
  
  endmenu
  
-menu "Dump support"

-
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   select KEXEC
-   help
- Generate crash dump after being started by kexec.
- Crash dump kernels are loaded in the main kernel with kexec-tools
- into a specially reserved region and then later executed after
- a crash by kdump/kexec.
- Refer to  for more details on 
this.
- This option also enables s390 zfcpdump.
- See also 
-
-endmenu
-
  config CCW
def_bool y
  
--

2.31.1



I just bisected the following build failure visible with 'ARCH=s390
allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec:
refactor for kernel/Kconfig.kexec") in -next.

   arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
 120 | static bool kdump_csum_valid(struct kimage *image)
 | ^~
   arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
 188 | int machine_kexec_prepare(struct kimage *image)
 |  ^~
   arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare':
   arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined 
type 'struct kimage'
 192 | if (image->type == KEXEC_TYPE_CRASH)
 |  ^~
   arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' 
undeclared (first use in this function); did you me

Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Arnd Bergmann

On Wed, Jul 5, 2023, at 17:22, Eric DeVolder wrote:
> On 7/5/23 10:05, Arnd Bergmann wrote:
>> On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote:
>> 
>> I see this is now in linux-next, and it caused a few randconfig
>> build issues, these never happened in the past:
>
> Arnd,
> Thanks for looking at this!
>
> I received randconfig errors from Andrew Morton's machinery. When 
> investigating I
> found that randconfig was able to specify CRASH_DUMP without KEXEC, and 
> that lead
> to problems. I believe this situation existed prior to this series as 
> well.

On Arm, there was definitely a bug because one could enable CRASH_DUMP
without enabling KEXEC, but that had no effect at all. I only noticed
the problem testing linux-next because it turned from silently broken
into a build failure

> Specifically CRASH_DUMP does not have a dependency on KEXEC, or select 
> (only s390
> has this hole closed).
>
> For CRASH_DUMP, this series now selects KEXEC to close this gap, which is 
> what a
> sane config would have (ie both CRASH_DUMP and KEXEC).

Right, that is the easier way out here.

> Do you think the changes outlined below are still needed?

I think only the first one still applies on arm, you need to ensure
that ARCH_SUPPORTS_CRASH_DUMP has the same dependency as
ARCH_SUPPORTS_KEXEC, or it just depends on ARCH_SUPPORTS_KEXEC.

Arnd

Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Nathan Chancellor

Hi Eric,

On Wed, Jul 05, 2023 at 10:20:03AM -0400, Eric DeVolder wrote:
> The kexec and crash kernel options are provided in the common
> kernel/Kconfig.kexec. Utilize the common options and provide
> the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
> equivalent set of KEXEC and CRASH options.
> 
> NOTE: The original Kconfig has a KEXEC_SIG which depends on
> MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
> dependency (using the strategy outlined in this series, and other
> techniques) results in 'error: recursive dependency detected'
> on CRYPTO.
> 
> Per Alexander Gordeev : "the MODULE_SIG_FORMAT
> dependency was introduced with [git commit below] and in fact was not
> necessary, since s390 did/does not use mod_check_sig() anyway.
> 
>  commit c8424e776b09 ("MODSIGN: Export module signature definitions")
> 
> MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
> SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
> MODULE_SIG_FORMAT does not hurt."
> 
> Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
> from KEXEC_SIG. Still results in equivalent .config files for s390.
> 
> Signed-off-by: Eric DeVolder 
> Acked-by: Alexander Gordeev 
> ---
>  arch/s390/Kconfig | 65 ++-
>  1 file changed, 19 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 5b39918b7042..5d4fbbfdd1cd 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -244,6 +244,25 @@ config PGTABLE_LEVELS
>  
>  source "kernel/livepatch/Kconfig"
>  
> +config ARCH_DEFAULT_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC
> + def_bool y
> +
> +config ARCH_SUPPORTS_KEXEC_FILE
> + def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
> +
> +config ARCH_HAS_KEXEC_PURGATORY
> + def_bool KEXEC_FILE
> +
> +config ARCH_SUPPORTS_CRASH_DUMP
> + def_bool y
> + help
> +   Refer to  for more details on 
> this.
> +   This option also enables s390 zfcpdump.
> +   See also 
> +
>  menu "Processor type and features"
>  
>  config HAVE_MARCH_Z10_FEATURES
> @@ -482,36 +501,6 @@ config SCHED_TOPOLOGY
>  
>  source "kernel/Kconfig.hz"
>  
> -config KEXEC
> - def_bool y
> - select KEXEC_CORE
> -
> -config KEXEC_FILE
> - bool "kexec file based system call"
> - select KEXEC_CORE
> - depends on CRYPTO
> - depends on CRYPTO_SHA256
> - depends on CRYPTO_SHA256_S390
> - help
> -   Enable the kexec file based system call. In contrast to the normal
> -   kexec system call this system call takes file descriptors for the
> -   kernel and initramfs as arguments.
> -
> -config ARCH_HAS_KEXEC_PURGATORY
> - def_bool y
> - depends on KEXEC_FILE
> -
> -config KEXEC_SIG
> - bool "Verify kernel signature during kexec_file_load() syscall"
> - depends on KEXEC_FILE && MODULE_SIG_FORMAT
> - help
> -   This option makes kernel signature verification mandatory for
> -   the kexec_file_load() syscall.
> -
> -   In addition to that option, you need to enable signature
> -   verification for the corresponding kernel image type being
> -   loaded in order for this to work.
> -
>  config KERNEL_NOBP
>   def_bool n
>   prompt "Enable modified branch prediction for the kernel by default"
> @@ -733,22 +722,6 @@ config VFIO_AP
>  
>  endmenu
>  
> -menu "Dump support"
> -
> -config CRASH_DUMP
> - bool "kernel crash dumps"
> - select KEXEC
> - help
> -   Generate crash dump after being started by kexec.
> -   Crash dump kernels are loaded in the main kernel with kexec-tools
> -   into a specially reserved region and then later executed after
> -   a crash by kdump/kexec.
> -   Refer to  for more details on 
> this.
> -   This option also enables s390 zfcpdump.
> -   See also 
> -
> -endmenu
> -
>  config CCW
>   def_bool y
>  
> -- 
> 2.31.1
> 

I just bisected the following build failure visible with 'ARCH=s390
allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec:
refactor for kernel/Kconfig.kexec") in -next.

  arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
120 | static bool kdump_csum_valid(struct kimage *image)
| ^~
  arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
188 | int machine_kexec_prepare(struct kimage *image)
|  ^~
  arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare':
  arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined type 
'struct kimage'
192 | if (image->type == KEXEC_TYPE_CRASH)
|  ^~
  arch/s390/kernel/machine_kexec.c:192:

Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder






On 7/5/23 10:05, Arnd Bergmann wrote:

On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote:

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 



+config ARCH_SUPPORTS_KEXEC
+   def_bool (!SMP || PM_SLEEP_SMP) && MMU

  config ATAGS_PROC
bool "Export atags in procfs"
@@ -1668,17 +1656,8 @@ config ATAGS_PROC
  Should the atags used to boot the kernel be exported in an "atags"
  file in procfs. Useful with kexec.

-config CRASH_DUMP
-   bool "Build kdump crash kernel (EXPERIMENTAL)"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec. The crash dump kernel must be compiled to a
- memory address not used by the main kernel
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y



I see this is now in linux-next, and it caused a few randconfig
build issues, these never happened in the past:


Arnd,
Thanks for looking at this!

I received randconfig errors from Andrew Morton's machinery. When investigating 
I
found that randconfig was able to specify CRASH_DUMP without KEXEC, and that 
lead
to problems. I believe this situation existed prior to this series as well.
Specifically CRASH_DUMP does not have a dependency on KEXEC, or select (only 
s390
has this hole closed).

For CRASH_DUMP, this series now selects KEXEC to close this gap, which is what a
sane config would have (ie both CRASH_DUMP and KEXEC).

Do you think the changes outlined below are still needed?
eric




* The #ifdef CONFIG_KEXEC check in arch/arm/include/asm/kexec.h
   needs to be changed to CONFIG_KEXEC_CORE:

include/linux/kexec.h:41:2: error: #error KEXEC_SOURCE_MEMORY_LIMIT not defined

   same thing on m68k

* ARCH_SUPPORTS_CRASH_DUMP needs the same dependency as ARCH_SUPPORTS_KEXEC,
   otherwise we seem to run into an obscure assembler error building the kdump
   core on architectures that do not support kdump:

   /tmp/ccpYl6w9.s:1546: Error: selected processor does not support requested 
special purpose register -- `mrs r2,cpsr'

* Most architectures build machine_kexec.o only when KEXEC is enabled,
   this also needs to be changed to KEXEC_CORE:

--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o
  obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o insn.o patch.o
  obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o
  obj-$(CONFIG_JUMP_LABEL)   += jump_label.o insn.o patch.o
-obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o
+obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
  # Main staffs in KPROBES are in arch/arm/probes/ .
  obj-$(CONFIG_KPROBES)  += patch.o insn.o
  obj-$(CONFIG_OABI_COMPAT)  += sys_oabi-compat.o


Arnd

Re: [PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Arnd Bergmann

On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote:
> The kexec and crash kernel options are provided in the common
> kernel/Kconfig.kexec. Utilize the common options and provide
> the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
> equivalent set of KEXEC and CRASH options.
>
> Signed-off-by: Eric DeVolder 

> +config ARCH_SUPPORTS_KEXEC
> + def_bool (!SMP || PM_SLEEP_SMP) && MMU
> 
>  config ATAGS_PROC
>   bool "Export atags in procfs"
> @@ -1668,17 +1656,8 @@ config ATAGS_PROC
> Should the atags used to boot the kernel be exported in an "atags"
> file in procfs. Useful with kexec.
> 
> -config CRASH_DUMP
> - bool "Build kdump crash kernel (EXPERIMENTAL)"
> - help
> -   Generate crash dump after being started by kexec. This should
> -   be normally only set in special crash dump kernels which are
> -   loaded in the main kernel with kexec-tools into a specially
> -   reserved region and then later executed after a crash by
> -   kdump/kexec. The crash dump kernel must be compiled to a
> -   memory address not used by the main kernel
> -
> -   For more details see Documentation/admin-guide/kdump/kdump.rst
> +config ARCH_SUPPORTS_CRASH_DUMP
> + def_bool y
> 

I see this is now in linux-next, and it caused a few randconfig
build issues, these never happened in the past:

* The #ifdef CONFIG_KEXEC check in arch/arm/include/asm/kexec.h
  needs to be changed to CONFIG_KEXEC_CORE:

include/linux/kexec.h:41:2: error: #error KEXEC_SOURCE_MEMORY_LIMIT not defined

  same thing on m68k

* ARCH_SUPPORTS_CRASH_DUMP needs the same dependency as ARCH_SUPPORTS_KEXEC,
  otherwise we seem to run into an obscure assembler error building the kdump
  core on architectures that do not support kdump:

  /tmp/ccpYl6w9.s:1546: Error: selected processor does not support requested 
special purpose register -- `mrs r2,cpsr'

* Most architectures build machine_kexec.o only when KEXEC is enabled,
  this also needs to be changed to KEXEC_CORE:

--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -59,7 +59,7 @@ obj-$(CONFIG_FUNCTION_TRACER) += entry-ftrace.o
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o insn.o patch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o insn.o patch.o
 obj-$(CONFIG_JUMP_LABEL)   += jump_label.o insn.o patch.o
-obj-$(CONFIG_KEXEC)+= machine_kexec.o relocate_kernel.o
+obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
 # Main staffs in KPROBES are in arch/arm/probes/ .
 obj-$(CONFIG_KPROBES)  += patch.o insn.o
 obj-$(CONFIG_OABI_COMPAT)  += sys_oabi-compat.o


   Arnd

[PATCH v4 13/13] sh/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: John Paul Adrian Glaubitz 
---
 arch/sh/Kconfig | 46 --
 1 file changed, 8 insertions(+), 38 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 2b3ce4fd3956..1cf6603781c7 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -548,44 +548,14 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call (EXPERIMENTAL)"
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.  And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "kernel crash dumps (EXPERIMENTAL)"
-   depends on BROKEN_ON_SMP
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel using
- PHYSICAL_START.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
-
-config KEXEC_JUMP
-   bool "kexec jump (EXPERIMENTAL)"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code via KEXEC
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool BROKEN_ON_SMP
+
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1

[PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

NOTE: The original Kconfig has a KEXEC_SIG which depends on
MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
dependency (using the strategy outlined in this series, and other
techniques) results in 'error: recursive dependency detected'
on CRYPTO.

Per Alexander Gordeev : "the MODULE_SIG_FORMAT
dependency was introduced with [git commit below] and in fact was not
necessary, since s390 did/does not use mod_check_sig() anyway.

 commit c8424e776b09 ("MODSIGN: Export module signature definitions")

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
MODULE_SIG_FORMAT does not hurt."

Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
from KEXEC_SIG. Still results in equivalent .config files for s390.

Signed-off-by: Eric DeVolder 
Acked-by: Alexander Gordeev 
---
 arch/s390/Kconfig | 65 ++-
 1 file changed, 19 insertions(+), 46 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5b39918b7042..5d4fbbfdd1cd 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -244,6 +244,25 @@ config PGTABLE_LEVELS
 
 source "kernel/livepatch/Kconfig"
 
+config ARCH_DEFAULT_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
+
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
+   help
+ Refer to  for more details on 
this.
+ This option also enables s390 zfcpdump.
+ See also 
+
 menu "Processor type and features"
 
 config HAVE_MARCH_Z10_FEATURES
@@ -482,36 +501,6 @@ config SCHED_TOPOLOGY
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   def_bool y
-   select KEXEC_CORE
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   depends on CRYPTO
-   depends on CRYPTO_SHA256
-   depends on CRYPTO_SHA256_S390
-   help
- Enable the kexec file based system call. In contrast to the normal
- kexec system call this system call takes file descriptors for the
- kernel and initramfs as arguments.
-
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool y
-   depends on KEXEC_FILE
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE && MODULE_SIG_FORMAT
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
 config KERNEL_NOBP
def_bool n
prompt "Enable modified branch prediction for the kernel by default"
@@ -733,22 +722,6 @@ config VFIO_AP
 
 endmenu
 
-menu "Dump support"
-
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   select KEXEC
-   help
- Generate crash dump after being started by kexec.
- Crash dump kernels are loaded in the main kernel with kexec-tools
- into a specially reserved region and then later executed after
- a crash by kdump/kexec.
- Refer to  for more details on 
this.
- This option also enables s390 zfcpdump.
- See also 
-
-endmenu
-
 config CCW
def_bool y
 
-- 
2.31.1

[PATCH v4 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/riscv/Kconfig | 44 +---
 1 file changed, 13 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b49793cf34eb..8a3af850597a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT
 
  If unsure what to do here, say N.
 
-config KEXEC
-   bool "Kexec system call"
-   depends on MMU
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SELECTS_KEXEC
+   def_bool y
+   depends on KEXEC
select HOTPLUG_CPU if SMP
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel. It is like a reboot
- but it is independent of the system firmware. And like a reboot
- you can start any kernel with it, not just Linux.
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool 64BIT && MMU
 
-config KEXEC_FILE
-   bool "kexec file based systmem call"
-   depends on 64BIT && MMU
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
select HAVE_IMA_KEXEC if IMA
-   select KEXEC_CORE
select KEXEC_ELF
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
- If you don't know what to do here, say Y.
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config COMPAT
bool "Kernel support for 32-bit U-mode"
-- 
2.31.1

[PATCH v4 10/13] powerpc/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
---
 arch/powerpc/Kconfig | 55 ++--
 1 file changed, 17 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..1695a71777f0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -589,41 +589,21 @@ config PPC64_SUPPORTS_MEMORY_FAILURE
default "y" if PPC_POWERNV
select ARCH_SUPPORTS_MEMORY_FAILURE
 
-config KEXEC
-   bool "kexec system call"
-   depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   select KEXEC_ELF
-   depends on PPC64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is a new version of the kexec system call. This call is
- file based and takes in file descriptors as system call arguments
- for kernel and initramfs as opposed to a list of segments as is the
- case for the older kexec call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
 
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+   select HAVE_IMA_KEXEC if IMA
+
 config PPC64_BIG_ENDIAN_ELF_ABI_V2
# Option is available to BFD, but LLD does not support ELFv1 so this is
# always true there.
@@ -683,14 +663,13 @@ config RELOCATABLE_TEST
  loaded at, which tends to be non-zero and therefore test the
  relocation code.
 
-config CRASH_DUMP
-   bool "Build a dump capture kernel"
-   depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE if PPC64 || 44x || PPC_85xx
-   help
- Build a kernel suitable for use as a dump capture kernel.
- The same kernel binary can be used as production kernel and dump
- capture kernel.
 
 config FA_DUMP
bool "Firmware-assisted dump"
-- 
2.31.1

[PATCH v4 08/13] mips/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: Thomas Bogendoerfer 
---
 arch/mips/Kconfig | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index fc6fba925aea..bc8421859006 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2878,33 +2878,11 @@ config HZ
 config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "Kernel crash dumps"
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or firmware using
- PHYSICAL_START.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded"
-- 
2.31.1

[PATCH v4 09/13] parisc/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/parisc/Kconfig | 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 4cb46d5c64a2..2ef6843aae60 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -339,29 +339,17 @@ config NR_CPUS
default "8" if 64BIT
default "16"
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- shutdown, so do not be surprised if this code does not
- initially work for you.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select KEXEC_ELF
-   help
- This enables the kexec_file_load() System call. This is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
 endmenu
 
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
+
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+
 source "drivers/parisc/Kconfig"
-- 
2.31.1

[PATCH v4 07/13] m68k/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/Kconfig | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index dc792b321f1e..3e318bf9504c 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -89,23 +89,8 @@ config MMU_SUN3
bool
depends on MMU && !MMU_MOTOROLA && !MMU_COLDFIRE
 
-config KEXEC
-   bool "kexec system call"
-   depends on M68KCLASSIC && MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+config ARCH_SUPPORTS_KEXEC
+   def_bool M68KCLASSIC && MMU
 
 config BOOTINFO_PROC
bool "Export bootinfo in procfs"
-- 
2.31.1

[PATCH v4 05/13] arm64/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm64/Kconfig | 62 +-
 1 file changed, 12 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7856c3a3e35a..bc6c9ff85a77 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1459,60 +1459,22 @@ config PARAVIRT_TIME_ACCOUNTING
 
  If in doubt, say N here.
 
-config KEXEC
-   depends on PM_SLEEP_SMP
-   select KEXEC_CORE
-   bool "kexec system call"
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE
-   help
- Select this option to verify a signature with loaded kernel
- image. If configured, any attempt of loading a image without
- valid signature will fail.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PM_SLEEP_SMP
 
-config KEXEC_IMAGE_VERIFY_SIG
-   bool "Enable Image signature verification support"
-   default y
-   depends on KEXEC_SIG
-   depends on EFI && SIGNED_PE_FILE_VERIFICATION
-   help
- Enable Image signature verification support.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
 
-comment "Support for PE file signature verification disabled"
-   depends on KEXEC_SIG
-   depends on !EFI || !SIGNED_PE_FILE_VERIFICATION
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select HAVE_IMA_KEXEC if IMA
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
+config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   def_bool y
 
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config TRANS_TABLE
def_bool y
-- 
2.31.1

[PATCH v4 06/13] loongarch/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/loongarch/Kconfig | 26 +++---
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index e55511af4c77..397203e18800 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -537,28 +537,16 @@ config CPU_HAS_PREFETCH
bool
default y
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
 
 config RELOCATABLE
bool "Relocatable kernel"
-- 
2.31.1

[PATCH v4 02/13] x86/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/x86/Kconfig | 89 +++-
 1 file changed, 13 insertions(+), 76 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7422db409770..6d2212c38046 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2040,88 +2040,25 @@ config EFI_RUNTIME_MAP
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   depends on X86_64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool KEXEC_FILE
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool X86_64 && CRYPTO && CRYPTO_SHA256
 
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
depends on KEXEC_FILE
-   help
-
- This option makes the kexec_file_load() syscall check for a valid
- signature of the kernel image.  The image can still be loaded without
- a valid signature unless you also enable KEXEC_SIG_FORCE, though if
- there's a signature that we can check, then it must be valid.
-
- In addition to this option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
-config KEXEC_SIG_FORCE
-   bool "Require a valid signature in kexec_file_load() syscall"
-   depends on KEXEC_SIG
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
+   select HAVE_IMA_KEXEC if IMA
 
-config KEXEC_BZIMAGE_VERIFY_SIG
-   bool "Enable bzImage signature verification support"
-   depends on KEXEC_SIG
-   depends on SIGNED_PE_FILE_VERIFICATION
-   select SYSTEM_TRUSTED_KEYRING
-   help
- Enable bzImage signature verification support.
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
 
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   depends on X86_64 || (X86_32 && HIGHMEM)
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or BIOS using
- PHYSICAL_START, or it must be built as a relocatable image
- (CONFIG_RELOCATABLE=y).
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
-config KEXEC_JUMP
-   bool "kexec jump"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code in physical address mode via KEXEC
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool X86_64 || (X86_32 && HIGHMEM)
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1

[PATCH v4 00/13] refactor Kconfig to consolidate KEXEC and CRASH options

2023-07-05 Thread Eric DeVolder

The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch//Kconfig files into new file kernel/Kconfig.kexec.

The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.

The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):

 - arm
 - arm64
 - ia64
 - loongarch
 - m68k
 - mips
 - parisc
 - powerpc
 - riscv
 - s390
 - sh
 - x86 

More information:

In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"

 https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devol...@oracle.com/

the new kernel feature introduces the config option CRASH_HOTPLUG.

In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch//Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.

 https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/

To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:

- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
  - These items from arch/Kconfig:
  CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
  - These items from arch/x86/Kconfig form the common options:
  KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
  KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
  - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
  and is now listed in "General Setup" submenu from init/Kconfig.
- To control the main common options, new options ARCH_SUPPORTS_KEXEC,
  ARCH_SUPPORTS_KEXEC_FILE and ARCH_SUPPORTS_CRASH_DUMP are introduced.
  NOTE: The existing ARCH_HAS_KEXEC_PURGATORY remains unchanged.
- To account for the slight differences, new options ARCH_SELECTS_KEXEC,
  ARCH_SELECTS_KEXEC_FILE and ARCH_SELECTS_CRASH_DUMP are used to
  elicit the same side effects as the original arch//Kconfig
  files for KEXEC and CRASH options.

An example, 'make menuconfig' illustrating the submenu:

  > General setup > Kexec and crash features
  [*] Enable kexec system call
  [*] Enable kexec file based system call
  [*]   Verify kernel signature during kexec_file_load() syscall
  [ ] Require a valid signature in kexec_file_load() syscall
  [ ] Enable bzImage signature verification support
  [*] kexec jump
  [*] kernel crash dumps
  [*]   Update the crash elfcorehdr on system configuration changes

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. In the
process of consolidating these options, I encountered slight differences
in the coding of these options in several of the architectures. As a
result, I settled on the following solution:

- Each of three main options has a 'depends on ARCH_SUPPORTS_'
  statement: ARCH_SUPPORTS_KEXEC, ARCH_SUPPORTS_KEXEC_FILE,
  ARCH_SUPPORTS_CRASH_DUMP.

  For example, the KEXEC_FILE option has a 'depends on
  ARCH_SUPPORTS_KEXEC_FILE' statement.

- The boolean ARCH_SUPPORTS_ in effect allows the arch to
  determine when the feature is allowed.  Archs which don't have the
  feature simply do not provide the corresponding ARCH_SUPPORTS_.
  For each arch, where there previously were KEXEC and/or CRASH
  options, these have been replaced with the corresponding boolean
  ARCH_SUPPORTS_, and an appropriate def_bool statement.

  For example, if the arch supports KEXEC_FILE, then the
  ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the
  KEXEC_FILE option to be available.

  If the arch has a 'depends on' statement in its original coding
  of the option, then that expression becomes part of the def_bool
  expression. For example, arm64 had:

  config KEXEC
depends on PM_SLEEP_SMP

  and in this solution, this converts to:

  config ARCH_SUPPORTS_KEXEC
def_bool PM_SLEEP_SMP


- In order to account for the differences in the config coding for
  the three common options, the ARCH_SELECTS_ is used.
  This options has a 'depends on ' statement to couple it
  to the main option, and from there can insert the differences
  from the common option and the arch original coding of that option.

  For example, a few archs enable CRYPTO and CRYTPO_SHA256 for
  KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and
  'select CRYPTO' and 'select CRYPTO_SHA256' statements.

Illustrating the option relationships:

For KEXEC:
 ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC

 KEXEC  # in Kconfig.kexec
 ARCH_SUPPORTS_KEX

[PATCH v4 04/13] ia64/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/ia64/Kconfig | 28 +---
 1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 2cd93e6bf0fe..88382f105301 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -361,31 +361,13 @@ config IA64_HP_AML_NFW
  the "force" module parameter, e.g., with the "aml_nfw.force"
  kernel command line option.
 
-config KEXEC
-   bool "kexec system call"
-   depends on !SMP || HOTPLUG_CPU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+endmenu
 
-config CRASH_DUMP
- bool "kernel crash dumps"
- depends on IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
- help
-   Generate crash dump after being started by kexec.
+config ARCH_SUPPORTS_KEXEC
+   def_bool !SMP || HOTPLUG_CPU
 
-endmenu
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
 
 menu "Power management and ACPI options"
 
-- 
2.31.1

[PATCH v4 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm/Kconfig | 29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7a27550ff3c1..1a6a6eb48a15 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1645,20 +1645,8 @@ config XIP_DEFLATED_DATA
  copied, saving some precious ROM space. A possible drawback is a
  slightly longer boot delay.
 
-config KEXEC
-   bool "Kexec system call (EXPERIMENTAL)"
-   depends on (!SMP || PM_SLEEP_SMP)
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.
+config ARCH_SUPPORTS_KEXEC
+   def_bool (!SMP || PM_SLEEP_SMP) && MMU
 
 config ATAGS_PROC
bool "Export atags in procfs"
@@ -1668,17 +1656,8 @@ config ATAGS_PROC
  Should the atags used to boot the kernel be exported in an "atags"
  file in procfs. Useful with kexec.
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel (EXPERIMENTAL)"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec. The crash dump kernel must be compiled to a
- memory address not used by the main kernel
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config AUTO_ZRELADDR
bool "Auto calculation of the decompressed kernel image address" if 
!ARCH_MULTIPLATFORM
-- 
2.31.1

[PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-07-05 Thread Eric DeVolder

The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.

Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_ config options.

Architectures can utilize the new ARCH_SELECTS_ config
options to specify additional components when  is enabled.

To summarize, the ARCH_SUPPORTS_ permits the  to be
enabled, and the ARCH_SELECTS_ handles side effects (ie.
select statements).

Signed-off-by: Eric DeVolder 
---
 arch/Kconfig |  13 -
 init/Kconfig |   2 +
 kernel/Kconfig.kexec | 110 +++
 3 files changed, 112 insertions(+), 13 deletions(-)
 create mode 100644 kernel/Kconfig.kexec

diff --git a/arch/Kconfig b/arch/Kconfig
index aff2746c8af2..b2872e9d3760 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,19 +11,6 @@ source "arch/$(SRCARCH)/Kconfig"
 
 menu "General architecture-dependent options"
 
-config CRASH_CORE
-   bool
-
-config KEXEC_CORE
-   select CRASH_CORE
-   bool
-
-config KEXEC_ELF
-   bool
-
-config HAVE_IMA_KEXEC
-   bool
-
 config ARCH_HAS_SUBPAGE_FAULTS
bool
help
diff --git a/init/Kconfig b/init/Kconfig
index f7f65af4ee12..639e8a3363c3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1927,6 +1927,8 @@ config BINDGEN_VERSION_TEXT
 config TRACEPOINTS
bool
 
+source "kernel/Kconfig.kexec"
+
 endmenu# General setup
 
 source "arch/Kconfig"
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
new file mode 100644
index ..d82a7ce59c05
--- /dev/null
+++ b/kernel/Kconfig.kexec
@@ -0,0 +1,110 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Kexec and crash features"
+
+config CRASH_CORE
+   bool
+
+config KEXEC_CORE
+   select CRASH_CORE
+   bool
+
+config KEXEC_ELF
+   bool
+
+config HAVE_IMA_KEXEC
+   bool
+
+config KEXEC
+   bool "Enable kexec system call"
+   default ARCH_DEFAULT_KEXEC
+   depends on ARCH_SUPPORTS_KEXEC
+   select KEXEC_CORE
+   help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is independent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. As of this writing the exact hardware
+ interface is strongly in flux, so no good recommendation can be
+ made.
+
+config KEXEC_FILE
+   bool "Enable kexec file based system call"
+   depends on ARCH_SUPPORTS_KEXEC_FILE
+   select KEXEC_CORE
+   help
+ This is new version of kexec system call. This system call is
+ file based and takes file descriptors as system call argument
+ for kernel and initramfs as opposed to list of segments as
+ accepted by kexec system call.
+
+config KEXEC_SIG
+   bool "Verify kernel signature during kexec_file_load() syscall"
+   depends on KEXEC_FILE
+   help
+ This option makes the kexec_file_load() syscall check for a valid
+ signature of the kernel image. The image can still be loaded without
+ a valid signature unless you also enable KEXEC_SIG_FORCE, though if
+ there's a signature that we can check, then it must be valid.
+
+ In addition to this option, you need to enable signature
+ verification for the corresponding kernel image type being
+ loaded in order for this to work.
+
+config KEXEC_SIG_FORCE
+   bool "Require a valid signature in kexec_file_load() syscall"
+   depends on KEXEC_SIG
+   help
+ This option makes kernel signature verification mandatory for
+ the kexec_file_load() syscall.
+
+config KEXEC_IMAGE_VERIFY_SIG
+   bool "Enable Image signature verification support (ARM)"
+   default ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   depends on KEXEC_SIG
+   depends on EFI && SIGNED_PE_FILE_VERIFICATION
+   help
+ Enable Image signature verification support.
+
+config KEXEC_BZIMAGE_VERIFY_SIG
+   bool "Enable bzImage signature verification support"
+   depends on KEXEC_SIG
+   depends on SIGNED_PE_FILE_VERIFICATION
+   select SYSTEM_TRUSTED_KEYRING
+   help
+ Enable b

[PATCH v2 15/92] spufs: convert to ctime accessor functions

2023-07-05 Thread Jeff Layton

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Acked-by: Jeremy Kerr 
Reviewed-by: Jan Kara 
Signed-off-by: Jeff Layton 
---
 arch/powerpc/platforms/cell/spufs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c 
b/arch/powerpc/platforms/cell/spufs/inode.c
index ea807aa0c31a..38c5be34c895 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -86,7 +86,7 @@ spufs_new_inode(struct super_block *sb, umode_t mode)
inode->i_mode = mode;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
-   inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+   inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);
 out:
return inode;
 }
-- 
2.41.0

Re: [PATCH 1/2] tools/perf/tests: perf all metrics test fails when perf_event access is restricted

2023-07-05 Thread Namhyung Kim

On Mon, Jul 3, 2023 at 10:04 PM Athira Rajeev
 wrote:
>
>
>
> > On 15-Jun-2023, at 1:08 PM, Athira Rajeev  
> > wrote:
> >
> > Perf all metrics test fails as below when perf_event access
> > is restricted.
> >
> >./perf test -v "perf all metrics test"
> >Metric 'Memory_RD_BW_Chip' not printed in:
> >Error:
> >Access to performance monitoring and observability operations is limited.
> >Enforced MAC policy settings (SELinux) can limit access to performance
> >—
> >access to performance monitoring and observability operations for 
> > processes
> >without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> >—
> >test child finished with -1
> > end 
> >perf all metrics test: FAILED!

In my system, it fails like below:

  $ ./perf test -v 101
  101: perf all metrics test   :
  --- start ---
  test child forked, pid 398458
  Testing branch_misprediction_ratio
  Testing all_remote_links_outbound
  Metric 'all_remote_links_outbound' not printed in:
  Error:
  Invalid event (remote_outbound_data_controller_3:u) in per-thread
mode, enable system wide with '-a'.
  Testing nps1_die_to_dram
  ...

Thanks,
Namhyung

>
>
> Hi,
>
> Looking for review comments on this patch.
>
> Thanks
> >
> > The perf all metrics test picks the input events from
> > "perf list --raw-dump metrics" and runs "perf stat -M "$m""
> > for each of the metrics in the list. It fails here for some
> > of the metrics which needs access, since it collects system
> > wide resource details/statistics. Fix the testcase to skip
> > those metric events.
> >
> > Signed-off-by: Athira Rajeev 
> > ---
> > tools/perf/tests/shell/stat_all_metrics.sh | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/tests/shell/stat_all_metrics.sh 
> > b/tools/perf/tests/shell/stat_all_metrics.sh
> > index 54774525e18a..14b96484a359 100755
> > --- a/tools/perf/tests/shell/stat_all_metrics.sh
> > +++ b/tools/perf/tests/shell/stat_all_metrics.sh
> > @@ -6,7 +6,9 @@ err=0
> > for m in $(perf list --raw-dump metrics); do
> >   echo "Testing $m"
> >   result=$(perf stat -M "$m" true 2>&1)
> > -  if [[ "$result" =~ ${m:0:50} ]] || [[ "$result" =~ "" ]]
> > +  # Skip if there is no access to perf_events monitoring
> > +  # and observability operations
> > +  if [[ "$result" =~ ${m:0:50} ]] || [[ "$result" =~ "" ]] 
> > || [[ "$result" =~ "Access to performance monitoring and observability 
> > operations is limited" ]]
> >   then
> > continue
> >   fi
> > --
> > 2.31.1
> >
>

Re: [PATCH] tools/perf/tests: Fix objdump in Object code reading test to look for all sections

2023-07-05 Thread Namhyung Kim

On Mon, Jul 3, 2023 at 10:04 PM Athira Rajeev
 wrote:
>
> Object code reading test fails intermittently with below logs:
>
>Reading object code for memory address: 0xc00801dd34fc
>File is: /lib/modules/6.3.0-rc7+/kernel/fs/xfs/xfs.ko
>On file address is: 0x11359c
>Objdump command is: objdump -z -d --start-address=0x1134fc 
> --stop-address=0x11357c /lib/modules/6.3.0-rc7+/kernel/fs/xfs/xfs.ko
>objdump read too few bytes: 128
>test child finished with -1
> end 
>Object code reading: FAILED!
>
> This issue happens ramdomly depending on the sample ip
> captured during the test run. In some cases, the same ip falls
> in the xfs module. The test does an objdump on the xfs.ko file and
> compares it with the dump from the dso that perf captures. But
> since the range of ip address falls in debug info section, it fails

This is strange.  Why did it fall into the debug section?

Thanks,
Namhyung


> to find the address range with objdump. Fix the objdump option so
> as to disasseble all sections to check the address range.
>
> Signed-off-by: Athira Rajeev 
> ---
>  tools/perf/tests/code-reading.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/tests/code-reading.c b/tools/perf/tests/code-reading.c
> index ed3815163d1b..02132478856a 100644
> --- a/tools/perf/tests/code-reading.c
> +++ b/tools/perf/tests/code-reading.c
> @@ -184,7 +184,7 @@ static int read_via_objdump(const char *filename, u64 
> addr, void *buf,
> FILE *f;
> int ret;
>
> -   fmt = "%s -z -d --start-address=0x%"PRIx64" 
> --stop-address=0x%"PRIx64" %s";
> +   fmt = "%s -z -D --start-address=0x%"PRIx64" 
> --stop-address=0x%"PRIx64" %s";
> ret = snprintf(cmd, sizeof(cmd), fmt, "objdump", addr, addr + len,
>filename);
> if (ret <= 0 || (size_t)ret >= sizeof(cmd))
> --
> 2.35.3
>

Re: [PATCH] tools/perf/tests: Fix Basic BPF llvm compile to check for libbpf support

2023-07-05 Thread Namhyung Kim

Hello,

On Mon, Jul 3, 2023 at 10:03 PM Athira Rajeev
 wrote:
>
> Basic BPF llvm compile fails in systems with libbpf
> that doesn't support BTF. Log shows below information.
>
> libbpf: BTF is required, but is missing or corrupted.
> Failed to parse test case 'Basic BPF llvm compile'
> test child finished with -2
>  end 
>
> Here BPF llvm compile fails due to missing BTF support.
> Fix the llvm test to skip the test incase BTF support is
> missing.
>
> Signed-off-by: Athira Rajeev 
> ---
>  tools/perf/tests/llvm.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/tests/llvm.c b/tools/perf/tests/llvm.c
> index 0bc25a56cfef..4c73c9eab0bb 100644
> --- a/tools/perf/tests/llvm.c
> +++ b/tools/perf/tests/llvm.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include "tests.h"
>  #include "debug.h"
> +#include 
>
>  #ifdef HAVE_LIBBPF_SUPPORT
>  #include 
> @@ -14,8 +15,12 @@ static int test__bpf_parsing(void *obj_buf, size_t 
> obj_buf_sz)
> struct bpf_object *obj;
>
> obj = bpf_object__open_mem(obj_buf, obj_buf_sz, NULL);
> -   if (libbpf_get_error(obj))
> +   if (libbpf_get_error(obj)) {
> +   /* Skip if there is no BTF support */
> +   if (errno == ENOENT)

Can we use the return value of libbpf_get_error() instead of
using the errno?

Also I'm curious why it requires BTF even if the BPF program
doesn't use it.  Is there an option or something for that?

Thanks,
Namhyung


> +   return TEST_SKIP;
> return TEST_FAIL;
> +   }
> bpf_object__close(obj);
> return TEST_OK;
>  }
> --
> 2.27.0
>

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Suren Baghdasaryan

On Wed, Jul 5, 2023 at 9:14 AM Suren Baghdasaryan  wrote:
>
> On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton  
> wrote:
> >
> > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten 
> > Leemhuis)"  wrote:
> >
> > > >>> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> > > >>> fix rather than disabling the feature in -stable.
> > >
> > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> > > Greg said below and that we already had three reports I know of I'd
> > > prefer if we could fix this rather sooner than later in mainline --
> > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> > > 6.4.y already or will do so soon.
> >
> > I'll send today's 2-patch series to Linus today or tomorrow.
>
> I need to make a correction to the patch fixing the issue (the first
> one in the series) and will post it within an hour. Thanks!

Corrected patch is posted at
https://lore.kernel.org/all/20230705171213.2843068-2-sur...@google.com/.
The other patch is unchanged but I posted v3 of the entire patchset
anyway.
Andrew, could you please replace the old version with this one before
sending it to Linus?
Thanks,
Suren.

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Suren Baghdasaryan

On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton  wrote:
>
> On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten 
> Leemhuis)"  wrote:
>
> > >>> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> > >>> fix rather than disabling the feature in -stable.
> >
> > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> > Greg said below and that we already had three reports I know of I'd
> > prefer if we could fix this rather sooner than later in mainline --
> > especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> > 6.4.y already or will do so soon.
>
> I'll send today's 2-patch series to Linus today or tomorrow.

I need to make a correction to the patch fixing the issue (the first
one in the series) and will post it within an hour. Thanks!

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Andrew Morton

On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten 
Leemhuis)"  wrote:

> >>> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> >>> fix rather than disabling the feature in -stable.
> 
> Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> Greg said below and that we already had three reports I know of I'd
> prefer if we could fix this rather sooner than later in mainline --
> especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> 6.4.y already or will do so soon.

I'll send today's 2-patch series to Linus today or tomorrow.

[PATCH v4 02/10] cpu/SMT: Move SMT prototypes into cpu_smt.h

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

In order to export the cpuhp_smt_control enum as part of the interface
between generic and architecture code, the architecture code needs to
include asm/topology.h.

But that leads to circular header dependencies. So split the enum and
related declarations into a separate header.

Signed-off-by: Michael Ellerman 
[ldufour: rewording the commit's description]
Signed-off-by: Laurent Dufour 
---
 arch/x86/include/asm/topology.h |  2 ++
 include/linux/cpu.h | 25 +
 include/linux/cpu_smt.h | 29 +
 kernel/cpu.c|  1 +
 4 files changed, 33 insertions(+), 24 deletions(-)
 create mode 100644 include/linux/cpu_smt.h

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index caf41c4869a0..ae49ed4417d0 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -136,6 +136,8 @@ static inline int topology_max_smt_threads(void)
return __max_smt_threads;
 }
 
+#include 
+
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 6e6e57ec69e8..6b326a9e8191 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 struct device_node;
@@ -204,30 +205,6 @@ void cpuhp_report_idle_dead(void);
 static inline void cpuhp_report_idle_dead(void) { }
 #endif /* #ifdef CONFIG_HOTPLUG_CPU */
 
-enum cpuhp_smt_control {
-   CPU_SMT_ENABLED,
-   CPU_SMT_DISABLED,
-   CPU_SMT_FORCE_DISABLED,
-   CPU_SMT_NOT_SUPPORTED,
-   CPU_SMT_NOT_IMPLEMENTED,
-};
-
-#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
-extern enum cpuhp_smt_control cpu_smt_control;
-extern void cpu_smt_disable(bool force);
-extern void cpu_smt_check_topology(void);
-extern bool cpu_smt_possible(void);
-extern int cpuhp_smt_enable(void);
-extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
-#else
-# define cpu_smt_control   (CPU_SMT_NOT_IMPLEMENTED)
-static inline void cpu_smt_disable(bool force) { }
-static inline void cpu_smt_check_topology(void) { }
-static inline bool cpu_smt_possible(void) { return false; }
-static inline int cpuhp_smt_enable(void) { return 0; }
-static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 
0; }
-#endif
-
 extern bool cpu_mitigations_off(void);
 extern bool cpu_mitigations_auto_nosmt(void);
 
diff --git a/include/linux/cpu_smt.h b/include/linux/cpu_smt.h
new file mode 100644
index ..722c2e306fef
--- /dev/null
+++ b/include/linux/cpu_smt.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CPU_SMT_H_
+#define _LINUX_CPU_SMT_H_
+
+enum cpuhp_smt_control {
+   CPU_SMT_ENABLED,
+   CPU_SMT_DISABLED,
+   CPU_SMT_FORCE_DISABLED,
+   CPU_SMT_NOT_SUPPORTED,
+   CPU_SMT_NOT_IMPLEMENTED,
+};
+
+#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
+extern enum cpuhp_smt_control cpu_smt_control;
+extern void cpu_smt_disable(bool force);
+extern void cpu_smt_check_topology(void);
+extern bool cpu_smt_possible(void);
+extern int cpuhp_smt_enable(void);
+extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
+#else
+# define cpu_smt_control   (CPU_SMT_NOT_IMPLEMENTED)
+static inline void cpu_smt_disable(bool force) { }
+static inline void cpu_smt_check_topology(void) { }
+static inline bool cpu_smt_possible(void) { return false; }
+static inline int cpuhp_smt_enable(void) { return 0; }
+static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 
0; }
+#endif
+
+#endif /* _LINUX_CPU_SMT_H_ */
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 03309f2f35a4..e02204c4675a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -592,6 +592,7 @@ static void lockdep_release_cpus_lock(void)
 void __weak arch_smt_update(void) { }
 
 #ifdef CONFIG_HOTPLUG_SMT
+
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
 
 void __init cpu_smt_disable(bool force)
-- 
2.41.0

[PATCH] powerpc: dts: add missing space before {

2023-07-05 Thread Krzysztof Kozlowski

Add missing whitespace between node name/label and opening {.

Signed-off-by: Krzysztof Kozlowski 
---
 arch/powerpc/boot/dts/fsl/c293si-post.dtsi  | 12 ++--
 arch/powerpc/boot/dts/fsl/p1022rdk.dts  | 10 +-
 arch/powerpc/boot/dts/fsl/p1022si-post.dtsi |  2 +-
 arch/powerpc/boot/dts/fsl/p3041ds.dts   |  4 ++--
 arch/powerpc/boot/dts/fsl/p5040ds.dts   |  2 +-
 arch/powerpc/boot/dts/fsl/t4240qds.dts  |  2 +-
 arch/powerpc/boot/dts/mpc5121.dtsi  |  2 +-
 arch/powerpc/boot/dts/mpc5125twr.dts|  2 +-
 8 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/c293si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/c293si-post.dtsi
index bec0fc36849d..f208fb8f64b3 100644
--- a/arch/powerpc/boot/dts/fsl/c293si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/c293si-post.dtsi
@@ -124,10 +124,10 @@ crypto@8 {
reg = <0x8 0x2>;
ranges = <0x0 0x8 0x2>;
 
-   jr@1000{
+   jr@1000 {
interrupts = <45 2 0 0>;
};
-   jr@2000{
+   jr@2000 {
interrupts = <57 2 0 0>;
};
};
@@ -140,10 +140,10 @@ crypto@a {
reg = <0xa 0x2>;
ranges = <0x0 0xa 0x2>;
 
-   jr@1000{
+   jr@1000 {
interrupts = <49 2 0 0>;
};
-   jr@2000{
+   jr@2000 {
interrupts = <50 2 0 0>;
};
};
@@ -156,10 +156,10 @@ crypto@c {
reg = <0xc 0x2>;
ranges = <0x0 0xc 0x2>;
 
-   jr@1000{
+   jr@1000 {
interrupts = <55 2 0 0>;
};
-   jr@2000{
+   jr@2000 {
interrupts = <56 2 0 0>;
};
};
diff --git a/arch/powerpc/boot/dts/fsl/p1022rdk.dts 
b/arch/powerpc/boot/dts/fsl/p1022rdk.dts
index 29e8af1e3711..4261c2f7e4b3 100644
--- a/arch/powerpc/boot/dts/fsl/p1022rdk.dts
+++ b/arch/powerpc/boot/dts/fsl/p1022rdk.dts
@@ -60,23 +60,23 @@ rtc@68 {
compatible = "st,m41t62";
reg = <0x68>;
};
-   adt7461@4c{
+   adt7461@4c {
compatible = "adi,adt7461";
reg = <0x4c>;
};
-   zl6100@21{
+   zl6100@21 {
compatible = "isil,zl6100";
reg = <0x21>;
};
-   zl6100@24{
+   zl6100@24 {
compatible = "isil,zl6100";
reg = <0x24>;
};
-   zl6100@26{
+   zl6100@26 {
compatible = "isil,zl6100";
reg = <0x26>;
};
-   zl6100@29{
+   zl6100@29 {
compatible = "isil,zl6100";
reg = <0x29>;
};
diff --git a/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi
index 5f51b7bfc064..093e4e3ed368 100644
--- a/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p1022si-post.dtsi
@@ -238,7 +238,7 @@ global-utilities@e {
fsl,has-rstcr;
};
 
-   power@e0070{
+   power@e0070 {
compatible = "fsl,mpc8536-pmc", "fsl,mpc8548-pmc";
reg = <0xe0070 0x20>;
};
diff --git a/arch/powerpc/boot/dts/fsl/p3041ds.dts 
b/arch/powerpc/boot/dts/fsl/p3041ds.dts
index 6f5f7283c533..ca0e0272ac62 100644
--- a/arch/powerpc/boot/dts/fsl/p3041ds.dts
+++ b/arch/powerpc/boot/dts/fsl/p3041ds.dts
@@ -41,7 +41,7 @@ / {
#size-cells = <2>;
interrupt-parent = <&mpic>;
 
-   aliases{
+   aliases {
phy_rgmii_0 = &phy_rgmii_0;
phy_rgmii_1 = &phy_rgmii_1;
phy_sgmii_1c = &phy_sgmii_1c;
@@ -165,7 +165,7 @@ adt7461@4c {
};
};
 
-   fman@40{
+   fman@40 {
ethernet@e {
phy-handle = <&phy_sgmii_1c>;
phy-connection-type = "sgmii";
diff --git a/arch/powerpc/boot/dts/fsl/p5040ds.dts 
b/arch/powerpc/boot/dts/fsl/p5040ds.dts
index 30850b3228e0..5cfc689ee474 100644
--- a/arch/powerpc/boot/dts/fsl/p5040ds.dts
+++ b/arch/powerpc/boot/dts/fsl/p5040ds.dts
@@ -41,7 +41,7 @@ / {
#size-cells = <2>;
interrupt-parent = <&mpic>;
 
-   aliases{
+   aliases {

[PATCH v4 10/10] powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Integrate with the generic SMT support, so that when a CPU is DLPAR
onlined it is brought up with the correct SMT mode.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 61fb7cb00880..e62835a12d73 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -398,6 +398,14 @@ static int dlpar_online_cpu(struct device_node *dn)
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != thread)
continue;
+
+   if (!topology_is_primary_thread(cpu)) {
+   if (cpu_smt_control != CPU_SMT_ENABLED)
+   break;
+   if (!topology_smt_thread_allowed(cpu))
+   break;
+   }
+
cpu_maps_update_done();
find_and_update_cpu_nid(cpu);
rc = device_online(get_cpu_device(cpu));
-- 
2.41.0

[PATCH v4 09/10] powerpc: Add HOTPLUG_SMT support

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Add support for HOTPLUG_SMT, which enables the generic sysfs SMT support
files in /sys/devices/system/cpu/smt, as well as the "nosmt" boot
parameter.

Implement the recently added hooks to allow partial SMT states, allow
any number of threads per core.

Tie the config symbol to HOTPLUG_CPU, which enables it on the major
platforms that support SMT. If there are other platforms that want the
SMT support that can be tweaked in future.

Signed-off-by: Michael Ellerman 
[ldufour: pass current SMT level to cpu_smt_set_num_threads]
[ldufour: remove topology_smt_supported]
[ldufour: remove topology_smt_threads_supported]
[ldufour: select CONFIG_SMT_NUM_THREADS_DYNAMIC]
[ldufour: update kernel-parameters.txt]
Signed-off-by: Laurent Dufour 
---
 Documentation/admin-guide/kernel-parameters.txt |  4 ++--
 arch/powerpc/Kconfig|  2 ++
 arch/powerpc/include/asm/topology.h | 15 +++
 arch/powerpc/kernel/smp.c   |  8 +++-
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 44bcaf791ce6..979f9bad59da 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3853,10 +3853,10 @@
nosmp   [SMP] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC.  legacy for "maxcpus=0".
 
-   nosmt   [KNL,MIPS,S390] Disable symmetric multithreading (SMT).
+   nosmt   [KNL,MIPS,S390, PPC] Disable symmetric multithreading 
(SMT).
Equivalent to smt=1.
 
-   [KNL,X86] Disable symmetric multithreading (SMT).
+   [KNL,X86,PPC] Disable symmetric multithreading (SMT).
nosmt=force: Force disable SMT, cannot be undone
 via the sysfs control file.
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..aef38d2ca542 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -272,6 +272,8 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_VIRT_CPU_ACCOUNTING_GEN
+   select HOTPLUG_SMT  if HOTPLUG_CPU
+   select SMT_NUM_THREADS_DYNAMIC
select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
select IOMMU_HELPER if PPC64
select IRQ_DOMAIN
diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 8a4d4f4d9749..f4e6f2dd04b7 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -143,5 +143,20 @@ static inline int cpu_to_coregroup_id(int cpu)
 #endif
 #endif
 
+#ifdef CONFIG_HOTPLUG_SMT
+#include 
+#include 
+
+static inline bool topology_is_primary_thread(unsigned int cpu)
+{
+   return cpu == cpu_first_thread_sibling(cpu);
+}
+
+static inline bool topology_smt_thread_allowed(unsigned int cpu)
+{
+   return cpu_thread_in_core(cpu) < cpu_smt_num_threads;
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TOPOLOGY_H */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index fbbb695bae3d..b9f0f8f11c37 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1087,7 +1087,7 @@ static int __init init_big_cores(void)
 
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
-   unsigned int cpu;
+   unsigned int cpu, num_threads;
 
DBG("smp_prepare_cpus\n");
 
@@ -1154,6 +1154,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 
if (smp_ops && smp_ops->probe)
smp_ops->probe();
+
+   // Initalise the generic SMT topology support
+   num_threads = 1;
+   if (smt_enabled_at_boot)
+   num_threads = smt_enabled_at_boot;
+   cpu_smt_set_num_threads(num_threads, threads_per_core);
 }
 
 void smp_prepare_boot_cpu(void)
-- 
2.41.0

[PATCH v4 07/10] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Add support to the /sys/devices/system/cpu/smt/control interface for
enabling a specified number of SMT threads per core, including partial
SMT states where not all threads are brought online.

The current interface accepts "on" and "off", to enable either 1 or all
SMT threads per core.

This commit allows writing an integer, between 1 and the number of SMT
threads supported by the machine. Writing 1 is a synonym for "off", 2 or
more enables SMT with the specified number of threads.

When reading the file, if all threads are online "on" is returned, to
avoid changing behaviour for existing users. If some other number of
threads is online then the integer value is returned.

Architectures like x86 only supporting 1 thread or all threads, should not
define CONFIG_SMT_NUM_THREADS_DYNAMIC. Architecture supporting partial SMT
states, like PowerPC, should define it.

Signed-off-by: Michael Ellerman 
[ldufour: slightly reword the commit's description]
[ldufour: remove switch() in __store_smt_control()]
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202306282340.ihqm0fla-...@intel.com/
[ldufour: fix build issue in control_show()]
Signed-off-by: Laurent Dufour 
---
 .../ABI/testing/sysfs-devices-system-cpu  |  1 +
 kernel/cpu.c  | 60 ++-
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index ecd585ca2d50..6dba65fb1956 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -555,6 +555,7 @@ Description:Control Symmetric Multi Threading (SMT)
  
=
 "on" SMT is enabled
 "off"SMT is disabled
+""SMT is enabled with N threads per 
core.
 "forceoff"   SMT is force disabled. Cannot be 
changed.
 "notsupported"   SMT is not supported by the CPU
 "notimplemented" SMT runtime toggling is not
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9a8d0685e055..7e8f1b044772 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2876,11 +2876,19 @@ static const struct attribute_group 
cpuhp_cpu_root_attr_group = {
 
 #ifdef CONFIG_HOTPLUG_SMT
 
+static bool cpu_smt_num_threads_valid(unsigned int threads)
+{
+   if (IS_ENABLED(CONFIG_SMT_NUM_THREADS_DYNAMIC))
+   return threads >= 1 && threads <= cpu_smt_max_threads;
+   return threads == 1 || threads == cpu_smt_max_threads;
+}
+
 static ssize_t
 __store_smt_control(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
 {
-   int ctrlval, ret;
+   int ctrlval, ret, num_threads, orig_threads;
+   bool force_off;
 
if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)
return -EPERM;
@@ -2888,30 +2896,39 @@ __store_smt_control(struct device *dev, struct 
device_attribute *attr,
if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
return -ENODEV;
 
-   if (sysfs_streq(buf, "on"))
+   if (sysfs_streq(buf, "on")) {
ctrlval = CPU_SMT_ENABLED;
-   else if (sysfs_streq(buf, "off"))
+   num_threads = cpu_smt_max_threads;
+   } else if (sysfs_streq(buf, "off")) {
ctrlval = CPU_SMT_DISABLED;
-   else if (sysfs_streq(buf, "forceoff"))
+   num_threads = 1;
+   } else if (sysfs_streq(buf, "forceoff")) {
ctrlval = CPU_SMT_FORCE_DISABLED;
-   else
+   num_threads = 1;
+   } else if (kstrtoint(buf, 10, &num_threads) == 0) {
+   if (num_threads == 1)
+   ctrlval = CPU_SMT_DISABLED;
+   else if (cpu_smt_num_threads_valid(num_threads))
+   ctrlval = CPU_SMT_ENABLED;
+   else
+   return -EINVAL;
+   } else {
return -EINVAL;
+   }
 
ret = lock_device_hotplug_sysfs();
if (ret)
return ret;
 
-   if (ctrlval != cpu_smt_control) {
-   switch (ctrlval) {
-   case CPU_SMT_ENABLED:
-   ret = cpuhp_smt_enable();
-   break;
-   case CPU_SMT_DISABLED:
-   case CPU_SMT_FORCE_DISABLED:
-   ret = cpuhp_smt_disable(ctrlval);
-   break;
-   }
-   }
+   orig_threads = cpu_smt_num_threads;
+   cpu_smt_num_threads = num_threads;
+
+   force_off = ctrlval != cpu_smt_control && ctrlval == 
CPU_SMT_FORCE_DISABLED;
+
+   if (num_threads > orig_threads)
+   ret = cpuhp_smt_enable();
+   else if (num_threads < orig_threads || fo

[PATCH v4 06/10] cpu/SMT: Create topology_smt_thread_allowed()

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Some architectures allows partial SMT states, ie. when not all SMT
threads are brought online.

To support that, add an architecture helper which checks whether a given
CPU is allowed to be brought online depending on how many SMT threads are
currently enabled. Since this is only applicable to architecture supporting
partial SMT, only these architectures should select the new configuration
variable CONFIG_SMT_NUM_THREADS_DYNAMIC. For the other architectures, not
supporting the partial SMT states, there is no need to define
topology_cpu_smt_allowed(), the generic code assumed that all the threads
are allowed or only the primary ones.

Call the helper from cpu_smt_enable(), and cpu_smt_allowed() when SMT is
enabled, to check if the particular thread should be onlined. Notably,
also call it from cpu_smt_disable() if CPU_SMT_ENABLED, to allow
offlining some threads to move from a higher to lower number of threads
online.

Signed-off-by: Michael Ellerman 
Suggested-by: Thomas Gleixner 
[ldufour: slightly reword the commit's description]
[ldufour: introduce CONFIG_SMT_NUM_THREADS_DYNAMIC]
Signed-off-by: Laurent Dufour 
---
 arch/Kconfig |  3 +++
 kernel/cpu.c | 24 +++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index aff2746c8af2..63c5d6a2022b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -34,6 +34,9 @@ config ARCH_HAS_SUBPAGE_FAULTS
 config HOTPLUG_SMT
bool
 
+config SMT_NUM_THREADS_DYNAMIC
+   bool
+
 # Selected by HOTPLUG_CORE_SYNC_DEAD or HOTPLUG_CORE_SYNC_FULL
 config HOTPLUG_CORE_SYNC
bool
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 70add058e77b..9a8d0685e055 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -645,9 +645,23 @@ static int __init smt_cmdline_disable(char *str)
 }
 early_param("nosmt", smt_cmdline_disable);
 
+/*
+ * For Archicture supporting partial SMT states check if the thread is allowed.
+ * Otherwise this has already been checked through cpu_smt_max_threads when
+ * setting the SMT level.
+ */
+static inline bool cpu_smt_thread_allowed(unsigned int cpu)
+{
+#ifdef CONFIG_SMT_NUM_THREADS_DYNAMIC
+   return topology_smt_thread_allowed(cpu);
+#else
+   return true;
+#endif
+}
+
 static inline bool cpu_smt_allowed(unsigned int cpu)
 {
-   if (cpu_smt_control == CPU_SMT_ENABLED)
+   if (cpu_smt_control == CPU_SMT_ENABLED && cpu_smt_thread_allowed(cpu))
return true;
 
if (topology_is_primary_thread(cpu))
@@ -2642,6 +2656,12 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
continue;
+   /*
+* Disable can be called with CPU_SMT_ENABLED when changing
+* from a higher to lower number of SMT threads per core.
+*/
+   if (ctrlval == CPU_SMT_ENABLED && cpu_smt_thread_allowed(cpu))
+   continue;
ret = cpu_down_maps_locked(cpu, CPUHP_OFFLINE);
if (ret)
break;
@@ -2676,6 +2696,8 @@ int cpuhp_smt_enable(void)
/* Skip online CPUs and CPUs on offline nodes */
if (cpu_online(cpu) || !node_online(cpu_to_node(cpu)))
continue;
+   if (!cpu_smt_thread_allowed(cpu))
+   continue;
ret = _cpu_up(cpu, 0, CPUHP_ONLINE);
if (ret)
break;
-- 
2.41.0

[PATCH v4 08/10] powerpc/pseries: Initialise CPU hotplug callbacks earlier

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

As part of the generic HOTPLUG_SMT code, there is support for disabling
secondary SMT threads at boot time, by passing "nosmt" on the kernel
command line.

The way that is implemented is the secondary threads are brought partly
online, and then taken back offline again. That is done to support x86
CPUs needing certain initialisation done on all threads. However powerpc
has similar needs, see commit d70a54e2d085 ("powerpc/powernv: Ignore
smt-enabled on Power8 and later").

For that to work the powerpc CPU hotplug callbacks need to be registered
before secondary CPUs are brought online, otherwise __cpu_disable()
fails due to smp_ops->cpu_disable being NULL.

So split the basic initialisation into pseries_cpu_hotplug_init() which
can be called early from setup_arch(). The DLPAR related initialisation
can still be done later, because it needs to do allocations.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 22 
 arch/powerpc/platforms/pseries/pseries.h |  2 ++
 arch/powerpc/platforms/pseries/setup.c   |  2 ++
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 1a3cb313976a..61fb7cb00880 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -845,15 +845,9 @@ static struct notifier_block pseries_smp_nb = {
.notifier_call = pseries_smp_notifier,
 };
 
-static int __init pseries_cpu_hotplug_init(void)
+void __init pseries_cpu_hotplug_init(void)
 {
int qcss_tok;
-   unsigned int node;
-
-#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
-   ppc_md.cpu_probe = dlpar_cpu_probe;
-   ppc_md.cpu_release = dlpar_cpu_release;
-#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
 
rtas_stop_self_token = rtas_function_token(RTAS_FN_STOP_SELF);
qcss_tok = rtas_function_token(RTAS_FN_QUERY_CPU_STOPPED_STATE);
@@ -862,12 +856,22 @@ static int __init pseries_cpu_hotplug_init(void)
qcss_tok == RTAS_UNKNOWN_SERVICE) {
printk(KERN_INFO "CPU Hotplug not supported by firmware "
"- disabling.\n");
-   return 0;
+   return;
}
 
smp_ops->cpu_offline_self = pseries_cpu_offline_self;
smp_ops->cpu_disable = pseries_cpu_disable;
smp_ops->cpu_die = pseries_cpu_die;
+}
+
+static int __init pseries_dlpar_init(void)
+{
+   unsigned int node;
+
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+   ppc_md.cpu_probe = dlpar_cpu_probe;
+   ppc_md.cpu_release = dlpar_cpu_release;
+#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
 
/* Processors can be added/removed only on LPAR */
if (firmware_has_feature(FW_FEATURE_LPAR)) {
@@ -886,4 +890,4 @@ static int __init pseries_cpu_hotplug_init(void)
 
return 0;
 }
-machine_arch_initcall(pseries, pseries_cpu_hotplug_init);
+machine_arch_initcall(pseries, pseries_dlpar_init);
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index f8bce40ebd0c..f8893ba46e83 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -75,11 +75,13 @@ static inline int dlpar_hp_pmem(struct pseries_hp_errorlog 
*hp_elog)
 
 #ifdef CONFIG_HOTPLUG_CPU
 int dlpar_cpu(struct pseries_hp_errorlog *hp_elog);
+void pseries_cpu_hotplug_init(void);
 #else
 static inline int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
 {
return -EOPNOTSUPP;
 }
+static inline void pseries_cpu_hotplug_init(void) { }
 #endif
 
 /* PCI root bridge prepare function override for pseries */
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index e2a57cfa6c83..41451b76c6e5 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -816,6 +816,8 @@ static void __init pSeries_setup_arch(void)
/* Discover PIC type and setup ppc_md accordingly */
smp_init_pseries();
 
+   // Setup CPU hotplug callbacks
+   pseries_cpu_hotplug_init();
 
if (radix_enabled() && !mmu_has_feature(MMU_FTR_GTSE))
if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
-- 
2.41.0

[PATCH v4 05/10] cpu/SMT: Remove topology_smt_supported()

2023-07-05 Thread Laurent Dufour

Since the maximum number of threads is now passed to
cpu_smt_set_num_threads(), checking that value is enough to know if SMT is
supported.

Cc: Michael Ellerman 
Suggested-by: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/include/asm/topology.h | 2 --
 arch/x86/kernel/smpboot.c   | 8 
 kernel/cpu.c| 4 ++--
 3 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index ae49ed4417d0..3235ba1e5b06 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -141,7 +141,6 @@ static inline int topology_max_smt_threads(void)
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
-bool topology_smt_supported(void);
 
 extern struct cpumask __cpu_primary_thread_mask;
 #define cpu_primary_thread_mask ((const struct cpumask 
*)&__cpu_primary_thread_mask)
@@ -164,7 +163,6 @@ static inline int topology_phys_to_logical_pkg(unsigned int 
pkg) { return 0; }
 static inline int topology_max_die_per_package(void) { return 1; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
-static inline bool topology_smt_supported(void) { return false; }
 #endif /* !CONFIG_SMP */
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ed2d51960a7d..f8e709fd2cd5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -326,14 +326,6 @@ static void notrace start_secondary(void *unused)
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
 }
 
-/**
- * topology_smt_supported - Check whether SMT is supported by the CPUs
- */
-bool topology_smt_supported(void)
-{
-   return smp_num_siblings > 1;
-}
-
 /**
  * topology_phys_to_logical_pkg - Map a physical package id to a logical
  * @phys_pkg:  The physical package id to map
diff --git a/kernel/cpu.c b/kernel/cpu.c
index d7dd535cb5b5..70add058e77b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -621,7 +621,7 @@ void __init cpu_smt_set_num_threads(unsigned int 
num_threads,
 {
WARN_ON(!num_threads || (num_threads > max_threads));
 
-   if (!topology_smt_supported())
+   if (max_threads == 1)
cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
 
cpu_smt_max_threads = max_threads;
@@ -1801,7 +1801,7 @@ early_param("cpuhp.parallel", 
parallel_bringup_parse_param);
 
 static inline bool cpuhp_smt_aware(void)
 {
-   return topology_smt_supported();
+   return cpu_smt_max_threads > 1;
 }
 
 static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
-- 
2.41.0

[PATCH v4 03/10] cpu/SMT: Move smt/control simple exit cases earlier

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Move the simple exit cases, ie. which don't depend on the value written,
earlier in the function. That makes it clearer that regardless of the
input those states can not be transitioned out of.

That does have a user-visible effect, in that the error returned will
now always be EPERM/ENODEV for those states, regardless of the value
written. Previously writing an invalid value would return EINVAL even
when in those states.

Signed-off-by: Michael Ellerman 
---
 kernel/cpu.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index e02204c4675a..b6fe170c93e9 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2841,6 +2841,12 @@ __store_smt_control(struct device *dev, struct 
device_attribute *attr,
 {
int ctrlval, ret;
 
+   if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)
+   return -EPERM;
+
+   if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
+   return -ENODEV;
+
if (sysfs_streq(buf, "on"))
ctrlval = CPU_SMT_ENABLED;
else if (sysfs_streq(buf, "off"))
@@ -2850,12 +2856,6 @@ __store_smt_control(struct device *dev, struct 
device_attribute *attr,
else
return -EINVAL;
 
-   if (cpu_smt_control == CPU_SMT_FORCE_DISABLED)
-   return -EPERM;
-
-   if (cpu_smt_control == CPU_SMT_NOT_SUPPORTED)
-   return -ENODEV;
-
ret = lock_device_hotplug_sysfs();
if (ret)
return ret;
-- 
2.41.0

[PATCH v4 00/10] Introduce SMT level and add PowerPC support

2023-07-05 Thread Laurent Dufour

I'm taking over the series Michael sent previously [1] which is smartly
reviewing the initial series I sent [2].  This series is addressing the
comments sent by Thomas and me on the Michael's one.

Here is a short introduction to the issue this series is addressing:

When a new CPU is added, the kernel is activating all its threads. This
leads to weird, but functional, result when adding CPU on a SMT 4 system
for instance.

Here the newly added CPU 1 has 8 threads while the other one has 4 threads
active (system has been booted with the 'smt-enabled=4' kernel option):

ltcden3-lp12:~ # ppc64_cpu --info
Core   0:0*1*2*3*4 5 6 7
Core   1:8*9*   10*   11*   12*   13*   14*   15*

This mixed SMT level may confused end users and/or some applications.

There is no SMT level recorded in the kernel (common code), neither in user
space, as far as I know. Such a level is helpful when adding new CPU or
when optimizing the energy efficiency (when reactivating CPUs).

When SMP and HOTPLUG_SMT are defined, this series is adding a new SMT level
(cpu_smt_num_threads) and few callbacks allowing the architecture code to
fine control this value, setting a max and a "at boot" level, and
controling whether a thread should be onlined or not.

v4:
  Rebase on top of 6.5's updates
  Remove a dependancy against the X86's symbol cpu_primary_thread_mask
v3:
  Fix a build error in the patch 6/9
v2:
  As Thomas suggested,
Reword some commit's description
Remove topology_smt_supported()
Remove topology_smt_threads_supported()
Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC
Remove switch() in __store_smt_control()
  Update kernel-parameters.txt

[1] 
https://lore.kernel.org/linuxppc-dev/20230524155630.794584-1-...@ellerman.id.au/
[2] 
https://lore.kernel.org/linuxppc-dev/20230331153905.31698-1-lduf...@linux.ibm.com/


Laurent Dufour (2):
  cpu/hotplug: remove dependancy against cpu_primary_thread_mask
  cpu/SMT: Remove topology_smt_supported()

Michael Ellerman (8):
  cpu/SMT: Move SMT prototypes into cpu_smt.h
  cpu/SMT: Move smt/control simple exit cases earlier
  cpu/SMT: Store the current/max number of threads
  cpu/SMT: Create topology_smt_thread_allowed()
  cpu/SMT: Allow enabling partial SMT states via sysfs
  powerpc/pseries: Initialise CPU hotplug callbacks earlier
  powerpc: Add HOTPLUG_SMT support
  powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs

 .../ABI/testing/sysfs-devices-system-cpu  |   1 +
 .../admin-guide/kernel-parameters.txt |   4 +-
 arch/Kconfig  |   3 +
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/include/asm/topology.h   |  15 ++
 arch/powerpc/kernel/smp.c |   8 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c  |  30 ++--
 arch/powerpc/platforms/pseries/pseries.h  |   2 +
 arch/powerpc/platforms/pseries/setup.c|   2 +
 arch/x86/include/asm/topology.h   |   4 +-
 arch/x86/kernel/cpu/common.c  |   2 +-
 arch/x86/kernel/smpboot.c |   8 -
 include/linux/cpu.h   |  25 +--
 include/linux/cpu_smt.h   |  33 
 kernel/cpu.c  | 142 +-
 15 files changed, 196 insertions(+), 85 deletions(-)
 create mode 100644 include/linux/cpu_smt.h

-- 
2.41.0

[PATCH v4 04/10] cpu/SMT: Store the current/max number of threads

2023-07-05 Thread Laurent Dufour

From: Michael Ellerman 

Some architectures allow partial SMT states at boot time, ie. when
not all SMT threads are brought online.

To support that the SMT code needs to know the maximum number of SMT
threads, and also the currently configured number.

The architecture code knows the max number of threads, so have the
architecture code pass that value to cpu_smt_set_num_threads(). Note that
although topology_max_smt_threads() exists, it is not configured early
enough to be used here. As architecture, like PowerPC, allows the threads
number to be set through the kernel command line, also pass that value.

Signed-off-by: Michael Ellerman 
[ldufour: slightly reword the commit message]
[ldufour: rename cpu_smt_check_topology and add a num_threads argument]
Signed-off-by: Laurent Dufour 
---
 arch/x86/kernel/cpu/common.c |  2 +-
 include/linux/cpu_smt.h  |  8 ++--
 kernel/cpu.c | 21 -
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 52683fddafaf..12a48a85da3d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2317,7 +2317,7 @@ void __init arch_cpu_finalize_init(void)
 * identify_boot_cpu() initialized SMT support information, let the
 * core code know.
 */
-   cpu_smt_check_topology();
+   cpu_smt_set_num_threads(smp_num_siblings, smp_num_siblings);
 
if (!IS_ENABLED(CONFIG_SMP)) {
pr_info("CPU: ");
diff --git a/include/linux/cpu_smt.h b/include/linux/cpu_smt.h
index 722c2e306fef..0c1664294b57 100644
--- a/include/linux/cpu_smt.h
+++ b/include/linux/cpu_smt.h
@@ -12,15 +12,19 @@ enum cpuhp_smt_control {
 
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
+extern unsigned int cpu_smt_num_threads;
 extern void cpu_smt_disable(bool force);
-extern void cpu_smt_check_topology(void);
+extern void cpu_smt_set_num_threads(unsigned int num_threads,
+   unsigned int max_threads);
 extern bool cpu_smt_possible(void);
 extern int cpuhp_smt_enable(void);
 extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
 #else
 # define cpu_smt_control   (CPU_SMT_NOT_IMPLEMENTED)
+# define cpu_smt_num_threads 1
 static inline void cpu_smt_disable(bool force) { }
-static inline void cpu_smt_check_topology(void) { }
+static inline void cpu_smt_set_num_threads(unsigned int num_threads,
+  unsigned int max_threads) { }
 static inline bool cpu_smt_possible(void) { return false; }
 static inline int cpuhp_smt_enable(void) { return 0; }
 static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 
0; }
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b6fe170c93e9..d7dd535cb5b5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -594,6 +594,8 @@ void __weak arch_smt_update(void) { }
 #ifdef CONFIG_HOTPLUG_SMT
 
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
+static unsigned int cpu_smt_max_threads __ro_after_init;
+unsigned int cpu_smt_num_threads __read_mostly = UINT_MAX;
 
 void __init cpu_smt_disable(bool force)
 {
@@ -607,16 +609,33 @@ void __init cpu_smt_disable(bool force)
pr_info("SMT: disabled\n");
cpu_smt_control = CPU_SMT_DISABLED;
}
+   cpu_smt_num_threads = 1;
 }
 
 /*
  * The decision whether SMT is supported can only be done after the full
  * CPU identification. Called from architecture code.
  */
-void __init cpu_smt_check_topology(void)
+void __init cpu_smt_set_num_threads(unsigned int num_threads,
+   unsigned int max_threads)
 {
+   WARN_ON(!num_threads || (num_threads > max_threads));
+
if (!topology_smt_supported())
cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
+
+   cpu_smt_max_threads = max_threads;
+
+   /*
+* If SMT has been disabled via the kernel command line or SMT is
+* not supported, set cpu_smt_num_threads to 1 for consistency.
+* If enabled, take the architecture requested number of threads
+* to bring up into account.
+*/
+   if (cpu_smt_control != CPU_SMT_ENABLED)
+   cpu_smt_num_threads = 1;
+   else if (num_threads < cpu_smt_num_threads)
+   cpu_smt_num_threads = num_threads;
 }
 
 static int __init smt_cmdline_disable(char *str)
-- 
2.41.0

[PATCH v4 01/10] cpu/hotplug: remove dependancy against cpu_primary_thread_mask

2023-07-05 Thread Laurent Dufour

The commit 18415f33e2ac ("cpu/hotplug: Allow "parallel" bringup up to
CPUHP_BP_KICK_AP_STATE") introduce a dependancy against a global variable
cpu_primary_thread_mask exported by the X86 code. This variable is only
used when CONFIG_HOTPLUG_PARALLEL is set.

Since cpuhp_get_primary_thread_mask() and cpuhp_smt_aware() are only used
when CONFIG_HOTPLUG_PARALLEL is set, don't define them when it is not set.

There is no functional change introduce by that patch.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 kernel/cpu.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 88a7ede322bd..03309f2f35a4 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -650,22 +650,8 @@ bool cpu_smt_possible(void)
 }
 EXPORT_SYMBOL_GPL(cpu_smt_possible);
 
-static inline bool cpuhp_smt_aware(void)
-{
-   return topology_smt_supported();
-}
-
-static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
-{
-   return cpu_primary_thread_mask;
-}
 #else
 static inline bool cpu_smt_allowed(unsigned int cpu) { return true; }
-static inline bool cpuhp_smt_aware(void) { return false; }
-static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
-{
-   return cpu_present_mask;
-}
 #endif
 
 static inline enum cpuhp_state
@@ -1793,6 +1779,16 @@ static int __init parallel_bringup_parse_param(char *arg)
 }
 early_param("cpuhp.parallel", parallel_bringup_parse_param);
 
+static inline bool cpuhp_smt_aware(void)
+{
+   return topology_smt_supported();
+}
+
+static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
+{
+   return cpu_primary_thread_mask;
+}
+
 /*
  * On architectures which have enabled parallel bringup this invokes all BP
  * prepare states for each of the to be onlined APs first. The last state
-- 
2.41.0

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread Yu Zhang

> > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, 
> > struct kvm_mmu *mmu,
> >   */
> >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
> > *sp, int i)
> >  {
> > -   bool host_writable;
> > +   bool host_writable, is_refcounted;
> > gpa_t first_pte_gpa;
> > u64 *sptep, spte;
> > struct kvm_memory_slot *slot;
> > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, 
> > struct kvm_mmu_page *sp, int
> > sptep = &sp->spt[i];
> > spte = *sptep;
> > host_writable = spte & shadow_host_writable_mask;
> > +   // TODO: is this correct?
> > +   is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > make_spte(vcpu, sp, slot, pte_access, gfn,
> >   spte_to_pfn(spte), spte, true, false,
> > - host_writable, &spte);
> > + host_writable, is_refcounted, &spte);
> 
> Could we restrict that a non-refcounted page shall not be used as shadow page?

Oh, sorry. It's not about shadow page. It's about guest page being
mapped as not refcounted. Silly me...

B.R.
Yu

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-05 Thread Zhi Wang

On Tue,  4 Jul 2023 16:50:48 +0900
David Stevens  wrote:

> From: David Stevens 
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens 
> ---
>  include/linux/kvm_host.h | 10 ++
>  virt/kvm/kvm_main.c  | 67 +---
>  virt/kvm/pfncache.c  |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>   const struct kvm_memory_slot *slot;
>   gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>   bool atomic;
>   /* Allow a read fault to create a writeable mapping. */
>   bool allow_write_mapping;
> + /*
> +  * Usage of the returned pfn will be guared by a mmu notifier. Must
  ^guarded
> +  * be true if FOLL_GET is not set.
> +  */
> + bool guarded_by_mmu_notifier;
>
It seems no one sets the guraded_by_mmu_notifier in this patch. Is
guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
caller of __kvm_follow_pfn()?

If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
temporary solution. I don't think it is a good idea to play tricks with
a temporary solution, more like we are abusing the toleration.

Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
indicate a tail page?
 
>   /* Outputs of __kvm_follow_pfn */
>   hva_t hva;
>   bool writable;
> + /* True if the returned pfn is for a page with a valid refcount. */
> + bool is_refcounted_page;
>  };
>  
>  kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b13f22861d2f..0f7b41f220b6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn 
> *foll, kvm_pfn_t *pfn)
>   if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>   *pfn = page_to_pfn(page[0]);
>   foll->writable = foll->allow_write_mapping;
> + foll->is_refcounted_page = true;
> + if (!(foll->flags & FOLL_GET))
> + put_page(page[0]);
>   return true;
>   }
>  
> @@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, 
> kvm_pfn_t *pfn)
>   return npages;
>  
>   foll->writable = (foll->flags & FOLL_WRITE) && 
> foll->allow_write_mapping;
> + foll->is_refcounted_page = true;
>  
>   /* map read fault as writable if possible */
>   if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> @@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, 
> kvm_pfn_t *pfn)
>   }
>   }
>   *pfn = page_to_pfn(page);
> + if (!(foll->flags & FOLL_GET))
> + put_page(page);
>   return npages;
>  }
>  
> @@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
> bool write_fault)
>   return true;
>  }
>  
> -static int kvm_try_get_pfn(kvm_pfn_t pfn)
> -{
> - struct page *page = kvm_pfn_to_refcounted_page(pfn);
> -
> - if (!page)
> - return 1;
> -
> - return get_page_unless_zero(page);
> -}
> -
>  static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct 
> kvm_follow_pfn *foll,
>  kvm_pfn_t *p_pfn)
>  {
> @@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
> *vma, struct kvm_follow_pfn
>   pte_t *ptep;
>   spinlock_t *ptl;
>   bool write_fault = foll->flags & FOLL_WRITE;
> + struct page *page;
>   int r;
>  
>   r = follow_pte(vma->vm_mm, foll->hva, &ptep, &ptl);
> @@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
> *vma, struct kvm_follow_pfn
>   pfn = pte_pfn(

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread Zhi Wang

On Tue,  4 Jul 2023 16:50:47 +0900
David Stevens  wrote:

> From: David Stevens 
> 
> Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> __kvm_follow_pfn refactors the old API's arguments into a struct and,
> where possible, combines the boolean arguments into a single flags
> argument.
> 
> Signed-off-by: David Stevens 
> ---
>  include/linux/kvm_host.h |  16 
>  virt/kvm/kvm_main.c  | 171 ++-
>  virt/kvm/kvm_mm.h|   3 +-
>  virt/kvm/pfncache.c  |   8 +-
>  4 files changed, 122 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..ef2763c2b12e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -97,6 +97,7 @@
>  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
>  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
>  #define KVM_PFN_ERR_SIGPENDING   (KVM_PFN_ERR_MASK + 3)
> +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
>  
>  /*
>   * error pfns indicate that the gfn is in slot but faild to
> @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +struct kvm_follow_pfn {
> + const struct kvm_memory_slot *slot;
> + gfn_t gfn;
> + unsigned int flags;
> + bool atomic;
> + /* Allow a read fault to create a writeable mapping. */
> + bool allow_write_mapping;
> +
> + /* Outputs of __kvm_follow_pfn */
> + hva_t hva;
> + bool writable;
> +};
> +
> +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> +
>  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
>  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 371bd783ff2b..b13f22861d2f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned 
> long addr)
>   * true indicates success, otherwise false is returned.  It's also the
>   * only part that runs if we can in atomic context.
>   */
> -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> - bool *writable, kvm_pfn_t *pfn)
> +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
>   struct page *page[1];
> + bool write_fault = foll->flags & FOLL_WRITE;
>  
>   /*
>* Fast pin a writable pfn only if it is a write fault request
>* or the caller allows to map a writable pfn for a read fault
>* request.
>*/
> - if (!(write_fault || writable))
> + if (!(write_fault || foll->allow_write_mapping))
>   return false;
>  
> - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
>   *pfn = page_to_pfn(page[0]);
> -
> - if (writable)
> - *writable = true;
> + foll->writable = foll->allow_write_mapping;
>   return true;
>   }
>  
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool 
> write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>   struct page *page;
>   int npages;
>  
>   might_sleep();
>  
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
>   if (npages != 1)
>   return npages;
>  
> + foll->writable = (foll->flags & FOLL_WRITE) && 
> foll->allow_write_mapping;
> +
>   /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>   struct page *wpage;
>  
> - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> - *writable = true;
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> + foll->writable = true;
>   put_page(page);
>   page = wpage;
>   }
> @@ -257

Re: [06/12] arch: Declare screen_info in

2023-07-05 Thread Thomas Zimmermann


Hi

Am 05.07.23 um 03:21 schrieb Sui Jingfeng:

Hi, Thomas


I love your patch, yet after applied your patch, the linux kernel fail 
to compile on my LoongArch machine.


screen_info is missing. I think this should be solved with your update 
to patch 1.


Best regards
Thomas




```

   CC  arch/loongarch/kernel/efi.o
arch/loongarch/kernel/efi.c: In function ‘init_screen_info’:
arch/loongarch/kernel/efi.c:77:54: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

    77 | si = early_memremap(screen_info_table, sizeof(*si));
   |  ^
arch/loongarch/kernel/efi.c:82:9: error: ‘screen_info’ undeclared (first 
use in this function)

    82 | screen_info = *si;
   | ^~~
arch/loongarch/kernel/efi.c:82:9: note: each undeclared identifier is 
reported only once for each function it appears in
arch/loongarch/kernel/efi.c:82:23: error: invalid use of undefined type 
‘struct screen_info’

    82 | screen_info = *si;
   |   ^
arch/loongarch/kernel/efi.c:83:29: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

    83 | memset(si, 0, sizeof(*si));
   | ^
arch/loongarch/kernel/efi.c:84:34: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

    84 | early_memunmap(si, sizeof(*si));
   |  ^
make[3]: *** [scripts/Makefile.build:252: arch/loongarch/kernel/efi.o] 
Error 1

make[3]: *** Waiting for unfinished jobs
make[2]: *** [scripts/Makefile.build:494: arch/loongarch/kernel] Error 2
make[1]: *** [scripts/Makefile.build:494: arch/loongarch] Error 2
make[1]: *** Waiting for unfinished jobs
make: *** [Makefile:2026: .] Error 2

```

On 2023/6/29 19:45, Thomas Zimmermann wrote:

The variable screen_info does not exist on all architectures. Declare
it in . All architectures that do declare it
will provide it via .

Add the Kconfig token ARCH_HAS_SCREEN_INFO to guard against access on
architectures that don't provide screen_info.

Signed-off-by: Thomas Zimmermann 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Guo Ren 
Cc: Brian Cain 
Cc: Huacai Chen 
Cc: WANG Xuerui 
Cc: Thomas Bogendoerfer 
Cc: Dinh Nguyen 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Helge Deller 
Cc: Arnd Bergmann 
Cc: Kees Cook 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Frederic Weisbecker 
Cc: Andrew Morton 
Cc: Ard Biesheuvel 
Cc: Sami Tolvanen 
Cc: Juerg Haefliger 
Cc: Geert Uytterhoeven 
Cc: Anshuman Khandual 
Cc: Niklas Schnelle 
Cc: "Russell King (Oracle)" 
Cc: Linus Walleij 
Cc: Sebastian Reichel 
Cc: "Mike Rapoport (IBM)" 
Cc: "Kirill A. Shutemov" 
Cc: Zi Yan 
Acked-by: WANG Xuerui  # loongarch
---
  arch/Kconfig  |  6 ++
  arch/alpha/Kconfig    |  1 +
  arch/arm/Kconfig  |  1 +
  arch/arm64/Kconfig    |  1 +
  arch/csky/Kconfig |  1 +
  arch/hexagon/Kconfig  |  1 +
  arch/ia64/Kconfig |  1 +
  arch/loongarch/Kconfig    |  1 +
  arch/mips/Kconfig |  1 +
  arch/nios2/Kconfig    |  1 +
  arch/powerpc/Kconfig  |  1 +
  arch/riscv/Kconfig    |  1 +
  arch/sh/Kconfig   |  1 +
  arch/sparc/Kconfig    |  1 +
  arch/x86/Kconfig  |  1 +
  arch/xtensa/Kconfig   |  1 +
  drivers/video/Kconfig |  3 +++
  include/asm-generic/Kbuild    |  1 +
  include/asm-generic/screen_info.h | 12 
  include/linux/screen_info.h   |  2 +-
  20 files changed, 38 insertions(+), 1 deletion(-)
  create mode 100644 include/asm-generic/screen_info.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cada..2f58293fd7bcb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1466,6 +1466,12 @@ config ARCH_HAS_NONLEAF_PMD_YOUNG
    address translations. Page table walkers that clear the 
accessed bit

    may use this capability to reduce their search space.
+config ARCH_HAS_SCREEN_INFO
+    bool
+    help
+  Selected by architectures that provide a global instance of
+  screen_info.
+
  source "kernel/gcov/Kconfig"
  source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index a5c2b1aa46b02..d749011d88b14 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -4,6 +4,7 @@ config ALPHA
  default y
  select ARCH_32BIT_USTAT_F_TINODE
  select ARCH_HAS_CURRENT_STACK_POINTER

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Gerald Schaefer

On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > > > 
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not.  But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).
> > > 
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > > 
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...  
> > 
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.  
> 
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
> 
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > 4KB-pgtable
> > >   * while the PP bits are never used, nor such a page is added to or 
> > > removed
> > >   * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() 
> > > has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > period
> > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > guarantee).  
> > 
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> > 
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> > 
> > So what exactly do you mean with "which page_table_free_rcu() does not 
> > actually
> > guarantee"?  
> 
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better.  And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> 
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate 
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> and is commented: 
> /*
>  * This isn't an RCU grace period and hence the page-tables cannot be
>  * assumed to be actually RCU-freed.
>  *
>  * It is however sufficient for software page-table walkers that rely on
>  * IRQ disabling.
>  */
> 
> Whether that's good for your PP pages or not, I've given no thought:
> I've just taken it on trust that what s390 has working today is good.

Yes, we should be fine with that, current code can be trusted :-)

> 
> If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> then I would not have written "(which page_tab

Re: [PATCH v3 3/9] cpu/SMT: Store the current/max number of threads

2023-07-05 Thread Laurent Dufour





Le 05/07/2023 à 05:05, Zhang, Rui a écrit :

On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:

From: Michael Ellerman 

Some architectures allows partial SMT states at boot time,


s/allows/allow.


Thanks Rui !

Re: [PATCH v3 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-07-05 Thread Laurent Dufour





Le 05/07/2023 à 05:14, Zhang, Rui a écrit :

On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:

@@ -2580,6 +2597,17 @@ static ssize_t control_show(struct device
*dev,
  {
 const char *state = smt_states[cpu_smt_control];
  
+#ifdef CONFIG_HOTPLUG_SMT

+   /*
+    * If SMT is enabled but not all threads are enabled then
show the
+    * number of threads. If all threads are enabled show "on".
Otherwise
+    * show the state name.
+    */
+   if (cpu_smt_control == CPU_SMT_ENABLED &&
+   cpu_smt_num_threads != cpu_smt_max_threads)
+   return sysfs_emit(buf, "%d\n", cpu_smt_num_threads);
+#endif
+


My understanding is that cpu_smt_control is always set to
CPU_SMT_NOT_IMPLEMENTED when CONFIG_HOTPLUG_SMT is not set, so this
ifdef is not necessary, right?


Hi Rui,

Indeed, cpu_smt_control, cpu_smt_num_threads and cpu_smt_max_threads are 
only defined when CONFIG_HOTPLUG_SMT is set. This is the reason for this 
#ifdef block.


This has been reported by the kernel test robot testing v2:
https://lore.kernel.org/oe-kbuild-all/202306282340.ihqm0fla-...@intel.com

Cheers,
Laurent.

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-05 Thread Yu Zhang

On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> From: David Stevens 
> 
> Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> callers to resolve a gfn when the associated pfn has a valid struct page
> that isn't being actively refcounted (e.g. tail pages of non-compound
> higher order pages). For a caller to safely omit FOLL_GET, all usages of
> the returned pfn must be guarded by a mmu notifier.
> 
> This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> is set when the returned pfn has an associated struct page with a valid
> refcount. Callers that don't pass FOLL_GET should remember this value
> and use it to avoid places like kvm_is_ad_tracked_page that assume a
> non-zero refcount.
> 
> Signed-off-by: David Stevens 
> ---
>  include/linux/kvm_host.h | 10 ++
>  virt/kvm/kvm_main.c  | 67 +---
>  virt/kvm/pfncache.c  |  2 +-
>  3 files changed, 47 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ef2763c2b12e..a45308c7d2d9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> kvm_memory_slot *slot, gfn_t gfn,
>  void kvm_release_page_clean(struct page *page);
>  void kvm_release_page_dirty(struct page *page);
>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
>  struct kvm_follow_pfn {
>   const struct kvm_memory_slot *slot;
>   gfn_t gfn;
> @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
>   bool atomic;
>   /* Allow a read fault to create a writeable mapping. */
>   bool allow_write_mapping;
> + /*
> +  * Usage of the returned pfn will be guared by a mmu notifier. Must
> +  * be true if FOLL_GET is not set.
> +  */
> + bool guarded_by_mmu_notifier;

And how? Any place to check the invalidate seq?

B.R.
Yu

Re: [PATCH v3 0/9] Introduce SMT level and add PowerPC support

2023-07-05 Thread Laurent Dufour


Le 05/07/2023 à 05:04, Zhang, Rui a écrit :

Hi, Laurent,

I want to test this patch set and found that it does not apply on top
of latest usptream git, because of some changes in this merge window,
so better rebase.


Hi Rui,

Thanks for your interest for this series.
The latest Thomas's changes came into the PowerPC next branch.
I'm working on a rebase.

Cheers,
Laurent.


thanks,
rui

On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:

I'm taking over the series Michael sent previously [1] which is
smartly
reviewing the initial series I sent [2].  This series is addressing
the
comments sent by Thomas and me on the Michael's one.

Here is a short introduction to the issue this series is addressing:

When a new CPU is added, the kernel is activating all its threads.
This
leads to weird, but functional, result when adding CPU on a SMT 4
system
for instance.

Here the newly added CPU 1 has 8 threads while the other one has 4
threads
active (system has been booted with the 'smt-enabled=4' kernel
option):

ltcden3-lp12:~ # ppc64_cpu --info
Core   0:    0*    1*    2*    3*    4 5 6 7
Core   1:    8*    9*   10*   11*   12*   13*   14*   15*

This mixed SMT level may confused end users and/or some applications.

There is no SMT level recorded in the kernel (common code), neither
in user
space, as far as I know. Such a level is helpful when adding new CPU
or
when optimizing the energy efficiency (when reactivating CPUs).

When SMP and HOTPLUG_SMT are defined, this series is adding a new SMT
level
(cpu_smt_num_threads) and few callbacks allowing the architecture
code to
fine control this value, setting a max and a "at boot" level, and
controling whether a thread should be onlined or not.

v3:
   Fix a build error in the patch 6/9
v2:
   As Thomas suggested,
     Reword some commit's description
     Remove topology_smt_supported()
     Remove topology_smt_threads_supported()
     Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC
     Remove switch() in __store_smt_control()
   Update kernel-parameters.txt

[1]
https://lore.kernel.org/linuxppc-dev/20230524155630.794584-1-...@ellerman.id.au/
[2]
https://lore.kernel.org/linuxppc-dev/20230331153905.31698-1-lduf...@linux.ibm.com/

Laurent Dufour (1):
   cpu/SMT: Remove topology_smt_supported()

Michael Ellerman (8):
   cpu/SMT: Move SMT prototypes into cpu_smt.h
   cpu/SMT: Move smt/control simple exit cases earlier
   cpu/SMT: Store the current/max number of threads
   cpu/SMT: Create topology_smt_thread_allowed()
   cpu/SMT: Allow enabling partial SMT states via sysfs
   powerpc/pseries: Initialise CPU hotplug callbacks earlier
   powerpc: Add HOTPLUG_SMT support
   powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs

  .../ABI/testing/sysfs-devices-system-cpu  |   1 +
  .../admin-guide/kernel-parameters.txt |   4 +-
  arch/Kconfig  |   3 +
  arch/powerpc/Kconfig  |   2 +
  arch/powerpc/include/asm/topology.h   |  15 +++
  arch/powerpc/kernel/smp.c |   8 +-
  arch/powerpc/platforms/pseries/hotplug-cpu.c  |  30 +++--
  arch/powerpc/platforms/pseries/pseries.h  |   2 +
  arch/powerpc/platforms/pseries/setup.c    |   2 +
  arch/x86/include/asm/topology.h   |   4 +-
  arch/x86/kernel/cpu/bugs.c    |   3 +-
  arch/x86/kernel/smpboot.c |   8 --
  include/linux/cpu.h   |  25 +---
  include/linux/cpu_smt.h   |  33 +
  kernel/cpu.c  | 118 ++--
--
  15 files changed, 187 insertions(+), 71 deletions(-)
  create mode 100644 include/linux/cpu_smt.h

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread Yu Zhang

On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang  wrote:
> >
> > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, 
> > > bool write_fault,
> > >   * The slow path to get the pfn of the specified host virtual address,
> > >   * 1 indicates success, -errno is returned if error is detected.
> > >   */
> > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > > write_fault,
> > > -bool interruptible, bool *writable, kvm_pfn_t 
> > > *pfn)
> > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > >  {
> > > - unsigned int flags = FOLL_HWPOISON;
> > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > >   struct page *page;
> > >   int npages;
> > >
> > >   might_sleep();
> > >
> > > - if (writable)
> > > - *writable = write_fault;
> > > -
> > > - if (write_fault)
> > > - flags |= FOLL_WRITE;
> > > - if (async)
> > > - flags |= FOLL_NOWAIT;
> > > - if (interruptible)
> > > - flags |= FOLL_INTERRUPTIBLE;
> > > -
> > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > >   if (npages != 1)
> > >   return npages;
> > >
> > > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > > foll->allow_write_mapping;
> > > +
> > >   /* map read fault as writable if possible */
> > > - if (unlikely(!write_fault) && writable) {
> > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >
> > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> 
> The two statements are logically equivalent, although I guess using
> !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> verbose.

Well, as the comment says, we wanna try to map the read fault as writable
whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
for write faults. So I guess using !foll->writable will not allow this.
Did I miss anything?

> > > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t 
> > > gfn,
> > > +bool atomic, bool interruptible, bool *async,
> > > +bool write_fault, bool *writable, hva_t *hva)
> > > +{
> > > + kvm_pfn_t pfn;
> > > + struct kvm_follow_pfn foll = {
> > > + .slot = slot,
> > > + .gfn = gfn,
> > > + .flags = 0,
> > > + .atomic = atomic,
> > > + .allow_write_mapping = !!writable,
> > > + };
> > > +
> > > + if (write_fault)
> > > + foll.flags |= FOLL_WRITE;
> > > + if (async)
> > > + foll.flags |= FOLL_NOWAIT;
> > > + if (interruptible)
> > > + foll.flags |= FOLL_INTERRUPTIBLE;
> > > +
> > > + pfn = __kvm_follow_pfn(&foll);
> > > + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
> >
> > Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> > if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> > Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> > seems unnecessary.
> 
> There are the cases where the fault does not fall within a vma or when
> the target vma's flags don't support the fault's access permissions.
> In those cases, continuing to try to resolve the fault won't cause
> problems per-se, but it's wasteful and a bit confusing. Having
> hva_to_pfn detect whether or not it may be possible to resolve the
> fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
> a good idea. It also matches what the existing code does.

Got it. Sounds reasonable. And thanks! :)

B.R.
Yu

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread Yu Zhang

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens 
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.
> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.
> 
> Signed-off-by: David Stevens 
> ---
>  arch/x86/kvm/mmu/mmu.c  | 44 +
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
>  arch/x86/kvm/mmu/spte.c |  4 ++-
>  arch/x86/kvm/mmu/spte.h | 12 -
>  arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>  
>   if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
>   flush = true;
> - kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
> + 
> kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
>   }
>  
>   if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
>   flush = true;
> - kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> + if (is_refcounted_page_pte(old_spte))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
>   }
>  
>   return flush;
> @@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, 
> u64 *sptep)
>* before they are reclaimed.  Sanity check that, if the pfn is backed
>* by a refcounted page, the refcount is elevated.
>*/
> - page = kvm_pfn_to_refcounted_page(pfn);
> - WARN_ON(page && !page_count(page));
> + if (is_refcounted_page_pte(old_spte)) {
> + page = kvm_pfn_to_refcounted_page(pfn);
> + WARN_ON(!page || !page_count(page));
> + }
>  
> - if (is_accessed_spte(old_spte))
> - kvm_set_pfn_accessed(pfn);
> + if (is_refcounted_page_pte(old_spte)) {
> + if (is_accessed_spte(old_spte))
> + kvm_set_page_accessed(pfn_to_page(pfn));
>  
> - if (is_dirty_spte(old_spte))
> - kvm_set_pfn_dirty(pfn);
> + if (is_dirty_spte(old_spte))
> + kvm_set_page_dirty(pfn_to_page(pfn));
> + }
>  
>   return old_spte;
>  }
> @@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
>* Capture the dirty status of the page, so that it doesn't get
>* lost when the SPTE is marked for access tracking.
>*/
> - if (is_writable_pte(spte))
> - kvm_set_pfn_dirty(spte_to_pfn(spte));
> + if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
>  
>   spte = mark_spte_for_access_track(spte);
>   mmu_spte_update_no_track(sptep, spte);
> @@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
>  {
>   bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
>  (unsigned long *)sptep);
> - if (was_writable && !spte_ad_enabled(*sptep))
> - kvm_set_pfn_dirty(spte_to_pfn(*sptep));
> + if (was_writable && !spte_ad_enabled(*sptep) && 
> is_refcounted_page_pte(*sptep))
> + kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
>  
>   return was_writable;
>  }
> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
> kvm_memory_slot *slot,
>   bool host_writable = !fault || fault->map_writable;
>   bool prefetch = !fault || fault->prefetch;
>   bool write_fault = fault && fault->write;
> + bool is_refcounted = !fault || fault->is_refcounted_page;
>  
>   pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
>*sptep, write_fault, gfn);
> @@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
> kvm_memory_slot *slot,
>   }
>  
>   wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, 
> prefetch,
> -true, host_writable, &spte);
> +true, host_writable, is_refcounted, &spte);
>  
>   if (*sptep == spte) {
>   ret = RET_PF_SPURIOUS;
> @@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault
>   struct kvm_follow_pfn foll = {
>   .slot = slot,
>

Re: [PATCH] KVM: PPC: Update MAINTAINERS

2023-07-05 Thread Michael Ellerman

Sean Christopherson  writes:
> On Thu, Jun 08, 2023, Nicholas Piggin wrote:
>> Michael is merging KVM PPC patches via the powerpc tree and KVM topic
>> branches. He doesn't necessarily have time to be across all of KVM so
>> is reluctant to call himself maintainer, but for the mechanics of how
>> patches flow upstream, it is maintained and does make sense to have
>> some contact people in MAINTAINERS.
>> 
>> So add Michael Ellerman as KVM PPC maintainer and myself as reviewer.
>> Split out the subarchs that don't get so much attention.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>
> Thanks for documenting the reality of things, much appreciated!
>
> Acked-by: Sean Christopherson 
>
>>  MAINTAINERS | 6 ++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 0dab9737ec16..44417acd2936 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11379,7 +11379,13 @@ F:  arch/mips/include/uapi/asm/kvm*
>>  F:  arch/mips/kvm/
>>  
>>  KERNEL VIRTUAL MACHINE FOR POWERPC (KVM/powerpc)
>> +M:  Michael Ellerman 
>> +R:  Nicholas Piggin 
>>  L:  linuxppc-dev@lists.ozlabs.org
>> +L:  k...@vger.kernel.org
>> +S:  Maintained (Book3S 64-bit HV)
>> +S:  Odd fixes (Book3S 64-bit PR)
>> +S:  Orphan (Book3E and 32-bit)
>
> Do you think there's any chance of dropping support for everything except 
> Book3S
> 64-bit HV at some point soonish?

Nick proposed disabling BookE KVM, which prompted some users to report
they are still actively using it:

  https://lore.kernel.org/all/20221128043623.1745708-1-npig...@gmail.com/

There are also still some KVM PR users.

In total I'd guess it's only some small 100s of users, but we don't
really know.

> There haven't been many generic KVM changes that touch PPC, but in my
> experience when such series do come along, the many flavors and layers
> of PPC incur quite a bit of development and testing cost, and have a
> high chance of being broken compared to other architectures.

Ack.

cheers

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-05 Thread Yicong Yang

On 2023/7/5 16:43, Barry Song wrote:
> On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang  wrote:
>>
>> On 2023/6/30 1:26, Catalin Marinas wrote:
>>> On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
 On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> From: Barry Song 
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
 [...]
>  .../features/vm/TLB/arch-support.txt  |  2 +-
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/tlbbatch.h | 12 
>  arch/arm64/include/asm/tlbflush.h | 33 -
>  arch/arm64/mm/flush.c | 69 +++
>  arch/x86/include/asm/tlbflush.h   |  5 +-
>  include/linux/mm_types_task.h |  4 +-
>  mm/rmap.c | 12 ++--

 First of all, this patch needs to be split in some preparatory patches
 introducing/renaming functions with no functional change for x86. Once
 done, you can add the arm64-only changes.

>>
>> got it. will try to split this patch as suggested.
>>
 Now, on the implementation, I had some comments on v7 but we didn't get
 to a conclusion and the thread eventually died:

 https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/

 I know I said a command line argument is better than Kconfig or some
 random number of CPUs heuristics but it would be even better if we don't
 bother with any, just make this always on.
>>
>> ok, will make this always on.
>>
 Barry had some comments
 around mprotect() being racy and that's why we have
 flush_tlb_batched_pending() but I don't think it's needed (or, for
 arm64, it can be a DSB since this patch issues the TLBIs but without the
 DVM Sync). So we need to clarify this (see Barry's last email on the
 above thread) and before attempting new versions of this patchset. With
 flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
 implementation would be faster on any SoC irrespective of the number of
 CPUs.
>>>
>>> I think I got the need for flush_tlb_batched_pending(). If
>>> try_to_unmap() marks the pte !present and we have a pending TLBI,
>>> change_pte_range() will skip the TLB maintenance altogether since it did
>>> not change the pte. So we could be left with stale TLB entries after
>>> mprotect() before TTU does the batch flushing.
>>>
> 
> Good catch.
> This could be also true for MADV_DONTNEED. after try_to_unmap, we run
> MADV_DONTNEED on this area, as pte is not present, we don't do anything
> on this PTE in zap_pte_range afterwards.
> 
>>> We can have an arch-specific flush_tlb_batched_pending() that can be a
>>> DSB only on arm64 and a full mm flush on x86.
>>>
>>
>> We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
>> condition so we first check whether there's a pended batched flush and
>> if so do the tlb flush. The pending checking is common and the differences
>> among the archs is how to flush the TLB here within the 
>> flush_tlb_batched_pending(),
>> on arm64 it should only be a dsb.
>>
>> As we only needs to maintain the TLBs already pended in batched flush,
>> does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
>> Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
>> flush_tlb_batched_pending() and no arch specific function needed.
> 
> as we have issued no-sync tlbi on those pending addresses , that means
> our hardware
> has already "recorded" what should be flushed in the specific mm. so
> DSB only will flush
> them correctly. right?
> 

yes it's right. I was just thought something like below. arch_tlbbatch_flush()
will only be a dsb on arm64 so this will match what Catalin wants. But as you
told that this maybe incorrect on x86 so we'd better have arch specific
implementation for flush_tlb_batched_pending() as suggested.

diff --git a/mm/rmap.c b/mm/rmap.c
index 9699c6011b0e..afa3571503a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -717,7 +717,7 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;

if (pending != flushed) {
-   flush_tlb_mm(mm);
+   arch_tlbbatch_flush(¤t->tlb_ubc.arch);
/*
 * If the new TLB flushing is pending during flushing, leave
 * mm->tlb_flush_batched as is, to avoid losing flushing.

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-05 Thread Yu Zhang

On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> From: David Stevens 
> 
> Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> memory into the guest that is backed by un-refcounted struct pages - for
> example, higher order non-compound pages allocated by the amdgpu driver
> via ttm_pool_alloc_page.

I guess you mean the tail pages of the higher order non-compound pages?
And as to the head page, it is said to be set to one coincidentally[*],
and shall not be considered as refcounted.  IIUC, refcount of this head
page will be increased and decreased soon in hva_to_pfn_remapped(), so
this may not be a problem(?). But treating this head page differently,
as a refcounted one(e.g., to set the A/D flags), is weired. 

Or maybe I missed some context, e.g., can the head page be allocted to
guest at all? 


> 
> The bulk of this change is tracking the is_refcounted_page flag so that
> non-refcounted pages don't trigger page_count() == 0 warnings. This is
> done by storing the flag in an unused bit in the sptes.

Also, maybe we should mention this only works on x86-64. 

> 
> Signed-off-by: David Stevens 
> ---
>  arch/x86/kvm/mmu/mmu.c  | 44 +
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
>  arch/x86/kvm/mmu/spte.c |  4 ++-
>  arch/x86/kvm/mmu/spte.h | 12 -
>  arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
>  6 files changed, 62 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e44ab512c3a1..b1607e314497 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c

...

> @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
> kvm_memory_slot *slot,
>   bool host_writable = !fault || fault->map_writable;
>   bool prefetch = !fault || fault->prefetch;
>   bool write_fault = fault && fault->write;
> + bool is_refcounted = !fault || fault->is_refcounted_page;

Just wonder, what if a non-refcounted page is prefetched?  Or is it possible in
practice?

...
>  
> @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, 
> struct kvm_mmu *mmu,
>   */
>  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, 
> int i)
>  {
> - bool host_writable;
> + bool host_writable, is_refcounted;
>   gpa_t first_pte_gpa;
>   u64 *sptep, spte;
>   struct kvm_memory_slot *slot;
> @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, 
> struct kvm_mmu_page *sp, int
>   sptep = &sp->spt[i];
>   spte = *sptep;
>   host_writable = spte & shadow_host_writable_mask;
> + // TODO: is this correct?
> + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
>   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>   make_spte(vcpu, sp, slot, pte_access, gfn,
> spte_to_pfn(spte), spte, true, false,
> -   host_writable, &spte);
> +   host_writable, is_refcounted, &spte);

Could we restrict that a non-refcounted page shall not be used as shadow page?

[*] https://lore.kernel.org/all/8caf3008-dcf3-985a-631e-e019b277c...@amd.com/

B.R.
Yu

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Greg KH

On Wed, Jul 05, 2023 at 10:51:57AM +0200, Linux regression tracking (Thorsten 
Leemhuis) wrote:
> On 05.07.23 09:08, Greg KH wrote:
> > On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
> >> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  
> >> wrote:
> >>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  
> >>> wrote:
>  Thanks! I'll investigate this later today. After discussing with
>  Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
>  the issue is fixed. I'll post a patch shortly.
> >>>
> >>> Posted at: 
> >>> https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> >>
> >> As that change fixes something in 6.4, why not cc: stable on it as 
> >> well?
> >
> > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > patch is fixing 6.4 I didn't need to send it to stable for older
> > versions. Did I miss something?
> 
>  6.4.y is a stable kernel tree right now, so yes, it needs to be included
>  there :)
> >>>
> >>> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> >>> fix rather than disabling the feature in -stable.
> 
> Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> Greg said below and that we already had three reports I know of I'd
> prefer if we could fix this rather sooner than later in mainline --
> especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> 6.4.y already or will do so soon.

Ick, yeah, and Fedora should be switching soon too, and I want to drop
support for 6.3.y "any day now".  Is there just a revert we can do now
first to resolve the regression and then work on fixing this up "better"
for 6.6-rc1?

thanks,

greg k-h

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang  wrote:
>
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, 
> > bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > write_fault,
> > -bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > - unsigned int flags = FOLL_HWPOISON;
> > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >   struct page *page;
> >   int npages;
> >
> >   might_sleep();
> >
> > - if (writable)
> > - *writable = write_fault;
> > -
> > - if (write_fault)
> > - flags |= FOLL_WRITE;
> > - if (async)
> > - flags |= FOLL_NOWAIT;
> > - if (interruptible)
> > - flags |= FOLL_INTERRUPTIBLE;
> > -
> > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >   if (npages != 1)
> >   return npages;
> >
> > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > foll->allow_write_mapping;
> > +
> >   /* map read fault as writable if possible */
> > - if (unlikely(!write_fault) && writable) {
> > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
>
> I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

The two statements are logically equivalent, although I guess using
!(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
verbose.

> >   struct page *wpage;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
> > - *writable = true;
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
> > + foll->writable = true;
> >   put_page(page);
> >   page = wpage;
> >   }
> > @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
> >   return get_page_unless_zero(page);
> >  }
> >
> ...
>
> > +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t 
> > gfn,
> > +bool atomic, bool interruptible, bool *async,
> > +bool write_fault, bool *writable, hva_t *hva)
> > +{
> > + kvm_pfn_t pfn;
> > + struct kvm_follow_pfn foll = {
> > + .slot = slot,
> > + .gfn = gfn,
> > + .flags = 0,
> > + .atomic = atomic,
> > + .allow_write_mapping = !!writable,
> > + };
> > +
> > + if (write_fault)
> > + foll.flags |= FOLL_WRITE;
> > + if (async)
> > + foll.flags |= FOLL_NOWAIT;
> > + if (interruptible)
> > + foll.flags |= FOLL_INTERRUPTIBLE;
> > +
> > + pfn = __kvm_follow_pfn(&foll);
> > + if (pfn == KVM_PFN_ERR_NEEDS_IO) {
>
> Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
> if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
> Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
> seems unnecessary.

There are the cases where the fault does not fall within a vma or when
the target vma's flags don't support the fault's access permissions.
In those cases, continuing to try to resolve the fault won't cause
problems per-se, but it's wasteful and a bit confusing. Having
hva_to_pfn detect whether or not it may be possible to resolve the
fault asynchronously and return KVM_PFN_ERR_NEEDS_IO if so seems like
a good idea. It also matches what the existing code does.

-David

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-05 Thread David Stevens

On Wed, Jul 5, 2023 at 5:47 PM Zhi Wang  wrote:
>
> On Tue,  4 Jul 2023 16:50:47 +0900
> David Stevens  wrote:
>
> > From: David Stevens 
> >
> > Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
> > __kvm_follow_pfn refactors the old API's arguments into a struct and,
> > where possible, combines the boolean arguments into a single flags
> > argument.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h |  16 
> >  virt/kvm/kvm_main.c  | 171 ++-
> >  virt/kvm/kvm_mm.h|   3 +-
> >  virt/kvm/pfncache.c  |   8 +-
> >  4 files changed, 122 insertions(+), 76 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 9d3ac7720da9..ef2763c2b12e 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -97,6 +97,7 @@
> >  #define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
> >  #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> >  #define KVM_PFN_ERR_SIGPENDING   (KVM_PFN_ERR_MASK + 3)
> > +#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> >
> >  /*
> >   * error pfns indicate that the gfn is in slot but faild to
> > @@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +struct kvm_follow_pfn {
> > + const struct kvm_memory_slot *slot;
> > + gfn_t gfn;
> > + unsigned int flags;
> > + bool atomic;
> > + /* Allow a read fault to create a writeable mapping. */
> > + bool allow_write_mapping;
> > +
> > + /* Outputs of __kvm_follow_pfn */
> > + hva_t hva;
> > + bool writable;
> > +};
> > +
> > +kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > +
> >  kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> >  kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> > bool *writable);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 371bd783ff2b..b13f22861d2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned 
> > long addr)
> >   * true indicates success, otherwise false is returned.  It's also the
> >   * only part that runs if we can in atomic context.
> >   */
> > -static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > - bool *writable, kvm_pfn_t *pfn)
> > +static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> >   struct page *page[1];
> > + bool write_fault = foll->flags & FOLL_WRITE;
> >
> >   /*
> >* Fast pin a writable pfn only if it is a write fault request
> >* or the caller allows to map a writable pfn for a read fault
> >* request.
> >*/
> > - if (!(write_fault || writable))
> > + if (!(write_fault || foll->allow_write_mapping))
> >   return false;
> >
> > - if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
> > + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> >   *pfn = page_to_pfn(page[0]);
> > -
> > - if (writable)
> > - *writable = true;
> > + foll->writable = foll->allow_write_mapping;
> >   return true;
> >   }
> >
> > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, 
> > bool write_fault,
> >   * The slow path to get the pfn of the specified host virtual address,
> >   * 1 indicates success, -errno is returned if error is detected.
> >   */
> > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > write_fault,
> > -bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> >  {
> > - unsigned int flags = FOLL_HWPOISON;
> > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> >   struct page *page;
> >   int npages;
> >
> >   might_sleep();
> >
> > - if (writable)
> > - *writable = write_fault;
> > -
> > - if (write_fault)
> > - flags |= FOLL_WRITE;
> > - if (async)
> > - flags |= FOLL_NOWAIT;
> > - if (interruptible)
> > - flags |= FOLL_INTERRUPTIBLE;
> > -
> > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> >   if (npages != 1)
> >   return npages;
> >
> > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > foll->allow_write_mapping;
> > +
> >   /* map read fault as writable if possible */
> > - if (unlikely(!write_fault) && writable) {
> > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> >   struct page *wpage;
> >
> > - if (get_user_page_fast_only(addr, FOLL_W

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Linux regression tracking (Thorsten Leemhuis)

On 05.07.23 09:08, Greg KH wrote:
> On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
>> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  
>> wrote:
>>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  
>>> wrote:
 Thanks! I'll investigate this later today. After discussing with
 Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
 the issue is fixed. I'll post a patch shortly.
>>>
>>> Posted at: 
>>> https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
>>
>> As that change fixes something in 6.4, why not cc: stable on it as well?
>
> Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> patch is fixing 6.4 I didn't need to send it to stable for older
> versions. Did I miss something?

 6.4.y is a stable kernel tree right now, so yes, it needs to be included
 there :)
>>>
>>> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
>>> fix rather than disabling the feature in -stable.

Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
Greg said below and that we already had three reports I know of I'd
prefer if we could fix this rather sooner than later in mainline --
especially as Arch Linux and openSUSE Tumbleweed likely have switched to
6.4.y already or will do so soon.

>> Ok, I think we have a fix posted at [2]  and it's cleanly applies to
>> 6.4.y stable branch as well. However fork() performance might slightly
>> regress, therefore disabling per-VMA locks by default for now seems to
>> be preferable even with this fix (see discussion at
>> https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/).
>> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
>> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
>> to stable@vger?
> 
> We can't do anything for stable until it lands in Linus's tree, so if
> you didn't happen to have the stable@ tag in the patch, just email us
> the git SHA1 and I can pick it up that way.
> 
> thanks,
> 
> greg k-h

Ciao, Thorsten

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-05 Thread Barry Song

On Tue, Jul 4, 2023 at 10:36 PM Yicong Yang  wrote:
>
> On 2023/6/30 1:26, Catalin Marinas wrote:
> > On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
> >> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
> >>> From: Barry Song 
> >>>
> >>> on x86, batched and deferred tlb shootdown has lead to 90%
> >>> performance increase on tlb shootdown. on arm64, HW can do
> >>> tlb shootdown without software IPI. But sync tlbi is still
> >>> quite expensive.
> >> [...]
> >>>  .../features/vm/TLB/arch-support.txt  |  2 +-
> >>>  arch/arm64/Kconfig|  1 +
> >>>  arch/arm64/include/asm/tlbbatch.h | 12 
> >>>  arch/arm64/include/asm/tlbflush.h | 33 -
> >>>  arch/arm64/mm/flush.c | 69 +++
> >>>  arch/x86/include/asm/tlbflush.h   |  5 +-
> >>>  include/linux/mm_types_task.h |  4 +-
> >>>  mm/rmap.c | 12 ++--
> >>
> >> First of all, this patch needs to be split in some preparatory patches
> >> introducing/renaming functions with no functional change for x86. Once
> >> done, you can add the arm64-only changes.
> >>
>
> got it. will try to split this patch as suggested.
>
> >> Now, on the implementation, I had some comments on v7 but we didn't get
> >> to a conclusion and the thread eventually died:
> >>
> >> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
> >>
> >> I know I said a command line argument is better than Kconfig or some
> >> random number of CPUs heuristics but it would be even better if we don't
> >> bother with any, just make this always on.
>
> ok, will make this always on.
>
> >> Barry had some comments
> >> around mprotect() being racy and that's why we have
> >> flush_tlb_batched_pending() but I don't think it's needed (or, for
> >> arm64, it can be a DSB since this patch issues the TLBIs but without the
> >> DVM Sync). So we need to clarify this (see Barry's last email on the
> >> above thread) and before attempting new versions of this patchset. With
> >> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
> >> implementation would be faster on any SoC irrespective of the number of
> >> CPUs.
> >
> > I think I got the need for flush_tlb_batched_pending(). If
> > try_to_unmap() marks the pte !present and we have a pending TLBI,
> > change_pte_range() will skip the TLB maintenance altogether since it did
> > not change the pte. So we could be left with stale TLB entries after
> > mprotect() before TTU does the batch flushing.
> >

Good catch.
This could be also true for MADV_DONTNEED. after try_to_unmap, we run
MADV_DONTNEED on this area, as pte is not present, we don't do anything
on this PTE in zap_pte_range afterwards.

> > We can have an arch-specific flush_tlb_batched_pending() that can be a
> > DSB only on arm64 and a full mm flush on x86.
> >
>
> We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
> condition so we first check whether there's a pended batched flush and
> if so do the tlb flush. The pending checking is common and the differences
> among the archs is how to flush the TLB here within the 
> flush_tlb_batched_pending(),
> on arm64 it should only be a dsb.
>
> As we only needs to maintain the TLBs already pended in batched flush,
> does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
> Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
> flush_tlb_batched_pending() and no arch specific function needed.

as we have issued no-sync tlbi on those pending addresses , that means
our hardware
has already "recorded" what should be flushed in the specific mm. so
DSB only will flush
them correctly. right?

>
> Thanks.
>

Barry

Re: [PATCH 07/12] arch/x86: Declare edid_info in

2023-07-05 Thread Thomas Zimmermann


Hi Arnd

Am 30.06.23 um 13:53 schrieb Arnd Bergmann:

On Fri, Jun 30, 2023, at 09:46, Thomas Zimmermann wrote:

Am 29.06.23 um 15:21 schrieb Arnd Bergmann:

On Thu, Jun 29, 2023, at 15:01, Thomas Zimmermann wrote:

Am 29.06.23 um 14:35 schrieb Arnd Bergmann:

On Thu, Jun 29, 2023, at 13:45, Thomas Zimmermann wrote:




FIRMWARE_EDID is a user-selectable feature, while ARCH_HAS_EDID_INFO
announces an architecture feature. They do different things.


I still have trouble seeing the difference.


The idea here is that ARCH_HAS_ signals the architecture's support for
the feature.  Drivers set 'depends on' in their Kconfig.

Another Kconfig token, VIDEO_SCREEN_INFO or FIRMWARE_EDID, would then
actually enable the feature.  Drivers select VIDEO_SCREEN_INFO or
FIRMWARE_EDID and the architectures contains code like


Fair enough. In that case, I guess FIRMWARE_EDID will just depend on
ARCH_HAS_EDID_INFO, or possibly "depends on FIRMWARE_EDID || EFI"
after it starts calling into an EFI specific function, right?


#ifdef VIDEO_SCREEN_INFO
struct screen_info screen_info = {
/* set values here */
}
#endif

This allows us to disable code that requires screen_info/edid_info, but
also disable screen_info/edid_info unless such code has been enabled in
the kernel config.

Some architectures currently mimic this by guarding screen_info with
ifdef CONFIG_VT or similar. I'd like to make this more flexible. The
cost of a few more internal Kconfig tokens seems negligible.


I definitely get it for the screen_info, which needs the complexity.
For ARCHARCH_HAS_EDID_INFO I would hope that it's never selected by
anything other than x86, so I would still go with just a dependency
on x86 for simplicity, but I don't mind having the extra symbol if that
keeps it more consistent with how the screen_info is handled.


Well, I'd like to add edid_info to platforms with EFI. What would be 
arm/arm64 and loongarch, I guess. See below for the future plans.





I suppose you could use FIRMWARE_EDID on EFI or OF systems without
the need for a global edid_info structure, but that would not
share any code with the current fb_firmware_edid() function.


The current code is build on top of screen_info and edid_info. I'd
preferably not replace that, if possible.


One way I could imagine this looking in the end would be
something like

struct screen_info *fb_screen_info(struct device *dev)
{
   struct screen_info *si = NULL;

   if (IS_ENABLED(CONFIG_EFI))
 si = efi_get_screen_info(dev);

   if (IS_ENABLED(CONFIG_ARCH_HAS_SCREEN_INFO) && !si)
 si = screen_info;

   return si;
}

corresponding to fb_firmware_edid(). With this, any driver
that wants to access screen_info would call this function
instead of using the global pointer, plus either NULL pointer
check or a CONFIG_ARCH_HAS_SCREEN_INFO dependency.

This way we could completely eliminate the global screen_info
on arm64, riscv, and loongarch but still use the efi and
hyperv framebuffer/drm drivers.


If possible, I'd like to remove global screen_info and edid_info 
entirely from fbdev and the various consoles.


We currently use screen_info to set up the generic framebuffer device in 
drivers/firmware/sysfb.c. I'd like to use edid_info here as well, so 
that the generic graphics drivers can get EDID information.


For the few fbdev drivers and consoles that require the global 
screen_info/edid_info, I'd rather provide lookup functions in sysfb 
(e.g., sysfb_get_screen_info(), sysfb_get_edid_info()). The global 
screen_info/edid_info state would then become an internal artifact of 
the sysfb code.


Hopefully that explains some of the decisions made in this patchset.

Best regards
Thomas



 Arnd


--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)


OpenPGP_signature
Description: OpenPGP digital signature

Re: [PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn

2023-07-05 Thread Yu Zhang

On Tue, Jul 04, 2023 at 04:50:49PM +0900, David Stevens wrote:
> From: David Stevens 
> 
> Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.
> 
> Signed-off-by: David Stevens 
> ---
>  arch/x86/kvm/mmu/mmu.c | 35 +--
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ec169f5c7dce..e44ab512c3a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
> struct kvm_async_pf *work)
>  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault)
>  {
>   struct kvm_memory_slot *slot = fault->slot;
> - bool async;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = fault->gfn,
> + .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
> + .allow_write_mapping = true,
> + };
>  
>   /*
>* Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault
>   return RET_PF_EMULATE;
>   }
>  
> - async = false;
> - fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, 
> &async,
> -   fault->write, &fault->map_writable,
> -   &fault->hva);
> - if (!async)
> - return RET_PF_CONTINUE; /* *pfn has correct page already */
> + foll.flags |= FOLL_NOWAIT;
> + fault->pfn = __kvm_follow_pfn(&foll);
> +
> + if (!is_error_noslot_pfn(fault->pfn))
> + goto success;
> +
> + if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> + return RET_PF_CONTINUE;

IIUC, FOLL_NOWAIT is set only when we wanna an async fault. So
KVM_PFN_ERR_NEEDS_IO may not be necessary? 

B.R.
Yu

Re: [PATCH 10/12] fbdev/core: Use fb_is_primary_device() in fb_firmware_edid()

2023-07-05 Thread Javier Martinez Canillas

Thomas Zimmermann  writes:

> Detect the primary VGA device with fb_is_primary_device() in the
> implementation of fb_firmware_edid(). Remove the existing code.
>

An explanation about why this is possible would be useful here.

> Adapt the function to receive an instance of struct fb_info and
> update all callers.
>

[...]

> -const unsigned char *fb_firmware_edid(struct device *device)
> +const unsigned char *fb_firmware_edid(struct fb_info *info)
>  {
> - struct pci_dev *dev = NULL;
> - struct resource *res = NULL;
>   unsigned char *edid = NULL;
>  
> - if (device)
> - dev = to_pci_dev(device);
> -
> - if (dev)
> - res = &dev->resource[PCI_ROM_RESOURCE];
> -
> - if (res && res->flags & IORESOURCE_ROM_SHADOW)

This open codes what used to be the fb_is_primary_device() logic before
commit 5ca1479cd35d ("fbdev: Simplify fb_is_primary_device for x86").
But now after that commit there is functional change since the ROM
shadowing check would be dropped.

I believe that's OK and Sima explains in their commit message that
vga_default_device() should be enough and the check is redundant.

Still, I think that this change should be documented in your commit
message. 

With that change,

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat

Re: [01/12] efi: Do not include from EFI header

2023-07-05 Thread Thomas Zimmermann


Hi

Am 05.07.23 um 03:40 schrieb Sui Jingfeng:

Hi, Thomas


I love your patch, LoongArch also have UEFI GOP support,

Maybe the arch/loongarch/kernel/efi.c don't include the '#include 
' explicitly.


Oh, thank you for testing. I try to build arch/ changes on all of the 
affected platforms, but I cannot build for all of them. I have no means 
of building for loongarch ATM.


I'll update the patch according to your feedback.

Best regards
Thomas




```

diff --git a/arch/loongarch/kernel/efi.c b/arch/loongarch/kernel/efi.c
index 3d448fef3af4..04f4d217aefb 100644
--- a/arch/loongarch/kernel/efi.c
+++ b/arch/loongarch/kernel/efi.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 
  #include 
```


On 2023/6/29 19:45, Thomas Zimmermann wrote:

The header file  does not need anything from
. Declare struct screen_info and remove
the include statements. Update a number of source files that
require struct screen_info's definition.

Signed-off-by: Thomas Zimmermann 
Cc: Ard Biesheuvel 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: Javier Martinez Canillas 


With the above issue solved, please take my R-B if you would like.


Reviewed-by: Sui Jingfeng 


---
  arch/arm/kernel/efi.c | 2 ++
  arch/arm64/kernel/efi.c   | 1 +
  drivers/firmware/efi/libstub/efi-stub-entry.c | 2 ++
  drivers/firmware/efi/libstub/screen_info.c    | 2 ++
  include/linux/efi.h   | 3 ++-
  5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/efi.c b/arch/arm/kernel/efi.c
index e2b9d2618c672..e94655ef16bb3 100644
--- a/arch/arm/kernel/efi.c
+++ b/arch/arm/kernel/efi.c
@@ -5,6 +5,8 @@
  #include 
  #include 
+#include 
+
  #include 
  #include 
  #include 
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index baab8dd3ead3c..3afbe503b066f 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -9,6 +9,7 @@
  #include 
  #include 
+#include 
  #include 
  #include 
diff --git a/drivers/firmware/efi/libstub/efi-stub-entry.c 
b/drivers/firmware/efi/libstub/efi-stub-entry.c

index cc4dcaea67fa6..2f1902e5d4075 100644
--- a/drivers/firmware/efi/libstub/efi-stub-entry.c
+++ b/drivers/firmware/efi/libstub/efi-stub-entry.c
@@ -1,6 +1,8 @@
  // SPDX-License-Identifier: GPL-2.0-only
  #include 
+#include 
+
  #include 
  #include "efistub.h"
diff --git a/drivers/firmware/efi/libstub/screen_info.c 
b/drivers/firmware/efi/libstub/screen_info.c

index 4be1c4d1f922b..a51ec201ca3cb 100644
--- a/drivers/firmware/efi/libstub/screen_info.c
+++ b/drivers/firmware/efi/libstub/screen_info.c
@@ -1,6 +1,8 @@
  // SPDX-License-Identifier: GPL-2.0
  #include 
+#include 
+
  #include 
  #include "efistub.h"
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 571d1a6e1b744..360895a5572c0 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -24,10 +24,11 @@
  #include 
  #include 
  #include 
-#include 
  #include 
+struct screen_info;
+
  #define EFI_SUCCESS    0
  #define EFI_LOAD_ERROR    ( 1 | (1UL << (BITS_PER_LONG-1)))
  #define EFI_INVALID_PARAMETER    ( 2 | (1UL << (BITS_PER_LONG-1)))




--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)


OpenPGP_signature
Description: OpenPGP digital signature

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-05 Thread Vincent Guittot

Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle 
> > wrote:
> > > 
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > > 
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > > 
> > > This introduces a performance degredation in the following scenario:
> > > 
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > > 
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > > 
> > > The first 5 tasks are distributed evenly across the two groups.
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > >  x x x  x x
> > > 
> > > Adding a 6th task yields the following distribution:
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > > SMT1 x x x  x x
> > > SMT2x
> > 
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> > 
> > Could you share with us your scheduler topology ?
> > 
> 
> You are correct, it should say core instead of CPU.
> 
> One actual configuration from one of my test machines (lscpu -e):
> 

[...]

> 
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

> 
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
> x x x x  x  x  x  x
> 
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

> 
> > > 
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > > 
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > > 
> > > Signed-off-by: Tobias Huschle 
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > >  * group's child domain.
> > >  */
> > > if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > -   busiest->sum_nr_running > local->sum_nr_running + 1)
> > > +   busiest->sum_nr_running * local->group_weight >
> > > +   local->sum_nr_running *
> > > busiest->group_weight + 1)


what you want to test here is that moving 1 task from busiest to local group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running 
+ 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * 
busiest->group_weight + busiest->group_weight + local->group_weight

and you also have to change how we calculate the imbalance which just provide 
the half of the diff of nr_running

by something like

(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * 
busiest->group_weight) / (busiest->group_weight + local->group_weight)

> > 
> > This is the prefer_sibling path. Could it be that you should disable
> > prefer_siling bet

Re: [PATCH 08/12] drivers/firmware: Remove trailing whitespaces

2023-07-05 Thread Javier Martinez Canillas

Thomas Zimmermann  writes:

> Fix coding style. No functional changes.
>
> Signed-off-by: Thomas Zimmermann 
> ---

Reviewed-by: Javier Martinez Canillas 


-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-05 Thread Yu Zhang

>  
> +void kvm_set_page_accessed(struct page *page);
> +void kvm_set_page_dirty(struct page *page);
> +
No one calls these 2 routines in this patch. How about move this change to
[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn ?

...

> @@ -2930,17 +2933,19 @@ static bool kvm_is_ad_tracked_page(struct page *page)
>   return !PageReserved(page);
>  }
>  
> -static void kvm_set_page_dirty(struct page *page)
> +void kvm_set_page_dirty(struct page *page)
>  {
>   if (kvm_is_ad_tracked_page(page))
>   SetPageDirty(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
>  
> -static void kvm_set_page_accessed(struct page *page)
> +void kvm_set_page_accessed(struct page *page)
>  {
>   if (kvm_is_ad_tracked_page(page))
>   mark_page_accessed(page);
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_page_accessed);

Same here.

B.R.
Yu

Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-05 Thread Greg KH

On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  
> wrote:
> >
> > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  
> > wrote:
> >
> > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default 
> > > > > > > until
> > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > >
> > > > > > Posted at: 
> > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> > > > >
> > > > > As that change fixes something in 6.4, why not cc: stable on it as 
> > > > > well?
> > > >
> > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > versions. Did I miss something?
> > >
> > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > there :)
> >
> > I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> > fix rather than disabling the feature in -stable.
> 
> Ok, I think we have a fix posted at [2]  and it's cleanly applies to
> 6.4.y stable branch as well. However fork() performance might slightly
> regress, therefore disabling per-VMA locks by default for now seems to
> be preferable even with this fix (see discussion at
> https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/).
> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> to stable@vger?

We can't do anything for stable until it lands in Linus's tree, so if
you didn't happen to have the stable@ tag in the patch, just email us
the git SHA1 and I can pick it up that way.

thanks,

greg k-h

95 matches

Mail list logo