Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Arnd Bergmann
On Fri, Apr 19, 2024, at 07:12, Michael Ellerman wrote:
> Michael Ellerman  writes:
>> "Arnd Bergmann"  writes:
>>>
>>> I had included this at first, but then I still ran into
>>> the same warnings because it ends up pulling in the
>>> generic outsb() etc from include/asm-generic/io.h
>>> that relies on setting a non-NULL PCI_IOBASE.
>>
>> Yes you're right. The above fixes the gcc build, but not clang.
>>
>> So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile
>> file operations without CONFIG_DEVPORT") into my next and then apply
>> this. But will see if there's any other build failures over night.
>
> That didn't work. Still lots of drivers in my tree (based on rc2) which
> use inb/outb etc, and barf on the empty #define inb.

Right, the patches from Niklas only went into linux-next so far,
and a few are missing (including the 8250 one I think), so -rc2
at the moment regresses, but that doesn't have the warning either.

The idea of my patch was to both fix the current linux-next
build regression and have something that works in the long
run, I didn't expect it to work by itself. Sorry that wasn't
clear from my description.

> So I think this patch needs to wait until all the CONFIG_HAS_IOPORT
> checks have been merged for various drivers.
>
> For now the below fixes the clang warning. AFAICS it's safe because any
> code using inb() etc. with CONFIG_PCI=n is currently just doing a plain
> load from virtual address ~zero which should fault anyway.

If the port number is high enough, the current code might end
up referencing a user space address, depending on mmap_min_addr,
which defaults to 4096.

Using POISON_POINTER_DELTA is clearly an improvement over that.

> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 08c550ed49be..1cd6eb6c8101 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -37,7 +37,7 @@ extern struct pci_dev *isa_bridge_pcidev;
>   * define properly based on the platform
>   */
>  #ifndef CONFIG_PCI
> -#define _IO_BASE   0
> +#define _IO_BASE   POISON_POINTER_DELTA
>  #define _ISA_MEM_BASE  0
>  #define PCI_DRAM_OFFSET 0
>  #elif defined(CONFIG_PPC32)

You may need to double-check, but I think for ppc64 we
can just unconditionally set _IO_BASE to ISA_IO_BASE
regardless of CONFIG_PCI. 3d5134ee8341 ("[POWERPC]
Rewrite IO allocation & mapping on powerpc64") ensured
that the I/O space is only ever mapped to this virtual
address, and the same method is used with the
asm-generic/io.h implementation on arm/arm64/loongarch/
m68k/riscv/xtensa. Using this would be both safer and
more efficient than the current version.
It should also not cause any regressions ;-)

Unfortunately, ppc32 never got that cleanup, so
POISON_POINTER_DELTA is probably still best until Niklas's
series is merged. You could set _ISA_MEM_BASE to the same 
here for good measure.

[another side note: the non-zero PCI_DRAM_OFFSET looks
unnecessary as well now, apparently this was meant for
ibm cpc710 and ppc440 platforms that are no longer
supported.]

 Arnd


Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Michael Ellerman
Michael Ellerman  writes:
> "Arnd Bergmann"  writes:
>> On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote:
>>> Arnd Bergmann  writes:
>>
>>> @@ -692,6 +692,7 @@ static inline void name at  
>>> \
>>>  #define writesw writesw
>>>  #define writesl writesl
>>>
>>> +#ifdef CONFIG_HAS_IOPORT
>>>  #define inb inb
>>>  #define inw inw
>>>  #define inl inl
>>> @@ -704,6 +705,8 @@ static inline void name at  
>>> \
>>>  #define outsb outsb
>>>  #define outsw outsw
>>>  #define outsl outsl
>>> +#endif // CONFIG_HAS_IOPORT
>>> +
>>>  #ifdef __powerpc64__
>>>  #define readq  readq
>>>  #define writeq writeq
>>
>> I had included this at first, but then I still ran into
>> the same warnings because it ends up pulling in the
>> generic outsb() etc from include/asm-generic/io.h
>> that relies on setting a non-NULL PCI_IOBASE.
>
> Yes you're right. The above fixes the gcc build, but not clang.
>
> So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile
> file operations without CONFIG_DEVPORT") into my next and then apply
> this. But will see if there's any other build failures over night.

That didn't work. Still lots of drivers in my tree (based on rc2) which
use inb/outb etc, and barf on the empty #define inb.

So I think this patch needs to wait until all the CONFIG_HAS_IOPORT
checks have been merged for various drivers.

For now the below fixes the clang warning. AFAICS it's safe because any
code using inb() etc. with CONFIG_PCI=n is currently just doing a plain
load from virtual address ~zero which should fault anyway.

cheers


diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 08c550ed49be..1cd6eb6c8101 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -37,7 +37,7 @@ extern struct pci_dev *isa_bridge_pcidev;
  * define properly based on the platform
  */
 #ifndef CONFIG_PCI
-#define _IO_BASE   0
+#define _IO_BASE   POISON_POINTER_DELTA
 #define _ISA_MEM_BASE  0
 #define PCI_DRAM_OFFSET 0
 #elif defined(CONFIG_PPC32)


[powerpc:next] BUILD SUCCESS f318c8be797f8572629d5386a88cde7d753457a8

2024-04-18 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next
branch HEAD: f318c8be797f8572629d5386a88cde7d753457a8  powerpc/ptdump: Fix 
walk_vmemmap() to also print first vmemmap entry

elapsed time: 726m

configs tested: 18
configs skipped: 157

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
i386 allmodconfig   gcc  
i386  allnoconfig   gcc  
i386 allyesconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  allyesconfig   clang
powerpc   randconfig-001-20240419   gcc  
powerpc   randconfig-002-20240419   gcc  
powerpc   randconfig-003-20240419   gcc  
powerpc64 randconfig-001-20240419   gcc  
powerpc64 randconfig-002-20240419   gcc  
powerpc64 randconfig-003-20240419   clang
riscvallmodconfig   clang
riscvallyesconfig   clang
um   allyesconfig   gcc  
x86_64  defconfig   gcc  
x86_64   rhel-8.3   gcc  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


Re: [PATCH net] MAINTAINERS: eth: mark IBM eHEA as an Orphan

2024-04-18 Thread Michael Ellerman
David Christensen  writes:
> Current maintainer Douglas Miller has left IBM and no replacement has
> been assigned for the driver. The eHEA hardware was last used on
> IBM POWER7 systems, the last of which reached end-of-support at the
> end of 2020.
>
> Signed-off-by: David Christensen 
> Reviewed-by: Pradeep Satyanarayana 
> ---
>  MAINTAINERS | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Acked-by: Michael Ellerman  (powerpc)

cheers

> diff --git a/MAINTAINERS b/MAINTAINERS
> index b5b89687680b..bcbbc240e51d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7831,9 +7831,8 @@ W:  http://aeschi.ch.eu.org/efs/
>  F:   fs/efs/
>  
>  EHEA (IBM pSeries eHEA 10Gb ethernet adapter) DRIVER
> -M:   Douglas Miller 
>  L:   net...@vger.kernel.org
> -S:   Maintained
> +S:   Orphan
>  F:   drivers/net/ethernet/ibm/ehea/
>  
>  ELM327 CAN NETWORK DRIVER
> -- 
> 2.39.3


Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu
On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?

IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.

Thanks,
Song


Re: [PATCH v8 2/3] PCI/AER: Disable AER service on suspend

2024-04-18 Thread Bjorn Helgaas
On Tue, Apr 16, 2024 at 12:32:24PM +0800, Kai-Heng Feng wrote:
> When the power rail gets cut off, the hardware can create some electric
> noise on the link that triggers AER. If IRQ is shared between AER with
> PME, such AER noise will cause a spurious wakeup on system suspend.
> 
> When the power rail gets back, the firmware of the device resets itself
> and can create unexpected behavior like sending PTM messages. For this
> case, the driver will always be too late to toggle off features should
> be disabled.
> 
> As Per PCIe Base Spec 5.0, section 5.2, titled "Link State Power
> Management", TLP and DLLP transmission are disabled for a Link in L2/L3
> Ready (D3hot), L2 (D3cold with aux power) and L3 (D3cold) states. So if
> the power will be turned off during suspend process, disable AER service
> and re-enable it during the resume process. This should not affect the
> basic functionality.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218090
> Signed-off-by: Kai-Heng Feng 

Thanks for reviving this series.  I tried follow the history about
this, but there are at least two series that were very similar and I
can't put it all together.

> ---
> v8:
>  - Add more bug reports.
> 
> v7:
>  - Wording
>  - Disable AER completely (again) if power will be turned off
> 
> v6:
> v5:
>  - Wording.
> 
> v4:
> v3:
>  - No change.
> 
> v2:
>  - Only disable AER IRQ.
>  - No more check on PME IRQ#.
>  - Use helper.
> 
>  drivers/pci/pcie/aer.c | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..bea7818c2d1b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1497,6 +1498,28 @@ static int aer_probe(struct pcie_device *dev)
>   return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + if (pci_ancestor_pr3_present(pdev) || pm_suspend_via_firmware())
> + aer_disable_rootport(rpc);

Why do we check pci_ancestor_pr3_present(pdev) and
pm_suspend_via_firmware()?  I'm getting pretty convinced that we need
to disable AER interrupts on suspend in general.  I think it will be
better if we do that consistently on all platforms, not special cases
based on details of how we suspend.

Also, why do we use aer_disable_rootport() instead of just
aer_disable_irq()?  I think it's the interrupt that causes issues on
suspend.  I see that there *were* some versions that used
aer_disable_irq(), but I can't find the reason it changed.

> +
> + return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + if (pci_ancestor_pr3_present(pdev) || pm_resume_via_firmware())
> + aer_enable_rootport(rpc);
> +
> + return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1561,6 +1584,8 @@ static struct pcie_port_service_driver aerdriver = {
>   .service= PCIE_PORT_SERVICE_AER,
>  
>   .probe  = aer_probe,
> + .suspend= aer_suspend,
> + .resume = aer_resume,
>   .remove = aer_remove,
>  };
>  
> -- 
> 2.34.1
> 


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson
On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+int kvm_arch_flush_tlb_if_young(void);
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
if (unlikely(kvm->mmu_invalidate_in_progress))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 38b498669ef9..5ebef8ef239c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,6 +595,11 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
+int __weak kvm_arch_flush_tlb_if_young(void)
+{
+   return true;
+}
+
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define 

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Mike Rapoport
On Thu, Apr 18, 2024 at 10:31:16PM +0300, Nadav Amit wrote:
> 
> > On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> > 
> > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> >> 
> >> 
> >> 
> >> I might be missing something, but it seems a bit racy.
> >> 
> >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> >> point, since you don’t hold the text_mutex, some might do text_poke(),
> >> e.g., by enabling/disabling static-key, and the update would be
> >> overwritten. No?
> > 
> > Right :(
> > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> > area.
> > 
> > So I'm thinking about calling alternatives_smp_module_add() from an
> > additional callback after the execmem_update_copy().
> > 
> > Does it make sense to you?
> 
> Going over the code again - I might have just been wrong: I confused the
> alternatives and the jump-label mechanisms (as they do share a lot of
> code and characteristics).
> 
> The jump-labels are updated when prepare_coming_module() is called, which
> happens after post_relocation() [which means they would be updated using
> text_poke() “inefficiently” but should be safe].
> 
> The “alternatives” appear only to use text_poke() (in contrast for
> text_poke_early()) from very specific few flows, e.g., 
> common_cpu_up() -> alternatives_enable_smp().
> 
> Are those flows pose a problem after boot?

Yes, common_cpu_up is called  on CPU hotplug, so it's possible to have a
race between alternatives_smp_module_add() and common_cpu_up() ->
alternatives_enable_smp().

And in UP case alternatives_smp_module_add() will call
alternatives_smp_unlock() that will patch module text before it is updated.

> Anyhow, sorry for the noise.

On the contrary, I would have missed it.

-- 
Sincerely yours,
Mike.


Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Nadav Amit



> On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> 
> On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
>> 
>> 
>> 
>> I might be missing something, but it seems a bit racy.
>> 
>> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
>> point, since you don’t hold the text_mutex, some might do text_poke(),
>> e.g., by enabling/disabling static-key, and the update would be
>> overwritten. No?
> 
> Right :(
> Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> area.
> 
> So I'm thinking about calling alternatives_smp_module_add() from an
> additional callback after the execmem_update_copy().
> 
> Does it make sense to you?

Going over the code again - I might have just been wrong: I confused the
alternatives and the jump-label mechanisms (as they do share a lot of
code and characteristics).

The jump-labels are updated when prepare_coming_module() is called, which
happens after post_relocation() [which means they would be updated using
text_poke() “inefficiently” but should be safe].

The “alternatives” appear only to use text_poke() (in contrast for
text_poke_early()) from very specific few flows, e.g., 
common_cpu_up() -> alternatives_enable_smp().

Are those flows pose a problem after boot?

Anyhow, sorry for the noise.

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport
On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > >
> > > > I'm looking at execmem_types more as definition of the consumers, maybe 
> > > > I
> > > > should have named the enum execmem_consumer at the first place.
> > >
> > > I think looking at execmem_type from consumers' point of view adds
> > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > and bpf (and maybe also module text) all have the same requirements.
> > > Did I miss something?
> >
> > It's enough to have one architecture with different constrains for kprobes
> > and bpf to warrant a type for each.
> 
> AFAICT, some of these constraints can be changed without too much work.

But why?
I honestly don't understand what are you trying to optimize here. A few
lines of initialization in execmem_info?
What is the advantage in forcing architectures to have imposed limits on
kprobes or bpf allocations?

> > Where do you see unnecessary complexity?
> >
> > > IOW, we have
> > >
> > > enum execmem_type {
> > > EXECMEM_DEFAULT,
> > > EXECMEM_TEXT,
> > > EXECMEM_KPROBES = EXECMEM_TEXT,
> > > EXECMEM_FTRACE = EXECMEM_TEXT,
> > > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > > _KPROBE, _FTRACE, _BPF */
> > > EXECMEM_DATA,  /* rw */
> > > EXECMEM_RO_DATA,
> > > EXECMEM_RO_AFTER_INIT,
> > > EXECMEM_TYPE_MAX,
> > > };
> > >
> > > Does this make sense?
> >
> > How do you suggest to deal with e.g. riscv that has separate address spaces
> > for modules, kprobes and bpf?
> 
> IIUC, modules and bpf use the same address space on riscv

Not exactly, bpf is a subset of modules on riscv.

> while kprobes use vmalloc address.

The whole point of using the entire vmalloc for kprobes is to avoid
pollution of limited modules space.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.


Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu
On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same 
> > > > as
> > > >   modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > > our own
> > > > types, e.g.
> > > >
> > > >   EXECMEM_ANYWHERE
> > > >   EXECMEM_NOPLT
> > > >   EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and 
> > > > friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>

AFAICT, some of these constraints can be changed without too much work.

> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA,  /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?

IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:

 |  -4 GB | 7fff |2 GB | modules, BPF

Did I get this right?

Thanks,
Song


Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-18 Thread Mike Rapoport
Hi Masami,

On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
> 
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport  wrote:
> 
> > From: "Mike Rapoport (IBM)" 
> > 
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> > 
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> > 
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
> 
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.

I'll rebase and will try to reduce ifdefery where possible.

> Thank you,
> 
> > 
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >  arch/Kconfig|  2 +-
> >  kernel/kprobes.c| 43 +
> >  kernel/trace/trace_kprobe.c | 11 ++
> >  3 files changed, 37 insertions(+), 19 deletions(-)
> > 
> 
> -- 
> Masami Hiramatsu

-- 
Sincerely yours,
Mike.


Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Mike Rapoport
On Wed, Apr 17, 2024 at 04:32:49PM -0700, Song Liu wrote:
> On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
> >
> > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > > +/**
> > > > > + * enum execmem_type - types of executable memory ranges
> > > > > + *
> > > > > + * There are several subsystems that allocate executable memory.
> > > > > + * Architectures define different restrictions on placement,
> > > > > + * permissions, alignment and other parameters for memory that can 
> > > > > be used
> > > > > + * by these subsystems.
> > > > > + * Types in this enum identify subsystems that allocate executable 
> > > > > memory
> > > > > + * and let architectures define parameters for ranges suitable for
> > > > > + * allocations by each subsystem.
> > > > > + *
> > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > > that
> > > > > + * are not explcitly defined.
> > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > > + * @EXECMEM_BPF: parameters for BPF
> > > > > + * @EXECMEM_TYPE_MAX:
> > > > > + */
> > > > > +enum execmem_type {
> > > > > + EXECMEM_DEFAULT,
> > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > > + EXECMEM_KPROBES,
> > > > > + EXECMEM_FTRACE,
> > > > > + EXECMEM_BPF,
> > > > > + EXECMEM_TYPE_MAX,
> > > > > +};
> > > >
> > > > Can we please get a break-down of how all these types are actually
> > > > different from one another?
> > > >
> > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > > mind) and has less strict placement constraints for some of them?
> > >
> > > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > > said
> > > several times.
> > >
> > > For arm64 we have two bsaic restrictions:
> > >
> > > 1) Direct branches can go +/-128M
> > >We can expand this range by having direct branches go to PLTs, at a
> > >performance cost.
> > >
> > > 2) PREL32 relocations can go +/-2G
> > >We cannot expand this further.
> > >
> > > * We don't need to allocate memory for ftrace. We do not use trampolines.
> > >
> > > * Kprobes XOL areas don't care about either of those; we don't place any
> > >   PC-relative instructions in those. Maybe we want to in future.
> > >
> > > * Modules care about both; we'd *prefer* to place them within +/-128M of 
> > > all
> > >   other kernel/module code, but if there's no space we can use PLTs and 
> > > expand
> > >   that to +/-2G. Since modules can refreence other modules, that ends up
> > >   actually being halved, and modules have to fit within some 2G window 
> > > that
> > >   also covers the kernel.
> 
> Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> really need
> EXECMEM_ANYWHERE below?
> 
> > >
> > > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > >   modules.
> >
> > BPF are happy with vmalloc().
> >
> > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > our own
> > > types, e.g.
> > >
> > >   EXECMEM_ANYWHERE
> > >   EXECMEM_NOPLT
> > >   EXECMEM_PREL32
> > >
> > > ... and then we use those in arch code to implement module_alloc() and 
> > > friends.
> >
> > I'm looking at execmem_types more as definition of the consumers, maybe I
> > should have named the enum execmem_consumer at the first place.
> 
> I think looking at execmem_type from consumers' point of view adds
> unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> and bpf (and maybe also module text) all have the same requirements.
> Did I miss something?

It's enough to have one architecture with different constrains for kprobes
and bpf to warrant a type for each.

Where do you see unnecessary complexity?
 
> IOW, we have
> 
> enum execmem_type {
> EXECMEM_DEFAULT,
> EXECMEM_TEXT,
> EXECMEM_KPROBES = EXECMEM_TEXT,
> EXECMEM_FTRACE = EXECMEM_TEXT,
> EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> _KPROBE, _FTRACE, _BPF */
> EXECMEM_DATA,  /* rw */
> EXECMEM_RO_DATA,
> EXECMEM_RO_AFTER_INIT,
> EXECMEM_TYPE_MAX,
> };
> 
> Does this make sense?
 
How do you suggest to deal with e.g. riscv that has separate address spaces
for modules, kprobes and bpf?

> Thanks,
> Song

-- 
Sincerely yours,
Mike.


Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Will Deacon
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> > wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > simpler, more
> > > straightforward solution would be to let architectures override 
> > > flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a 
> > > range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  
> Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address 
> or
> a guest virtual address, but it's a moot point because KVM doen't have the 
> guest
> virtual address, and if it's a host virtual address, there would need to be 
> valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will


Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Michael Ellerman
"Arnd Bergmann"  writes:
> On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote:
>> Arnd Bergmann  writes:
>
>> @@ -692,6 +692,7 @@ static inline void name at  
>> \
>>  #define writesw writesw
>>  #define writesl writesl
>>
>> +#ifdef CONFIG_HAS_IOPORT
>>  #define inb inb
>>  #define inw inw
>>  #define inl inl
>> @@ -704,6 +705,8 @@ static inline void name at  
>> \
>>  #define outsb outsb
>>  #define outsw outsw
>>  #define outsl outsl
>> +#endif // CONFIG_HAS_IOPORT
>> +
>>  #ifdef __powerpc64__
>>  #define readq  readq
>>  #define writeq writeq
>
> I had included this at first, but then I still ran into
> the same warnings because it ends up pulling in the
> generic outsb() etc from include/asm-generic/io.h
> that relies on setting a non-NULL PCI_IOBASE.

Yes you're right. The above fixes the gcc build, but not clang.

So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile
file operations without CONFIG_DEVPORT") into my next and then apply
this. But will see if there's any other build failures over night.

cheers


Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-18 Thread Mike Rapoport
On Tue, Apr 16, 2024 at 09:52:34AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote:
> > On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote:
> > > 
> > > > To populate the cache, a writable large page is allocated from vmalloc 
> > > > with
> > > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped 
> > > > as
> > > > ROX.
> > > 
> > > > +static void execmem_invalidate(void *ptr, size_t size, bool writable)
> > > > +{
> > > > +   if (execmem_info->invalidate)
> > > > +   execmem_info->invalidate(ptr, size, writable);
> > > > +   else
> > > > +   memset(ptr, 0, size);
> > > > +}
> > > 
> > > +static void execmem_invalidate(void *ptr, size_t size, bool writeable)
> > > +{
> > > +   /* fill memory with INT3 instructions */
> > > +   if (writeable)
> > > +   memset(ptr, 0xcc, size);
> > > +   else
> > > +   text_poke_set(ptr, 0xcc, size);
> > > +}
> > > 
> > > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction.
> > > It raises #BP not #UD.
> > 
> > Do you mean that _invalidate is a poor name choice or that it's necessary
> > to use an instruction that raises #UD?
> 
> Poor naming, mostly. #BP handler will still scream bloody murder if the
> site is otherwise unclaimed.
> 
> It just isn't an invalid instruction.

Well, execmem_fill_with_insns_screaming_bloody_murder seems too long, how
about execmem_fill_trapping_insns?

-- 
Sincerely yours,
Mike.


Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-18 Thread Mike Rapoport
On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
> 
> 
> > On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> > 
> > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> > struct load_info *info)
> > add_kallsyms(mod, info);
> > 
> > /* Arch-specific module finalizing. */
> > -   return module_finalize(info->hdr, info->sechdrs, mod);
> > +   ret = module_finalize(info->hdr, info->sechdrs, mod);
> > +   if (ret)
> > +   return ret;
> > +
> > +   for_each_mod_mem_type(type) {
> > +   struct module_memory *mem = >mem[type];
> > +
> > +   if (mem->is_rox) {
> > +   if (!execmem_update_copy(mem->base, mem->rw_copy,
> > +mem->size))
> > +   return -ENOMEM;
> > +
> > +   vfree(mem->rw_copy);
> > +   mem->rw_copy = NULL;
> > +   }
> > +   }
> > +
> > +   return 0;
> > }
> 
> I might be missing something, but it seems a bit racy.
> 
> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
> point, since you don’t hold the text_mutex, some might do text_poke(),
> e.g., by enabling/disabling static-key, and the update would be
> overwritten. No?

Right :(
Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
area.

So I'm thinking about calling alternatives_smp_module_add() from an
additional callback after the execmem_update_copy().

Does it make sense to you?

-- 
Sincerely yours,
Mike.


Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Arnd Bergmann
On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote:
> Arnd Bergmann  writes:

> @@ -692,6 +692,7 @@ static inline void name at  
> \
>  #define writesw writesw
>  #define writesl writesl
>
> +#ifdef CONFIG_HAS_IOPORT
>  #define inb inb
>  #define inw inw
>  #define inl inl
> @@ -704,6 +705,8 @@ static inline void name at  
> \
>  #define outsb outsb
>  #define outsw outsw
>  #define outsl outsl
> +#endif // CONFIG_HAS_IOPORT
> +
>  #ifdef __powerpc64__
>  #define readq  readq
>  #define writeq writeq

I had included this at first, but then I still ran into
the same warnings because it ends up pulling in the
generic outsb() etc from include/asm-generic/io.h
that relies on setting a non-NULL PCI_IOBASE.

  Arnd


Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Michael Ellerman
Michael Ellerman  writes:
> Arnd Bergmann  writes:
>> From: Arnd Bergmann 
>>
>> Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes
>> a NULL pointer dereference, which is bad for a number of reasons.
>>
>> After my patch to turn on -Werror in linux-next, this caused a
>> compiler-time warning with clang:
>>
>> In file included from arch/powerpc/include/asm/io.h:672:
>> arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
>> arithmetic on a null pointer has undefined behavior
>> [-Werror,-Wnull-pointer-arithmetic]
>>43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c),
>>   | ^~~
>>44 |  (p, b, c), pio, p)
>>   |  ~~
>>
>> In this configuration, CONFIG_HAS_IOPORT is already disabled, and all
>> drivers that use inb()/outb() should now depend on that (some patches are
>> still in the process of getting marged).
>>
>> Hide all references to inb()/outb() in the powerpc code and the definitions
>> when HAS_IOPORT is disabled to remove the possible NULL pointer access.
>> The same should happin in asm-generic in the near future, but for now
>> the empty inb() macros are still defined to ensure the generic version
>> does not get pulled in.
>>
>> Signed-off-by: Arnd Bergmann 
>> Reported-by: Naresh Kamboju 
>> --
>
> This needs a small fixup:

Well, only because my tree doesn't have f0a816fb12da ("/dev/port: don't compile 
file operations without CONFIG_DEVPORT").

cheers


Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-18 Thread Michael Ellerman
Arnd Bergmann  writes:
> From: Arnd Bergmann 
>
> Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes
> a NULL pointer dereference, which is bad for a number of reasons.
>
> After my patch to turn on -Werror in linux-next, this caused a
> compiler-time warning with clang:
>
> In file included from arch/powerpc/include/asm/io.h:672:
> arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
> arithmetic on a null pointer has undefined behavior
> [-Werror,-Wnull-pointer-arithmetic]
>43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c),
>   | ^~~
>44 |  (p, b, c), pio, p)
>   |  ~~
>
> In this configuration, CONFIG_HAS_IOPORT is already disabled, and all
> drivers that use inb()/outb() should now depend on that (some patches are
> still in the process of getting marged).
>
> Hide all references to inb()/outb() in the powerpc code and the definitions
> when HAS_IOPORT is disabled to remove the possible NULL pointer access.
> The same should happin in asm-generic in the near future, but for now
> the empty inb() macros are still defined to ensure the generic version
> does not get pulled in.
>
> Signed-off-by: Arnd Bergmann 
> Reported-by: Naresh Kamboju 
> --

This needs a small fixup:

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 86c212fcbc0c..60c80d0baf40 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -692,6 +692,7 @@ static inline void name at  
\
 #define writesw writesw
 #define writesl writesl

+#ifdef CONFIG_HAS_IOPORT
 #define inb inb
 #define inw inw
 #define inl inl
@@ -704,6 +705,8 @@ static inline void name at  
\
 #define outsb outsb
 #define outsw outsw
 #define outsl outsl
+#endif // CONFIG_HAS_IOPORT
+
 #ifdef __powerpc64__
 #define readq  readq
 #define writeq writeq


I'm running it through some randconfig builds now.

cheers