Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
On Fri, Apr 19, 2024, at 07:12, Michael Ellerman wrote: > Michael Ellerman writes: >> "Arnd Bergmann" writes: >>> >>> I had included this at first, but then I still ran into >>> the same warnings because it ends up pulling in the >>> generic outsb() etc from include/asm-generic/io.h >>> that relies on setting a non-NULL PCI_IOBASE. >> >> Yes you're right. The above fixes the gcc build, but not clang. >> >> So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile >> file operations without CONFIG_DEVPORT") into my next and then apply >> this. But will see if there's any other build failures over night. > > That didn't work. Still lots of drivers in my tree (based on rc2) which > use inb/outb etc, and barf on the empty #define inb. Right, the patches from Niklas only went into linux-next so far, and a few are missing (including the 8250 one I think), so -rc2 at the moment regresses, but that doesn't have the warning either. The idea of my patch was to both fix the current linux-next build regression and have something that works in the long run, I didn't expect it to work by itself. Sorry that wasn't clear from my description. > So I think this patch needs to wait until all the CONFIG_HAS_IOPORT > checks have been merged for various drivers. > > For now the below fixes the clang warning. AFAICS it's safe because any > code using inb() etc. with CONFIG_PCI=n is currently just doing a plain > load from virtual address ~zero which should fault anyway. If the port number is high enough, the current code might end up referencing a user space address, depending on mmap_min_addr, which defaults to 4096. Using POISON_POINTER_DELTA is clearly an improvement over that. > diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h > index 08c550ed49be..1cd6eb6c8101 100644 > --- a/arch/powerpc/include/asm/io.h > +++ b/arch/powerpc/include/asm/io.h > @@ -37,7 +37,7 @@ extern struct pci_dev *isa_bridge_pcidev; > * define properly based on the platform > */ > #ifndef CONFIG_PCI > -#define _IO_BASE 0 > +#define _IO_BASE POISON_POINTER_DELTA > #define _ISA_MEM_BASE 0 > #define PCI_DRAM_OFFSET 0 > #elif defined(CONFIG_PPC32) You may need to double-check, but I think for ppc64 we can just unconditionally set _IO_BASE to ISA_IO_BASE regardless of CONFIG_PCI. 3d5134ee8341 ("[POWERPC] Rewrite IO allocation & mapping on powerpc64") ensured that the I/O space is only ever mapped to this virtual address, and the same method is used with the asm-generic/io.h implementation on arm/arm64/loongarch/ m68k/riscv/xtensa. Using this would be both safer and more efficient than the current version. It should also not cause any regressions ;-) Unfortunately, ppc32 never got that cleanup, so POISON_POINTER_DELTA is probably still best until Niklas's series is merged. You could set _ISA_MEM_BASE to the same here for good measure. [another side note: the non-zero PCI_DRAM_OFFSET looks unnecessary as well now, apparently this was meant for ibm cpc710 and ppc440 platforms that are no longer supported.] Arnd
Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
Michael Ellerman writes: > "Arnd Bergmann" writes: >> On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote: >>> Arnd Bergmann writes: >> >>> @@ -692,6 +692,7 @@ static inline void name at >>> \ >>> #define writesw writesw >>> #define writesl writesl >>> >>> +#ifdef CONFIG_HAS_IOPORT >>> #define inb inb >>> #define inw inw >>> #define inl inl >>> @@ -704,6 +705,8 @@ static inline void name at >>> \ >>> #define outsb outsb >>> #define outsw outsw >>> #define outsl outsl >>> +#endif // CONFIG_HAS_IOPORT >>> + >>> #ifdef __powerpc64__ >>> #define readq readq >>> #define writeq writeq >> >> I had included this at first, but then I still ran into >> the same warnings because it ends up pulling in the >> generic outsb() etc from include/asm-generic/io.h >> that relies on setting a non-NULL PCI_IOBASE. > > Yes you're right. The above fixes the gcc build, but not clang. > > So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile > file operations without CONFIG_DEVPORT") into my next and then apply > this. But will see if there's any other build failures over night. That didn't work. Still lots of drivers in my tree (based on rc2) which use inb/outb etc, and barf on the empty #define inb. So I think this patch needs to wait until all the CONFIG_HAS_IOPORT checks have been merged for various drivers. For now the below fixes the clang warning. AFAICS it's safe because any code using inb() etc. with CONFIG_PCI=n is currently just doing a plain load from virtual address ~zero which should fault anyway. cheers diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 08c550ed49be..1cd6eb6c8101 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -37,7 +37,7 @@ extern struct pci_dev *isa_bridge_pcidev; * define properly based on the platform */ #ifndef CONFIG_PCI -#define _IO_BASE 0 +#define _IO_BASE POISON_POINTER_DELTA #define _ISA_MEM_BASE 0 #define PCI_DRAM_OFFSET 0 #elif defined(CONFIG_PPC32)
[powerpc:next] BUILD SUCCESS f318c8be797f8572629d5386a88cde7d753457a8
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next branch HEAD: f318c8be797f8572629d5386a88cde7d753457a8 powerpc/ptdump: Fix walk_vmemmap() to also print first vmemmap entry elapsed time: 726m configs tested: 18 configs skipped: 157 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alphaallyesconfig gcc i386 allmodconfig gcc i386 allnoconfig gcc i386 allyesconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc allyesconfig clang powerpc randconfig-001-20240419 gcc powerpc randconfig-002-20240419 gcc powerpc randconfig-003-20240419 gcc powerpc64 randconfig-001-20240419 gcc powerpc64 randconfig-002-20240419 gcc powerpc64 randconfig-003-20240419 clang riscvallmodconfig clang riscvallyesconfig clang um allyesconfig gcc x86_64 defconfig gcc x86_64 rhel-8.3 gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH net] MAINTAINERS: eth: mark IBM eHEA as an Orphan
David Christensen writes: > Current maintainer Douglas Miller has left IBM and no replacement has > been assigned for the driver. The eHEA hardware was last used on > IBM POWER7 systems, the last of which reached end-of-support at the > end of 2020. > > Signed-off-by: David Christensen > Reviewed-by: Pradeep Satyanarayana > --- > MAINTAINERS | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) Acked-by: Michael Ellerman (powerpc) cheers > diff --git a/MAINTAINERS b/MAINTAINERS > index b5b89687680b..bcbbc240e51d 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -7831,9 +7831,8 @@ W: http://aeschi.ch.eu.org/efs/ > F: fs/efs/ > > EHEA (IBM pSeries eHEA 10Gb ethernet adapter) DRIVER > -M: Douglas Miller > L: net...@vger.kernel.org > -S: Maintained > +S: Orphan > F: drivers/net/ethernet/ibm/ehea/ > > ELM327 CAN NETWORK DRIVER > -- > 2.39.3
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport wrote: > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote: > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport wrote: > > > > > > > > > > I'm looking at execmem_types more as definition of the consumers, > > > > > maybe I > > > > > should have named the enum execmem_consumer at the first place. > > > > > > > > I think looking at execmem_type from consumers' point of view adds > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, > > > > kprobe, > > > > and bpf (and maybe also module text) all have the same requirements. > > > > Did I miss something? > > > > > > It's enough to have one architecture with different constrains for kprobes > > > and bpf to warrant a type for each. > > > > AFAICT, some of these constraints can be changed without too much work. > > But why? > I honestly don't understand what are you trying to optimize here. A few > lines of initialization in execmem_info? IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it harder for bpf and kprobe to share the same ROX page. In many use cases, a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and module text. It is not efficient if we have to allocate separate pages for each of these use cases. If this is not a problem, the current approach works. Thanks, Song
Re: [PATCH v8 2/3] PCI/AER: Disable AER service on suspend
On Tue, Apr 16, 2024 at 12:32:24PM +0800, Kai-Heng Feng wrote: > When the power rail gets cut off, the hardware can create some electric > noise on the link that triggers AER. If IRQ is shared between AER with > PME, such AER noise will cause a spurious wakeup on system suspend. > > When the power rail gets back, the firmware of the device resets itself > and can create unexpected behavior like sending PTM messages. For this > case, the driver will always be too late to toggle off features should > be disabled. > > As Per PCIe Base Spec 5.0, section 5.2, titled "Link State Power > Management", TLP and DLLP transmission are disabled for a Link in L2/L3 > Ready (D3hot), L2 (D3cold with aux power) and L3 (D3cold) states. So if > the power will be turned off during suspend process, disable AER service > and re-enable it during the resume process. This should not affect the > basic functionality. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=209149 > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295 > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218090 > Signed-off-by: Kai-Heng Feng Thanks for reviving this series. I tried follow the history about this, but there are at least two series that were very similar and I can't put it all together. > --- > v8: > - Add more bug reports. > > v7: > - Wording > - Disable AER completely (again) if power will be turned off > > v6: > v5: > - Wording. > > v4: > v3: > - No change. > > v2: > - Only disable AER IRQ. > - No more check on PME IRQ#. > - Use helper. > > drivers/pci/pcie/aer.c | 25 + > 1 file changed, 25 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index ac6293c24976..bea7818c2d1b 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1497,6 +1498,28 @@ static int aer_probe(struct pcie_device *dev) > return 0; > } > > +static int aer_suspend(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + struct pci_dev *pdev = rpc->rpd; > + > + if (pci_ancestor_pr3_present(pdev) || pm_suspend_via_firmware()) > + aer_disable_rootport(rpc); Why do we check pci_ancestor_pr3_present(pdev) and pm_suspend_via_firmware()? I'm getting pretty convinced that we need to disable AER interrupts on suspend in general. I think it will be better if we do that consistently on all platforms, not special cases based on details of how we suspend. Also, why do we use aer_disable_rootport() instead of just aer_disable_irq()? I think it's the interrupt that causes issues on suspend. I see that there *were* some versions that used aer_disable_irq(), but I can't find the reason it changed. > + > + return 0; > +} > + > +static int aer_resume(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + struct pci_dev *pdev = rpc->rpd; > + > + if (pci_ancestor_pr3_present(pdev) || pm_resume_via_firmware()) > + aer_enable_rootport(rpc); > + > + return 0; > +} > + > /** > * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP > * @dev: pointer to Root Port, RCEC, or RCiEP > @@ -1561,6 +1584,8 @@ static struct pcie_port_service_driver aerdriver = { > .service= PCIE_PORT_SERVICE_AER, > > .probe = aer_probe, > + .suspend= aer_suspend, > + .resume = aer_resume, > .remove = aer_remove, > }; > > -- > 2.34.1 >
Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
On Thu, Apr 18, 2024, Will Deacon wrote: > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson > > > wrote: > > > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon > > > > > wrote: > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > > would be really great to change the .clear_flush_young() callback so > > > > > > that the architecture could handle the TLB invalidation. At the > > > > > > moment, > > > > > > the core KVM code invalidates the whole VMID courtesy of > > > > > > 'flush_on_ret' > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > > lighter-weight and targetted TLBI in the architecture page-table > > > > > > code > > > > > > when we actually update the ptes for small ranges. > > > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > > costly consequences). > > > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > > code that deals with the page tables, as it has a pretty good idea of > > > > > what needs to be invalidated and how -- specially on architectures > > > > > that have a HW-broadcast facility like arm64. > > > > > > > > Would this be roughly on par with an in-line flush on arm64? The > > > > simpler, more > > > > straightforward solution would be to let architectures override > > > > flush_on_ret, > > > > but I would prefer something like the below as x86 can also utilize a > > > > range-based > > > > flush when running as a nested hypervisor. > > > > ... > > > > > I think this works for us on HW that has range invalidation, which > > > would already be a positive move. > > > > > > For the lesser HW that isn't range capable, it also gives the > > > opportunity to perform the iteration ourselves or go for the nuclear > > > option if the range is larger than some arbitrary constant (though > > > this is additional work). > > > > > > But this still considers the whole range as being affected by > > > range->handler(). It'd be interesting to try and see whether more > > > precise tracking is (or isn't) generally beneficial. > > > > I assume the idea would be to let arch code do single-page invalidations of > > stage-2 entries for each gfn? > > Right, as it's the only code which knows which ptes actually ended up > being aged. > > > Unless I'm having a brain fart, x86 can't make use of that functionality. > > Intel > > doesn't provide any way to do targeted invalidation of stage-2 mappings. > > AMD > > provides an instruction to do broadcast invalidations, but it takes a > > virtual > > address, i.e. a stage-1 address. I can't tell if it's a host virtual > > address or > > a guest virtual address, but it's a moot point because KVM doen't have the > > guest > > virtual address, and if it's a host virtual address, there would need to be > > valid > > mappings in the host page tables for it to work, which KVM can't guarantee. > > Ah, so it sounds like it would need to be an arch opt-in then. Even if x86 (or some other arch code) could use the precise tracking, I think it would make sense to have the behavior be arch specific. Adding infrastructure to get information from arch code, only to turn around and give it back to arch code would be odd. Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE, the best/easiest solution would be to let arm64 opt out of the common TLB flush when a SPTE is made young. With the range-based flushing bundled in, this? --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 40 +--- 2 files changed, 27 insertions(+), 15 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index afbc99264ffa..8fe5f5e16919 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER +int kvm_arch_flush_tlb_if_young(void); + static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_invalidate_in_progress)) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 38b498669ef9..5ebef8ef239c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -595,6 +595,11 @@ static void kvm_null_fn(void) } #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn) +int __weak kvm_arch_flush_tlb_if_young(void) +{ + return true; +} + /* Iterate over each memslot intersecting [start, last] (inclusive) range */ #define
Re: [RFC PATCH 3/7] module: [
On Thu, Apr 18, 2024 at 10:31:16PM +0300, Nadav Amit wrote: > > > On 18 Apr 2024, at 13:20, Mike Rapoport wrote: > > > > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote: > >> > >> > >> > >> I might be missing something, but it seems a bit racy. > >> > >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this > >> point, since you don’t hold the text_mutex, some might do text_poke(), > >> e.g., by enabling/disabling static-key, and the update would be > >> overwritten. No? > > > > Right :( > > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty > > area. > > > > So I'm thinking about calling alternatives_smp_module_add() from an > > additional callback after the execmem_update_copy(). > > > > Does it make sense to you? > > Going over the code again - I might have just been wrong: I confused the > alternatives and the jump-label mechanisms (as they do share a lot of > code and characteristics). > > The jump-labels are updated when prepare_coming_module() is called, which > happens after post_relocation() [which means they would be updated using > text_poke() “inefficiently” but should be safe]. > > The “alternatives” appear only to use text_poke() (in contrast for > text_poke_early()) from very specific few flows, e.g., > common_cpu_up() -> alternatives_enable_smp(). > > Are those flows pose a problem after boot? Yes, common_cpu_up is called on CPU hotplug, so it's possible to have a race between alternatives_smp_module_add() and common_cpu_up() -> alternatives_enable_smp(). And in UP case alternatives_smp_module_add() will call alternatives_smp_unlock() that will patch module text before it is updated. > Anyhow, sorry for the noise. On the contrary, I would have missed it. -- Sincerely yours, Mike.
Re: [RFC PATCH 3/7] module: [
> On 18 Apr 2024, at 13:20, Mike Rapoport wrote: > > On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote: >> >> >> >> I might be missing something, but it seems a bit racy. >> >> IIUC, module_finalize() calls alternatives_smp_module_add(). At this >> point, since you don’t hold the text_mutex, some might do text_poke(), >> e.g., by enabling/disabling static-key, and the update would be >> overwritten. No? > > Right :( > Even worse, for UP case alternatives_smp_unlock() will "patch" still empty > area. > > So I'm thinking about calling alternatives_smp_module_add() from an > additional callback after the execmem_update_copy(). > > Does it make sense to you? Going over the code again - I might have just been wrong: I confused the alternatives and the jump-label mechanisms (as they do share a lot of code and characteristics). The jump-labels are updated when prepare_coming_module() is called, which happens after post_relocation() [which means they would be updated using text_poke() “inefficiently” but should be safe]. The “alternatives” appear only to use text_poke() (in contrast for text_poke_early()) from very specific few flows, e.g., common_cpu_up() -> alternatives_enable_smp(). Are those flows pose a problem after boot? Anyhow, sorry for the noise.
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote: > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport wrote: > > > > > > > > I'm looking at execmem_types more as definition of the consumers, maybe > > > > I > > > > should have named the enum execmem_consumer at the first place. > > > > > > I think looking at execmem_type from consumers' point of view adds > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe, > > > and bpf (and maybe also module text) all have the same requirements. > > > Did I miss something? > > > > It's enough to have one architecture with different constrains for kprobes > > and bpf to warrant a type for each. > > AFAICT, some of these constraints can be changed without too much work. But why? I honestly don't understand what are you trying to optimize here. A few lines of initialization in execmem_info? What is the advantage in forcing architectures to have imposed limits on kprobes or bpf allocations? > > Where do you see unnecessary complexity? > > > > > IOW, we have > > > > > > enum execmem_type { > > > EXECMEM_DEFAULT, > > > EXECMEM_TEXT, > > > EXECMEM_KPROBES = EXECMEM_TEXT, > > > EXECMEM_FTRACE = EXECMEM_TEXT, > > > EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without > > > _KPROBE, _FTRACE, _BPF */ > > > EXECMEM_DATA, /* rw */ > > > EXECMEM_RO_DATA, > > > EXECMEM_RO_AFTER_INIT, > > > EXECMEM_TYPE_MAX, > > > }; > > > > > > Does this make sense? > > > > How do you suggest to deal with e.g. riscv that has separate address spaces > > for modules, kprobes and bpf? > > IIUC, modules and bpf use the same address space on riscv Not exactly, bpf is a subset of modules on riscv. > while kprobes use vmalloc address. The whole point of using the entire vmalloc for kprobes is to avoid pollution of limited modules space. > Thanks, > Song -- Sincerely yours, Mike.
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport wrote: > [...] > > > > Is +/- 2G enough for all realistic use cases? If so, I guess we don't > > really need > > EXECMEM_ANYWHERE below? > > > > > > > > > > * I'm not sure about BPF's requirements; it seems happy doing the same > > > > as > > > > modules. > > > > > > BPF are happy with vmalloc(). > > > > > > > So if we *must* use a common execmem allocator, what we'd reall want is > > > > our own > > > > types, e.g. > > > > > > > > EXECMEM_ANYWHERE > > > > EXECMEM_NOPLT > > > > EXECMEM_PREL32 > > > > > > > > ... and then we use those in arch code to implement module_alloc() and > > > > friends. > > > > > > I'm looking at execmem_types more as definition of the consumers, maybe I > > > should have named the enum execmem_consumer at the first place. > > > > I think looking at execmem_type from consumers' point of view adds > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe, > > and bpf (and maybe also module text) all have the same requirements. > > Did I miss something? > > It's enough to have one architecture with different constrains for kprobes > and bpf to warrant a type for each. > AFAICT, some of these constraints can be changed without too much work. > Where do you see unnecessary complexity? > > > IOW, we have > > > > enum execmem_type { > > EXECMEM_DEFAULT, > > EXECMEM_TEXT, > > EXECMEM_KPROBES = EXECMEM_TEXT, > > EXECMEM_FTRACE = EXECMEM_TEXT, > > EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without > > _KPROBE, _FTRACE, _BPF */ > > EXECMEM_DATA, /* rw */ > > EXECMEM_RO_DATA, > > EXECMEM_RO_AFTER_INIT, > > EXECMEM_TYPE_MAX, > > }; > > > > Does this make sense? > > How do you suggest to deal with e.g. riscv that has separate address spaces > for modules, kprobes and bpf? IIUC, modules and bpf use the same address space on riscv, while kprobes use vmalloc address. I haven't tried this yet, but I think we can let kprobes use the same space as modules and bpf, which is: | -4 GB | 7fff |2 GB | modules, BPF Did I get this right? Thanks, Song
Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES
Hi Masami, On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote: > Hi Mike, > > On Thu, 11 Apr 2024 19:00:50 +0300 > Mike Rapoport wrote: > > > From: "Mike Rapoport (IBM)" > > > > kprobes depended on CONFIG_MODULES because it has to allocate memory for > > code. > > > > Since code allocations are now implemented with execmem, kprobes can be > > enabled in non-modular kernels. > > > > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside > > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the > > dependency of CONFIG_KPROBES on CONFIG_MODULES. > > Thanks for this work, but this conflicts with the latest fix in v6.9-rc4. > Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in > function body? We have enough dummy functions for that, so it should > not make a problem. I'll rebase and will try to reduce ifdefery where possible. > Thank you, > > > > > Signed-off-by: Mike Rapoport (IBM) > > --- > > arch/Kconfig| 2 +- > > kernel/kprobes.c| 43 + > > kernel/trace/trace_kprobe.c | 11 ++ > > 3 files changed, 37 insertions(+), 19 deletions(-) > > > > -- > Masami Hiramatsu -- Sincerely yours, Mike.
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Wed, Apr 17, 2024 at 04:32:49PM -0700, Song Liu wrote: > On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport wrote: > > > > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote: > > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote: > > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote: > > > > > +/** > > > > > + * enum execmem_type - types of executable memory ranges > > > > > + * > > > > > + * There are several subsystems that allocate executable memory. > > > > > + * Architectures define different restrictions on placement, > > > > > + * permissions, alignment and other parameters for memory that can > > > > > be used > > > > > + * by these subsystems. > > > > > + * Types in this enum identify subsystems that allocate executable > > > > > memory > > > > > + * and let architectures define parameters for ranges suitable for > > > > > + * allocations by each subsystem. > > > > > + * > > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types > > > > > that > > > > > + * are not explcitly defined. > > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections > > > > > + * @EXECMEM_KPROBES: parameters for kprobes > > > > > + * @EXECMEM_FTRACE: parameters for ftrace > > > > > + * @EXECMEM_BPF: parameters for BPF > > > > > + * @EXECMEM_TYPE_MAX: > > > > > + */ > > > > > +enum execmem_type { > > > > > + EXECMEM_DEFAULT, > > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT, > > > > > + EXECMEM_KPROBES, > > > > > + EXECMEM_FTRACE, > > > > > + EXECMEM_BPF, > > > > > + EXECMEM_TYPE_MAX, > > > > > +}; > > > > > > > > Can we please get a break-down of how all these types are actually > > > > different from one another? > > > > > > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to > > > > mind) and has less strict placement constraints for some of them? > > > > > > Yeah, and really I'd *much* rather deal with that in arch code, as I have > > > said > > > several times. > > > > > > For arm64 we have two bsaic restrictions: > > > > > > 1) Direct branches can go +/-128M > > >We can expand this range by having direct branches go to PLTs, at a > > >performance cost. > > > > > > 2) PREL32 relocations can go +/-2G > > >We cannot expand this further. > > > > > > * We don't need to allocate memory for ftrace. We do not use trampolines. > > > > > > * Kprobes XOL areas don't care about either of those; we don't place any > > > PC-relative instructions in those. Maybe we want to in future. > > > > > > * Modules care about both; we'd *prefer* to place them within +/-128M of > > > all > > > other kernel/module code, but if there's no space we can use PLTs and > > > expand > > > that to +/-2G. Since modules can refreence other modules, that ends up > > > actually being halved, and modules have to fit within some 2G window > > > that > > > also covers the kernel. > > Is +/- 2G enough for all realistic use cases? If so, I guess we don't > really need > EXECMEM_ANYWHERE below? > > > > > > > * I'm not sure about BPF's requirements; it seems happy doing the same as > > > modules. > > > > BPF are happy with vmalloc(). > > > > > So if we *must* use a common execmem allocator, what we'd reall want is > > > our own > > > types, e.g. > > > > > > EXECMEM_ANYWHERE > > > EXECMEM_NOPLT > > > EXECMEM_PREL32 > > > > > > ... and then we use those in arch code to implement module_alloc() and > > > friends. > > > > I'm looking at execmem_types more as definition of the consumers, maybe I > > should have named the enum execmem_consumer at the first place. > > I think looking at execmem_type from consumers' point of view adds > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe, > and bpf (and maybe also module text) all have the same requirements. > Did I miss something? It's enough to have one architecture with different constrains for kprobes and bpf to warrant a type for each. Where do you see unnecessary complexity? > IOW, we have > > enum execmem_type { > EXECMEM_DEFAULT, > EXECMEM_TEXT, > EXECMEM_KPROBES = EXECMEM_TEXT, > EXECMEM_FTRACE = EXECMEM_TEXT, > EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without > _KPROBE, _FTRACE, _BPF */ > EXECMEM_DATA, /* rw */ > EXECMEM_RO_DATA, > EXECMEM_RO_AFTER_INIT, > EXECMEM_TYPE_MAX, > }; > > Does this make sense? How do you suggest to deal with e.g. riscv that has separate address spaces for modules, kprobes and bpf? > Thanks, > Song -- Sincerely yours, Mike.
Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote: > On Sat, Apr 13, 2024, Marc Zyngier wrote: > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson > > wrote: > > > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote: > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon wrote: > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote: > > > > > Also, if you're in the business of hacking the MMU notifier code, it > > > > > would be really great to change the .clear_flush_young() callback so > > > > > that the architecture could handle the TLB invalidation. At the > > > > > moment, > > > > > the core KVM code invalidates the whole VMID courtesy of > > > > > 'flush_on_ret' > > > > > being set by kvm_handle_hva_range(), whereas we could do a much > > > > > lighter-weight and targetted TLBI in the architecture page-table code > > > > > when we actually update the ptes for small ranges. > > > > > > > > Indeed, and I was looking at this earlier this week as it has a pretty > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with > > > > costly consequences). > > > > > > > > In general, it feels like the TLB invalidation should stay with the > > > > code that deals with the page tables, as it has a pretty good idea of > > > > what needs to be invalidated and how -- specially on architectures > > > > that have a HW-broadcast facility like arm64. > > > > > > Would this be roughly on par with an in-line flush on arm64? The > > > simpler, more > > > straightforward solution would be to let architectures override > > > flush_on_ret, > > > but I would prefer something like the below as x86 can also utilize a > > > range-based > > > flush when running as a nested hypervisor. > > ... > > > I think this works for us on HW that has range invalidation, which > > would already be a positive move. > > > > For the lesser HW that isn't range capable, it also gives the > > opportunity to perform the iteration ourselves or go for the nuclear > > option if the range is larger than some arbitrary constant (though > > this is additional work). > > > > But this still considers the whole range as being affected by > > range->handler(). It'd be interesting to try and see whether more > > precise tracking is (or isn't) generally beneficial. > > I assume the idea would be to let arch code do single-page invalidations of > stage-2 entries for each gfn? Right, as it's the only code which knows which ptes actually ended up being aged. > Unless I'm having a brain fart, x86 can't make use of that functionality. > Intel > doesn't provide any way to do targeted invalidation of stage-2 mappings. AMD > provides an instruction to do broadcast invalidations, but it takes a virtual > address, i.e. a stage-1 address. I can't tell if it's a host virtual address > or > a guest virtual address, but it's a moot point because KVM doen't have the > guest > virtual address, and if it's a host virtual address, there would need to be > valid > mappings in the host page tables for it to work, which KVM can't guarantee. Ah, so it sounds like it would need to be an arch opt-in then. Will
Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
"Arnd Bergmann" writes: > On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote: >> Arnd Bergmann writes: > >> @@ -692,6 +692,7 @@ static inline void name at >> \ >> #define writesw writesw >> #define writesl writesl >> >> +#ifdef CONFIG_HAS_IOPORT >> #define inb inb >> #define inw inw >> #define inl inl >> @@ -704,6 +705,8 @@ static inline void name at >> \ >> #define outsb outsb >> #define outsw outsw >> #define outsl outsl >> +#endif // CONFIG_HAS_IOPORT >> + >> #ifdef __powerpc64__ >> #define readq readq >> #define writeq writeq > > I had included this at first, but then I still ran into > the same warnings because it ends up pulling in the > generic outsb() etc from include/asm-generic/io.h > that relies on setting a non-NULL PCI_IOBASE. Yes you're right. The above fixes the gcc build, but not clang. So I think I'll just cherry pick f0a816fb12da ("/dev/port: don't compile file operations without CONFIG_DEVPORT") into my next and then apply this. But will see if there's any other build failures over night. cheers
Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages
On Tue, Apr 16, 2024 at 09:52:34AM +0200, Peter Zijlstra wrote: > On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote: > > On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote: > > > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote: > > > > > > > To populate the cache, a writable large page is allocated from vmalloc > > > > with > > > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped > > > > as > > > > ROX. > > > > > > > +static void execmem_invalidate(void *ptr, size_t size, bool writable) > > > > +{ > > > > + if (execmem_info->invalidate) > > > > + execmem_info->invalidate(ptr, size, writable); > > > > + else > > > > + memset(ptr, 0, size); > > > > +} > > > > > > +static void execmem_invalidate(void *ptr, size_t size, bool writeable) > > > +{ > > > + /* fill memory with INT3 instructions */ > > > + if (writeable) > > > + memset(ptr, 0xcc, size); > > > + else > > > + text_poke_set(ptr, 0xcc, size); > > > +} > > > > > > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction. > > > It raises #BP not #UD. > > > > Do you mean that _invalidate is a poor name choice or that it's necessary > > to use an instruction that raises #UD? > > Poor naming, mostly. #BP handler will still scream bloody murder if the > site is otherwise unclaimed. > > It just isn't an invalid instruction. Well, execmem_fill_with_insns_screaming_bloody_murder seems too long, how about execmem_fill_trapping_insns? -- Sincerely yours, Mike.
Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text
On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote: > > > > On 11 Apr 2024, at 19:05, Mike Rapoport wrote: > > > > @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const > > struct load_info *info) > > add_kallsyms(mod, info); > > > > /* Arch-specific module finalizing. */ > > - return module_finalize(info->hdr, info->sechdrs, mod); > > + ret = module_finalize(info->hdr, info->sechdrs, mod); > > + if (ret) > > + return ret; > > + > > + for_each_mod_mem_type(type) { > > + struct module_memory *mem = >mem[type]; > > + > > + if (mem->is_rox) { > > + if (!execmem_update_copy(mem->base, mem->rw_copy, > > +mem->size)) > > + return -ENOMEM; > > + > > + vfree(mem->rw_copy); > > + mem->rw_copy = NULL; > > + } > > + } > > + > > + return 0; > > } > > I might be missing something, but it seems a bit racy. > > IIUC, module_finalize() calls alternatives_smp_module_add(). At this > point, since you don’t hold the text_mutex, some might do text_poke(), > e.g., by enabling/disabling static-key, and the update would be > overwritten. No? Right :( Even worse, for UP case alternatives_smp_unlock() will "patch" still empty area. So I'm thinking about calling alternatives_smp_module_add() from an additional callback after the execmem_update_copy(). Does it make sense to you? -- Sincerely yours, Mike.
Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
On Thu, Apr 18, 2024, at 08:26, Michael Ellerman wrote: > Arnd Bergmann writes: > @@ -692,6 +692,7 @@ static inline void name at > \ > #define writesw writesw > #define writesl writesl > > +#ifdef CONFIG_HAS_IOPORT > #define inb inb > #define inw inw > #define inl inl > @@ -704,6 +705,8 @@ static inline void name at > \ > #define outsb outsb > #define outsw outsw > #define outsl outsl > +#endif // CONFIG_HAS_IOPORT > + > #ifdef __powerpc64__ > #define readq readq > #define writeq writeq I had included this at first, but then I still ran into the same warnings because it ends up pulling in the generic outsb() etc from include/asm-generic/io.h that relies on setting a non-NULL PCI_IOBASE. Arnd
Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
Michael Ellerman writes: > Arnd Bergmann writes: >> From: Arnd Bergmann >> >> Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes >> a NULL pointer dereference, which is bad for a number of reasons. >> >> After my patch to turn on -Werror in linux-next, this caused a >> compiler-time warning with clang: >> >> In file included from arch/powerpc/include/asm/io.h:672: >> arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer >> arithmetic on a null pointer has undefined behavior >> [-Werror,-Wnull-pointer-arithmetic] >>43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c), >> | ^~~ >>44 | (p, b, c), pio, p) >> | ~~ >> >> In this configuration, CONFIG_HAS_IOPORT is already disabled, and all >> drivers that use inb()/outb() should now depend on that (some patches are >> still in the process of getting marged). >> >> Hide all references to inb()/outb() in the powerpc code and the definitions >> when HAS_IOPORT is disabled to remove the possible NULL pointer access. >> The same should happin in asm-generic in the near future, but for now >> the empty inb() macros are still defined to ensure the generic version >> does not get pulled in. >> >> Signed-off-by: Arnd Bergmann >> Reported-by: Naresh Kamboju >> -- > > This needs a small fixup: Well, only because my tree doesn't have f0a816fb12da ("/dev/port: don't compile file operations without CONFIG_DEVPORT"). cheers
Re: [PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n
Arnd Bergmann writes: > From: Arnd Bergmann > > Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes > a NULL pointer dereference, which is bad for a number of reasons. > > After my patch to turn on -Werror in linux-next, this caused a > compiler-time warning with clang: > > In file included from arch/powerpc/include/asm/io.h:672: > arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer > arithmetic on a null pointer has undefined behavior > [-Werror,-Wnull-pointer-arithmetic] >43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c), > | ^~~ >44 | (p, b, c), pio, p) > | ~~ > > In this configuration, CONFIG_HAS_IOPORT is already disabled, and all > drivers that use inb()/outb() should now depend on that (some patches are > still in the process of getting marged). > > Hide all references to inb()/outb() in the powerpc code and the definitions > when HAS_IOPORT is disabled to remove the possible NULL pointer access. > The same should happin in asm-generic in the near future, but for now > the empty inb() macros are still defined to ensure the generic version > does not get pulled in. > > Signed-off-by: Arnd Bergmann > Reported-by: Naresh Kamboju > -- This needs a small fixup: diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 86c212fcbc0c..60c80d0baf40 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -692,6 +692,7 @@ static inline void name at \ #define writesw writesw #define writesl writesl +#ifdef CONFIG_HAS_IOPORT #define inb inb #define inw inw #define inl inl @@ -704,6 +705,8 @@ static inline void name at \ #define outsb outsb #define outsw outsw #define outsl outsl +#endif // CONFIG_HAS_IOPORT + #ifdef __powerpc64__ #define readq readq #define writeq writeq I'm running it through some randconfig builds now. cheers