date:20220222

Re: [PATCH v4 4/9] powerpc/vas: Return paste instruction failure if no active window

2022-02-22 Thread Haren Myneni

On Wed, 2022-02-23 at 17:05 +1000, Nicholas Piggin wrote:
> Excerpts from Haren Myneni's message of February 20, 2022 5:58 am:
> > The VAS window may not be active if the system looses credits and
> > the NX generates page fault when it receives request on unmap
> > paste address.
> > 
> > The kernel handles the fault by remap new paste address if the
> > window is active again, Otherwise return the paste instruction
> > failure if the executed instruction that caused the fault was
> > a paste.
> 
> Looks good, thanks for fixin the SIGBUS thing, was that my
> fault? I vaguely remember writing some of this patch :P

Thanks for your reviews on all patches. 

No, it was my fault not handling the -EGAIN error. 

> 
> Thanks,
> Nick
> 
> > Signed-off-by: Nicholas Piggin 
> > Signed-off-by: Haren Myneni 
> > ---
> >  arch/powerpc/include/asm/ppc-opcode.h   |  2 +
> >  arch/powerpc/platforms/book3s/vas-api.c | 55
> > -
> >  2 files changed, 56 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/ppc-opcode.h
> > b/arch/powerpc/include/asm/ppc-opcode.h
> > index 9675303b724e..82f1f0041c6f 100644
> > --- a/arch/powerpc/include/asm/ppc-opcode.h
> > +++ b/arch/powerpc/include/asm/ppc-opcode.h
> > @@ -262,6 +262,8 @@
> >  #define PPC_INST_MFSPR_PVR 0x7c1f42a6
> >  #define PPC_INST_MFSPR_PVR_MASK0xfc1e
> >  #define PPC_INST_MTMSRD0x7c000164
> > +#define PPC_INST_PASTE 0x7c20070d
> > +#define PPC_INST_PASTE_MASK0xfc2007ff
> >  #define PPC_INST_POPCNTB   0x7cf4
> >  #define PPC_INST_POPCNTB_MASK  0xfc0007fe
> >  #define PPC_INST_RFEBB 0x4c000124
> > diff --git a/arch/powerpc/platforms/book3s/vas-api.c
> > b/arch/powerpc/platforms/book3s/vas-api.c
> > index f359e7b2bf90..f3e421511ea6 100644
> > --- a/arch/powerpc/platforms/book3s/vas-api.c
> > +++ b/arch/powerpc/platforms/book3s/vas-api.c
> > @@ -351,6 +351,41 @@ static int coproc_release(struct inode *inode,
> > struct file *fp)
> > return 0;
> >  }
> >  
> > +/*
> > + * If the executed instruction that caused the fault was a paste,
> > then
> > + * clear regs CR0[EQ], advance NIP, and return 0. Else return
> > error code.
> > + */
> > +static int do_fail_paste(void)
> > +{
> > +   struct pt_regs *regs = current->thread.regs;
> > +   u32 instword;
> > +
> > +   if (WARN_ON_ONCE(!regs))
> > +   return -EINVAL;
> > +
> > +   if (WARN_ON_ONCE(!user_mode(regs)))
> > +   return -EINVAL;
> > +
> > +   /*
> > +* If we couldn't translate the instruction, the driver should
> > +* return success without handling the fault, it will be
> > retried
> > +* or the instruction fetch will fault.
> > +*/
> > +   if (get_user(instword, (u32 __user *)(regs->nip)))
> > +   return -EAGAIN;
> > +
> > +   /*
> > +* Not a paste instruction, driver may fail the fault.
> > +*/
> > +   if ((instword & PPC_INST_PASTE_MASK) != PPC_INST_PASTE)
> > +   return -ENOENT;
> > +
> > +   regs->ccr &= ~0xe000;   /* Clear CR0[0-2] to fail paste */
> > +   regs_add_return_ip(regs, 4);/* Emulate the paste */
> > +
> > +   return 0;
> > +}
> > +
> >  /*
> >   * This fault handler is invoked when the core generates page
> > fault on
> >   * the paste address. Happens if the kernel closes window in
> > hypervisor
> > @@ -408,9 +443,27 @@ static vm_fault_t vas_mmap_fault(struct
> > vm_fault *vmf)
> > }
> > mutex_unlock(&txwin->task_ref.mmap_mutex);
> >  
> > -   return VM_FAULT_SIGBUS;
> > +   /*
> > +* Received this fault due to closing the actual window.
> > +* It can happen during migration or lost credits.
> > +* Since no mapping, return the paste instruction failure
> > +* to the user space.
> > +*/
> > +   ret = do_fail_paste();
> > +   /*
> > +* The user space can retry several times until success (needed
> > +* for migration) or should fallback to SW compression or
> > +* manage with the existing open windows if available.
> > +* Looking at sysfs interface, it can determine whether these
> > +* failures are coming during migration or core removal:
> > +* nr_used_credits > nr_total_credits when lost credits
> > +*/
> > +   if (!ret || (ret == -EAGAIN))
> > +   return VM_FAULT_NOPAGE;
> >  
> > +   return VM_FAULT_SIGBUS;
> >  }
> > +
> >  static const struct vm_operations_struct vas_vm_ops = {
> > .fault = vas_mmap_fault,
> >  };
> > -- 
> > 2.27.0
> > 
> > 
> >

Re: [PATCH 00/16] Remove usage of the deprecated "pci-dma-compat.h" API

2022-02-22 Thread Christoph Hellwig

Hi Christophe,

do you know what the state is in current linux-next?

I think we'll just want to queue up anything left at this point in the
dma-mapping or PCI tree and get it done.

Re: [PATCH v2 09/18] mips: use simpler access_ok()

2022-02-22 Thread Thomas Bogendoerfer

On Wed, Feb 16, 2022 at 02:13:23PM +0100, Arnd Bergmann wrote:
> diff --git a/arch/mips/include/asm/uaccess.h b/arch/mips/include/asm/uaccess.h
> index db9a8e002b62..d7c89dc3426c 100644
> --- a/arch/mips/include/asm/uaccess.h
> +++ b/arch/mips/include/asm/uaccess.h
> @@ -19,6 +19,7 @@
>  #ifdef CONFIG_32BIT
>  
>  #define __UA_LIMIT 0x8000UL
> +#define TASK_SIZE_MAX__UA_LIMIT

using KSEG0 instead would IMHO be the better choice. This gives the
chance to remove __UA_LIMIT completly after cleaning up ptrace.c

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]

Re: [PATCH v4 9/9] powerpc/pseries/vas: Write 'nr_total_credits' for QoS credits change

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 6:03 am:
> 
> pseries supports two types of credits - Default (uses normal priority
> FIFO) and Qality of service (QoS uses high priority FIFO). The user
> decides the number of QoS credits and sets this value with HMC
> interface. With the core add/removal, this value can be changed in HMC
> which invokes drmgr to communicate to the kernel.
> 
> This patch adds an interface so that drmgr command can write the new
> target QoS credits in sysfs. But the kernel gets the new QoS
> capabilities from the hypervisor whenever nr_total_credits is updated
> to make sure sync with the values in the hypervisor.
> 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/pseries/vas-sysfs.c | 33 +-
>  arch/powerpc/platforms/pseries/vas.c   |  2 +-
>  arch/powerpc/platforms/pseries/vas.h   |  1 +
>  3 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/vas-sysfs.c 
> b/arch/powerpc/platforms/pseries/vas-sysfs.c
> index e24d3edb3021..20745cd75f27 100644
> --- a/arch/powerpc/platforms/pseries/vas-sysfs.c
> +++ b/arch/powerpc/platforms/pseries/vas-sysfs.c
> @@ -25,6 +25,33 @@ struct vas_caps_entry {
>  
>  #define to_caps_entry(entry) container_of(entry, struct vas_caps_entry, kobj)
>  
> +/*
> + * This function is used to get the notification from the drmgr when
> + * QoS credits are changed. Though receiving the target total QoS
> + * credits here, get the official QoS capabilities from the hypervisor.
> + */
> +static ssize_t nr_total_credits_store(struct vas_cop_feat_caps *caps,
> +const char *buf, size_t count)
> +{
> + int err;
> + u16 creds;
> +
> + /*
> +  * Nothing to do for default credit type.
> +  */
> + if (caps->win_type == VAS_GZIP_DEF_FEAT_TYPE)
> + return -EOPNOTSUPP;
> +
> + err = kstrtou16(buf, 0, &creds);
> + if (!err)
> + err = vas_reconfig_capabilties(caps->win_type);

So what's happening here? The creds value is ignored? Can it just
be a write-only file which is named appropriately to indicate it
can be written-to to trigger an update?

Thanks,
Nick

Re: [PATCH v4 8/9] powerpc/pseries/vas: sysfs interface to export capabilities

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 6:01 am:
> 
> The hypervisor provides the available VAS GZIP capabilities such
> as default or QoS window type and the target available credits in
> each type. This patch creates sysfs entries and exports the target,
> used and the available credits for each feature.
> 
> This interface can be used by the user space to determine the credits
> usage or to set the target credits in the case of QoS type (for DLPAR).
> 
> /sys/devices/vas/vas0/gzip/default_capabilities (default GZIP capabilities)
>   nr_total_credits /* Total credits available. Can be
>/* changed with DLPAR operation */
>   nr_used_credits  /* Used credits */
> 
> /sys/devices/vas/vas0/gzip/qos_capabilities (QoS GZIP capabilities)
>   nr_total_credits
>   nr_used_credits
> 

Looks good, thanks


Reviewed-by: Nicholas Piggin 

> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/pseries/Makefile|   2 +-
>  arch/powerpc/platforms/pseries/vas-sysfs.c | 226 +
>  arch/powerpc/platforms/pseries/vas.c   |   6 +
>  arch/powerpc/platforms/pseries/vas.h   |   6 +
>  4 files changed, 239 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/platforms/pseries/vas-sysfs.c
> 
> diff --git a/arch/powerpc/platforms/pseries/Makefile 
> b/arch/powerpc/platforms/pseries/Makefile
> index ee60b59024b4..29b522d2c755 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -29,6 +29,6 @@ obj-$(CONFIG_PPC_SVM)   += svm.o
>  obj-$(CONFIG_FA_DUMP)+= rtas-fadump.o
>  
>  obj-$(CONFIG_SUSPEND)+= suspend.o
> -obj-$(CONFIG_PPC_VAS)+= vas.o
> +obj-$(CONFIG_PPC_VAS)+= vas.o vas-sysfs.o
>  
>  obj-$(CONFIG_ARCH_HAS_CC_PLATFORM)   += cc_platform.o
> diff --git a/arch/powerpc/platforms/pseries/vas-sysfs.c 
> b/arch/powerpc/platforms/pseries/vas-sysfs.c
> new file mode 100644
> index ..e24d3edb3021
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/vas-sysfs.c
> @@ -0,0 +1,226 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright 2022-23 IBM Corp.
> + */
> +
> +#define pr_fmt(fmt) "vas: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vas.h"
> +
> +#ifdef CONFIG_SYSFS
> +static struct kobject *pseries_vas_kobj;
> +static struct kobject *gzip_caps_kobj;
> +
> +struct vas_caps_entry {
> + struct kobject kobj;
> + struct vas_cop_feat_caps *caps;
> +};
> +
> +#define to_caps_entry(entry) container_of(entry, struct vas_caps_entry, kobj)
> +
> +#define sysfs_caps_entry_read(_name) \
> +static ssize_t _name##_show(struct vas_cop_feat_caps *caps, char *buf)   
> \
> +{\
> + return sprintf(buf, "%d\n", atomic_read(&caps->_name)); \
> +}
> +
> +struct vas_sysfs_entry {
> + struct attribute attr;
> + ssize_t (*show)(struct vas_cop_feat_caps *, char *);
> + ssize_t (*store)(struct vas_cop_feat_caps *, const char *, size_t);
> +};
> +
> +#define VAS_ATTR_RO(_name)   \
> + sysfs_caps_entry_read(_name);   \
> + static struct vas_sysfs_entry _name##_attribute = __ATTR(_name, \
> + 0444, _name##_show, NULL);
> +
> +/*
> + * Create sysfs interface:
> + * /sys/devices/vas/vas0/gzip/default_capabilities
> + *   This directory contains the following VAS GZIP capabilities
> + *   for the defaule credit type.
> + * /sys/devices/vas/vas0/gzip/default_capabilities/nr_total_credits
> + *   Total number of default credits assigned to the LPAR which
> + *   can be changed with DLPAR operation.
> + * /sys/devices/vas/vas0/gzip/default_capabilities/nr_used_credits
> + *   Number of credits used by the user space. One credit will
> + *   be assigned for each window open.
> + *
> + * /sys/devices/vas/vas0/gzip/qos_capabilities
> + *   This directory contains the following VAS GZIP capabilities
> + *   for the Quality of Service (QoS) credit type.
> + * /sys/devices/vas/vas0/gzip/qos_capabilities/nr_total_credits
> + *   Total number of QoS credits assigned to the LPAR. The user
> + *   has to define this value using HMC interface. It can be
> + *   changed dynamically by the user.
> + * /sys/devices/vas/vas0/gzip/qos_capabilities/nr_used_credits
> + *   Number of credits used by the user space.
> + */
> +
> +VAS_ATTR_RO(nr_total_credits);
> +VAS_ATTR_RO(nr_used_credits);
> +
> +static struct attribute *vas_capab_attrs[] = {
> + &nr_total_credits_attribute.attr,
> + &nr_used_credits_attribute.attr,
> + NULL,
> +};
> +
> +static ssize_t vas_type_show(struct kobject *kobj, struct attribute *attr,
> +  char *buf)
> +{
> + struct vas_caps_entry *centry;
> + struct vas_cop_feat_caps *caps;
> + struct vas_sysfs_entry *entry;
> +
> + cen

Re: [PATCH v4 7/9] powerpc/pseries/vas: Reopen windows with DLPAR core add

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 6:01 am:
> 
> VAS windows can be closed in the hypervisor due to lost credits
> when the core is removed and the kernel gets fault for NX
> requests on these in-active windows. If these credits are
> available later for core add, reopen these windows and set them
> active. When the OS sees page faults on these active windows,
> it creates mapping on the new paste address. Then the user space
> can continue to use these windows and send HW compression
> requests to NX successfully.

Just for my own ignorance, what happens if userspace does not get
another page fault on that window? Presumably when it gets a page
fault it changes to an available window and doesn't just keep
re-trying. So in what situation does it attempt to re-access a
faulting window?

Thanks,
Nick

> 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/pseries/vas.c | 91 +++-
>  1 file changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> index a297720bcdae..96178dd58adf 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -565,6 +565,88 @@ static int __init get_vas_capabilities(u8 feat, enum 
> vas_cop_feat_type type,
>   return 0;
>  }
>  
> +/*
> + * VAS windows can be closed due to lost credits when the core is
> + * removed. So reopen them if credits are available due to DLPAR
> + * core add and set the window active status. When NX sees the page
> + * fault on the unmapped paste address, the kernel handles the fault
> + * by setting the remapping to new paste address if the window is
> + * active.
> + */
> +static int reconfig_open_windows(struct vas_caps *vcaps, int creds)
> +{
> + long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
> + struct vas_cop_feat_caps *caps = &vcaps->caps;
> + struct pseries_vas_window *win = NULL, *tmp;
> + int rc, mv_ents = 0;
> +
> + /*
> +  * Nothing to do if there are no closed windows.
> +  */
> + if (!vcaps->nr_close_wins)
> + return 0;
> +
> + /*
> +  * For the core removal, the hypervisor reduces the credits
> +  * assigned to the LPAR and the kernel closes VAS windows
> +  * in the hypervisor depends on reduced credits. The kernel
> +  * uses LIFO (the last windows that are opened will be closed
> +  * first) and expects to open in the same order when credits
> +  * are available.
> +  * For example, 40 windows are closed when the LPAR lost 2 cores
> +  * (dedicated). If 1 core is added, this LPAR can have 20 more
> +  * credits. It means the kernel can reopen 20 windows. So move
> +  * 20 entries in the VAS windows lost and reopen next 20 windows.
> +  */
> + if (vcaps->nr_close_wins > creds)
> + mv_ents = vcaps->nr_close_wins - creds;
> +
> + list_for_each_entry_safe(win, tmp, &vcaps->list, win_list) {
> + if (!mv_ents)
> + break;
> +
> + mv_ents--;
> + }
> +
> + list_for_each_entry_safe_from(win, tmp, &vcaps->list, win_list) {
> + /*
> +  * Nothing to do on this window if it is not closed
> +  * with VAS_WIN_NO_CRED_CLOSE
> +  */
> + if (!(win->vas_win.status & VAS_WIN_NO_CRED_CLOSE))
> + continue;
> +
> + rc = allocate_setup_window(win, (u64 *)&domain[0],
> +caps->win_type);
> + if (rc)
> + return rc;
> +
> + rc = h_modify_vas_window(win);
> + if (rc)
> + goto out;
> +
> + mutex_lock(&win->vas_win.task_ref.mmap_mutex);
> + /*
> +  * Set window status to active
> +  */
> + win->vas_win.status &= ~VAS_WIN_NO_CRED_CLOSE;
> + mutex_unlock(&win->vas_win.task_ref.mmap_mutex);
> + win->win_type = caps->win_type;
> + if (!--vcaps->nr_close_wins)
> + break;
> + }
> +
> + return 0;
> +out:
> + /*
> +  * Window modify HCALL failed. So close the window to the
> +  * hypervisor and return.
> +  */
> + free_irq_setup(win);
> + h_deallocate_vas_window(win->vas_win.winid);
> + return rc;
> +}
> +
>  /*
>   * The hypervisor reduces the available credits if the LPAR lost core. It
>   * means the excessive windows should not be active and the user space
> @@ -673,7 +755,14 @@ static int vas_reconfig_capabilties(u8 type)
>* closed / reopened. Hold the vas_pseries_mutex so that the
>* the user space can not open new windows.
>*/
> - if (old_nr_creds >  new_nr_creds) {
> + if (old_nr_creds <  new_nr_creds) {
> + /*
> +  * If the existing target credits is less than the new
> +  * target, reopen wi

Re: [PATCH v4 6/9] powerpc/pseries/vas: Close windows with DLPAR core removal

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 6:00 am:
> 
> The hypervisor assigns vas credits (windows) for each LPAR based
> on the number of cores configured in that system. The OS is
> expected to release credits when cores are removed, and may
> allocate more when cores are added. So there is a possibility of
> using excessive credits (windows) in the LPAR and the hypervisor
> expects the system to close the excessive windows so that NX load
> can be equally distributed across all LPARs in the system.
> 
> When the OS closes the excessive windows in the hypervisor,
> it sets the window status in-active and invalidates window
> virtual address mapping. The user space receives paste instruction
> failure if any NX requests are issued on the in-active window.

Thanks for adding this paragraph. Then presumably userspace can
update their windows and be able to re-try with an available open
window?

in-active can be one word, not hyphenated.


> 
> This patch also adds the notifier for core removal/add to close
> windows in the hypervisor if the system lost credits (core
> removal) and reopen windows in the hypervisor when the previously
> lost credits are available.
> 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/include/asm/vas.h   |   2 +
>  arch/powerpc/platforms/pseries/vas.c | 207 +--
>  arch/powerpc/platforms/pseries/vas.h |   3 +
>  3 files changed, 204 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 27251af18c65..6baf7b9ffed4 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -34,6 +34,8 @@
>   */
>  #define VAS_WIN_ACTIVE   0x0 /* Used in platform independent 
> */
>   /* vas mmap() */
> +/* Window is closed in the hypervisor due to lost credit */
> +#define VAS_WIN_NO_CRED_CLOSE0x0001

I thought we were getting a different status for software
status vs status rturned by hypervisor?

> diff --git a/arch/powerpc/platforms/pseries/vas.h 
> b/arch/powerpc/platforms/pseries/vas.h
> index 2872532ed72a..701363cfd7c1 100644
> --- a/arch/powerpc/platforms/pseries/vas.h
> +++ b/arch/powerpc/platforms/pseries/vas.h
> @@ -83,6 +83,9 @@ struct vas_cop_feat_caps {
>  struct vas_caps {
>   struct vas_cop_feat_caps caps;
>   struct list_head list;  /* List of open windows */
> + int nr_close_wins;  /* closed windows in the hypervisor for DLPAR */
> + int nr_open_windows;/* Number of successful open windows */
> + u8 feat;/* Feature type */
>  };

Still not entirely sold on the idea that nr_open_windows is a feature
or capability, but if the code works out easier this way, sometimes
these little hacks are reasonable.

Thanks,
Nick

Re: [PATCH v4 5/9] powerpc/vas: Map paste address only if window is active

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 5:59 am:
> 
> The paste address mapping is done with mmap() after the window is
> opened with ioctl. If the window is closed by OS in the hypervisor
> due to DLPAR after this mmap(), the paste instruction returns

I don't think the changelog was improved here.

The window is closed by the OS in response to a DLPAR operation
by the hypervisor? The OS can't be in the hypervisor.


> failure until the OS reopens this window again. But before mmap(),
> DLPAR core removal can happen which causes the corresponding
> window in-active. So if the window is not active, return mmap()
> failure with -EACCES and expects the user space reissue mmap()
> when the window is active or open a new window when the credit
> is available.
> 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/book3s/vas-api.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c 
> b/arch/powerpc/platforms/book3s/vas-api.c
> index f3e421511ea6..eb4489b2b46b 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -496,10 +496,26 @@ static int coproc_mmap(struct file *fp, struct 
> vm_area_struct *vma)
>   return -EACCES;
>   }
>  
> + /*
> +  * The initial mmap is done after the window is opened
> +  * with ioctl. But before mmap(), this window can be closed in
> +  * the hypervisor due to lost credit (core removal on pseries).
> +  * So if the window is not active, return mmap() failure with
> +  * -EACCES and expects the user space reissue mmap() when it
> +  * is active again or open new window when the credit is available.
> +  */
> + mutex_lock(&txwin->task_ref.mmap_mutex);
> + if (txwin->status != VAS_WIN_ACTIVE) {
> + pr_err("%s(): Window is not active\n", __func__);
> + rc = -EACCES;
> + goto out;
> + }
> +
>   paste_addr = cp_inst->coproc->vops->paste_addr(txwin);
>   if (!paste_addr) {
>   pr_err("%s(): Window paste address failed\n", __func__);
> - return -EINVAL;
> + rc = -EINVAL;
> + goto out;
>   }
>  
>   pfn = paste_addr >> PAGE_SHIFT;
> @@ -519,6 +535,8 @@ static int coproc_mmap(struct file *fp, struct 
> vm_area_struct *vma)
>   txwin->task_ref.vma = vma;
>   vma->vm_ops = &vas_vm_ops;
>  
> +out:
> + mutex_unlock(&txwin->task_ref.mmap_mutex);

Did we have an explanation or what mmap_mutex is protecting? Sorry if 
you explained it and I forgot -- would be good to have a small comment
(what is it protecting against).

Thanks,
Nick

Re: [PATCH v4 4/9] powerpc/vas: Return paste instruction failure if no active window

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 5:58 am:
> 
> The VAS window may not be active if the system looses credits and
> the NX generates page fault when it receives request on unmap
> paste address.
> 
> The kernel handles the fault by remap new paste address if the
> window is active again, Otherwise return the paste instruction
> failure if the executed instruction that caused the fault was
> a paste.

Looks good, thanks for fixin the SIGBUS thing, was that my
fault? I vaguely remember writing some of this patch :P

Thanks,
Nick

> 
> Signed-off-by: Nicholas Piggin 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/include/asm/ppc-opcode.h   |  2 +
>  arch/powerpc/platforms/book3s/vas-api.c | 55 -
>  2 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
> b/arch/powerpc/include/asm/ppc-opcode.h
> index 9675303b724e..82f1f0041c6f 100644
> --- a/arch/powerpc/include/asm/ppc-opcode.h
> +++ b/arch/powerpc/include/asm/ppc-opcode.h
> @@ -262,6 +262,8 @@
>  #define PPC_INST_MFSPR_PVR   0x7c1f42a6
>  #define PPC_INST_MFSPR_PVR_MASK  0xfc1e
>  #define PPC_INST_MTMSRD  0x7c000164
> +#define PPC_INST_PASTE   0x7c20070d
> +#define PPC_INST_PASTE_MASK  0xfc2007ff
>  #define PPC_INST_POPCNTB 0x7cf4
>  #define PPC_INST_POPCNTB_MASK0xfc0007fe
>  #define PPC_INST_RFEBB   0x4c000124
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c 
> b/arch/powerpc/platforms/book3s/vas-api.c
> index f359e7b2bf90..f3e421511ea6 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -351,6 +351,41 @@ static int coproc_release(struct inode *inode, struct 
> file *fp)
>   return 0;
>  }
>  
> +/*
> + * If the executed instruction that caused the fault was a paste, then
> + * clear regs CR0[EQ], advance NIP, and return 0. Else return error code.
> + */
> +static int do_fail_paste(void)
> +{
> + struct pt_regs *regs = current->thread.regs;
> + u32 instword;
> +
> + if (WARN_ON_ONCE(!regs))
> + return -EINVAL;
> +
> + if (WARN_ON_ONCE(!user_mode(regs)))
> + return -EINVAL;
> +
> + /*
> +  * If we couldn't translate the instruction, the driver should
> +  * return success without handling the fault, it will be retried
> +  * or the instruction fetch will fault.
> +  */
> + if (get_user(instword, (u32 __user *)(regs->nip)))
> + return -EAGAIN;
> +
> + /*
> +  * Not a paste instruction, driver may fail the fault.
> +  */
> + if ((instword & PPC_INST_PASTE_MASK) != PPC_INST_PASTE)
> + return -ENOENT;
> +
> + regs->ccr &= ~0xe000;   /* Clear CR0[0-2] to fail paste */
> + regs_add_return_ip(regs, 4);/* Emulate the paste */
> +
> + return 0;
> +}
> +
>  /*
>   * This fault handler is invoked when the core generates page fault on
>   * the paste address. Happens if the kernel closes window in hypervisor
> @@ -408,9 +443,27 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
>   }
>   mutex_unlock(&txwin->task_ref.mmap_mutex);
>  
> - return VM_FAULT_SIGBUS;
> + /*
> +  * Received this fault due to closing the actual window.
> +  * It can happen during migration or lost credits.
> +  * Since no mapping, return the paste instruction failure
> +  * to the user space.
> +  */
> + ret = do_fail_paste();
> + /*
> +  * The user space can retry several times until success (needed
> +  * for migration) or should fallback to SW compression or
> +  * manage with the existing open windows if available.
> +  * Looking at sysfs interface, it can determine whether these
> +  * failures are coming during migration or core removal:
> +  * nr_used_credits > nr_total_credits when lost credits
> +  */
> + if (!ret || (ret == -EAGAIN))
> + return VM_FAULT_NOPAGE;
>  
> + return VM_FAULT_SIGBUS;
>  }
> +
>  static const struct vm_operations_struct vas_vm_ops = {
>   .fault = vas_mmap_fault,
>  };
> -- 
> 2.27.0
> 
> 
>

Re: [PATCH v4 3/9] powerpc/vas: Add paste address mmap fault handler

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 5:55 am:
> 
> The user space opens VAS windows and issues NX requests by pasting
> CRB on the corresponding paste address mmap. When the system lost
> credits due to core removal, the kernel has to close the window in
> the hypervisor and make the window inactive by unmapping this paste
> address. Also the OS has to handle NX request page faults if the user
> space issue NX requests.
> 
> This handler maps the new paste address with the same VMA when the
> window is active again (due to core add with DLPAR). Otherwise
> returns paste failure.
> 

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/include/asm/vas.h  | 10 
>  arch/powerpc/platforms/book3s/vas-api.c | 68 +
>  2 files changed, 78 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 57573d9c1e09..27251af18c65 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -29,6 +29,12 @@
>  #define VAS_THRESH_FIFO_GT_QTR_FULL  2
>  #define VAS_THRESH_FIFO_GT_EIGHTH_FULL   3
>  
> +/*
> + * VAS window Linux status bits
> + */
> +#define VAS_WIN_ACTIVE   0x0 /* Used in platform independent 
> */
> + /* vas mmap() */
> +
>  /*
>   * Get/Set bit fields
>   */
> @@ -59,6 +65,9 @@ struct vas_user_win_ref {
>   struct pid *pid;/* PID of owner */
>   struct pid *tgid;   /* Thread group ID of owner */
>   struct mm_struct *mm;   /* Linux process mm_struct */
> + struct mutex mmap_mutex;/* protects paste address mmap() */
> + /* with DLPAR close/open windows */
> + struct vm_area_struct *vma; /* Save VMA and used in DLPAR ops */
>  };
>  
>  /*
> @@ -67,6 +76,7 @@ struct vas_user_win_ref {
>  struct vas_window {
>   u32 winid;
>   u32 wcreds_max; /* Window credits */
> + u32 status; /* Window status used in OS */
>   enum vas_cop_type cop;
>   struct vas_user_win_ref task_ref;
>   char *dbgname;
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c 
> b/arch/powerpc/platforms/book3s/vas-api.c
> index 4d82c92ddd52..f359e7b2bf90 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -316,6 +316,7 @@ static int coproc_ioc_tx_win_open(struct file *fp, 
> unsigned long arg)
>   return PTR_ERR(txwin);
>   }
>  
> + mutex_init(&txwin->task_ref.mmap_mutex);
>   cp_inst->txwin = txwin;
>  
>   return 0;
> @@ -350,6 +351,70 @@ static int coproc_release(struct inode *inode, struct 
> file *fp)
>   return 0;
>  }
>  
> +/*
> + * This fault handler is invoked when the core generates page fault on
> + * the paste address. Happens if the kernel closes window in hypervisor
> + * (on pseries) due to lost credit or the paste address is not mapped.
> + */
> +static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct file *fp = vma->vm_file;
> + struct coproc_instance *cp_inst = fp->private_data;
> + struct vas_window *txwin;
> + u64 paste_addr;
> + int ret;
> +
> + /*
> +  * window is not opened. Shouldn't expect this error.
> +  */
> + if (!cp_inst || !cp_inst->txwin) {
> + pr_err("%s(): Unexpected fault on paste address with TX window 
> closed\n",
> + __func__);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + txwin = cp_inst->txwin;
> + /*
> +  * When the LPAR lost credits due to core removal or during
> +  * migration, invalidate the existing mapping for the current
> +  * paste addresses and set windows in-active (zap_page_range in
> +  * reconfig_close_windows()).
> +  * New mapping will be done later after migration or new credits
> +  * available. So continue to receive faults if the user space
> +  * issue NX request.
> +  */
> + if (txwin->task_ref.vma != vmf->vma) {
> + pr_err("%s(): No previous mapping with paste address\n",
> + __func__);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + mutex_lock(&txwin->task_ref.mmap_mutex);
> + /*
> +  * The window may be inactive due to lost credit (Ex: core
> +  * removal with DLPAR). If the window is active again when
> +  * the credit is available, map the new paste address at the
> +  * the window virtual address.
> +  */
> + if (txwin->status == VAS_WIN_ACTIVE) {
> + paste_addr = cp_inst->coproc->vops->paste_addr(txwin);
> + if (paste_addr) {
> + ret = vmf_insert_pfn(vma, vma->vm_start,
> + (paste_addr >> PAGE_SHIFT));
> + mutex_unlock(&txwin->task_ref.mmap_mutex);
> + return ret;
> + }

Re: [PATCH v4 2/9] powerpc/pseries/vas: Save PID in pseries_vas_window struct

2022-02-22 Thread Nicholas Piggin

Excerpts from Haren Myneni's message of February 20, 2022 5:55 am:
> 
> The kernel sets the VAS window with PID when it is opened in
> the hypervisor. During DLPAR operation, windows can be closed and
> reopened in the hypervisor when the credit is available. So saves
> this PID in pseries_vas_window struct when the window is opened
> initially and reuse it later during DLPAR operation.

Thanks for renaming it lpid->pid and adding the comment.

Reviewed-by: Nicholas Piggin 

> 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/pseries/vas.c | 9 +
>  arch/powerpc/platforms/pseries/vas.h | 1 +
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> index 18aae037ffe9..1035446f985b 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -107,7 +107,6 @@ static int h_deallocate_vas_window(u64 winid)
>  static int h_modify_vas_window(struct pseries_vas_window *win)
>  {
>   long rc;
> - u32 lpid = mfspr(SPRN_PID);
>  
>   /*
>* AMR value is not supported in Linux VAS implementation.
> @@ -115,7 +114,7 @@ static int h_modify_vas_window(struct pseries_vas_window 
> *win)
>*/
>   do {
>   rc = plpar_hcall_norets(H_MODIFY_VAS_WINDOW,
> - win->vas_win.winid, lpid, 0,
> + win->vas_win.winid, win->pid, 0,
>   VAS_MOD_WIN_FLAGS, 0);
>  
>   rc = hcall_return_busy_check(rc);
> @@ -124,8 +123,8 @@ static int h_modify_vas_window(struct pseries_vas_window 
> *win)
>   if (rc == H_SUCCESS)
>   return 0;
>  
> - pr_err("H_MODIFY_VAS_WINDOW error: %ld, winid %u lpid %u\n",
> - rc, win->vas_win.winid, lpid);
> + pr_err("H_MODIFY_VAS_WINDOW error: %ld, winid %u pid %u\n",
> + rc, win->vas_win.winid, win->pid);
>   return -EIO;
>  }
>  
> @@ -338,6 +337,8 @@ static struct vas_window *vas_allocate_window(int vas_id, 
> u64 flags,
>   }
>   }
>  
> + txwin->pid = mfspr(SPRN_PID);
> +
>   /*
>* Allocate / Deallocate window hcalls and setup / free IRQs
>* have to be protected with mutex.
> diff --git a/arch/powerpc/platforms/pseries/vas.h 
> b/arch/powerpc/platforms/pseries/vas.h
> index d6ea8ab8b07a..2872532ed72a 100644
> --- a/arch/powerpc/platforms/pseries/vas.h
> +++ b/arch/powerpc/platforms/pseries/vas.h
> @@ -114,6 +114,7 @@ struct pseries_vas_window {
>   u64 domain[6];  /* Associativity domain Ids */
>   /* this window is allocated */
>   u64 util;
> + u32 pid;/* PID associated with this window */
>  
>   /* List of windows opened which is used for LPM */
>   struct list_head win_list;
> -- 
> 2.27.0
> 
> 
>

Re: [PATCH] powerpc/64s: Don't use DSISR for SLB faults

2022-02-22 Thread Nicholas Piggin

Excerpts from Michael Ellerman's message of February 22, 2022 9:34 pm:
> Since commit 46ddcb3950a2 ("powerpc/mm: Show if a bad page fault on data
> is read or write.") we use page_fault_is_write(regs->dsisr) in
> __bad_page_fault() to determine if the fault is for a read or write, and
> change the message printed accordingly.
> 
> But SLB faults, aka Data Segment Interrupts, don't set DSISR (Data
> Storage Interrupt Status Register) to a useful value. All ISA versions
> from v2.03 through v3.1 specify that the Data Segment Interrupt sets
> DSISR "to an undefined value". As far as I can see there's no mention of
> SLB faults setting DSISR in any BookIV content either.
> 
> This manifests as accesses that should be a read being incorrectly
> reported as writes, for example, using the xmon "dump" command:
> 
>   0:mon> d 0x5deadbeef000
>   5deadbeef000
>   [359526.415354][C6] BUG: Unable to handle kernel data access on write 
> at 0x5deadbeef000
>   [359526.415611][C6] Faulting instruction address: 0xc010a300
>   cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf400]
>   pc: c010a300: mread+0x90/0x190
> 
> If we disassemble the PC, we see a load instruction:
> 
>   0:mon> di c010a300
>   c010a300 8949  lbz r10,0(r9)
> 
> We can also see in exceptions-64s.S that the data_access_slb block
> doesn't set IDSISR=1, which means it doesn't load DSISR into pt_regs. So
> the value we're using to determine if the fault is a read/write is some
> stale value in pt_regs from a previous page fault.
> 
> Rework the printing logic to separate the SLB fault case out, and only
> print read/write in the cases where we can determine it.
> 
> The result looks like eg:
> 
>   0:mon> d 0x5deadbeef000
>   5deadbeef000
>   [  721.779525][C6] BUG: Unable to handle kernel data access at 
> 0x5deadbeef000
>   [  721.779697][C6] Faulting instruction address: 0xc014cbe0
>   cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf390]
> 
>   0:mon> d 0
>   
>   [  742.793242][C6] BUG: Kernel NULL pointer dereference at 0x
>   [  742.793316][C6] Faulting instruction address: 0xc014cbe0
>   cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf390]
> 

Reviewed-by: Nicholas Piggin 

> Fixes: 46ddcb3950a2 ("powerpc/mm: Show if a bad page fault on data is read or 
> write.")
> Reported-by: Nageswara R Sastry 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/fault.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index eb8ecd7343a9..7ba6d3eff636 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -567,18 +567,24 @@ NOKPROBE_SYMBOL(hash__do_page_fault);
>  static void __bad_page_fault(struct pt_regs *regs, int sig)
>  {
>   int is_write = page_fault_is_write(regs->dsisr);
> + const char *msg;
>  
>   /* kernel has accessed a bad area */
>  
> + if (regs->dar < PAGE_SIZE)
> + msg = "Kernel NULL pointer dereference";
> + else
> + msg = "Unable to handle kernel data access";
> +
>   switch (TRAP(regs)) {
>   case INTERRUPT_DATA_STORAGE:
> - case INTERRUPT_DATA_SEGMENT:
>   case INTERRUPT_H_DATA_STORAGE:
> - pr_alert("BUG: %s on %s at 0x%08lx\n",
> -  regs->dar < PAGE_SIZE ? "Kernel NULL pointer 
> dereference" :
> -  "Unable to handle kernel data access",
> + pr_alert("BUG: %s on %s at 0x%08lx\n", msg,
>is_write ? "write" : "read", regs->dar);
>   break;
> + case INTERRUPT_DATA_SEGMENT:
> + pr_alert("BUG: %s at 0x%08lx\n", msg, regs->dar);
> + break;
>   case INTERRUPT_INST_STORAGE:
>   case INTERRUPT_INST_SEGMENT:
>   pr_alert("BUG: Unable to handle kernel instruction fetch%s",
> -- 
> 2.34.1
> 
>

Re: [PATCH v4 0/3] KVM: PPC: Book3S PR: Fixes for AIL and SCV

2022-02-22 Thread Nicholas Piggin

Excerpts from Paolo Bonzini's message of February 23, 2022 12:11 am:
> On 2/22/22 07:47, Nicholas Piggin wrote:
>> Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are
>> happy with it (link in changelog) just waiting on KVM upstreaming. Do
>> you have objections to the series going to ppc/kvm tree first, or
>> another option is you could take patch 3 alone first (it's relatively
>> independent of the other 2) and ppc/kvm gets it from you?
> 
> Hi Nick,
> 
> I have pushed a topic branch kvm-cap-ppc-210 to kvm.git with just the 
> definition and documentation of the capability.  ppc/kvm can apply your 
> patch based on it (and drop the relevant parts of patch 3).  I'll send 
> it to Linus this week.

Hey Paolo,

Thanks for this, I could have done it for you! This seems like a good 
way to reserve/merge caps: when there is a series ready for N+1, then
merge window then the cap number and description could have a topic
branch based on an earlier release. I'm not sure if you'd been doing 
that before (looks like not for the most recent few caps, at least).

One thing that might improve it is if you used 5.16 as the base for
the kvm-cap branch. I realise it wasn't so simple this time because 
5.17-rc2 had a new cap merged. But it should be possible if all new caps 
took this approach. It would give the arch tree more flexibility where 
to base their tree on without (mpe usually does -rc2). NBD just an idea 
for next time.

Thanks,
Nick

[PATCH v6 5/5] drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.

2022-02-22 Thread Zi Yan

From: Zi Yan 

alloc_contig_range() now only needs to be aligned to pageblock_order,
drop virtio_mem size requirement that it needs to be the max of
pageblock_order and MAX_ORDER.

Signed-off-by: Zi Yan 
---
 drivers/virtio/virtio_mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index e7d6b679596d..e07486f01999 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -2476,10 +2476,10 @@ static int virtio_mem_init_hotplug(struct virtio_mem 
*vm)
  VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
/*
-* TODO: once alloc_contig_range() works reliably with pageblock
-* granularity on ZONE_NORMAL, use pageblock_nr_pages instead.
+* alloc_contig_range() works reliably with pageblock
+* granularity on ZONE_NORMAL, use pageblock_nr_pages.
 */
-   sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES;
+   sb_size = PAGE_SIZE * pageblock_nr_pages;
sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
if (sb_size < memory_block_size_bytes() && !force_bbm) {
-- 
2.34.1

[PATCH v6 3/5] mm: make alloc_contig_range work at pageblock granularity

2022-02-22 Thread Zi Yan

From: Zi Yan 

alloc_contig_range() worked at MAX_ORDER-1 granularity to avoid merging
pageblocks with different migratetypes. It might unnecessarily convert
extra pageblocks at the beginning and at the end of the range. Change
alloc_contig_range() to work at pageblock granularity.

Special handling is needed for free pages and in-use pages across the
boundaries of the range specified alloc_contig_range(). Because these
partially isolated pages causes free page accounting issues. The free
pages will be split and freed into separate migratetype lists; the
in-use pages will be migrated then the freed pages will be handled.

Reported-by: kernel test robot 
Signed-off-by: Zi Yan 
---
 include/linux/page-isolation.h |   2 +-
 mm/internal.h  |   6 ++
 mm/memory_hotplug.c|   3 +-
 mm/page_alloc.c| 112 ---
 mm/page_isolation.c| 156 +++--
 5 files changed, 214 insertions(+), 65 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index eb4a208fe907..20ec9cad3882 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -52,7 +52,7 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-unsigned migratetype, int flags);
+unsigned migratetype, int flags, gfp_t gfp_flags);
 
 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
diff --git a/mm/internal.h b/mm/internal.h
index 7ed98955c8f4..2626e38dd62c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -256,6 +256,9 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t 
align,
  phys_addr_t min_addr,
  int nid, bool exact_nid);
 
+void split_free_page(struct page *free_page,
+   int order, unsigned long split_pfn_offset);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
@@ -319,6 +322,9 @@ isolate_freepages_range(struct compact_control *cc,
 int
 isolate_migratepages_range(struct compact_control *cc,
   unsigned long low_pfn, unsigned long end_pfn);
+
+int __alloc_contig_migrate_range(struct compact_control *cc,
+   unsigned long start, unsigned long end);
 #endif
 int find_suitable_fallback(struct free_area *area, unsigned int order,
int migratetype, bool only_stealable, bool *can_steal);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aee69281dad6..bbd1ff39121f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1833,7 +1833,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages,
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
   MIGRATE_MOVABLE,
-  MEMORY_OFFLINE | REPORT_FAILURE);
+  MEMORY_OFFLINE | REPORT_FAILURE,
+  GFP_USER | __GFP_MOVABLE | 
__GFP_RETRY_MAYFAIL);
if (ret) {
reason = "failure to isolate range";
goto failed_removal_pcplists_disabled;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b900315657cf..038e044c5a80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1139,6 +1139,43 @@ static inline void __free_one_page(struct page *page,
page_reporting_notify_free(order);
 }
 
+/**
+ * split_free_page() -- split a free page at split_pfn_offset
+ * @free_page: the original free page
+ * @order: the order of the page
+ * @split_pfn_offset:  split offset within the page
+ *
+ * It is used when the free page crosses two pageblocks with different 
migratetypes
+ * at split_pfn_offset within the page. The split free page will be put into
+ * separate migratetype lists afterwards. Otherwise, the function achieves
+ * nothing.
+ */
+void split_free_page(struct page *free_page,
+   int order, unsigned long split_pfn_offset)
+{
+   struct zone *zone = page_zone(free_page);
+   unsigned long free_page_pfn = page_to_pfn(free_page);
+   unsigned long pfn;
+   unsigned long flags;
+   int free_page_order;
+
+   spin_lock_irqsave(&zone->lock, flags);
+   del_page_from_free_list(free_page, zone, order);
+   for (pfn = free_page_pfn;
+pfn < free_page_pfn + (1UL << order);) {
+   int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
+
+   free_page_order = order_base_2(split_pfn_offset);
+   __free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
+   mt, FPI_NONE);
+   pfn += 1UL << free_page_order;
+   split_pfn_offset -= (1UL << free_page_order);
+   /* we have done the first part, now switch

[PATCH v6 1/5] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c

2022-02-22 Thread Zi Yan

From: Zi Yan 

has_unmovable_pages() is only used in mm/page_isolation.c. Move it from
mm/page_alloc.c and make it static.

Signed-off-by: Zi Yan 
Reviewed-by: Oscar Salvador 
Reviewed-by: Mike Rapoport 
---
 include/linux/page-isolation.h |   2 -
 mm/page_alloc.c| 119 -
 mm/page_isolation.c| 119 +
 3 files changed, 119 insertions(+), 121 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 572458016331..e14eddf6741a 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,8 +33,6 @@ static inline bool is_migrate_isolate(int migratetype)
 #define MEMORY_OFFLINE 0x1
 #define REPORT_FAILURE 0x2
 
-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-int migratetype, int flags);
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
int migratetype, int *num_movable);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ff1efc84205..228751019fd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8948,125 +8948,6 @@ void *__init alloc_large_system_hash(const char 
*tablename,
return table;
 }
 
-/*
- * This function checks whether pageblock includes unmovable pages or not.
- *
- * PageLRU check without isolation or lru_lock could race so that
- * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
- * check without lock_page also may miss some movable non-lru pages at
- * race condition. So you can't expect this function should be exact.
- *
- * Returns a page without holding a reference. If the caller wants to
- * dereference that page (e.g., dumping), it has to make sure that it
- * cannot get removed (e.g., via memory unplug) concurrently.
- *
- */
-struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-int migratetype, int flags)
-{
-   unsigned long iter = 0;
-   unsigned long pfn = page_to_pfn(page);
-   unsigned long offset = pfn % pageblock_nr_pages;
-
-   if (is_migrate_cma_page(page)) {
-   /*
-* CMA allocations (alloc_contig_range) really need to mark
-* isolate CMA pageblocks even when they are not movable in fact
-* so consider them movable here.
-*/
-   if (is_migrate_cma(migratetype))
-   return NULL;
-
-   return page;
-   }
-
-   for (; iter < pageblock_nr_pages - offset; iter++) {
-   page = pfn_to_page(pfn + iter);
-
-   /*
-* Both, bootmem allocations and memory holes are marked
-* PG_reserved and are unmovable. We can even have unmovable
-* allocations inside ZONE_MOVABLE, for example when
-* specifying "movablecore".
-*/
-   if (PageReserved(page))
-   return page;
-
-   /*
-* If the zone is movable and we have ruled out all reserved
-* pages then it should be reasonably safe to assume the rest
-* is movable.
-*/
-   if (zone_idx(zone) == ZONE_MOVABLE)
-   continue;
-
-   /*
-* Hugepages are not in LRU lists, but they're movable.
-* THPs are on the LRU, but need to be counted as #small pages.
-* We need not scan over tail pages because we don't
-* handle each tail page individually in migration.
-*/
-   if (PageHuge(page) || PageTransCompound(page)) {
-   struct page *head = compound_head(page);
-   unsigned int skip_pages;
-
-   if (PageHuge(page)) {
-   if 
(!hugepage_migration_supported(page_hstate(head)))
-   return page;
-   } else if (!PageLRU(head) && !__PageMovable(head)) {
-   return page;
-   }
-
-   skip_pages = compound_nr(head) - (page - head);
-   iter += skip_pages - 1;
-   continue;
-   }
-
-   /*
-* We can't use page_count without pin a page
-* because another CPU can free compound page.
-* This check already skips compound tails of THP
-* because their page->_refcount is zero at all time.
-*/
-   if (!page_ref_count(page)) {
-   if (PageBuddy(page))
-   iter += (1 << buddy_order(page)) - 1;
-   continue;
-   }
-
-   /*
-

[PATCH v6 4/5] mm: cma: use pageblock_order as the single alignment

2022-02-22 Thread Zi Yan

From: Zi Yan 

Now alloc_contig_range() works at pageblock granularity. Change CMA
allocation, which uses alloc_contig_range(), to use pageblock_order
alignment.

Signed-off-by: Zi Yan 
---
 include/linux/cma.h| 4 ++--
 include/linux/mmzone.h | 5 +
 mm/page_alloc.c| 4 ++--
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 90fd742fd1ef..22fa94231dfe 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -21,11 +21,11 @@
 #define CMA_MAX_NAME 64
 
 /*
- * TODO: once the buddy -- especially pageblock merging and 
alloc_contig_range()
+ *  the buddy -- especially pageblock merging and alloc_contig_range()
  * -- can deal with only some pageblocks of a higher-order page being
  *  MIGRATE_CMA, we can use pageblock_nr_pages.
  */
-#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES
+#define CMA_MIN_ALIGNMENT_PAGES pageblock_nr_pages
 #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
 
 struct cma;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3fff6deca2c0..da38c8436493 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -54,10 +54,7 @@ enum migratetype {
 *
 * The way to use it is to change migratetype of a range of
 * pageblocks to MIGRATE_CMA which can be done by
-* __free_pageblock_cma() function.  What is important though
-* is that a range of pageblocks must be aligned to
-* MAX_ORDER_NR_PAGES should biggest page be bigger than
-* a single pageblock.
+* __free_pageblock_cma() function.
 */
MIGRATE_CMA,
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 038e044c5a80..90281e33e20a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9077,8 +9077,8 @@ int __alloc_contig_migrate_range(struct compact_control 
*cc,
  * be either of the two.
  * @gfp_mask:  GFP mask to use during compaction
  *
- * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
- * aligned.  The PFN range must belong to a single zone.
+ * The PFN range does not have to be pageblock aligned. The PFN range must
+ * belong to a single zone.
  *
  * The first thing this routine does is attempt to MIGRATE_ISOLATE all
  * pageblocks in the range.  Once isolated, the pageblocks should not
-- 
2.34.1

[PATCH v6 0/5] Use pageblock_order for cma and alloc_contig_range alignment.

2022-02-22 Thread Zi Yan

From: Zi Yan 

Hi all,

This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
and alloc_contig_range(). It prepares for my upcoming changes to make
MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-02-14-17-46.

Changelog
===
V6
---
1. Resolved compilation error/warning reported by kernel test robot.
2. Tried to solve the coding concerns from Christophe Leroy.
3. Shortened lengthy lines (pointed out by Christoph Hellwig).

V5
---
1. Moved isolation address alignment handling in start_isolate_page_range().
2. Rewrote and simplified how alloc_contig_range() works at pageblock
   granularity (Patch 3). Only two pageblock migratetypes need to be saved and
   restored. start_isolate_page_range() might need to migrate pages in this
   version, but it prevents the caller from worrying about
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range
   is isolated.

V4
---
1. Dropped two irrelevant patches on non-lru compound page handling, as
   it is not supported upstream.
2. Renamed migratetype_has_fallback() to migratetype_is_mergeable().
3. Always check whether two pageblocks can be merged in
   __free_one_page() when order is >= pageblock_order, as the case (not
   mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common.
3. Moving has_unmovable_pages() is now a separate patch.
4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code.

Description
===

The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range()
isolates pageblocks to remove free memory from buddy allocator but isolating
only a subset of pageblocks within a page spanning across multiple pageblocks
causes free page accounting issues. Isolated page might not be put into the
right free list, since the code assumes the migratetype of the first pageblock
as the whole free page migratetype. This is based on the discussion at [2].

To remove the requirement, this patchset:
1. isolates pages at pageblock granularity instead of
   max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages);
2. splits free pages across the specified range or migrates in-use pages
   across the specified range then splits the freed page to avoid free page
   accounting issues (it happens when multiple pageblocks within a single page
   have different migratetypes);
3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned
   range during isolation to avoid alloc_contig_range() failure when pageblocks
   within a MAX_ORDER - 1 aligned range are allocated separately.
4. returns pages not in the range as it did before.

One optimization might come later:
1. make MIGRATE_ISOLATE a separate bit to be able to restore the original
   migratetypes when isolation fails in the middle of the range.

Feel free to give comments and suggestions. Thanks.

[1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi@sent.com/
[2] 
https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b94...@redhat.com/

Zi Yan (5):
  mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
  mm: page_isolation: check specified range for unmovable pages
  mm: make alloc_contig_range work at pageblock granularity
  mm: cma: use pageblock_order as the single alignment
  drivers: virtio_mem: use pageblock size as the minimum virtio_mem
size.

 drivers/virtio/virtio_mem.c|   6 +-
 include/linux/cma.h|   4 +-
 include/linux/mmzone.h |   5 +-
 include/linux/page-isolation.h |  14 +-
 mm/internal.h  |   6 +
 mm/memory_hotplug.c|   3 +-
 mm/page_alloc.c| 246 +++
 mm/page_isolation.c| 296 +++--
 8 files changed, 367 insertions(+), 213 deletions(-)

-- 
2.34.1

[PATCH v6 2/5] mm: page_isolation: check specified range for unmovable pages

2022-02-22 Thread Zi Yan

From: Zi Yan 

Enable set_migratetype_isolate() to check specified sub-range for
unmovable pages during isolation. Page isolation is done
at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all
pages within that granularity are intended to be isolated. For example,
alloc_contig_range(), which uses page isolation, allows ranges without
alignment. This commit makes unmovable page check only look for
interesting pages, so that page isolation can succeed for any
non-overlapping ranges.

Signed-off-by: Zi Yan 
---
 include/linux/page-isolation.h | 10 
 mm/page_alloc.c| 13 +-
 mm/page_isolation.c| 47 +-
 3 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index e14eddf6741a..eb4a208fe907 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -15,6 +15,16 @@ static inline bool is_migrate_isolate(int migratetype)
 {
return migratetype == MIGRATE_ISOLATE;
 }
+static inline unsigned long pfn_max_align_down(unsigned long pfn)
+{
+   return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
+}
+
+static inline unsigned long pfn_max_align_up(unsigned long pfn)
+{
+   return ALIGN(pfn, MAX_ORDER_NR_PAGES);
+}
+
 #else
 static inline bool has_isolate_pageblock(struct zone *zone)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 228751019fd8..b900315657cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8949,16 +8949,6 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static unsigned long pfn_max_align_down(unsigned long pfn)
-{
-   return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
-}
-
-static unsigned long pfn_max_align_up(unsigned long pfn)
-{
-   return ALIGN(pfn, MAX_ORDER_NR_PAGES);
-}
-
 #if defined(CONFIG_DYNAMIC_DEBUG) || \
(defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
@@ -9103,8 +9093,7 @@ int alloc_contig_range(unsigned long start, unsigned long 
end,
 * put back to page allocator so that buddy can use them.
 */
 
-   ret = start_isolate_page_range(pfn_max_align_down(start),
-  pfn_max_align_up(end), migratetype, 0);
+   ret = start_isolate_page_range(start, end, migratetype, 0);
if (ret)
return ret;
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b34f1310aeaa..e0afc3ee8cf9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -16,7 +16,8 @@
 #include 
 
 /*
- * This function checks whether pageblock includes unmovable pages or not.
+ * This function checks whether pageblock within [start_pfn, end_pfn) includes
+ * unmovable pages or not.
  *
  * PageLRU check without isolation or lru_lock could race so that
  * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
@@ -29,11 +30,14 @@
  *
  */
 static struct page *has_unmovable_pages(struct zone *zone, struct page *page,
-int migratetype, int flags)
+int migratetype, int flags,
+unsigned long start_pfn, unsigned long end_pfn)
 {
-   unsigned long iter = 0;
-   unsigned long pfn = page_to_pfn(page);
-   unsigned long offset = pfn % pageblock_nr_pages;
+   unsigned long first_pfn = max(page_to_pfn(page), start_pfn);
+   unsigned long pfn = first_pfn;
+   unsigned long last_pfn = min(ALIGN(pfn + 1, pageblock_nr_pages), 
end_pfn);
+
+   page = pfn_to_page(pfn);
 
if (is_migrate_cma_page(page)) {
/*
@@ -47,8 +51,8 @@ static struct page *has_unmovable_pages(struct zone *zone, 
struct page *page,
return page;
}
 
-   for (; iter < pageblock_nr_pages - offset; iter++) {
-   page = pfn_to_page(pfn + iter);
+   for (pfn = first_pfn; pfn < last_pfn; pfn++) {
+   page = pfn_to_page(pfn);
 
/*
 * Both, bootmem allocations and memory holes are marked
@@ -85,7 +89,7 @@ static struct page *has_unmovable_pages(struct zone *zone, 
struct page *page,
}
 
skip_pages = compound_nr(head) - (page - head);
-   iter += skip_pages - 1;
+   pfn += skip_pages - 1;
continue;
}
 
@@ -97,7 +101,7 @@ static struct page *has_unmovable_pages(struct zone *zone, 
struct page *page,
 */
if (!page_ref_count(page)) {
if (PageBuddy(page))
-   iter += (1 << buddy_order(page)) - 1;
+   pfn += (1 << buddy_order(page)) - 1;
continue;
}
 
@@ -134,7 +138,13 @@ static struct page *has_unmovable_pages(struct zone *zone, 
struct page *page,

Re: [PATCH v3 2/2] crypto: vmx - add missing dependencies

2022-02-22 Thread Herbert Xu

On Thu, Feb 17, 2022 at 11:57:51AM +0100, Petr Vorel wrote:
> vmx-crypto module depends on CRYPTO_AES, CRYPTO_CBC, CRYPTO_CTR or
> CRYPTO_XTS, thus add them.
> 
> These dependencies are likely to be enabled, but if
> CRYPTO_DEV_VMX=y && !CRYPTO_MANAGER_DISABLE_TESTS
> and either of CRYPTO_AES, CRYPTO_CBC, CRYPTO_CTR or CRYPTO_XTS is built
> as module or disabled, alg_test() from crypto/testmgr.c complains during
> boot about failing to allocate the generic fallback implementations
> (2 == ENOENT):
> 
> [0.540953] Failed to allocate xts(aes) fallback: -2
> [0.541014] alg: skcipher: failed to allocate transform for p8_aes_xts: -2
> [0.541120] alg: self-tests for p8_aes_xts (xts(aes)) failed (rc=-2)
> [0.50] Failed to allocate ctr(aes) fallback: -2
> [0.544497] alg: skcipher: failed to allocate transform for p8_aes_ctr: -2
> [0.544603] alg: self-tests for p8_aes_ctr (ctr(aes)) failed (rc=-2)
> [0.547992] Failed to allocate cbc(aes) fallback: -2
> [0.548052] alg: skcipher: failed to allocate transform for p8_aes_cbc: -2
> [0.548156] alg: self-tests for p8_aes_cbc (cbc(aes)) failed (rc=-2)
> [0.550745] Failed to allocate transformation for 'aes': -2
> [0.550801] alg: cipher: Failed to load transform for p8_aes: -2
> [0.550892] alg: self-tests for p8_aes (aes) failed (rc=-2)
> 
> Fixes: c07f5d3da643 ("crypto: vmx - Adding support for XTS")
> Fixes: d2e3ae6f3aba ("crypto: vmx - Enabling VMX module for PPC64")
> 
> Suggested-by: Nicolai Stange 
> Signed-off-by: Petr Vorel 
> ---
> changes v2->v3:
> * more less the same, just in drivers/crypto/Kconfig (previously it was
>   in drivers/crypto/vmx/Kconfig)
> * change commit subject to be compatible
> 
>  drivers/crypto/Kconfig | 4 
>  1 file changed, 4 insertions(+)

Please respin this patch to add the selects to the existing tristate.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v3 1/2] crypto: vmx - merge CRYPTO_DEV_VMX_ENCRYPT into CRYPTO_DEV_VMX

2022-02-22 Thread Herbert Xu

On Thu, Feb 17, 2022 at 11:57:50AM +0100, Petr Vorel wrote:
> CRYPTO_DEV_VMX_ENCRYPT is redundant with CRYPTO_DEV_VMX.
> 
> And it also forces CRYPTO_GHASH to be builtin even
> CRYPTO_DEV_VMX_ENCRYPT was configured as module.

Just because a tristate sits under a bool, it does not force
the options that it selects to y/n.  The select still operates
on the basis of the tristate.

So I don't see the point to this code churn unless the powerpc
folks want to move in this direction.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH] powerpc/code-patching: Pre-map patch area

2022-02-22 Thread Michael Ellerman

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 906d43463366..00c68e7fb11e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -43,9 +43,14 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -53,6 +58,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

Re: [PATCH V5 00/21] riscv: compat: Add COMPAT mode support for rv64

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:24 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Currently, most 64-bit architectures (x86, parisc, powerpc, arm64,
s390, mips, sparc) have supported COMPAT mode. But they all have
history issues and can't use standard linux unistd.h. RISC-V would
be first standard __SYSCALL_COMPAT user of include/uapi/asm-generic
/unistd.h.


TBH, I'd always sort of hoped we wouldn't have to do this: it's a lot of 
ABI surface to keep around for a use case I'm not really sure is ever 
going to get any traction (it's not like we have legacy 32-bit 
userspaces floating around, the 32-bit userspace is newer than the 
64-bit userspace).  My assumption is that users who actually wanted the 
memory savings (likely a very small number) would be better served with 
rv64/ilp32, as that'll allow the larger registers that the hardware 
supports.  From some earlier discussions it looks like rv64/ilp32 isn't 
going to be allowed, though, so this seems like the only way to go.


I've left some minor comments, and I saw some unresolved ones as well.  
I'm going to assume there'll be a v6, but LMK if you want me to deal 
with cleaning things up.


I'm OK targeting this for 5.18, assuming the cleanups can be resolved in 
the next week or two: generally with a lot of ABI surface I'd prefer to 
let things sit on for-next for a whole cycle, but I don't think it's a 
stretch to say that the actual ABI here is the rv32 port and that 
rv64/COMPAT=y is just our best attempt at matching it.


It's probably worth running at least the glibc test suite, just to see 
if there's anything obvious that's failing.



The patchset are based on v5.17-rc2, you can compare rv64-compat32
v.s. rv32-whole in qemu with following step:

 - Prepare rv32 rootfs & fw_jump.bin by buildroot.org
   $ git clone git://git.busybox.net/buildroot
   $ cd buildroot
   $ make qemu_riscv32_virt_defconfig O=qemu_riscv32_virt_defconfig
   $ make -C qemu_riscv32_virt_defconfig
   $ make qemu_riscv64_virt_defconfig O=qemu_riscv64_virt_defconfig
   $ make -C qemu_riscv64_virt_defconfig
   (Got fw_jump.bin & rootfs.ext2 in qemu_riscvXX_virt_defconfig/images)

 - Prepare Linux rv32 & rv64 Image
   $ git clone g...@github.com:c-sky/csky-linux.git -b riscv_compat_v5 linux
   $ cd linux
   $ echo "CONFIG_STRICT_KERNEL_RWX=n" >> arch/riscv/configs/defconfig
   $ echo "CONFIG_STRICT_MODULE_RWX=n" >> arch/riscv/configs/defconfig
   $ make ARCH=riscv CROSS_COMPILE=riscv32-buildroot-linux-gnu- 
O=../build-rv32/ rv32_defconfig
   $ make ARCH=riscv CROSS_COMPILE=riscv32-buildroot-linux-gnu- 
O=../build-rv32/ Image
   $ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- 
O=../build-rv64/ defconfig
   $ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- 
O=../build-rv64/ Image

 - Prepare Qemu: (made by LIU Zhiwei )
   $ git clone g...@github.com:alistair23/qemu.git -b 
riscv-to-apply.for-upstream linux
   $ cd qemu
   $ ./configure --target-list="riscv64-softmmu riscv32-softmmu"
   $ make

Now let's compare rv32-compat with rv32-native memory footprint. Kernel with 
rv32 = rv64
defconfig, rootfs, opensbi, Qemu are the same.

 - Run rv64 with rv32 rootfs in compat mode:
   $ ./build/qemu-system-riscv64 -cpu rv64,x-h=true -M virt -m 64m -nographic -bios 
qemu_riscv64_virt_defconfig/images/fw_jump.bin -kernel build-rv64/Image -drive file 
qemu_riscv32_virt_defconfig/images/rootfs.ext2,format=raw,id=hd0 -device 
virtio-blk-device,drive=hd0 -append "rootwait root=/dev/vda ro console=ttyS0 
earlycon=sbi" -netdev user,id=net0 -device virtio-net-device,netdev=net0

QEMU emulator version 6.2.50 (v6.2.0-29-g196d7182c8)
OpenSBI v0.9
[0.00] Linux version 5.16.0-rc6-00017-g750f87086bdd-dirty 
(guoren@guoren-Z87-HD3) (riscv64-unknown-linux-gnu-gcc (GCC) 10.2.0, GNU ld 
(GNU Binutils) 2.37) #96 SMP Tue Dec 28 21:01:55 CST 2021
[0.00] OF: fdt: Ignoring memory range 0x8000 - 0x8020
[0.00] Machine model: riscv-virtio,qemu
[0.00] earlycon: sbi0 at I/O port 0x0 (options '')
[0.00] printk: bootconsole [sbi0] enabled
[0.00] efi: UEFI not found.
[0.00] Zone ranges:
[0.00]   DMA32[mem 0x8020-0x83ff]
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x8020-0x83ff]
[0.00] Initmem setup node 0 [mem 0x8020-0x83ff]
[0.00] SBI specification v0.2 detected
[0.00] SBI implementation ID=0x1 Version=0x9
[0.00] SBI TIME extension detected
[0.00] SBI IPI extension detected
[0.00] SBI RFENCE extension detected
[0.00] SBI v0.2 HSM extension detected
[0.00] riscv: ISA extensions acdfhimsu
[0.00] riscv: ELF capabilities acdfim
[0.00] percpu: Embedded 17 pages/cpu s30696 r8192 d30744 u69632
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 1

Re: [PATCH V5 21/21] KVM: compat: riscv: Prevent KVM_COMPAT from being selected

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:45 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Current riscv doesn't support the 32bit KVM API. Let's make it
clear by not selecting KVM_COMPAT.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Anup Patel 
---
 virt/kvm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index f4834c20e4a6..a8c5c9f06b3c 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -53,7 +53,7 @@ config KVM_GENERIC_DIRTYLOG_READ_PROTECT

 config KVM_COMPAT
def_bool y
-   depends on KVM && COMPAT && !(S390 || ARM64)
+   depends on KVM && COMPAT && !(S390 || ARM64 || RISCV)

 config HAVE_KVM_IRQ_BYPASS
bool


Reviewed-by: Palmer Dabbelt 
Acked-by: Palmer Dabbelt 

I'm assuming Anup is going to take this as per the discussion, but LMK 
if you want me to take it along with the rest of the series.  There's 
some minor comments outstanding on the other patches.

Re: [PATCH V5 19/21] riscv: compat: ptrace: Add compat_arch_ptrace implement

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:43 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Now, you can use native gdb on riscv64 for rv32 app debugging.

$ uname -a
Linux buildroot 5.16.0-rc4-00036-gbef6b82fdf23-dirty #53 SMP Mon Dec 20 
23:06:53 CST 2021 riscv64 GNU/Linux
$ cat /proc/cpuinfo
processor   : 0
hart: 0
isa : rv64imafdcsuh
mmu : sv48

$ file /bin/busybox
/bin/busybox: setuid ELF 32-bit LSB shared object, UCB RISC-V, version 1 
(SYSV), dynamically linked, interpreter /lib/ld-linux-riscv32-ilp32d.so.1, for 
GNU/Linux 5.15.0, stripped
$ file /usr/bin/gdb
/usr/bin/gdb: ELF 32-bit LSB shared object, UCB RISC-V, version 1 (GNU/Linux), 
dynamically linked, interpreter /lib/ld-linux-riscv32-ilp32d.so.1, for 
GNU/Linux 5.15.0, stripped
$ /usr/bin/gdb /bin/busybox
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
...
Reading symbols from /bin/busybox...
(No debugging symbols found in /bin/busybox)
(gdb) b main
Breakpoint 1 at 0x8ddc
(gdb) r
Starting program: /bin/busybox
Failed to read a valid object file image from memory.

Breakpoint 1, 0x555a8ddc in main ()
(gdb) i r
ra 0x77df0b74   0x77df0b74
sp 0x7fdd3d10   0x7fdd3d10
gp 0x5567e800   0x5567e800 
tp 0x77f64280   0x77f64280
t0 0x0  0
t1 0x555a6fac   1431990188
t2 0x77dd8db4   2011008436
fp 0x7fdd3e34   0x7fdd3e34
s1 0x7fdd3e34   2145205812
a0 0x   -1
a1 0x2000   8192
a2 0x7fdd3e3c   2145205820
a3 0x0  0
a4 0x7fdd3d30   2145205552
a5 0x555a8dc0   1431997888
a6 0x77f2c170   2012397936
a7 0x6a7c7a2f   1786542639
s2 0x0  0
s3 0x0  0
s4 0x555a8dc0   1431997888
s5 0x77f8a3a8   2012783528
s6 0x7fdd3e3c   2145205820
s7 0x5567cecc   1432866508
--Type  for more, q to quit, c to continue without paging--
s8 0x1  1
s9 0x0  0
s100x55634448   1432568904
s110x0  0
t3 0x77df0bb8   2011106232
t4 0x42fc   17148
t5 0x0  0
t6 0x40 64
pc 0x555a8ddc   0x555a8ddc 
(gdb) si
0x555a78f0 in mallopt@plt ()
(gdb) c
Continuing.
BusyBox v1.34.1 (2021-12-19 22:39:48 CST) multi-call binary.
BusyBox is copyrighted by many authors between 1998-2015.
Licensed under GPLv2. See source distribution for detailed
copyright notices.

Usage: busybox [function [arguments]...]
   or: busybox --list[-full]
...
[Inferior 1 (process 107) exited normally]
(gdb) q

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/kernel/ptrace.c | 87 +++---
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index a89243730153..bb387593a121 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -111,11 +112,6 @@ static const struct user_regset_view 
riscv_user_native_view = {
.n = ARRAY_SIZE(riscv_user_regset),
 };

-const struct user_regset_view *task_user_regset_view(struct task_struct *task)
-{
-   return &riscv_user_native_view;
-}
-
 struct pt_regs_offset {
const char *name;
int offset;
@@ -273,3 +269,84 @@ __visible void do_syscall_trace_exit(struct pt_regs *regs)
trace_sys_exit(regs, regs_return_value(regs));
 #endif
 }
+
+#ifdef CONFIG_COMPAT
+static int compat_riscv_gpr_get(struct task_struct *target,
+   const struct user_regset *regset,
+   struct membuf to)
+{
+   struct compat_user_regs_struct cregs;
+
+   regs_to_cregs(&cregs, task_pt_regs(target));
+
+   return membuf_write(&to, &cregs,
+   sizeof(struct compat_user_regs_struct));
+}
+
+static int compat_riscv_gpr_set(struct task_struct *target,
+   const struct user_regset *regset,
+   unsigned int pos, unsigned int count,
+   const void *kbuf, const void __user *ubuf)
+{
+   int ret;
+   struct compat_user_regs_struct cregs;
+
+   ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &cregs, 0, -1);
+
+   cregs_to_regs(&cregs, task_pt_regs(target));
+
+   return ret;
+}
+
+static const struct user_regset compat_riscv_user_regset[] = {
+   [REGSET_X] = {
+   .core_note_type = NT_PRSTATUS,
+   .n = ELF_NGREG,
+   .size = sizeof(compat_elf_greg_t),
+   .align = sizeof(co

Re: [PATCH V5 18/21] riscv: compat: signal: Add rt_frame implementation

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:42 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Implement compat_setup_rt_frame for sigcontext save & restore. The
main process is the same with signal, but the rv32 pt_regs' size
is different from rv64's, so we needs convert them.


It's kind of ugly to have two copies of essentially exactly the same 
code, just targeted at different structures.  The other ports have 
sufficiently different 32-bit and 64-bit ABIs that it makes sense there, 
but we should be able to share pretty much everything.  That said, all 
that would probably only ever benefit RISC-V so I'm not sure it'd be 
worth doing.


Reviewed-by: Palmer Dabbelt 

Happy to see someone clean this up later, but it seems good enough for 
now.


Thanks!


Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/kernel/Makefile|   1 +
 arch/riscv/kernel/compat_signal.c | 243 ++
 arch/riscv/kernel/signal.c|  13 +-
 3 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/compat_signal.c

diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 88e79f481c21..a46f9807c59e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -67,4 +67,5 @@ obj-$(CONFIG_JUMP_LABEL)  += jump_label.o

 obj-$(CONFIG_EFI)  += efi.o
 obj-$(CONFIG_COMPAT)   += compat_syscall_table.o
+obj-$(CONFIG_COMPAT)   += compat_signal.o
 obj-$(CONFIG_COMPAT)   += compat_vdso/
diff --git a/arch/riscv/kernel/compat_signal.c 
b/arch/riscv/kernel/compat_signal.c
new file mode 100644
index ..7041742ded08
--- /dev/null
+++ b/arch/riscv/kernel/compat_signal.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#define COMPAT_DEBUG_SIG 0
+
+struct compat_sigcontext {
+   struct compat_user_regs_struct sc_regs;
+   union __riscv_fp_state sc_fpregs;
+};
+
+struct compat_ucontext {
+   compat_ulong_t  uc_flags;
+   struct compat_ucontext  *uc_link;
+   compat_stack_t  uc_stack;
+   sigset_tuc_sigmask;
+   /* There's some padding here to allow sigset_t to be expanded in the
+* future.  Though this is unlikely, other architectures put uc_sigmask
+* at the end of this structure and explicitly state it can be
+* expanded, so we didn't want to box ourselves in here. */
+   __u8  __unused[1024 / 8 - sizeof(sigset_t)];
+   /* We can't put uc_sigmask at the end of this structure because we need
+* to be able to expand sigcontext in the future.  For example, the
+* vector ISA extension will almost certainly add ISA state.  We want
+* to ensure all user-visible ISA state can be saved and restored via a
+* ucontext, so we're putting this at the end in order to allow for
+* infinite extensibility.  Since we know this will be extended and we
+* assume sigset_t won't be extended an extreme amount, we're
+* prioritizing this. */
+   struct compat_sigcontext uc_mcontext;
+};
+
+struct compat_rt_sigframe {
+   struct compat_siginfo info;
+   struct compat_ucontext uc;
+};
+
+#ifdef CONFIG_FPU
+static long compat_restore_fp_state(struct pt_regs *regs,
+   union __riscv_fp_state __user *sc_fpregs)
+{
+   long err;
+   struct __riscv_d_ext_state __user *state = &sc_fpregs->d;
+   size_t i;
+
+   err = __copy_from_user(¤t->thread.fstate, state, sizeof(*state));
+   if (unlikely(err))
+   return err;
+
+   fstate_restore(current, regs);
+
+   /* We support no other extension state at this time. */
+   for (i = 0; i < ARRAY_SIZE(sc_fpregs->q.reserved); i++) {
+   u32 value;
+
+   err = __get_user(value, &sc_fpregs->q.reserved[i]);
+   if (unlikely(err))
+   break;
+   if (value != 0)
+   return -EINVAL;
+   }
+
+   return err;
+}
+
+static long compat_save_fp_state(struct pt_regs *regs,
+ union __riscv_fp_state __user *sc_fpregs)
+{
+   long err;
+   struct __riscv_d_ext_state __user *state = &sc_fpregs->d;
+   size_t i;
+
+   fstate_save(current, regs);
+   err = __copy_to_user(state, ¤t->thread.fstate, sizeof(*state));
+   if (unlikely(err))
+   return err;
+
+   /* We support no other extension state at this time. */
+   for (i = 0; i < ARRAY_SIZE(sc_fpregs->q.reserved); i++) {
+   err = __put_user(0, &sc_fpregs->q.reserved[i]);
+   if (unlikely(err))
+   break;
+   }
+
+   return err;
+}
+#else
+#define compat_save_fp_state(task, regs) (0)
+#define compat_restore_fp_state(task, regs) (0)
+#endif
+
+static long compat_r

Re: [PATCH V5 16/21] riscv: compat: vdso: Add rv32 VDSO base code implementation

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:40 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

There is no vgettimeofday supported in rv32 that makes simple to
generate rv32 vdso code which only needs riscv64 compiler. Other
architectures need change compiler or -m (machine parameter) to
support vdso32 compiling. If rv32 support vgettimeofday (which
cause C compile) in future, we would add CROSS_COMPILE to support
that makes more requirement on compiler enviornment.


IMO this is the wrong way to go, as there's some subtle differences 
between elf32 and elf64 (the .gnu.hash layout, for example).  I'm kind 
of surprised userspace tolerates this sort of thing at all, but given 
how easy it is to target rv32 from all toolchains (we don't need 
libraries here, so just -march should do it) I don't think it's worth 
chasing around the likely long-tail issues that will arise.



linux-rv64/arch/riscv/kernel/compat_vdso/compat_vdso.so.dbg:
file format elf64-littleriscv

Disassembly of section .text:

0800 <__vdso_rt_sigreturn>:
 800:   08b00893li  a7,139
 804:   0073ecall
 808:   unimp
...

080c <__vdso_getcpu>:
 80c:   0a800893li  a7,168
 810:   0073ecall
 814:   8082ret
...

0818 <__vdso_flush_icache>:
 818:   10300893li  a7,259
 81c:   0073ecall
 820:   8082ret

linux-rv32/arch/riscv/kernel/vdso/vdso.so.dbg:
file format elf32-littleriscv

Disassembly of section .text:

0800 <__vdso_rt_sigreturn>:
 800:   08b00893li  a7,139
 804:   0073ecall
 808:   unimp
...

080c <__vdso_getcpu>:
 80c:   0a800893li  a7,168
 810:   0073ecall
 814:   8082ret
...

0818 <__vdso_flush_icache>:
 818:   10300893li  a7,259
 81c:   0073ecall
 820:   8082ret

Finally, reuse all *.S from vdso in compat_vdso that makes
implementation clear and readable.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/Makefile   |  5 ++
 arch/riscv/include/asm/vdso.h |  9 +++
 arch/riscv/kernel/Makefile|  1 +
 arch/riscv/kernel/compat_vdso/.gitignore  |  2 +
 arch/riscv/kernel/compat_vdso/Makefile| 68 +++
 arch/riscv/kernel/compat_vdso/compat_vdso.S   |  8 +++
 .../kernel/compat_vdso/compat_vdso.lds.S  |  3 +
 arch/riscv/kernel/compat_vdso/flush_icache.S  |  3 +
 .../compat_vdso/gen_compat_vdso_offsets.sh|  5 ++
 arch/riscv/kernel/compat_vdso/getcpu.S|  3 +
 arch/riscv/kernel/compat_vdso/note.S  |  3 +
 arch/riscv/kernel/compat_vdso/rt_sigreturn.S  |  3 +
 arch/riscv/kernel/vdso/vdso.S |  6 +-
 13 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/kernel/compat_vdso/.gitignore
 create mode 100644 arch/riscv/kernel/compat_vdso/Makefile
 create mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.S
 create mode 100644 arch/riscv/kernel/compat_vdso/compat_vdso.lds.S
 create mode 100644 arch/riscv/kernel/compat_vdso/flush_icache.S
 create mode 100755 arch/riscv/kernel/compat_vdso/gen_compat_vdso_offsets.sh
 create mode 100644 arch/riscv/kernel/compat_vdso/getcpu.S
 create mode 100644 arch/riscv/kernel/compat_vdso/note.S
 create mode 100644 arch/riscv/kernel/compat_vdso/rt_sigreturn.S

diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index a02e588c4947..f73d50552e09 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -106,12 +106,17 @@ libs-$(CONFIG_EFI_STUB) += 
$(objtree)/drivers/firmware/efi/libstub/lib.a
 PHONY += vdso_install
 vdso_install:
$(Q)$(MAKE) $(build)=arch/riscv/kernel/vdso $@
+   $(if $(CONFIG_COMPAT),$(Q)$(MAKE) \
+   $(build)=arch/riscv/kernel/compat_vdso $@)

 ifeq ($(KBUILD_EXTMOD),)
 ifeq ($(CONFIG_MMU),y)
 prepare: vdso_prepare
 vdso_prepare: prepare0
$(Q)$(MAKE) $(build)=arch/riscv/kernel/vdso 
include/generated/vdso-offsets.h
+   $(if $(CONFIG_COMPAT),$(Q)$(MAKE) \
+   $(build)=arch/riscv/kernel/compat_vdso 
include/generated/compat_vdso-offsets.h)
+
 endif
 endif

diff --git a/arch/riscv/include/asm/vdso.h b/arch/riscv/include/asm/vdso.h
index bc6f75f3a199..af981426fe0f 100644
--- a/arch/riscv/include/asm/vdso.h
+++ b/arch/riscv/include/asm/vdso.h
@@ -21,6 +21,15 @@

 #define VDSO_SYMBOL(base, name)
\
(void __user *)((unsigned long)(base) + __vdso_##name##_offset)
+
+#ifdef CONFIG_COMPAT
+#include 
+
+#define COMPAT_VDSO_SYMBOL(base, name) 
\
+   (void __user *)((unsigned long)(base) + compat__vdso_##name##_offset)
+
+#endif /* C

Re: [PATCH V5 12/21] riscv: compat: syscall: Add entry.S implementation

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:36 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Implement the entry of compat_sys_call_table[] in asm. Ref to
riscv-privileged spec 4.1.1 Supervisor Status Register (sstatus):

 BIT[32:33] = UXL[1:0]:
 - 1:32
 - 2:64
 - 3:128

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/include/asm/csr.h |  7 +++
 arch/riscv/kernel/entry.S| 18 --
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index ae711692eec9..eed96fa62d66 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -36,6 +36,13 @@
 #define SR_SD  _AC(0x8000, UL) /* FS/XS dirty */
 #endif

+#ifdef CONFIG_COMPAT
+#define SR_UXL _AC(0x3, UL) /* XLEN mask for U-mode */
+#define SR_UXL_32  _AC(0x1, UL) /* XLEN = 32 for U-mode */
+#define SR_UXL_64  _AC(0x2, UL) /* XLEN = 64 for U-mode */
+#define SR_UXL_SHIFT   32
+#endif
+
 /* SATP flags */
 #ifndef CONFIG_64BIT
 #define SATP_PPN   _AC(0x003F, UL)
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index ed29e9c8f660..1951743f09b3 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -207,13 +207,27 @@ check_syscall_nr:
 * Syscall number held in a7.
 * If syscall number is above allowed value, redirect to ni_syscall.
 */
-   bgeu a7, t0, 1f
+   bgeu a7, t0, 3f
+#ifdef CONFIG_COMPAT
+   REG_L s0, PT_STATUS(sp)
+   srli s0, s0, SR_UXL_SHIFT
+   andi s0, s0, (SR_UXL >> SR_UXL_SHIFT)
+   li t0, (SR_UXL_32 >> SR_UXL_SHIFT)
+   sub t0, s0, t0
+   bnez t0, 1f
+
+   /* Call compat_syscall */
+   la s0, compat_sys_call_table
+   j 2f
+1:
+#endif
/* Call syscall */
la s0, sys_call_table
+2:
slli t0, a7, RISCV_LGPTR
add s0, s0, t0
REG_L s0, 0(s0)
-1:
+3:
jalr s0

 ret_from_syscall:


Reviewed-by: Palmer Dabbelt

Re: [PATCH V5 13/21] riscv: compat: process: Add UXL_32 support in start_thread

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:37 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

If the current task is in COMPAT mode, set SR_UXL_32 in status for
returning userspace. We need CONFIG _COMPAT to prevent compiling
errors with rv32 defconfig.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/kernel/process.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 03ac3aa611f5..1a666ad299b4 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -97,6 +97,11 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
}
regs->epc = pc;
regs->sp = sp;
+
+#ifdef CONFIG_COMPAT
+   if (is_compat_task())
+   regs->status |= SR_UXL_32;


Not sure if I'm just misunderstanding the bit ops here, but aren't we 
trying to set the UXL field to 1 (for UXL=32)?  That should be a bit 
field set op, not just an OR.



+#endif
 }

 void flush_thread(void)


Additionally: this isn't really an issue so much with this patch, but it 
does bring up that we're relying on someone else to have set UXL=64 on 
CONFIG_COMPAT=n systems.  I don't see that in any spec anywhere, so we 
should really be setting UXL in Linux for all systems (ie, not just those with

COMPAT=y).

Re: [PATCH V5 17/21] riscv: compat: vdso: Add setup additional pages implementation

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:41 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Reconstruct __setup_additional_pages() by appending vdso info
pointer argument to meet compat_vdso_info requirement. And change
vm_special_mapping *dm, *cm initialization into static.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/include/asm/elf.h |   5 ++
 arch/riscv/include/asm/mmu.h |   1 +
 arch/riscv/kernel/vdso.c | 104 +--
 3 files changed, 81 insertions(+), 29 deletions(-)

diff --git a/arch/riscv/include/asm/elf.h b/arch/riscv/include/asm/elf.h
index 3a4293dc7229..d87d3bcc758d 100644
--- a/arch/riscv/include/asm/elf.h
+++ b/arch/riscv/include/asm/elf.h
@@ -134,5 +134,10 @@ do {if ((ex).e_ident[EI_CLASS] == ELFCLASS32)  
\
 typedef compat_ulong_t compat_elf_greg_t;
 typedef compat_elf_greg_t  compat_elf_gregset_t[ELF_NGREG];

+extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
+ int uses_interp);
+#define compat_arch_setup_additional_pages \
+   compat_arch_setup_additional_pages
+
 #endif /* CONFIG_COMPAT */
 #endif /* _ASM_RISCV_ELF_H */
diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index 0099dc116168..cedcf8ea3c76 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -16,6 +16,7 @@ typedef struct {
atomic_long_t id;
 #endif
void *vdso;
+   void *vdso_info;
 #ifdef CONFIG_SMP
/* A local icache flush is needed before user execution can resume. */
cpumask_t icache_stale_mask;
diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index a9436a65161a..deca69524799 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -23,6 +23,9 @@ struct vdso_data {
 #endif

 extern char vdso_start[], vdso_end[];
+#ifdef CONFIG_COMPAT
+extern char compat_vdso_start[], compat_vdso_end[];
+#endif

 enum vvar_pages {
VVAR_DATA_PAGE_OFFSET,
@@ -30,6 +33,11 @@ enum vvar_pages {
VVAR_NR_PAGES,
 };

+enum rv_vdso_map {
+   RV_VDSO_MAP_VVAR,
+   RV_VDSO_MAP_VDSO,
+};
+
 #define VVAR_SIZE  (VVAR_NR_PAGES << PAGE_SHIFT)

 /*
@@ -52,12 +60,6 @@ struct __vdso_info {
struct vm_special_mapping *cm;
 };

-static struct __vdso_info vdso_info __ro_after_init = {
-   .name = "vdso",
-   .vdso_code_start = vdso_start,
-   .vdso_code_end = vdso_end,
-};
-
 static int vdso_mremap(const struct vm_special_mapping *sm,
   struct vm_area_struct *new_vma)
 {
@@ -66,35 +68,35 @@ static int vdso_mremap(const struct vm_special_mapping *sm,
return 0;
 }

-static int __init __vdso_init(void)
+static int __init __vdso_init(struct __vdso_info *vdso_info)
 {
unsigned int i;
struct page **vdso_pagelist;
unsigned long pfn;

-   if (memcmp(vdso_info.vdso_code_start, "\177ELF", 4)) {
+   if (memcmp(vdso_info->vdso_code_start, "\177ELF", 4)) {
pr_err("vDSO is not a valid ELF object!\n");
return -EINVAL;
}

-   vdso_info.vdso_pages = (
-   vdso_info.vdso_code_end -
-   vdso_info.vdso_code_start) >>
+   vdso_info->vdso_pages = (
+   vdso_info->vdso_code_end -
+   vdso_info->vdso_code_start) >>
PAGE_SHIFT;

-   vdso_pagelist = kcalloc(vdso_info.vdso_pages,
+   vdso_pagelist = kcalloc(vdso_info->vdso_pages,
sizeof(struct page *),
GFP_KERNEL);
if (vdso_pagelist == NULL)
return -ENOMEM;

/* Grab the vDSO code pages. */
-   pfn = sym_to_pfn(vdso_info.vdso_code_start);
+   pfn = sym_to_pfn(vdso_info->vdso_code_start);

-   for (i = 0; i < vdso_info.vdso_pages; i++)
+   for (i = 0; i < vdso_info->vdso_pages; i++)
vdso_pagelist[i] = pfn_to_page(pfn + i);

-   vdso_info.cm->pages = vdso_pagelist;
+   vdso_info->cm->pages = vdso_pagelist;

return 0;
 }
@@ -116,13 +118,14 @@ int vdso_join_timens(struct task_struct *task, struct 
time_namespace *ns)
 {
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;
+   struct __vdso_info *vdso_info = mm->context.vdso_info;


IIUC this is the only use for context.vdso_info?  If that's the case, 
can we just switch between VDSO targets based on __is_compat_task(task)?  
That'd save an mm_struct pointer, which is always nice.  It'd probably 
be worth cleaning up the arm64 port too, which zaps both mappings.




mmap_read_lock(mm);

for (vma = mm->mmap; vma; vma = vma->vm_next) {
unsigned long size = vma->vm_end - vma->vm_start;

-   if (vma_is_special_mapping(vma, vdso_info.dm))
+   if (vma_is_special_mapping(vma, vdso_info->dm))
zap_page_range

Re: [PATCH V5 09/21] riscv: compat: Add basic compat data type implementation

2022-02-22 Thread Palmer Dabbelt


On Tue, 01 Feb 2022 07:05:33 PST (-0800), guo...@kernel.org wrote:

From: Guo Ren 

Implement riscv asm/compat.h for struct compat_xxx,
is_compat_task, compat_user_regset, regset convert.

The rv64 compat.h has inherited most of the structs
from the generic one.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
Cc: Arnd Bergmann 
Cc: Palmer Dabbelt 
---
 arch/riscv/include/asm/compat.h  | 129 +++
 arch/riscv/include/asm/thread_info.h |   1 +
 2 files changed, 130 insertions(+)
 create mode 100644 arch/riscv/include/asm/compat.h

diff --git a/arch/riscv/include/asm/compat.h b/arch/riscv/include/asm/compat.h
new file mode 100644
index ..2ac955b51148
--- /dev/null
+++ b/arch/riscv/include/asm/compat.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_COMPAT_H
+#define __ASM_COMPAT_H
+
+#define COMPAT_UTS_MACHINE "riscv\0\0"
+
+/*
+ * Architecture specific compatibility types
+ */
+#include 
+#include 
+#include 
+#include 
+
+static inline int is_compat_task(void)
+{
+   return test_thread_flag(TIF_32BIT);
+}
+
+struct compat_user_regs_struct {
+   compat_ulong_t pc;
+   compat_ulong_t ra;
+   compat_ulong_t sp;
+   compat_ulong_t gp;
+   compat_ulong_t tp;
+   compat_ulong_t t0;
+   compat_ulong_t t1;
+   compat_ulong_t t2;
+   compat_ulong_t s0;
+   compat_ulong_t s1;
+   compat_ulong_t a0;
+   compat_ulong_t a1;
+   compat_ulong_t a2;
+   compat_ulong_t a3;
+   compat_ulong_t a4;
+   compat_ulong_t a5;
+   compat_ulong_t a6;
+   compat_ulong_t a7;
+   compat_ulong_t s2;
+   compat_ulong_t s3;
+   compat_ulong_t s4;
+   compat_ulong_t s5;
+   compat_ulong_t s6;
+   compat_ulong_t s7;
+   compat_ulong_t s8;
+   compat_ulong_t s9;
+   compat_ulong_t s10;
+   compat_ulong_t s11;
+   compat_ulong_t t3;
+   compat_ulong_t t4;
+   compat_ulong_t t5;
+   compat_ulong_t t6;
+};
+
+static inline void regs_to_cregs(struct compat_user_regs_struct *cregs,
+struct pt_regs *regs)
+{
+   cregs->pc= (compat_ulong_t) regs->epc;
+   cregs->ra= (compat_ulong_t) regs->ra;
+   cregs->sp= (compat_ulong_t) regs->sp;
+   cregs->gp= (compat_ulong_t) regs->gp;
+   cregs->tp= (compat_ulong_t) regs->tp;
+   cregs->t0= (compat_ulong_t) regs->t0;
+   cregs->t1= (compat_ulong_t) regs->t1;
+   cregs->t2= (compat_ulong_t) regs->t2;
+   cregs->s0= (compat_ulong_t) regs->s0;
+   cregs->s1= (compat_ulong_t) regs->s1;
+   cregs->a0= (compat_ulong_t) regs->a0;
+   cregs->a1= (compat_ulong_t) regs->a1;
+   cregs->a2= (compat_ulong_t) regs->a2;
+   cregs->a3= (compat_ulong_t) regs->a3;
+   cregs->a4= (compat_ulong_t) regs->a4;
+   cregs->a5= (compat_ulong_t) regs->a5;
+   cregs->a6= (compat_ulong_t) regs->a6;
+   cregs->a7= (compat_ulong_t) regs->a7;
+   cregs->s2= (compat_ulong_t) regs->s2;
+   cregs->s3= (compat_ulong_t) regs->s3;
+   cregs->s4= (compat_ulong_t) regs->s4;
+   cregs->s5= (compat_ulong_t) regs->s5;
+   cregs->s6= (compat_ulong_t) regs->s6;
+   cregs->s7= (compat_ulong_t) regs->s7;
+   cregs->s8= (compat_ulong_t) regs->s8;
+   cregs->s9= (compat_ulong_t) regs->s9;
+   cregs->s10   = (compat_ulong_t) regs->s10;
+   cregs->s11   = (compat_ulong_t) regs->s11;
+   cregs->t3= (compat_ulong_t) regs->t3;
+   cregs->t4= (compat_ulong_t) regs->t4;
+   cregs->t5= (compat_ulong_t) regs->t5;
+   cregs->t6= (compat_ulong_t) regs->t6;
+};
+
+static inline void cregs_to_regs(struct compat_user_regs_struct *cregs,
+struct pt_regs *regs)
+{
+   regs->epc= (unsigned long) cregs->pc;
+   regs->ra = (unsigned long) cregs->ra;
+   regs->sp = (unsigned long) cregs->sp;
+   regs->gp = (unsigned long) cregs->gp;
+   regs->tp = (unsigned long) cregs->tp;
+   regs->t0 = (unsigned long) cregs->t0;
+   regs->t1 = (unsigned long) cregs->t1;
+   regs->t2 = (unsigned long) cregs->t2;
+   regs->s0 = (unsigned long) cregs->s0;
+   regs->s1 = (unsigned long) cregs->s1;
+   regs->a0 = (unsigned long) cregs->a0;
+   regs->a1 = (unsigned long) cregs->a1;
+   regs->a2 = (unsigned long) cregs->a2;
+   regs->a3 = (unsigned long) cregs->a3;
+   regs->a4 = (unsigned long) cregs->a4;
+   regs->a5 = (unsigned long) cregs->a5;
+   regs->a6 = (unsigned long) cregs->a6;
+   regs->a7 = (unsigned long) cregs->a7;
+   regs->s2 = (unsigned long) cregs->s2;
+   regs->s3 = (unsigned long) cregs->s3;
+   regs->s4 = (unsigned long) cregs->s4;
+   regs->s5 = (unsigned long) cregs->s5;
+   regs->s6 =

[PATCH] powerpc: Fix missing declaration of [en/dis]able_kernel_altivec()

2022-02-22 Thread Magali Lemes

When CONFIG_PPC64 is set and CONFIG_ALTIVEC is not the following build
failures occur:

   drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c: In function 
'dc_fpu_begin':
>> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:61:17: error: 
>> implicit declaration of function 'enable_kernel_altivec'; did you mean 
>> 'enable_kernel_vsx'? [-Werror=implicit-function-declaration]
  61 | enable_kernel_altivec();
 | ^
 | enable_kernel_vsx
   drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c: In function 
'dc_fpu_end':
>> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:89:17: error: 
>> implicit declaration of function 'disable_kernel_altivec'; did you mean 
>> 'disable_kernel_vsx'? [-Werror=implicit-function-declaration]
  89 | disable_kernel_altivec();
 | ^~
 | disable_kernel_vsx
   cc1: some warnings being treated as errors

This commit adds stub instances of both enable_kernel_altivec() and
disable_kernel_altivec() the same way as done in commit bd73758803c2
regarding enable_kernel_vsx() and disable_kernel_vsx().

Reported-by: kernel test robot 
Signed-off-by: Magali Lemes 
---
 arch/powerpc/include/asm/switch_to.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/switch_to.h 
b/arch/powerpc/include/asm/switch_to.h
index 1f43ef696033..aee25e3ebf96 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -62,6 +62,15 @@ static inline void disable_kernel_altivec(void)
 #else
 static inline void save_altivec(struct task_struct *t) { }
 static inline void __giveup_altivec(struct task_struct *t) { }
+static inline void enable_kernel_altivec(void)
+{
+   BUILD_BUG();
+}
+
+static inline void disable_kernel_altivec(void)
+{
+   BUILD_BUG();
+}
 #endif
 
 #ifdef CONFIG_VSX
-- 
2.25.1

Re: [PATCH v5 1/6] module: Always have struct mod_tree_root

2022-02-22 Thread Aaron Tomlin

On Tue 2022-02-22 16:00 +0100, Christophe Leroy wrote:
> In order to separate text and data, we need to setup
> two rb trees.
> 
> This means that struct mod_tree_root is required even without
> MODULES_TREE_LOOKUP.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  kernel/module/internal.h | 4 +++-
>  kernel/module/main.c | 5 -
>  2 files changed, 3 insertions(+), 6 deletions(-)

Reviewed-by: Aaron Tomlin 

-- 
Aaron Tomlin

Re: [PATCH v2 09/18] mips: use simpler access_ok()

2022-02-22 Thread Thomas Bogendoerfer

On Mon, Feb 21, 2022 at 03:31:23PM +0100, Arnd Bergmann wrote:
> I'll update it this way, otherwise I'd need help in form of a patch
> that changes the exception handling so __get_user/__put_user
> also return -EFAULT for an address error.

https://lore.kernel.org/all/20220222155345.138861-1-tsbog...@alpha.franken.de/

That does the trick.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]

[PATCH 08/11] swiotlb: make the swiotlb_init interface more useful

2022-02-22 Thread Christoph Hellwig

Pass a bool to pass if swiotlb needs to be enabled based on the
addressing needs and replace the verbose argument with a set of
flags, including one to force enable bounce buffering.

Note that this patch removes the possibility to force xen-swiotlb
use using swiotlb=force on the command line on x86 (arm and arm64
never supported that), but this interface will be restored shortly.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/mm/init.c |  6 +
 arch/arm64/mm/init.c   |  6 +
 arch/ia64/mm/init.c|  4 +--
 arch/mips/cavium-octeon/dma-octeon.c   |  2 +-
 arch/mips/loongson64/dma.c |  2 +-
 arch/mips/sibyte/common/dma.c  |  2 +-
 arch/powerpc/include/asm/swiotlb.h |  1 +
 arch/powerpc/mm/mem.c  |  3 ++-
 arch/powerpc/platforms/pseries/setup.c |  3 ---
 arch/riscv/mm/init.c   |  8 +-
 arch/s390/mm/init.c|  3 +--
 arch/x86/kernel/cpu/mshyperv.c |  8 --
 arch/x86/kernel/pci-dma.c  | 17 ++---
 arch/x86/mm/mem_encrypt_amd.c  |  3 ---
 drivers/xen/swiotlb-xen.c  |  4 +--
 include/linux/swiotlb.h| 15 ++-
 include/trace/events/swiotlb.h | 29 -
 kernel/dma/swiotlb.c   | 35 ++
 18 files changed, 57 insertions(+), 94 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 6d0cb0f7bc54b..73f30d278b565 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -312,11 +312,7 @@ static void __init free_highpages(void)
 void __init mem_init(void)
 {
 #ifdef CONFIG_ARM_LPAE
-   if (swiotlb_force == SWIOTLB_FORCE ||
-   max_pfn > arm_dma_pfn_limit)
-   swiotlb_init(1);
-   else
-   swiotlb_force = SWIOTLB_NO_FORCE;
+   swiotlb_init(max_pfn > arm_dma_pfn_limit, SWIOTLB_VERBOSE);
 #endif
 
set_max_mapnr(pfn_to_page(max_pfn) - mem_map);
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index db63cc885771a..52102adda3d28 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -373,11 +373,7 @@ void __init bootmem_init(void)
  */
 void __init mem_init(void)
 {
-   if (swiotlb_force == SWIOTLB_FORCE ||
-   max_pfn > PFN_DOWN(arm64_dma_phys_limit))
-   swiotlb_init(1);
-   else if (!xen_swiotlb_detect())
-   swiotlb_force = SWIOTLB_NO_FORCE;
+   swiotlb_init(max_pfn > PFN_DOWN(arm64_dma_phys_limit), SWIOTLB_VERBOSE);
 
/* this will put all unused low memory onto the freelists */
memblock_free_all();
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 5d165607bf354..3c3e15b22608f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -437,9 +437,7 @@ mem_init (void)
if (iommu_detected)
break;
 #endif
-#ifdef CONFIG_SWIOTLB
-   swiotlb_init(1);
-#endif
+   swiotlb_init(true, SWIOTLB_VERBOSE);
} while (0);
 
 #ifdef CONFIG_FLATMEM
diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index fb7547e217263..9fbba6a8fa4c5 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -235,5 +235,5 @@ void __init plat_swiotlb_setup(void)
 #endif
 
swiotlb_adjust_size(swiotlbsize);
-   swiotlb_init(1);
+   swiotlb_init(true, SWIOTLB_VERBOSE);
 }
diff --git a/arch/mips/loongson64/dma.c b/arch/mips/loongson64/dma.c
index 364f2f27c8723..8220a1bc0db64 100644
--- a/arch/mips/loongson64/dma.c
+++ b/arch/mips/loongson64/dma.c
@@ -24,5 +24,5 @@ phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr)
 
 void __init plat_swiotlb_setup(void)
 {
-   swiotlb_init(1);
+   swiotlb_init(true, SWIOTLB_VERBOSE);
 }
diff --git a/arch/mips/sibyte/common/dma.c b/arch/mips/sibyte/common/dma.c
index eb47a94f3583e..c5c2c782aff68 100644
--- a/arch/mips/sibyte/common/dma.c
+++ b/arch/mips/sibyte/common/dma.c
@@ -10,5 +10,5 @@
 
 void __init plat_swiotlb_setup(void)
 {
-   swiotlb_init(1);
+   swiotlb_init(true, SWIOTLB_VERBOSE);
 }
diff --git a/arch/powerpc/include/asm/swiotlb.h 
b/arch/powerpc/include/asm/swiotlb.h
index 3c1a1cd161286..4203b5e0a88ed 100644
--- a/arch/powerpc/include/asm/swiotlb.h
+++ b/arch/powerpc/include/asm/swiotlb.h
@@ -9,6 +9,7 @@
 #include 
 
 extern unsigned int ppc_swiotlb_enable;
+extern unsigned int ppc_swiotlb_flags;
 
 #ifdef CONFIG_SWIOTLB
 void swiotlb_detect_4g(void);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8e301cd8925b2..d99b8b5b40ca6 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -251,7 +252,7 @@ void __init mem_init(void)
if (is_secure_guest())
svm_swiotlb_init();
else
-   swiotlb_init(0);
+   swiotlb_init(ppc_swiotlb_enable, ppc_swi

[PATCH 11/11] x86: remove cruft from

2022-02-22 Thread Christoph Hellwig

 gets pulled in by all drivers using the DMA API.
Remove x86 internal variables and unnecessary includes from it.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/include/asm/dma-mapping.h | 11 ---
 arch/x86/include/asm/iommu.h   |  2 ++
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/dma-mapping.h 
b/arch/x86/include/asm/dma-mapping.h
index 256fd8115223d..1c66708e30623 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -2,17 +2,6 @@
 #ifndef _ASM_X86_DMA_MAPPING_H
 #define _ASM_X86_DMA_MAPPING_H
 
-/*
- * IOMMU interface. See Documentation/core-api/dma-api-howto.rst and
- * Documentation/core-api/dma-api.rst for documentation.
- */
-
-#include 
-#include 
-
-extern int iommu_merge;
-extern int panic_on_overflow;
-
 extern const struct dma_map_ops *dma_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index dba89ed40d38d..0bef44d30a278 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -8,6 +8,8 @@
 
 extern int force_iommu, no_iommu;
 extern int iommu_detected;
+extern int iommu_merge;
+extern int panic_on_overflow;
 
 #ifdef CONFIG_SWIOTLB
 extern bool x86_swiotlb_enable;
-- 
2.30.2

[PATCH 10/11] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-02-22 Thread Christoph Hellwig

Allow to pass a remap argument to the swiotlb initialization functions
to handle the Xen/x86 remap case.  ARM/ARM64 never did any remapping
from xen_swiotlb_fixup, so we don't even need that quirk.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/xen/mm.c   |  23 +++---
 arch/x86/include/asm/xen/page.h |   5 --
 arch/x86/kernel/pci-dma.c   |  27 ---
 arch/x86/pci/sta2x11-fixup.c|   2 +-
 drivers/xen/swiotlb-xen.c   | 128 +---
 include/linux/swiotlb.h |   7 +-
 include/xen/arm/page.h  |   1 -
 include/xen/swiotlb-xen.h   |   8 +-
 kernel/dma/swiotlb.c| 120 +++---
 9 files changed, 102 insertions(+), 219 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index a7e54a087b802..58b40f87617d3 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -23,22 +23,20 @@
 #include 
 #include 
 
-unsigned long xen_get_swiotlb_free_pages(unsigned int order)
+static gfp_t xen_swiotlb_gfp(void)
 {
phys_addr_t base;
-   gfp_t flags = __GFP_NOWARN|__GFP_KSWAPD_RECLAIM;
u64 i;
 
for_each_mem_range(i, &base, NULL) {
if (base < (phys_addr_t)0x) {
if (IS_ENABLED(CONFIG_ZONE_DMA32))
-   flags |= __GFP_DMA32;
-   else
-   flags |= __GFP_DMA;
-   break;
+   return __GFP_DMA32;
+   return __GFP_DMA;
}
}
-   return __get_free_pages(flags, order);
+
+   return GFP_KERNEL;
 }
 
 static bool hypercall_cflush = false;
@@ -143,10 +141,15 @@ static int __init xen_mm_init(void)
if (!xen_swiotlb_detect())
return 0;
 
-   rc = xen_swiotlb_init();
/* we can work with the default swiotlb */
-   if (rc < 0 && rc != -EEXIST)
-   return rc;
+   if (!io_tlb_default_mem.nslabs) {
+   if (!xen_initial_domain())
+   return -EINVAL;
+   rc = swiotlb_init_late(swiotlb_size_or_default(),
+  xen_swiotlb_gfp(), NULL);
+   if (rc < 0)
+   return rc;
+   }
 
cflush.op = 0;
cflush.a.dev_bus_addr = 0;
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index e989bc2269f54..1fc67df500145 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -357,9 +357,4 @@ static inline bool xen_arch_need_swiotlb(struct device *dev,
return false;
 }
 
-static inline unsigned long xen_get_swiotlb_free_pages(unsigned int order)
-{
-   return __get_free_pages(__GFP_NOWARN, order);
-}
-
 #endif /* _ASM_X86_XEN_PAGE_H */
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 9576a02a2590f..b849f11a756d0 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -40,7 +40,6 @@ int iommu_detected __read_mostly = 0;
 #ifdef CONFIG_SWIOTLB
 bool x86_swiotlb_enable;
 static unsigned int x86_swiotlb_flags;
-static bool xen_swiotlb;
 
 /*
  * If 4GB or more detected (and iommu=off not set) or if SME is active
@@ -50,17 +49,16 @@ static void __init pci_swiotlb_detect_4gb(void)
 {
 #ifdef CONFIG_SWIOTLB_XEN
if (xen_pv_domain()) {
-   if (xen_initial_domain() || x86_swiotlb_enable) {
-   xen_swiotlb = true;
-   xen_swiotlb_init_early();
-   dma_ops = &xen_swiotlb_dma_ops;
+   if (xen_initial_domain())
+   x86_swiotlb_enable = true;
 
+   if (x86_swiotlb_enable) {
+   dma_ops = &xen_swiotlb_dma_ops;
 #ifdef CONFIG_PCI
/* Make sure ACS will be enabled */
pci_request_acs();
 #endif
}
-   x86_swiotlb_enable = false;
return;
}
 #endif /* CONFIG_SWIOTLB_XEN */
@@ -91,7 +89,8 @@ void __init pci_iommu_alloc(void)
amd_iommu_detect();
detect_intel_iommu();
 #ifdef CONFIG_SWIOTLB
-   swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
+   swiotlb_init_remap(x86_swiotlb_enable, x86_swiotlb_flags,
+  xen_pv_domain() ? xen_swiotlb_fixup : NULL);
 #endif
 }
 
@@ -205,13 +204,17 @@ int pci_xen_swiotlb_init_late(void)
 {
int rc;
 
-   if (xen_swiotlb)
+   if (dma_ops == &xen_swiotlb_dma_ops)
return 0;
 
-   rc = xen_swiotlb_init();
-   if (rc)
-   return rc;
-
+   /* we can work with the default swiotlb */
+   if (!io_tlb_default_mem.nslabs) {
+   rc = swiotlb_init_late(swiotlb_size_or_default(),
+  GFP_KERNEL, xen_swiotlb_fixup);
+   if (rc < 0)
+   return rc;
+   }
+ 
/* XXX: this switches the dma ops under liv

[PATCH 09/11] swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction

2022-02-22 Thread Christoph Hellwig

Power SVM wants to allocate a swiotlb buffer that is not restricted to
low memory for the trusted hypervisor scheme.  Consolidate the support
for this into the swiotlb_init interface by adding a new flag.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/svm.h   |  4 
 arch/powerpc/mm/mem.c|  5 +
 arch/powerpc/platforms/pseries/svm.c | 26 +-
 include/linux/swiotlb.h  |  1 +
 kernel/dma/swiotlb.c |  9 +++--
 5 files changed, 10 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/svm.h b/arch/powerpc/include/asm/svm.h
index 7546402d796af..85580b30aba48 100644
--- a/arch/powerpc/include/asm/svm.h
+++ b/arch/powerpc/include/asm/svm.h
@@ -15,8 +15,6 @@ static inline bool is_secure_guest(void)
return mfmsr() & MSR_S;
 }
 
-void __init svm_swiotlb_init(void);
-
 void dtl_cache_ctor(void *addr);
 #define get_dtl_cache_ctor()   (is_secure_guest() ? dtl_cache_ctor : NULL)
 
@@ -27,8 +25,6 @@ static inline bool is_secure_guest(void)
return false;
 }
 
-static inline void svm_swiotlb_init(void) {}
-
 #define get_dtl_cache_ctor() NULL
 
 #endif /* CONFIG_PPC_SVM */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index d99b8b5b40ca6..a4d65418c30a9 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -249,10 +249,7 @@ void __init mem_init(void)
 * back to to-down.
 */
memblock_set_bottom_up(true);
-   if (is_secure_guest())
-   svm_swiotlb_init();
-   else
-   swiotlb_init(ppc_swiotlb_enable, ppc_swiotlb_flags);
+   swiotlb_init(ppc_swiotlb_enable, ppc_swiotlb_flags);
 #endif
 
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index c5228f4969eb2..3b4045d508ec8 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -28,7 +28,7 @@ static int __init init_svm(void)
 * need to use the SWIOTLB buffer for DMA even if dma_capable() says
 * otherwise.
 */
-   swiotlb_force = SWIOTLB_FORCE;
+   ppc_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_FORCE;
 
/* Share the SWIOTLB buffer with the host. */
swiotlb_update_mem_attributes();
@@ -37,30 +37,6 @@ static int __init init_svm(void)
 }
 machine_early_initcall(pseries, init_svm);
 
-/*
- * Initialize SWIOTLB. Essentially the same as swiotlb_init(), except that it
- * can allocate the buffer anywhere in memory. Since the hypervisor doesn't 
have
- * any addressing limitation, we don't need to allocate it in low addresses.
- */
-void __init svm_swiotlb_init(void)
-{
-   unsigned char *vstart;
-   unsigned long bytes, io_tlb_nslabs;
-
-   io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
-   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
-
-   bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-
-   vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
-   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
-   return;
-
-
-   memblock_free(vstart, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
-   panic("SVM: Cannot allocate SWIOTLB buffer");
-}
-
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dcecf953f7997..ee655f2e4d28b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -15,6 +15,7 @@ struct scatterlist;
 
 #define SWIOTLB_VERBOSE(1 << 0) /* verbose initialization */
 #define SWIOTLB_FORCE  (1 << 1) /* force bounce buffering */
+#define SWIOTLB_ANY(1 << 2) /* allow any memory for the buffer */
 
 /*
  * Maximum allowable number of contiguous slabs to map,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ad604e5a0983d..ec200e40fc397 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -275,8 +275,13 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
if (swiotlb_force_disable)
return;
 
-   /* Get IO TLB memory from the low pages */
-   tlb = memblock_alloc_low(bytes, PAGE_SIZE);
+   /*
+* By default allocate the bonuce buffer memory from low memory.
+*/
+   if (flags & SWIOTLB_ANY)
+   tlb = memblock_alloc(bytes, PAGE_SIZE);
+   else
+   tlb = memblock_alloc_low(bytes, PAGE_SIZE);
if (!tlb)
goto fail;
if (swiotlb_init_with_tbl(tlb, default_nslabs, flags))
-- 
2.30.2

[PATCH 07/11] x86: remove the IOMMU table infrastructure

2022-02-22 Thread Christoph Hellwig

The IOMMU table tries to separate the different IOMMUs into different
backends, but actually requires various cross calls.

Rewrite the code to do the generic swiotlb/swiotlb-xen setup directly
in pci-dma.c and then just call into the IOMMU drivers.

Signed-off-by: Christoph Hellwig 
---
 arch/ia64/include/asm/iommu_table.h|   7 --
 arch/x86/include/asm/dma-mapping.h |   1 -
 arch/x86/include/asm/gart.h|   5 +-
 arch/x86/include/asm/iommu.h   |   6 ++
 arch/x86/include/asm/iommu_table.h | 102 --
 arch/x86/include/asm/swiotlb.h |  30 ---
 arch/x86/include/asm/xen/swiotlb-xen.h |   2 -
 arch/x86/kernel/Makefile   |   2 -
 arch/x86/kernel/amd_gart_64.c  |   5 +-
 arch/x86/kernel/aperture_64.c  |  14 ++--
 arch/x86/kernel/pci-dma.c  | 112 -
 arch/x86/kernel/pci-iommu_table.c  |  77 -
 arch/x86/kernel/pci-swiotlb.c  |  77 -
 arch/x86/kernel/tboot.c|   1 -
 arch/x86/kernel/vmlinux.lds.S  |  12 ---
 arch/x86/xen/Makefile  |   2 -
 arch/x86/xen/pci-swiotlb-xen.c |  96 -
 drivers/iommu/amd/init.c   |   6 --
 drivers/iommu/amd/iommu.c  |   5 +-
 drivers/iommu/intel/dmar.c |   6 +-
 include/linux/dmar.h   |   6 +-
 21 files changed, 115 insertions(+), 459 deletions(-)
 delete mode 100644 arch/ia64/include/asm/iommu_table.h
 delete mode 100644 arch/x86/include/asm/iommu_table.h
 delete mode 100644 arch/x86/include/asm/swiotlb.h
 delete mode 100644 arch/x86/kernel/pci-iommu_table.c
 delete mode 100644 arch/x86/kernel/pci-swiotlb.c
 delete mode 100644 arch/x86/xen/pci-swiotlb-xen.c

diff --git a/arch/ia64/include/asm/iommu_table.h 
b/arch/ia64/include/asm/iommu_table.h
deleted file mode 100644
index cc96116ac276a..0
--- a/arch/ia64/include/asm/iommu_table.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_IA64_IOMMU_TABLE_H
-#define _ASM_IA64_IOMMU_TABLE_H
-
-#define IOMMU_INIT_POST(_detect)
-
-#endif /* _ASM_IA64_IOMMU_TABLE_H */
diff --git a/arch/x86/include/asm/dma-mapping.h 
b/arch/x86/include/asm/dma-mapping.h
index bb1654fe0ce74..256fd8115223d 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -9,7 +9,6 @@
 
 #include 
 #include 
-#include 
 
 extern int iommu_merge;
 extern int panic_on_overflow;
diff --git a/arch/x86/include/asm/gart.h b/arch/x86/include/asm/gart.h
index 3185565743459..5af8088a10df6 100644
--- a/arch/x86/include/asm/gart.h
+++ b/arch/x86/include/asm/gart.h
@@ -38,7 +38,7 @@ extern int gart_iommu_aperture_disabled;
 extern void early_gart_iommu_check(void);
 extern int gart_iommu_init(void);
 extern void __init gart_parse_options(char *);
-extern int gart_iommu_hole_init(void);
+void gart_iommu_hole_init(void);
 
 #else
 #define gart_iommu_aperture0
@@ -51,9 +51,8 @@ static inline void early_gart_iommu_check(void)
 static inline void gart_parse_options(char *options)
 {
 }
-static inline int gart_iommu_hole_init(void)
+static inline void gart_iommu_hole_init(void)
 {
-   return -ENODEV;
 }
 #endif
 
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index bf1ed2ddc74bd..dba89ed40d38d 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -9,6 +9,12 @@
 extern int force_iommu, no_iommu;
 extern int iommu_detected;
 
+#ifdef CONFIG_SWIOTLB
+extern bool x86_swiotlb_enable;
+#else
+#define x86_swiotlb_enable false
+#endif
+
 /* 10 seconds */
 #define DMAR_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
 
diff --git a/arch/x86/include/asm/iommu_table.h 
b/arch/x86/include/asm/iommu_table.h
deleted file mode 100644
index 1fb3fd1a83c25..0
--- a/arch/x86/include/asm/iommu_table.h
+++ /dev/null
@@ -1,102 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_IOMMU_TABLE_H
-#define _ASM_X86_IOMMU_TABLE_H
-
-#include 
-
-/*
- * History lesson:
- * The execution chain of IOMMUs in 2.6.36 looks as so:
- *
- *[xen-swiotlb]
- * |
- * +[swiotlb *]--+
- */ | \
- *   /  |  \
- *[GART] [Calgary]  [Intel VT-d]
- * /
- */
- * [AMD-Vi]
- *
- * *: if SWIOTLB detected 'iommu=soft'/'swiotlb=force' it would skip
- * over the rest of IOMMUs and unconditionally initialize the SWIOTLB.
- * Also it would surreptitiously initialize set the swiotlb=1 if there were
- * more than 4GB and if the user did not pass in 'iommu=off'. The swiotlb
- * flag would be turned off by all IOMMUs except the Calgary one.
- *
- * The IOMMU_INIT* macros allow a similar tree (or more complex if desired)
- * to be built by defining who we depend on.
- *
- * And all that needs to be done is to use one of the macros in the IOMMU
- * and the pci-dma.c will take care of the rest.
- */
-
-struct iomm

[PATCH 06/11] MIPS/octeon: use swiotlb_init instead of open coding it

2022-02-22 Thread Christoph Hellwig

Use the generic swiotlb initialization helper instead of open coding it.

Signed-off-by: Christoph Hellwig 
---
 arch/mips/cavium-octeon/dma-octeon.c | 15 ++-
 arch/mips/pci/pci-octeon.c   |  2 +-
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index df70308db0e69..fb7547e217263 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -186,15 +186,12 @@ phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
daddr)
return daddr;
 }
 
-char *octeon_swiotlb;
-
 void __init plat_swiotlb_setup(void)
 {
phys_addr_t start, end;
phys_addr_t max_addr;
phys_addr_t addr_size;
size_t swiotlbsize;
-   unsigned long swiotlb_nslabs;
u64 i;
 
max_addr = 0;
@@ -236,15 +233,7 @@ void __init plat_swiotlb_setup(void)
if (OCTEON_IS_OCTEON2() && max_addr >= 0x1ul)
swiotlbsize = 64 * (1<<20);
 #endif
-   swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
-   swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
-   swiotlbsize = swiotlb_nslabs << IO_TLB_SHIFT;
-
-   octeon_swiotlb = memblock_alloc_low(swiotlbsize, PAGE_SIZE);
-   if (!octeon_swiotlb)
-   panic("%s: Failed to allocate %zu bytes align=%lx\n",
- __func__, swiotlbsize, PAGE_SIZE);
 
-   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM)
-   panic("Cannot allocate SWIOTLB buffer");
+   swiotlb_adjust_size(swiotlbsize);
+   swiotlb_init(1);
 }
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index fc29b85cfa926..e457a18cbdc59 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -664,7 +664,7 @@ static int __init octeon_pci_setup(void)
 
/* BAR1 movable regions contiguous to cover the swiotlb */
octeon_bar1_pci_phys =
-   virt_to_phys(octeon_swiotlb) & ~((1ull << 22) - 1);
+   io_tlb_default_mem.start & ~((1ull << 22) - 1);
 
for (index = 0; index < 32; index++) {
union cvmx_pci_bar1_indexx bar1_index;
-- 
2.30.2

[PATCH 03/11] swiotlb: simplify swiotlb_max_segment

2022-02-22 Thread Christoph Hellwig

Remove the bogus Xen override that was usually larger than the actual
size and just calculate the value on demand.  Note that
swiotlb_max_segment still doesn't make sense as an interface and should
eventually be removed.

Signed-off-by: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c |  2 --
 include/linux/swiotlb.h   |  1 -
 kernel/dma/swiotlb.c  | 20 +++-
 3 files changed, 3 insertions(+), 20 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 47aebd98f52f5..485cd06ed39e7 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -202,7 +202,6 @@ int xen_swiotlb_init(void)
rc = swiotlb_late_init_with_tbl(start, nslabs);
if (rc)
return rc;
-   swiotlb_set_max_segment(PAGE_SIZE);
return 0;
 error:
if (nslabs > 1024 && repeat--) {
@@ -254,7 +253,6 @@ void __init xen_swiotlb_init_early(void)
 
if (swiotlb_init_with_tbl(start, nslabs, true))
panic("Cannot allocate SWIOTLB buffer");
-   swiotlb_set_max_segment(PAGE_SIZE);
 }
 #endif /* CONFIG_X86 */
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index f6c3638255d54..9fb3a568f0c51 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -164,7 +164,6 @@ static inline void swiotlb_adjust_size(unsigned long size)
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
-extern void swiotlb_set_max_segment(unsigned int);
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 struct page *swiotlb_alloc(struct device *dev, size_t size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 36fbf1181d285..519e363097190 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -75,12 +75,6 @@ struct io_tlb_mem io_tlb_default_mem;
 
 phys_addr_t swiotlb_unencrypted_base;
 
-/*
- * Max segment that we can provide which (if pages are contingous) will
- * not be bounced (unless SWIOTLB_FORCE is set).
- */
-static unsigned int max_segment;
-
 static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
 
 static int __init
@@ -104,18 +98,12 @@ early_param("swiotlb", setup_io_tlb_npages);
 
 unsigned int swiotlb_max_segment(void)
 {
-   return io_tlb_default_mem.nslabs ? max_segment : 0;
+   if (!io_tlb_default_mem.nslabs)
+   return 0;
+   return rounddown(io_tlb_default_mem.nslabs << IO_TLB_SHIFT, PAGE_SIZE);
 }
 EXPORT_SYMBOL_GPL(swiotlb_max_segment);
 
-void swiotlb_set_max_segment(unsigned int val)
-{
-   if (swiotlb_force == SWIOTLB_FORCE)
-   max_segment = 1;
-   else
-   max_segment = rounddown(val, PAGE_SIZE);
-}
-
 unsigned long swiotlb_size_or_default(void)
 {
return default_nslabs << IO_TLB_SHIFT;
@@ -267,7 +255,6 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 
if (verbose)
swiotlb_print_info();
-   swiotlb_set_max_segment(mem->nslabs << IO_TLB_SHIFT);
return 0;
 }
 
@@ -368,7 +355,6 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
swiotlb_print_info();
-   swiotlb_set_max_segment(mem->nslabs << IO_TLB_SHIFT);
return 0;
 }
 
-- 
2.30.2

[PATCH 05/11] swiotlb: pass a gfp_mask argument to swiotlb_init_late

2022-02-22 Thread Christoph Hellwig

Let the caller chose a zone to allocate from.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/pci/sta2x11-fixup.c | 2 +-
 include/linux/swiotlb.h  | 2 +-
 kernel/dma/swiotlb.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index e0c039a75b2db..c7e6faf59a861 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
int size = STA2X11_SWIOTLB_SIZE;
/* First instance: register your own swiotlb area */
dev_info(&pdev->dev, "Using SWIOTLB (size %i)\n", size);
-   if (swiotlb_init_late(size))
+   if (swiotlb_init_late(size, GFP_DMA))
dev_emerg(&pdev->dev, "init swiotlb failed\n");
}
list_add(&instance->list, &sta2x11_instance_list);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b48b26bfa0edb..1befd6b2ccf5e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -40,7 +40,7 @@ extern void swiotlb_init(int verbose);
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
 unsigned long swiotlb_size_or_default(void);
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
-int swiotlb_init_late(size_t size);
+int swiotlb_init_late(size_t size, gfp_t gfp_mask);
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 5f64b02fbb732..a653fcf1fe6c2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -290,7 +290,7 @@ swiotlb_init(int verbose)
  * initialize the swiotlb later using the slab allocator if needed.
  * This should be just like above, but with some error catching.
  */
-int swiotlb_init_late(size_t size)
+int swiotlb_init_late(size_t size, gfp_t gfp_mask)
 {
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned long bytes;
@@ -309,7 +309,7 @@ int swiotlb_init_late(size_t size)
bytes = nslabs << IO_TLB_SHIFT;
 
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-   vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+   vstart = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
  order);
if (vstart)
break;
-- 
2.30.2

[PATCH 04/11] swiotlb: rename swiotlb_late_init_with_default_size

2022-02-22 Thread Christoph Hellwig

swiotlb_late_init_with_default_size is an overly verbose name that
doesn't even catch what the function is doing, given that the size is
not just a default but the actual requested size.

Rename it to swiotlb_init_late.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/pci/sta2x11-fixup.c | 2 +-
 include/linux/swiotlb.h  | 2 +-
 kernel/dma/swiotlb.c | 6 ++
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 101081ad64b6d..e0c039a75b2db 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
int size = STA2X11_SWIOTLB_SIZE;
/* First instance: register your own swiotlb area */
dev_info(&pdev->dev, "Using SWIOTLB (size %i)\n", size);
-   if (swiotlb_late_init_with_default_size(size))
+   if (swiotlb_init_late(size))
dev_emerg(&pdev->dev, "init swiotlb failed\n");
}
list_add(&instance->list, &sta2x11_instance_list);
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 9fb3a568f0c51..b48b26bfa0edb 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -40,7 +40,7 @@ extern void swiotlb_init(int verbose);
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
 unsigned long swiotlb_size_or_default(void);
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
-extern int swiotlb_late_init_with_default_size(size_t default_size);
+int swiotlb_init_late(size_t size);
 extern void __init swiotlb_update_mem_attributes(void);
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 519e363097190..5f64b02fbb732 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -290,11 +290,9 @@ swiotlb_init(int verbose)
  * initialize the swiotlb later using the slab allocator if needed.
  * This should be just like above, but with some error catching.
  */
-int
-swiotlb_late_init_with_default_size(size_t default_size)
+int swiotlb_init_late(size_t size)
 {
-   unsigned long nslabs =
-   ALIGN(default_size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
+   unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned long bytes;
unsigned char *vstart = NULL;
unsigned int order;
-- 
2.30.2

[PATCH 02/11] swiotlb: make swiotlb_exit a no-op if SWIOTLB_FORCE is set

2022-02-22 Thread Christoph Hellwig

If force bouncing is enabled we can't release the bufffers.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index f1e7ea160b433..36fbf1181d285 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -378,6 +378,9 @@ void __init swiotlb_exit(void)
unsigned long tbl_vaddr;
size_t tbl_size, slots_size;
 
+   if (swiotlb_force == SWIOTLB_FORCE)
+   return;
+
if (!mem->nslabs)
return;
 
-- 
2.30.2

cleanup swiotlb initialization

2022-02-22 Thread Christoph Hellwig

Hi all,

this series tries to clean up the swiotlb initialization, including
that of swiotlb-xen.  To get there is also removes the x86 iommu table
infrastructure that massively obsfucates the initialization path.

Git tree:

git://git.infradead.org/users/hch/misc.git swiotlb-init-cleanup

Gitweb:


http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/swiotlb-init-cleanup

Diffstat:
 arch/ia64/include/asm/iommu_table.h  |7 -
 arch/x86/include/asm/iommu_table.h   |  102 ---
 arch/x86/include/asm/swiotlb.h   |   30 -
 arch/x86/kernel/pci-iommu_table.c|   77 --
 arch/x86/kernel/pci-swiotlb.c|   77 --
 arch/x86/xen/pci-swiotlb-xen.c   |   96 --
 b/arch/arm/mm/init.c |6 -
 b/arch/arm/xen/mm.c  |   23 ++--
 b/arch/arm64/mm/init.c   |6 -
 b/arch/ia64/mm/init.c|4 
 b/arch/mips/cavium-octeon/dma-octeon.c   |   15 --
 b/arch/mips/loongson64/dma.c |2 
 b/arch/mips/pci/pci-octeon.c |2 
 b/arch/mips/sibyte/common/dma.c  |2 
 b/arch/powerpc/include/asm/svm.h |4 
 b/arch/powerpc/include/asm/swiotlb.h |1 
 b/arch/powerpc/mm/mem.c  |6 -
 b/arch/powerpc/platforms/pseries/setup.c |3 
 b/arch/powerpc/platforms/pseries/svm.c   |   26 
 b/arch/riscv/mm/init.c   |8 -
 b/arch/s390/mm/init.c|3 
 b/arch/x86/include/asm/dma-mapping.h |   12 --
 b/arch/x86/include/asm/gart.h|5 
 b/arch/x86/include/asm/iommu.h   |8 +
 b/arch/x86/include/asm/xen/page.h|5 
 b/arch/x86/include/asm/xen/swiotlb-xen.h |2 
 b/arch/x86/kernel/Makefile   |2 
 b/arch/x86/kernel/amd_gart_64.c  |5 
 b/arch/x86/kernel/aperture_64.c  |   14 --
 b/arch/x86/kernel/cpu/mshyperv.c |8 -
 b/arch/x86/kernel/pci-dma.c  |  114 +
 b/arch/x86/kernel/tboot.c|1 
 b/arch/x86/kernel/vmlinux.lds.S  |   12 --
 b/arch/x86/mm/mem_encrypt_amd.c  |3 
 b/arch/x86/pci/sta2x11-fixup.c   |2 
 b/arch/x86/xen/Makefile  |2 
 b/drivers/iommu/amd/init.c   |6 -
 b/drivers/iommu/amd/iommu.c  |5 
 b/drivers/iommu/intel/dmar.c |6 -
 b/drivers/xen/swiotlb-xen.c  |  132 -
 b/include/linux/dmar.h   |6 -
 b/include/linux/swiotlb.h|   22 ++--
 b/include/trace/events/swiotlb.h |   29 +
 b/include/xen/arm/page.h |1 
 b/include/xen/swiotlb-xen.h  |8 +
 b/kernel/dma/direct.h|2 
 b/kernel/dma/swiotlb.c   |  163 +++
 47 files changed, 258 insertions(+), 817 deletions(-)

[PATCH 01/11] dma-direct: use is_swiotlb_active in dma_direct_map_page

2022-02-22 Thread Christoph Hellwig

Use the more specific is_swiotlb_active check instead of checking the
global swiotlb_force variable.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/direct.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4632b0f4f72eb..4dc16e08c7e1a 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -91,7 +91,7 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
return swiotlb_map(dev, phys, size, dir, attrs);
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
-   if (swiotlb_force != SWIOTLB_NO_FORCE)
+   if (is_swiotlb_active(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
 
dev_WARN_ONCE(dev, 1,
-- 
2.30.2

Re: [PATCH] platforms/83xx: Use of_device_get_match_data()

2022-02-22 Thread Christophe Leroy

Resending as I accidentaly sent my response to the list only.

Le 21/02/2022 à 03:03, cgel@gmail.com a écrit :
> From: Minghao Chi (CGEL ZTE) 
> 
> Use of_device_get_match_data() to simplify the code.
> 
> Reported-by: Zeal Robot 
> Signed-off-by: Minghao Chi (CGEL ZTE) 
> ---
>   arch/powerpc/platforms/83xx/suspend.c | 7 +--
>   1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/83xx/suspend.c 
> b/arch/powerpc/platforms/83xx/suspend.c
> index bb147d34d4a6..9ae9268b683c 100644
> --- a/arch/powerpc/platforms/83xx/suspend.c
> +++ b/arch/powerpc/platforms/83xx/suspend.c
> @@ -322,17 +322,12 @@ static const struct platform_suspend_ops 
> mpc83xx_suspend_ops = {
>   static const struct of_device_id pmc_match[];
>   static int pmc_probe(struct platform_device *ofdev)
>   {
> - const struct of_device_id *match;
>   struct device_node *np = ofdev->dev.of_node;
>   struct resource res;
>   const struct pmc_type *type;
>   int ret = 0;
>   
> - match = of_match_device(pmc_match, &ofdev->dev);
> - if (!match)
> - return -EINVAL;
> -
> - type = match->data;
> + type = of_device_get_match_data(&ofdev->dev);

What happens when of_device_get_match_data() returns NULL ?

>   
>   if (!of_device_is_available(np))
>   return -ENODEV;

[PATCH v5 0/6] Allocate module text and data separately

2022-02-22 Thread Christophe Leroy

This series applies on top of Aaron's series "module: core code clean up" v8.


This series allow architectures to request having modules data in
vmalloc area instead of module area.

This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec. Without this change module data
remains executable regardless of CONFIG_STRICT_MODULES_RWX.

This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.

Changes in v5:
- Rebased on top of Aaron's series "module: core code clean up" v8

Changes in v4:
- Rebased on top of Aaron's series "module: core code clean up" v6

Changes in v3:
- Fixed the tree for data_layout at one place (Thanks Miroslav)
- Moved removal of module_addr_min/module_addr_max macro out of patch 1 in a 
new patch at the end of the series to reduce churn.

Changes in v2:
- Dropped first two patches which are not necessary. They may be added back 
later as a follow-up series.
- Fixed the printks in GDB

Christophe Leroy (6):
  module: Always have struct mod_tree_root
  module: Prepare for handling several RB trees
  module: Introduce data_layout
  module: Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
  module: Remove module_addr_min and module_addr_max
  powerpc: Select ARCH_WANTS_MODULES_DATA_IN_VMALLOC on book3s/32 and
8xx

 arch/Kconfig|   6 +++
 arch/powerpc/Kconfig|   1 +
 include/linux/module.h  |   8 +++
 kernel/debug/kdb/kdb_main.c |  10 +++-
 kernel/module/internal.h|  13 +++--
 kernel/module/kallsyms.c|  18 +++
 kernel/module/main.c| 103 +++-
 kernel/module/procfs.c  |   8 ++-
 kernel/module/strict_rwx.c  |  10 ++--
 kernel/module/tree_lookup.c |  28 ++
 10 files changed, 149 insertions(+), 56 deletions(-)

-- 
2.34.1

[PATCH v5 4/6] module: Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC

2022-02-22 Thread Christophe Leroy

Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to allow architectures
to request having modules data in vmalloc area instead of module area.

This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec.

This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.

Cc: Jason Wessel 
Acked-by: Daniel Thompson 
Cc: Douglas Anderson 
Signed-off-by: Christophe Leroy 
---
 arch/Kconfig|  6 
 include/linux/module.h  |  8 +
 kernel/debug/kdb/kdb_main.c | 10 +--
 kernel/module/internal.h|  3 ++
 kernel/module/main.c| 58 +++--
 kernel/module/procfs.c  |  8 +++--
 kernel/module/tree_lookup.c |  8 +
 7 files changed, 95 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..b5d1f2c19c27 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -882,6 +882,12 @@ config MODULES_USE_ELF_REL
  Modules only use ELF REL relocations.  Modules with ELF RELA
  relocations will give an error.
 
+config ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   bool
+   help
+ For architectures like powerpc/32 which have constraints on module
+ allocation and need to allocate module data outside of module area.
+
 config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/include/linux/module.h b/include/linux/module.h
index 7ec9715de7dc..134c1df04b14 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -422,6 +422,9 @@ struct module {
/* Core layout: rbtree is accessed frequently, so keep together. */
struct module_layout core_layout __module_layout_align;
struct module_layout init_layout;
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   struct module_layout data_layout;
+#endif
 
/* Arch-specific module values */
struct mod_arch_specific arch;
@@ -569,6 +572,11 @@ bool is_module_text_address(unsigned long addr);
 static inline bool within_module_core(unsigned long addr,
  const struct module *mod)
 {
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   if ((unsigned long)mod->data_layout.base <= addr &&
+   addr < (unsigned long)mod->data_layout.base + mod->data_layout.size)
+   return true;
+#endif
return (unsigned long)mod->core_layout.base <= addr &&
   addr < (unsigned long)mod->core_layout.base + 
mod->core_layout.size;
 }
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 5369bf45c5d4..94de2889a062 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2027,8 +2027,11 @@ static int kdb_lsmod(int argc, const char **argv)
if (mod->state == MODULE_STATE_UNFORMED)
continue;
 
-   kdb_printf("%-20s%8u  0x%px ", mod->name,
-  mod->core_layout.size, (void *)mod);
+   kdb_printf("%-20s%8u", mod->name, mod->core_layout.size);
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   kdb_printf("/%8u", mod->data_layout.size);
+#endif
+   kdb_printf("  0x%px ", (void *)mod);
 #ifdef CONFIG_MODULE_UNLOAD
kdb_printf("%4d ", module_refcount(mod));
 #endif
@@ -2039,6 +2042,9 @@ static int kdb_lsmod(int argc, const char **argv)
else
kdb_printf(" (Live)");
kdb_printf(" 0x%px", mod->core_layout.base);
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   kdb_printf("/0x%px", mod->data_layout.base);
+#endif
 
 #ifdef CONFIG_MODULE_UNLOAD
{
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 5ad6233d409a..6911c7533ede 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -20,7 +20,9 @@
 /* Maximum number of characters written by module_flags() */
 #define MODULE_FLAGS_BUF_SIZE (TAINT_FLAGS_COUNT + 4)
 
+#ifndef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
 #definedata_layout core_layout
+#endif
 
 /*
  * Modules' sections will be aligned on page boundaries
@@ -154,6 +156,7 @@ struct mod_tree_root {
 };
 
 extern struct mod_tree_root mod_tree;
+extern struct mod_tree_root mod_data_tree;
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 void mod_tree_insert(struct module *mod);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index bd26280f2880..f4d95a2ff08f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -78,6 +78,12 @@ struct mod_tree_root mod_tree __cacheline_aligned = {
.addr_min = -1UL,
 };
 
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+struct mod_tree_root mod_data_tree __cacheline_aligned = {
+   .addr_min = -1UL,
+};
+#endif
+
 #define module_addr_min mod_tree.addr_min
 #de

[PATCH v5 2/6] module: Prepare for handling several RB trees

2022-02-22 Thread Christophe Leroy

In order to separate text and data, we need to setup
two rb trees.

Modify functions to give the tree as a parameter.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h|  4 ++--
 kernel/module/main.c| 16 
 kernel/module/tree_lookup.c | 20 ++--
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 743b598e7cc2..99a5be36190c 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -157,13 +157,13 @@ extern struct mod_tree_root mod_tree;
 void mod_tree_insert(struct module *mod);
 void mod_tree_remove_init(struct module *mod);
 void mod_tree_remove(struct module *mod);
-struct module *mod_find(unsigned long addr);
+struct module *mod_find(unsigned long addr, struct mod_tree_root *tree);
 #else /* !CONFIG_MODULES_TREE_LOOKUP */
 
 static inline void mod_tree_insert(struct module *mod) { }
 static inline void mod_tree_remove_init(struct module *mod) { }
 static inline void mod_tree_remove(struct module *mod) { }
-static inline struct module *mod_find(unsigned long addr)
+static inline struct module *mod_find(unsigned long addr, struct mod_tree_root 
*tree)
 {
struct module *mod;
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 3b75cb97f8c2..c0b961e02909 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -91,22 +91,22 @@ struct symsearch {
  * Bounds of module text, for speeding up __module_address.
  * Protected by module_mutex.
  */
-static void __mod_update_bounds(void *base, unsigned int size)
+static void __mod_update_bounds(void *base, unsigned int size, struct 
mod_tree_root *tree)
 {
unsigned long min = (unsigned long)base;
unsigned long max = min + size;
 
-   if (min < module_addr_min)
-   module_addr_min = min;
-   if (max > module_addr_max)
-   module_addr_max = max;
+   if (min < tree->addr_min)
+   tree->addr_min = min;
+   if (max > tree->addr_max)
+   tree->addr_max = max;
 }
 
 static void mod_update_bounds(struct module *mod)
 {
-   __mod_update_bounds(mod->core_layout.base, mod->core_layout.size);
+   __mod_update_bounds(mod->core_layout.base, mod->core_layout.size, 
&mod_tree);
if (mod->init_layout.size)
-   __mod_update_bounds(mod->init_layout.base, 
mod->init_layout.size);
+   __mod_update_bounds(mod->init_layout.base, 
mod->init_layout.size, &mod_tree);
 }
 
 static void module_assert_mutex_or_preempt(void)
@@ -3051,7 +3051,7 @@ struct module *__module_address(unsigned long addr)
 
module_assert_mutex_or_preempt();
 
-   mod = mod_find(addr);
+   mod = mod_find(addr, &mod_tree);
if (mod) {
BUG_ON(!within_module(addr, mod));
if (mod->state == MODULE_STATE_UNFORMED)
diff --git a/kernel/module/tree_lookup.c b/kernel/module/tree_lookup.c
index 0bc4ec3b22ce..995fe68059db 100644
--- a/kernel/module/tree_lookup.c
+++ b/kernel/module/tree_lookup.c
@@ -61,14 +61,14 @@ static const struct latch_tree_ops mod_tree_ops = {
.comp = mod_tree_comp,
 };
 
-static noinline void __mod_tree_insert(struct mod_tree_node *node)
+static noinline void __mod_tree_insert(struct mod_tree_node *node, struct 
mod_tree_root *tree)
 {
-   latch_tree_insert(&node->node, &mod_tree.root, &mod_tree_ops);
+   latch_tree_insert(&node->node, &tree->root, &mod_tree_ops);
 }
 
-static void __mod_tree_remove(struct mod_tree_node *node)
+static void __mod_tree_remove(struct mod_tree_node *node, struct mod_tree_root 
*tree)
 {
-   latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops);
+   latch_tree_erase(&node->node, &tree->root, &mod_tree_ops);
 }
 
 /*
@@ -80,28 +80,28 @@ void mod_tree_insert(struct module *mod)
mod->core_layout.mtn.mod = mod;
mod->init_layout.mtn.mod = mod;
 
-   __mod_tree_insert(&mod->core_layout.mtn);
+   __mod_tree_insert(&mod->core_layout.mtn, &mod_tree);
if (mod->init_layout.size)
-   __mod_tree_insert(&mod->init_layout.mtn);
+   __mod_tree_insert(&mod->init_layout.mtn, &mod_tree);
 }
 
 void mod_tree_remove_init(struct module *mod)
 {
if (mod->init_layout.size)
-   __mod_tree_remove(&mod->init_layout.mtn);
+   __mod_tree_remove(&mod->init_layout.mtn, &mod_tree);
 }
 
 void mod_tree_remove(struct module *mod)
 {
-   __mod_tree_remove(&mod->core_layout.mtn);
+   __mod_tree_remove(&mod->core_layout.mtn, &mod_tree);
mod_tree_remove_init(mod);
 }
 
-struct module *mod_find(unsigned long addr)
+struct module *mod_find(unsigned long addr, struct mod_tree_root *tree)
 {
struct latch_tree_node *ltn;
 
-   ltn = latch_tree_find((void *)addr, &mod_tree.root, &mod_tree_ops);
+   ltn = latch_tree_find((void *)addr, &tree->root, &mod_tree_ops);
if (!ltn)
return NULL;
 
-- 
2.34.1

[PATCH v5 3/6] module: Introduce data_layout

2022-02-22 Thread Christophe Leroy

In order to allow separation of data from text, add another layout,
called data_layout. For architectures requesting separation of text
and data, only text will go in core_layout and data will go in
data_layout.

For architectures which keep text and data together, make data_layout
an alias of core_layout, that way data_layout can be used for all
data manipulations, regardless of whether data is in core_layout or
data_layout.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h   |  2 ++
 kernel/module/kallsyms.c   | 18 +-
 kernel/module/main.c   | 20 
 kernel/module/strict_rwx.c | 10 +-
 4 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 99a5be36190c..5ad6233d409a 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -20,6 +20,8 @@
 /* Maximum number of characters written by module_flags() */
 #define MODULE_FLAGS_BUF_SIZE (TAINT_FLAGS_COUNT + 4)
 
+#definedata_layout core_layout
+
 /*
  * Modules' sections will be aligned on page boundaries
  * to ensure complete separation of code and data, but
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index b6d49bb5afed..850cc66bb28c 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -134,12 +134,12 @@ void layout_symtab(struct module *mod, struct load_info 
*info)
}
 
/* Append room for core symbols at end of core part. */
-   info->symoffs = ALIGN(mod->core_layout.size, symsect->sh_addralign ?: 
1);
-   info->stroffs = mod->core_layout.size = info->symoffs + ndst * 
sizeof(Elf_Sym);
-   mod->core_layout.size += strtab_size;
-   info->core_typeoffs = mod->core_layout.size;
-   mod->core_layout.size += ndst * sizeof(char);
-   mod->core_layout.size = debug_align(mod->core_layout.size);
+   info->symoffs = ALIGN(mod->data_layout.size, symsect->sh_addralign ?: 
1);
+   info->stroffs = mod->data_layout.size = info->symoffs + ndst * 
sizeof(Elf_Sym);
+   mod->data_layout.size += strtab_size;
+   info->core_typeoffs = mod->data_layout.size;
+   mod->data_layout.size += ndst * sizeof(char);
+   mod->data_layout.size = debug_align(mod->data_layout.size);
 
/* Put string table section at end of init part of module. */
strsect->sh_flags |= SHF_ALLOC;
@@ -187,9 +187,9 @@ void add_kallsyms(struct module *mod, const struct 
load_info *info)
 * Now populate the cut down core kallsyms for after init
 * and set types up while we still have access to sections.
 */
-   mod->core_kallsyms.symtab = dst = mod->core_layout.base + info->symoffs;
-   mod->core_kallsyms.strtab = s = mod->core_layout.base + info->stroffs;
-   mod->core_kallsyms.typetab = mod->core_layout.base + 
info->core_typeoffs;
+   mod->core_kallsyms.symtab = dst = mod->data_layout.base + info->symoffs;
+   mod->core_kallsyms.strtab = s = mod->data_layout.base + info->stroffs;
+   mod->core_kallsyms.typetab = mod->data_layout.base + 
info->core_typeoffs;
src = rcu_dereference_sched(mod->kallsyms)->symtab;
for (ndst = i = 0; i < 
rcu_dereference_sched(mod->kallsyms)->num_symtab; i++) {
rcu_dereference_sched(mod->kallsyms)->typetab[i] = elf_type(src 
+ i, info);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c0b961e02909..bd26280f2880 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1229,7 +1229,7 @@ static void free_module(struct module *mod)
percpu_modfree(mod);
 
/* Free lock-classes; relies on the preceding sync_rcu(). */
-   lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
+   lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);
 
/* Finally, free the core (containing the module structure) */
module_memfree(mod->core_layout.base);
@@ -1470,13 +1470,15 @@ static void layout_sections(struct module *mod, struct 
load_info *info)
for (i = 0; i < info->hdr->e_shnum; ++i) {
Elf_Shdr *s = &info->sechdrs[i];
const char *sname = info->secstrings + s->sh_name;
+   unsigned int *sizep;
 
if ((s->sh_flags & masks[m][0]) != masks[m][0]
|| (s->sh_flags & masks[m][1])
|| s->sh_entsize != ~0UL
|| module_init_layout_section(sname))
continue;
-   s->sh_entsize = module_get_offset(mod, 
&mod->core_layout.size, s, i);
+   sizep = m ? &mod->data_layout.size : 
&mod->core_layout.size;
+   s->sh_entsize = module_get_offset(mod, sizep, s, i);
pr_debug("\t%s\n", sname);
}
switch (m) {
@@ -1485,15 +1487,15 @@ static void layout_sections(struct modul

[PATCH v5 5/6] module: Remove module_addr_min and module_addr_max

2022-02-22 Thread Christophe Leroy

Replace module_addr_min and module_addr_max by
mod_tree.addr_min and mod_tree.addr_max

Signed-off-by: Christophe Leroy 
---
 kernel/module/main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index f4d95a2ff08f..db503a212532 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -63,7 +63,7 @@
  * Mutex protects:
  * 1) List of modules (also safely readable with preempt_disable),
  * 2) module_use links,
- * 3) module_addr_min/module_addr_max.
+ * 3) mod_tree.addr_min/mod_tree.addr_max.
  * (delete and add uses RCU list operations).
  */
 DEFINE_MUTEX(module_mutex);
@@ -3006,14 +3006,14 @@ static void cfi_init(struct module *mod)
mod->exit = *exit;
 #endif
 
-   cfi_module_add(mod, module_addr_min);
+   cfi_module_add(mod, mod_tree.addr_min);
 #endif
 }
 
 static void cfi_cleanup(struct module *mod)
 {
 #ifdef CONFIG_CFI_CLANG
-   cfi_module_remove(mod, module_addr_min);
+   cfi_module_remove(mod, mod_tree.addr_min);
 #endif
 }
 
-- 
2.34.1

[PATCH v5 6/6] powerpc: Select ARCH_WANTS_MODULES_DATA_IN_VMALLOC on book3s/32 and 8xx

2022-02-22 Thread Christophe Leroy

book3s/32 and 8xx have a separate area for allocating modules,
defined by MODULES_VADDR / MODULES_END.

On book3s/32, it is not possible to protect against execution
on a page basis. A full 256M segment is either Exec or NoExec.
The module area is in an Exec segment while vmalloc area is
in a NoExec segment.

In order to protect module data against execution, select
ARCH_WANTS_MODULES_DATA_IN_VMALLOC.

For the 8xx (and possibly other 32 bits platform in the future),
there is no such constraint on Exec/NoExec protection, however
there is a critical distance between kernel functions and callers
that needs to remain below 32Mbytes in order to avoid costly
trampolines. By allocating data outside of module area, we
increase the chance for module text to remain within acceptable
distance from kernel core text.

So select ARCH_WANTS_MODULES_DATA_IN_VMALLOC for 8xx as well.

Signed-off-by: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 28e4047e99e8..478ee49a4fb4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -156,6 +156,7 @@ config PPC
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
+   select ARCH_WANTS_MODULES_DATA_IN_VMALLOC   if PPC_BOOK3S_32 || 
PPC_8xx
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
-- 
2.34.1

[PATCH v5 1/6] module: Always have struct mod_tree_root

2022-02-22 Thread Christophe Leroy

In order to separate text and data, we need to setup
two rb trees.

This means that struct mod_tree_root is required even without
MODULES_TREE_LOOKUP.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h | 4 +++-
 kernel/module/main.c | 5 -
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 3fc139d5074b..743b598e7cc2 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -143,15 +143,17 @@ static inline void module_decompress_cleanup(struct 
load_info *info)
 }
 #endif
 
-#ifdef CONFIG_MODULES_TREE_LOOKUP
 struct mod_tree_root {
+#ifdef CONFIG_MODULES_TREE_LOOKUP
struct latch_tree_root root;
+#endif
unsigned long addr_min;
unsigned long addr_max;
 };
 
 extern struct mod_tree_root mod_tree;
 
+#ifdef CONFIG_MODULES_TREE_LOOKUP
 void mod_tree_insert(struct module *mod);
 void mod_tree_remove_init(struct module *mod);
 void mod_tree_remove(struct module *mod);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 0749afdc34b5..3b75cb97f8c2 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -74,7 +74,6 @@ static void do_free_init(struct work_struct *w);
 static DECLARE_WORK(init_free_wq, do_free_init);
 static LLIST_HEAD(init_free_list);
 
-#ifdef CONFIG_MODULES_TREE_LOOKUP
 struct mod_tree_root mod_tree __cacheline_aligned = {
.addr_min = -1UL,
 };
@@ -82,10 +81,6 @@ struct mod_tree_root mod_tree __cacheline_aligned = {
 #define module_addr_min mod_tree.addr_min
 #define module_addr_max mod_tree.addr_max
 
-#else /* !CONFIG_MODULES_TREE_LOOKUP */
-static unsigned long module_addr_min = -1UL, module_addr_max;
-#endif /* CONFIG_MODULES_TREE_LOOKUP */
-
 struct symsearch {
const struct kernel_symbol *start, *stop;
const s32 *crcs;
-- 
2.34.1

Re: [PATCH 3/4] powerpc: Handle prefixed instructions in show_user_instructions()

2022-02-22 Thread Christophe Leroy





Le 02/06/2020 à 07:27, Jordan Niethe a écrit :

Currently prefixed instructions are treated as two word instructions by
show_user_instructions(), treat them as a single instruction. '<' and
'>' are placed around the instruction at the NIP, and for prefixed
instructions this is placed around the prefix only. Make the '<' and '>'
wrap the prefix and suffix.

Currently showing a prefixed instruction looks like:
fbe1fff8 3920 0600 a3e3 <0400> f7e4 ebe1fff8 4e800020

Make it look like:
0xfbe1fff8 0x3920 0x0600 0xa3e3 <0x0400 0xf7e4> 0xebe1fff8 
0x4e800020 0x 0x


Is it really needed to have the leading 0x ?

And is there a reason for that two 0x at the end of the new line 
that we don't have at the end of the old line ?


This is initially split into 8 instructions per line in order to fit in 
a 80 columns screen/terminal.


Could you make it such that it still fits within 80 cols ?

Same for patch 4 on show_user_instructions()

Christophe

Re: [PATCH 3/3] powerpc/bpf: Reallocate BPF registers to volatile registers when possible on PPC64

2022-02-22 Thread Christophe Leroy





Le 27/07/2021 à 08:55, Jordan Niethe a écrit :

Implement commit 40272035e1d0 ("powerpc/bpf: Reallocate BPF registers to
volatile registers when possible on PPC32") for PPC64.

When the BPF routine doesn't call any function, the non volatile
registers can be reallocated to volatile registers in order to avoid
having to save them/restore on the stack. To keep track of which
registers can be reallocated to make sure registers are set seen when
used.

Before this patch, the test #359 ADD default X is:
0:   nop
4:   nop
8:   std r27,-40(r1)
c:   std r28,-32(r1)
   10:   xor r8,r8,r8
   14:   rotlwi  r8,r8,0
   18:   xor r28,r28,r28
   1c:   rotlwi  r28,r28,0
   20:   mr  r27,r3
   24:   li  r8,66
   28:   add r8,r8,r28
   2c:   rotlwi  r8,r8,0
   30:   ld  r27,-40(r1)
   34:   ld  r28,-32(r1)
   38:   mr  r3,r8
   3c:   blr

After this patch, the same test has become:
0:   nop
4:   nop
8:   xor r8,r8,r8
c:   rotlwi  r8,r8,0
   10:   xor r5,r5,r5
   14:   rotlwi  r5,r5,0
   18:   mr  r4,r3
   1c:   li  r8,66
   20:   add r8,r8,r5
   24:   rotlwi  r8,r8,0
   28:   mr  r3,r8
   2c:   blr

Signed-off-by: Jordan Niethe 


If this series is still applicable, it needs to be rebased of Naveen's 
series https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=286000


Christophe

Re: [PATCH v4 0/3] KVM: PPC: Book3S PR: Fixes for AIL and SCV

2022-02-22 Thread Paolo Bonzini


On 2/22/22 07:47, Nicholas Piggin wrote:

Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are
happy with it (link in changelog) just waiting on KVM upstreaming. Do
you have objections to the series going to ppc/kvm tree first, or
another option is you could take patch 3 alone first (it's relatively
independent of the other 2) and ppc/kvm gets it from you?


Hi Nick,

I have pushed a topic branch kvm-cap-ppc-210 to kvm.git with just the 
definition and documentation of the capability.  ppc/kvm can apply your 
patch based on it (and drop the relevant parts of patch 3).  I'll send 
it to Linus this week.


Paolo

[PATCH] powerpc: Remove remaining stab codes

2022-02-22 Thread Christophe Leroy

Following commit 12318163737c ("powerpc/32: Remove remaining .stabs
annotations"), stabs code are not used anymore.

Remove them.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/ppc_asm.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index d9c6f12e6d3e..ee52667d76e2 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -693,12 +693,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_CELL_TB_BUG, 
CPU_FTR_CELL_TB_BUG, 96)
 #defineevr30   30
 #defineevr31   31
 
-/* some stab codes */
-#define N_FUN  36
-#define N_RSYM 64
-#define N_SLINE68
-#define N_SO   100
-
 #define RFSCV  .long 0x4ca4
 
 /*
-- 
2.34.1

[PATCH] powerpc/64s: Don't use DSISR for SLB faults

2022-02-22 Thread Michael Ellerman

Since commit 46ddcb3950a2 ("powerpc/mm: Show if a bad page fault on data
is read or write.") we use page_fault_is_write(regs->dsisr) in
__bad_page_fault() to determine if the fault is for a read or write, and
change the message printed accordingly.

But SLB faults, aka Data Segment Interrupts, don't set DSISR (Data
Storage Interrupt Status Register) to a useful value. All ISA versions
from v2.03 through v3.1 specify that the Data Segment Interrupt sets
DSISR "to an undefined value". As far as I can see there's no mention of
SLB faults setting DSISR in any BookIV content either.

This manifests as accesses that should be a read being incorrectly
reported as writes, for example, using the xmon "dump" command:

  0:mon> d 0x5deadbeef000
  5deadbeef000
  [359526.415354][C6] BUG: Unable to handle kernel data access on write at 
0x5deadbeef000
  [359526.415611][C6] Faulting instruction address: 0xc010a300
  cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf400]
  pc: c010a300: mread+0x90/0x190

If we disassemble the PC, we see a load instruction:

  0:mon> di c010a300
  c010a300 8949  lbz r10,0(r9)

We can also see in exceptions-64s.S that the data_access_slb block
doesn't set IDSISR=1, which means it doesn't load DSISR into pt_regs. So
the value we're using to determine if the fault is a read/write is some
stale value in pt_regs from a previous page fault.

Rework the printing logic to separate the SLB fault case out, and only
print read/write in the cases where we can determine it.

The result looks like eg:

  0:mon> d 0x5deadbeef000
  5deadbeef000
  [  721.779525][C6] BUG: Unable to handle kernel data access at 
0x5deadbeef000
  [  721.779697][C6] Faulting instruction address: 0xc014cbe0
  cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf390]

  0:mon> d 0
  
  [  742.793242][C6] BUG: Kernel NULL pointer dereference at 0x
  [  742.793316][C6] Faulting instruction address: 0xc014cbe0
  cpu 0x6: Vector: 380 (Data SLB Access) at [cffbf390]

Fixes: 46ddcb3950a2 ("powerpc/mm: Show if a bad page fault on data is read or 
write.")
Reported-by: Nageswara R Sastry 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/fault.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index eb8ecd7343a9..7ba6d3eff636 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -567,18 +567,24 @@ NOKPROBE_SYMBOL(hash__do_page_fault);
 static void __bad_page_fault(struct pt_regs *regs, int sig)
 {
int is_write = page_fault_is_write(regs->dsisr);
+   const char *msg;
 
/* kernel has accessed a bad area */
 
+   if (regs->dar < PAGE_SIZE)
+   msg = "Kernel NULL pointer dereference";
+   else
+   msg = "Unable to handle kernel data access";
+
switch (TRAP(regs)) {
case INTERRUPT_DATA_STORAGE:
-   case INTERRUPT_DATA_SEGMENT:
case INTERRUPT_H_DATA_STORAGE:
-   pr_alert("BUG: %s on %s at 0x%08lx\n",
-regs->dar < PAGE_SIZE ? "Kernel NULL pointer 
dereference" :
-"Unable to handle kernel data access",
+   pr_alert("BUG: %s on %s at 0x%08lx\n", msg,
 is_write ? "write" : "read", regs->dar);
break;
+   case INTERRUPT_DATA_SEGMENT:
+   pr_alert("BUG: %s at 0x%08lx\n", msg, regs->dar);
+   break;
case INTERRUPT_INST_STORAGE:
case INTERRUPT_INST_SEGMENT:
pr_alert("BUG: Unable to handle kernel instruction fetch%s",
-- 
2.34.1

[PATCH v4 2/6] module: Prepare for handling several RB trees

2022-02-22 Thread Christophe Leroy

In order to separate text and data, we need to setup
two rb trees.

Modify functions to give the tree as a parameter.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h|  4 ++--
 kernel/module/main.c| 16 
 kernel/module/tree_lookup.c | 20 ++--
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 07561753158d..26a1a3711d66 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -157,13 +157,13 @@ extern struct mod_tree_root mod_tree;
 void mod_tree_insert(struct module *mod);
 void mod_tree_remove_init(struct module *mod);
 void mod_tree_remove(struct module *mod);
-struct module *mod_find(unsigned long addr);
+struct module *mod_find(unsigned long addr, struct mod_tree_root *tree);
 #else /* !CONFIG_MODULES_TREE_LOOKUP */
 
 static inline void mod_tree_insert(struct module *mod) { }
 static inline void mod_tree_remove_init(struct module *mod) { }
 static inline void mod_tree_remove(struct module *mod) { }
-static inline struct module *mod_find(unsigned long addr)
+static inline struct module *mod_find(unsigned long addr, struct mod_tree_root 
*tree)
 {
struct module *mod;
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 3b75cb97f8c2..c0b961e02909 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -91,22 +91,22 @@ struct symsearch {
  * Bounds of module text, for speeding up __module_address.
  * Protected by module_mutex.
  */
-static void __mod_update_bounds(void *base, unsigned int size)
+static void __mod_update_bounds(void *base, unsigned int size, struct 
mod_tree_root *tree)
 {
unsigned long min = (unsigned long)base;
unsigned long max = min + size;
 
-   if (min < module_addr_min)
-   module_addr_min = min;
-   if (max > module_addr_max)
-   module_addr_max = max;
+   if (min < tree->addr_min)
+   tree->addr_min = min;
+   if (max > tree->addr_max)
+   tree->addr_max = max;
 }
 
 static void mod_update_bounds(struct module *mod)
 {
-   __mod_update_bounds(mod->core_layout.base, mod->core_layout.size);
+   __mod_update_bounds(mod->core_layout.base, mod->core_layout.size, 
&mod_tree);
if (mod->init_layout.size)
-   __mod_update_bounds(mod->init_layout.base, 
mod->init_layout.size);
+   __mod_update_bounds(mod->init_layout.base, 
mod->init_layout.size, &mod_tree);
 }
 
 static void module_assert_mutex_or_preempt(void)
@@ -3051,7 +3051,7 @@ struct module *__module_address(unsigned long addr)
 
module_assert_mutex_or_preempt();
 
-   mod = mod_find(addr);
+   mod = mod_find(addr, &mod_tree);
if (mod) {
BUG_ON(!within_module(addr, mod));
if (mod->state == MODULE_STATE_UNFORMED)
diff --git a/kernel/module/tree_lookup.c b/kernel/module/tree_lookup.c
index 0bc4ec3b22ce..995fe68059db 100644
--- a/kernel/module/tree_lookup.c
+++ b/kernel/module/tree_lookup.c
@@ -61,14 +61,14 @@ static const struct latch_tree_ops mod_tree_ops = {
.comp = mod_tree_comp,
 };
 
-static noinline void __mod_tree_insert(struct mod_tree_node *node)
+static noinline void __mod_tree_insert(struct mod_tree_node *node, struct 
mod_tree_root *tree)
 {
-   latch_tree_insert(&node->node, &mod_tree.root, &mod_tree_ops);
+   latch_tree_insert(&node->node, &tree->root, &mod_tree_ops);
 }
 
-static void __mod_tree_remove(struct mod_tree_node *node)
+static void __mod_tree_remove(struct mod_tree_node *node, struct mod_tree_root 
*tree)
 {
-   latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops);
+   latch_tree_erase(&node->node, &tree->root, &mod_tree_ops);
 }
 
 /*
@@ -80,28 +80,28 @@ void mod_tree_insert(struct module *mod)
mod->core_layout.mtn.mod = mod;
mod->init_layout.mtn.mod = mod;
 
-   __mod_tree_insert(&mod->core_layout.mtn);
+   __mod_tree_insert(&mod->core_layout.mtn, &mod_tree);
if (mod->init_layout.size)
-   __mod_tree_insert(&mod->init_layout.mtn);
+   __mod_tree_insert(&mod->init_layout.mtn, &mod_tree);
 }
 
 void mod_tree_remove_init(struct module *mod)
 {
if (mod->init_layout.size)
-   __mod_tree_remove(&mod->init_layout.mtn);
+   __mod_tree_remove(&mod->init_layout.mtn, &mod_tree);
 }
 
 void mod_tree_remove(struct module *mod)
 {
-   __mod_tree_remove(&mod->core_layout.mtn);
+   __mod_tree_remove(&mod->core_layout.mtn, &mod_tree);
mod_tree_remove_init(mod);
 }
 
-struct module *mod_find(unsigned long addr)
+struct module *mod_find(unsigned long addr, struct mod_tree_root *tree)
 {
struct latch_tree_node *ltn;
 
-   ltn = latch_tree_find((void *)addr, &mod_tree.root, &mod_tree_ops);
+   ltn = latch_tree_find((void *)addr, &tree->root, &mod_tree_ops);
if (!ltn)
return NULL;
 
-- 
2.34.1

[PATCH v4 5/6] module: Remove module_addr_min and module_addr_max

2022-02-22 Thread Christophe Leroy

Replace module_addr_min and module_addr_max by
mod_tree.addr_min and mod_tree.addr_max

Signed-off-by: Christophe Leroy 
---
 kernel/module/main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index f4d95a2ff08f..db503a212532 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -63,7 +63,7 @@
  * Mutex protects:
  * 1) List of modules (also safely readable with preempt_disable),
  * 2) module_use links,
- * 3) module_addr_min/module_addr_max.
+ * 3) mod_tree.addr_min/mod_tree.addr_max.
  * (delete and add uses RCU list operations).
  */
 DEFINE_MUTEX(module_mutex);
@@ -3006,14 +3006,14 @@ static void cfi_init(struct module *mod)
mod->exit = *exit;
 #endif
 
-   cfi_module_add(mod, module_addr_min);
+   cfi_module_add(mod, mod_tree.addr_min);
 #endif
 }
 
 static void cfi_cleanup(struct module *mod)
 {
 #ifdef CONFIG_CFI_CLANG
-   cfi_module_remove(mod, module_addr_min);
+   cfi_module_remove(mod, mod_tree.addr_min);
 #endif
 }
 
-- 
2.34.1

[PATCH v4 4/6] module: Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC

2022-02-22 Thread Christophe Leroy

Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to allow architectures
to request having modules data in vmalloc area instead of module area.

This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec.

This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.

Cc: Jason Wessel 
Acked-by: Daniel Thompson 
Cc: Douglas Anderson 
Signed-off-by: Christophe Leroy 
---
 arch/Kconfig|  6 
 include/linux/module.h  |  8 +
 kernel/debug/kdb/kdb_main.c | 10 +--
 kernel/module/internal.h|  3 ++
 kernel/module/main.c| 58 +++--
 kernel/module/procfs.c  |  8 +++--
 kernel/module/tree_lookup.c |  8 +
 7 files changed, 95 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 678a80713b21..b5d1f2c19c27 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -882,6 +882,12 @@ config MODULES_USE_ELF_REL
  Modules only use ELF REL relocations.  Modules with ELF RELA
  relocations will give an error.
 
+config ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   bool
+   help
+ For architectures like powerpc/32 which have constraints on module
+ allocation and need to allocate module data outside of module area.
+
 config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/include/linux/module.h b/include/linux/module.h
index 7ec9715de7dc..134c1df04b14 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -422,6 +422,9 @@ struct module {
/* Core layout: rbtree is accessed frequently, so keep together. */
struct module_layout core_layout __module_layout_align;
struct module_layout init_layout;
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   struct module_layout data_layout;
+#endif
 
/* Arch-specific module values */
struct mod_arch_specific arch;
@@ -569,6 +572,11 @@ bool is_module_text_address(unsigned long addr);
 static inline bool within_module_core(unsigned long addr,
  const struct module *mod)
 {
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   if ((unsigned long)mod->data_layout.base <= addr &&
+   addr < (unsigned long)mod->data_layout.base + mod->data_layout.size)
+   return true;
+#endif
return (unsigned long)mod->core_layout.base <= addr &&
   addr < (unsigned long)mod->core_layout.base + 
mod->core_layout.size;
 }
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 5369bf45c5d4..94de2889a062 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -2027,8 +2027,11 @@ static int kdb_lsmod(int argc, const char **argv)
if (mod->state == MODULE_STATE_UNFORMED)
continue;
 
-   kdb_printf("%-20s%8u  0x%px ", mod->name,
-  mod->core_layout.size, (void *)mod);
+   kdb_printf("%-20s%8u", mod->name, mod->core_layout.size);
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   kdb_printf("/%8u", mod->data_layout.size);
+#endif
+   kdb_printf("  0x%px ", (void *)mod);
 #ifdef CONFIG_MODULE_UNLOAD
kdb_printf("%4d ", module_refcount(mod));
 #endif
@@ -2039,6 +2042,9 @@ static int kdb_lsmod(int argc, const char **argv)
else
kdb_printf(" (Live)");
kdb_printf(" 0x%px", mod->core_layout.base);
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+   kdb_printf("/0x%px", mod->data_layout.base);
+#endif
 
 #ifdef CONFIG_MODULE_UNLOAD
{
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index db85058f4acb..27ea99707059 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -20,7 +20,9 @@
 /* Maximum number of characters written by module_flags() */
 #define MODULE_FLAGS_BUF_SIZE (TAINT_FLAGS_COUNT + 4)
 
+#ifndef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
 #definedata_layout core_layout
+#endif
 
 /*
  * Modules' sections will be aligned on page boundaries
@@ -154,6 +156,7 @@ struct mod_tree_root {
 };
 
 extern struct mod_tree_root mod_tree;
+extern struct mod_tree_root mod_data_tree;
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 void mod_tree_insert(struct module *mod);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index bd26280f2880..f4d95a2ff08f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -78,6 +78,12 @@ struct mod_tree_root mod_tree __cacheline_aligned = {
.addr_min = -1UL,
 };
 
+#ifdef CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
+struct mod_tree_root mod_data_tree __cacheline_aligned = {
+   .addr_min = -1UL,
+};
+#endif
+
 #define module_addr_min mod_tree.addr_min
 #de

[PATCH v4 3/6] module: Introduce data_layout

2022-02-22 Thread Christophe Leroy

In order to allow separation of data from text, add another layout,
called data_layout. For architectures requesting separation of text
and data, only text will go in core_layout and data will go in
data_layout.

For architectures which keep text and data together, make data_layout
an alias of core_layout, that way data_layout can be used for all
data manipulations, regardless of whether data is in core_layout or
data_layout.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h   |  2 ++
 kernel/module/kallsyms.c   | 18 +-
 kernel/module/main.c   | 20 
 kernel/module/strict_rwx.c | 10 +-
 4 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 26a1a3711d66..db85058f4acb 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -20,6 +20,8 @@
 /* Maximum number of characters written by module_flags() */
 #define MODULE_FLAGS_BUF_SIZE (TAINT_FLAGS_COUNT + 4)
 
+#definedata_layout core_layout
+
 /*
  * Modules' sections will be aligned on page boundaries
  * to ensure complete separation of code and data, but
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index 2ee8d2e67068..fe6723c040be 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -134,12 +134,12 @@ void layout_symtab(struct module *mod, struct load_info 
*info)
}
 
/* Append room for core symbols at end of core part. */
-   info->symoffs = ALIGN(mod->core_layout.size, symsect->sh_addralign ?: 
1);
-   info->stroffs = mod->core_layout.size = info->symoffs + ndst * 
sizeof(Elf_Sym);
-   mod->core_layout.size += strtab_size;
-   info->core_typeoffs = mod->core_layout.size;
-   mod->core_layout.size += ndst * sizeof(char);
-   mod->core_layout.size = debug_align(mod->core_layout.size);
+   info->symoffs = ALIGN(mod->data_layout.size, symsect->sh_addralign ?: 
1);
+   info->stroffs = mod->data_layout.size = info->symoffs + ndst * 
sizeof(Elf_Sym);
+   mod->data_layout.size += strtab_size;
+   info->core_typeoffs = mod->data_layout.size;
+   mod->data_layout.size += ndst * sizeof(char);
+   mod->data_layout.size = debug_align(mod->data_layout.size);
 
/* Put string table section at end of init part of module. */
strsect->sh_flags |= SHF_ALLOC;
@@ -186,9 +186,9 @@ void add_kallsyms(struct module *mod, const struct 
load_info *info)
 * Now populate the cut down core kallsyms for after init
 * and set types up while we still have access to sections.
 */
-   mod->core_kallsyms.symtab = dst = mod->core_layout.base + info->symoffs;
-   mod->core_kallsyms.strtab = s = mod->core_layout.base + info->stroffs;
-   mod->core_kallsyms.typetab = mod->core_layout.base + 
info->core_typeoffs;
+   mod->core_kallsyms.symtab = dst = mod->data_layout.base + info->symoffs;
+   mod->core_kallsyms.strtab = s = mod->data_layout.base + info->stroffs;
+   mod->core_kallsyms.typetab = mod->data_layout.base + 
info->core_typeoffs;
src = rcu_dereference_sched(mod->kallsyms)->symtab;
for (ndst = i = 0; i < 
rcu_dereference_sched(mod->kallsyms)->num_symtab; i++) {
rcu_dereference_sched(mod->kallsyms)->typetab[i] = elf_type(src 
+ i, info);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c0b961e02909..bd26280f2880 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1229,7 +1229,7 @@ static void free_module(struct module *mod)
percpu_modfree(mod);
 
/* Free lock-classes; relies on the preceding sync_rcu(). */
-   lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
+   lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);
 
/* Finally, free the core (containing the module structure) */
module_memfree(mod->core_layout.base);
@@ -1470,13 +1470,15 @@ static void layout_sections(struct module *mod, struct 
load_info *info)
for (i = 0; i < info->hdr->e_shnum; ++i) {
Elf_Shdr *s = &info->sechdrs[i];
const char *sname = info->secstrings + s->sh_name;
+   unsigned int *sizep;
 
if ((s->sh_flags & masks[m][0]) != masks[m][0]
|| (s->sh_flags & masks[m][1])
|| s->sh_entsize != ~0UL
|| module_init_layout_section(sname))
continue;
-   s->sh_entsize = module_get_offset(mod, 
&mod->core_layout.size, s, i);
+   sizep = m ? &mod->data_layout.size : 
&mod->core_layout.size;
+   s->sh_entsize = module_get_offset(mod, sizep, s, i);
pr_debug("\t%s\n", sname);
}
switch (m) {
@@ -1485,15 +1487,15 @@ static void layout_sections(struct modul

[PATCH v4 1/6] module: Always have struct mod_tree_root

2022-02-22 Thread Christophe Leroy

In order to separate text and data, we need to setup
two rb trees.

This means that struct mod_tree_root is required even without
MODULES_TREE_LOOKUP.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h | 4 +++-
 kernel/module/main.c | 5 -
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index fecfa590c149..07561753158d 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -143,15 +143,17 @@ static inline void module_decompress_cleanup(struct 
load_info *info)
 }
 #endif
 
-#ifdef CONFIG_MODULES_TREE_LOOKUP
 struct mod_tree_root {
+#ifdef CONFIG_MODULES_TREE_LOOKUP
struct latch_tree_root root;
+#endif
unsigned long addr_min;
unsigned long addr_max;
 };
 
 extern struct mod_tree_root mod_tree;
 
+#ifdef CONFIG_MODULES_TREE_LOOKUP
 void mod_tree_insert(struct module *mod);
 void mod_tree_remove_init(struct module *mod);
 void mod_tree_remove(struct module *mod);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 0749afdc34b5..3b75cb97f8c2 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -74,7 +74,6 @@ static void do_free_init(struct work_struct *w);
 static DECLARE_WORK(init_free_wq, do_free_init);
 static LLIST_HEAD(init_free_list);
 
-#ifdef CONFIG_MODULES_TREE_LOOKUP
 struct mod_tree_root mod_tree __cacheline_aligned = {
.addr_min = -1UL,
 };
@@ -82,10 +81,6 @@ struct mod_tree_root mod_tree __cacheline_aligned = {
 #define module_addr_min mod_tree.addr_min
 #define module_addr_max mod_tree.addr_max
 
-#else /* !CONFIG_MODULES_TREE_LOOKUP */
-static unsigned long module_addr_min = -1UL, module_addr_max;
-#endif /* CONFIG_MODULES_TREE_LOOKUP */
-
 struct symsearch {
const struct kernel_symbol *start, *stop;
const s32 *crcs;
-- 
2.34.1

[PATCH v4 0/6] Allocate module text and data separately

2022-02-22 Thread Christophe Leroy

This series applies on top of Aaron's series "module: core code clean up" v6, 
plus the 4 fixups I just sent:
- Fixup for 54f2273e5fef ("module: Move kallsyms support into a separate file")
- Fixup for e5973a14d187 ("module: Move strict rwx support to a separate file")
- Fixup for 1df95c1b9fb2 ("module: Move latched RB-tree support to a separate 
file")
- Fixup for 87b31159f78a ("module: Move all into module/")


This series allow architectures to request having modules data in
vmalloc area instead of module area.

This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec. Without this change module data
remains executable regardless of CONFIG_STRICT_MODULES_RWX.

This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.

Changes in v4:
- Rebased on top of Aaron's series "module: core code clean up" v6

Changes in v3:
- Fixed the tree for data_layout at one place (Thanks Miroslav)
- Moved removal of module_addr_min/module_addr_max macro out of patch 1 in a 
new patch at the end of the series to reduce churn.

Changes in v2:
- Dropped first two patches which are not necessary. They may be added back 
later as a follow-up series.
- Fixed the printks in GDB

Christophe Leroy (6):
  module: Always have struct mod_tree_root
  module: Prepare for handling several RB trees
  module: Introduce data_layout
  module: Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
  module: Remove module_addr_min and module_addr_max
  powerpc: Select ARCH_WANTS_MODULES_DATA_IN_VMALLOC on book3s/32 and
8xx

 arch/Kconfig|   6 +++
 arch/powerpc/Kconfig|   1 +
 include/linux/module.h  |   8 +++
 kernel/debug/kdb/kdb_main.c |  10 +++-
 kernel/module/internal.h|  13 +++--
 kernel/module/kallsyms.c|  18 +++
 kernel/module/main.c| 103 +++-
 kernel/module/procfs.c  |   8 ++-
 kernel/module/strict_rwx.c  |  10 ++--
 kernel/module/tree_lookup.c |  28 ++
 10 files changed, 149 insertions(+), 56 deletions(-)

-- 
2.34.1

[PATCH v4 6/6] powerpc: Select ARCH_WANTS_MODULES_DATA_IN_VMALLOC on book3s/32 and 8xx

2022-02-22 Thread Christophe Leroy

book3s/32 and 8xx have a separate area for allocating modules,
defined by MODULES_VADDR / MODULES_END.

On book3s/32, it is not possible to protect against execution
on a page basis. A full 256M segment is either Exec or NoExec.
The module area is in an Exec segment while vmalloc area is
in a NoExec segment.

In order to protect module data against execution, select
ARCH_WANTS_MODULES_DATA_IN_VMALLOC.

For the 8xx (and possibly other 32 bits platform in the future),
there is no such constraint on Exec/NoExec protection, however
there is a critical distance between kernel functions and callers
that needs to remain below 32Mbytes in order to avoid costly
trampolines. By allocating data outside of module area, we
increase the chance for module text to remain within acceptable
distance from kernel core text.

So select ARCH_WANTS_MODULES_DATA_IN_VMALLOC for 8xx as well.

Signed-off-by: Christophe Leroy 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 28e4047e99e8..478ee49a4fb4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -156,6 +156,7 @@ config PPC
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
+   select ARCH_WANTS_MODULES_DATA_IN_VMALLOC   if PPC_BOOK3S_32 || 
PPC_8xx
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
-- 
2.34.1

66 matches

Mail list logo