[PATCH 1/1] ASoC: imx-hdmi: Use dev_err_probe

2023-01-18 Thread Alexander Stein
This silences -517 errors and helps figuring out why the device probe
is deferred.

Signed-off-by: Alexander Stein 
---
 sound/soc/fsl/imx-hdmi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/fsl/imx-hdmi.c b/sound/soc/fsl/imx-hdmi.c
index a780cf5a65ffa..b6cc7e6c2a320 100644
--- a/sound/soc/fsl/imx-hdmi.c
+++ b/sound/soc/fsl/imx-hdmi.c
@@ -202,7 +202,7 @@ static int imx_hdmi_probe(struct platform_device *pdev)
snd_soc_card_set_drvdata(&data->card, data);
ret = devm_snd_soc_register_card(&pdev->dev, &data->card);
if (ret) {
-   dev_err(&pdev->dev, "snd_soc_register_card failed (%d)\n", ret);
+   dev_err_probe(&pdev->dev, ret, "snd_soc_register_card 
failed\n");
goto fail;
}
 
-- 
2.34.1



[PATCH v2] powerpc: Implement arch_within_stack_frames

2023-01-18 Thread Nicholas Miehlbradt
Walks the stack when copy_{to,from}_user address is in the stack to
ensure that the object being copied is entirely within a single stack
frame.

Substatially similar to the x86 implementation except using the back
chain to traverse the stack and identify stack frame boundaries.

Signed-off-by: Nicholas Miehlbradt 
---
v2: Rename PARAMETER_SAVE_OFFSET to STACK_FRAME_PARAMS
Add definitions of STACK_FRAME_PARAMS for PPC32 and remove dependancy on 
PPC64
Ignore the current stack frame and start with it's parent, similar to x86

v1: 
https://lore.kernel.org/linuxppc-dev/20221214044252.1910657-1-nicho...@linux.ibm.com/
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/thread_info.h | 36 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ca5418457ed..97ca54773521 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -198,6 +198,7 @@ config PPC
select HAVE_ARCH_KASAN_VMALLOC  if HAVE_ARCH_KASAN
select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+   select HAVE_ARCH_WITHIN_STACK_FRAMES
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index af58f1ed3952..c5dce5f239c1 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -186,6 +186,42 @@ static inline bool test_thread_local_flags(unsigned int 
flags)
 #define is_elf2_task() (0)
 #endif
 
+#if defined(CONFIG_PPC64_ELF_ABI_V1)
+#define STACK_FRAME_PARAMS 48
+#elif defined(CONFIG_PPC64_ELF_ABI_V2)
+#define STACK_FRAME_PARAMS 32
+#elif defined(CONFIG_PPC32)
+#define STACK_FRAME_PARAMS 8
+#endif
+
+/*
+ * Walks up the stack frames to make sure that the specified object is
+ * entirely contained by a single stack frame.
+ *
+ * Returns:
+ * GOOD_FRAME  if within a frame
+ * BAD_STACK   if placed across a frame boundary (or outside stack)
+ */
+static inline int arch_within_stack_frames(const void * const stack,
+  const void * const stackend,
+  const void *obj, unsigned long len)
+{
+   const void *params;
+   const void *frame;
+
+   params = *(const void * const *)current_stack_pointer + 
STACK_FRAME_PARAMS;
+   frame = **(const void * const * const *)current_stack_pointer;
+
+   while (stack <= frame && frame < stackend) {
+   if (obj + len <= frame)
+   return obj >= params ? GOOD_FRAME : BAD_STACK;
+   params = frame + STACK_FRAME_PARAMS;
+   frame = *(const void * const *)frame;
+   }
+
+   return BAD_STACK;
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
-- 
2.34.1



Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-18 Thread Nicholas Piggin
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin  wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +   struct mm_struct *mm = arg;
> > +
> > +   if (current->active_mm == mm) {
> > +   WARN_ON_ONCE(current->mm);
> > +   current->active_mm = &init_mm;
> > +   switch_mm(mm, &init_mm, current);
> > +   }
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

Now I come to think of it, Rik had done this for x86 a while back.

https://lore.kernel.org/all/20180728215357.3249-10-r...@surriel.com/

I didn't know about it when I wrote this, so I never dug into why it
didn't get merged. It might have missed the final __mmdrop races but
I'm not not sure, x86 lazy tlb mode is too complicated to know at a
glance. I would check with him though.

Thanks,
Nick


Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

2023-01-18 Thread Nicholas Piggin
On Thu Jan 19, 2023 at 3:30 AM AEST, Linus Torvalds wrote:
> [ Adding a few more x86 and arm64 maintainers - while linux-arch is
> the right mailing list, I'm not convinced people actually follow it
> all that closely ]
>
> On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin  wrote:
> >
> > On a 16-socket 192-core POWER8 system, a context switching benchmark
> > with as many software threads as CPUs (so each switch will go in and
> > out of idle), upstream can achieve a rate of about 1 million context
> > switches per second, due to contention on the mm refcount.
> >
> > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > the option. This increases the above benchmark to 118 million context
> > switches per second.
>
> Well, the 1M -> 118M change does seem like a good reason for this series.

It was an artificial corner case, mind you. I don't think it's a reason
to panic and likely smaller systems with faster atomics will care far
less than our big 2-hop systems.

Benchmark is will-it-scale:

  ./context_switch1_threads -t 768
  min:2174 max:2690 total:1827952

33.52%  [k] finish_task_switch
27.26%  [k] interrupt_return
22.66%  [k] __schedule
 2.30%  [k] _raw_spin_trylock

  ./context_switch1_threads -t 1536
  min:103000 max:120100 total:177201906

The top case has 1/2 the switching pairs to available CPU, which makes
them all switch the same mm between real and lazy. Bottom case is
just switching between user threads so that doesn't hit the lazy
refcount.

> The patches certainly don't look offensive to me, so Ack as far as I'm
> concerned, but honestly, it's been some time since I've personally
> been active on the idle and lazy TLB code, so that ack is probably
> largely worthless.
>
> If anything, my main reaction to this all is to wonder whether the
> config option is a good idea - maybe we could do this unconditionally,
> and make the source code (and logic) simpler to follow when you don't
> have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.
>
> I wouldn't be surprised to hear that x86 can have the same issue where
> the mm_struct refcount is a bigger issue than the possibility of an
> extra TLB shootdown at the final exit time.
>
> But having the config options as a way to switch people over gradually
> (and perhaps then removing it later) doesn't sound wrong to me either.

IMO it's trivial enough that we could carry both, but everything's a
straw on the camel's back so if we can consolidate it would always be
preferebale. Let's see how it plays out for a few releases.

> And I personally find the argument in patch 3/5 fairly convincing:
>
>   Shootdown IPIs cost could be an issue, but they have not been observed
>   to be a serious problem with this scheme, because short-lived processes
>   tend not to migrate CPUs much, therefore they don't get much chance to
>   leave lazy tlb mm references on remote CPUs.
>
> Andy? PeterZ? Catalin?
>
> Nick - it might be good to link to the actual benchmark, and let
> people who have access to big machines perhaps just try it out on
> non-powerpc platforms...

Yep good point, I'll put it in the changelog. I might submit another
round to Andrew in a bit with acks and any minor tweaks and minus the
last patch, assuming no major changes or objections.

Thanks,
Nick


Re: [PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema

2023-01-18 Thread kernel test robot
Hi Rob,

I love your patch! Perhaps something to improve:

[auto build test WARNING on 1b929c02afd37871d5afb9d498426f83432e71c2]

url:
https://github.com/intel-lab-lkp/linux/commits/Rob-Herring/dt-bindings-usb-Remove-obsolete-brcm-bcm3384-usb-txt/20230119-030120
base:   1b929c02afd37871d5afb9d498426f83432e71c2
patch link:
https://lore.kernel.org/r/20230110-dt-usb-v2-3-926bc1260e51%40kernel.org
patch subject: [PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings 
to schema
reproduce:
# 
https://github.com/intel-lab-lkp/linux/commit/e7220b26de1a7fcd192feec481c1a90f7bf5c949
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review 
Rob-Herring/dt-bindings-usb-Remove-obsolete-brcm-bcm3384-usb-txt/20230119-030120
git checkout e7220b26de1a7fcd192feec481c1a90f7bf5c949
make menuconfig
# enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, 
CONFIG_WARN_ABI_ERRORS
make htmldocs

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

>> Warning: Documentation/devicetree/bindings/mfd/omap-usb-host.txt references 
>> a file that doesn't exist: 
>> Documentation/devicetree/bindings/usb/ehci-generic.yaml
>> Warning: Documentation/devicetree/bindings/mfd/omap-usb-host.txt references 
>> a file that doesn't exist: 
>> Documentation/devicetree/bindings/usb/ohci-generic.yaml

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Re: Memory transaction instructions

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 7:05 PM AEST, David Howells wrote:
> Linus Torvalds  wrote:
>
> > And for the kernel, where we don't have bad locking, and where we
> > actually use fine-grained locks that are _near_ the data that we are
> > locking (the lockref of the dcache is obviously one example of that,
> > but the skbuff queue you mention is almost certainly exactly the same
> > situation): the lock is right by the data that the lock protects, and
> > the "shared lock cacheline" model simply does not work. You'll bounce
> > the data, and most likely you'll also touch the same lock cacheline
> > too.
>
> Yeah.  The reason I was actually wondering about them was if it would be
> possible to avoid the requirement to disable interrupts/softirqs to, say,
> modify the skbuff queue.  On some arches actually disabling irqs is quite a
> heavy operation (I think this is/was true on ppc64, for example; it certainly
> was on frv) and it was necessary to "emulate" the disablement.

Not too bad on modern ppc64. Changing MSR in general has to flush the
pipe and even re-fetch, because it can alter memory translation among
other things, so it was heavy. Everything we support has a lightweight
MSR change that just modifies the interrupt enable bit and only needs
minor serialisation (although we still have that software-irq-disable
thing which avoids the heavy MSR problem on old CPUs).

Thanks,
Nick


Re: [PATCH v3 04/24] powerpc/secvar: Handle format string in the consumer

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> From: Russell Currey 
>
> The code that handles the format string in secvar-sysfs.c is entirely
> OPAL specific, so create a new "format" op in secvar_operations to make
> the secvar code more generic.  No functional change.
>
> Signed-off-by: Russell Currey 
> Signed-off-by: Andrew Donnellan 
>
> ---
>
> v2: Use sysfs_emit() instead of sprintf() (gregkh)
>
> v3: Enforce format string size limit (ruscur)
> ---
>  arch/powerpc/include/asm/secvar.h|  3 +++
>  arch/powerpc/kernel/secvar-sysfs.c   | 23 --
>  arch/powerpc/platforms/powernv/opal-secvar.c | 25 
>  3 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/secvar.h 
> b/arch/powerpc/include/asm/secvar.h
> index 07ba36f868a7..8b6475589120 100644
> --- a/arch/powerpc/include/asm/secvar.h
> +++ b/arch/powerpc/include/asm/secvar.h
> @@ -11,12 +11,15 @@
>  #include 
>  #include 
>  
> +#define SECVAR_MAX_FORMAT_LEN30 // max length of string returned by 
> ->format()
> +
>  extern const struct secvar_operations *secvar_ops;
>  
>  struct secvar_operations {
>   int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size);
>   int (*get_next)(const char *key, u64 *key_len, u64 keybufsize);
>   int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
> + ssize_t (*format)(char *buf);
>  };
>  
>  #ifdef CONFIG_PPC_SECURE_BOOT
> diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
> b/arch/powerpc/kernel/secvar-sysfs.c
> index 462cacc0ca60..d3858eedd72c 100644
> --- a/arch/powerpc/kernel/secvar-sysfs.c
> +++ b/arch/powerpc/kernel/secvar-sysfs.c
> @@ -21,26 +21,13 @@ static struct kset *secvar_kset;
>  static ssize_t format_show(struct kobject *kobj, struct kobj_attribute *attr,
>  char *buf)
>  {
> - ssize_t rc = 0;
> - struct device_node *node;
> - const char *format;
> -
> - node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
> - if (!of_device_is_available(node)) {
> - rc = -ENODEV;
> - goto out;
> - }
> + char tmp[SECVAR_MAX_FORMAT_LEN];
> + ssize_t len = secvar_ops->format(tmp);
>  
> - rc = of_property_read_string(node, "format", &format);
> - if (rc)
> - goto out;
> + if (len <= 0)
> + return -EIO;

AFAIKS this does have a functional change, it loses the return value.
Why not return len if it is < 0, and -EIO if len == 0?

Thanks,
Nick


Re: [PATCH v3 13/24] powerpc/pseries: Fix handling of PLPKS object flushing timeout

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> plpks_confirm_object_flushed() uses the H_PKS_CONFIRM_OBJECT_FLUSHED hcall
> to check whether changes to an object in the Platform KeyStore have been
> flushed to non-volatile storage.
>
> The hcall returns two output values, the return code and the flush status.
> plpks_confirm_object_flushed() polls the hcall until either the flush
> status has updated, the return code is an error, or a timeout has been
> exceeded.
>
> While we're still polling, the hcall is returning H_SUCCESS (0) as the
> return code. In the timeout case, this means that upon exiting the polling
> loop, rc is 0, and therefore 0 is returned to the user.
>
> Handle the timeout case separately and return ETIMEDOUT if triggered.
>
> Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore")

Can fixes go to the start of the series?

Thanks,
Nick


Re: [PATCH v3 16/24] powerpc/pseries: Implement signed update for PLPKS objects

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> From: Nayna Jain 
>
> The Platform Keystore provides a signed update interface which can be used
> to create, replace or append to certain variables in the PKS in a secure
> fashion, with the hypervisor requiring that the update be signed using the
> Platform Key.
>
> Implement an interface to the H_PKS_SIGNED_UPDATE hcall in the plpks
> driver to allow signed updates to PKS objects.
>
> (The plpks driver doesn't need to do any cryptography or otherwise handle
> the actual signed variable contents - that will be handled by userspace
> tooling.)
>
> Signed-off-by: Nayna Jain 
> [ajd: split patch, add timeout handling and misc cleanups]
> Co-developed-by: Andrew Donnellan 
> Signed-off-by: Andrew Donnellan 
> Signed-off-by: Russell Currey 
>
> ---
>
> v3: Merge plpks fixes and signed update series with secvar series
>
> Fix error code handling in plpks_confirm_object_flushed() (ruscur)
>
> Pass plpks_var struct to plpks_signed_update_var() by reference (mpe)
>
> Consistent constant naming scheme (ruscur)
> ---
>  arch/powerpc/include/asm/hvcall.h  |  3 +-
>  arch/powerpc/include/asm/plpks.h   |  5 ++
>  arch/powerpc/platforms/pseries/plpks.c | 71 --
>  3 files changed, 73 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index 95fd7f9485d5..33b26c0cb69b 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -336,7 +336,8 @@
>  #define H_SCM_FLUSH  0x44C
>  #define H_GET_ENERGY_SCALE_INFO  0x450
>  #define H_WATCHDOG   0x45C
> -#define MAX_HCALL_OPCODE H_WATCHDOG
> +#define H_PKS_SIGNED_UPDATE  0x454
> +#define MAX_HCALL_OPCODE H_PKS_SIGNED_UPDATE

^ Bad rebase.

Thanks,
Nick


Re: [PATCH v3 08/24] powerpc/secvar: Allow backend to populate static list of variable names

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> Currently, the list of variables is populated by calling
> secvar_ops->get_next() repeatedly, which is explicitly modelled on the
> OPAL API (including the keylen parameter).
>
> For the upcoming PLPKS backend, we have a static list of variable names.
> It is messy to fit that into get_next(), so instead, let the backend put
> a NULL-terminated array of variable names into secvar_ops->var_names,
> which will be used if get_next() is undefined.
>
> Signed-off-by: Andrew Donnellan 
> Signed-off-by: Russell Currey 
>
> ---
>
> v3: New patch (ajd/mpe)
> ---
>  arch/powerpc/include/asm/secvar.h  |  4 ++
>  arch/powerpc/kernel/secvar-sysfs.c | 67 --
>  2 files changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/secvar.h 
> b/arch/powerpc/include/asm/secvar.h
> index ebf95386d720..c8bee1834b54 100644
> --- a/arch/powerpc/include/asm/secvar.h
> +++ b/arch/powerpc/include/asm/secvar.h
> @@ -23,6 +23,10 @@ struct secvar_operations {
>   ssize_t (*format)(char *buf);
>   int (*max_size)(u64 *max_size);
>   const struct attribute **config_attrs;
> +
> + // NULL-terminated array of fixed variable names
> + // Only used if get_next() isn't provided
> + const char * const *var_names;

The other way you could go is provide a sysfs_init() ops call here,
and export the add_var as a library function that backends can use.

Thanks,
Nick


Re: [PATCH v3 04/24] powerpc/secvar: Handle format string in the consumer

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> From: Russell Currey 
>
> The code that handles the format string in secvar-sysfs.c is entirely
> OPAL specific, so create a new "format" op in secvar_operations to make
> the secvar code more generic.  No functional change.
>
> Signed-off-by: Russell Currey 
> Signed-off-by: Andrew Donnellan 
>
> ---
>
> v2: Use sysfs_emit() instead of sprintf() (gregkh)
>
> v3: Enforce format string size limit (ruscur)
> ---
>  arch/powerpc/include/asm/secvar.h|  3 +++
>  arch/powerpc/kernel/secvar-sysfs.c   | 23 --
>  arch/powerpc/platforms/powernv/opal-secvar.c | 25 
>  3 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/secvar.h 
> b/arch/powerpc/include/asm/secvar.h
> index 07ba36f868a7..8b6475589120 100644
> --- a/arch/powerpc/include/asm/secvar.h
> +++ b/arch/powerpc/include/asm/secvar.h
> @@ -11,12 +11,15 @@
>  #include 
>  #include 
>  
> +#define SECVAR_MAX_FORMAT_LEN30 // max length of string returned by 
> ->format()
> +
>  extern const struct secvar_operations *secvar_ops;
>  
>  struct secvar_operations {
>   int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size);
>   int (*get_next)(const char *key, u64 *key_len, u64 keybufsize);
>   int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
> + ssize_t (*format)(char *buf);

Maybe pass the buf size as an argument here? Which is a bit less error
prone and more flexible than finding the right #define for it.

Thanks,
Nick


Re: [PATCH v3 02/24] powerpc/secvar: WARN_ON_ONCE() if multiple secvar ops are set

2023-01-18 Thread Nicholas Piggin
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote:
> From: Russell Currey 
>
> The secvar code only supports one consumer at a time.
>
> Multiple consumers aren't possible at this point in time, but we'd want
> it to be obvious if it ever could happen.
>
> Signed-off-by: Russell Currey 
> Signed-off-by: Andrew Donnellan 
> ---
>  arch/powerpc/kernel/secvar-ops.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/secvar-ops.c 
> b/arch/powerpc/kernel/secvar-ops.c
> index 6a29777d6a2d..aa1b2adc2710 100644
> --- a/arch/powerpc/kernel/secvar-ops.c
> +++ b/arch/powerpc/kernel/secvar-ops.c
> @@ -8,10 +8,12 @@
>  
>  #include 
>  #include 
> +#include 
>  
> -const struct secvar_operations *secvar_ops __ro_after_init;
> +const struct secvar_operations *secvar_ops __ro_after_init = NULL;
>  
>  void set_secvar_ops(const struct secvar_operations *ops)
>  {
> + WARN_ON_ONCE(secvar_ops);
>   secvar_ops = ops;

You could make it return error here and two line patch in the caller to
handle the error and then things wouldn't get corrupted.

Thanks,
Nick


Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-18 Thread Nicholas Piggin
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:
>
>
> > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin  wrote:
> > 
> > +static void do_shoot_lazy_tlb(void *arg)
> > +{
> > +   struct mm_struct *mm = arg;
> > +
> > +   if (current->active_mm == mm) {
> > +   WARN_ON_ONCE(current->mm);
> > +   current->active_mm = &init_mm;
> > +   switch_mm(mm, &init_mm, current);
> > +   }
> > +}
>
> I might be out of touch - doesn’t a flush already take place when we free
> the page-tables, at least on common cases on x86?
>
> IIUC exit_mmap() would free page-tables, and whenever page-tables are
> freed, on x86, we do shootdown regardless to whether the target CPU TLB state
> marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
> everything should be fine, no?
>
> [ I understand you care about powerpc, just wondering on the effect on x86 ]

If you can easily piggyback on IPI work you already do in exit_mmap then
that's likely to be preferable. I don't know the details of x86 these
days but there is some discussion about it in last year's thread, it
sounded quite feasible.

This is stil required at final __mmdrop() time because it's still
possible that lazy mm refs will need to be cleaned. exit_mmap() itself
explicitly creates one, so if the __mmdrop() runs on a different CPU,
then there's one. kthreads using the mm could create others. If that
part of it is unclear or under-commented, I can try improve it.

Thanks,
Nick



Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-18 Thread Nadav Amit



> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin  wrote:
> 
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> + struct mm_struct *mm = arg;
> +
> + if (current->active_mm == mm) {
> + WARN_ON_ONCE(current->mm);
> + current->active_mm = &init_mm;
> + switch_mm(mm, &init_mm, current);
> + }
> +}

I might be out of touch - doesn’t a flush already take place when we free
the page-tables, at least on common cases on x86?

IIUC exit_mmap() would free page-tables, and whenever page-tables are
freed, on x86, we do shootdown regardless to whether the target CPU TLB state
marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
everything should be fine, no?

[ I understand you care about powerpc, just wondering on the effect on x86 ]



[PATCH] of: Fix of platform build on powerpc due to bad of disaply code

2023-01-18 Thread Michal Suchanek


The commit 2d681d6a23a1 ("of: Make of framebuffer devices unique")
breaks build because of wrong argument to snprintf. That certainly
avoids the runtime error but is not the intended outcome.

Fixes: 2d681d6a23a1 ("of: Make of framebuffer devices unique")
Signed-off-by: Michal Suchanek 
---
 drivers/of/platform.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index f2a5d679a324..e9dd7371f27a 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -564,10 +564,10 @@ static int __init of_platform_default_populate_init(void)
}
 
for_each_node_by_type(node, "display") {
-   char *buf[14];
+   char buf[14];
if (!of_get_property(node, "linux,opened", NULL) || 
node == boot_display)
continue;
-   ret = snprintf(buf, "of-display-%d", display_number++);
+   ret = snprintf(buf, sizeof(buf), "of-display-%d", 
display_number++);
if (ret >= sizeof(buf))
continue;
of_platform_device_create(node, buf, NULL);
-- 
2.35.3



Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko  wrote:
>
> On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > > anon_vma_lock_write to match the locking order of page fault 
> > > > > > handler.
> > > > >
> > > > > Does this changelog assumes per vma locking in the #PF?
> > > >
> > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > > but the changelog already talks about that. Maybe I should change it
> > > > to simply:
> > > > ```
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > > has been locked.
> > >
> > > Because 
> >
> > because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
>
> So it will become:
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.
>
> which is effectivelly saying
> vma_adjust_trans_huge() modifies the VMA and such modifications should
> be done under VMA write-lock protection so move VMA flag modifications
> before so all of them are covered by the same write protection.
>
> right?

Yes, and the wording in the latter version is simpler to understand
IMO, so I would like to adopt it. Do you agree?

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH] of: Make of framebuffer devices unique

2023-01-18 Thread Michal Suchánek
On Wed, Jan 18, 2023 at 09:13:05PM +0100, Erhard F. wrote:
> On Tue, 17 Jan 2023 17:58:04 +0100
> Michal Suchanek  wrote:
> 
> > Since Linux 5.19 this error is observed:
> > 
> > sysfs: cannot create duplicate filename '/devices/platform/of-display'
> > 
> > This is because multiple devices with the same name 'of-display' are
> > created on the same bus.
> > 
> > Update the code to create numbered device names for the non-boot
> > disaplay.
> > 
> > cc: linuxppc-dev@lists.ozlabs.org
> > References: https://bugzilla.kernel.org/show_bug.cgi?id=216095
> > Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers")
> > Reported-by: Erhard F. 
> > Suggested-by: Thomas Zimmermann 
> > Signed-off-by: Michal Suchanek 
> > ---
> >  drivers/of/platform.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > index 81c8c227ab6b..f2a5d679a324 100644
> > --- a/drivers/of/platform.c
> > +++ b/drivers/of/platform.c
> > @@ -525,6 +525,7 @@ static int __init 
> > of_platform_default_populate_init(void)
> > if (IS_ENABLED(CONFIG_PPC)) {
> > struct device_node *boot_display = NULL;
> > struct platform_device *dev;
> > +   int display_number = 1;
> > int ret;
> >  
> > /* Check if we have a MacOS display without a node spec */
> > @@ -561,10 +562,15 @@ static int __init 
> > of_platform_default_populate_init(void)
> > boot_display = node;
> > break;
> > }
> > +
> > for_each_node_by_type(node, "display") {
> > +   char *buf[14];
> > if (!of_get_property(node, "linux,opened", NULL) || 
> > node == boot_display)
> > continue;
> > -   of_platform_device_create(node, "of-display", NULL);
> > +   ret = snprintf(buf, "of-display-%d", display_number++);
> > +   if (ret >= sizeof(buf))
> > +   continue;
> > +   of_platform_device_create(node, buf, NULL);
> > }
> >  
> > } else {
> > -- 
> > 2.35.3
> > 
> 
> Thank you for the patch Michal!
> 
> It applies on 6.2-rc4 but I get this build error with my config:

Indeed, it's doubly bad.

Where is the kernel test robot when you need it?

It should not be that easy to miss this file but clearly it can happen.

I will send a fixup.

Sorry about the mess.

Thanks

Michal


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
>
> On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> >  wrote:
> > >
> > > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > > +locking maintainers
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault 
> > > > > > > handling
> > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs 
> > > > > > > need
> > > > > > > to be exclusively locked during VMA tree modifications, instead 
> > > > > > > of the
> > > > > > > usual lock/unlock patter we mark a VMA as locked by taking 
> > > > > > > per-VMA lock
> > > > > > > exclusively and setting vma->lock_seq to the current 
> > > > > > > mm->lock_seq. When
> > > > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > > > mmap_lock,
> > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs 
> > > > > > > marked as
> > > > > > > locked.
> > > > > > [...]
> > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +   up_read(&vma->lock);
> > > > > > > +}
> > > > > >
> > > > > > One thing that might be gnarly here is that I think you might not be
> > > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > > from what I remember, I think that up_read() (unlike something like
> > > > > > spin_unlock()) can access the lock object after it's already been
> > > > > > acquired by someone else.
> > > > >
> > > > > Yes, I think you are right. From a look into the code it seems that
> > > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > > That doesn't make it less of a problem though.
> > > > >
> > > > > > So if you want to protect against concurrent
> > > > > > deletion, this might have to be something like:
> > > > > >
> > > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > > up_read(&vma->lock);
> > > > > > rcu_read_unlock();
> > > > > >
> > > > > > But I'm not entirely sure about that, the locking folks might know 
> > > > > > better.
> > > > >
> > > > > I am not a locking expert but to me it looks like this should work
> > > > > because the final cleanup would have to happen rcu_read_unlock.
> > > > >
> > > > > Thanks, I have completely missed this aspect of the locking when 
> > > > > looking
> > > > > into the code.
> > > > >
> > > > > Btw. looking at this again I have fully realized how hard it is 
> > > > > actually
> > > > > to see that vm_area_free is guaranteed to sync up with ongoing 
> > > > > readers.
> > > > > vma manipulation functions like __adjust_vma make my head spin. Would 
> > > > > it
> > > > > make more sense to have a rcu style synchronization point in
> > > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > > uncontended down_write of course.
> > > >
> > > > Something along those lines might be a good idea, but I think that
> > > > rather than synchronizing the removal, it should maybe be something
> > > > that splats (and bails out?) if it detects pending readers. If we get
> > > > to vm_area_free() on a VMA that has pending readers, we might already
> > > > be in a lot of trouble because the concurrent readers might have been
> > > > traversing page tables while we were tearing them down or fun stuff
> > > > like that.
> > > >
> > > > I think maybe Suren was already talking about something like that in
> > > > another part of this patch series but I don't remember...
> > >
> > > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?
> >
> > Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> > VMAs and that VMAs are freed or isolated under VMA write lock
> > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> > Michal mentioned gets hit then it's a bug in my design and I'll have
> > to fix it. But please, let's not add synchronize_rcu() in the
> > vm_area_free().
>
> Just to clarify. I didn't suggest to add synchronize_rcu into
> vm_area_free. What I really meant was synchronize_rcu like primitive to
> effectivelly synchronize with any potential pending read locker (so
> something like vma_write_lock (or whatever it is called). The point is
> that vma freeing is an event all readers should be notified about.

I don't think readers need to be notified if we are ensuring that the
VMA is not used by anyone else and is not reachable by the readers.
This is currently done by write-locking the VMA either before removing
it from the tree or before freeing it.

> This can be done explicitly for each and every vma before vm_area_free
> is called but this is

Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko  wrote:
> >
> > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > >
> > > > Does this changelog assumes per vma locking in the #PF?
> > >
> > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > but the changelog already talks about that. Maybe I should change it
> > > to simply:
> > > ```
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked.
> >
> > Because 
> 
> because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.

So it will become:
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.

which is effectivelly saying
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done under VMA write-lock protection so move VMA flag modifications
before so all of them are covered by the same write protection.

right?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-18 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 6:44 PM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko  wrote:
> > >
> > > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > > page fault handling. When VMA is not found, can't be locked or changes
> > > > after being locked, the function returns NULL. The lookup is performed
> > > > under RCU protection to prevent the found VMA from being destroyed 
> > > > before
> > > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > > the results.
> > > > For now only anonymous VMAs can be searched this way. In other cases the
> > > > function returns NULL.
> > >
> > > Could you describe why only anonymous vmas are handled at this stage and
> > > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > > doesn't seem to have any anonymous vma specific requirements AFAICS.
> >
> > TBH I haven't spent too much time looking into file-backed page faults
> > yet but a couple of tasks I can think of are:
> > - Ensure that all vma->vm_ops->fault() handlers do not rely on
> > mmap_lock being read-locked;
>
> I think this way lies madness.  There are just too many device drivers
> that implement ->fault.  My plan is to call the ->map_pages() method
> under RCU without even read-locking the VMA.  If that doesn't satisfy
> the fault, then drop all the way back to taking the mmap_sem for read
> before calling into ->fault.

Sounds reasonable to me but I guess the devil is in the details...

>


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > +locking maintainers
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > > > wrote:
> > > > > > Introduce a per-VMA rw_semaphore to be used during page fault 
> > > > > > handling
> > > > > > instead of mmap_lock. Because there are cases when multiple VMAs 
> > > > > > need
> > > > > > to be exclusively locked during VMA tree modifications, instead of 
> > > > > > the
> > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > > lock
> > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > > When
> > > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > > mmap_lock,
> > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs 
> > > > > > marked as
> > > > > > locked.
> > > > > [...]
> > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +   up_read(&vma->lock);
> > > > > > +}
> > > > >
> > > > > One thing that might be gnarly here is that I think you might not be
> > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > from what I remember, I think that up_read() (unlike something like
> > > > > spin_unlock()) can access the lock object after it's already been
> > > > > acquired by someone else.
> > > >
> > > > Yes, I think you are right. From a look into the code it seems that
> > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > That doesn't make it less of a problem though.
> > > >
> > > > > So if you want to protect against concurrent
> > > > > deletion, this might have to be something like:
> > > > >
> > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > up_read(&vma->lock);
> > > > > rcu_read_unlock();
> > > > >
> > > > > But I'm not entirely sure about that, the locking folks might know 
> > > > > better.
> > > >
> > > > I am not a locking expert but to me it looks like this should work
> > > > because the final cleanup would have to happen rcu_read_unlock.
> > > >
> > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > into the code.
> > > >
> > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > make more sense to have a rcu style synchronization point in
> > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > uncontended down_write of course.
> > >
> > > Something along those lines might be a good idea, but I think that
> > > rather than synchronizing the removal, it should maybe be something
> > > that splats (and bails out?) if it detects pending readers. If we get
> > > to vm_area_free() on a VMA that has pending readers, we might already
> > > be in a lot of trouble because the concurrent readers might have been
> > > traversing page tables while we were tearing them down or fun stuff
> > > like that.
> > >
> > > I think maybe Suren was already talking about something like that in
> > > another part of this patch series but I don't remember...
> >
> > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?
> 
> Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> VMAs and that VMAs are freed or isolated under VMA write lock
> protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> Michal mentioned gets hit then it's a bug in my design and I'll have
> to fix it. But please, let's not add synchronize_rcu() in the
> vm_area_free().

Just to clarify. I didn't suggest to add synchronize_rcu into
vm_area_free. What I really meant was synchronize_rcu like primitive to
effectivelly synchronize with any potential pending read locker (so
something like vma_write_lock (or whatever it is called). The point is
that vma freeing is an event all readers should be notified about.
This can be done explicitly for each and every vma before vm_area_free
is called but this is just hard to review and easy to break over time.
See my point?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Paul E. McKenney
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  wrote:
> >
> > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
> > > >
> > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is 
> > > > > > > enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit 
> > > > > > > path when
> > > > > > > multiple VMAs are being freed.
> > > > > >
> > > > > > What kind of regressions.
> > > > > >
> > > > > > > To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per 
> > > > > > > group.
> > > > > >
> > > > > > Please add some data to justify this additional complexity.
> > > > >
> > > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > > noticed when running execl test from unixbench suite. spawn test also
> > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > > enabled.
> > > >
> > > > Could you be more specific? vma freeing is async with the RCU so how
> > > > come this has resulted in a regression? Is there any heavy
> > > > rcu_synchronize in the exec path? That would be an interesting
> > > > information.
> > >
> > > No, there is no heavy rcu_synchronize() or any other additional
> > > synchronous load in the exit path. It's the call_rcu() which can block
> > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > > revealed that this function was taking multiple ms (don't recall the
> > > actual number, sorry). Paul's explanation implied that this happens
> > > due to contention on the locks taken in this function. For more
> > > in-depth details I'll have to ask Paul for help :) This code is quite
> > > complex and I don't know all the details of RCU implementation.
> >
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations.  If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
> 
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
> 
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay.  The reason for this delay is the use of a separate
> > ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist.  This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
> 
> I believe this is the reason in my profiled case.
> 
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
> 
> I don't think I've seen such a case.

Whew!!!  ;-)

> Thanks for clarifications, Paul!

No problem!

Thanx, Paul


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney  wrote:
>
> On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path 
> > > > > > when
> > > > > > multiple VMAs are being freed.
> > > > >
> > > > > What kind of regressions.
> > > > >
> > > > > > To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > Please add some data to justify this additional complexity.
> > > >
> > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > noticed when running execl test from unixbench suite. spawn test also
> > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > enabled.
> > >
> > > Could you be more specific? vma freeing is async with the RCU so how
> > > come this has resulted in a regression? Is there any heavy
> > > rcu_synchronize in the exec path? That would be an interesting
> > > information.
> >
> > No, there is no heavy rcu_synchronize() or any other additional
> > synchronous load in the exit path. It's the call_rcu() which can block
> > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > revealed that this function was taking multiple ms (don't recall the
> > actual number, sorry). Paul's explanation implied that this happens
> > due to contention on the locks taken in this function. For more
> > in-depth details I'll have to ask Paul for help :) This code is quite
> > complex and I don't know all the details of RCU implementation.
>
> There are a couple of possibilities here.
>
> First, if I am remembering correctly, the time between the call_rcu()
> and invocation of the corresponding callback was taking multiple seconds,
> but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> order to save power by batching RCU work over multiple call_rcu()
> invocations.  If this is causing a problem for a given call site, the
> shiny new call_rcu_hurry() can be used instead.  Doing this gets back
> to the old-school non-laziness, but can of course consume more power.

That would not be the case because CONFIG_LAZY_RCU was not an option
at the time I was profiling this issue.
Laxy RCU would be a great option to replace this patch but
unfortunately it's not the default behavior, so I would still have to
implement this batching in case lazy RCU is not enabled.

>
> Second, there is a much shorter one-jiffy delay between the call_rcu()
> and the invocation of the corresponding callback in kernels built with
> either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
> of this delay is to avoid lock contention, and so this delay is incurred
> only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> invoking call_rcu() at least 16 times within a given jiffy will incur
> the added delay.  The reason for this delay is the use of a separate
> ->nocb_bypass list.  As Suren says, this bypass list is used to reduce
> lock contention on the main ->cblist.  This is not needed in old-school
> kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> (including most datacenter kernels) because in that case the callbacks
> enqueued by call_rcu() are touched only by the corresponding CPU, so
> that there is no need for locks.

I believe this is the reason in my profiled case.

>
> Third, if you are instead seeing multiple milliseconds of CPU consumed by
> call_rcu() in the common case (for example, without the aid of interrupts,
> NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

I don't think I've seen such a case.
Thanks for clarifications, Paul!

>
> Or have I lost track of some other slow case?
>
> Thanx, Paul


[PATCH v2 2/5] dt-bindings: usb: Convert multiple "usb-ohci" bindings to DT schema

2023-01-18 Thread Rob Herring
"usb-ohci" is another "generic" OHCI controller compatible string used by
several platforms. Add it to the generic-ohci.yaml schema and remove all
the old binding docs.

Marvell pxa-usb.txt has "usb-ohci" in the example, but actual users don't,
so drop it.

Signed-off-by: Rob Herring 
---
v2:
 - Fix transceiver property in if/then schema
 - Move OMAP to separate patch
---
 .../devicetree/bindings/powerpc/nintendo/wii.txt   | 10 ---
 .../devicetree/bindings/usb/generic-ohci.yaml  | 28 +++--
 Documentation/devicetree/bindings/usb/ohci-nxp.txt | 24 ---
 Documentation/devicetree/bindings/usb/pxa-usb.txt  |  2 +-
 .../devicetree/bindings/usb/spear-usb.txt  | 35 --
 5 files changed, 26 insertions(+), 73 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt 
b/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt
index c4d78f28d23c..3ff6ebbb4998 100644
--- a/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt
+++ b/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt
@@ -97,16 +97,6 @@ Nintendo Wii device tree
- reg : should contain the EXI registers location and length
- interrupts : should contain the EXI interrupt
 
-1.g) The Open Host Controller Interface (OHCI) nodes
-
-  Represent the USB 1.x Open Host Controller Interfaces.
-
-  Required properties:
-
-   - compatible : should be "nintendo,hollywood-usb-ohci","usb-ohci"
-   - reg : should contain the OHCI registers location and length
-   - interrupts : should contain the OHCI interrupt
-
 1.h) The Enhanced Host Controller Interface (EHCI) node
 
   Represents the USB 2.0 Enhanced Host Controller Interface.
diff --git a/Documentation/devicetree/bindings/usb/generic-ohci.yaml 
b/Documentation/devicetree/bindings/usb/generic-ohci.yaml
index 4fcbd0add49d..8492d809ba40 100644
--- a/Documentation/devicetree/bindings/usb/generic-ohci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-ohci.yaml
@@ -6,9 +6,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
 
 title: USB OHCI Controller
 
-allOf:
-  - $ref: "usb-hcd.yaml"
-
 maintainers:
   - Greg Kroah-Hartman 
 
@@ -50,6 +47,13 @@ properties:
   - snps,hsdk-v1.0-ohci
   - const: generic-ohci
   - const: generic-ohci
+  - items:
+  - enum:
+  - cavium,octeon-6335-ohci
+  - nintendo,hollywood-usb-ohci
+  - nxp,ohci-nxp
+  - st,spear600-ohci
+  - const: usb-ohci
 
   reg:
 maxItems: 1
@@ -119,11 +123,29 @@ properties:
   - host
   - otg
 
+  transceiver:
+$ref: /schemas/types.yaml#/definitions/phandle
+description:
+  The associated ISP1301 device. Necessary for the UDC controller for
+  connecting to the USB physical layer.
+
 required:
   - compatible
   - reg
   - interrupts
 
+allOf:
+  - $ref: usb-hcd.yaml
+  - if:
+  not:
+properties:
+  compatible:
+contains:
+  const: nxp,ohci-nxp
+then:
+  properties:
+transceiver: false
+
 additionalProperties: false
 
 examples:
diff --git a/Documentation/devicetree/bindings/usb/ohci-nxp.txt 
b/Documentation/devicetree/bindings/usb/ohci-nxp.txt
deleted file mode 100644
index 71e28c1017ed..
--- a/Documentation/devicetree/bindings/usb/ohci-nxp.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-* OHCI controller, NXP ohci-nxp variant
-
-Required properties:
-- compatible: must be "nxp,ohci-nxp"
-- reg: physical base address of the controller and length of memory mapped
-  region.
-- interrupts: The OHCI interrupt
-- transceiver: phandle of the associated ISP1301 device - this is necessary for
-   the UDC controller for connecting to the USB physical layer
-
-Example (LPC32xx):
-
-   isp1301: usb-transceiver@2c {
-   compatible = "nxp,isp1301";
-   reg = <0x2c>;
-   };
-
-   ohci@3102 {
-   compatible = "nxp,ohci-nxp";
-   reg = <0x3102 0x300>;
-   interrupt-parent = <&mic>;
-   interrupts = <0x3b 0>;
-   transceiver = <&isp1301>;
-   };
diff --git a/Documentation/devicetree/bindings/usb/pxa-usb.txt 
b/Documentation/devicetree/bindings/usb/pxa-usb.txt
index 9c331799b87c..53fdae4fa6f6 100644
--- a/Documentation/devicetree/bindings/usb/pxa-usb.txt
+++ b/Documentation/devicetree/bindings/usb/pxa-usb.txt
@@ -22,7 +22,7 @@ Optional properties:
 Example:
 
usb0: ohci@4c00 {
-   compatible = "marvell,pxa-ohci", "usb-ohci";
+   compatible = "marvell,pxa-ohci";
reg = <0x4c00 0x10>;
interrupts = <18>;
marvell,enable-port1;
diff --git a/Documentation/devicetree/bindings/usb/spear-usb.txt 
b/Documentation/devicetree/bindings/usb/spear-usb.txt
deleted file mode 100644
index 1dc91cc459c0..
--- a/Documentation/devicetree/bindings/usb/spear-usb.txt
+++ /dev/null
@@ -1

[PATCH v2 4/5] dt-bindings: usb: Convert Marvell Orion EHCI to DT schema

2023-01-18 Thread Rob Herring
The Marvell Orion EHCI binding is just some compatible strings, so add it
to the generic-ehci.yaml schema.

Signed-off-by: Rob Herring 
---
 .../devicetree/bindings/usb/ehci-orion.txt | 22 --
 .../devicetree/bindings/usb/generic-ehci.yaml  |  2 ++
 2 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/Documentation/devicetree/bindings/usb/ehci-orion.txt 
b/Documentation/devicetree/bindings/usb/ehci-orion.txt
deleted file mode 100644
index 2855bae79fda..
--- a/Documentation/devicetree/bindings/usb/ehci-orion.txt
+++ /dev/null
@@ -1,22 +0,0 @@
-* EHCI controller, Orion Marvell variants
-
-Required properties:
-- compatible: must be one of the following
-   "marvell,orion-ehci"
-   "marvell,armada-3700-ehci"
-- reg: physical base address of the controller and length of memory mapped
-  region.
-- interrupts: The EHCI interrupt
-
-Optional properties:
-- clocks: reference to the clock
-- phys: reference to the USB PHY
-- phy-names: name of the USB PHY, should be "usb"
-
-Example:
-
-   ehci@5 {
-   compatible = "marvell,orion-ehci";
-   reg = <0x5 0x1000>;
-   interrupts = <19>;
-   };
diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml 
b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
index 2d382ae424da..ebbb01b39a92 100644
--- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
@@ -74,6 +74,8 @@ properties:
   - const: usb-ehci
   - enum:
   - generic-ehci
+  - marvell,armada-3700-ehci
+  - marvell,orion-ehci
   - ti,ehci-omap
   - usb-ehci
 

-- 
2.39.0



[PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema

2023-01-18 Thread Rob Herring
The OMAP OHCI and EHCI USB host bindings follow the generic binding, so
add the compatibles and remove the old txt binding docs.

The examples in omap-usb-host.txt don't match actual users, so update
them dropping the fallback compatible.

Signed-off-by: Rob Herring 
---
v2:
 - New patch
---
 .../devicetree/bindings/mfd/omap-usb-host.txt  |  8 +++---
 .../devicetree/bindings/usb/ehci-omap.txt  | 31 --
 .../devicetree/bindings/usb/generic-ehci.yaml  |  1 +
 .../devicetree/bindings/usb/generic-ohci.yaml  |  4 ++-
 .../devicetree/bindings/usb/ohci-omap3.txt | 15 ---
 5 files changed, 8 insertions(+), 51 deletions(-)

diff --git a/Documentation/devicetree/bindings/mfd/omap-usb-host.txt 
b/Documentation/devicetree/bindings/mfd/omap-usb-host.txt
index aa1eaa59581b..7ce5800dd36d 100644
--- a/Documentation/devicetree/bindings/mfd/omap-usb-host.txt
+++ b/Documentation/devicetree/bindings/mfd/omap-usb-host.txt
@@ -64,8 +64,8 @@ Required properties if child node exists:
 Properties for children:
 
 The OMAP HS USB Host subsystem contains EHCI and OHCI controllers.
-See Documentation/devicetree/bindings/usb/ehci-omap.txt and
-Documentation/devicetree/bindings/usb/ohci-omap3.txt.
+See Documentation/devicetree/bindings/usb/ehci-generic.yaml and
+Documentation/devicetree/bindings/usb/ohci-generic.yaml.
 
 Example for OMAP4:
 
@@ -78,14 +78,14 @@ usbhshost: usbhshost@4a064000 {
ranges;
 
usbhsohci: ohci@4a064800 {
-   compatible = "ti,ohci-omap3", "usb-ohci";
+   compatible = "ti,ohci-omap3";
reg = <0x4a064800 0x400>;
interrupt-parent = <&gic>;
interrupts = <0 76 0x4>;
};
 
usbhsehci: ehci@4a064c00 {
-   compatible = "ti,ehci-omap", "usb-ehci";
+   compatible = "ti,ehci-omap";
reg = <0x4a064c00 0x400>;
interrupt-parent = <&gic>;
interrupts = <0 77 0x4>;
diff --git a/Documentation/devicetree/bindings/usb/ehci-omap.txt 
b/Documentation/devicetree/bindings/usb/ehci-omap.txt
deleted file mode 100644
index d77e11a975a2..
--- a/Documentation/devicetree/bindings/usb/ehci-omap.txt
+++ /dev/null
@@ -1,31 +0,0 @@
-OMAP HS USB EHCI controller
-
-This device is usually the child of the omap-usb-host
-Documentation/devicetree/bindings/mfd/omap-usb-host.txt
-
-Required properties:
-
-- compatible: should be "ti,ehci-omap"
-- reg: should contain one register range i.e. start and length
-- interrupts: description of the interrupt line
-
-Optional properties:
-
-- phys: list of phandles to PHY nodes.
-  This property is required if at least one of the ports are in
-  PHY mode i.e. OMAP_EHCI_PORT_MODE_PHY
-
-To specify the port mode, see
-Documentation/devicetree/bindings/mfd/omap-usb-host.txt
-
-Example for OMAP4:
-
-usbhsehci: ehci@4a064c00 {
-   compatible = "ti,ehci-omap";
-   reg = <0x4a064c00 0x400>;
-   interrupts = <0 77 0x4>;
-};
-
-&usbhsehci {
-   phys = <&hsusb1_phy 0 &hsusb3_phy>;
-};
diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml 
b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
index 994818cb6044..2d382ae424da 100644
--- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
@@ -74,6 +74,7 @@ properties:
   - const: usb-ehci
   - enum:
   - generic-ehci
+  - ti,ehci-omap
   - usb-ehci
 
   reg:
diff --git a/Documentation/devicetree/bindings/usb/generic-ohci.yaml 
b/Documentation/devicetree/bindings/usb/generic-ohci.yaml
index 8492d809ba40..a9ba7257b884 100644
--- a/Documentation/devicetree/bindings/usb/generic-ohci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-ohci.yaml
@@ -46,7 +46,9 @@ properties:
   - ingenic,jz4740-ohci
   - snps,hsdk-v1.0-ohci
   - const: generic-ohci
-  - const: generic-ohci
+  - enum:
+  - generic-ohci
+  - ti,ohci-omap3
   - items:
   - enum:
   - cavium,octeon-6335-ohci
diff --git a/Documentation/devicetree/bindings/usb/ohci-omap3.txt 
b/Documentation/devicetree/bindings/usb/ohci-omap3.txt
deleted file mode 100644
index ce8c47cff6d0..
--- a/Documentation/devicetree/bindings/usb/ohci-omap3.txt
+++ /dev/null
@@ -1,15 +0,0 @@
-OMAP HS USB OHCI controller (OMAP3 and later)
-
-Required properties:
-
-- compatible: should be "ti,ohci-omap3"
-- reg: should contain one register range i.e. start and length
-- interrupts: description of the interrupt line
-
-Example for OMAP4:
-
-usbhsohci: ohci@4a064800 {
-   compatible = "ti,ohci-omap3";
-   reg = <0x4a064800 0x400>;
-   interrupts = <0 76 0x4>;
-};

-- 
2.39.0



[PATCH v2 1/5] dt-bindings: usb: Remove obsolete brcm,bcm3384-usb.txt

2023-01-18 Thread Rob Herring
The "brcm,bcm3384-ohci" and "brcm,bcm3384-ehci" compatibles are already
documented in generic-ohci.yaml and generic-ehci.yaml, respectively, so
remove the old txt binding.

Signed-off-by: Rob Herring 
---
 Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt 
b/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt
deleted file mode 100644
index 452c45c7bf29..
--- a/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt
+++ /dev/null
@@ -1,11 +0,0 @@
-* Broadcom USB controllers
-
-Required properties:
-- compatible: "brcm,bcm3384-ohci", "brcm,bcm3384-ehci"
-
-  These currently use the generic-ohci and generic-ehci drivers.  On some
-  systems, special handling may be needed in the following cases:
-
-  - Restoring state after systemwide power save modes
-  - Sharing PHYs with the USBD (UDC) hardware
-  - Figuring out which controllers are disabled on ASIC bondout variants

-- 
2.39.0



[PATCH v2 5/5] dt-bindings: usb: Convert Nuvoton EHCI to DT schema

2023-01-18 Thread Rob Herring
The Nuvoton EHCI binding is just some compatible strings, so add it to the
generic-ehci.yaml schema.

Signed-off-by: Rob Herring 
---
 .../devicetree/bindings/usb/generic-ehci.yaml|  2 ++
 .../devicetree/bindings/usb/npcm7xx-usb.txt  | 20 
 2 files changed, 2 insertions(+), 20 deletions(-)

diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml 
b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
index ebbb01b39a92..050cfd5acdaa 100644
--- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml
@@ -76,6 +76,8 @@ properties:
   - generic-ehci
   - marvell,armada-3700-ehci
   - marvell,orion-ehci
+  - nuvoton,npcm750-ehci
+  - nuvoton,npcm845-ehci
   - ti,ehci-omap
   - usb-ehci
 
diff --git a/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt 
b/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt
deleted file mode 100644
index 352a0a1e2f76..
--- a/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt
+++ /dev/null
@@ -1,20 +0,0 @@
-Nuvoton NPCM7XX SoC USB controllers:
--
-
-EHCI:
--
-
-Required properties:
-- compatible: should be one of
-"nuvoton,npcm750-ehci"
-"nuvoton,npcm845-ehci"
-- interrupts: Should contain the EHCI interrupt
-- reg:Physical address and length of the register set for the device
-
-Example:
-
-   ehci1: usb@f0806000 {
-   compatible = "nuvoton,npcm750-ehci";
-   reg = <0xf0806000 0x1000>;
-   interrupts = <0 61 4>;
-   };

-- 
2.39.0



[PATCH v2 0/5] dt-bindings: usb: Convert some more simple OHCI/EHCI bindings

2023-01-18 Thread Rob Herring
The 'ohci-usb' compatible is another 'generic' compatible for OHCI, but 
isn't documented with a schema. Let's add it to generic-ohci.yaml 
schema. While looking at this, I found a few other USB host bindings 
which are simple enough to use the 'generic' schemas.

Signed-off-by: Rob Herring 
---
Changes in v2:
- Fix schema error for 'transceiver'
- Split OMAP changes to separate patch and convert omap-ehci
- Link to v1: 
https://lore.kernel.org/r/20230110-dt-usb-v1-0-8e366e326...@kernel.org

---
Rob Herring (5):
  dt-bindings: usb: Remove obsolete brcm,bcm3384-usb.txt
  dt-bindings: usb: Convert multiple "usb-ohci" bindings to DT schema
  dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema
  dt-bindings: usb: Convert Marvell Orion EHCI to DT schema
  dt-bindings: usb: Convert Nuvoton EHCI to DT schema

 .../devicetree/bindings/mfd/omap-usb-host.txt  |  8 ++---
 .../devicetree/bindings/powerpc/nintendo/wii.txt   | 10 ---
 .../devicetree/bindings/usb/brcm,bcm3384-usb.txt   | 11 ---
 .../devicetree/bindings/usb/ehci-omap.txt  | 31 ---
 .../devicetree/bindings/usb/ehci-orion.txt | 22 --
 .../devicetree/bindings/usb/generic-ehci.yaml  |  5 
 .../devicetree/bindings/usb/generic-ohci.yaml  | 32 +---
 .../devicetree/bindings/usb/npcm7xx-usb.txt| 20 -
 Documentation/devicetree/bindings/usb/ohci-nxp.txt | 24 ---
 .../devicetree/bindings/usb/ohci-omap3.txt | 15 --
 Documentation/devicetree/bindings/usb/pxa-usb.txt  |  2 +-
 .../devicetree/bindings/usb/spear-usb.txt  | 35 --
 12 files changed, 38 insertions(+), 177 deletions(-)
---
base-commit: 1b929c02afd37871d5afb9d498426f83432e71c2
change-id: 20230110-dt-usb-1c637752a83b

Best regards,
-- 
Rob Herring 



Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Paul E. McKenney
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
> >
> > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path 
> > > > > when
> > > > > multiple VMAs are being freed.
> > > >
> > > > What kind of regressions.
> > > >
> > > > > To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > Please add some data to justify this additional complexity.
> > >
> > > Sorry, should have done that in the first place. A 4.3% regression was
> > > noticed when running execl test from unixbench suite. spawn test also
> > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > enabled.
> >
> > Could you be more specific? vma freeing is async with the RCU so how
> > come this has resulted in a regression? Is there any heavy
> > rcu_synchronize in the exec path? That would be an interesting
> > information.
> 
> No, there is no heavy rcu_synchronize() or any other additional
> synchronous load in the exit path. It's the call_rcu() which can block
> the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> other call_rcu()'s going on in parallel. Note that call_rcu() calls
> rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> revealed that this function was taking multiple ms (don't recall the
> actual number, sorry). Paul's explanation implied that this happens
> due to contention on the locks taken in this function. For more
> in-depth details I'll have to ask Paul for help :) This code is quite
> complex and I don't know all the details of RCU implementation.

There are a couple of possibilities here.

First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations.  If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead.  Doing this gets back
to the old-school non-laziness, but can of course consume more power.

Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay.  The reason for this delay is the use of a separate
->nocb_bypass list.  As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist.  This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.

Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know.  That sounds to me like a bug.

Or have I lost track of some other slow case?

Thanx, Paul


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko  wrote:
>
> On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> > >
> > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > anon_vma_lock_write to match the locking order of page fault handler.
> > >
> > > Does this changelog assumes per vma locking in the #PF?
> >
> > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > but the changelog already talks about that. Maybe I should change it
> > to simply:
> > ```
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked.
>
> Because 

because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.

>
> Withtout that additional reasonaning it is not really clear why that is
> needed and seems arbitrary.

Would the above be a good reasoning?

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:43 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
> >  wrote:
> > >
> > > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > > Assert there are no holders of VMA lock for reading when it is about to 
> > > > be
> > > > destroyed.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan 
> > > > ---
> > > >  include/linux/mm.h | 8 
> > > >  kernel/fork.c  | 2 ++
> > > >  2 files changed, 10 insertions(+)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 594e835bad9c..c464fc8a514c 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> > > > vm_area_struct *vma)
> > > >   VM_BUG_ON_VMA(vma->vm_lock_seq != 
> > > > READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > > >  }
> > > >
> > > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > > +{
> > > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > > +   vma->vm_lock_seq != 
> > > > READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > > +   vma);
> > >
> > > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > > us something is wrong on its own, no? This could be somebody racing with
> > > the vma destruction and using the write lock. Unlikely but I do not see
> > > why to narrow debugging scope.
> >
> > I wanted to ensure there are no page fault handlers (read-lockers)
> > when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> > could trigger if someone is concurrently calling vma_write_lock(). But
> > I don't think we expect someone to be write-locking the VMA while we
>
> That would be UAF, no?

Yes. That's why what I have is an overkill (which is also racy).

>
> > are destroying it, so you are right, I'm overcomplicating things here.
> > I think I can get rid of vma_assert_no_reader() and add
> > VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> > __vm_area_free(). WDYT?
>
> Yes, that adds some debugging. Not sure it is really necessary buyt it
> is VM_BUG_ON so why not.

I would like to keep it if possible. If it triggers that would be a
clear signal what the issue is. Otherwise it might be hard to debug
such a corner case.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko  wrote:
>
> On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed.
> > >
> > > What kind of regressions.
> > >
> > > > To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > Please add some data to justify this additional complexity.
> >
> > Sorry, should have done that in the first place. A 4.3% regression was
> > noticed when running execl test from unixbench suite. spawn test also
> > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > longer due to call_rcu() which is slow when RCU callback offloading is
> > enabled.
>
> Could you be more specific? vma freeing is async with the RCU so how
> come this has resulted in a regression? Is there any heavy
> rcu_synchronize in the exec path? That would be an interesting
> information.

No, there is no heavy rcu_synchronize() or any other additional
synchronous load in the exit path. It's the call_rcu() which can block
the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
other call_rcu()'s going on in parallel. Note that call_rcu() calls
rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
revealed that this function was taking multiple ms (don't recall the
actual number, sorry). Paul's explanation implied that this happens
due to contention on the locks taken in this function. For more
in-depth details I'll have to ask Paul for help :) This code is quite
complex and I don't know all the details of RCU implementation.


> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 1:40 AM Michal Hocko  wrote:
>
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.

I'll add Jann's description in the changelog. Thanks Jann!

>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF retract_page_tables
> vma_read_lock
> i_mmap_lock_write
> i_mmap_lock_read
> vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 4:51 AM Jann Horn  wrote:
>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Page fault handlers might need to fire MMU notifications while a new
> > notifier is being registered. Modify mm_take_all_locks to write-lock all
> > VMAs and prevent this race with fault handlers that would hold VMA locks.
> > VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> > locking order as in page fault handlers.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  mm/mmap.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 30c7d1c5206e..a256deca0bc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, 
> > struct address_space *mapping)
> >   * of mm/rmap.c:
> >   *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> >   * hugetlb mapping);
> > + *   - all vmas marked locked
>
> The existing comment above says that this is an *ordered* listing of
> which locks are taken.
>
> >   *   - all i_mmap_rwsem locks;
> >   *   - all anon_vma->rwseml
> >   *
> > @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> > mas_for_each(&mas, vma, ULONG_MAX) {
> > if (signal_pending(current))
> > goto out_unlock;
> > +   vma_write_lock(vma);
> > if (vma->vm_file && vma->vm_file->f_mapping &&
> > is_vm_hugetlb_page(vma))
> > vm_lock_mapping(mm, vma->vm_file->f_mapping);
>
> Note that multiple VMAs can have the same ->f_mapping, so with this,
> the lock ordering between VMA locks and the mapping locks of hugetlb
> VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
> ->f_mapping, then the following operations happen:
>
> 1. lock VMA 1
> 2. lock mapping of VMAs 1 and 2
> 3. lock VMA 2
> 4. [second vm_lock_mapping() is a no-op]
>
> So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
> took the mapping lock first.
>
> The existing code has one loop per lock type to ensure that the locks
> really are taken in the specified order, even when some of the locks
> are associated with multiple VMAs.
>
> If we don't care about the ordering between these two, maybe that's
> fine and you just have to adjust the comment; but it would be clearer
> to add a separate loop for the VMA locks.

Oh, thanks for pointing out this detail. A separate loop is definitely
needed here. Will do that in the next respin.

>
> > @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> > if (vma->vm_file && vma->vm_file->f_mapping)
> > vm_unlock_mapping(vma->vm_file->f_mapping);
> > }
> > +   vma_write_unlock_mm(mm);
> >
> > mutex_unlock(&mm_all_locks_mutex);
> >  }
> > --
> > 2.39.0
> >


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Suren Baghdasaryan
On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > +locking maintainers
> > > >
> > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > When
> > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked 
> > > > > as
> > > > > locked.
> > > > [...]
> > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > +{
> > > > > +   up_read(&vma->lock);
> > > > > +}
> > > >
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else.
> > >
> > > Yes, I think you are right. From a look into the code it seems that
> > > the UAF is quite unlikely as there is a ton of work to be done between
> > > vma_write_lock used to prepare vma for removal and actual removal.
> > > That doesn't make it less of a problem though.
> > >
> > > > So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > > >
> > > > But I'm not entirely sure about that, the locking folks might know 
> > > > better.
> > >
> > > I am not a locking expert but to me it looks like this should work
> > > because the final cleanup would have to happen rcu_read_unlock.
> > >
> > > Thanks, I have completely missed this aspect of the locking when looking
> > > into the code.
> > >
> > > Btw. looking at this again I have fully realized how hard it is actually
> > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > make more sense to have a rcu style synchronization point in
> > > vm_area_free directly before call_rcu? This would add an overhead of
> > > uncontended down_write of course.
> >
> > Something along those lines might be a good idea, but I think that
> > rather than synchronizing the removal, it should maybe be something
> > that splats (and bails out?) if it detects pending readers. If we get
> > to vm_area_free() on a VMA that has pending readers, we might already
> > be in a lot of trouble because the concurrent readers might have been
> > traversing page tables while we were tearing them down or fun stuff
> > like that.
> >
> > I think maybe Suren was already talking about something like that in
> > another part of this patch series but I don't remember...
>
> This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?

Yes, I spent a lot of time ensuring that __adjust_vma locks the right
VMAs and that VMAs are freed or isolated under VMA write lock
protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
Michal mentioned gets hit then it's a bug in my design and I'll have
to fix it. But please, let's not add synchronize_rcu() in the
vm_area_free(). That will slow down any path that frees a VMA,
especially the exit path which might be freeing thousands of them. I
had an SPF version with synchronize_rcu() in the vm_area_free() and
phone vendors started yelling at me the very next day. call_rcu() with
CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is
already bad enough to show up in the benchmarks and that's why I had
to add call_rcu() batching in
https://lore.kernel.org/all/20230109205336.3665937-40-sur...@google.com.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

2023-01-18 Thread Linus Torvalds
[ Adding a few more x86 and arm64 maintainers - while linux-arch is
the right mailing list, I'm not convinced people actually follow it
all that closely ]

On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin  wrote:
>
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second, due to contention on the mm refcount.
>
> 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> the option. This increases the above benchmark to 118 million context
> switches per second.

Well, the 1M -> 118M change does seem like a good reason for this series.

The patches certainly don't look offensive to me, so Ack as far as I'm
concerned, but honestly, it's been some time since I've personally
been active on the idle and lazy TLB code, so that ack is probably
largely worthless.

If anything, my main reaction to this all is to wonder whether the
config option is a good idea - maybe we could do this unconditionally,
and make the source code (and logic) simpler to follow when you don't
have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option.

I wouldn't be surprised to hear that x86 can have the same issue where
the mm_struct refcount is a bigger issue than the possibility of an
extra TLB shootdown at the final exit time.

But having the config options as a way to switch people over gradually
(and perhaps then removing it later) doesn't sound wrong to me either.

And I personally find the argument in patch 3/5 fairly convincing:

  Shootdown IPIs cost could be an issue, but they have not been observed
  to be a serious problem with this scheme, because short-lived processes
  tend not to migrate CPUs much, therefore they don't get much chance to
  leave lazy tlb mm references on remote CPUs.

Andy? PeterZ? Catalin?

Nick - it might be good to link to the actual benchmark, and let
people who have access to big machines perhaps just try it out on
non-powerpc platforms...

   Linus


Re: [PATCH v9 00/10] phy: Add support for Lynx 10G SerDes

2023-01-18 Thread Vinod Koul
On 17-01-23, 11:46, Sean Anderson wrote:
> 
> I noticed that this series is marked "changes requested" on patchwork.
> However, I have received only automated feedback. I have done my best
> effort to address feedback I have received on prior revisions. I would
> appreciate getting another round of review before resending this series.

Looking at the series, looks like kernel-bot sent some warnings on the
series so I was expecting an updated series for review

-- 
~Vinod


Re: [PATCH] of: Make of framebuffer devices unique

2023-01-18 Thread Rob Herring


On Tue, 17 Jan 2023 17:58:04 +0100, Michal Suchanek wrote:
> Since Linux 5.19 this error is observed:
> 
> sysfs: cannot create duplicate filename '/devices/platform/of-display'
> 
> This is because multiple devices with the same name 'of-display' are
> created on the same bus.
> 
> Update the code to create numbered device names for the non-boot
> disaplay.
> 
> cc: linuxppc-dev@lists.ozlabs.org
> References: https://bugzilla.kernel.org/show_bug.cgi?id=216095
> Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers")
> Reported-by: Erhard F. 
> Suggested-by: Thomas Zimmermann 
> Signed-off-by: Michal Suchanek 
> ---
>  drivers/of/platform.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 

Applied, thanks!


[PATCH v5 7/7] drm/i915/gt: use __xchg instead of internal helper

2023-01-18 Thread Andrzej Hajda
Prefer core helper if available.

Signed-off-by: Andrzej Hajda 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 2 +-
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++--
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ++--
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 4 ++--
 drivers/gpu/drm/i915/gt/intel_gsc.c  | 2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c   | 4 ++--
 drivers/gpu/drm/i915/gt/intel_gt_pm.c| 2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 6 +++---
 drivers/gpu/drm/i915/gt/intel_migrate.c  | 2 +-
 drivers/gpu/drm/i915/gt/intel_rc6.c  | 2 +-
 drivers/gpu/drm/i915/gt/intel_rps.c  | 2 +-
 drivers/gpu/drm/i915/gt/selftest_context.c   | 2 +-
 drivers/gpu/drm/i915/gt/selftest_ring_submission.c   | 2 +-
 drivers/gpu/drm/i915/gt/selftest_timeline.c  | 2 +-
 drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c| 2 +-
 drivers/gpu/drm/i915/gt/uc/intel_uc.c| 2 +-
 drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 2 +-
 drivers/gpu/drm/i915/i915_utils.h| 1 +
 18 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc685..9712bfc2c6523d 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1042,7 +1042,7 @@ static void cleanup_status_page(struct intel_engine_cs 
*engine)
/* Prevent writes into HWSP after returning the page to the system */
intel_engine_set_hwsp_writemask(engine, ~0u);
 
-   vma = fetch_and_zero(&engine->status_page.vma);
+   vma = __xchg(&engine->status_page.vma, 0);
if (!vma)
return;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 9a527e1f5be655..09befcc6a84fa1 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -229,7 +229,7 @@ static void heartbeat(struct work_struct *wrk)
mutex_unlock(&ce->timeline->mutex);
 out:
if (!engine->i915->params.enable_hangcheck || !next_heartbeat(engine))
-   i915_request_put(fetch_and_zero(&engine->heartbeat.systole));
+   i915_request_put(__xchg(&engine->heartbeat.systole, 0));
intel_engine_pm_put(engine);
 }
 
@@ -244,7 +244,7 @@ void intel_engine_unpark_heartbeat(struct intel_engine_cs 
*engine)
 void intel_engine_park_heartbeat(struct intel_engine_cs *engine)
 {
if (cancel_delayed_work(&engine->heartbeat.work))
-   i915_request_put(fetch_and_zero(&engine->heartbeat.systole));
+   i915_request_put(__xchg(&engine->heartbeat.systole, 0));
 }
 
 void intel_gt_unpark_heartbeats(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 18ffe55282e594..5c985e6fa1be2f 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -3199,7 +3199,7 @@ static void execlists_reset_cancel(struct intel_engine_cs 
*engine)
RB_CLEAR_NODE(rb);
 
spin_lock(&ve->base.sched_engine->lock);
-   rq = fetch_and_zero(&ve->request);
+   rq = __xchg(&ve->request, NULL);
if (rq) {
if (i915_request_mark_eio(rq)) {
rq->engine = engine;
@@ -3604,7 +3604,7 @@ static void rcu_virtual_context_destroy(struct 
work_struct *wrk)
 
spin_lock_irq(&ve->base.sched_engine->lock);
 
-   old = fetch_and_zero(&ve->request);
+   old = __xchg(&ve->request, NULL);
if (old) {
GEM_BUG_ON(!__i915_request_is_complete(old));
__i915_request_submit(old);
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index fe64c13fd3b4aa..6f441c3d3d1cef 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -684,7 +684,7 @@ static void fini_aliasing_ppgtt(struct i915_ggtt *ggtt)
 {
struct i915_ppgtt *ppgtt;
 
-   ppgtt = fetch_and_zero(&ggtt->alias);
+   ppgtt = __xchg(&ggtt->alias, NULL);
if (!ppgtt)
return;
 
@@ -1238,7 +1238,7 @@ bool i915_ggtt_resume_vm(struct i915_address_space *vm)
   was_bound);
 
if (obj) { /* only used during resume => exclusive access */
-   write_domain_objs |= fetch_and_zero(&obj->write_domain);
+   write_domain_objs |= __xchg(&obj->write_domain, 0);
obj->read_domains |= I915_GEM_DOMAIN_GTT;
}
}
diff --git a/drivers/gpu/drm/i915/gt/int

[PATCH v5 6/7] qed: use __xchg if possible

2023-01-18 Thread Andrzej Hajda
Recently introduced helper simplifies the code.

Signed-off-by: Andrzej Hajda 
---
 include/linux/qed/qed_chain.h | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/linux/qed/qed_chain.h b/include/linux/qed/qed_chain.h
index a84063492c71ae..6355d558cf01b2 100644
--- a/include/linux/qed/qed_chain.h
+++ b/include/linux/qed/qed_chain.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -368,7 +369,7 @@ static inline void qed_chain_return_produced(struct 
qed_chain *p_chain)
  */
 static inline void *qed_chain_produce(struct qed_chain *p_chain)
 {
-   void *p_ret = NULL, *p_prod_idx, *p_prod_page_idx;
+   void *p_prod_idx, *p_prod_page_idx;
 
if (is_chain_u16(p_chain)) {
if ((p_chain->u.chain16.prod_idx &
@@ -390,11 +391,8 @@ static inline void *qed_chain_produce(struct qed_chain 
*p_chain)
p_chain->u.chain32.prod_idx++;
}
 
-   p_ret = p_chain->p_prod_elem;
-   p_chain->p_prod_elem = (void *)(((u8 *)p_chain->p_prod_elem) +
-   p_chain->elem_size);
-
-   return p_ret;
+   return __xchg(&p_chain->p_prod_elem,
+ (u8 *)p_chain->p_prod_elem + p_chain->elem_size);
 }
 
 /**
@@ -439,7 +437,7 @@ static inline void qed_chain_recycle_consumed(struct 
qed_chain *p_chain)
  */
 static inline void *qed_chain_consume(struct qed_chain *p_chain)
 {
-   void *p_ret = NULL, *p_cons_idx, *p_cons_page_idx;
+   void *p_cons_idx, *p_cons_page_idx;
 
if (is_chain_u16(p_chain)) {
if ((p_chain->u.chain16.cons_idx &
@@ -461,11 +459,8 @@ static inline void *qed_chain_consume(struct qed_chain 
*p_chain)
p_chain->u.chain32.cons_idx++;
}
 
-   p_ret = p_chain->p_cons_elem;
-   p_chain->p_cons_elem = (void *)(((u8 *)p_chain->p_cons_elem) +
-   p_chain->elem_size);
-
-   return p_ret;
+   return __xchg(&p_chain->p_cons_elem,
+ (u8 *)p_chain->p_cons_elem + p_chain->elem_size);
 }
 
 /**
-- 
2.34.1



[PATCH v5 5/7] io_uring: use __xchg if possible

2023-01-18 Thread Andrzej Hajda
Recently introduced helper simplifies the code.

Signed-off-by: Andrzej Hajda 
---
 io_uring/io_uring.c | 7 ++-
 io_uring/slist.h| 6 ++
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 2ac1cd8d23ea62..2b46a692d69022 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1095,8 +1096,6 @@ static void __io_req_find_next_prep(struct io_kiocb *req)
 
 static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req)
 {
-   struct io_kiocb *nxt;
-
/*
 * If LINK is set, we have dependent requests in this chain. If we
 * didn't fail this request, queue the first one up, moving any other
@@ -1105,9 +1104,7 @@ static inline struct io_kiocb *io_req_find_next(struct 
io_kiocb *req)
 */
if (unlikely(req->flags & IO_DISARM_MASK))
__io_req_find_next_prep(req);
-   nxt = req->link;
-   req->link = NULL;
-   return nxt;
+   return __xchg(&req->link, NULL);
 }
 
 static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
diff --git a/io_uring/slist.h b/io_uring/slist.h
index f27601fa46607b..d3a743a1adc626 100644
--- a/io_uring/slist.h
+++ b/io_uring/slist.h
@@ -2,6 +2,7 @@
 #define INTERNAL_IO_SLIST_H
 
 #include 
+#include 
 
 #define wq_list_for_each(pos, prv, head)   \
for (pos = (head)->first, prv = NULL; pos; prv = pos, pos = (pos)->next)
@@ -121,10 +122,7 @@ static inline void wq_list_del(struct io_wq_work_list 
*list,
 static inline
 struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack)
 {
-   struct io_wq_work_node *node = stack->next;
-
-   stack->next = node->next;
-   return node;
+   return __xchg(&stack->next, stack->next->next);
 }
 
 static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
-- 
2.34.1



[PATCH v5 4/7] llist: simplify __llist_del_all

2023-01-18 Thread Andrzej Hajda
llist_del_all uses xchg, let's use __xchg here.

Signed-off-by: Andrzej Hajda 
---
 include/linux/llist.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 85bda2d02d65be..4dc1d185ea98ab 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -50,6 +50,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -241,10 +242,7 @@ static inline struct llist_node *llist_del_all(struct 
llist_head *head)
 
 static inline struct llist_node *__llist_del_all(struct llist_head *head)
 {
-   struct llist_node *first = head->first;
-
-   head->first = NULL;
-   return first;
+   return __xchg(&head->first, NULL);
 }
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
-- 
2.34.1



[PATCH v5 3/7] arch/*/uprobes: simplify arch_uretprobe_hijack_return_addr

2023-01-18 Thread Andrzej Hajda
In all architectures, except x86, arch_uretprobe_hijack_return_addr
is just __xchg.

Signed-off-by: Andrzej Hajda 
---
 arch/arm/probes/uprobes/core.c |  8 ++--
 arch/arm64/kernel/probes/uprobes.c |  9 ++---
 arch/csky/kernel/probes/uprobes.c  |  9 ++---
 arch/mips/kernel/uprobes.c | 10 ++
 arch/powerpc/kernel/uprobes.c  | 10 ++
 arch/riscv/kernel/probes/uprobes.c |  9 ++---
 arch/s390/kernel/uprobes.c |  7 ++-
 arch/sparc/kernel/uprobes.c|  7 ++-
 8 files changed, 16 insertions(+), 53 deletions(-)

diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index f5f790c6e5f896..77ce8ae431376d 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -61,12 +62,7 @@ unsigned long
 arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
  struct pt_regs *regs)
 {
-   unsigned long orig_ret_vaddr;
-
-   orig_ret_vaddr = regs->ARM_lr;
-   /* Replace the return addr with trampoline addr */
-   regs->ARM_lr = trampoline_vaddr;
-   return orig_ret_vaddr;
+   return __xchg(®s->ARM_lr, trampoline_vaddr);
 }
 
 int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
diff --git a/arch/arm64/kernel/probes/uprobes.c 
b/arch/arm64/kernel/probes/uprobes.c
index d49aef2657cdf7..d7171d30ea6681 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2014-2016 Pratyush Anand 
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -150,13 +151,7 @@ unsigned long
 arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
  struct pt_regs *regs)
 {
-   unsigned long orig_ret_vaddr;
-
-   orig_ret_vaddr = procedure_link_pointer(regs);
-   /* Replace the return addr with trampoline addr */
-   procedure_link_pointer_set(regs, trampoline_vaddr);
-
-   return orig_ret_vaddr;
+   return __xchg(&procedure_link_pointer(regs), trampoline_vaddr);
 }
 
 int arch_uprobe_exception_notify(struct notifier_block *self,
diff --git a/arch/csky/kernel/probes/uprobes.c 
b/arch/csky/kernel/probes/uprobes.c
index 2d31a12e46cfee..775fe88b5f0016 100644
--- a/arch/csky/kernel/probes/uprobes.c
+++ b/arch/csky/kernel/probes/uprobes.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2014-2016 Pratyush Anand 
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -123,13 +124,7 @@ unsigned long
 arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
  struct pt_regs *regs)
 {
-   unsigned long ra;
-
-   ra = regs->lr;
-
-   regs->lr = trampoline_vaddr;
-
-   return ra;
+   return __xchg(®s->lr, trampoline_vaddr);
 }
 
 int arch_uprobe_exception_notify(struct notifier_block *self,
diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c
index 6c063aa188e626..2b1f375c407b87 100644
--- a/arch/mips/kernel/uprobes.c
+++ b/arch/mips/kernel/uprobes.c
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -197,14 +198,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *aup,
 unsigned long arch_uretprobe_hijack_return_addr(
unsigned long trampoline_vaddr, struct pt_regs *regs)
 {
-   unsigned long ra;
-
-   ra = regs->regs[31];
-
-   /* Replace the return address with the trampoline address */
-   regs->regs[31] = trampoline_vaddr;
-
-   return ra;
+   return __xchg(®s->regs[31], trampoline_vaddr);
 }
 
 /**
diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
index 95a41ae9dfa755..3c15c320315e6c 100644
--- a/arch/powerpc/kernel/uprobes.c
+++ b/arch/powerpc/kernel/uprobes.c
@@ -7,6 +7,7 @@
  * Adapted from the x86 port by Ananth N Mavinakayanahalli 
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -197,14 +198,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, 
struct pt_regs *regs)
 unsigned long
 arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct 
pt_regs *regs)
 {
-   unsigned long orig_ret_vaddr;
-
-   orig_ret_vaddr = regs->link;
-
-   /* Replace the return addr with trampoline addr */
-   regs->link = trampoline_vaddr;
-
-   return orig_ret_vaddr;
+   return __xchg(®s->link, trampoline_vaddr);
 }
 
 bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
diff --git a/arch/riscv/kernel/probes/uprobes.c 
b/arch/riscv/kernel/probes/uprobes.c
index c976a21cd4bd5b..5c8415e216536d 100644
--- a/arch/riscv/kernel/probes/uprobes.c
+++ b/arch/riscv/kernel/probes/uprobes.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
 #include 
+#include 
 #include 
 #include 
 
@@ -122,13 +123,7 @@ unsigned long
 arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,

[PATCH v5 2/7] linux/include: add non-atomic version of xchg

2023-01-18 Thread Andrzej Hajda
The pattern of setting variable with new value and returning old
one is very common in kernel. Usually atomicity of the operation
is not required, so xchg seems to be suboptimal and confusing in
such cases.

Signed-off-by: Andrzej Hajda 
Reviewed-by: Andy Shevchenko 
---
 include/linux/non-atomic/xchg.h | 19 +++
 1 file changed, 19 insertions(+)
 create mode 100644 include/linux/non-atomic/xchg.h

diff --git a/include/linux/non-atomic/xchg.h b/include/linux/non-atomic/xchg.h
new file mode 100644
index 00..f7fa5dd746f37d
--- /dev/null
+++ b/include/linux/non-atomic/xchg.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NON_ATOMIC_XCHG_H
+#define _LINUX_NON_ATOMIC_XCHG_H
+
+/**
+ * __xchg - set variable pointed by @ptr to @val, return old value
+ * @ptr: pointer to affected variable
+ * @val: value to be written
+ *
+ * This is non-atomic variant of xchg.
+ */
+#define __xchg(ptr, val) ({\
+   __auto_type __ptr = ptr;\
+   __auto_type __t = *__ptr;   \
+   *__ptr = (val); \
+   __t;\
+})
+
+#endif
-- 
2.34.1



[PATCH v5 1/7] arch: rename all internal names __xchg to __arch_xchg

2023-01-18 Thread Andrzej Hajda
__xchg will be used for non-atomic xchg macro.

Signed-off-by: Andrzej Hajda 
Reviewed-by: Arnd Bergmann 
Acked-by: Geert Uytterhoeven  [m68k]
Acked-by: Palmer Dabbelt  [riscv]
---
v2: squashed all arch patches into one
v3: fixed alpha/xchg_local, thx to l...@intel.com
v4: adjusted indentation (Heiko)
---
 arch/alpha/include/asm/cmpxchg.h | 10 +-
 arch/arc/include/asm/cmpxchg.h   |  4 ++--
 arch/arm/include/asm/cmpxchg.h   |  7 ---
 arch/arm64/include/asm/cmpxchg.h |  7 +++
 arch/hexagon/include/asm/cmpxchg.h   | 10 +-
 arch/ia64/include/asm/cmpxchg.h  |  2 +-
 arch/ia64/include/uapi/asm/cmpxchg.h |  4 ++--
 arch/loongarch/include/asm/cmpxchg.h |  4 ++--
 arch/m68k/include/asm/cmpxchg.h  |  6 +++---
 arch/mips/include/asm/cmpxchg.h  |  4 ++--
 arch/openrisc/include/asm/cmpxchg.h  | 10 +-
 arch/parisc/include/asm/cmpxchg.h|  4 ++--
 arch/powerpc/include/asm/cmpxchg.h   |  4 ++--
 arch/riscv/include/asm/atomic.h  |  2 +-
 arch/riscv/include/asm/cmpxchg.h |  4 ++--
 arch/s390/include/asm/cmpxchg.h  |  8 
 arch/sh/include/asm/cmpxchg.h|  4 ++--
 arch/sparc/include/asm/cmpxchg_32.h  |  4 ++--
 arch/sparc/include/asm/cmpxchg_64.h  |  6 +++---
 arch/xtensa/include/asm/cmpxchg.h|  4 ++--
 20 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/arch/alpha/include/asm/cmpxchg.h b/arch/alpha/include/asm/cmpxchg.h
index 6e0a850aa9d38c..91d4a4d9258cd2 100644
--- a/arch/alpha/include/asm/cmpxchg.h
+++ b/arch/alpha/include/asm/cmpxchg.h
@@ -6,15 +6,15 @@
  * Atomic exchange routines.
  */
 
-#define xchg(type, args...)__xchg ## type ## _local(args)
+#define xchg(type, args...)__arch_xchg ## type ## 
_local(args)
 #define cmpxchg(type, args...) __cmpxchg ## type ## _local(args)
 #include 
 
 #define xchg_local(ptr, x) \
 ({ \
__typeof__(*(ptr)) _x_ = (x);   \
-   (__typeof__(*(ptr))) __xchg_local((ptr), (unsigned long)_x_,\
-  sizeof(*(ptr))); \
+   (__typeof__(*(ptr))) __arch_xchg_local((ptr), (unsigned long)_x_,\
+  sizeof(*(ptr))); \
 })
 
 #define arch_cmpxchg_local(ptr, o, n)  \
@@ -34,7 +34,7 @@
 
 #undef xchg
 #undef cmpxchg
-#define xchg(type, args...)__xchg ##type(args)
+#define xchg(type, args...)__arch_xchg ##type(args)
 #define cmpxchg(type, args...) __cmpxchg ##type(args)
 #include 
 
@@ -48,7 +48,7 @@
__typeof__(*(ptr)) _x_ = (x);   \
smp_mb();   \
__ret = (__typeof__(*(ptr)))\
-   __xchg((ptr), (unsigned long)_x_, sizeof(*(ptr)));  \
+   __arch_xchg((ptr), (unsigned long)_x_, sizeof(*(ptr))); \
smp_mb();   \
__ret;  \
 })
diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h
index c5b544a5fe8106..e138fde067dea5 100644
--- a/arch/arc/include/asm/cmpxchg.h
+++ b/arch/arc/include/asm/cmpxchg.h
@@ -85,7 +85,7 @@
  */
 #ifdef CONFIG_ARC_HAS_LLSC
 
-#define __xchg(ptr, val)   \
+#define __arch_xchg(ptr, val)  \
 ({ \
__asm__ __volatile__(   \
"   ex  %0, [%1]\n" /* set new value */ \
@@ -102,7 +102,7 @@
\
switch(sizeof(*(_p_))) {\
case 4: \
-   _val_ = __xchg(_p_, _val_); \
+   _val_ = __arch_xchg(_p_, _val_);\
break;  \
default:\
BUILD_BUG();\
diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
index 4dfe538dfc689b..44667bdb4707a9 100644
--- a/arch/arm/include/asm/cmpxchg.h
+++ b/arch/arm/include/asm/cmpxchg.h
@@ -25,7 +25,8 @@
 #define swp_is_buggy
 #endif
 
-static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int 
size)
+static inline unsigned long
+__arch_xchg(unsigned long x, volatile void *ptr, int size)
 {
extern void __bad_xchg(volatile void *, int);
unsigned long r

[PATCH v5 0/7] Introduce __xchg, non-atomic xchg

2023-01-18 Thread Andrzej Hajda
Hi all,

The helper is tiny and there are advices we can live without it, so
I want to present few arguments why it would be good to have it:

1. Code readability/simplification/number of lines:
  - decreases number of lines,
  - it often eliminates local variables,
  - for real examples see patches 3+.

2. Presence of similar helpers in other somehow related languages/libs:

a) Rust[1]: 'replace' from std::mem module, there is also 'take'
helper (__xchg(&x, 0)), which is the same as private helper in
i915 - fetch_and_zero, see latest patch.
b) C++ [2]: 'exchange' from utility header.

If the idea is OK there are still 2 questions to answer:

1. Name of the helper, __xchg follows kernel conventions,
but for me Rust names are also OK.
2. Where to put the helper:
a) as in this patchset include/linux/non-atomic/xchg.h,
proposed by Andy Shevchenko,
b) include/linux/utils.h ? any better name? Some kind
of container for simple helpers.

All __xchg conversions were performed using cocci script,
then manually adjusted if necessary.

There is lot of places it can be used in, I have just chosen
some of them. I can provide cocci script to detect others (not all),
if necessary.

Changes:
v2: squashed all __xchg -> __arch_xchg t one patch (Arnd)
v3: fixed alpha/xchg_local (l...@intel.com)
v4: adjusted indentation (Heiko)
v5: added more __xchg conversions - patches 3-6, added tags

[1]: https://doc.rust-lang.org/std/mem/index.html
[2]: https://en.cppreference.com/w/cpp/header/utility

Regards
Andrzej

Andrzej Hajda (7):
  arch: rename all internal names __xchg to __arch_xchg
  linux/include: add non-atomic version of xchg
  arch/*/uprobes: simplify arch_uretprobe_hijack_return_addr
  llist: simplify __llist_del_all
  io_uring: use __xchg if possible
  qed: use __xchg if possible
  drm/i915/gt: use __xchg instead of internal helper

 arch/alpha/include/asm/cmpxchg.h  | 10 +-
 arch/arc/include/asm/cmpxchg.h|  4 ++--
 arch/arm/include/asm/cmpxchg.h|  7 ---
 arch/arm/probes/uprobes/core.c|  8 ++--
 arch/arm64/include/asm/cmpxchg.h  |  7 +++
 arch/arm64/kernel/probes/uprobes.c|  9 ++---
 arch/csky/kernel/probes/uprobes.c |  9 ++---
 arch/hexagon/include/asm/cmpxchg.h| 10 +-
 arch/ia64/include/asm/cmpxchg.h   |  2 +-
 arch/ia64/include/uapi/asm/cmpxchg.h  |  4 ++--
 arch/loongarch/include/asm/cmpxchg.h  |  4 ++--
 arch/m68k/include/asm/cmpxchg.h   |  6 +++---
 arch/mips/include/asm/cmpxchg.h   |  4 ++--
 arch/mips/kernel/uprobes.c| 10 ++
 arch/openrisc/include/asm/cmpxchg.h   | 10 +-
 arch/parisc/include/asm/cmpxchg.h |  4 ++--
 arch/powerpc/include/asm/cmpxchg.h|  4 ++--
 arch/powerpc/kernel/uprobes.c | 10 ++
 arch/riscv/include/asm/atomic.h   |  2 +-
 arch/riscv/include/asm/cmpxchg.h  |  4 ++--
 arch/riscv/kernel/probes/uprobes.c|  9 ++---
 arch/s390/include/asm/cmpxchg.h   |  8 
 arch/s390/kernel/uprobes.c|  7 ++-
 arch/sh/include/asm/cmpxchg.h |  4 ++--
 arch/sparc/include/asm/cmpxchg_32.h   |  4 ++--
 arch/sparc/include/asm/cmpxchg_64.h   |  6 +++---
 arch/sparc/kernel/uprobes.c   |  7 ++-
 arch/xtensa/include/asm/cmpxchg.h |  4 ++--
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  4 ++--
 .../drm/i915/gt/intel_execlists_submission.c  |  4 ++--
 drivers/gpu/drm/i915/gt/intel_ggtt.c  |  4 ++--
 drivers/gpu/drm/i915/gt/intel_gsc.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt.c|  4 ++--
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |  2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c   |  6 +++---
 drivers/gpu/drm/i915/gt/intel_migrate.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_rc6.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_rps.c   |  2 +-
 drivers/gpu/drm/i915/gt/selftest_context.c|  2 +-
 .../drm/i915/gt/selftest_ring_submission.c|  2 +-
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  2 +-
 drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c |  2 +-
 drivers/gpu/drm/i915/gt/uc/intel_uc.c |  2 +-
 drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  |  2 +-
 drivers/gpu/drm/i915/i915_utils.h |  1 +
 include/linux/llist.h |  6 ++
 include/linux/non-atomic/xchg.h   | 19 +++
 include/linux/qed/qed_chain.h | 19 +++
 io_uring/io_uring.c   |  7 ++-
 io_uring/slist.h  |  6 ++
 51 files changed, 126 insertions(+), 155 deletions(-)
 create mode 100644 include/linux/non-atomic/xchg.h

-- 
2.34.1



Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Wed 18-01-23 14:23:32, Jann Horn wrote:
> On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > +locking maintainers
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +   up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else.
> >
> > Yes, I think you are right. From a look into the code it seems that
> > the UAF is quite unlikely as there is a ton of work to be done between
> > vma_write_lock used to prepare vma for removal and actual removal.
> > That doesn't make it less of a problem though.
> >
> > > So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> > >
> > > But I'm not entirely sure about that, the locking folks might know better.
> >
> > I am not a locking expert but to me it looks like this should work
> > because the final cleanup would have to happen rcu_read_unlock.
> >
> > Thanks, I have completely missed this aspect of the locking when looking
> > into the code.
> >
> > Btw. looking at this again I have fully realized how hard it is actually
> > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > vma manipulation functions like __adjust_vma make my head spin. Would it
> > make more sense to have a rcu style synchronization point in
> > vm_area_free directly before call_rcu? This would add an overhead of
> > uncontended down_write of course.
> 
> Something along those lines might be a good idea, but I think that
> rather than synchronizing the removal, it should maybe be something
> that splats (and bails out?) if it detects pending readers. If we get
> to vm_area_free() on a VMA that has pending readers, we might already
> be in a lot of trouble because the concurrent readers might have been
> traversing page tables while we were tearing them down or fun stuff
> like that.
> 
> I think maybe Suren was already talking about something like that in
> another part of this patch series but I don't remember...

This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] tools/perf: Fix the file mode with copyfile while adding file to build-id cache

2023-01-18 Thread Arnaldo Carvalho de Melo
Em Mon, Jan 16, 2023 at 10:31:30AM +0530, Athira Rajeev escreveu:
> The test "build id cache operations" fails on powerpc
> As below:
> 
>   Adding 5a0fd882b53084224ba47b624c55a469 ./tests/shell/../pe-file.exe: Ok
>   build id: 5a0fd882b53084224ba47b624c55a469
>   link: /tmp/perf.debug.ZTu/.build-id/5a/0fd882b53084224ba47b624c55a469
>   file: 
> /tmp/perf.debug.ZTu/.build-id/5a/../../root/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf
>   failed: file 
> /tmp/perf.debug.ZTu/.build-id/5a/../../root/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf
>  does not exist
>   test child finished with -1
>    end 
>   build id cache operations: FAILED!
> 
> The failing test is when trying to add pe-file.exe to
> build id cache.
> 
> Perf buildid-cache can be used to add/remove/manage
> files from the build-id cache. "-a" option is used to
> add a file to the build-id cache.
> 
> Simple command to do so for a PE exe file:
>  # ls -ltr tests/pe-file.exe
>  -rw-r--r--. 1 root root 75595 Jan 10 23:35 tests/pe-file.exe
>  The file is in home directory.
> 
>  # mkdir  /tmp/perf.debug.TeY1
>  # perf --buildid-dir /tmp/perf.debug.TeY1 buildid-cache -v
>-a tests/pe-file.exe
> 
> The above will create ".build-id" folder in build id
> directory, which is /tmp/perf.debug.TeY1. Also adds file
> to this folder under build id. Example:
> 
>  # ls -ltr /tmp/perf.debug.TeY1/.build-id/5a/0fd882b53084224ba47b624c55a469/
>  total 76
>  -rw-r--r--. 1 root root 0 Jan 11 00:38 probes
>  -rwxr-xr-x. 1 root root 75595 Jan 11 00:38 elf
> 
> We can see in the results that file mode for original
> file and file in build id directory is different. ie,
> build id file has executable permission whereas original
> file doesn’t have.
> 
> The code path and function ( build_id_cache__add ) to
> add file to cache is in "util/build-id.c". In
> build_id_cache__add() function, it first attempts to link
> the original file to destination cache folder. If linking
> the file fails ( which can happen if the destination and
> source is on a different mount points ), it will copy the
> file to destination. Here copyfile() routine explicitly uses
> mode as "755" and hence file in the destination will have
> executable permission.
> 
> Code snippet:
> 
>  if (link(realname, filename) && errno != EEXIST &&
>copyfile(name, filename))
> 
> Strace logs:
> 
>   172285 link("/home//linux/tools/perf/tests/pe-file.exe", 
> "/tmp/perf.debug.TeY1/home//linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf")
>  = -1 EXDEV (Invalid cross-device link)
>   172285 newfstatat(AT_FDCWD, "tests/pe-file.exe", {st_mode=S_IFREG|0644, 
> st_size=75595, ...}, 0) = 0
>   172285 openat(AT_FDCWD, 
> "/tmp/perf.debug.TeY1/home//linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/.elf.KbAnsl",
>  O_RDWR|O_CREAT|O_EXCL, 0600) = 3
>   172285 fchmod(3, 0755)  = 0
>   172285 openat(AT_FDCWD, "tests/pe-file.exe", O_RDONLY) = 4
>   172285 mmap(NULL, 75595, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7fffa5cd
>   172285 pwrite64(3, 
> "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 
> 75595, 0) = 75595
> 
> Whereas if the link succeeds, it succeeds in the first
> attempt itself and the file in the build-id dir will
> have same permission as original file.
> 
> Example, above uses /tmp. Instead if we use
> "--buildid-dir /home/build", linking will work here
> since mount points are same. Hence the destination file
> will not have executable permission.
> 
> Since the testcase "tests/shell/buildid.sh" always looks
> for executable file, test fails in powerpc environment
> when test is run from /root.
> 
> The patch adds a change in build_id_cache__add to use
> copyfile_mode which also passes the file’s original mode as
> argument. This way the destination file mode also will
> be same as original file.
> 
> Signed-off-by: Athira Rajeev 

Thanks, applied both patches, IIRC there were an Acked-by from Ian for
this one? Or was that reference a cut'n'paste error with that other
series for the #/bin/bash changes?

- Arnaldo

> ---
>  tools/perf/util/build-id.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/util/build-id.c b/tools/perf/util/build-id.c
> index a839b30c981b..ea9c083ab1e3 100644
> --- a/tools/perf/util/build-id.c
> +++ b/tools/perf/util/build-id.c
> @@ -715,9 +715,13 @@ build_id_cache__add(const char *sbuild_id, const char 
> *name, const char *realnam
>   } else if (nsi && nsinfo__need_setns(nsi)) {
>   if (copyfile_ns(name, filename, nsi))
>   goto out_free;
> - } else if (link(realname, filename) && errno != EEXIST &&
> - copyfile(name, filename))
> - goto out_free;
> + } else if (link

Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Jann Horn
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > +locking maintainers
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +   up_read(&vma->lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
>
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.
>
> > So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(&vma->lock);
> > rcu_read_unlock();
> >
> > But I'm not entirely sure about that, the locking folks might know better.
>
> I am not a locking expert but to me it looks like this should work
> because the final cleanup would have to happen rcu_read_unlock.
>
> Thanks, I have completely missed this aspect of the locking when looking
> into the code.
>
> Btw. looking at this again I have fully realized how hard it is actually
> to see that vm_area_free is guaranteed to sync up with ongoing readers.
> vma manipulation functions like __adjust_vma make my head spin. Would it
> make more sense to have a rcu style synchronization point in
> vm_area_free directly before call_rcu? This would add an overhead of
> uncontended down_write of course.

Something along those lines might be a good idea, but I think that
rather than synchronizing the removal, it should maybe be something
that splats (and bails out?) if it detects pending readers. If we get
to vm_area_free() on a VMA that has pending readers, we might already
be in a lot of trouble because the concurrent readers might have been
traversing page tables while we were tearing them down or fun stuff
like that.

I think maybe Suren was already talking about something like that in
another part of this patch series but I don't remember...


RE: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread David Laight
...
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
> 
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.

All it takes is a hardware interrupt
Especially if the softint code can also run.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v3 22/24] powerpc/pseries: Implement secvars for dynamic secure boot

2023-01-18 Thread Stefan Berger




On 1/18/23 01:10, Andrew Donnellan wrote:


+
+// PLPKS dynamic secure boot doesn't give us a format string in the same way 
OPAL does.
+// Instead, report the format using the SB_VERSION variable in the keystore.
+static ssize_t plpks_secvar_format(char *buf)


Ideally there would be a size_t accompanying this buffer...


+{
+   struct plpks_var var = {0};
+   ssize_t ret;
+
+   var.component = NULL;
+   // Only the signed variables have null bytes in their names, this one 
doesn't
+   var.name = "SB_VERSION";
+   var.namelen = 10;
+   var.datalen = 1;
+   var.data = kzalloc(1, GFP_KERNEL);


NULL pointer check?


+
+   // Unlike the other vars, SB_VERSION is owned by firmware instead of 
the OS
+   ret = plpks_read_fw_var(&var);
+   if (ret) {
+   if (ret == -ENOENT) {
+   ret = snprintf(buf, SECVAR_MAX_FORMAT_LEN, 
"ibm,plpks-sb-unknown");
+   } else {
+   pr_err("Error %ld reading SB_VERSION from firmware\n", 
ret);
+   ret = -EIO;
+   }
+   goto err;
+   }
+
+   // This string is made up by us - the hypervisor doesn't provide us
+   // with a format string in the way that OPAL firmware does. Hypervisor
+   // defines SB_VERSION as a "1 byte unsigned integer value".
+   ret = snprintf(buf, SECVAR_MAX_FORMAT_LEN, "ibm,plpks-sb-v%hhu", 
var.data[0]);
+
+err:
+   kfree(var.data);
+   return ret;
+}
+


Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

2023-01-18 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Page fault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to write-lock all
> VMAs and prevent this race with fault handlers that would hold VMA locks.
> VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> locking order as in page fault handlers.
>
> Signed-off-by: Suren Baghdasaryan 
> ---
>  mm/mmap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30c7d1c5206e..a256deca0bc0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, 
> struct address_space *mapping)
>   * of mm/rmap.c:
>   *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
>   * hugetlb mapping);
> + *   - all vmas marked locked

The existing comment above says that this is an *ordered* listing of
which locks are taken.

>   *   - all i_mmap_rwsem locks;
>   *   - all anon_vma->rwseml
>   *
> @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> mas_for_each(&mas, vma, ULONG_MAX) {
> if (signal_pending(current))
> goto out_unlock;
> +   vma_write_lock(vma);
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> vm_lock_mapping(mm, vma->vm_file->f_mapping);

Note that multiple VMAs can have the same ->f_mapping, so with this,
the lock ordering between VMA locks and the mapping locks of hugetlb
VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
->f_mapping, then the following operations happen:

1. lock VMA 1
2. lock mapping of VMAs 1 and 2
3. lock VMA 2
4. [second vm_lock_mapping() is a no-op]

So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
took the mapping lock first.

The existing code has one loop per lock type to ensure that the locks
really are taken in the specified order, even when some of the locks
are associated with multiple VMAs.

If we don't care about the ordering between these two, maybe that's
fine and you just have to adjust the comment; but it would be clearer
to add a separate loop for the VMA locks.

> @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_unlock_mapping(vma->vm_file->f_mapping);
> }
> +   vma_write_unlock_mm(mm);
>
> mutex_unlock(&mm_all_locks_mutex);
>  }
> --
> 2.39.0
>


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-18 Thread Jann Horn
On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko  wrote:
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF retract_page_tables
> vma_read_lock
> i_mmap_lock_write
> i_mmap_lock_read
> vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.

Oops, yeah. Now that I'm looking closer I also don't see a path from
the #PF path to i_mmap_lock_read. Sorry for sending you on a wild
goose chase.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 19:02:55, Jann Horn wrote:
> +locking maintainers
> 
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +   up_read(&vma->lock);
> > +}
> 
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else.

Yes, I think you are right. From a look into the code it seems that
the UAF is quite unlikely as there is a ton of work to be done between
vma_write_lock used to prepare vma for removal and actual removal.
That doesn't make it less of a problem though.

> So if you want to protect against concurrent
> deletion, this might have to be something like:
> 
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();
> 
> But I'm not entirely sure about that, the locking folks might know better.

I am not a locking expert but to me it looks like this should work
because the final cleanup would have to happen rcu_read_unlock.

Thanks, I have completely missed this aspect of the locking when looking
into the code.

Btw. looking at this again I have fully realized how hard it is actually
to see that vm_area_free is guaranteed to sync up with ongoing readers.
vma manipulation functions like __adjust_vma make my head spin. Would it
make more sense to have a rcu style synchronization point in
vm_area_free directly before call_rcu? This would add an overhead of
uncontended down_write of course.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 21/24] powerpc/pseries: Pass PLPKS password on kexec

2023-01-18 Thread Andrew Donnellan
On Wed, 2023-01-18 at 17:10 +1100, Andrew Donnellan wrote:
> 
>  struct umem_info {
> u64 *buf;   /* data buffer for usable-memory
> property */
> @@ -1155,7 +1156,7 @@ int setup_new_fdt_ppc64(const struct kimage
> *image, void *fdt,
> unsigned long initrd_len, const char
> *cmdline)
>  {
> struct crash_mem *umem = NULL, *rmem = NULL;
> -   int i, nr_ranges, ret;
> +   int i, nr_ranges, ret, chosen_node;

rip, we broke the build when CONFIG_PSERIES_PLPKS=n.

v4 incoming shortly, after we wait a little while longer for comments.

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed.
> >
> > What kind of regressions.
> >
> > > To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > Please add some data to justify this additional complexity.
> 
> Sorry, should have done that in the first place. A 4.3% regression was
> noticed when running execl test from unixbench suite. spawn test also
> showed 1.6% regression. Profiling revealed that vma freeing was taking
> longer due to call_rcu() which is slow when RCU callback offloading is
> enabled.

Could you be more specific? vma freeing is async with the RCU so how
come this has resulted in a regression? Is there any heavy
rcu_synchronize in the exec path? That would be an interesting
information.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > Assert there are no holders of VMA lock for reading when it is about to be
> > > destroyed.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  include/linux/mm.h | 8 
> > >  kernel/fork.c  | 2 ++
> > >  2 files changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 594e835bad9c..c464fc8a514c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> > > vm_area_struct *vma)
> > >   VM_BUG_ON_VMA(vma->vm_lock_seq != 
> > > READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > >  }
> > >
> > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > +{
> > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > +   vma->vm_lock_seq != 
> > > READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > +   vma);
> >
> > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > us something is wrong on its own, no? This could be somebody racing with
> > the vma destruction and using the write lock. Unlikely but I do not see
> > why to narrow debugging scope.
> 
> I wanted to ensure there are no page fault handlers (read-lockers)
> when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> could trigger if someone is concurrently calling vma_write_lock(). But
> I don't think we expect someone to be write-locking the VMA while we

That would be UAF, no?

> are destroying it, so you are right, I'm overcomplicating things here.
> I think I can get rid of vma_assert_no_reader() and add
> VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> __vm_area_free(). WDYT?

Yes, that adds some debugging. Not sure it is really necessary buyt it
is VM_BUG_ON so why not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 21:28:06, Jann Horn wrote:
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
> 
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
> 
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
> 
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().

OK, I see. This would be a great addendum to the changelog.
 
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.

This, right?
#PF retract_page_tables
vma_read_lock
i_mmap_lock_write
i_mmap_lock_read
vma_write_lock


I might be missing something but I have only found huge_pmd_share to be
called from the #PF path. That one should be safe as it cannot be a
target for THP. Not that it would matter much because such a dependency
chain would be really subtle.
-- 
Michal Hocko
SUSE Labs


Re: crypto: p10-aes-gcm - Add asm markings necessary for kernel code

2023-01-18 Thread Herbert Xu
On Wed, Jan 18, 2023 at 02:04:44PM +1100, Stephen Rothwell wrote:
>
> arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: 
> aes_p8_set_encrypt_key+0x44: unannotated intra-function call
> 
> from the powerpc pseries_le_defconfig build (which is otherwise ok).

Thanks Stephen.  I've reverted these changes completely.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > anon_vma_lock_write to match the locking order of page fault handler.
> >
> > Does this changelog assumes per vma locking in the #PF?
> 
> Hmm, you are right. Page fault handlers do not use per-vma locks yet
> but the changelog already talks about that. Maybe I should change it
> to simply:
> ```
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked.

Because 

Withtout that additional reasonaning it is not really clear why that is
needed and seems arbitrary.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Michal Hocko
On Tue 17-01-23 21:54:58, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > When
> > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked 
> > > > > as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >   differ. If they match then there must be a vma exclusive lock
> > > >   held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >   vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >   held exclusively and the current mm counter is noted to the 
> > > > vma
> > > >   side. This will allow multiple vmas to be locked under a 
> > > > single
> > > >   mmap_lock write lock (e.g. during vma merging). The vma 
> > > > counter
> > > >   is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> > 
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
> > If other names are preferable I'm open to changing them.
> 
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
> 
>   vma_start_read()
>   vma_end_read()
>   vma_start_write()
>   vma_end_write()
>   vma_downgrade_write()
> 
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?

Agreed!

> Would that reduce people's confusion?

Yes I believe that naming it less like a locking primitive will clarify
it. vma_{start,end}_[try]read is better indeed. I am wondering about the
write side of things because that is where things get confusing. There
is no explicit write lock nor unlock. vma_start_write sounds better than
the vma_write_lock but it still lacks that pairing with vma_end_write
which is never the right thing to call. Wouldn't vma_mark_modified and
vma_publish_changes describe the scheme better?

Downgrade case is probably the least interesting one because that is
just one off thing that can be completely hidden from any code besides
mmap_write_downgrade so I wouldn't be too concern about that one.

But as you say, no need to bikeshed this too much. Great naming is hard
and if the scheme is documented properly we can live with a suboptimal
naming as well.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/8] perf/core: Add perf_sample_save_brstack() helper

2023-01-18 Thread Athira Rajeev



> On 18-Jan-2023, at 11:35 AM, Namhyung Kim  wrote:
> 
> When it saves the branch stack to the perf sample data, it needs to
> update the sample flags and the dynamic size.  To make sure this,
> add the perf_sample_save_brstack() helper and convert all call sites.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: x...@kernel.org
> Suggested-by: Peter Zijlstra 
> Acked-by: Jiri Olsa 
> Tested-by: Jiri Olsa 
> Signed-off-by: Namhyung Kim 

Hi Namhyung,

Changes looks good to me.

Acked-by: Athira Rajeev 

Thanks
Athira

> ---
> arch/powerpc/perf/core-book3s.c |  3 +-
> arch/x86/events/amd/core.c  |  6 +--
> arch/x86/events/intel/core.c|  6 +--
> arch/x86/events/intel/ds.c  |  9 ++---
> include/linux/perf_event.h  | 66 -
> kernel/events/core.c| 16 +++-
> 6 files changed, 53 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index bf318dd9b709..8c1f7def596e 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2313,8 +2313,7 @@ static void record_and_restart(struct perf_event 
> *event, unsigned long val,
>   struct cpu_hw_events *cpuhw;
>   cpuhw = this_cpu_ptr(&cpu_hw_events);
>   power_pmu_bhrb_read(event, cpuhw);
> - data.br_stack = &cpuhw->bhrb_stack;
> - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> + perf_sample_save_brstack(&data, event, 
> &cpuhw->bhrb_stack);
>   }
> 
>   if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC &&
> diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
> index d6f3703e4119..463f3eb8bbd7 100644
> --- a/arch/x86/events/amd/core.c
> +++ b/arch/x86/events/amd/core.c
> @@ -928,10 +928,8 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
>   if (!x86_perf_event_set_period(event))
>   continue;
> 
> - if (has_branch_stack(event)) {
> - data.br_stack = &cpuc->lbr_stack;
> - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> - }
> + if (has_branch_stack(event))
> + perf_sample_save_brstack(&data, event, 
> &cpuc->lbr_stack);
> 
>   if (perf_event_overflow(event, &data, regs))
>   x86_pmu_stop(event, 0);
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 29d2d0411caf..14f0a746257d 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3036,10 +3036,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
> status)
> 
>   perf_sample_data_init(&data, 0, event->hw.last_period);
> 
> - if (has_branch_stack(event)) {
> - data.br_stack = &cpuc->lbr_stack;
> - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> - }
> + if (has_branch_stack(event))
> + perf_sample_save_brstack(&data, event, 
> &cpuc->lbr_stack);
> 
>   if (perf_event_overflow(event, &data, regs))
>   x86_pmu_stop(event, 0);
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 158cf845fc80..07c8a2cdc3ee 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1720,10 +1720,8 @@ static void setup_pebs_fixed_sample_data(struct 
> perf_event *event,
>   data->sample_flags |= PERF_SAMPLE_TIME;
>   }
> 
> - if (has_branch_stack(event)) {
> - data->br_stack = &cpuc->lbr_stack;
> - data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> - }
> + if (has_branch_stack(event))
> + perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
> }
> 
> static void adaptive_pebs_save_regs(struct pt_regs *regs,
> @@ -1883,8 +1881,7 @@ static void setup_pebs_adaptive_sample_data(struct 
> perf_event *event,
> 
>   if (has_branch_stack(event)) {
>   intel_pmu_store_pebs_lbrs(lbr);
> - data->br_stack = &cpuc->lbr_stack;
> - data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
> + perf_sample_save_brstack(data, event, &cpuc->lbr_stack);
>   }
>   }
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 569dfac5887f..7db0e9cc2682 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1102,6 +1102,31 @@ extern u64 perf_event_read_value(struct perf_event 
> *event,
> 
> extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, 
> struct pt_regs *regs);
> 
> +static inline bool branch_sample_no_flags(const struct perf_event *event)
> +{
> + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_NO_FLAGS;
> +}
> +
> +static inline bool branch_sample_no_cycles(const struct perf_

Re: Memory transaction instructions

2023-01-18 Thread David Howells
Linus Torvalds  wrote:

> And for the kernel, where we don't have bad locking, and where we
> actually use fine-grained locks that are _near_ the data that we are
> locking (the lockref of the dcache is obviously one example of that,
> but the skbuff queue you mention is almost certainly exactly the same
> situation): the lock is right by the data that the lock protects, and
> the "shared lock cacheline" model simply does not work. You'll bounce
> the data, and most likely you'll also touch the same lock cacheline
> too.

Yeah.  The reason I was actually wondering about them was if it would be
possible to avoid the requirement to disable interrupts/softirqs to, say,
modify the skbuff queue.  On some arches actually disabling irqs is quite a
heavy operation (I think this is/was true on ppc64, for example; it certainly
was on frv) and it was necessary to "emulate" the disablement.

David



Re: crypto: p10-aes-gcm - Add asm markings necessary for kernel code

2023-01-18 Thread Sathvika Vasireddy

Hi Stephen,

On 18/01/23 08:34, Stephen Rothwell wrote:

Hi Herbert,

On Tue, 17 Jan 2023 15:26:24 +0800 Herbert Xu  
wrote:

On Tue, Jan 17, 2023 at 02:47:47PM +1100, Stephen Rothwell wrote:

Hi all,

After merging the crypto tree, today's linux-next build (powerpc
pseries_le_defconfig) failed like this:

arch/powerpc/crypto/p10_aes_gcm.o: warning: objtool: .text+0x884: unannotated 
intra-function call
arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: 
unannotated intra-function call
ld: arch/powerpc/crypto/p10_aes_gcm.o: ABI version 1 is not compatible with ABI 
version 2 output
ld: failed to merge target specific data of file 
arch/powerpc/crypto/p10_aes_gcm.o

Caused by commit

   ca68a96c37eb ("crypto: p10-aes-gcm - An accelerated AES/GCM stitched 
implementation")

I have applied the following hack for today.

Thanks Stephen, I'm going to update the previous fix as follows:

I still get:

arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: 
unannotated intra-function call

from the powerpc pseries_le_defconfig build (which is otherwise ok).


Warnings [1], [2], and [3] are seen with pseries_le_defconfig.

[1] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: 
aes_p8_set_encrypt_key+0x44: unannotated intra-function call
[2] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: .text+0x2448: 
unannotated intra-function call
[3] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: .text+0x2d68: 
unannotated intra-function call

Given that there are no calls to _mcount, one way to address this warning, is 
by skipping objtool from running on arch/powerpc/crypto/aesp8-ppc.o file.
The below diff works for me.

=
diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile
index 5b8252013abd..d00664c8d761 100644
--- a/arch/powerpc/crypto/Makefile
+++ b/arch/powerpc/crypto/Makefile
@@ -31,3 +31,5 @@ targets += aesp8-ppc.S ghashp8-ppc.S
 
 $(obj)/aesp8-ppc.S $(obj)/ghashp8-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE

$(call if_changed,perl)
+
+OBJECT_FILES_NON_STANDARD_aesp8-ppc.o := y
=


The other way to fix these warnings is by using ANNOTATE_INTRA_FUNCTION_CALL 
macro to indicate that the branch target is valid. And, by annotating symbols 
with SYM_FUNC_START_LOCAL and SYM_FUNC_END macros.
The below diff works for me:

=
diff --git a/arch/powerpc/crypto/aesp8-ppc.pl b/arch/powerpc/crypto/aesp8-ppc.pl
index cdbcf6e13efc..355e0036869a 100644
--- a/arch/powerpc/crypto/aesp8-ppc.pl
+++ b/arch/powerpc/crypto/aesp8-ppc.pl
@@ -80,6 +80,9 @@
 # POWER8[le]   3.96/0.72   0.741.1
 # POWER8[be]   3.75/0.65   0.661.0
 
+print "#include \n";

+print "#include \n";
+
 $flavour = shift;
 
 if ($flavour =~ /64/) {

@@ -185,7 +188,8 @@ Lset_encrypt_key:
lis r0,0xfff0
mfspr   $vrsave,256
mtspr   256,r0
-
+
+   ANNOTATE_INTRA_FUNCTION_CALL
bl  Lconsts
mtlrr11
 
@@ -3039,7 +3043,7 @@ Lxts_enc6x_ret:

.long   0
 
 .align 5

-_aesp8_xts_enc5x:
+SYM_FUNC_START_LOCAL(_aesp8_xts_enc5x)
vcipher $out0,$out0,v24
vcipher $out1,$out1,v24
vcipher $out2,$out2,v24
@@ -3121,6 +3125,7 @@ _aesp8_xts_enc5x:
blr
 .long  0
 .byte  0,12,0x14,0,0,0,0,0
+SYM_FUNC_END(_aesp8_xts_enc5x)
 
 .align 5

 _aesp8_xts_decrypt6x:
@@ -3727,7 +3732,7 @@ Lxts_dec6x_ret:
.long   0
 
 .align 5

-_aesp8_xts_dec5x:
+SYM_FUNC_START_LOCAL(_aesp8_xts_dec5x)
vncipher$out0,$out0,v24
vncipher$out1,$out1,v24
vncipher$out2,$out2,v24
@@ -3809,6 +3814,7 @@ _aesp8_xts_dec5x:
blr
 .long  0
 .byte  0,12,0x14,0,0,0,0,0
+SYM_FUNC_END(_aesp8_xts_dec5x)
 ___
 }} }}}

=


Thanks,
Sathvika



[PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs

2023-01-18 Thread Nicholas Piggin
** Not for merge **

CONFIG_MMU_LAZY_TLB_SHOOTDOWN that requires IPIs to clear the "lazy tlb"
references to an mm that is being freed. With the radix MMU, the final
userspace exit TLB flush can be performed with IPIs, and those IPIs can
also clear lazy tlb mm references, which mostly eliminates the final
IPIs required by MMU_LAZY_TLB_SHOOTDOWN.

This does mean the final TLB flush is not done with TLBIE, which can be
faster than IPI+TLBIEL, but we would have to do those IPIs for lazy
shootdown so using TLBIEL should be a win.

The final cpumask test and possible IPIs are still needed to clean up
some rare race cases. We could prevent those entirely (e.g., prevent new
lazy tlb mm references if userspace has gone away, or move the final
TLB flush later), but I'd have to see actual numbers that matter before
adding any more complexity for it. I can't imagine it would ever be
worthwhile.

This takes lazy tlb mm shootdown IPI interrupts from 314 to 3 on a 144
CPU system doing a kernel compile. It also takes care of the one
potential problem workload which is a short-lived process with multiple
CPU-bound threads that want to be spread to other CPUs, because the mm
exit happens after the process is back to single-threaded.

---
 arch/powerpc/mm/book3s64/radix_tlb.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 282359ab525b..f34b78cb4c7d 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1303,7 +1303,31 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 * See the comment for radix in arch_exit_mmap().
 */
if (tlb->fullmm || tlb->need_flush_all) {
-   __flush_all_mm(mm, true);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+   /*
+* Shootdown based lazy tlb mm refcounting means we
+* have to IPI everyone in the mm_cpumask anyway soon
+* when the mm goes away, so might as well do it as
+* part of the final flush now.
+*
+* If lazy shootdown was improved to reduce IPIs (e.g.,
+* by batching), then it may end up being better to use
+* tlbies here instead.
+*/
+   smp_mb(); /* see radix__flush_tlb_mm */
+   exit_flush_lazy_tlbs(mm);
+   _tlbiel_pid(mm->context.id, RIC_FLUSH_ALL);
+
+   /*
+* It should not be possible to have coprocessors still
+* attached here.
+*/
+   if (WARN_ON_ONCE(atomic_read(&mm->context.copros) > 0))
+   __flush_all_mm(mm, true);
+   } else {
+   __flush_all_mm(mm, true);
+   }
+
} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
if (!tlb->freed_tables)
radix__flush_tlb_mm(mm);
-- 
2.37.2



[PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

2023-01-18 Thread Nicholas Piggin
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second, due to contention on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing
a kernel compile, which is in the noise in terms of kernel cycles.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..600ace5a7f1a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -265,6 +265,7 @@ config PPC
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
+   select MMU_LAZY_TLB_SHOOTDOWN   if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE   if PPC64 || NOT_COHERENT_CACHE
select NEED_PER_CPU_EMBED_FIRST_CHUNK   if PPC64
-- 
2.37.2



[PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-18 Thread Nicholas Piggin
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs cost could be an issue, but they have not been observed
to be a serious problem with this scheme, because short-lived processes
tend not to migrate CPUs much, therefore they don't get much chance to
leave lazy tlb mm references on remote CPUs. There are a lot of options
to reduce them if necessary.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig  | 15 
 kernel/fork.c | 65 +++
 2 files changed, 80 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index b07d36f08fea..f7da34e4bc62 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # already).
 config MMU_LAZY_TLB_REFCOUNT
def_bool y
+   depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch *must*:
+# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
+#   at least all possible CPUs in which the mm is lazy.
+# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+   bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..263660e78c2a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()  (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_check_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void do_shoot_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   if (current->active_mm == mm) {
+   WARN_ON_ONCE(current->mm);
+   current->active_mm = &init_mm;
+   switch_mm(mm, &init_mm, current);
+   }
+}
+
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
+   if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+   /*
+* In this case, lazy tlb mms are refounted and would not reach
+* __mmdrop until all CPUs have switched away and mmdrop()ed.
+*/
+   return;
+   }
+
+   /*
+* Lazy TLB shootdown does not refcount "lazy tlb mm" usage, rather it
+* requires lazy mm users to switch to another mm when the refcount
+* drops to zero, before the mm is freed. This requires IPIs here to
+* switch kernel threads to init_mm.
+*
+* archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
+* switch with the final userspace teardown TLB flush which leaves the
+* mm lazy on this CPU but no others, reducing the need for additional
+* IPIs here. There are cases where a final IPI is still required here,
+* such as the final mmdrop being performed on a different CPU than the
+* one exiting, or kernel threads using the mm when userspace exits.
+*
+* IPI overheads have not found to be expensive, but they could be
+* reduced in a number of possible ways, for example (roughly
+* increasing order of complexity):
+* - The last lazy reference created by exit_mm() could instead switch
+*   to init_mm, however it's probable this will run on the same CPU
+*   immediately afterwards, so this may not reduce IPIs much.
+* - A batch of mms requiring IPIs could be gathered and freed at once.
+* - CPUs store active_mm where it can be remotely checked without a
+*   lock, to filter out false-positives in the cpumask.
+* - After mm_users or mm_count reaches zero, switching away from the
+*   mm could clear mm_cpumask to reduce some IPIs, perhaps together
+*   with some batching or delaying of the final IPIs.
+* - A delayed freeing and RCU-like quiescing sequence based on mm
+*   switching to avoid IPIs completely.
+*/
+   on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+   if (IS_ENABLED(CONFIG_DEBUG_VM))
+   on_each_cpu(do_check_lazy

[PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable

2023-01-18 Thread Nicholas Piggin
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin 
---
 Documentation/mm/active_mm.rst |  6 ++
 arch/Kconfig   | 17 +
 include/linux/sched/mm.h   | 18 +++---
 kernel/sched/core.c| 22 ++
 kernel/sched/sched.h   |  4 +++-
 5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index 6f8269c284ed..2b0d08332400 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -4,6 +4,12 @@
 Active MM
 =
 
+Note, the mm_count refcount may no longer include the "lazy" users
+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers which abstracts this config option.
+
 ::
 
  List:   linux-kernel
diff --git a/arch/Kconfig b/arch/Kconfig
index 12e3ddabac9d..b07d36f08fea 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  irqs disabled over activate_mm. Architectures that do IPI based TLB
  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example.
+#
+# To implement this, an arch *must*:
+# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the
+# lazy reference of a kthread's ->active_mm (non-arch code has been converted
+# already).
+config MMU_LAZY_TLB_REFCOUNT
+   def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5376caf6fcf3..68bbe8d90c2e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-   mmgrab(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+   mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-   mmdrop(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+   mmdrop(mm);
+   } else {
+   /*
+* mmdrop_lazy_tlb must provide a full memory barrier, see the
+* membarrier comment finish_task_switch which relies on this.
+*/
+   smp_mb();
+   }
 }
 
 static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
 {
-   mmdrop_sched(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+   mmdrop_sched(mm);
+   else
+   smp_mb(); // see above
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26aaa974ee6d..1ea14d849a0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5081,7 +5081,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
+   struct mm_struct *mm = NULL;
unsigned int prev_state;
 
/*
@@ -5100,7 +5100,10 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   mm = rq->prev_lazy_mm;
+   rq->prev_lazy_mm = NULL;
+#endif
 
/*
 * A task struct has one reference for the use as "current".
@@ -5231,9 +5234,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
lru_gen_use_mm(next->mm);
 
if (!prev->mm) {// from kernel
-   /* will mmdrop_lazy_tlb() in finish_task_switch(). */
-   rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   /* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+   rq->prev_lazy_m

[PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions

2023-01-18 Thread Nicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.

The only functional change is in kthread_use_mm/kthread_unuse_mm is
because it is clever with refcounting: If it happens that the kthread's
lazy tlb mm (active_mm) is the same as the mm to be used, the code
doesn't touch the refcount but rather transfers the lazy refcount to
used-mm refcount. If the lazy tlb mm refcount is no longer equivalent to
the regular refcount, this trick can not be used. mmgrab a regular
reference on mm to use, and mmdrop_lazy_tlb the previous active_mm.

Signed-off-by: Nicholas Piggin 
---
 arch/arm/mach-rpc/ecard.c|  2 +-
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c|  2 +-
 include/linux/sched/mm.h | 16 
 kernel/cpu.c |  2 +-
 kernel/exit.c|  2 +-
 kernel/kthread.c | 21 +
 kernel/sched/core.c  | 15 ---
 9 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 53813f9464a2..c30df1097c52 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
-   mmdrop(active_mm);
+   mmdrop_lazy_tlb(active_mm);
ecard_init_pgtables(mm);
return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6b90f10a6c81..7db6b3faea65 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1611,7 +1611,7 @@ void start_secondary(void *unused)
if (IS_ENABLED(CONFIG_PPC32))
setup_kup();
 
-   mmgrab(&init_mm);
+   mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
 
smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 4e29b619578c..282359ab525b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -794,10 +794,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool 
always_flush)
if (current->active_mm == mm) {
WARN_ON_ONCE(current->mm != NULL);
/* Is a kernel thread and is using mm as the lazy tlb */
-   mmgrab(&init_mm);
+   mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
}
 
/*
diff --git a/fs/exec.c b/fs/exec.c
index ab913243a367..1a32a88db173 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1033,7 +1033,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
return 0;
}
-   mmdrop(active_mm);
+   mmdrop_lazy_tlb(active_mm);
return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..5376caf6fcf3 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -79,6 +79,22 @@ static inline void mmdrop_sched(struct mm_struct *mm)
 }
 #endif
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+   mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+   mmdrop(mm);
+}
+
+static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
+{
+   mmdrop_sched(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6c0a92ca6bb5..189895288d9d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -623,7 +623,7 @@ static int finish_cpu(unsigned int cpu)
 */
if (mm != &init_mm)
idle->active_mm = &init_mm;
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 15dc2ec80c46..1a4608d765e4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -537,7 +537,7 @@ static void exit_mm(void)
return;
sync_mm_rss(mm);
mmap_read_lock(mm);
-   mmgrab(mm);
+   mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..691b213e578f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1410,14 +1410,19 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);
 
+   /*
+* It's possible that tsk->active_mm == mm here, but we must
+* still mmgrab(mm) and mmdrop_lazy_

[PATCH v6 0/5] shoot lazy tlbs

2023-01-18 Thread Nicholas Piggin
It's time for that annual flamewar. Nothing really changed in core code
to clean things up or make it better for x86 last year, so I think we
can table that objection.

IIRC the situation left off with Andy proposing a different approach,
and Linus preferring to shoot the lazies at exit time (piggybacking on
the TLB flush IPI), which is what this series allows an arch to do.
Discussion thread here:

https://lore.kernel.org/linux-arch/7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.l...@kernel.org/

I don't think there was any movement on this or other alternatives, or
code cleanups since then, but correct me if I'm wrong.

Since v5 of this series, there has just been a minor rebase to upstream,
and some tweaking of comments and code style. No functional changes.

Also included patch 5 which is the optimisation that combines final TLB
shootdown with the lazy tlb mm shootdown IPIs. Included because Linus
expected to see it. It works fine, but I have some other powerpc changes
I would like to go ahead of it so I would like to take those through the
powerpc tree. And actually giving it a release cycle without that
optimization will help stress test the final IPI cleanup path too, which
I would like.

Even without the last patch, the additional IPIs caused by shoot lazy
is down in the noise so I'm not too concerned about it.

Thanks,
Nick

Nicholas Piggin (5):
  lazy tlb: introduce lazy tlb mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling
scheme
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown
IPIs

 Documentation/mm/active_mm.rst   |  6 +++
 arch/Kconfig | 32 ++
 arch/arm/mach-rpc/ecard.c|  2 +-
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c | 30 +++--
 fs/exec.c|  2 +-
 include/linux/sched/mm.h | 28 
 kernel/cpu.c |  2 +-
 kernel/exit.c|  2 +-
 kernel/fork.c| 65 
 kernel/kthread.c | 21 +
 kernel/sched/core.c  | 35 ++-
 kernel/sched/sched.h |  4 +-
 14 files changed, 205 insertions(+), 27 deletions(-)

-- 
2.37.2