[PATCH 1/1] ASoC: imx-hdmi: Use dev_err_probe
This silences -517 errors and helps figuring out why the device probe is deferred. Signed-off-by: Alexander Stein --- sound/soc/fsl/imx-hdmi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/soc/fsl/imx-hdmi.c b/sound/soc/fsl/imx-hdmi.c index a780cf5a65ffa..b6cc7e6c2a320 100644 --- a/sound/soc/fsl/imx-hdmi.c +++ b/sound/soc/fsl/imx-hdmi.c @@ -202,7 +202,7 @@ static int imx_hdmi_probe(struct platform_device *pdev) snd_soc_card_set_drvdata(&data->card, data); ret = devm_snd_soc_register_card(&pdev->dev, &data->card); if (ret) { - dev_err(&pdev->dev, "snd_soc_register_card failed (%d)\n", ret); + dev_err_probe(&pdev->dev, ret, "snd_soc_register_card failed\n"); goto fail; } -- 2.34.1
[PATCH v2] powerpc: Implement arch_within_stack_frames
Walks the stack when copy_{to,from}_user address is in the stack to ensure that the object being copied is entirely within a single stack frame. Substatially similar to the x86 implementation except using the back chain to traverse the stack and identify stack frame boundaries. Signed-off-by: Nicholas Miehlbradt --- v2: Rename PARAMETER_SAVE_OFFSET to STACK_FRAME_PARAMS Add definitions of STACK_FRAME_PARAMS for PPC32 and remove dependancy on PPC64 Ignore the current stack frame and start with it's parent, similar to x86 v1: https://lore.kernel.org/linuxppc-dev/20221214044252.1910657-1-nicho...@linux.ibm.com/ --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/thread_info.h | 36 ++ 2 files changed, 37 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2ca5418457ed..97ca54773521 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -198,6 +198,7 @@ config PPC select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index af58f1ed3952..c5dce5f239c1 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -186,6 +186,42 @@ static inline bool test_thread_local_flags(unsigned int flags) #define is_elf2_task() (0) #endif +#if defined(CONFIG_PPC64_ELF_ABI_V1) +#define STACK_FRAME_PARAMS 48 +#elif defined(CONFIG_PPC64_ELF_ABI_V2) +#define STACK_FRAME_PARAMS 32 +#elif defined(CONFIG_PPC32) +#define STACK_FRAME_PARAMS 8 +#endif + +/* + * Walks up the stack frames to make sure that the specified object is + * entirely contained by a single stack frame. + * + * Returns: + * GOOD_FRAME if within a frame + * BAD_STACK if placed across a frame boundary (or outside stack) + */ +static inline int arch_within_stack_frames(const void * const stack, + const void * const stackend, + const void *obj, unsigned long len) +{ + const void *params; + const void *frame; + + params = *(const void * const *)current_stack_pointer + STACK_FRAME_PARAMS; + frame = **(const void * const * const *)current_stack_pointer; + + while (stack <= frame && frame < stackend) { + if (obj + len <= frame) + return obj >= params ? GOOD_FRAME : BAD_STACK; + params = frame + STACK_FRAME_PARAMS; + frame = *(const void * const *)frame; + } + + return BAD_STACK; +} + #endif /* !__ASSEMBLY__ */ #endif /* __KERNEL__ */ -- 2.34.1
Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote: > > > > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin wrote: > > > > +static void do_shoot_lazy_tlb(void *arg) > > +{ > > + struct mm_struct *mm = arg; > > + > > + if (current->active_mm == mm) { > > + WARN_ON_ONCE(current->mm); > > + current->active_mm = &init_mm; > > + switch_mm(mm, &init_mm, current); > > + } > > +} > > I might be out of touch - doesn’t a flush already take place when we free > the page-tables, at least on common cases on x86? > > IIUC exit_mmap() would free page-tables, and whenever page-tables are > freed, on x86, we do shootdown regardless to whether the target CPU TLB state > marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and > everything should be fine, no? > > [ I understand you care about powerpc, just wondering on the effect on x86 ] Now I come to think of it, Rik had done this for x86 a while back. https://lore.kernel.org/all/20180728215357.3249-10-r...@surriel.com/ I didn't know about it when I wrote this, so I never dug into why it didn't get merged. It might have missed the final __mmdrop races but I'm not not sure, x86 lazy tlb mode is too complicated to know at a glance. I would check with him though. Thanks, Nick
Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
On Thu Jan 19, 2023 at 3:30 AM AEST, Linus Torvalds wrote: > [ Adding a few more x86 and arm64 maintainers - while linux-arch is > the right mailing list, I'm not convinced people actually follow it > all that closely ] > > On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin wrote: > > > > On a 16-socket 192-core POWER8 system, a context switching benchmark > > with as many software threads as CPUs (so each switch will go in and > > out of idle), upstream can achieve a rate of about 1 million context > > switches per second, due to contention on the mm refcount. > > > > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable > > the option. This increases the above benchmark to 118 million context > > switches per second. > > Well, the 1M -> 118M change does seem like a good reason for this series. It was an artificial corner case, mind you. I don't think it's a reason to panic and likely smaller systems with faster atomics will care far less than our big 2-hop systems. Benchmark is will-it-scale: ./context_switch1_threads -t 768 min:2174 max:2690 total:1827952 33.52% [k] finish_task_switch 27.26% [k] interrupt_return 22.66% [k] __schedule 2.30% [k] _raw_spin_trylock ./context_switch1_threads -t 1536 min:103000 max:120100 total:177201906 The top case has 1/2 the switching pairs to available CPU, which makes them all switch the same mm between real and lazy. Bottom case is just switching between user threads so that doesn't hit the lazy refcount. > The patches certainly don't look offensive to me, so Ack as far as I'm > concerned, but honestly, it's been some time since I've personally > been active on the idle and lazy TLB code, so that ack is probably > largely worthless. > > If anything, my main reaction to this all is to wonder whether the > config option is a good idea - maybe we could do this unconditionally, > and make the source code (and logic) simpler to follow when you don't > have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option. > > I wouldn't be surprised to hear that x86 can have the same issue where > the mm_struct refcount is a bigger issue than the possibility of an > extra TLB shootdown at the final exit time. > > But having the config options as a way to switch people over gradually > (and perhaps then removing it later) doesn't sound wrong to me either. IMO it's trivial enough that we could carry both, but everything's a straw on the camel's back so if we can consolidate it would always be preferebale. Let's see how it plays out for a few releases. > And I personally find the argument in patch 3/5 fairly convincing: > > Shootdown IPIs cost could be an issue, but they have not been observed > to be a serious problem with this scheme, because short-lived processes > tend not to migrate CPUs much, therefore they don't get much chance to > leave lazy tlb mm references on remote CPUs. > > Andy? PeterZ? Catalin? > > Nick - it might be good to link to the actual benchmark, and let > people who have access to big machines perhaps just try it out on > non-powerpc platforms... Yep good point, I'll put it in the changelog. I might submit another round to Andrew in a bit with acks and any minor tweaks and minus the last patch, assuming no major changes or objections. Thanks, Nick
Re: [PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema
Hi Rob, I love your patch! Perhaps something to improve: [auto build test WARNING on 1b929c02afd37871d5afb9d498426f83432e71c2] url: https://github.com/intel-lab-lkp/linux/commits/Rob-Herring/dt-bindings-usb-Remove-obsolete-brcm-bcm3384-usb-txt/20230119-030120 base: 1b929c02afd37871d5afb9d498426f83432e71c2 patch link: https://lore.kernel.org/r/20230110-dt-usb-v2-3-926bc1260e51%40kernel.org patch subject: [PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema reproduce: # https://github.com/intel-lab-lkp/linux/commit/e7220b26de1a7fcd192feec481c1a90f7bf5c949 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Rob-Herring/dt-bindings-usb-Remove-obsolete-brcm-bcm3384-usb-txt/20230119-030120 git checkout e7220b26de1a7fcd192feec481c1a90f7bf5c949 make menuconfig # enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS make htmldocs If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot All warnings (new ones prefixed by >>): >> Warning: Documentation/devicetree/bindings/mfd/omap-usb-host.txt references >> a file that doesn't exist: >> Documentation/devicetree/bindings/usb/ehci-generic.yaml >> Warning: Documentation/devicetree/bindings/mfd/omap-usb-host.txt references >> a file that doesn't exist: >> Documentation/devicetree/bindings/usb/ohci-generic.yaml -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests
Re: Memory transaction instructions
On Wed Jan 18, 2023 at 7:05 PM AEST, David Howells wrote: > Linus Torvalds wrote: > > > And for the kernel, where we don't have bad locking, and where we > > actually use fine-grained locks that are _near_ the data that we are > > locking (the lockref of the dcache is obviously one example of that, > > but the skbuff queue you mention is almost certainly exactly the same > > situation): the lock is right by the data that the lock protects, and > > the "shared lock cacheline" model simply does not work. You'll bounce > > the data, and most likely you'll also touch the same lock cacheline > > too. > > Yeah. The reason I was actually wondering about them was if it would be > possible to avoid the requirement to disable interrupts/softirqs to, say, > modify the skbuff queue. On some arches actually disabling irqs is quite a > heavy operation (I think this is/was true on ppc64, for example; it certainly > was on frv) and it was necessary to "emulate" the disablement. Not too bad on modern ppc64. Changing MSR in general has to flush the pipe and even re-fetch, because it can alter memory translation among other things, so it was heavy. Everything we support has a lightweight MSR change that just modifies the interrupt enable bit and only needs minor serialisation (although we still have that software-irq-disable thing which avoids the heavy MSR problem on old CPUs). Thanks, Nick
Re: [PATCH v3 04/24] powerpc/secvar: Handle format string in the consumer
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > From: Russell Currey > > The code that handles the format string in secvar-sysfs.c is entirely > OPAL specific, so create a new "format" op in secvar_operations to make > the secvar code more generic. No functional change. > > Signed-off-by: Russell Currey > Signed-off-by: Andrew Donnellan > > --- > > v2: Use sysfs_emit() instead of sprintf() (gregkh) > > v3: Enforce format string size limit (ruscur) > --- > arch/powerpc/include/asm/secvar.h| 3 +++ > arch/powerpc/kernel/secvar-sysfs.c | 23 -- > arch/powerpc/platforms/powernv/opal-secvar.c | 25 > 3 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/arch/powerpc/include/asm/secvar.h > b/arch/powerpc/include/asm/secvar.h > index 07ba36f868a7..8b6475589120 100644 > --- a/arch/powerpc/include/asm/secvar.h > +++ b/arch/powerpc/include/asm/secvar.h > @@ -11,12 +11,15 @@ > #include > #include > > +#define SECVAR_MAX_FORMAT_LEN30 // max length of string returned by > ->format() > + > extern const struct secvar_operations *secvar_ops; > > struct secvar_operations { > int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size); > int (*get_next)(const char *key, u64 *key_len, u64 keybufsize); > int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size); > + ssize_t (*format)(char *buf); > }; > > #ifdef CONFIG_PPC_SECURE_BOOT > diff --git a/arch/powerpc/kernel/secvar-sysfs.c > b/arch/powerpc/kernel/secvar-sysfs.c > index 462cacc0ca60..d3858eedd72c 100644 > --- a/arch/powerpc/kernel/secvar-sysfs.c > +++ b/arch/powerpc/kernel/secvar-sysfs.c > @@ -21,26 +21,13 @@ static struct kset *secvar_kset; > static ssize_t format_show(struct kobject *kobj, struct kobj_attribute *attr, > char *buf) > { > - ssize_t rc = 0; > - struct device_node *node; > - const char *format; > - > - node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend"); > - if (!of_device_is_available(node)) { > - rc = -ENODEV; > - goto out; > - } > + char tmp[SECVAR_MAX_FORMAT_LEN]; > + ssize_t len = secvar_ops->format(tmp); > > - rc = of_property_read_string(node, "format", &format); > - if (rc) > - goto out; > + if (len <= 0) > + return -EIO; AFAIKS this does have a functional change, it loses the return value. Why not return len if it is < 0, and -EIO if len == 0? Thanks, Nick
Re: [PATCH v3 13/24] powerpc/pseries: Fix handling of PLPKS object flushing timeout
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > plpks_confirm_object_flushed() uses the H_PKS_CONFIRM_OBJECT_FLUSHED hcall > to check whether changes to an object in the Platform KeyStore have been > flushed to non-volatile storage. > > The hcall returns two output values, the return code and the flush status. > plpks_confirm_object_flushed() polls the hcall until either the flush > status has updated, the return code is an error, or a timeout has been > exceeded. > > While we're still polling, the hcall is returning H_SUCCESS (0) as the > return code. In the timeout case, this means that upon exiting the polling > loop, rc is 0, and therefore 0 is returned to the user. > > Handle the timeout case separately and return ETIMEDOUT if triggered. > > Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore") Can fixes go to the start of the series? Thanks, Nick
Re: [PATCH v3 16/24] powerpc/pseries: Implement signed update for PLPKS objects
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > From: Nayna Jain > > The Platform Keystore provides a signed update interface which can be used > to create, replace or append to certain variables in the PKS in a secure > fashion, with the hypervisor requiring that the update be signed using the > Platform Key. > > Implement an interface to the H_PKS_SIGNED_UPDATE hcall in the plpks > driver to allow signed updates to PKS objects. > > (The plpks driver doesn't need to do any cryptography or otherwise handle > the actual signed variable contents - that will be handled by userspace > tooling.) > > Signed-off-by: Nayna Jain > [ajd: split patch, add timeout handling and misc cleanups] > Co-developed-by: Andrew Donnellan > Signed-off-by: Andrew Donnellan > Signed-off-by: Russell Currey > > --- > > v3: Merge plpks fixes and signed update series with secvar series > > Fix error code handling in plpks_confirm_object_flushed() (ruscur) > > Pass plpks_var struct to plpks_signed_update_var() by reference (mpe) > > Consistent constant naming scheme (ruscur) > --- > arch/powerpc/include/asm/hvcall.h | 3 +- > arch/powerpc/include/asm/plpks.h | 5 ++ > arch/powerpc/platforms/pseries/plpks.c | 71 -- > 3 files changed, 73 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index 95fd7f9485d5..33b26c0cb69b 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -336,7 +336,8 @@ > #define H_SCM_FLUSH 0x44C > #define H_GET_ENERGY_SCALE_INFO 0x450 > #define H_WATCHDOG 0x45C > -#define MAX_HCALL_OPCODE H_WATCHDOG > +#define H_PKS_SIGNED_UPDATE 0x454 > +#define MAX_HCALL_OPCODE H_PKS_SIGNED_UPDATE ^ Bad rebase. Thanks, Nick
Re: [PATCH v3 08/24] powerpc/secvar: Allow backend to populate static list of variable names
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > Currently, the list of variables is populated by calling > secvar_ops->get_next() repeatedly, which is explicitly modelled on the > OPAL API (including the keylen parameter). > > For the upcoming PLPKS backend, we have a static list of variable names. > It is messy to fit that into get_next(), so instead, let the backend put > a NULL-terminated array of variable names into secvar_ops->var_names, > which will be used if get_next() is undefined. > > Signed-off-by: Andrew Donnellan > Signed-off-by: Russell Currey > > --- > > v3: New patch (ajd/mpe) > --- > arch/powerpc/include/asm/secvar.h | 4 ++ > arch/powerpc/kernel/secvar-sysfs.c | 67 -- > 2 files changed, 50 insertions(+), 21 deletions(-) > > diff --git a/arch/powerpc/include/asm/secvar.h > b/arch/powerpc/include/asm/secvar.h > index ebf95386d720..c8bee1834b54 100644 > --- a/arch/powerpc/include/asm/secvar.h > +++ b/arch/powerpc/include/asm/secvar.h > @@ -23,6 +23,10 @@ struct secvar_operations { > ssize_t (*format)(char *buf); > int (*max_size)(u64 *max_size); > const struct attribute **config_attrs; > + > + // NULL-terminated array of fixed variable names > + // Only used if get_next() isn't provided > + const char * const *var_names; The other way you could go is provide a sysfs_init() ops call here, and export the add_var as a library function that backends can use. Thanks, Nick
Re: [PATCH v3 04/24] powerpc/secvar: Handle format string in the consumer
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > From: Russell Currey > > The code that handles the format string in secvar-sysfs.c is entirely > OPAL specific, so create a new "format" op in secvar_operations to make > the secvar code more generic. No functional change. > > Signed-off-by: Russell Currey > Signed-off-by: Andrew Donnellan > > --- > > v2: Use sysfs_emit() instead of sprintf() (gregkh) > > v3: Enforce format string size limit (ruscur) > --- > arch/powerpc/include/asm/secvar.h| 3 +++ > arch/powerpc/kernel/secvar-sysfs.c | 23 -- > arch/powerpc/platforms/powernv/opal-secvar.c | 25 > 3 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/arch/powerpc/include/asm/secvar.h > b/arch/powerpc/include/asm/secvar.h > index 07ba36f868a7..8b6475589120 100644 > --- a/arch/powerpc/include/asm/secvar.h > +++ b/arch/powerpc/include/asm/secvar.h > @@ -11,12 +11,15 @@ > #include > #include > > +#define SECVAR_MAX_FORMAT_LEN30 // max length of string returned by > ->format() > + > extern const struct secvar_operations *secvar_ops; > > struct secvar_operations { > int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size); > int (*get_next)(const char *key, u64 *key_len, u64 keybufsize); > int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size); > + ssize_t (*format)(char *buf); Maybe pass the buf size as an argument here? Which is a bit less error prone and more flexible than finding the right #define for it. Thanks, Nick
Re: [PATCH v3 02/24] powerpc/secvar: WARN_ON_ONCE() if multiple secvar ops are set
On Wed Jan 18, 2023 at 4:10 PM AEST, Andrew Donnellan wrote: > From: Russell Currey > > The secvar code only supports one consumer at a time. > > Multiple consumers aren't possible at this point in time, but we'd want > it to be obvious if it ever could happen. > > Signed-off-by: Russell Currey > Signed-off-by: Andrew Donnellan > --- > arch/powerpc/kernel/secvar-ops.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/secvar-ops.c > b/arch/powerpc/kernel/secvar-ops.c > index 6a29777d6a2d..aa1b2adc2710 100644 > --- a/arch/powerpc/kernel/secvar-ops.c > +++ b/arch/powerpc/kernel/secvar-ops.c > @@ -8,10 +8,12 @@ > > #include > #include > +#include > > -const struct secvar_operations *secvar_ops __ro_after_init; > +const struct secvar_operations *secvar_ops __ro_after_init = NULL; > > void set_secvar_ops(const struct secvar_operations *ops) > { > + WARN_ON_ONCE(secvar_ops); > secvar_ops = ops; You could make it return error here and two line patch in the caller to handle the error and then things wouldn't get corrupted. Thanks, Nick
Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote: > > > > On Jan 18, 2023, at 12:00 AM, Nicholas Piggin wrote: > > > > +static void do_shoot_lazy_tlb(void *arg) > > +{ > > + struct mm_struct *mm = arg; > > + > > + if (current->active_mm == mm) { > > + WARN_ON_ONCE(current->mm); > > + current->active_mm = &init_mm; > > + switch_mm(mm, &init_mm, current); > > + } > > +} > > I might be out of touch - doesn’t a flush already take place when we free > the page-tables, at least on common cases on x86? > > IIUC exit_mmap() would free page-tables, and whenever page-tables are > freed, on x86, we do shootdown regardless to whether the target CPU TLB state > marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and > everything should be fine, no? > > [ I understand you care about powerpc, just wondering on the effect on x86 ] If you can easily piggyback on IPI work you already do in exit_mmap then that's likely to be preferable. I don't know the details of x86 these days but there is some discussion about it in last year's thread, it sounded quite feasible. This is stil required at final __mmdrop() time because it's still possible that lazy mm refs will need to be cleaned. exit_mmap() itself explicitly creates one, so if the __mmdrop() runs on a different CPU, then there's one. kthreads using the mm could create others. If that part of it is unclear or under-commented, I can try improve it. Thanks, Nick
Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin wrote: > > +static void do_shoot_lazy_tlb(void *arg) > +{ > + struct mm_struct *mm = arg; > + > + if (current->active_mm == mm) { > + WARN_ON_ONCE(current->mm); > + current->active_mm = &init_mm; > + switch_mm(mm, &init_mm, current); > + } > +} I might be out of touch - doesn’t a flush already take place when we free the page-tables, at least on common cases on x86? IIUC exit_mmap() would free page-tables, and whenever page-tables are freed, on x86, we do shootdown regardless to whether the target CPU TLB state marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and everything should be fine, no? [ I understand you care about powerpc, just wondering on the effect on x86 ]
[PATCH] of: Fix of platform build on powerpc due to bad of disaply code
The commit 2d681d6a23a1 ("of: Make of framebuffer devices unique") breaks build because of wrong argument to snprintf. That certainly avoids the runtime error but is not the intended outcome. Fixes: 2d681d6a23a1 ("of: Make of framebuffer devices unique") Signed-off-by: Michal Suchanek --- drivers/of/platform.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/of/platform.c b/drivers/of/platform.c index f2a5d679a324..e9dd7371f27a 100644 --- a/drivers/of/platform.c +++ b/drivers/of/platform.c @@ -564,10 +564,10 @@ static int __init of_platform_default_populate_init(void) } for_each_node_by_type(node, "display") { - char *buf[14]; + char buf[14]; if (!of_get_property(node, "linux,opened", NULL) || node == boot_display) continue; - ret = snprintf(buf, "of-display-%d", display_number++); + ret = snprintf(buf, sizeof(buf), "of-display-%d", display_number++); if (ret >= sizeof(buf)) continue; of_platform_device_create(node, buf, NULL); -- 2.35.3
Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko wrote: > > On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko wrote: > > > > > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote: > > > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko wrote: > > > > > > > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote: > > > > > > Move VMA flag modification (which now implies VMA locking) before > > > > > > anon_vma_lock_write to match the locking order of page fault > > > > > > handler. > > > > > > > > > > Does this changelog assumes per vma locking in the #PF? > > > > > > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet > > > > but the changelog already talks about that. Maybe I should change it > > > > to simply: > > > > ``` > > > > Move VMA flag modification (which now implies VMA locking) before > > > > vma_adjust_trans_huge() to ensure the modifications are done after VMA > > > > has been locked. > > > > > > Because > > > > because vma_adjust_trans_huge() modifies the VMA and such > > modifications should be done under VMA write-lock protection. > > So it will become: > Move VMA flag modification (which now implies VMA locking) before > vma_adjust_trans_huge() to ensure the modifications are done after VMA > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such > modifications should be done under VMA write-lock protection. > > which is effectivelly saying > vma_adjust_trans_huge() modifies the VMA and such modifications should > be done under VMA write-lock protection so move VMA flag modifications > before so all of them are covered by the same write protection. > > right? Yes, and the wording in the latter version is simpler to understand IMO, so I would like to adopt it. Do you agree? > -- > Michal Hocko > SUSE Labs
Re: [PATCH] of: Make of framebuffer devices unique
On Wed, Jan 18, 2023 at 09:13:05PM +0100, Erhard F. wrote: > On Tue, 17 Jan 2023 17:58:04 +0100 > Michal Suchanek wrote: > > > Since Linux 5.19 this error is observed: > > > > sysfs: cannot create duplicate filename '/devices/platform/of-display' > > > > This is because multiple devices with the same name 'of-display' are > > created on the same bus. > > > > Update the code to create numbered device names for the non-boot > > disaplay. > > > > cc: linuxppc-dev@lists.ozlabs.org > > References: https://bugzilla.kernel.org/show_bug.cgi?id=216095 > > Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers") > > Reported-by: Erhard F. > > Suggested-by: Thomas Zimmermann > > Signed-off-by: Michal Suchanek > > --- > > drivers/of/platform.c | 8 +++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/of/platform.c b/drivers/of/platform.c > > index 81c8c227ab6b..f2a5d679a324 100644 > > --- a/drivers/of/platform.c > > +++ b/drivers/of/platform.c > > @@ -525,6 +525,7 @@ static int __init > > of_platform_default_populate_init(void) > > if (IS_ENABLED(CONFIG_PPC)) { > > struct device_node *boot_display = NULL; > > struct platform_device *dev; > > + int display_number = 1; > > int ret; > > > > /* Check if we have a MacOS display without a node spec */ > > @@ -561,10 +562,15 @@ static int __init > > of_platform_default_populate_init(void) > > boot_display = node; > > break; > > } > > + > > for_each_node_by_type(node, "display") { > > + char *buf[14]; > > if (!of_get_property(node, "linux,opened", NULL) || > > node == boot_display) > > continue; > > - of_platform_device_create(node, "of-display", NULL); > > + ret = snprintf(buf, "of-display-%d", display_number++); > > + if (ret >= sizeof(buf)) > > + continue; > > + of_platform_device_create(node, buf, NULL); > > } > > > > } else { > > -- > > 2.35.3 > > > > Thank you for the patch Michal! > > It applies on 6.2-rc4 but I get this build error with my config: Indeed, it's doubly bad. Where is the kernel test robot when you need it? It should not be that easy to miss this file but clearly it can happen. I will send a fixup. Sorry about the mess. Thanks Michal
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team > > wrote: > > > > > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > > > +locking maintainers > > > > > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan > > > > > > wrote: > > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault > > > > > > > handling > > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs > > > > > > > need > > > > > > > to be exclusively locked during VMA tree modifications, instead > > > > > > > of the > > > > > > > usual lock/unlock patter we mark a VMA as locked by taking > > > > > > > per-VMA lock > > > > > > > exclusively and setting vma->lock_seq to the current > > > > > > > mm->lock_seq. When > > > > > > > mmap_write_lock holder is done with all modifications and drops > > > > > > > mmap_lock, > > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs > > > > > > > marked as > > > > > > > locked. > > > > > > [...] > > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > > > +{ > > > > > > > + up_read(&vma->lock); > > > > > > > +} > > > > > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > > > allowed to use up_read() to fully release ownership of an object - > > > > > > from what I remember, I think that up_read() (unlike something like > > > > > > spin_unlock()) can access the lock object after it's already been > > > > > > acquired by someone else. > > > > > > > > > > Yes, I think you are right. From a look into the code it seems that > > > > > the UAF is quite unlikely as there is a ton of work to be done between > > > > > vma_write_lock used to prepare vma for removal and actual removal. > > > > > That doesn't make it less of a problem though. > > > > > > > > > > > So if you want to protect against concurrent > > > > > > deletion, this might have to be something like: > > > > > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > > > up_read(&vma->lock); > > > > > > rcu_read_unlock(); > > > > > > > > > > > > But I'm not entirely sure about that, the locking folks might know > > > > > > better. > > > > > > > > > > I am not a locking expert but to me it looks like this should work > > > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > > > > > Thanks, I have completely missed this aspect of the locking when > > > > > looking > > > > > into the code. > > > > > > > > > > Btw. looking at this again I have fully realized how hard it is > > > > > actually > > > > > to see that vm_area_free is guaranteed to sync up with ongoing > > > > > readers. > > > > > vma manipulation functions like __adjust_vma make my head spin. Would > > > > > it > > > > > make more sense to have a rcu style synchronization point in > > > > > vm_area_free directly before call_rcu? This would add an overhead of > > > > > uncontended down_write of course. > > > > > > > > Something along those lines might be a good idea, but I think that > > > > rather than synchronizing the removal, it should maybe be something > > > > that splats (and bails out?) if it detects pending readers. If we get > > > > to vm_area_free() on a VMA that has pending readers, we might already > > > > be in a lot of trouble because the concurrent readers might have been > > > > traversing page tables while we were tearing them down or fun stuff > > > > like that. > > > > > > > > I think maybe Suren was already talking about something like that in > > > > another part of this patch series but I don't remember... > > > > > > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com? > > > > Yes, I spent a lot of time ensuring that __adjust_vma locks the right > > VMAs and that VMAs are freed or isolated under VMA write lock > > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch > > Michal mentioned gets hit then it's a bug in my design and I'll have > > to fix it. But please, let's not add synchronize_rcu() in the > > vm_area_free(). > > Just to clarify. I didn't suggest to add synchronize_rcu into > vm_area_free. What I really meant was synchronize_rcu like primitive to > effectivelly synchronize with any potential pending read locker (so > something like vma_write_lock (or whatever it is called). The point is > that vma freeing is an event all readers should be notified about. I don't think readers need to be notified if we are ensuring that the VMA is not used by anyone else and is not reachable by the readers. This is currently done by write-locking the VMA either before removing it from the tree or before freeing it. > This can be done explicitly for each and every vma before vm_area_free > is called but this is
Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko wrote: > > > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote: > > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko wrote: > > > > > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote: > > > > > Move VMA flag modification (which now implies VMA locking) before > > > > > anon_vma_lock_write to match the locking order of page fault handler. > > > > > > > > Does this changelog assumes per vma locking in the #PF? > > > > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet > > > but the changelog already talks about that. Maybe I should change it > > > to simply: > > > ``` > > > Move VMA flag modification (which now implies VMA locking) before > > > vma_adjust_trans_huge() to ensure the modifications are done after VMA > > > has been locked. > > > > Because > > because vma_adjust_trans_huge() modifies the VMA and such > modifications should be done under VMA write-lock protection. So it will become: Move VMA flag modification (which now implies VMA locking) before vma_adjust_trans_huge() to ensure the modifications are done after VMA has been locked. Because vma_adjust_trans_huge() modifies the VMA and such modifications should be done under VMA write-lock protection. which is effectivelly saying vma_adjust_trans_huge() modifies the VMA and such modifications should be done under VMA write-lock protection so move VMA flag modifications before so all of them are covered by the same write protection. right? -- Michal Hocko SUSE Labs
Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code
On Tue, Jan 17, 2023 at 6:44 PM Matthew Wilcox wrote: > > On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko wrote: > > > > > > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote: > > > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during > > > > page fault handling. When VMA is not found, can't be locked or changes > > > > after being locked, the function returns NULL. The lookup is performed > > > > under RCU protection to prevent the found VMA from being destroyed > > > > before > > > > the VMA lock is acquired. VMA lock statistics are updated according to > > > > the results. > > > > For now only anonymous VMAs can be searched this way. In other cases the > > > > function returns NULL. > > > > > > Could you describe why only anonymous vmas are handled at this stage and > > > what (roughly) has to be done to support other vmas? lock_vma_under_rcu > > > doesn't seem to have any anonymous vma specific requirements AFAICS. > > > > TBH I haven't spent too much time looking into file-backed page faults > > yet but a couple of tasks I can think of are: > > - Ensure that all vma->vm_ops->fault() handlers do not rely on > > mmap_lock being read-locked; > > I think this way lies madness. There are just too many device drivers > that implement ->fault. My plan is to call the ->map_pages() method > under RCU without even read-locking the VMA. If that doesn't satisfy > the fault, then drop all the way back to taking the mmap_sem for read > before calling into ->fault. Sounds reasonable to me but I guess the devil is in the details... >
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team > wrote: > > > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > > +locking maintainers > > > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan > > > > > wrote: > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault > > > > > > handling > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs > > > > > > need > > > > > > to be exclusively locked during VMA tree modifications, instead of > > > > > > the > > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA > > > > > > lock > > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. > > > > > > When > > > > > > mmap_write_lock holder is done with all modifications and drops > > > > > > mmap_lock, > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs > > > > > > marked as > > > > > > locked. > > > > > [...] > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > > +{ > > > > > > + up_read(&vma->lock); > > > > > > +} > > > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > > allowed to use up_read() to fully release ownership of an object - > > > > > from what I remember, I think that up_read() (unlike something like > > > > > spin_unlock()) can access the lock object after it's already been > > > > > acquired by someone else. > > > > > > > > Yes, I think you are right. From a look into the code it seems that > > > > the UAF is quite unlikely as there is a ton of work to be done between > > > > vma_write_lock used to prepare vma for removal and actual removal. > > > > That doesn't make it less of a problem though. > > > > > > > > > So if you want to protect against concurrent > > > > > deletion, this might have to be something like: > > > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > > up_read(&vma->lock); > > > > > rcu_read_unlock(); > > > > > > > > > > But I'm not entirely sure about that, the locking folks might know > > > > > better. > > > > > > > > I am not a locking expert but to me it looks like this should work > > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > > > Thanks, I have completely missed this aspect of the locking when looking > > > > into the code. > > > > > > > > Btw. looking at this again I have fully realized how hard it is actually > > > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > > > vma manipulation functions like __adjust_vma make my head spin. Would it > > > > make more sense to have a rcu style synchronization point in > > > > vm_area_free directly before call_rcu? This would add an overhead of > > > > uncontended down_write of course. > > > > > > Something along those lines might be a good idea, but I think that > > > rather than synchronizing the removal, it should maybe be something > > > that splats (and bails out?) if it detects pending readers. If we get > > > to vm_area_free() on a VMA that has pending readers, we might already > > > be in a lot of trouble because the concurrent readers might have been > > > traversing page tables while we were tearing them down or fun stuff > > > like that. > > > > > > I think maybe Suren was already talking about something like that in > > > another part of this patch series but I don't remember... > > > > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com? > > Yes, I spent a lot of time ensuring that __adjust_vma locks the right > VMAs and that VMAs are freed or isolated under VMA write lock > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch > Michal mentioned gets hit then it's a bug in my design and I'll have > to fix it. But please, let's not add synchronize_rcu() in the > vm_area_free(). Just to clarify. I didn't suggest to add synchronize_rcu into vm_area_free. What I really meant was synchronize_rcu like primitive to effectivelly synchronize with any potential pending read locker (so something like vma_write_lock (or whatever it is called). The point is that vma freeing is an event all readers should be notified about. This can be done explicitly for each and every vma before vm_area_free is called but this is just hard to review and easy to break over time. See my point? -- Michal Hocko SUSE Labs
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney wrote: > > > > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko wrote: > > > > > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko wrote: > > > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > > call_rcu() can take a long time when callback offloading is > > > > > > > enabled. > > > > > > > Its use in the vm_area_free can cause regressions in the exit > > > > > > > path when > > > > > > > multiple VMAs are being freed. > > > > > > > > > > > > What kind of regressions. > > > > > > > > > > > > > To minimize that impact, place VMAs into > > > > > > > a list and free them in groups using one call_rcu() call per > > > > > > > group. > > > > > > > > > > > > Please add some data to justify this additional complexity. > > > > > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > > > noticed when running execl test from unixbench suite. spawn test also > > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > > > enabled. > > > > > > > > Could you be more specific? vma freeing is async with the RCU so how > > > > come this has resulted in a regression? Is there any heavy > > > > rcu_synchronize in the exec path? That would be an interesting > > > > information. > > > > > > No, there is no heavy rcu_synchronize() or any other additional > > > synchronous load in the exit path. It's the call_rcu() which can block > > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > > > other call_rcu()'s going on in parallel. Note that call_rcu() calls > > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > > > revealed that this function was taking multiple ms (don't recall the > > > actual number, sorry). Paul's explanation implied that this happens > > > due to contention on the locks taken in this function. For more > > > in-depth details I'll have to ask Paul for help :) This code is quite > > > complex and I don't know all the details of RCU implementation. > > > > There are a couple of possibilities here. > > > > First, if I am remembering correctly, the time between the call_rcu() > > and invocation of the corresponding callback was taking multiple seconds, > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > > order to save power by batching RCU work over multiple call_rcu() > > invocations. If this is causing a problem for a given call site, the > > shiny new call_rcu_hurry() can be used instead. Doing this gets back > > to the old-school non-laziness, but can of course consume more power. > > That would not be the case because CONFIG_LAZY_RCU was not an option > at the time I was profiling this issue. > Laxy RCU would be a great option to replace this patch but > unfortunately it's not the default behavior, so I would still have to > implement this batching in case lazy RCU is not enabled. > > > Second, there is a much shorter one-jiffy delay between the call_rcu() > > and the invocation of the corresponding callback in kernels built with > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > > of this delay is to avoid lock contention, and so this delay is incurred > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > > invoking call_rcu() at least 16 times within a given jiffy will incur > > the added delay. The reason for this delay is the use of a separate > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > > lock contention on the main ->cblist. This is not needed in old-school > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > > (including most datacenter kernels) because in that case the callbacks > > enqueued by call_rcu() are touched only by the corresponding CPU, so > > that there is no need for locks. > > I believe this is the reason in my profiled case. > > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > > call_rcu() in the common case (for example, without the aid of interrupts, > > NMIs, or SMIs), please do let me know. That sounds to me like a bug. > > I don't think I've seen such a case. Whew!!! ;-) > Thanks for clarifications, Paul! No problem! Thanx, Paul
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney wrote: > > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko wrote: > > > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko wrote: > > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > > Its use in the vm_area_free can cause regressions in the exit path > > > > > > when > > > > > > multiple VMAs are being freed. > > > > > > > > > > What kind of regressions. > > > > > > > > > > > To minimize that impact, place VMAs into > > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > > > Please add some data to justify this additional complexity. > > > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > > noticed when running execl test from unixbench suite. spawn test also > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > > enabled. > > > > > > Could you be more specific? vma freeing is async with the RCU so how > > > come this has resulted in a regression? Is there any heavy > > > rcu_synchronize in the exec path? That would be an interesting > > > information. > > > > No, there is no heavy rcu_synchronize() or any other additional > > synchronous load in the exit path. It's the call_rcu() which can block > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > > other call_rcu()'s going on in parallel. Note that call_rcu() calls > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > > revealed that this function was taking multiple ms (don't recall the > > actual number, sorry). Paul's explanation implied that this happens > > due to contention on the locks taken in this function. For more > > in-depth details I'll have to ask Paul for help :) This code is quite > > complex and I don't know all the details of RCU implementation. > > There are a couple of possibilities here. > > First, if I am remembering correctly, the time between the call_rcu() > and invocation of the corresponding callback was taking multiple seconds, > but that was because the kernel was built with CONFIG_LAZY_RCU=y in > order to save power by batching RCU work over multiple call_rcu() > invocations. If this is causing a problem for a given call site, the > shiny new call_rcu_hurry() can be used instead. Doing this gets back > to the old-school non-laziness, but can of course consume more power. That would not be the case because CONFIG_LAZY_RCU was not an option at the time I was profiling this issue. Laxy RCU would be a great option to replace this patch but unfortunately it's not the default behavior, so I would still have to implement this batching in case lazy RCU is not enabled. > > Second, there is a much shorter one-jiffy delay between the call_rcu() > and the invocation of the corresponding callback in kernels built with > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose > of this delay is to avoid lock contention, and so this delay is incurred > only on CPUs that are queuing callbacks at a rate exceeding 16K/second. > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU > invoking call_rcu() at least 16 times within a given jiffy will incur > the added delay. The reason for this delay is the use of a separate > ->nocb_bypass list. As Suren says, this bypass list is used to reduce > lock contention on the main ->cblist. This is not needed in old-school > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y > (including most datacenter kernels) because in that case the callbacks > enqueued by call_rcu() are touched only by the corresponding CPU, so > that there is no need for locks. I believe this is the reason in my profiled case. > > Third, if you are instead seeing multiple milliseconds of CPU consumed by > call_rcu() in the common case (for example, without the aid of interrupts, > NMIs, or SMIs), please do let me know. That sounds to me like a bug. I don't think I've seen such a case. Thanks for clarifications, Paul! > > Or have I lost track of some other slow case? > > Thanx, Paul
[PATCH v2 2/5] dt-bindings: usb: Convert multiple "usb-ohci" bindings to DT schema
"usb-ohci" is another "generic" OHCI controller compatible string used by several platforms. Add it to the generic-ohci.yaml schema and remove all the old binding docs. Marvell pxa-usb.txt has "usb-ohci" in the example, but actual users don't, so drop it. Signed-off-by: Rob Herring --- v2: - Fix transceiver property in if/then schema - Move OMAP to separate patch --- .../devicetree/bindings/powerpc/nintendo/wii.txt | 10 --- .../devicetree/bindings/usb/generic-ohci.yaml | 28 +++-- Documentation/devicetree/bindings/usb/ohci-nxp.txt | 24 --- Documentation/devicetree/bindings/usb/pxa-usb.txt | 2 +- .../devicetree/bindings/usb/spear-usb.txt | 35 -- 5 files changed, 26 insertions(+), 73 deletions(-) diff --git a/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt b/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt index c4d78f28d23c..3ff6ebbb4998 100644 --- a/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt +++ b/Documentation/devicetree/bindings/powerpc/nintendo/wii.txt @@ -97,16 +97,6 @@ Nintendo Wii device tree - reg : should contain the EXI registers location and length - interrupts : should contain the EXI interrupt -1.g) The Open Host Controller Interface (OHCI) nodes - - Represent the USB 1.x Open Host Controller Interfaces. - - Required properties: - - - compatible : should be "nintendo,hollywood-usb-ohci","usb-ohci" - - reg : should contain the OHCI registers location and length - - interrupts : should contain the OHCI interrupt - 1.h) The Enhanced Host Controller Interface (EHCI) node Represents the USB 2.0 Enhanced Host Controller Interface. diff --git a/Documentation/devicetree/bindings/usb/generic-ohci.yaml b/Documentation/devicetree/bindings/usb/generic-ohci.yaml index 4fcbd0add49d..8492d809ba40 100644 --- a/Documentation/devicetree/bindings/usb/generic-ohci.yaml +++ b/Documentation/devicetree/bindings/usb/generic-ohci.yaml @@ -6,9 +6,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: USB OHCI Controller -allOf: - - $ref: "usb-hcd.yaml" - maintainers: - Greg Kroah-Hartman @@ -50,6 +47,13 @@ properties: - snps,hsdk-v1.0-ohci - const: generic-ohci - const: generic-ohci + - items: + - enum: + - cavium,octeon-6335-ohci + - nintendo,hollywood-usb-ohci + - nxp,ohci-nxp + - st,spear600-ohci + - const: usb-ohci reg: maxItems: 1 @@ -119,11 +123,29 @@ properties: - host - otg + transceiver: +$ref: /schemas/types.yaml#/definitions/phandle +description: + The associated ISP1301 device. Necessary for the UDC controller for + connecting to the USB physical layer. + required: - compatible - reg - interrupts +allOf: + - $ref: usb-hcd.yaml + - if: + not: +properties: + compatible: +contains: + const: nxp,ohci-nxp +then: + properties: +transceiver: false + additionalProperties: false examples: diff --git a/Documentation/devicetree/bindings/usb/ohci-nxp.txt b/Documentation/devicetree/bindings/usb/ohci-nxp.txt deleted file mode 100644 index 71e28c1017ed.. --- a/Documentation/devicetree/bindings/usb/ohci-nxp.txt +++ /dev/null @@ -1,24 +0,0 @@ -* OHCI controller, NXP ohci-nxp variant - -Required properties: -- compatible: must be "nxp,ohci-nxp" -- reg: physical base address of the controller and length of memory mapped - region. -- interrupts: The OHCI interrupt -- transceiver: phandle of the associated ISP1301 device - this is necessary for - the UDC controller for connecting to the USB physical layer - -Example (LPC32xx): - - isp1301: usb-transceiver@2c { - compatible = "nxp,isp1301"; - reg = <0x2c>; - }; - - ohci@3102 { - compatible = "nxp,ohci-nxp"; - reg = <0x3102 0x300>; - interrupt-parent = <&mic>; - interrupts = <0x3b 0>; - transceiver = <&isp1301>; - }; diff --git a/Documentation/devicetree/bindings/usb/pxa-usb.txt b/Documentation/devicetree/bindings/usb/pxa-usb.txt index 9c331799b87c..53fdae4fa6f6 100644 --- a/Documentation/devicetree/bindings/usb/pxa-usb.txt +++ b/Documentation/devicetree/bindings/usb/pxa-usb.txt @@ -22,7 +22,7 @@ Optional properties: Example: usb0: ohci@4c00 { - compatible = "marvell,pxa-ohci", "usb-ohci"; + compatible = "marvell,pxa-ohci"; reg = <0x4c00 0x10>; interrupts = <18>; marvell,enable-port1; diff --git a/Documentation/devicetree/bindings/usb/spear-usb.txt b/Documentation/devicetree/bindings/usb/spear-usb.txt deleted file mode 100644 index 1dc91cc459c0.. --- a/Documentation/devicetree/bindings/usb/spear-usb.txt +++ /dev/null @@ -1
[PATCH v2 4/5] dt-bindings: usb: Convert Marvell Orion EHCI to DT schema
The Marvell Orion EHCI binding is just some compatible strings, so add it to the generic-ehci.yaml schema. Signed-off-by: Rob Herring --- .../devicetree/bindings/usb/ehci-orion.txt | 22 -- .../devicetree/bindings/usb/generic-ehci.yaml | 2 ++ 2 files changed, 2 insertions(+), 22 deletions(-) diff --git a/Documentation/devicetree/bindings/usb/ehci-orion.txt b/Documentation/devicetree/bindings/usb/ehci-orion.txt deleted file mode 100644 index 2855bae79fda.. --- a/Documentation/devicetree/bindings/usb/ehci-orion.txt +++ /dev/null @@ -1,22 +0,0 @@ -* EHCI controller, Orion Marvell variants - -Required properties: -- compatible: must be one of the following - "marvell,orion-ehci" - "marvell,armada-3700-ehci" -- reg: physical base address of the controller and length of memory mapped - region. -- interrupts: The EHCI interrupt - -Optional properties: -- clocks: reference to the clock -- phys: reference to the USB PHY -- phy-names: name of the USB PHY, should be "usb" - -Example: - - ehci@5 { - compatible = "marvell,orion-ehci"; - reg = <0x5 0x1000>; - interrupts = <19>; - }; diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml b/Documentation/devicetree/bindings/usb/generic-ehci.yaml index 2d382ae424da..ebbb01b39a92 100644 --- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml +++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml @@ -74,6 +74,8 @@ properties: - const: usb-ehci - enum: - generic-ehci + - marvell,armada-3700-ehci + - marvell,orion-ehci - ti,ehci-omap - usb-ehci -- 2.39.0
[PATCH v2 3/5] dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema
The OMAP OHCI and EHCI USB host bindings follow the generic binding, so add the compatibles and remove the old txt binding docs. The examples in omap-usb-host.txt don't match actual users, so update them dropping the fallback compatible. Signed-off-by: Rob Herring --- v2: - New patch --- .../devicetree/bindings/mfd/omap-usb-host.txt | 8 +++--- .../devicetree/bindings/usb/ehci-omap.txt | 31 -- .../devicetree/bindings/usb/generic-ehci.yaml | 1 + .../devicetree/bindings/usb/generic-ohci.yaml | 4 ++- .../devicetree/bindings/usb/ohci-omap3.txt | 15 --- 5 files changed, 8 insertions(+), 51 deletions(-) diff --git a/Documentation/devicetree/bindings/mfd/omap-usb-host.txt b/Documentation/devicetree/bindings/mfd/omap-usb-host.txt index aa1eaa59581b..7ce5800dd36d 100644 --- a/Documentation/devicetree/bindings/mfd/omap-usb-host.txt +++ b/Documentation/devicetree/bindings/mfd/omap-usb-host.txt @@ -64,8 +64,8 @@ Required properties if child node exists: Properties for children: The OMAP HS USB Host subsystem contains EHCI and OHCI controllers. -See Documentation/devicetree/bindings/usb/ehci-omap.txt and -Documentation/devicetree/bindings/usb/ohci-omap3.txt. +See Documentation/devicetree/bindings/usb/ehci-generic.yaml and +Documentation/devicetree/bindings/usb/ohci-generic.yaml. Example for OMAP4: @@ -78,14 +78,14 @@ usbhshost: usbhshost@4a064000 { ranges; usbhsohci: ohci@4a064800 { - compatible = "ti,ohci-omap3", "usb-ohci"; + compatible = "ti,ohci-omap3"; reg = <0x4a064800 0x400>; interrupt-parent = <&gic>; interrupts = <0 76 0x4>; }; usbhsehci: ehci@4a064c00 { - compatible = "ti,ehci-omap", "usb-ehci"; + compatible = "ti,ehci-omap"; reg = <0x4a064c00 0x400>; interrupt-parent = <&gic>; interrupts = <0 77 0x4>; diff --git a/Documentation/devicetree/bindings/usb/ehci-omap.txt b/Documentation/devicetree/bindings/usb/ehci-omap.txt deleted file mode 100644 index d77e11a975a2.. --- a/Documentation/devicetree/bindings/usb/ehci-omap.txt +++ /dev/null @@ -1,31 +0,0 @@ -OMAP HS USB EHCI controller - -This device is usually the child of the omap-usb-host -Documentation/devicetree/bindings/mfd/omap-usb-host.txt - -Required properties: - -- compatible: should be "ti,ehci-omap" -- reg: should contain one register range i.e. start and length -- interrupts: description of the interrupt line - -Optional properties: - -- phys: list of phandles to PHY nodes. - This property is required if at least one of the ports are in - PHY mode i.e. OMAP_EHCI_PORT_MODE_PHY - -To specify the port mode, see -Documentation/devicetree/bindings/mfd/omap-usb-host.txt - -Example for OMAP4: - -usbhsehci: ehci@4a064c00 { - compatible = "ti,ehci-omap"; - reg = <0x4a064c00 0x400>; - interrupts = <0 77 0x4>; -}; - -&usbhsehci { - phys = <&hsusb1_phy 0 &hsusb3_phy>; -}; diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml b/Documentation/devicetree/bindings/usb/generic-ehci.yaml index 994818cb6044..2d382ae424da 100644 --- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml +++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml @@ -74,6 +74,7 @@ properties: - const: usb-ehci - enum: - generic-ehci + - ti,ehci-omap - usb-ehci reg: diff --git a/Documentation/devicetree/bindings/usb/generic-ohci.yaml b/Documentation/devicetree/bindings/usb/generic-ohci.yaml index 8492d809ba40..a9ba7257b884 100644 --- a/Documentation/devicetree/bindings/usb/generic-ohci.yaml +++ b/Documentation/devicetree/bindings/usb/generic-ohci.yaml @@ -46,7 +46,9 @@ properties: - ingenic,jz4740-ohci - snps,hsdk-v1.0-ohci - const: generic-ohci - - const: generic-ohci + - enum: + - generic-ohci + - ti,ohci-omap3 - items: - enum: - cavium,octeon-6335-ohci diff --git a/Documentation/devicetree/bindings/usb/ohci-omap3.txt b/Documentation/devicetree/bindings/usb/ohci-omap3.txt deleted file mode 100644 index ce8c47cff6d0.. --- a/Documentation/devicetree/bindings/usb/ohci-omap3.txt +++ /dev/null @@ -1,15 +0,0 @@ -OMAP HS USB OHCI controller (OMAP3 and later) - -Required properties: - -- compatible: should be "ti,ohci-omap3" -- reg: should contain one register range i.e. start and length -- interrupts: description of the interrupt line - -Example for OMAP4: - -usbhsohci: ohci@4a064800 { - compatible = "ti,ohci-omap3"; - reg = <0x4a064800 0x400>; - interrupts = <0 76 0x4>; -}; -- 2.39.0
[PATCH v2 1/5] dt-bindings: usb: Remove obsolete brcm,bcm3384-usb.txt
The "brcm,bcm3384-ohci" and "brcm,bcm3384-ehci" compatibles are already documented in generic-ohci.yaml and generic-ehci.yaml, respectively, so remove the old txt binding. Signed-off-by: Rob Herring --- Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt | 11 --- 1 file changed, 11 deletions(-) diff --git a/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt b/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt deleted file mode 100644 index 452c45c7bf29.. --- a/Documentation/devicetree/bindings/usb/brcm,bcm3384-usb.txt +++ /dev/null @@ -1,11 +0,0 @@ -* Broadcom USB controllers - -Required properties: -- compatible: "brcm,bcm3384-ohci", "brcm,bcm3384-ehci" - - These currently use the generic-ohci and generic-ehci drivers. On some - systems, special handling may be needed in the following cases: - - - Restoring state after systemwide power save modes - - Sharing PHYs with the USBD (UDC) hardware - - Figuring out which controllers are disabled on ASIC bondout variants -- 2.39.0
[PATCH v2 5/5] dt-bindings: usb: Convert Nuvoton EHCI to DT schema
The Nuvoton EHCI binding is just some compatible strings, so add it to the generic-ehci.yaml schema. Signed-off-by: Rob Herring --- .../devicetree/bindings/usb/generic-ehci.yaml| 2 ++ .../devicetree/bindings/usb/npcm7xx-usb.txt | 20 2 files changed, 2 insertions(+), 20 deletions(-) diff --git a/Documentation/devicetree/bindings/usb/generic-ehci.yaml b/Documentation/devicetree/bindings/usb/generic-ehci.yaml index ebbb01b39a92..050cfd5acdaa 100644 --- a/Documentation/devicetree/bindings/usb/generic-ehci.yaml +++ b/Documentation/devicetree/bindings/usb/generic-ehci.yaml @@ -76,6 +76,8 @@ properties: - generic-ehci - marvell,armada-3700-ehci - marvell,orion-ehci + - nuvoton,npcm750-ehci + - nuvoton,npcm845-ehci - ti,ehci-omap - usb-ehci diff --git a/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt b/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt deleted file mode 100644 index 352a0a1e2f76.. --- a/Documentation/devicetree/bindings/usb/npcm7xx-usb.txt +++ /dev/null @@ -1,20 +0,0 @@ -Nuvoton NPCM7XX SoC USB controllers: -- - -EHCI: -- - -Required properties: -- compatible: should be one of -"nuvoton,npcm750-ehci" -"nuvoton,npcm845-ehci" -- interrupts: Should contain the EHCI interrupt -- reg:Physical address and length of the register set for the device - -Example: - - ehci1: usb@f0806000 { - compatible = "nuvoton,npcm750-ehci"; - reg = <0xf0806000 0x1000>; - interrupts = <0 61 4>; - }; -- 2.39.0
[PATCH v2 0/5] dt-bindings: usb: Convert some more simple OHCI/EHCI bindings
The 'ohci-usb' compatible is another 'generic' compatible for OHCI, but isn't documented with a schema. Let's add it to generic-ohci.yaml schema. While looking at this, I found a few other USB host bindings which are simple enough to use the 'generic' schemas. Signed-off-by: Rob Herring --- Changes in v2: - Fix schema error for 'transceiver' - Split OMAP changes to separate patch and convert omap-ehci - Link to v1: https://lore.kernel.org/r/20230110-dt-usb-v1-0-8e366e326...@kernel.org --- Rob Herring (5): dt-bindings: usb: Remove obsolete brcm,bcm3384-usb.txt dt-bindings: usb: Convert multiple "usb-ohci" bindings to DT schema dt-bindings: usb: Convert OMAP OHCI/EHCI bindings to schema dt-bindings: usb: Convert Marvell Orion EHCI to DT schema dt-bindings: usb: Convert Nuvoton EHCI to DT schema .../devicetree/bindings/mfd/omap-usb-host.txt | 8 ++--- .../devicetree/bindings/powerpc/nintendo/wii.txt | 10 --- .../devicetree/bindings/usb/brcm,bcm3384-usb.txt | 11 --- .../devicetree/bindings/usb/ehci-omap.txt | 31 --- .../devicetree/bindings/usb/ehci-orion.txt | 22 -- .../devicetree/bindings/usb/generic-ehci.yaml | 5 .../devicetree/bindings/usb/generic-ohci.yaml | 32 +--- .../devicetree/bindings/usb/npcm7xx-usb.txt| 20 - Documentation/devicetree/bindings/usb/ohci-nxp.txt | 24 --- .../devicetree/bindings/usb/ohci-omap3.txt | 15 -- Documentation/devicetree/bindings/usb/pxa-usb.txt | 2 +- .../devicetree/bindings/usb/spear-usb.txt | 35 -- 12 files changed, 38 insertions(+), 177 deletions(-) --- base-commit: 1b929c02afd37871d5afb9d498426f83432e71c2 change-id: 20230110-dt-usb-1c637752a83b Best regards, -- Rob Herring
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote: > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko wrote: > > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko wrote: > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > > Its use in the vm_area_free can cause regressions in the exit path > > > > > when > > > > > multiple VMAs are being freed. > > > > > > > > What kind of regressions. > > > > > > > > > To minimize that impact, place VMAs into > > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > > > Please add some data to justify this additional complexity. > > > > > > Sorry, should have done that in the first place. A 4.3% regression was > > > noticed when running execl test from unixbench suite. spawn test also > > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > > longer due to call_rcu() which is slow when RCU callback offloading is > > > enabled. > > > > Could you be more specific? vma freeing is async with the RCU so how > > come this has resulted in a regression? Is there any heavy > > rcu_synchronize in the exec path? That would be an interesting > > information. > > No, there is no heavy rcu_synchronize() or any other additional > synchronous load in the exit path. It's the call_rcu() which can block > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of > other call_rcu()'s going on in parallel. Note that call_rcu() calls > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling > revealed that this function was taking multiple ms (don't recall the > actual number, sorry). Paul's explanation implied that this happens > due to contention on the locks taken in this function. For more > in-depth details I'll have to ask Paul for help :) This code is quite > complex and I don't know all the details of RCU implementation. There are a couple of possibilities here. First, if I am remembering correctly, the time between the call_rcu() and invocation of the corresponding callback was taking multiple seconds, but that was because the kernel was built with CONFIG_LAZY_RCU=y in order to save power by batching RCU work over multiple call_rcu() invocations. If this is causing a problem for a given call site, the shiny new call_rcu_hurry() can be used instead. Doing this gets back to the old-school non-laziness, but can of course consume more power. Second, there is a much shorter one-jiffy delay between the call_rcu() and the invocation of the corresponding callback in kernels built with either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose of this delay is to avoid lock contention, and so this delay is incurred only on CPUs that are queuing callbacks at a rate exceeding 16K/second. This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU invoking call_rcu() at least 16 times within a given jiffy will incur the added delay. The reason for this delay is the use of a separate ->nocb_bypass list. As Suren says, this bypass list is used to reduce lock contention on the main ->cblist. This is not needed in old-school kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y (including most datacenter kernels) because in that case the callbacks enqueued by call_rcu() are touched only by the corresponding CPU, so that there is no need for locks. Third, if you are instead seeing multiple milliseconds of CPU consumed by call_rcu() in the common case (for example, without the aid of interrupts, NMIs, or SMIs), please do let me know. That sounds to me like a bug. Or have I lost track of some other slow case? Thanx, Paul
Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko wrote: > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko wrote: > > > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote: > > > > Move VMA flag modification (which now implies VMA locking) before > > > > anon_vma_lock_write to match the locking order of page fault handler. > > > > > > Does this changelog assumes per vma locking in the #PF? > > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet > > but the changelog already talks about that. Maybe I should change it > > to simply: > > ``` > > Move VMA flag modification (which now implies VMA locking) before > > vma_adjust_trans_huge() to ensure the modifications are done after VMA > > has been locked. > > Because because vma_adjust_trans_huge() modifies the VMA and such modifications should be done under VMA write-lock protection. > > Withtout that additional reasonaning it is not really clear why that is > needed and seems arbitrary. Would the above be a good reasoning? > > -- > Michal Hocko > SUSE Labs
Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
On Wed, Jan 18, 2023 at 1:43 AM 'Michal Hocko' via kernel-team wrote: > > On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team > > wrote: > > > > > > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote: > > > > Assert there are no holders of VMA lock for reading when it is about to > > > > be > > > > destroyed. > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > --- > > > > include/linux/mm.h | 8 > > > > kernel/fork.c | 2 ++ > > > > 2 files changed, 10 insertions(+) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 594e835bad9c..c464fc8a514c 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct > > > > vm_area_struct *vma) > > > > VM_BUG_ON_VMA(vma->vm_lock_seq != > > > > READ_ONCE(vma->vm_mm->mm_lock_seq), vma); > > > > } > > > > > > > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma) > > > > +{ > > > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) && > > > > + vma->vm_lock_seq != > > > > READ_ONCE(vma->vm_mm->mm_lock_seq), > > > > + vma); > > > > > > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell > > > us something is wrong on its own, no? This could be somebody racing with > > > the vma destruction and using the write lock. Unlikely but I do not see > > > why to narrow debugging scope. > > > > I wanted to ensure there are no page fault handlers (read-lockers) > > when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone > > could trigger if someone is concurrently calling vma_write_lock(). But > > I don't think we expect someone to be write-locking the VMA while we > > That would be UAF, no? Yes. That's why what I have is an overkill (which is also racy). > > > are destroying it, so you are right, I'm overcomplicating things here. > > I think I can get rid of vma_assert_no_reader() and add > > VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in > > __vm_area_free(). WDYT? > > Yes, that adds some debugging. Not sure it is really necessary buyt it > is VM_BUG_ON so why not. I would like to keep it if possible. If it triggers that would be a clear signal what the issue is. Otherwise it might be hard to debug such a corner case. > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko wrote: > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko wrote: > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > > call_rcu() can take a long time when callback offloading is enabled. > > > > Its use in the vm_area_free can cause regressions in the exit path when > > > > multiple VMAs are being freed. > > > > > > What kind of regressions. > > > > > > > To minimize that impact, place VMAs into > > > > a list and free them in groups using one call_rcu() call per group. > > > > > > Please add some data to justify this additional complexity. > > > > Sorry, should have done that in the first place. A 4.3% regression was > > noticed when running execl test from unixbench suite. spawn test also > > showed 1.6% regression. Profiling revealed that vma freeing was taking > > longer due to call_rcu() which is slow when RCU callback offloading is > > enabled. > > Could you be more specific? vma freeing is async with the RCU so how > come this has resulted in a regression? Is there any heavy > rcu_synchronize in the exec path? That would be an interesting > information. No, there is no heavy rcu_synchronize() or any other additional synchronous load in the exit path. It's the call_rcu() which can block the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of other call_rcu()'s going on in parallel. Note that call_rcu() calls rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling revealed that this function was taking multiple ms (don't recall the actual number, sorry). Paul's explanation implied that this happens due to contention on the locks taken in this function. For more in-depth details I'll have to ask Paul for help :) This code is quite complex and I don't know all the details of RCU implementation. > -- > Michal Hocko > SUSE Labs
Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
On Wed, Jan 18, 2023 at 1:40 AM Michal Hocko wrote: > > On Tue 17-01-23 21:28:06, Jann Horn wrote: > > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko wrote: > > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote: > > > > Protect VMA from concurrent page fault handler while collapsing a huge > > > > page. Page fault handler needs a stable PMD to use PTL and relies on > > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), > > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will > > > > not be detected by a page fault handler without proper locking. > > > > > > I am struggling with this changelog. Maybe because my recollection of > > > the THP collapsing subtleties is weak. But aren't you just trying to say > > > that the current #PF handling and THP collapsing need to be mutually > > > exclusive currently so in order to keep that assumption you have mark > > > the vma write locked? > > > > > > Also it is not really clear to me how that handles other vmas which can > > > share the same thp? > > > > It's not about the hugepage itself, it's about how the THP collapse > > operation frees page tables. > > > > Before this series, page tables can be walked under any one of the > > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged > > unlinks and frees page tables, it must ensure that all of those either > > are locked or don't exist. This series adds a fourth lock under which > > page tables can be traversed, and so khugepaged must also lock out that one. > > > > There is a codepath in khugepaged that iterates through all mappings > > of a file to zap page tables (retract_page_tables()), which locks each > > visited mm with mmap_write_trylock() and now also does > > vma_write_lock(). > > OK, I see. This would be a great addendum to the changelog. I'll add Jann's description in the changelog. Thanks Jann! > > > I think one aspect of this patch that might cause trouble later on, if > > support for non-anonymous VMAs is added, is that retract_page_tables() > > now does vma_write_lock() while holding the mapping lock; the page > > fault handling path would probably take the locks the other way > > around, leading to a deadlock? So the vma_write_lock() in > > retract_page_tables() might have to become a trylock later on. > > This, right? > #PF retract_page_tables > vma_read_lock > i_mmap_lock_write > i_mmap_lock_read > vma_write_lock > > > I might be missing something but I have only found huge_pmd_share to be > called from the #PF path. That one should be safe as it cannot be a > target for THP. Not that it would matter much because such a dependency > chain would be really subtle. > -- > Michal Hocko > SUSE Labs
Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
On Wed, Jan 18, 2023 at 4:51 AM Jann Horn wrote: > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > > Page fault handlers might need to fire MMU notifications while a new > > notifier is being registered. Modify mm_take_all_locks to write-lock all > > VMAs and prevent this race with fault handlers that would hold VMA locks. > > VMAs are locked before i_mmap_rwsem and anon_vma to keep the same > > locking order as in page fault handlers. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/mmap.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 30c7d1c5206e..a256deca0bc0 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, > > struct address_space *mapping) > > * of mm/rmap.c: > > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for > > * hugetlb mapping); > > + * - all vmas marked locked > > The existing comment above says that this is an *ordered* listing of > which locks are taken. > > > * - all i_mmap_rwsem locks; > > * - all anon_vma->rwseml > > * > > @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm) > > mas_for_each(&mas, vma, ULONG_MAX) { > > if (signal_pending(current)) > > goto out_unlock; > > + vma_write_lock(vma); > > if (vma->vm_file && vma->vm_file->f_mapping && > > is_vm_hugetlb_page(vma)) > > vm_lock_mapping(mm, vma->vm_file->f_mapping); > > Note that multiple VMAs can have the same ->f_mapping, so with this, > the lock ordering between VMA locks and the mapping locks of hugetlb > VMAs is mixed: If you have two adjacent hugetlb VMAs with the same > ->f_mapping, then the following operations happen: > > 1. lock VMA 1 > 2. lock mapping of VMAs 1 and 2 > 3. lock VMA 2 > 4. [second vm_lock_mapping() is a no-op] > > So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we > took the mapping lock first. > > The existing code has one loop per lock type to ensure that the locks > really are taken in the specified order, even when some of the locks > are associated with multiple VMAs. > > If we don't care about the ordering between these two, maybe that's > fine and you just have to adjust the comment; but it would be clearer > to add a separate loop for the VMA locks. Oh, thanks for pointing out this detail. A separate loop is definitely needed here. Will do that in the next respin. > > > @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm) > > if (vma->vm_file && vma->vm_file->f_mapping) > > vm_unlock_mapping(vma->vm_file->f_mapping); > > } > > + vma_write_unlock_mm(mm); > > > > mutex_unlock(&mm_all_locks_mutex); > > } > > -- > > 2.39.0 > >
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team wrote: > > On Wed 18-01-23 14:23:32, Jann Horn wrote: > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > > +locking maintainers > > > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan > > > > wrote: > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA > > > > > lock > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. > > > > > When > > > > > mmap_write_lock holder is done with all modifications and drops > > > > > mmap_lock, > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked > > > > > as > > > > > locked. > > > > [...] > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > > +{ > > > > > + up_read(&vma->lock); > > > > > +} > > > > > > > > One thing that might be gnarly here is that I think you might not be > > > > allowed to use up_read() to fully release ownership of an object - > > > > from what I remember, I think that up_read() (unlike something like > > > > spin_unlock()) can access the lock object after it's already been > > > > acquired by someone else. > > > > > > Yes, I think you are right. From a look into the code it seems that > > > the UAF is quite unlikely as there is a ton of work to be done between > > > vma_write_lock used to prepare vma for removal and actual removal. > > > That doesn't make it less of a problem though. > > > > > > > So if you want to protect against concurrent > > > > deletion, this might have to be something like: > > > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > > up_read(&vma->lock); > > > > rcu_read_unlock(); > > > > > > > > But I'm not entirely sure about that, the locking folks might know > > > > better. > > > > > > I am not a locking expert but to me it looks like this should work > > > because the final cleanup would have to happen rcu_read_unlock. > > > > > > Thanks, I have completely missed this aspect of the locking when looking > > > into the code. > > > > > > Btw. looking at this again I have fully realized how hard it is actually > > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > > vma manipulation functions like __adjust_vma make my head spin. Would it > > > make more sense to have a rcu style synchronization point in > > > vm_area_free directly before call_rcu? This would add an overhead of > > > uncontended down_write of course. > > > > Something along those lines might be a good idea, but I think that > > rather than synchronizing the removal, it should maybe be something > > that splats (and bails out?) if it detects pending readers. If we get > > to vm_area_free() on a VMA that has pending readers, we might already > > be in a lot of trouble because the concurrent readers might have been > > traversing page tables while we were tearing them down or fun stuff > > like that. > > > > I think maybe Suren was already talking about something like that in > > another part of this patch series but I don't remember... > > This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com? Yes, I spent a lot of time ensuring that __adjust_vma locks the right VMAs and that VMAs are freed or isolated under VMA write lock protection to exclude any readers. If the VM_BUG_ON_VMA in the patch Michal mentioned gets hit then it's a bug in my design and I'll have to fix it. But please, let's not add synchronize_rcu() in the vm_area_free(). That will slow down any path that frees a VMA, especially the exit path which might be freeing thousands of them. I had an SPF version with synchronize_rcu() in the vm_area_free() and phone vendors started yelling at me the very next day. call_rcu() with CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is already bad enough to show up in the benchmarks and that's why I had to add call_rcu() batching in https://lore.kernel.org/all/20230109205336.3665937-40-sur...@google.com. > > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
[ Adding a few more x86 and arm64 maintainers - while linux-arch is the right mailing list, I'm not convinced people actually follow it all that closely ] On Wed, Jan 18, 2023 at 12:00 AM Nicholas Piggin wrote: > > On a 16-socket 192-core POWER8 system, a context switching benchmark > with as many software threads as CPUs (so each switch will go in and > out of idle), upstream can achieve a rate of about 1 million context > switches per second, due to contention on the mm refcount. > > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable > the option. This increases the above benchmark to 118 million context > switches per second. Well, the 1M -> 118M change does seem like a good reason for this series. The patches certainly don't look offensive to me, so Ack as far as I'm concerned, but honestly, it's been some time since I've personally been active on the idle and lazy TLB code, so that ack is probably largely worthless. If anything, my main reaction to this all is to wonder whether the config option is a good idea - maybe we could do this unconditionally, and make the source code (and logic) simpler to follow when you don't have to worry about the CONFIG_MMU_LAZY_TLB_REFCOUNT option. I wouldn't be surprised to hear that x86 can have the same issue where the mm_struct refcount is a bigger issue than the possibility of an extra TLB shootdown at the final exit time. But having the config options as a way to switch people over gradually (and perhaps then removing it later) doesn't sound wrong to me either. And I personally find the argument in patch 3/5 fairly convincing: Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. Andy? PeterZ? Catalin? Nick - it might be good to link to the actual benchmark, and let people who have access to big machines perhaps just try it out on non-powerpc platforms... Linus
Re: [PATCH v9 00/10] phy: Add support for Lynx 10G SerDes
On 17-01-23, 11:46, Sean Anderson wrote: > > I noticed that this series is marked "changes requested" on patchwork. > However, I have received only automated feedback. I have done my best > effort to address feedback I have received on prior revisions. I would > appreciate getting another round of review before resending this series. Looking at the series, looks like kernel-bot sent some warnings on the series so I was expecting an updated series for review -- ~Vinod
Re: [PATCH] of: Make of framebuffer devices unique
On Tue, 17 Jan 2023 17:58:04 +0100, Michal Suchanek wrote: > Since Linux 5.19 this error is observed: > > sysfs: cannot create duplicate filename '/devices/platform/of-display' > > This is because multiple devices with the same name 'of-display' are > created on the same bus. > > Update the code to create numbered device names for the non-boot > disaplay. > > cc: linuxppc-dev@lists.ozlabs.org > References: https://bugzilla.kernel.org/show_bug.cgi?id=216095 > Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers") > Reported-by: Erhard F. > Suggested-by: Thomas Zimmermann > Signed-off-by: Michal Suchanek > --- > drivers/of/platform.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > Applied, thanks!
[PATCH v5 7/7] drm/i915/gt: use __xchg instead of internal helper
Prefer core helper if available. Signed-off-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_engine_cs.c| 2 +- drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_ggtt.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_gsc.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_gt_pm.c| 2 +- drivers/gpu/drm/i915/gt/intel_lrc.c | 6 +++--- drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +- drivers/gpu/drm/i915/gt/intel_rc6.c | 2 +- drivers/gpu/drm/i915/gt/intel_rps.c | 2 +- drivers/gpu/drm/i915/gt/selftest_context.c | 2 +- drivers/gpu/drm/i915/gt/selftest_ring_submission.c | 2 +- drivers/gpu/drm/i915/gt/selftest_timeline.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 2 +- drivers/gpu/drm/i915/i915_utils.h| 1 + 18 files changed, 24 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 922f1bb22dc685..9712bfc2c6523d 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1042,7 +1042,7 @@ static void cleanup_status_page(struct intel_engine_cs *engine) /* Prevent writes into HWSP after returning the page to the system */ intel_engine_set_hwsp_writemask(engine, ~0u); - vma = fetch_and_zero(&engine->status_page.vma); + vma = __xchg(&engine->status_page.vma, 0); if (!vma) return; diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index 9a527e1f5be655..09befcc6a84fa1 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -229,7 +229,7 @@ static void heartbeat(struct work_struct *wrk) mutex_unlock(&ce->timeline->mutex); out: if (!engine->i915->params.enable_hangcheck || !next_heartbeat(engine)) - i915_request_put(fetch_and_zero(&engine->heartbeat.systole)); + i915_request_put(__xchg(&engine->heartbeat.systole, 0)); intel_engine_pm_put(engine); } @@ -244,7 +244,7 @@ void intel_engine_unpark_heartbeat(struct intel_engine_cs *engine) void intel_engine_park_heartbeat(struct intel_engine_cs *engine) { if (cancel_delayed_work(&engine->heartbeat.work)) - i915_request_put(fetch_and_zero(&engine->heartbeat.systole)); + i915_request_put(__xchg(&engine->heartbeat.systole, 0)); } void intel_gt_unpark_heartbeats(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 18ffe55282e594..5c985e6fa1be2f 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -3199,7 +3199,7 @@ static void execlists_reset_cancel(struct intel_engine_cs *engine) RB_CLEAR_NODE(rb); spin_lock(&ve->base.sched_engine->lock); - rq = fetch_and_zero(&ve->request); + rq = __xchg(&ve->request, NULL); if (rq) { if (i915_request_mark_eio(rq)) { rq->engine = engine; @@ -3604,7 +3604,7 @@ static void rcu_virtual_context_destroy(struct work_struct *wrk) spin_lock_irq(&ve->base.sched_engine->lock); - old = fetch_and_zero(&ve->request); + old = __xchg(&ve->request, NULL); if (old) { GEM_BUG_ON(!__i915_request_is_complete(old)); __i915_request_submit(old); diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index fe64c13fd3b4aa..6f441c3d3d1cef 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -684,7 +684,7 @@ static void fini_aliasing_ppgtt(struct i915_ggtt *ggtt) { struct i915_ppgtt *ppgtt; - ppgtt = fetch_and_zero(&ggtt->alias); + ppgtt = __xchg(&ggtt->alias, NULL); if (!ppgtt) return; @@ -1238,7 +1238,7 @@ bool i915_ggtt_resume_vm(struct i915_address_space *vm) was_bound); if (obj) { /* only used during resume => exclusive access */ - write_domain_objs |= fetch_and_zero(&obj->write_domain); + write_domain_objs |= __xchg(&obj->write_domain, 0); obj->read_domains |= I915_GEM_DOMAIN_GTT; } } diff --git a/drivers/gpu/drm/i915/gt/int
[PATCH v5 6/7] qed: use __xchg if possible
Recently introduced helper simplifies the code. Signed-off-by: Andrzej Hajda --- include/linux/qed/qed_chain.h | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/include/linux/qed/qed_chain.h b/include/linux/qed/qed_chain.h index a84063492c71ae..6355d558cf01b2 100644 --- a/include/linux/qed/qed_chain.h +++ b/include/linux/qed/qed_chain.h @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -368,7 +369,7 @@ static inline void qed_chain_return_produced(struct qed_chain *p_chain) */ static inline void *qed_chain_produce(struct qed_chain *p_chain) { - void *p_ret = NULL, *p_prod_idx, *p_prod_page_idx; + void *p_prod_idx, *p_prod_page_idx; if (is_chain_u16(p_chain)) { if ((p_chain->u.chain16.prod_idx & @@ -390,11 +391,8 @@ static inline void *qed_chain_produce(struct qed_chain *p_chain) p_chain->u.chain32.prod_idx++; } - p_ret = p_chain->p_prod_elem; - p_chain->p_prod_elem = (void *)(((u8 *)p_chain->p_prod_elem) + - p_chain->elem_size); - - return p_ret; + return __xchg(&p_chain->p_prod_elem, + (u8 *)p_chain->p_prod_elem + p_chain->elem_size); } /** @@ -439,7 +437,7 @@ static inline void qed_chain_recycle_consumed(struct qed_chain *p_chain) */ static inline void *qed_chain_consume(struct qed_chain *p_chain) { - void *p_ret = NULL, *p_cons_idx, *p_cons_page_idx; + void *p_cons_idx, *p_cons_page_idx; if (is_chain_u16(p_chain)) { if ((p_chain->u.chain16.cons_idx & @@ -461,11 +459,8 @@ static inline void *qed_chain_consume(struct qed_chain *p_chain) p_chain->u.chain32.cons_idx++; } - p_ret = p_chain->p_cons_elem; - p_chain->p_cons_elem = (void *)(((u8 *)p_chain->p_cons_elem) + - p_chain->elem_size); - - return p_ret; + return __xchg(&p_chain->p_cons_elem, + (u8 *)p_chain->p_cons_elem + p_chain->elem_size); } /** -- 2.34.1
[PATCH v5 5/7] io_uring: use __xchg if possible
Recently introduced helper simplifies the code. Signed-off-by: Andrzej Hajda --- io_uring/io_uring.c | 7 ++- io_uring/slist.h| 6 ++ 2 files changed, 4 insertions(+), 9 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 2ac1cd8d23ea62..2b46a692d69022 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include #include @@ -1095,8 +1096,6 @@ static void __io_req_find_next_prep(struct io_kiocb *req) static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req) { - struct io_kiocb *nxt; - /* * If LINK is set, we have dependent requests in this chain. If we * didn't fail this request, queue the first one up, moving any other @@ -1105,9 +1104,7 @@ static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req) */ if (unlikely(req->flags & IO_DISARM_MASK)) __io_req_find_next_prep(req); - nxt = req->link; - req->link = NULL; - return nxt; + return __xchg(&req->link, NULL); } static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked) diff --git a/io_uring/slist.h b/io_uring/slist.h index f27601fa46607b..d3a743a1adc626 100644 --- a/io_uring/slist.h +++ b/io_uring/slist.h @@ -2,6 +2,7 @@ #define INTERNAL_IO_SLIST_H #include +#include #define wq_list_for_each(pos, prv, head) \ for (pos = (head)->first, prv = NULL; pos; prv = pos, pos = (pos)->next) @@ -121,10 +122,7 @@ static inline void wq_list_del(struct io_wq_work_list *list, static inline struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack) { - struct io_wq_work_node *node = stack->next; - - stack->next = node->next; - return node; + return __xchg(&stack->next, stack->next->next); } static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) -- 2.34.1
[PATCH v5 4/7] llist: simplify __llist_del_all
llist_del_all uses xchg, let's use __xchg here. Signed-off-by: Andrzej Hajda --- include/linux/llist.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/linux/llist.h b/include/linux/llist.h index 85bda2d02d65be..4dc1d185ea98ab 100644 --- a/include/linux/llist.h +++ b/include/linux/llist.h @@ -50,6 +50,7 @@ #include #include +#include #include #include @@ -241,10 +242,7 @@ static inline struct llist_node *llist_del_all(struct llist_head *head) static inline struct llist_node *__llist_del_all(struct llist_head *head) { - struct llist_node *first = head->first; - - head->first = NULL; - return first; + return __xchg(&head->first, NULL); } extern struct llist_node *llist_del_first(struct llist_head *head); -- 2.34.1
[PATCH v5 3/7] arch/*/uprobes: simplify arch_uretprobe_hijack_return_addr
In all architectures, except x86, arch_uretprobe_hijack_return_addr is just __xchg. Signed-off-by: Andrzej Hajda --- arch/arm/probes/uprobes/core.c | 8 ++-- arch/arm64/kernel/probes/uprobes.c | 9 ++--- arch/csky/kernel/probes/uprobes.c | 9 ++--- arch/mips/kernel/uprobes.c | 10 ++ arch/powerpc/kernel/uprobes.c | 10 ++ arch/riscv/kernel/probes/uprobes.c | 9 ++--- arch/s390/kernel/uprobes.c | 7 ++- arch/sparc/kernel/uprobes.c| 7 ++- 8 files changed, 16 insertions(+), 53 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index f5f790c6e5f896..77ce8ae431376d 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include @@ -61,12 +62,7 @@ unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs) { - unsigned long orig_ret_vaddr; - - orig_ret_vaddr = regs->ARM_lr; - /* Replace the return addr with trampoline addr */ - regs->ARM_lr = trampoline_vaddr; - return orig_ret_vaddr; + return __xchg(®s->ARM_lr, trampoline_vaddr); } int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c index d49aef2657cdf7..d7171d30ea6681 100644 --- a/arch/arm64/kernel/probes/uprobes.c +++ b/arch/arm64/kernel/probes/uprobes.c @@ -3,6 +3,7 @@ * Copyright (C) 2014-2016 Pratyush Anand */ #include +#include #include #include #include @@ -150,13 +151,7 @@ unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs) { - unsigned long orig_ret_vaddr; - - orig_ret_vaddr = procedure_link_pointer(regs); - /* Replace the return addr with trampoline addr */ - procedure_link_pointer_set(regs, trampoline_vaddr); - - return orig_ret_vaddr; + return __xchg(&procedure_link_pointer(regs), trampoline_vaddr); } int arch_uprobe_exception_notify(struct notifier_block *self, diff --git a/arch/csky/kernel/probes/uprobes.c b/arch/csky/kernel/probes/uprobes.c index 2d31a12e46cfee..775fe88b5f0016 100644 --- a/arch/csky/kernel/probes/uprobes.c +++ b/arch/csky/kernel/probes/uprobes.c @@ -3,6 +3,7 @@ * Copyright (C) 2014-2016 Pratyush Anand */ #include +#include #include #include #include @@ -123,13 +124,7 @@ unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs) { - unsigned long ra; - - ra = regs->lr; - - regs->lr = trampoline_vaddr; - - return ra; + return __xchg(®s->lr, trampoline_vaddr); } int arch_uprobe_exception_notify(struct notifier_block *self, diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c index 6c063aa188e626..2b1f375c407b87 100644 --- a/arch/mips/kernel/uprobes.c +++ b/arch/mips/kernel/uprobes.c @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -197,14 +198,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *aup, unsigned long arch_uretprobe_hijack_return_addr( unsigned long trampoline_vaddr, struct pt_regs *regs) { - unsigned long ra; - - ra = regs->regs[31]; - - /* Replace the return address with the trampoline address */ - regs->regs[31] = trampoline_vaddr; - - return ra; + return __xchg(®s->regs[31], trampoline_vaddr); } /** diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c index 95a41ae9dfa755..3c15c320315e6c 100644 --- a/arch/powerpc/kernel/uprobes.c +++ b/arch/powerpc/kernel/uprobes.c @@ -7,6 +7,7 @@ * Adapted from the x86 port by Ananth N Mavinakayanahalli */ #include +#include #include #include #include @@ -197,14 +198,7 @@ bool arch_uprobe_skip_sstep(struct arch_uprobe *auprobe, struct pt_regs *regs) unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs *regs) { - unsigned long orig_ret_vaddr; - - orig_ret_vaddr = regs->link; - - /* Replace the return addr with trampoline addr */ - regs->link = trampoline_vaddr; - - return orig_ret_vaddr; + return __xchg(®s->link, trampoline_vaddr); } bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx, diff --git a/arch/riscv/kernel/probes/uprobes.c b/arch/riscv/kernel/probes/uprobes.c index c976a21cd4bd5b..5c8415e216536d 100644 --- a/arch/riscv/kernel/probes/uprobes.c +++ b/arch/riscv/kernel/probes/uprobes.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include #include #include @@ -122,13 +123,7 @@ unsigned long arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr,
[PATCH v5 2/7] linux/include: add non-atomic version of xchg
The pattern of setting variable with new value and returning old one is very common in kernel. Usually atomicity of the operation is not required, so xchg seems to be suboptimal and confusing in such cases. Signed-off-by: Andrzej Hajda Reviewed-by: Andy Shevchenko --- include/linux/non-atomic/xchg.h | 19 +++ 1 file changed, 19 insertions(+) create mode 100644 include/linux/non-atomic/xchg.h diff --git a/include/linux/non-atomic/xchg.h b/include/linux/non-atomic/xchg.h new file mode 100644 index 00..f7fa5dd746f37d --- /dev/null +++ b/include/linux/non-atomic/xchg.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_NON_ATOMIC_XCHG_H +#define _LINUX_NON_ATOMIC_XCHG_H + +/** + * __xchg - set variable pointed by @ptr to @val, return old value + * @ptr: pointer to affected variable + * @val: value to be written + * + * This is non-atomic variant of xchg. + */ +#define __xchg(ptr, val) ({\ + __auto_type __ptr = ptr;\ + __auto_type __t = *__ptr; \ + *__ptr = (val); \ + __t;\ +}) + +#endif -- 2.34.1
[PATCH v5 1/7] arch: rename all internal names __xchg to __arch_xchg
__xchg will be used for non-atomic xchg macro. Signed-off-by: Andrzej Hajda Reviewed-by: Arnd Bergmann Acked-by: Geert Uytterhoeven [m68k] Acked-by: Palmer Dabbelt [riscv] --- v2: squashed all arch patches into one v3: fixed alpha/xchg_local, thx to l...@intel.com v4: adjusted indentation (Heiko) --- arch/alpha/include/asm/cmpxchg.h | 10 +- arch/arc/include/asm/cmpxchg.h | 4 ++-- arch/arm/include/asm/cmpxchg.h | 7 --- arch/arm64/include/asm/cmpxchg.h | 7 +++ arch/hexagon/include/asm/cmpxchg.h | 10 +- arch/ia64/include/asm/cmpxchg.h | 2 +- arch/ia64/include/uapi/asm/cmpxchg.h | 4 ++-- arch/loongarch/include/asm/cmpxchg.h | 4 ++-- arch/m68k/include/asm/cmpxchg.h | 6 +++--- arch/mips/include/asm/cmpxchg.h | 4 ++-- arch/openrisc/include/asm/cmpxchg.h | 10 +- arch/parisc/include/asm/cmpxchg.h| 4 ++-- arch/powerpc/include/asm/cmpxchg.h | 4 ++-- arch/riscv/include/asm/atomic.h | 2 +- arch/riscv/include/asm/cmpxchg.h | 4 ++-- arch/s390/include/asm/cmpxchg.h | 8 arch/sh/include/asm/cmpxchg.h| 4 ++-- arch/sparc/include/asm/cmpxchg_32.h | 4 ++-- arch/sparc/include/asm/cmpxchg_64.h | 6 +++--- arch/xtensa/include/asm/cmpxchg.h| 4 ++-- 20 files changed, 54 insertions(+), 54 deletions(-) diff --git a/arch/alpha/include/asm/cmpxchg.h b/arch/alpha/include/asm/cmpxchg.h index 6e0a850aa9d38c..91d4a4d9258cd2 100644 --- a/arch/alpha/include/asm/cmpxchg.h +++ b/arch/alpha/include/asm/cmpxchg.h @@ -6,15 +6,15 @@ * Atomic exchange routines. */ -#define xchg(type, args...)__xchg ## type ## _local(args) +#define xchg(type, args...)__arch_xchg ## type ## _local(args) #define cmpxchg(type, args...) __cmpxchg ## type ## _local(args) #include #define xchg_local(ptr, x) \ ({ \ __typeof__(*(ptr)) _x_ = (x); \ - (__typeof__(*(ptr))) __xchg_local((ptr), (unsigned long)_x_,\ - sizeof(*(ptr))); \ + (__typeof__(*(ptr))) __arch_xchg_local((ptr), (unsigned long)_x_,\ + sizeof(*(ptr))); \ }) #define arch_cmpxchg_local(ptr, o, n) \ @@ -34,7 +34,7 @@ #undef xchg #undef cmpxchg -#define xchg(type, args...)__xchg ##type(args) +#define xchg(type, args...)__arch_xchg ##type(args) #define cmpxchg(type, args...) __cmpxchg ##type(args) #include @@ -48,7 +48,7 @@ __typeof__(*(ptr)) _x_ = (x); \ smp_mb(); \ __ret = (__typeof__(*(ptr)))\ - __xchg((ptr), (unsigned long)_x_, sizeof(*(ptr))); \ + __arch_xchg((ptr), (unsigned long)_x_, sizeof(*(ptr))); \ smp_mb(); \ __ret; \ }) diff --git a/arch/arc/include/asm/cmpxchg.h b/arch/arc/include/asm/cmpxchg.h index c5b544a5fe8106..e138fde067dea5 100644 --- a/arch/arc/include/asm/cmpxchg.h +++ b/arch/arc/include/asm/cmpxchg.h @@ -85,7 +85,7 @@ */ #ifdef CONFIG_ARC_HAS_LLSC -#define __xchg(ptr, val) \ +#define __arch_xchg(ptr, val) \ ({ \ __asm__ __volatile__( \ " ex %0, [%1]\n" /* set new value */ \ @@ -102,7 +102,7 @@ \ switch(sizeof(*(_p_))) {\ case 4: \ - _val_ = __xchg(_p_, _val_); \ + _val_ = __arch_xchg(_p_, _val_);\ break; \ default:\ BUILD_BUG();\ diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h index 4dfe538dfc689b..44667bdb4707a9 100644 --- a/arch/arm/include/asm/cmpxchg.h +++ b/arch/arm/include/asm/cmpxchg.h @@ -25,7 +25,8 @@ #define swp_is_buggy #endif -static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size) +static inline unsigned long +__arch_xchg(unsigned long x, volatile void *ptr, int size) { extern void __bad_xchg(volatile void *, int); unsigned long r
[PATCH v5 0/7] Introduce __xchg, non-atomic xchg
Hi all, The helper is tiny and there are advices we can live without it, so I want to present few arguments why it would be good to have it: 1. Code readability/simplification/number of lines: - decreases number of lines, - it often eliminates local variables, - for real examples see patches 3+. 2. Presence of similar helpers in other somehow related languages/libs: a) Rust[1]: 'replace' from std::mem module, there is also 'take' helper (__xchg(&x, 0)), which is the same as private helper in i915 - fetch_and_zero, see latest patch. b) C++ [2]: 'exchange' from utility header. If the idea is OK there are still 2 questions to answer: 1. Name of the helper, __xchg follows kernel conventions, but for me Rust names are also OK. 2. Where to put the helper: a) as in this patchset include/linux/non-atomic/xchg.h, proposed by Andy Shevchenko, b) include/linux/utils.h ? any better name? Some kind of container for simple helpers. All __xchg conversions were performed using cocci script, then manually adjusted if necessary. There is lot of places it can be used in, I have just chosen some of them. I can provide cocci script to detect others (not all), if necessary. Changes: v2: squashed all __xchg -> __arch_xchg t one patch (Arnd) v3: fixed alpha/xchg_local (l...@intel.com) v4: adjusted indentation (Heiko) v5: added more __xchg conversions - patches 3-6, added tags [1]: https://doc.rust-lang.org/std/mem/index.html [2]: https://en.cppreference.com/w/cpp/header/utility Regards Andrzej Andrzej Hajda (7): arch: rename all internal names __xchg to __arch_xchg linux/include: add non-atomic version of xchg arch/*/uprobes: simplify arch_uretprobe_hijack_return_addr llist: simplify __llist_del_all io_uring: use __xchg if possible qed: use __xchg if possible drm/i915/gt: use __xchg instead of internal helper arch/alpha/include/asm/cmpxchg.h | 10 +- arch/arc/include/asm/cmpxchg.h| 4 ++-- arch/arm/include/asm/cmpxchg.h| 7 --- arch/arm/probes/uprobes/core.c| 8 ++-- arch/arm64/include/asm/cmpxchg.h | 7 +++ arch/arm64/kernel/probes/uprobes.c| 9 ++--- arch/csky/kernel/probes/uprobes.c | 9 ++--- arch/hexagon/include/asm/cmpxchg.h| 10 +- arch/ia64/include/asm/cmpxchg.h | 2 +- arch/ia64/include/uapi/asm/cmpxchg.h | 4 ++-- arch/loongarch/include/asm/cmpxchg.h | 4 ++-- arch/m68k/include/asm/cmpxchg.h | 6 +++--- arch/mips/include/asm/cmpxchg.h | 4 ++-- arch/mips/kernel/uprobes.c| 10 ++ arch/openrisc/include/asm/cmpxchg.h | 10 +- arch/parisc/include/asm/cmpxchg.h | 4 ++-- arch/powerpc/include/asm/cmpxchg.h| 4 ++-- arch/powerpc/kernel/uprobes.c | 10 ++ arch/riscv/include/asm/atomic.h | 2 +- arch/riscv/include/asm/cmpxchg.h | 4 ++-- arch/riscv/kernel/probes/uprobes.c| 9 ++--- arch/s390/include/asm/cmpxchg.h | 8 arch/s390/kernel/uprobes.c| 7 ++- arch/sh/include/asm/cmpxchg.h | 4 ++-- arch/sparc/include/asm/cmpxchg_32.h | 4 ++-- arch/sparc/include/asm/cmpxchg_64.h | 6 +++--- arch/sparc/kernel/uprobes.c | 7 ++- arch/xtensa/include/asm/cmpxchg.h | 4 ++-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 4 ++-- .../drm/i915/gt/intel_execlists_submission.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_ggtt.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_gsc.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c| 4 ++-- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 +- drivers/gpu/drm/i915/gt/intel_lrc.c | 6 +++--- drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +- drivers/gpu/drm/i915/gt/intel_rc6.c | 2 +- drivers/gpu/drm/i915/gt/intel_rps.c | 2 +- drivers/gpu/drm/i915/gt/selftest_context.c| 2 +- .../drm/i915/gt/selftest_ring_submission.c| 2 +- drivers/gpu/drm/i915/gt/selftest_timeline.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 2 +- drivers/gpu/drm/i915/i915_utils.h | 1 + include/linux/llist.h | 6 ++ include/linux/non-atomic/xchg.h | 19 +++ include/linux/qed/qed_chain.h | 19 +++ io_uring/io_uring.c | 7 ++- io_uring/slist.h | 6 ++ 51 files changed, 126 insertions(+), 155 deletions(-) create mode 100644 include/linux/non-atomic/xchg.h -- 2.34.1
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Wed 18-01-23 14:23:32, Jann Horn wrote: > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > > +locking maintainers > > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan > > > wrote: > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > mmap_write_lock holder is done with all modifications and drops > > > > mmap_lock, > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > locked. > > > [...] > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > > +{ > > > > + up_read(&vma->lock); > > > > +} > > > > > > One thing that might be gnarly here is that I think you might not be > > > allowed to use up_read() to fully release ownership of an object - > > > from what I remember, I think that up_read() (unlike something like > > > spin_unlock()) can access the lock object after it's already been > > > acquired by someone else. > > > > Yes, I think you are right. From a look into the code it seems that > > the UAF is quite unlikely as there is a ton of work to be done between > > vma_write_lock used to prepare vma for removal and actual removal. > > That doesn't make it less of a problem though. > > > > > So if you want to protect against concurrent > > > deletion, this might have to be something like: > > > > > > rcu_read_lock(); /* keeps vma alive */ > > > up_read(&vma->lock); > > > rcu_read_unlock(); > > > > > > But I'm not entirely sure about that, the locking folks might know better. > > > > I am not a locking expert but to me it looks like this should work > > because the final cleanup would have to happen rcu_read_unlock. > > > > Thanks, I have completely missed this aspect of the locking when looking > > into the code. > > > > Btw. looking at this again I have fully realized how hard it is actually > > to see that vm_area_free is guaranteed to sync up with ongoing readers. > > vma manipulation functions like __adjust_vma make my head spin. Would it > > make more sense to have a rcu style synchronization point in > > vm_area_free directly before call_rcu? This would add an overhead of > > uncontended down_write of course. > > Something along those lines might be a good idea, but I think that > rather than synchronizing the removal, it should maybe be something > that splats (and bails out?) if it detects pending readers. If we get > to vm_area_free() on a VMA that has pending readers, we might already > be in a lot of trouble because the concurrent readers might have been > traversing page tables while we were tearing them down or fun stuff > like that. > > I think maybe Suren was already talking about something like that in > another part of this patch series but I don't remember... This http://lkml.kernel.org/r/20230109205336.3665937-27-sur...@google.com? -- Michal Hocko SUSE Labs
Re: [PATCH 1/2] tools/perf: Fix the file mode with copyfile while adding file to build-id cache
Em Mon, Jan 16, 2023 at 10:31:30AM +0530, Athira Rajeev escreveu: > The test "build id cache operations" fails on powerpc > As below: > > Adding 5a0fd882b53084224ba47b624c55a469 ./tests/shell/../pe-file.exe: Ok > build id: 5a0fd882b53084224ba47b624c55a469 > link: /tmp/perf.debug.ZTu/.build-id/5a/0fd882b53084224ba47b624c55a469 > file: > /tmp/perf.debug.ZTu/.build-id/5a/../../root/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf > failed: file > /tmp/perf.debug.ZTu/.build-id/5a/../../root/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf > does not exist > test child finished with -1 > end > build id cache operations: FAILED! > > The failing test is when trying to add pe-file.exe to > build id cache. > > Perf buildid-cache can be used to add/remove/manage > files from the build-id cache. "-a" option is used to > add a file to the build-id cache. > > Simple command to do so for a PE exe file: > # ls -ltr tests/pe-file.exe > -rw-r--r--. 1 root root 75595 Jan 10 23:35 tests/pe-file.exe > The file is in home directory. > > # mkdir /tmp/perf.debug.TeY1 > # perf --buildid-dir /tmp/perf.debug.TeY1 buildid-cache -v >-a tests/pe-file.exe > > The above will create ".build-id" folder in build id > directory, which is /tmp/perf.debug.TeY1. Also adds file > to this folder under build id. Example: > > # ls -ltr /tmp/perf.debug.TeY1/.build-id/5a/0fd882b53084224ba47b624c55a469/ > total 76 > -rw-r--r--. 1 root root 0 Jan 11 00:38 probes > -rwxr-xr-x. 1 root root 75595 Jan 11 00:38 elf > > We can see in the results that file mode for original > file and file in build id directory is different. ie, > build id file has executable permission whereas original > file doesn’t have. > > The code path and function ( build_id_cache__add ) to > add file to cache is in "util/build-id.c". In > build_id_cache__add() function, it first attempts to link > the original file to destination cache folder. If linking > the file fails ( which can happen if the destination and > source is on a different mount points ), it will copy the > file to destination. Here copyfile() routine explicitly uses > mode as "755" and hence file in the destination will have > executable permission. > > Code snippet: > > if (link(realname, filename) && errno != EEXIST && >copyfile(name, filename)) > > Strace logs: > > 172285 link("/home//linux/tools/perf/tests/pe-file.exe", > "/tmp/perf.debug.TeY1/home//linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf") > = -1 EXDEV (Invalid cross-device link) > 172285 newfstatat(AT_FDCWD, "tests/pe-file.exe", {st_mode=S_IFREG|0644, > st_size=75595, ...}, 0) = 0 > 172285 openat(AT_FDCWD, > "/tmp/perf.debug.TeY1/home//linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/.elf.KbAnsl", > O_RDWR|O_CREAT|O_EXCL, 0600) = 3 > 172285 fchmod(3, 0755) = 0 > 172285 openat(AT_FDCWD, "tests/pe-file.exe", O_RDONLY) = 4 > 172285 mmap(NULL, 75595, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7fffa5cd > 172285 pwrite64(3, > "MZ\220\0\3\0\0\0\4\0\0\0\377\377\0\0\270\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., > 75595, 0) = 75595 > > Whereas if the link succeeds, it succeeds in the first > attempt itself and the file in the build-id dir will > have same permission as original file. > > Example, above uses /tmp. Instead if we use > "--buildid-dir /home/build", linking will work here > since mount points are same. Hence the destination file > will not have executable permission. > > Since the testcase "tests/shell/buildid.sh" always looks > for executable file, test fails in powerpc environment > when test is run from /root. > > The patch adds a change in build_id_cache__add to use > copyfile_mode which also passes the file’s original mode as > argument. This way the destination file mode also will > be same as original file. > > Signed-off-by: Athira Rajeev Thanks, applied both patches, IIRC there were an Acked-by from Ian for this one? Or was that reference a cut'n'paste error with that other series for the #/bin/bash changes? - Arnaldo > --- > tools/perf/util/build-id.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/tools/perf/util/build-id.c b/tools/perf/util/build-id.c > index a839b30c981b..ea9c083ab1e3 100644 > --- a/tools/perf/util/build-id.c > +++ b/tools/perf/util/build-id.c > @@ -715,9 +715,13 @@ build_id_cache__add(const char *sbuild_id, const char > *name, const char *realnam > } else if (nsi && nsinfo__need_setns(nsi)) { > if (copyfile_ns(name, filename, nsi)) > goto out_free; > - } else if (link(realname, filename) && errno != EEXIST && > - copyfile(name, filename)) > - goto out_free; > + } else if (link
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko wrote: > On Tue 17-01-23 19:02:55, Jann Horn wrote: > > +locking maintainers > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > to be exclusively locked during VMA tree modifications, instead of the > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > locked. > > [...] > > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > > +{ > > > + up_read(&vma->lock); > > > +} > > > > One thing that might be gnarly here is that I think you might not be > > allowed to use up_read() to fully release ownership of an object - > > from what I remember, I think that up_read() (unlike something like > > spin_unlock()) can access the lock object after it's already been > > acquired by someone else. > > Yes, I think you are right. From a look into the code it seems that > the UAF is quite unlikely as there is a ton of work to be done between > vma_write_lock used to prepare vma for removal and actual removal. > That doesn't make it less of a problem though. > > > So if you want to protect against concurrent > > deletion, this might have to be something like: > > > > rcu_read_lock(); /* keeps vma alive */ > > up_read(&vma->lock); > > rcu_read_unlock(); > > > > But I'm not entirely sure about that, the locking folks might know better. > > I am not a locking expert but to me it looks like this should work > because the final cleanup would have to happen rcu_read_unlock. > > Thanks, I have completely missed this aspect of the locking when looking > into the code. > > Btw. looking at this again I have fully realized how hard it is actually > to see that vm_area_free is guaranteed to sync up with ongoing readers. > vma manipulation functions like __adjust_vma make my head spin. Would it > make more sense to have a rcu style synchronization point in > vm_area_free directly before call_rcu? This would add an overhead of > uncontended down_write of course. Something along those lines might be a good idea, but I think that rather than synchronizing the removal, it should maybe be something that splats (and bails out?) if it detects pending readers. If we get to vm_area_free() on a VMA that has pending readers, we might already be in a lot of trouble because the concurrent readers might have been traversing page tables while we were tearing them down or fun stuff like that. I think maybe Suren was already talking about something like that in another part of this patch series but I don't remember...
RE: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
... > > One thing that might be gnarly here is that I think you might not be > > allowed to use up_read() to fully release ownership of an object - > > from what I remember, I think that up_read() (unlike something like > > spin_unlock()) can access the lock object after it's already been > > acquired by someone else. > > Yes, I think you are right. From a look into the code it seems that > the UAF is quite unlikely as there is a ton of work to be done between > vma_write_lock used to prepare vma for removal and actual removal. > That doesn't make it less of a problem though. All it takes is a hardware interrupt Especially if the softint code can also run. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH v3 22/24] powerpc/pseries: Implement secvars for dynamic secure boot
On 1/18/23 01:10, Andrew Donnellan wrote: + +// PLPKS dynamic secure boot doesn't give us a format string in the same way OPAL does. +// Instead, report the format using the SB_VERSION variable in the keystore. +static ssize_t plpks_secvar_format(char *buf) Ideally there would be a size_t accompanying this buffer... +{ + struct plpks_var var = {0}; + ssize_t ret; + + var.component = NULL; + // Only the signed variables have null bytes in their names, this one doesn't + var.name = "SB_VERSION"; + var.namelen = 10; + var.datalen = 1; + var.data = kzalloc(1, GFP_KERNEL); NULL pointer check? + + // Unlike the other vars, SB_VERSION is owned by firmware instead of the OS + ret = plpks_read_fw_var(&var); + if (ret) { + if (ret == -ENOENT) { + ret = snprintf(buf, SECVAR_MAX_FORMAT_LEN, "ibm,plpks-sb-unknown"); + } else { + pr_err("Error %ld reading SB_VERSION from firmware\n", ret); + ret = -EIO; + } + goto err; + } + + // This string is made up by us - the hypervisor doesn't provide us + // with a format string in the way that OPAL firmware does. Hypervisor + // defines SB_VERSION as a "1 byte unsigned integer value". + ret = snprintf(buf, SECVAR_MAX_FORMAT_LEN, "ibm,plpks-sb-v%hhu", var.data[0]); + +err: + kfree(var.data); + return ret; +} +
Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > Page fault handlers might need to fire MMU notifications while a new > notifier is being registered. Modify mm_take_all_locks to write-lock all > VMAs and prevent this race with fault handlers that would hold VMA locks. > VMAs are locked before i_mmap_rwsem and anon_vma to keep the same > locking order as in page fault handlers. > > Signed-off-by: Suren Baghdasaryan > --- > mm/mmap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 30c7d1c5206e..a256deca0bc0 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, > struct address_space *mapping) > * of mm/rmap.c: > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for > * hugetlb mapping); > + * - all vmas marked locked The existing comment above says that this is an *ordered* listing of which locks are taken. > * - all i_mmap_rwsem locks; > * - all anon_vma->rwseml > * > @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm) > mas_for_each(&mas, vma, ULONG_MAX) { > if (signal_pending(current)) > goto out_unlock; > + vma_write_lock(vma); > if (vma->vm_file && vma->vm_file->f_mapping && > is_vm_hugetlb_page(vma)) > vm_lock_mapping(mm, vma->vm_file->f_mapping); Note that multiple VMAs can have the same ->f_mapping, so with this, the lock ordering between VMA locks and the mapping locks of hugetlb VMAs is mixed: If you have two adjacent hugetlb VMAs with the same ->f_mapping, then the following operations happen: 1. lock VMA 1 2. lock mapping of VMAs 1 and 2 3. lock VMA 2 4. [second vm_lock_mapping() is a no-op] So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we took the mapping lock first. The existing code has one loop per lock type to ensure that the locks really are taken in the specified order, even when some of the locks are associated with multiple VMAs. If we don't care about the ordering between these two, maybe that's fine and you just have to adjust the comment; but it would be clearer to add a separate loop for the VMA locks. > @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm) > if (vma->vm_file && vma->vm_file->f_mapping) > vm_unlock_mapping(vma->vm_file->f_mapping); > } > + vma_write_unlock_mm(mm); > > mutex_unlock(&mm_all_locks_mutex); > } > -- > 2.39.0 >
Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko wrote: > On Tue 17-01-23 21:28:06, Jann Horn wrote: > > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko wrote: > > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote: > > > > Protect VMA from concurrent page fault handler while collapsing a huge > > > > page. Page fault handler needs a stable PMD to use PTL and relies on > > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), > > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will > > > > not be detected by a page fault handler without proper locking. > > > > > > I am struggling with this changelog. Maybe because my recollection of > > > the THP collapsing subtleties is weak. But aren't you just trying to say > > > that the current #PF handling and THP collapsing need to be mutually > > > exclusive currently so in order to keep that assumption you have mark > > > the vma write locked? > > > > > > Also it is not really clear to me how that handles other vmas which can > > > share the same thp? > > > > It's not about the hugepage itself, it's about how the THP collapse > > operation frees page tables. > > > > Before this series, page tables can be walked under any one of the > > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged > > unlinks and frees page tables, it must ensure that all of those either > > are locked or don't exist. This series adds a fourth lock under which > > page tables can be traversed, and so khugepaged must also lock out that one. > > > > There is a codepath in khugepaged that iterates through all mappings > > of a file to zap page tables (retract_page_tables()), which locks each > > visited mm with mmap_write_trylock() and now also does > > vma_write_lock(). > > OK, I see. This would be a great addendum to the changelog. > > > I think one aspect of this patch that might cause trouble later on, if > > support for non-anonymous VMAs is added, is that retract_page_tables() > > now does vma_write_lock() while holding the mapping lock; the page > > fault handling path would probably take the locks the other way > > around, leading to a deadlock? So the vma_write_lock() in > > retract_page_tables() might have to become a trylock later on. > > This, right? > #PF retract_page_tables > vma_read_lock > i_mmap_lock_write > i_mmap_lock_read > vma_write_lock > > > I might be missing something but I have only found huge_pmd_share to be > called from the #PF path. That one should be safe as it cannot be a > target for THP. Not that it would matter much because such a dependency > chain would be really subtle. Oops, yeah. Now that I'm looking closer I also don't see a path from the #PF path to i_mmap_lock_read. Sorry for sending you on a wild goose chase.
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Tue 17-01-23 19:02:55, Jann Horn wrote: > +locking maintainers > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan wrote: > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > instead of mmap_lock. Because there are cases when multiple VMAs need > > to be exclusively locked during VMA tree modifications, instead of the > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > locked. > [...] > > +static inline void vma_read_unlock(struct vm_area_struct *vma) > > +{ > > + up_read(&vma->lock); > > +} > > One thing that might be gnarly here is that I think you might not be > allowed to use up_read() to fully release ownership of an object - > from what I remember, I think that up_read() (unlike something like > spin_unlock()) can access the lock object after it's already been > acquired by someone else. Yes, I think you are right. From a look into the code it seems that the UAF is quite unlikely as there is a ton of work to be done between vma_write_lock used to prepare vma for removal and actual removal. That doesn't make it less of a problem though. > So if you want to protect against concurrent > deletion, this might have to be something like: > > rcu_read_lock(); /* keeps vma alive */ > up_read(&vma->lock); > rcu_read_unlock(); > > But I'm not entirely sure about that, the locking folks might know better. I am not a locking expert but to me it looks like this should work because the final cleanup would have to happen rcu_read_unlock. Thanks, I have completely missed this aspect of the locking when looking into the code. Btw. looking at this again I have fully realized how hard it is actually to see that vm_area_free is guaranteed to sync up with ongoing readers. vma manipulation functions like __adjust_vma make my head spin. Would it make more sense to have a rcu style synchronization point in vm_area_free directly before call_rcu? This would add an overhead of uncontended down_write of course. -- Michal Hocko SUSE Labs
Re: [PATCH v3 21/24] powerpc/pseries: Pass PLPKS password on kexec
On Wed, 2023-01-18 at 17:10 +1100, Andrew Donnellan wrote: > > struct umem_info { > u64 *buf; /* data buffer for usable-memory > property */ > @@ -1155,7 +1156,7 @@ int setup_new_fdt_ppc64(const struct kimage > *image, void *fdt, > unsigned long initrd_len, const char > *cmdline) > { > struct crash_mem *umem = NULL, *rmem = NULL; > - int i, nr_ranges, ret; > + int i, nr_ranges, ret, chosen_node; rip, we broke the build when CONFIG_PSERIES_PLPKS=n. v4 incoming shortly, after we wait a little while longer for comments. -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote: > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko wrote: > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote: > > > call_rcu() can take a long time when callback offloading is enabled. > > > Its use in the vm_area_free can cause regressions in the exit path when > > > multiple VMAs are being freed. > > > > What kind of regressions. > > > > > To minimize that impact, place VMAs into > > > a list and free them in groups using one call_rcu() call per group. > > > > Please add some data to justify this additional complexity. > > Sorry, should have done that in the first place. A 4.3% regression was > noticed when running execl test from unixbench suite. spawn test also > showed 1.6% regression. Profiling revealed that vma freeing was taking > longer due to call_rcu() which is slow when RCU callback offloading is > enabled. Could you be more specific? vma freeing is async with the RCU so how come this has resulted in a regression? Is there any heavy rcu_synchronize in the exec path? That would be an interesting information. -- Michal Hocko SUSE Labs
Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction
On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote: > On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team > wrote: > > > > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote: > > > Assert there are no holders of VMA lock for reading when it is about to be > > > destroyed. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > include/linux/mm.h | 8 > > > kernel/fork.c | 2 ++ > > > 2 files changed, 10 insertions(+) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 594e835bad9c..c464fc8a514c 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct > > > vm_area_struct *vma) > > > VM_BUG_ON_VMA(vma->vm_lock_seq != > > > READ_ONCE(vma->vm_mm->mm_lock_seq), vma); > > > } > > > > > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma) > > > +{ > > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) && > > > + vma->vm_lock_seq != > > > READ_ONCE(vma->vm_mm->mm_lock_seq), > > > + vma); > > > > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell > > us something is wrong on its own, no? This could be somebody racing with > > the vma destruction and using the write lock. Unlikely but I do not see > > why to narrow debugging scope. > > I wanted to ensure there are no page fault handlers (read-lockers) > when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone > could trigger if someone is concurrently calling vma_write_lock(). But > I don't think we expect someone to be write-locking the VMA while we That would be UAF, no? > are destroying it, so you are right, I'm overcomplicating things here. > I think I can get rid of vma_assert_no_reader() and add > VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in > __vm_area_free(). WDYT? Yes, that adds some debugging. Not sure it is really necessary buyt it is VM_BUG_ON so why not. -- Michal Hocko SUSE Labs
Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page
On Tue 17-01-23 21:28:06, Jann Horn wrote: > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko wrote: > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote: > > > Protect VMA from concurrent page fault handler while collapsing a huge > > > page. Page fault handler needs a stable PMD to use PTL and relies on > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will > > > not be detected by a page fault handler without proper locking. > > > > I am struggling with this changelog. Maybe because my recollection of > > the THP collapsing subtleties is weak. But aren't you just trying to say > > that the current #PF handling and THP collapsing need to be mutually > > exclusive currently so in order to keep that assumption you have mark > > the vma write locked? > > > > Also it is not really clear to me how that handles other vmas which can > > share the same thp? > > It's not about the hugepage itself, it's about how the THP collapse > operation frees page tables. > > Before this series, page tables can be walked under any one of the > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged > unlinks and frees page tables, it must ensure that all of those either > are locked or don't exist. This series adds a fourth lock under which > page tables can be traversed, and so khugepaged must also lock out that one. > > There is a codepath in khugepaged that iterates through all mappings > of a file to zap page tables (retract_page_tables()), which locks each > visited mm with mmap_write_trylock() and now also does > vma_write_lock(). OK, I see. This would be a great addendum to the changelog. > I think one aspect of this patch that might cause trouble later on, if > support for non-anonymous VMAs is added, is that retract_page_tables() > now does vma_write_lock() while holding the mapping lock; the page > fault handling path would probably take the locks the other way > around, leading to a deadlock? So the vma_write_lock() in > retract_page_tables() might have to become a trylock later on. This, right? #PF retract_page_tables vma_read_lock i_mmap_lock_write i_mmap_lock_read vma_write_lock I might be missing something but I have only found huge_pmd_share to be called from the #PF path. That one should be safe as it cannot be a target for THP. Not that it would matter much because such a dependency chain would be really subtle. -- Michal Hocko SUSE Labs
Re: crypto: p10-aes-gcm - Add asm markings necessary for kernel code
On Wed, Jan 18, 2023 at 02:04:44PM +1100, Stephen Rothwell wrote: > > arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: > aes_p8_set_encrypt_key+0x44: unannotated intra-function call > > from the powerpc pseries_le_defconfig build (which is otherwise ok). Thanks Stephen. I've reverted these changes completely. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call
On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote: > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko wrote: > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote: > > > Move VMA flag modification (which now implies VMA locking) before > > > anon_vma_lock_write to match the locking order of page fault handler. > > > > Does this changelog assumes per vma locking in the #PF? > > Hmm, you are right. Page fault handlers do not use per-vma locks yet > but the changelog already talks about that. Maybe I should change it > to simply: > ``` > Move VMA flag modification (which now implies VMA locking) before > vma_adjust_trans_huge() to ensure the modifications are done after VMA > has been locked. Because Withtout that additional reasonaning it is not really clear why that is needed and seems arbitrary. -- Michal Hocko SUSE Labs
Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it
On Tue 17-01-23 21:54:58, Matthew Wilcox wrote: > On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote: > > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko wrote: > > > > > > On Tue 17-01-23 16:04:26, Michal Hocko wrote: > > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote: > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA > > > > > lock > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. > > > > > When > > > > > mmap_write_lock holder is done with all modifications and drops > > > > > mmap_lock, > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked > > > > > as > > > > > locked. > > > > > > > > I have to say I was struggling a bit with the above and only understood > > > > what you mean by reading the patch several times. I would phrase it like > > > > this (feel free to use if you consider this to be an improvement). > > > > > > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a > > > > per-vma and per-mm sequence counters to note exclusive locking: > > > > - read lock - (implemented by vma_read_trylock) requires the the > > > > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to > > > > differ. If they match then there must be a vma exclusive lock > > > > held somewhere. > > > > - read unlock - (implemented by vma_read_unlock) is a trivial > > > > vma->lock unlock. > > > > - write lock - (vma_write_lock) requires the mmap_lock to be > > > > held exclusively and the current mm counter is noted to the > > > > vma > > > > side. This will allow multiple vmas to be locked under a > > > > single > > > > mmap_lock write lock (e.g. during vma merging). The vma > > > > counter > > > > is modified under exclusive vma lock. > > > > > > Didn't realize one more thing. > > > Unlike standard write lock this implementation allows to be > > > called multiple times under a single mmap_lock. In a sense > > > it is more of mark_vma_potentially_modified than a lock. > > > > In the RFC it was called vma_mark_locked() originally and renames were > > discussed in the email thread ending here: > > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/. > > If other names are preferable I'm open to changing them. > > I don't want to bikeshed this, but rather than locking it seems to be > more: > > vma_start_read() > vma_end_read() > vma_start_write() > vma_end_write() > vma_downgrade_write() > > ... and that these are _implemented_ with locks (in part) is an > implementation detail? Agreed! > Would that reduce people's confusion? Yes I believe that naming it less like a locking primitive will clarify it. vma_{start,end}_[try]read is better indeed. I am wondering about the write side of things because that is where things get confusing. There is no explicit write lock nor unlock. vma_start_write sounds better than the vma_write_lock but it still lacks that pairing with vma_end_write which is never the right thing to call. Wouldn't vma_mark_modified and vma_publish_changes describe the scheme better? Downgrade case is probably the least interesting one because that is just one off thing that can be completely hidden from any code besides mmap_write_downgrade so I wouldn't be too concern about that one. But as you say, no need to bikeshed this too much. Great naming is hard and if the scheme is documented properly we can live with a suboptimal naming as well. -- Michal Hocko SUSE Labs
Re: [PATCH 4/8] perf/core: Add perf_sample_save_brstack() helper
> On 18-Jan-2023, at 11:35 AM, Namhyung Kim wrote: > > When it saves the branch stack to the perf sample data, it needs to > update the sample flags and the dynamic size. To make sure this, > add the perf_sample_save_brstack() helper and convert all call sites. > > Cc: linuxppc-dev@lists.ozlabs.org > Cc: x...@kernel.org > Suggested-by: Peter Zijlstra > Acked-by: Jiri Olsa > Tested-by: Jiri Olsa > Signed-off-by: Namhyung Kim Hi Namhyung, Changes looks good to me. Acked-by: Athira Rajeev Thanks Athira > --- > arch/powerpc/perf/core-book3s.c | 3 +- > arch/x86/events/amd/core.c | 6 +-- > arch/x86/events/intel/core.c| 6 +-- > arch/x86/events/intel/ds.c | 9 ++--- > include/linux/perf_event.h | 66 - > kernel/events/core.c| 16 +++- > 6 files changed, 53 insertions(+), 53 deletions(-) > > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c > index bf318dd9b709..8c1f7def596e 100644 > --- a/arch/powerpc/perf/core-book3s.c > +++ b/arch/powerpc/perf/core-book3s.c > @@ -2313,8 +2313,7 @@ static void record_and_restart(struct perf_event > *event, unsigned long val, > struct cpu_hw_events *cpuhw; > cpuhw = this_cpu_ptr(&cpu_hw_events); > power_pmu_bhrb_read(event, cpuhw); > - data.br_stack = &cpuhw->bhrb_stack; > - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK; > + perf_sample_save_brstack(&data, event, > &cpuhw->bhrb_stack); > } > > if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC && > diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c > index d6f3703e4119..463f3eb8bbd7 100644 > --- a/arch/x86/events/amd/core.c > +++ b/arch/x86/events/amd/core.c > @@ -928,10 +928,8 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs) > if (!x86_perf_event_set_period(event)) > continue; > > - if (has_branch_stack(event)) { > - data.br_stack = &cpuc->lbr_stack; > - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK; > - } > + if (has_branch_stack(event)) > + perf_sample_save_brstack(&data, event, > &cpuc->lbr_stack); > > if (perf_event_overflow(event, &data, regs)) > x86_pmu_stop(event, 0); > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index 29d2d0411caf..14f0a746257d 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -3036,10 +3036,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 > status) > > perf_sample_data_init(&data, 0, event->hw.last_period); > > - if (has_branch_stack(event)) { > - data.br_stack = &cpuc->lbr_stack; > - data.sample_flags |= PERF_SAMPLE_BRANCH_STACK; > - } > + if (has_branch_stack(event)) > + perf_sample_save_brstack(&data, event, > &cpuc->lbr_stack); > > if (perf_event_overflow(event, &data, regs)) > x86_pmu_stop(event, 0); > diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c > index 158cf845fc80..07c8a2cdc3ee 100644 > --- a/arch/x86/events/intel/ds.c > +++ b/arch/x86/events/intel/ds.c > @@ -1720,10 +1720,8 @@ static void setup_pebs_fixed_sample_data(struct > perf_event *event, > data->sample_flags |= PERF_SAMPLE_TIME; > } > > - if (has_branch_stack(event)) { > - data->br_stack = &cpuc->lbr_stack; > - data->sample_flags |= PERF_SAMPLE_BRANCH_STACK; > - } > + if (has_branch_stack(event)) > + perf_sample_save_brstack(data, event, &cpuc->lbr_stack); > } > > static void adaptive_pebs_save_regs(struct pt_regs *regs, > @@ -1883,8 +1881,7 @@ static void setup_pebs_adaptive_sample_data(struct > perf_event *event, > > if (has_branch_stack(event)) { > intel_pmu_store_pebs_lbrs(lbr); > - data->br_stack = &cpuc->lbr_stack; > - data->sample_flags |= PERF_SAMPLE_BRANCH_STACK; > + perf_sample_save_brstack(data, event, &cpuc->lbr_stack); > } > } > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 569dfac5887f..7db0e9cc2682 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1102,6 +1102,31 @@ extern u64 perf_event_read_value(struct perf_event > *event, > > extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, > struct pt_regs *regs); > > +static inline bool branch_sample_no_flags(const struct perf_event *event) > +{ > + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_NO_FLAGS; > +} > + > +static inline bool branch_sample_no_cycles(const struct perf_
Re: Memory transaction instructions
Linus Torvalds wrote: > And for the kernel, where we don't have bad locking, and where we > actually use fine-grained locks that are _near_ the data that we are > locking (the lockref of the dcache is obviously one example of that, > but the skbuff queue you mention is almost certainly exactly the same > situation): the lock is right by the data that the lock protects, and > the "shared lock cacheline" model simply does not work. You'll bounce > the data, and most likely you'll also touch the same lock cacheline > too. Yeah. The reason I was actually wondering about them was if it would be possible to avoid the requirement to disable interrupts/softirqs to, say, modify the skbuff queue. On some arches actually disabling irqs is quite a heavy operation (I think this is/was true on ppc64, for example; it certainly was on frv) and it was necessary to "emulate" the disablement. David
Re: crypto: p10-aes-gcm - Add asm markings necessary for kernel code
Hi Stephen, On 18/01/23 08:34, Stephen Rothwell wrote: Hi Herbert, On Tue, 17 Jan 2023 15:26:24 +0800 Herbert Xu wrote: On Tue, Jan 17, 2023 at 02:47:47PM +1100, Stephen Rothwell wrote: Hi all, After merging the crypto tree, today's linux-next build (powerpc pseries_le_defconfig) failed like this: arch/powerpc/crypto/p10_aes_gcm.o: warning: objtool: .text+0x884: unannotated intra-function call arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: unannotated intra-function call ld: arch/powerpc/crypto/p10_aes_gcm.o: ABI version 1 is not compatible with ABI version 2 output ld: failed to merge target specific data of file arch/powerpc/crypto/p10_aes_gcm.o Caused by commit ca68a96c37eb ("crypto: p10-aes-gcm - An accelerated AES/GCM stitched implementation") I have applied the following hack for today. Thanks Stephen, I'm going to update the previous fix as follows: I still get: arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: unannotated intra-function call from the powerpc pseries_le_defconfig build (which is otherwise ok). Warnings [1], [2], and [3] are seen with pseries_le_defconfig. [1] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: unannotated intra-function call [2] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: .text+0x2448: unannotated intra-function call [3] - arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: .text+0x2d68: unannotated intra-function call Given that there are no calls to _mcount, one way to address this warning, is by skipping objtool from running on arch/powerpc/crypto/aesp8-ppc.o file. The below diff works for me. = diff --git a/arch/powerpc/crypto/Makefile b/arch/powerpc/crypto/Makefile index 5b8252013abd..d00664c8d761 100644 --- a/arch/powerpc/crypto/Makefile +++ b/arch/powerpc/crypto/Makefile @@ -31,3 +31,5 @@ targets += aesp8-ppc.S ghashp8-ppc.S $(obj)/aesp8-ppc.S $(obj)/ghashp8-ppc.S: $(obj)/%.S: $(src)/%.pl FORCE $(call if_changed,perl) + +OBJECT_FILES_NON_STANDARD_aesp8-ppc.o := y = The other way to fix these warnings is by using ANNOTATE_INTRA_FUNCTION_CALL macro to indicate that the branch target is valid. And, by annotating symbols with SYM_FUNC_START_LOCAL and SYM_FUNC_END macros. The below diff works for me: = diff --git a/arch/powerpc/crypto/aesp8-ppc.pl b/arch/powerpc/crypto/aesp8-ppc.pl index cdbcf6e13efc..355e0036869a 100644 --- a/arch/powerpc/crypto/aesp8-ppc.pl +++ b/arch/powerpc/crypto/aesp8-ppc.pl @@ -80,6 +80,9 @@ # POWER8[le] 3.96/0.72 0.741.1 # POWER8[be] 3.75/0.65 0.661.0 +print "#include \n"; +print "#include \n"; + $flavour = shift; if ($flavour =~ /64/) { @@ -185,7 +188,8 @@ Lset_encrypt_key: lis r0,0xfff0 mfspr $vrsave,256 mtspr 256,r0 - + + ANNOTATE_INTRA_FUNCTION_CALL bl Lconsts mtlrr11 @@ -3039,7 +3043,7 @@ Lxts_enc6x_ret: .long 0 .align 5 -_aesp8_xts_enc5x: +SYM_FUNC_START_LOCAL(_aesp8_xts_enc5x) vcipher $out0,$out0,v24 vcipher $out1,$out1,v24 vcipher $out2,$out2,v24 @@ -3121,6 +3125,7 @@ _aesp8_xts_enc5x: blr .long 0 .byte 0,12,0x14,0,0,0,0,0 +SYM_FUNC_END(_aesp8_xts_enc5x) .align 5 _aesp8_xts_decrypt6x: @@ -3727,7 +3732,7 @@ Lxts_dec6x_ret: .long 0 .align 5 -_aesp8_xts_dec5x: +SYM_FUNC_START_LOCAL(_aesp8_xts_dec5x) vncipher$out0,$out0,v24 vncipher$out1,$out1,v24 vncipher$out2,$out2,v24 @@ -3809,6 +3814,7 @@ _aesp8_xts_dec5x: blr .long 0 .byte 0,12,0x14,0,0,0,0,0 +SYM_FUNC_END(_aesp8_xts_dec5x) ___ }} }}} = Thanks, Sathvika
[PATCH v6 5/5] powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs
** Not for merge ** CONFIG_MMU_LAZY_TLB_SHOOTDOWN that requires IPIs to clear the "lazy tlb" references to an mm that is being freed. With the radix MMU, the final userspace exit TLB flush can be performed with IPIs, and those IPIs can also clear lazy tlb mm references, which mostly eliminates the final IPIs required by MMU_LAZY_TLB_SHOOTDOWN. This does mean the final TLB flush is not done with TLBIE, which can be faster than IPI+TLBIEL, but we would have to do those IPIs for lazy shootdown so using TLBIEL should be a win. The final cpumask test and possible IPIs are still needed to clean up some rare race cases. We could prevent those entirely (e.g., prevent new lazy tlb mm references if userspace has gone away, or move the final TLB flush later), but I'd have to see actual numbers that matter before adding any more complexity for it. I can't imagine it would ever be worthwhile. This takes lazy tlb mm shootdown IPI interrupts from 314 to 3 on a 144 CPU system doing a kernel compile. It also takes care of the one potential problem workload which is a short-lived process with multiple CPU-bound threads that want to be spread to other CPUs, because the mm exit happens after the process is back to single-threaded. --- arch/powerpc/mm/book3s64/radix_tlb.c | 26 +- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 282359ab525b..f34b78cb4c7d 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -1303,7 +1303,31 @@ void radix__tlb_flush(struct mmu_gather *tlb) * See the comment for radix in arch_exit_mmap(). */ if (tlb->fullmm || tlb->need_flush_all) { - __flush_all_mm(mm, true); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + /* +* Shootdown based lazy tlb mm refcounting means we +* have to IPI everyone in the mm_cpumask anyway soon +* when the mm goes away, so might as well do it as +* part of the final flush now. +* +* If lazy shootdown was improved to reduce IPIs (e.g., +* by batching), then it may end up being better to use +* tlbies here instead. +*/ + smp_mb(); /* see radix__flush_tlb_mm */ + exit_flush_lazy_tlbs(mm); + _tlbiel_pid(mm->context.id, RIC_FLUSH_ALL); + + /* +* It should not be possible to have coprocessors still +* attached here. +*/ + if (WARN_ON_ONCE(atomic_read(&mm->context.copros) > 0)) + __flush_all_mm(mm, true); + } else { + __flush_all_mm(mm, true); + } + } else if ( (psize = radix_get_mmu_psize(page_size)) == -1) { if (!tlb->freed_tables) radix__flush_tlb_mm(mm); -- 2.37.2
[PATCH v6 4/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
On a 16-socket 192-core POWER8 system, a context switching benchmark with as many software threads as CPUs (so each switch will go in and out of idle), upstream can achieve a rate of about 1 million context switches per second, due to contention on the mm refcount. 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable the option. This increases the above benchmark to 118 million context switches per second. This generates 314 additional IPI interrupts on a 144 CPU system doing a kernel compile, which is in the noise in terms of kernel cycles. Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b8c4ac56bddc..600ace5a7f1a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -265,6 +265,7 @@ config PPC select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_MERGE_VMAS + select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64 select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE select NEED_PER_CPU_EMBED_FIRST_CHUNK if PPC64 -- 2.37.2
[PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme
On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications (particularly switching between the idle thread and an application thread). Abandoning lazy tlb slows switching down quite a bit in the important user->idle->user cases, so instead implement a non-refcounted scheme that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any remaining lazy ones. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary. Signed-off-by: Nicholas Piggin --- arch/Kconfig | 15 kernel/fork.c | 65 +++ 2 files changed, 80 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index b07d36f08fea..f7da34e4bc62 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -481,6 +481,21 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM # already). config MMU_LAZY_TLB_REFCOUNT def_bool y + depends on !MMU_LAZY_TLB_SHOOTDOWN + +# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an +# mm as a lazy tlb beyond its last reference count, by shooting down these +# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may +# be using the mm as a lazy tlb, so that they may switch themselves to using +# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs +# may be using mm as a lazy tlb mm. +# +# To implement this, an arch *must*: +# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains +# at least all possible CPUs in which the mm is lazy. +# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above). +config MMU_LAZY_TLB_SHOOTDOWN + bool config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/kernel/fork.c b/kernel/fork.c index 9f7fe3541897..263660e78c2a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -780,6 +780,67 @@ static void check_mm(struct mm_struct *mm) #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm)(kmem_cache_free(mm_cachep, (mm))) +static void do_check_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + WARN_ON_ONCE(current->active_mm == mm); +} + +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + WARN_ON_ONCE(current->mm); + current->active_mm = &init_mm; + switch_mm(mm, &init_mm, current); + } +} + +static void cleanup_lazy_tlbs(struct mm_struct *mm) +{ + if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + /* +* In this case, lazy tlb mms are refounted and would not reach +* __mmdrop until all CPUs have switched away and mmdrop()ed. +*/ + return; + } + + /* +* Lazy TLB shootdown does not refcount "lazy tlb mm" usage, rather it +* requires lazy mm users to switch to another mm when the refcount +* drops to zero, before the mm is freed. This requires IPIs here to +* switch kernel threads to init_mm. +* +* archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm +* switch with the final userspace teardown TLB flush which leaves the +* mm lazy on this CPU but no others, reducing the need for additional +* IPIs here. There are cases where a final IPI is still required here, +* such as the final mmdrop being performed on a different CPU than the +* one exiting, or kernel threads using the mm when userspace exits. +* +* IPI overheads have not found to be expensive, but they could be +* reduced in a number of possible ways, for example (roughly +* increasing order of complexity): +* - The last lazy reference created by exit_mm() could instead switch +* to init_mm, however it's probable this will run on the same CPU +* immediately afterwards, so this may not reduce IPIs much. +* - A batch of mms requiring IPIs could be gathered and freed at once. +* - CPUs store active_mm where it can be remotely checked without a +* lock, to filter out false-positives in the cpumask. +* - After mm_users or mm_count reaches zero, switching away from the +* mm could clear mm_cpumask to reduce some IPIs, perhaps together +* with some batching or delaying of the final IPIs. +* - A delayed freeing and RCU-like quiescing sequence based on mm +* switching to avoid IPIs completely. +*/ + on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); + if (IS_ENABLED(CONFIG_DEBUG_VM)) + on_each_cpu(do_check_lazy
[PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, which is what existing code does, so the patch is effectively a no-op. Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is. Signed-off-by: Nicholas Piggin --- Documentation/mm/active_mm.rst | 6 ++ arch/Kconfig | 17 + include/linux/sched/mm.h | 18 +++--- kernel/sched/core.c| 22 ++ kernel/sched/sched.h | 4 +++- 5 files changed, 59 insertions(+), 8 deletions(-) diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst index 6f8269c284ed..2b0d08332400 100644 --- a/Documentation/mm/active_mm.rst +++ b/Documentation/mm/active_mm.rst @@ -4,6 +4,12 @@ Active MM = +Note, the mm_count refcount may no longer include the "lazy" users +(running tasks with ->active_mm == mm && ->mm == NULL) on kernels +with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy +references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() +helpers which abstracts this config option. + :: List: linux-kernel diff --git a/arch/Kconfig b/arch/Kconfig index 12e3ddabac9d..b07d36f08fea 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM irqs disabled over activate_mm. Architectures that do IPI based TLB shootdowns should enable this. +# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references. +# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching +# to/from kernel threads when the same mm is running on a lot of CPUs (a large +# multi-threaded application), by reducing contention on the mm refcount. +# +# This can be disabled if the architecture ensures no CPUs are using an mm as a +# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm +# or its kernel page tables). This could be arranged by arch_exit_mmap(), or +# final exit(2) TLB flush, for example. +# +# To implement this, an arch *must*: +# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the +# lazy reference of a kthread's ->active_mm (non-arch code has been converted +# already). +config MMU_LAZY_TLB_REFCOUNT + def_bool y + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 5376caf6fcf3..68bbe8d90c2e 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm) /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - mmgrab(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - mmdrop(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) { + mmdrop(mm); + } else { + /* +* mmdrop_lazy_tlb must provide a full memory barrier, see the +* membarrier comment finish_task_switch which relies on this. +*/ + smp_mb(); + } } static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm) { - mmdrop_sched(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) + mmdrop_sched(mm); + else + smp_mb(); // see above } /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26aaa974ee6d..1ea14d849a0d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5081,7 +5081,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) __releases(rq->lock) { struct rq *rq = this_rq(); - struct mm_struct *mm = rq->prev_mm; + struct mm_struct *mm = NULL; unsigned int prev_state; /* @@ -5100,7 +5100,10 @@ static struct rq *finish_task_switch(struct task_struct *prev) current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); - rq->prev_mm = NULL; +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT + mm = rq->prev_lazy_mm; + rq->prev_lazy_mm = NULL; +#endif /* * A task struct has one reference for the use as "current". @@ -5231,9 +5234,20 @@ context_switch(struct rq *rq, struct task_struct *prev, lru_gen_use_mm(next->mm); if (!prev->mm) {// from kernel - /* will mmdrop_lazy_tlb() in finish_task_switch(). */ - rq->prev_mm = prev->active_mm; +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT + /* Will mmdrop_lazy_tlb() in finish_task_switch(). */ + rq->prev_lazy_m
[PATCH v6 1/5] lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. The only functional change is in kthread_use_mm/kthread_unuse_mm is because it is clever with refcounting: If it happens that the kthread's lazy tlb mm (active_mm) is the same as the mm to be used, the code doesn't touch the refcount but rather transfers the lazy refcount to used-mm refcount. If the lazy tlb mm refcount is no longer equivalent to the regular refcount, this trick can not be used. mmgrab a regular reference on mm to use, and mmdrop_lazy_tlb the previous active_mm. Signed-off-by: Nicholas Piggin --- arch/arm/mach-rpc/ecard.c| 2 +- arch/powerpc/kernel/smp.c| 2 +- arch/powerpc/mm/book3s64/radix_tlb.c | 4 ++-- fs/exec.c| 2 +- include/linux/sched/mm.h | 16 kernel/cpu.c | 2 +- kernel/exit.c| 2 +- kernel/kthread.c | 21 + kernel/sched/core.c | 15 --- 9 files changed, 44 insertions(+), 22 deletions(-) diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c index 53813f9464a2..c30df1097c52 100644 --- a/arch/arm/mach-rpc/ecard.c +++ b/arch/arm/mach-rpc/ecard.c @@ -253,7 +253,7 @@ static int ecard_init_mm(void) current->mm = mm; current->active_mm = mm; activate_mm(active_mm, mm); - mmdrop(active_mm); + mmdrop_lazy_tlb(active_mm); ecard_init_pgtables(mm); return 0; } diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6b90f10a6c81..7db6b3faea65 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1611,7 +1611,7 @@ void start_secondary(void *unused) if (IS_ENABLED(CONFIG_PPC32)) setup_kup(); - mmgrab(&init_mm); + mmgrab_lazy_tlb(&init_mm); current->active_mm = &init_mm; smp_store_cpu_info(cpu); diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4e29b619578c..282359ab525b 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -794,10 +794,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush) if (current->active_mm == mm) { WARN_ON_ONCE(current->mm != NULL); /* Is a kernel thread and is using mm as the lazy tlb */ - mmgrab(&init_mm); + mmgrab_lazy_tlb(&init_mm); current->active_mm = &init_mm; switch_mm_irqs_off(mm, &init_mm, current); - mmdrop(mm); + mmdrop_lazy_tlb(mm); } /* diff --git a/fs/exec.c b/fs/exec.c index ab913243a367..1a32a88db173 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1033,7 +1033,7 @@ static int exec_mmap(struct mm_struct *mm) mmput(old_mm); return 0; } - mmdrop(active_mm); + mmdrop_lazy_tlb(active_mm); return 0; } diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 2a243616f222..5376caf6fcf3 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -79,6 +79,22 @@ static inline void mmdrop_sched(struct mm_struct *mm) } #endif +/* Helpers for lazy TLB mm refcounting */ +static inline void mmgrab_lazy_tlb(struct mm_struct *mm) +{ + mmgrab(mm); +} + +static inline void mmdrop_lazy_tlb(struct mm_struct *mm) +{ + mmdrop(mm); +} + +static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm) +{ + mmdrop_sched(mm); +} + /** * mmget() - Pin the address space associated with a &struct mm_struct. * @mm: The address space to pin. diff --git a/kernel/cpu.c b/kernel/cpu.c index 6c0a92ca6bb5..189895288d9d 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -623,7 +623,7 @@ static int finish_cpu(unsigned int cpu) */ if (mm != &init_mm) idle->active_mm = &init_mm; - mmdrop(mm); + mmdrop_lazy_tlb(mm); return 0; } diff --git a/kernel/exit.c b/kernel/exit.c index 15dc2ec80c46..1a4608d765e4 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -537,7 +537,7 @@ static void exit_mm(void) return; sync_mm_rss(mm); mmap_read_lock(mm); - mmgrab(mm); + mmgrab_lazy_tlb(mm); BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); diff --git a/kernel/kthread.c b/kernel/kthread.c index f97fd01a2932..691b213e578f 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1410,14 +1410,19 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); + /* +* It's possible that tsk->active_mm == mm here, but we must +* still mmgrab(mm) and mmdrop_lazy_
[PATCH v6 0/5] shoot lazy tlbs
It's time for that annual flamewar. Nothing really changed in core code to clean things up or make it better for x86 last year, so I think we can table that objection. IIRC the situation left off with Andy proposing a different approach, and Linus preferring to shoot the lazies at exit time (piggybacking on the TLB flush IPI), which is what this series allows an arch to do. Discussion thread here: https://lore.kernel.org/linux-arch/7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.l...@kernel.org/ I don't think there was any movement on this or other alternatives, or code cleanups since then, but correct me if I'm wrong. Since v5 of this series, there has just been a minor rebase to upstream, and some tweaking of comments and code style. No functional changes. Also included patch 5 which is the optimisation that combines final TLB shootdown with the lazy tlb mm shootdown IPIs. Included because Linus expected to see it. It works fine, but I have some other powerpc changes I would like to go ahead of it so I would like to take those through the powerpc tree. And actually giving it a release cycle without that optimization will help stress test the final IPI cleanup path too, which I would like. Even without the last patch, the additional IPIs caused by shoot lazy is down in the noise so I'm not too concerned about it. Thanks, Nick Nicholas Piggin (5): lazy tlb: introduce lazy tlb mm refcount helper functions lazy tlb: allow lazy tlb mm refcounting to be configurable lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs Documentation/mm/active_mm.rst | 6 +++ arch/Kconfig | 32 ++ arch/arm/mach-rpc/ecard.c| 2 +- arch/powerpc/Kconfig | 1 + arch/powerpc/kernel/smp.c| 2 +- arch/powerpc/mm/book3s64/radix_tlb.c | 30 +++-- fs/exec.c| 2 +- include/linux/sched/mm.h | 28 kernel/cpu.c | 2 +- kernel/exit.c| 2 +- kernel/fork.c| 65 kernel/kthread.c | 21 + kernel/sched/core.c | 35 ++- kernel/sched/sched.h | 4 +- 14 files changed, 205 insertions(+), 27 deletions(-) -- 2.37.2