Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code
On Tue, 16 Jul 2024, Mike Rapoport wrote:\n From: "Mike Rapoport (Microsoft)" Every architecture that supports NUMA defines node_data in the same way: struct pglist_data *node_data[MAX_NUMNODES]; No reason to keep multiple copies of this definition and its forward declarations, especially when such forward declaration is the only thing in include/asm/mmzone.h for many architectures. Add definition and declaration of node_data to generic code and drop architecture-specific versions. Signed-off-by: Mike Rapoport (Microsoft) Nice cleanup. Acked-by: Davidlohr Bueso
Re: [PATCH v3 00/35] Per-VMA locks
On Fri, 24 Feb 2023, freak07 wrote: Here are some measurements from a Pixel 7 Pro that´s running a kernel either with the Per-VMA locks patchset or without. If there´s interest I can provide results of other specific apps as well. Results are from consecutive cold app launches issued with "am start" command spawning the main activity of Slack Android app. https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing There´s quite a noticeable improvement, as can be seen in the graph. The results were reproducible. Thanks for sharing. I am not surprised - after all, per-vma locks narrow some of the performance gaps between vanilla and speculative pfs, with less complexity (albeit this is now a 35 patch series :). Thanks, Davidlohr
Re: [PATCH v4 3/7] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
On Thu, 26 Jan 2023, Suren Baghdasaryan wrote: To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace it with VM_LOCKED_MASK bitmask and convert all users. Might be good to mention explicitly no change in semantics, but otherwise lgtm Reviewed-by: Davidlohr Bueso
Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions
On Wed, 25 Jan 2023, Andrew Morton wrote: On Wed, 25 Jan 2023 15:35:49 -0800 Suren Baghdasaryan wrote: vm_flags are among VMA attributes which affect decisions like VMA merging and splitting. Therefore all vm_flags modifications are performed after taking exclusive mmap_lock to prevent vm_flags updates racing with such operations. Introduce modifier functions for vm_flags to be used whenever flags are updated. This way we can better check and control correct locking behavior during these updates. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h +static inline void init_vm_flags(struct vm_area_struct *vma, +static inline void reset_vm_flags(struct vm_area_struct *vma, +static inline void set_vm_flags(struct vm_area_struct *vma, +static inline void clear_vm_flags(struct vm_area_struct *vma, +static inline void mod_vm_flags(struct vm_area_struct *vma, vm_flags_init(), vm_flags_reset(), etc? This would be more idiomatic and I do think the most-significant-first naming style is preferable. I tend to prefer this naming yes, but lgtm regardless. Reviewed-by: Davidlohr Bueso
Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
On Wed, 11 Jan 2023, Suren Baghdasaryan wrote: On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso wrote: On Mon, 09 Jan 2023, Suren Baghdasaryan wrote: >To keep vma locking correctness when vm_flags are modified, add modifier >functions to be used whenever flags are updated. How about moving this patch and the ones that follow out of this series, into a preliminary patchset? It would reduce the amount of noise in the per-vma lock changes, which would then only be adding the needed vma_write_lock()ing. How about moving those prerequisite patches to the beginning of the patchset (before maple_tree RCU changes)? I feel like they do belong in the patchset because as a standalone patchset it would be unclear why I'm adding all these accessor functions and introducing this churn. Would that be acceptable? imo the abstraction of vm_flags handling is worth being standalone and is easier to be picked up before a more complex locking scheme change. But either way, it's up to you. Thanks, Davidlohr
Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions
On Mon, 09 Jan 2023, Suren Baghdasaryan wrote: To keep vma locking correctness when vm_flags are modified, add modifier functions to be used whenever flags are updated. How about moving this patch and the ones that follow out of this series, into a preliminary patchset? It would reduce the amount of noise in the per-vma lock changes, which would then only be adding the needed vma_write_lock()ing. Thanks, Davidlohr
Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK
On Mon, 09 Jan 2023, Suren Baghdasaryan wrote: This configuration variable will be used to build the support for VMA locking during page fault handling. This is enabled by default on supported architectures with SMP and MMU set. The architecture support is needed since the page fault handler is called from the architecture's page faulting code which needs modifications to handle faults under VMA lock. I don't think that per-vma locking should be something that is user-configurable. It should just be depdendant on the arch. So maybe just remove CONFIG_PER_VMA_LOCK? Thanks, Davidlohr
Re: [PATCH 08/10] scsi/ibmvfc: Replace tasklet with work
On Thu, 09 Jun 2022, Sebastian Andrzej Siewior wrote: On 2022-05-30 16:15:10 [-0700], Davidlohr Bueso wrote: diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index d0eab5700dc5..31b1900489e7 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -891,7 +891,7 @@ static void ibmvfc_release_crq_queue(struct ibmvfc_host *vhost) ibmvfc_dbg(vhost, "Releasing CRQ\n"); free_irq(vdev->irq, vhost); - tasklet_kill(>tasklet); +cancel_work_sync(>work); s/ {8}/\t/ is there a reason not to use threaded interrupts? I went with a workqueue here because the resume from suspend also schedules async processing, so threaded irqs didn't seem like a good fit. This is also similar to patch 2 (but in that case I overlooked the resume caller which you pointed out). Thanks, Davidlohr
[PATCH 08/10] scsi/ibmvfc: Replace tasklet with work
Tasklets have long been deprecated as being too heavy on the system by running in irq context - and this is not a performance critical path. If a higher priority process wants to run, it must wait for the tasklet to finish before doing so. Use a workqueue instead and run in task context - albeit the increased concurrency (tasklets safe against themselves), but the handler is done under both the vhost's host_lock + crq.q_lock so should be safe. Cc: Tyrel Datwyler Cc: Michael Ellerman --- drivers/scsi/ibmvscsi/ibmvfc.c | 21 - drivers/scsi/ibmvscsi/ibmvfc.h | 3 ++- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index d0eab5700dc5..31b1900489e7 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -891,7 +891,7 @@ static void ibmvfc_release_crq_queue(struct ibmvfc_host *vhost) ibmvfc_dbg(vhost, "Releasing CRQ\n"); free_irq(vdev->irq, vhost); - tasklet_kill(>tasklet); +cancel_work_sync(>work); do { if (rc) msleep(100); @@ -3689,22 +3689,22 @@ static irqreturn_t ibmvfc_interrupt(int irq, void *dev_instance) spin_lock_irqsave(vhost->host->host_lock, flags); vio_disable_interrupts(to_vio_dev(vhost->dev)); - tasklet_schedule(>tasklet); + schedule_work(>work); spin_unlock_irqrestore(vhost->host->host_lock, flags); return IRQ_HANDLED; } /** - * ibmvfc_tasklet - Interrupt handler tasklet + * ibmvfc_work - work handler * @data: ibmvfc host struct * * Returns: * Nothing **/ -static void ibmvfc_tasklet(void *data) +static void ibmvfc_workfn(struct work_struct *work) { - struct ibmvfc_host *vhost = data; - struct vio_dev *vdev = to_vio_dev(vhost->dev); + struct ibmvfc_host *vhost; + struct vio_dev *vdev; struct ibmvfc_crq *crq; struct ibmvfc_async_crq *async; struct ibmvfc_event *evt, *temp; @@ -3712,6 +3712,9 @@ static void ibmvfc_tasklet(void *data) int done = 0; LIST_HEAD(evt_doneq); + vhost = container_of(work, struct ibmvfc_host, work); + vdev = to_vio_dev(vhost->dev); + spin_lock_irqsave(vhost->host->host_lock, flags); spin_lock(vhost->crq.q_lock); while (!done) { @@ -5722,7 +5725,7 @@ static int ibmvfc_init_crq(struct ibmvfc_host *vhost) retrc = 0; - tasklet_init(>tasklet, (void *)ibmvfc_tasklet, (unsigned long)vhost); + INIT_WORK(>work, ibmvfc_workfn); if ((rc = request_irq(vdev->irq, ibmvfc_interrupt, 0, IBMVFC_NAME, vhost))) { dev_err(dev, "Couldn't register irq 0x%x. rc=%d\n", vdev->irq, rc); @@ -5738,7 +5741,7 @@ static int ibmvfc_init_crq(struct ibmvfc_host *vhost) return retrc; req_irq_failed: - tasklet_kill(>tasklet); + cancel_work_sync(>work); do { rc = plpar_hcall_norets(H_FREE_CRQ, vdev->unit_address); } while (rc == H_BUSY || H_IS_LONG_BUSY(rc)); @@ -6213,7 +6216,7 @@ static int ibmvfc_resume(struct device *dev) spin_lock_irqsave(vhost->host->host_lock, flags); vio_disable_interrupts(vdev); - tasklet_schedule(>tasklet); + schedule_work(>work); spin_unlock_irqrestore(vhost->host->host_lock, flags); return 0; } diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h index 3718406e0988..7eca3622a2fa 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.h +++ b/drivers/scsi/ibmvscsi/ibmvfc.h @@ -12,6 +12,7 @@ #include #include +#include #include #define IBMVFC_NAME"ibmvfc" @@ -892,7 +893,7 @@ struct ibmvfc_host { char partition_name[97]; void (*job_step) (struct ibmvfc_host *); struct task_struct *work_thread; - struct tasklet_struct tasklet; + struct work_struct work; struct work_struct rport_add_work_q; wait_queue_head_t init_wait_q; wait_queue_head_t work_wait_q; -- 2.36.1
[PATCH 06/10] scsi/ibmvscsi_tgt: Replace work tasklet with threaded irq
Tasklets have long been deprecated as being too heavy on the system by running in irq context - and this is not a performance critical path. If a higher priority process wants to run, it must wait for the tasklet to finish before doing so. A more suitable equivalent is to converted to threaded irq instead and deal with the async work in task context. Cc: Michael Cyr Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Davidlohr Bueso --- drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 17 +++-- drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h | 1 - 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c index eee1a24f7e15..fafadb7158a3 100644 --- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c +++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c @@ -2948,9 +2948,8 @@ static irqreturn_t ibmvscsis_interrupt(int dummy, void *data) struct scsi_info *vscsi = data; vio_disable_interrupts(vscsi->dma_dev); - tasklet_schedule(>work_task); - return IRQ_HANDLED; + return IRQ_WAKE_THREAD; } /** @@ -3317,7 +3316,7 @@ static int ibmvscsis_rdma(struct ibmvscsis_cmd *cmd, struct scatterlist *sg, * * Note: this is an edge triggered interrupt. It can not be shared. */ -static void ibmvscsis_handle_crq(unsigned long data) +static irqreturn_t ibmvscsis_handle_crq(int irq, void *data) { struct scsi_info *vscsi = (struct scsi_info *)data; struct viosrp_crq *crq; @@ -3340,7 +3339,7 @@ static void ibmvscsis_handle_crq(unsigned long data) dev_dbg(>dev, "handle_crq, don't process: flags 0x%x, state 0x%hx\n", vscsi->flags, vscsi->state); spin_unlock_bh(>intr_lock); - return; + goto done; } rc = vscsi->flags & SCHEDULE_DISCONNECT; @@ -3417,6 +3416,8 @@ static void ibmvscsis_handle_crq(unsigned long data) vscsi->state); spin_unlock_bh(>intr_lock); +done: + return IRQ_HANDLED; } static int ibmvscsis_probe(struct vio_dev *vdev, @@ -3530,9 +3531,6 @@ static int ibmvscsis_probe(struct vio_dev *vdev, dev_dbg(>dev, "probe hrc %ld, client partition num %d\n", hrc, vscsi->client_data.partition_number); - tasklet_init(>work_task, ibmvscsis_handle_crq, -(unsigned long)vscsi); - init_completion(>wait_idle); init_completion(>unconfig); @@ -3544,7 +3542,8 @@ static int ibmvscsis_probe(struct vio_dev *vdev, goto unmap_buf; } - rc = request_irq(vdev->irq, ibmvscsis_interrupt, 0, "ibmvscsis", vscsi); + rc = request_threaded_irq(vdev->irq, ibmvscsis_interrupt, + ibmvscsis_handle_crq, 0, "ibmvscsis", vscsi); if (rc) { rc = -EPERM; dev_err(>dev, "probe: request_irq failed, rc %d\n", rc); @@ -3565,7 +3564,6 @@ static int ibmvscsis_probe(struct vio_dev *vdev, free_buf: kfree(vscsi->map_buf); destroy_queue: - tasklet_kill(>work_task); ibmvscsis_unregister_command_q(vscsi); ibmvscsis_destroy_command_q(vscsi); free_timer: @@ -3602,7 +3600,6 @@ static void ibmvscsis_remove(struct vio_dev *vdev) dma_unmap_single(>dev, vscsi->map_ioba, PAGE_SIZE, DMA_BIDIRECTIONAL); kfree(vscsi->map_buf); - tasklet_kill(>work_task); ibmvscsis_destroy_command_q(vscsi); ibmvscsis_freetimer(vscsi); ibmvscsis_free_cmds(vscsi); diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h index 7ae074e5d7a1..b66c982b8b00 100644 --- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h +++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h @@ -295,7 +295,6 @@ struct scsi_info { struct vio_dev *dma_dev; struct srp_target target; struct ibmvscsis_tport tport; - struct tasklet_struct work_task; struct work_struct proc_work; }; -- 2.36.1
[PATCH 09/10] scsi/ibmvscsi: Replace srp tasklet with work
Tasklets have long been deprecated as being too heavy on the system by running in irq context - and this is not a performance critical path. If a higher priority process wants to run, it must wait for the tasklet to finish before doing so. Process srps asynchronously in process context in a dedicated single threaded workqueue. Cc: Tyrel Datwyler Cc: Michael Ellerman --- drivers/scsi/ibmvscsi/ibmvscsi.c | 38 ++-- drivers/scsi/ibmvscsi/ibmvscsi.h | 3 ++- 2 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c index 63f32f843e75..37cbea8bb0af 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.c +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c @@ -86,6 +86,8 @@ static DEFINE_SPINLOCK(ibmvscsi_driver_lock); static struct scsi_transport_template *ibmvscsi_transport_template; +static struct workqueue_struct *ibmvscsi_wq; + #define IBMVSCSI_VERSION "1.5.9" MODULE_DESCRIPTION("IBM Virtual SCSI"); @@ -117,7 +119,7 @@ static void ibmvscsi_handle_crq(struct viosrp_crq *crq, * @irq: number of irq to handle, not used * @dev_instance: ibmvscsi_host_data of host that received interrupt * - * Disables interrupts and schedules srp_task + * Disables interrupts and schedules srp_work * Always returns IRQ_HANDLED */ static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance) @@ -125,7 +127,7 @@ static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance) struct ibmvscsi_host_data *hostdata = (struct ibmvscsi_host_data *)dev_instance; vio_disable_interrupts(to_vio_dev(hostdata->dev)); - tasklet_schedule(>srp_task); + queue_work(ibmvscsi_wq, >srp_work); return IRQ_HANDLED; } @@ -145,7 +147,7 @@ static void ibmvscsi_release_crq_queue(struct crq_queue *queue, long rc = 0; struct vio_dev *vdev = to_vio_dev(hostdata->dev); free_irq(vdev->irq, (void *)hostdata); - tasklet_kill(>srp_task); + cancel_work_sync(>srp_work); do { if (rc) msleep(100); @@ -206,16 +208,19 @@ static int ibmvscsi_send_crq(struct ibmvscsi_host_data *hostdata, } /** - * ibmvscsi_task: - Process srps asynchronously + * ibmvscsi_workfn: - Process srps asynchronously * @data: ibmvscsi_host_data of host */ -static void ibmvscsi_task(void *data) +static void ibmvscsi_workfn(struct work_struct *work) { - struct ibmvscsi_host_data *hostdata = (struct ibmvscsi_host_data *)data; - struct vio_dev *vdev = to_vio_dev(hostdata->dev); + struct ibmvscsi_host_data *hostdata; + struct vio_dev *vdev; struct viosrp_crq *crq; int done = 0; + hostdata = container_of(work, struct ibmvscsi_host_data, srp_work); + vdev = to_vio_dev(hostdata->dev); + while (!done) { /* Pull all the valid messages off the CRQ */ while ((crq = crq_queue_next_crq(>queue)) != NULL) { @@ -367,8 +372,7 @@ static int ibmvscsi_init_crq_queue(struct crq_queue *queue, queue->cur = 0; spin_lock_init(>lock); - tasklet_init(>srp_task, (void *)ibmvscsi_task, -(unsigned long)hostdata); + INIT_WORK(>srp_work, ibmvscsi_workfn); if (request_irq(vdev->irq, ibmvscsi_handle_event, @@ -387,7 +391,7 @@ static int ibmvscsi_init_crq_queue(struct crq_queue *queue, return retrc; req_irq_failed: - tasklet_kill(>srp_task); + cancel_work_sync(>srp_work); rc = 0; do { if (rc) @@ -2371,7 +2375,7 @@ static int ibmvscsi_resume(struct device *dev) { struct ibmvscsi_host_data *hostdata = dev_get_drvdata(dev); vio_disable_interrupts(to_vio_dev(hostdata->dev)); - tasklet_schedule(>srp_task); + queue_work(ibmvscsi_wq, >srp_work); return 0; } @@ -2418,15 +2422,25 @@ static int __init ibmvscsi_module_init(void) if (!ibmvscsi_transport_template) return -ENOMEM; + ibmvscsi_wq = alloc_ordered_workqueue("ibmvscsi_wq", 0); + if (!ibmvscsi_wq) { + srp_release_transport(ibmvscsi_transport_template); + return -ENOMEM; + } + ret = vio_register_driver(_driver); - if (ret) + if (ret) { + destroy_workqueue(ibmvscsi_wq); srp_release_transport(ibmvscsi_transport_template); + } + return ret; } static void __exit ibmvscsi_module_exit(void) { vio_unregister_driver(_driver); + destroy_workqueue(ibmvscsi_wq); srp_release_transport(ibmvscsi_transport_template); } diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.h b/drivers/scsi/ibmvscsi/ibmvscsi.h index e60916ef7a49..f7c52744a206 100644 --- a/drivers/scsi/ibmvscsi/ibmvscsi.h +++ b/drivers/scsi/ibmvscsi/ibmvscsi.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include
Re: [PATCH] xsysace: Remove SYSACE driver
Hi, On Mon, 09 Nov 2020, Michal Simek wrote: Sysace IP is no longer used on Xilinx PowerPC 405/440 and Microblaze systems. The driver is not regularly tested and very likely not working for quite a long time that's why remove it. Is there a reason this patch was never merged? can the driver be removed? I ran into this as a potential tasklet user that can be replaced/removed. Thanks, Davidlohr Signed-off-by: Michal Simek --- Based on discussion https://lore.kernel.org/linux-arm-kernel/5ab9a2a1-20e3-c7b2-f666-2034df436...@kernel.dk/ I have grepped the kernel and found any old ppc platform. I have included it in this patch to have a discussion about it. --- MAINTAINERS |1 - arch/microblaze/boot/dts/system.dts |8 - arch/powerpc/boot/dts/icon.dts |7 - arch/powerpc/configs/44x/icon_defconfig |1 - drivers/block/Kconfig |6 - drivers/block/Makefile |1 - drivers/block/xsysace.c | 1273 --- 7 files changed, 1297 deletions(-) delete mode 100644 drivers/block/xsysace.c diff --git a/MAINTAINERS b/MAINTAINERS index cba8ddf87a08..38556c009758 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2741,7 +2741,6 @@ T:git https://github.com/Xilinx/linux-xlnx.git F: Documentation/devicetree/bindings/i2c/cdns,i2c-r1p10.yaml F: Documentation/devicetree/bindings/i2c/xlnx,xps-iic-2.00.a.yaml F: arch/arm/mach-zynq/ -F: drivers/block/xsysace.c F: drivers/clocksource/timer-cadence-ttc.c F: drivers/cpuidle/cpuidle-zynq.c F: drivers/edac/synopsys_edac.c diff --git a/arch/microblaze/boot/dts/system.dts b/arch/microblaze/boot/dts/system.dts index 5b236527176e..b7ee1056779e 100644 --- a/arch/microblaze/boot/dts/system.dts +++ b/arch/microblaze/boot/dts/system.dts @@ -310,14 +310,6 @@ RS232_Uart_1: serial@8400 { xlnx,odd-parity = <0x0>; xlnx,use-parity = <0x0>; } ; - SysACE_CompactFlash: sysace@8360 { - compatible = "xlnx,xps-sysace-1.00.a"; - interrupt-parent = <_intc_0>; - interrupts = < 4 2 >; - reg = < 0x8360 0x1 >; - xlnx,family = "virtex5"; - xlnx,mem-width = <0x10>; - } ; debug_module: debug@8440 { compatible = "xlnx,mdm-1.00.d"; reg = < 0x8440 0x1 >; diff --git a/arch/powerpc/boot/dts/icon.dts b/arch/powerpc/boot/dts/icon.dts index fbaa60b8f87a..4fd7a4fbb4fb 100644 --- a/arch/powerpc/boot/dts/icon.dts +++ b/arch/powerpc/boot/dts/icon.dts @@ -197,13 +197,6 @@ partition@fa { reg = <0x00fa 0x0006>; }; }; - - SysACE_CompactFlash: sysace@1,0 { - compatible = "xlnx,sysace"; - interrupt-parent = <>; - interrupts = <24 0x4>; - reg = <0x0001 0x 0x1>; - }; }; UART0: serial@f200 { diff --git a/arch/powerpc/configs/44x/icon_defconfig b/arch/powerpc/configs/44x/icon_defconfig index 930948a1da76..fb9a15573546 100644 --- a/arch/powerpc/configs/44x/icon_defconfig +++ b/arch/powerpc/configs/44x/icon_defconfig @@ -28,7 +28,6 @@ CONFIG_MTD_CFI_AMDSTD=y CONFIG_MTD_PHYSMAP_OF=y CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_SIZE=35000 -CONFIG_XILINX_SYSACE=y CONFIG_SCSI=y CONFIG_BLK_DEV_SD=y CONFIG_SCSI_CONSTANTS=y diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index ecceaaa1a66f..9cb02861298d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -388,12 +388,6 @@ config SUNVDC source "drivers/s390/block/Kconfig" -config XILINX_SYSACE - tristate "Xilinx SystemACE support" - depends on 4xx || MICROBLAZE - help - Include support for the Xilinx SystemACE CompactFlash interface - config XEN_BLKDEV_FRONTEND tristate "Xen virtual block device support" depends on XEN diff --git a/drivers/block/Makefile b/drivers/block/Makefile index e1f63117ee94..5ddd9370972a 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -19,7 +19,6 @@ obj-$(CONFIG_ATARI_FLOPPY)+= ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o -obj-$(CONFIG_XILINX_SYSACE)+= xsysace.o obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o obj-$(CONFIG_SUNVDC)+= sunvdc.o obj-$(CONFIG_BLK_DEV_SKD) += skd.o diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c deleted file mode 100644 index eb8ef65778c3..
[PATCH v2] powerpc/qspinlock: Use generic smp_cond_load_relaxed
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added busy-waiting pausing with a preferred SMT priority pattern, lowering the priority (reducing decode cycles) during the whole loop slowpath. However, data shows that while this pattern works well with simple spinlocks, queued spinlocks benefit more being kept in medium priority, with a cpu_relax() instead, being a low+medium combo on powerpc. Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with 2 sockets and 8 threads per core. 1. locktorture. This is data for the lowest and most artificial/pathological level, with increasing thread counts pounding on the lock. Metrics are total ops/minute. Despite some small hits in the 4-8 range, scenarios are either neutral or favorable to this patch. +=+==+==+===+ | # tasks | vanilla | dirty| %diff | +=+==+==+===+ | 2 | 46718565 | 48751350 | 4.35 | +-+--+--+---+ | 4 | 51740198 | 50369082 | -2.65 | +-+--+--+---+ | 8 | 63756510 | 62568821 | -1.86 | +-+--+--+---+ | 16 | 67824531 | 70966546 | 4.63 | +-+--+--+---+ | 32 | 53843519 | 61155508 | 13.58 | +-+--+--+---+ | 64 | 53005778 | 53104412 | 0.18 | +-+--+--+---+ | 128 | 53331980 | 54606910 | 2.39 | +=+==+==+===+ 2. sockperf (tcp throughput) Here a client will do one-way throughput tests to a localhost server, with increasing message sizes, dealing with the sk_lock. This patch shows to put the performance of the qspinlock back to par with that of the simple lock: simple-spinlock vanilla dirty Hmean 1473.50 ( 0.00%) 54.44 * -25.93%* 73.45 * -0.07%* Hmean 100 654.47 ( 0.00%) 385.61 * -41.08%* 771.43 * 17.87%* Hmean 300 2719.39 ( 0.00%) 2181.67 * -19.77%* 2666.50 * -1.94%* Hmean 500 4400.59 ( 0.00%) 3390.77 * -22.95%* 4322.14 * -1.78%* Hmean 850 6726.21 ( 0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%* 3. dbench (tmpfs) Configured to run with up to ncpusx8 clients, it shows both latency and throughput metrics. For the latency, with the exception of the 64 case, there is really nothing to go by: vanilladirty Amean latency-1 1.67 ( 0.00%)1.67 * 0.09%* Amean latency-2 2.15 ( 0.00%)2.08 * 3.36%* Amean latency-4 2.50 ( 0.00%)2.56 * -2.27%* Amean latency-8 2.49 ( 0.00%)2.48 * 0.31%* Amean latency-16 2.69 ( 0.00%)2.72 * -1.37%* Amean latency-32 2.96 ( 0.00%)3.04 * -2.60%* Amean latency-64 7.78 ( 0.00%)8.17 * -5.07%* Amean latency-512 186.91 ( 0.00%) 186.41 * 0.27%* For the dbench4 Throughput (misleading but traditional) there's a small but rather constant improvement: vanilladirty Hmean 1849.13 ( 0.00%) 851.51 * 0.28%* Hmean 2 1664.03 ( 0.00%) 1663.94 * -0.01%* Hmean 4 3073.70 ( 0.00%) 3104.29 * 1.00%* Hmean 8 5624.02 ( 0.00%) 5694.16 * 1.25%* Hmean 16 9169.49 ( 0.00%) 9324.43 * 1.69%* Hmean 32 11969.37 ( 0.00%)12127.09 * 1.32%* Hmean 64 15021.12 ( 0.00%)15243.14 * 1.48%* Hmean 51214891.27 ( 0.00%)15162.11 * 1.82%* Measuring the dbench4 Per-VFS Operation latency, shows some very minor differences within the noise level, around the 0-1% ranges. Fixes: 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) Acked-by: Nicholas Piggin Signed-off-by: Davidlohr Bueso --- Changes from v1: Added small description and labeling smp_cond_load_relaxed requested by Nick. Added Nick's ack. arch/powerpc/include/asm/barrier.h | 16 arch/powerpc/include/asm/qspinlock.h | 7 +++ 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index aecfde829d5d..7ae29cfb06c0 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -80,22 +80,6 @@ do { \ ___p1; \ }) -#ifdef CONFIG_PPC64 -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \ - typeof(ptr) __PTR = (ptr); \ - __unqual_scalar_typeof(*ptr) VAL; \ - VAL = READ_ONCE(*__PTR);\ - if (unlikely(!(cond_expr))) { \ - spin_begin
Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
On Tue, 16 Mar 2021, Nicholas Piggin wrote: One request, could you add a comment in place that references smp_cond_load_relaxed() so this commit can be found again if someone looks at it? Something like this /* * smp_cond_load_relaxed was found to have performance problems if * implemented with spin_begin()/spin_end(). */ Sure, let me see where I can fit that in and send out a v2. Similarly, but unrelated to this patch, is there any chance we could remove the whole spin_until_cond() machinery and make it specific to powerpc? This was introduced in 2017 and doesn't really have any users outside of powerpc, except for these: drivers/firmware/arm_scmi/driver.c: spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop)); drivers/firmware/arm_scmi/shmem.c: spin_until_cond(ioread32(>channel_status) & drivers/net/ethernet/xilinx/ll_temac_main.c: spin_until_cond(hard_acs_rdy_or_timeout(lp, timeout)); ... which afaict only the xilinx one can actually build on powerpc. Regardless, these could be converted to smp_cond_load_relaxed(), being the more standard way to do optimized busy-waiting, caring more about the family of barriers than ad-hoc SMT priorities. Of course, I have no way of testing any of these changes. I wonder if it should have a Fixes: tag to the original commit as well. I'm not sure either. I've actually been informed recently of other workloads that benefit from the revert on large Power9 boxes. So I'll go ahead and add it. Otherwise, Acked-by: Nicholas Piggin Thanks, Davidlohr
Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
On Tue, 09 Mar 2021, Michal Such�nek wrote: On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote: 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added busy-waiting pausing with a preferred SMT priority pattern, lowering the priority (reducing decode cycles) during the whole loop slowpath. However, data shows that while this pattern works well with simple ^^ spinlocks, queued spinlocks benefit more being kept in medium priority, with a cpu_relax() instead, being a low+medium combo on powerpc. ... diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index aecfde829d5d..7ae29cfb06c0 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -80,22 +80,6 @@ do { \ ___p1; \ }) -#ifdef CONFIG_PPC64 Maybe it should be kept for the simple spinlock case then? It is kept, note that simple spinlocks don't use smp_cond_load_relaxed, but instead deal with the priorities in arch_spin_lock(), so it will spin in low priority until it sees a chance to take the lock, where it switches back to medium. Thanks, Davidlohr
[PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added busy-waiting pausing with a preferred SMT priority pattern, lowering the priority (reducing decode cycles) during the whole loop slowpath. However, data shows that while this pattern works well with simple spinlocks, queued spinlocks benefit more being kept in medium priority, with a cpu_relax() instead, being a low+medium combo on powerpc. Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with 2 sockets and 8 threads per core. 1. locktorture. This is data for the lowest and most artificial/pathological level, with increasing thread counts pounding on the lock. Metrics are total ops/minute. Despite some small hits in the 4-8 range, scenarios are either neutral or favorable to this patch. +=+==+==+===+ | # tasks | vanilla | dirty| %diff | +=+==+==+===+ | 2 | 46718565 | 48751350 | 4.35 | +-+--+--+---+ | 4 | 51740198 | 50369082 | -2.65 | +-+--+--+---+ | 8 | 63756510 | 62568821 | -1.86 | +-+--+--+---+ | 16 | 67824531 | 70966546 | 4.63 | +-+--+--+---+ | 32 | 53843519 | 61155508 | 13.58 | +-+--+--+---+ | 64 | 53005778 | 53104412 | 0.18 | +-+--+--+---+ | 128 | 53331980 | 54606910 | 2.39 | +=+==+==+===+ 2. sockperf (tcp throughput) Here a client will do one-way throughput tests to a localhost server, with increasing message sizes, dealing with the sk_lock. This patch shows to put the performance of the qspinlock back to par with that of the simple lock: simple-spinlock vanilla dirty Hmean 1473.50 ( 0.00%) 54.44 * -25.93%* 73.45 * -0.07%* Hmean 100 654.47 ( 0.00%) 385.61 * -41.08%* 771.43 * 17.87%* Hmean 300 2719.39 ( 0.00%) 2181.67 * -19.77%* 2666.50 * -1.94%* Hmean 500 4400.59 ( 0.00%) 3390.77 * -22.95%* 4322.14 * -1.78%* Hmean 850 6726.21 ( 0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%* 3. dbench (tmpfs) Configured to run with up to ncpusx8 clients, it shows both latency and throughput metrics. For the latency, with the exception of the 64 case, there is really nothing to go by: vanilladirty Amean latency-1 1.67 ( 0.00%)1.67 * 0.09%* Amean latency-2 2.15 ( 0.00%)2.08 * 3.36%* Amean latency-4 2.50 ( 0.00%)2.56 * -2.27%* Amean latency-8 2.49 ( 0.00%)2.48 * 0.31%* Amean latency-16 2.69 ( 0.00%)2.72 * -1.37%* Amean latency-32 2.96 ( 0.00%)3.04 * -2.60%* Amean latency-64 7.78 ( 0.00%)8.17 * -5.07%* Amean latency-512 186.91 ( 0.00%) 186.41 * 0.27%* For the dbench4 Throughput (misleading but traditional) there's a small but rather constant improvement: vanilladirty Hmean 1849.13 ( 0.00%) 851.51 * 0.28%* Hmean 2 1664.03 ( 0.00%) 1663.94 * -0.01%* Hmean 4 3073.70 ( 0.00%) 3104.29 * 1.00%* Hmean 8 5624.02 ( 0.00%) 5694.16 * 1.25%* Hmean 16 9169.49 ( 0.00%) 9324.43 * 1.69%* Hmean 32 11969.37 ( 0.00%)12127.09 * 1.32%* Hmean 64 15021.12 ( 0.00%)15243.14 * 1.48%* Hmean 51214891.27 ( 0.00%)15162.11 * 1.82%* Measuring the dbench4 Per-VFS Operation latency, shows some very minor differences within the noise level, around the 0-1% ranges. Signed-off-by: Davidlohr Bueso --- arch/powerpc/include/asm/barrier.h | 16 1 file changed, 16 deletions(-) diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index aecfde829d5d..7ae29cfb06c0 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -80,22 +80,6 @@ do { \ ___p1; \ }) -#ifdef CONFIG_PPC64 -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \ - typeof(ptr) __PTR = (ptr); \ - __unqual_scalar_typeof(*ptr) VAL; \ - VAL = READ_ONCE(*__PTR);\ - if (unlikely(!(cond_expr))) { \ - spin_begin(); \ - do {\ - VAL = READ_ONCE(*__PTR);\ - } while (!(cond_expr)); \ - spin_end
[PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked
c6f5d02b6a0f (locking/spinlocks/arm64: Remove smp_mb() from arch_spin_is_locked()) made it pretty official that the call semantics do not imply any sort of barriers, and any user that gets creative must explicitly do any serialization. This creativity, however, is nowadays pretty limited: 1. spin_unlock_wait() has been removed from the kernel in favor of a lock/unlock combo. Furthermore, queued spinlocks have now for a number of years no longer relied on _Q_LOCKED_VAL for the call, but any non-zero value to indicate a locked state. There were cases where the delayed locked store could lead to breaking mutual exclusion with crossed locking; such as with sysv ipc and netfilter being the most extreme. 2. The auditing Andrea did in verified that remaining spin_is_locked() no longer rely on such semantics. Most callers just use it to assert a lock is taken, in a debug nature. The only user that gets cute is NOLOCK qdisc, as of: 96009c7d500e (sched: replace __QDISC_STATE_RUNNING bit with a spin lock) ... which ironically went in the next day after c6f5d02b6a0f. This change replaces test_bit() with spin_is_locked() to know whether to take the busylock heuristic to reduce contention on the main qdisc lock. So any races against spin_is_locked() for archs that use LL/SC for spin_lock() will be benign and not break any mutual exclusion; furthermore, both the seqlock and busylock have the same scope. Cc: parri.and...@gmail.com Cc: pab...@redhat.com Signed-off-by: Davidlohr Bueso --- arch/powerpc/include/asm/qspinlock.h | 12 arch/powerpc/include/asm/simple_spinlock.h | 3 +-- 2 files changed, 1 insertion(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index 3ce1a0bee4fe..b052b0624816 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -44,18 +44,6 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) } #define queued_spin_lock queued_spin_lock -static __always_inline int queued_spin_is_locked(struct qspinlock *lock) -{ - /* -* This barrier was added to simple spinlocks by commit 51d7d5205d338, -* but it should now be possible to remove it, asm arm64 has done with -* commit c6f5d02b6a0f. -*/ - smp_mb(); - return atomic_read(>val); -} -#define queued_spin_is_locked queued_spin_is_locked - #ifdef CONFIG_PARAVIRT_SPINLOCKS #define SPIN_THRESHOLD (1<<15) /* not tuned */ diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h index 3e87258f73b1..1b935396522a 100644 --- a/arch/powerpc/include/asm/simple_spinlock.h +++ b/arch/powerpc/include/asm/simple_spinlock.h @@ -38,8 +38,7 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) static inline int arch_spin_is_locked(arch_spinlock_t *lock) { - smp_mb(); - return !arch_spin_value_unlocked(*lock); + return !arch_spin_value_unlocked(READ_ONCE(*lock)); } /* -- 2.26.2
[PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once
Instead of both queued and simple spinlocks doing it. Move it into the arch's spinlock.h. Signed-off-by: Davidlohr Bueso --- arch/powerpc/include/asm/qspinlock.h | 2 -- arch/powerpc/include/asm/simple_spinlock.h | 3 --- arch/powerpc/include/asm/spinlock.h| 3 +++ 3 files changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index b752d34517b3..3ce1a0bee4fe 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -44,8 +44,6 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) } #define queued_spin_lock queued_spin_lock -#define smp_mb__after_spinlock() smp_mb() - static __always_inline int queued_spin_is_locked(struct qspinlock *lock) { /* diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h index 9c3c30534333..3e87258f73b1 100644 --- a/arch/powerpc/include/asm/simple_spinlock.h +++ b/arch/powerpc/include/asm/simple_spinlock.h @@ -282,7 +282,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw) #define arch_read_relax(lock) rw_yield(lock) #define arch_write_relax(lock) rw_yield(lock) -/* See include/linux/spinlock.h */ -#define smp_mb__after_spinlock() smp_mb() - #endif /* _ASM_POWERPC_SIMPLE_SPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 6ec72282888d..bd75872a6334 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -10,6 +10,9 @@ #include #endif +/* See include/linux/spinlock.h */ +#define smp_mb__after_spinlock() smp_mb() + #ifndef CONFIG_PARAVIRT_SPINLOCKS static inline void pv_spinlocks_init(void) { } #endif -- 2.26.2
[PATCH 0/3] powerpc/qspinlock: Some tuning updates
Hi, A few updates while going through the powerpc port of the qspinlock. Patches 1 and 2 are straightforward, while patch 3 can be considered more of an rfc as I've only tested on a single machine, and there could be an alternative way if it doesn't end up being nearly a universal performance win. Thanks! Davidlohr Bueso (3): powerpc/spinlock: Define smp_mb__after_spinlock only once powerpc/spinlock: Unserialize spin_is_locked powerpc/qspinlock: Use generic smp_cond_load_relaxed arch/powerpc/include/asm/barrier.h | 16 arch/powerpc/include/asm/qspinlock.h | 14 -- arch/powerpc/include/asm/simple_spinlock.h | 6 +- arch/powerpc/include/asm/spinlock.h| 3 +++ 4 files changed, 4 insertions(+), 35 deletions(-) -- 2.26.2
Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait
On Fri, 20 Mar 2020, Peter Zijlstra wrote: On Fri, Mar 20, 2020 at 01:55:26AM -0700, Davidlohr Bueso wrote: - swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) && - (!vcpu->arch.pause))); + rcuwait_wait_event(*wait, + (!vcpu->arch.power_off) && (!vcpu->arch.pause), + TASK_INTERRUPTIBLE); - for (;;) { - prepare_to_swait_exclusive(>wq, , TASK_INTERRUPTIBLE); - - if (kvm_vcpu_check_block(vcpu) < 0) - break; - - waited = true; - schedule(); - } - - finish_swait(>wq, ); + rcuwait_wait_event(>wait, + (block_check = kvm_vcpu_check_block(vcpu)) < 0, + TASK_INTERRUPTIBLE); Are these yet more instances that really want to be TASK_IDLE ? Hmm probably as it makes sense for a blocked vcpu not to be contributing to the loadavg. So if this is the only reason to use interruptible, then yes we ought to change it. However, I'll make this a separate patch, given this (ab)use isn't as obvious as the PS3 case, which is a kthread and therefore signals are masked. Thanks, Davidlohr
Re: [patch V3 00/20] Lock ordering documentation and annotation for lockdep
On Sat, 21 Mar 2020, Thomas Gleixner wrote: This is the third and hopefully final version of this work. The second one can be found here: Would you rather I send in a separate series with the kvm changes, or should I just send a v2 with the fixes here again? Thanks, Davidlohr
Re: [patch V2 11/15] completion: Use simple wait queues
On Wed, 18 Mar 2020, Thomas Gleixner wrote: From: Thomas Gleixner completion uses a wait_queue_head_t to enqueue waiters. wait_queue_head_t contains a spinlock_t to protect the list of waiters which excludes it from being used in truly atomic context on a PREEMPT_RT enabled kernel. The spinlock in the wait queue head cannot be replaced by a raw_spinlock because: - wait queues can have custom wakeup callbacks, which acquire other spinlock_t locks and have potentially long execution times - wake_up() walks an unbounded number of list entries during the wake up and may wake an unbounded number of waiters. For simplicity and performance reasons complete() should be usable on PREEMPT_RT enabled kernels. completions do not use custom wakeup callbacks and are usually single waiter, except for a few corner cases. Replace the wait queue in the completion with a simple wait queue (swait), which uses a raw_spinlock_t for protecting the waiter list and therefore is safe to use inside truly atomic regions on PREEMPT_RT. There is no semantical or functional change: - completions use the exclusive wait mode which is what swait provides - complete() wakes one exclusive waiter - complete_all() wakes all waiters while holding the lock which protects the wait queue against newly incoming waiters. The conversion to swait preserves this behaviour. complete_all() might cause unbound latencies with a large number of waiters being woken at once, but most complete_all() usage sites are either in testing or initialization code or have only a really small number of concurrent waiters which for now does not cause a latency problem. Keep it simple for now. The fixup of the warning check in the USB gadget driver is just a straight forward conversion of the lockless waiter check from one waitqueue type to the other. Signed-off-by: Thomas Gleixner Cc: Arnd Bergmann Reviewed-by: Davidlohr Bueso
Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()
On Fri, 20 Mar 2020, Sebastian Andrzej Siewior wrote: I though that v2 has it fixed with the previous commit (acpi: Remove header dependency). The kbot just reported that everything is fine. Let me look??? Nah my bad, that build did not have the full series applied :) Sorry for the noise. Thanks, Davidlohr
[PATCH 19/15] sched/swait: Reword some of the main description
With both the increased use of swait and kvm no longer using it, we can reword some of the comments. While removing Linus' valid rant, I've also cared to explicitly mention that swait is very different than regular wait. In addition it is mentioned against using swait in favor of the regular flavor. Signed-off-by: Davidlohr Bueso --- include/linux/swait.h | 23 +-- 1 file changed, 5 insertions(+), 18 deletions(-) diff --git a/include/linux/swait.h b/include/linux/swait.h index 73e06e9986d4..6e5b5d0e64fd 100644 --- a/include/linux/swait.h +++ b/include/linux/swait.h @@ -9,23 +9,10 @@ #include /* - * BROKEN wait-queues. - * - * These "simple" wait-queues are broken garbage, and should never be - * used. The comments below claim that they are "similar" to regular - * wait-queues, but the semantics are actually completely different, and - * every single user we have ever had has been buggy (or pointless). - * - * A "swake_up_one()" only wakes up _one_ waiter, which is not at all what - * "wake_up()" does, and has led to problems. In other cases, it has - * been fine, because there's only ever one waiter (kvm), but in that - * case gthe whole "simple" wait-queue is just pointless to begin with, - * since there is no "queue". Use "wake_up_process()" with a direct - * pointer instead. - * - * While these are very similar to regular wait queues (wait.h) the most - * important difference is that the simple waitqueue allows for deterministic - * behaviour -- IOW it has strictly bounded IRQ and lock hold times. + * Simple waitqueues are semantically very different to regular wait queues + * (wait.h). The most important difference is that the simple waitqueue allows + * for deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. * * Mainly, this is accomplished by two things. Firstly not allowing swake_up_all * from IRQ disabled, and dropping the lock upon every wakeup, giving a higher @@ -39,7 +26,7 @@ *sleeper state. * * - the !exclusive mode; because that leads to O(n) wakeups, everything is - *exclusive. + *exclusive. As such swait_wake_up_one will only ever awake _one_ waiter. * * - custom wake callback functions; because you cannot give any guarantees *about random code. This also allows swait to be used in RT, such that -- 2.16.4
[PATCH 18/15] kvm: Replace vcpu->swait with rcuwait
The use of any sort of waitqueue (simple or regular) for wait/waking vcpus has always been an overkill and semantically wrong. Because this is per-vcpu (which is blocked) there is only ever a single waiting vcpu, thus no need for any sort of queue. As such, make use of the rcuwait primitive, with the following considerations: - rcuwait already provides the proper barriers that serialize concurrent waiter and waker. - Task wakeup is done in rcu read critical region, with a stable task pointer. - Because there is no concurrency among waiters, we need not worry about rcuwait_wait_event() calls corrupting the wait->task. As a consequence, this saves the locking done in swait when adding to the queue. The x86-tscdeadline_latency test mentioned in 8577370fb0cb ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg, latency is reduced by around 15% with this change. Cc: Paolo Bonzini Signed-off-by: Davidlohr Bueso --- Only compiled and tested on x86. arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s_hv.c| 10 -- arch/x86/kvm/lapic.c| 2 +- include/linux/kvm_host.h| 10 +- virt/kvm/arm/arch_timer.c | 2 +- virt/kvm/arm/arm.c | 9 + virt/kvm/async_pf.c | 3 +-- virt/kvm/kvm_main.c | 33 + 8 files changed, 31 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 6e8b8ffd06ad..e2b4a1e3fb7d 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -752,7 +752,7 @@ struct kvm_vcpu_arch { u8 irq_pending; /* Used by XIVE to signal pending guest irqs */ u32 last_inst; - struct swait_queue_head *wqp; + struct rcuwait *waitp; struct kvmppc_vcore *vcore; int ret; int trap; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 2cefd071b848..c7cbc4bd06e9 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -231,13 +231,11 @@ static bool kvmppc_ipi_thread(int cpu) static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu) { int cpu; - struct swait_queue_head *wqp; + struct rcuwait *wait; - wqp = kvm_arch_vcpu_wq(vcpu); - if (swq_has_sleeper(wqp)) { - swake_up_one(wqp); + wait = kvm_arch_vcpu_get_wait(vcpu); + if (rcuwait_wake_up(wait)) ++vcpu->stat.halt_wakeup; - } cpu = READ_ONCE(vcpu->arch.thread_cpu); if (cpu >= 0 && kvmppc_ipi_thread(cpu)) @@ -4274,7 +4272,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, struct kvm_vcpu *vcpu) } user_vrsave = mfspr(SPRN_VRSAVE); - vcpu->arch.wqp = >arch.vcore->wq; + vcpu->arch.waitp = >arch.vcore->wait; vcpu->arch.pgdir = kvm->mm->pgd; vcpu->arch.state = KVMPPC_VCPU_BUSY_IN_HOST; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e3099c642fec..a4420c26dfbc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1815,7 +1815,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu) /* If the preempt notifier has already run, it also called apic_timer_expired */ if (!apic->lapic_timer.hv_timer_in_use) goto out; - WARN_ON(swait_active(>wq)); + WARN_ON(rcu_dereference(vcpu->wait.task)); cancel_hv_timer(apic); apic_timer_expired(apic); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bcb9b2ac0791..b5694429aede 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -23,7 +23,7 @@ #include #include #include -#include +#include #include #include #include @@ -277,7 +277,7 @@ struct kvm_vcpu { struct mutex mutex; struct kvm_run *run; - struct swait_queue_head wq; + struct rcuwait wait; struct pid __rcu *pid; int sigset_active; sigset_t sigset; @@ -952,12 +952,12 @@ static inline bool kvm_arch_has_assigned_device(struct kvm *kvm) } #endif -static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu) +static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu) { #ifdef __KVM_HAVE_ARCH_WQP - return vcpu->arch.wqp; + return vcpu->arch.wait; #else - return >wq; + return >wait; #endif } diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c index 0d9438e9de2a..4be71cb58691 100644 --- a/virt/kvm/arm/arch_timer.c +++ b/virt/kvm/arm/arch_timer.c @@ -593,7 +593,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu) if (map.emul_ptimer) soft_timer_cancel(_ptimer->hrtimer); - if (swait_active(kvm_arch_vcpu_wq(vcpu))) + if (rcu_dereference(kvm_arch_v
[PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted
Let the caller know if wake_up_process() was actually called or not; some users can use this information for ad-hoc. Of course returning true does not guarantee that wake_up_process() actually woke anything up. Signed-off-by: Davidlohr Bueso --- include/linux/rcuwait.h | 2 +- kernel/exit.c | 10 -- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h index 6e8798458091..3f83b9a12ad3 100644 --- a/include/linux/rcuwait.h +++ b/include/linux/rcuwait.h @@ -24,7 +24,7 @@ static inline void rcuwait_init(struct rcuwait *w) w->task = NULL; } -extern void rcuwait_wake_up(struct rcuwait *w); +extern bool rcuwait_wake_up(struct rcuwait *w); /* * The caller is responsible for locking around rcuwait_wait_event(), diff --git a/kernel/exit.c b/kernel/exit.c index 6cc6cc485d07..b0bb0a8ec4b1 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -234,9 +234,10 @@ void release_task(struct task_struct *p) goto repeat; } -void rcuwait_wake_up(struct rcuwait *w) +bool rcuwait_wake_up(struct rcuwait *w) { struct task_struct *task; + bool ret = false; rcu_read_lock(); @@ -254,10 +255,15 @@ void rcuwait_wake_up(struct rcuwait *w) smp_mb(); /* (B) */ task = rcu_dereference(w->task); - if (task) + if (task) { wake_up_process(task); + ret = true; + } rcu_read_unlock(); + + return ret; } +EXPORT_SYMBOL_GPL(rcuwait_wake_up); /* * Determine if a process group is "orphaned", according to the POSIX -- 2.16.4
[PATCH 16/15] rcuwait: Get rid of stale name comment
The 'trywake' name was renamed to simply 'wake', update the comment. Signed-off-by: Davidlohr Bueso --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/exit.c b/kernel/exit.c index 0b81b26a872a..6cc6cc485d07 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -243,7 +243,7 @@ void rcuwait_wake_up(struct rcuwait *w) /* * Order condition vs @task, such that everything prior to the load * of @task is visible. This is the condition as to why the user called -* rcuwait_trywake() in the first place. Pairs with set_current_state() +* rcuwait_wake() in the first place. Pairs with set_current_state() * barrier (A) in rcuwait_wait_event(). * *WAITWAKE -- 2.16.4
Re: [patch V2 00/15] Lock ordering documentation and annotation for lockdep
On Wed, 18 Mar 2020, Thomas Gleixner wrote: The PS3 one got converted by Peter Zijlstra to rcu_wait(). While at it, I think it makes sense to finally convert the kvm vcpu swait to rcuwait (patch 6/15 starts the necessary api changes). I'm sending some patches on top of this patchset. Thanks, Davidlohr
Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()
On Wed, 18 Mar 2020, Thomas Gleixner wrote: --- a/include/linux/rcuwait.h +++ b/include/linux/rcuwait.h @@ -3,6 +3,7 @@ #define _LINUX_RCUWAIT_H_ #include +#include So this is causing build to fail for me: CC arch/x86/boot/compressed/cmdline.o arch/x86/boot/compressed/cmdline.c:5:20: error: conflicting types for âset_fsâ static inline void set_fs(unsigned long seg) ^~ In file included from ./include/linux/uaccess.h:11:0, from ./include/linux/sched/task.h:11, from ./include/linux/sched/signal.h:9, from ./include/linux/rcuwait.h:6, from ./include/linux/percpu-rwsem.h:8, from ./include/linux/fs.h:34, from ./include/linux/proc_fs.h:9, from ./include/acpi/acpi_bus.h:83, from ./include/linux/acpi.h:32, from arch/x86/boot/compressed/misc.h:28, from arch/x86/boot/compressed/cmdline.c:2: ./arch/x86/include/asm/uaccess.h:29:20: note: previous definition of âset_fsâ was here static inline void set_fs(mm_segment_t fs) ^~ make[2]: *** [scripts/Makefile.build:268: arch/x86/boot/compressed/cmdline.o] Error 1 make[1]: *** [arch/x86/boot/Makefile:113: arch/x86/boot/compressed/vmlinux] Error 2 make: *** [arch/x86/Makefile:285: bzImage] Error 2 Right now I'm not sure what the proper fix should be. Thanks, Davidlohr
Re: [patch V2 07/15] powerpc/ps3: Convert half completion to rcuwait
On Wed, 18 Mar 2020, Thomas Gleixner wrote: AFAICT the kthread uses TASK_INTERRUPTIBLE to not increase loadavg, kthreads cannot receive signals by default and this one doesn't look different. Use TASK_IDLE instead. Hmm it seems in general this needs to be done kernel-wide. This kthread abuse of TASK_INTERRUPTIBLE seems to be a common thing. There's also the users doing schedule_timeout_interruptible()... Thanks, Davidlohr
Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation
On Wed, 18 Mar 2020, Thomas Gleixner wrote: +Owner semantics +=== + +Most lock types in the Linux kernel have strict owner semantics, i.e. the +context (task) which acquires a lock has to release it. + +There are two exceptions: + + - semaphores + - rwsems + +semaphores have no strict owner semantics for historical reasons. They are I would rephrase this to: semaphores have no owner semantics for historical reason, and as such trylock and release operations can be called from interrupt context. They are ... +often used for both serialization and waiting purposes. That's generally +discouraged and should be replaced by separate serialization and wait +mechanisms. ^ , such as mutexes or completions. + +rwsems have grown interfaces which allow non owner release for special +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT +substitutes all locking primitives except semaphores with RT-mutex based +implementations to provide priority inheritance for all lock types except +the truly spinning ones. Priority inheritance on ownerless locks is +obviously impossible. + +For now the rwsem non-owner release excludes code which utilizes it from +being used on PREEMPT_RT enabled kernels. In same cases this can be +mitigated by disabling portions of the code, in other cases the complete +functionality has to be disabled until a workable solution has been found. Thanks, Davidlohr
Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic
On Tue, 23 Apr 2019, Bueso wrote: On Wed, 03 Apr 2019, Daniel Jordan wrote: On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote: Le 02/04/2019 à 22:41, Daniel Jordan a écrit : With locked_vm now an atomic, there is no need to take mmap_sem as writer. Delete and refactor accordingly. Could you please detail the change ? Ok, I'll be more specific in the next version, using some of your language in fact. :) It looks like this is not the only change. I'm wondering what the consequences are. Before we did: - lock - calculate future value - check the future value is acceptable - update value if future value acceptable - return error if future value non acceptable - unlock Now we do: - atomic update with future (possibly too high) value - check the new value is acceptable - atomic update back with older value if new value not acceptable and return error So if a concurrent action wants to increase locked_vm with an acceptable step while another one has temporarily set it too high, it will now fail. I think we should keep the previous approach and do a cmpxchg after validating the new value. Wouldn't the cmpxchg alternative also be exposed the locked_vm changing between validating the new value and the cmpxchg() and we'd bogusly fail even when there is still just because the value changed (I'm assuming we don't hold any locks, otherwise all this is pointless). current_locked = atomic_read(>locked_vm); new_locked = current_locked + npages; if (new_locked < lock_limit) if (cmpxchg(>locked_vm, current_locked, new_locked) == current_locked) Err, this being != of course.
Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic
On Wed, 03 Apr 2019, Daniel Jordan wrote: On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote: Le 02/04/2019 à 22:41, Daniel Jordan a écrit : > With locked_vm now an atomic, there is no need to take mmap_sem as > writer. Delete and refactor accordingly. Could you please detail the change ? Ok, I'll be more specific in the next version, using some of your language in fact. :) It looks like this is not the only change. I'm wondering what the consequences are. Before we did: - lock - calculate future value - check the future value is acceptable - update value if future value acceptable - return error if future value non acceptable - unlock Now we do: - atomic update with future (possibly too high) value - check the new value is acceptable - atomic update back with older value if new value not acceptable and return error So if a concurrent action wants to increase locked_vm with an acceptable step while another one has temporarily set it too high, it will now fail. I think we should keep the previous approach and do a cmpxchg after validating the new value. Wouldn't the cmpxchg alternative also be exposed the locked_vm changing between validating the new value and the cmpxchg() and we'd bogusly fail even when there is still just because the value changed (I'm assuming we don't hold any locks, otherwise all this is pointless). current_locked = atomic_read(>locked_vm); new_locked = current_locked + npages; if (new_locked < lock_limit) if (cmpxchg(>locked_vm, current_locked, new_locked) == current_locked) /* ENOMEM */ That's a good idea, and especially worth doing considering that an arbitrary number of threads that charge a low amount of locked_vm can fail just because one thread charges lots of it. Yeah but the window for this is quite small, I doubt it would be a real issue. What if before doing the atomic_add_return(), we first did the racy new_locked check for ENOMEM, then do the speculative add and cleanup, if necessary. This would further reduce the scope of the window where false ENOMEM can occur. pinned_vm appears to be broken the same way, so I can fix it too unless someone beats me to it. This should not be a surprise for the rdma folks. Cc'ing Jason nonetheless. Thanks, Davidlohr
Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()
On Tue, 23 Apr 2019, Peter Zijlstra wrote: Also; the initial motivation was prefaulting large VMAs and the contention on mmap was killing things; but similarly, the contention on the refcount (I did try that) killed things just the same. Right, this is just like what can happen with per-vma locking. Thanks, Davidlohr
Re: [PATCH 1/6] mm: change locked_vm's type from unsigned long to atomic64_t
On Tue, 02 Apr 2019, Andrew Morton wrote: Also, we didn't remove any down_write(mmap_sem)s from core code so I'm thinking that the benefit of removing a few mmap_sem-takings from a few obscure drivers (sorry ;)) is pretty small. afaik porting the remaining incorrect users of locked_vm to pinned_vm was the next step before this one, which made converting locked_vm to atomic hardly worth it. Daniel? Thanks, Davidlohr
Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files
On Fri, 22 Mar 2019, Linus Torvalds wrote: Some of them _might_ be performance-critical. There's the one on mmap_sem in the fault handling path, for example. And yes, I'd expect the normal case to very much be "no other readers or writers" for that one. Yeah, the mmap_sem case in the fault path is really expecting an unlocked state. To the point that four archs have added branch predictions, ie: 92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the page fault path)) b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to do_page_fault) And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly: (without resetting the counters) correct incorrect %Function File Line --- - - 4603685 34 0 do_user_addr_fault fault.c 1416 (bootup) 382327745 449 0 do_user_addr_fault fault.c 1416 (kernel build) 399446159 461 0 do_user_addr_fault fault.c 1416 (redis benchmark) It would probably wouldn't harm doing the unlikely() for all archs, or alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in patch 3 and do it implicitly but maybe that would be less flexible(?) Thanks, Davidlohr
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Fri, 08 Feb 2019, Waiman Long wrote: I am planning to run more performance test and post the data sometimes next week. Davidlohr is also going to run some of his rwsem performance test on this patchset. So I ran this series on a 40-core IB 2 socket with various worklods in mmtests. Below are some of the interesting ones; full numbers and curves at https://linux-scalability.org/rwsem-reader-spinner/ All workloads are with increasing number of threads. -- pagefault timings: pft is an artificial pf benchmark (thus reader stress). metric is faults/cpu and faults/sec v5.0-rc6 v5.0-rc6 dirty Hmean faults/cpu-1624224.9815 ( 0.00%) 618847.5201 * -0.86%* Hmean faults/cpu-4539550.3509 ( 0.00%) 547407.5738 * 1.46%* Hmean faults/cpu-7401470.3461 ( 0.00%) 381157.9830 * -5.06%* Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%* Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%* Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%* Hmean faults/cpu-4091203.6820 ( 0.00%)91832.7489 * 0.69%* Hmean faults/sec-1623292.3467 ( 0.00%) 617992.0795 * -0.85%* Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%* Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%* Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%* Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%* Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%* Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%* Stddevfaults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%) Stddevfaults/cpu-4 24165.9454 ( 0.00%)15841.1232 ( 34.45%) Stddevfaults/cpu-7 20914.8351 ( 0.00%)22744.3294 ( -8.75%) Stddevfaults/cpu-1211274.3490 ( 0.00%)14733.3152 ( -30.68%) Stddevfaults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%) Stddevfaults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%) Stddevfaults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%) Stddevfaults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%) Stddevfaults/sec-4 84042.7234 ( 0.00%)57996.7158 ( 30.99%) Stddevfaults/sec-7123656.7901 ( 0.00%) 135591.1087 ( -9.65%) Stddevfaults/sec-1297135.6091 ( 0.00%) 127054.4926 ( -30.80%) Stddevfaults/sec-2169564.6264 ( 0.00%)65922.6381 ( 5.24%) Stddevfaults/sec-3051524.4027 ( 0.00%)56109.4159 ( -8.90%) Stddevfaults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%) With the exception of the hicup at 7 threads, things are pretty much in the noise region for both metrics. -- git checkout First metric is total runtime for runs with incremental threads. v5.0-rc6v5.0-rc6 dirty User 218.95 219.07 System 149.29 146.82 Elapsed 1574.10 1427.08 In this case there's a non trivial improvement (11%) in overall elapsed time. -- reaim (which is always succeptible to rwsem changes for both mmap_sem and i_mmap) v5.0-rc6 v5.0-rc6 dirty Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%* Hmean compute-21 85294.91 ( 0.00%)85524.20 * 0.27%* Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%* Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%* Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%* Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%* Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%* Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%* Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%* The 'compute' workload overall takes a small hit. Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%* Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%* Hmean new_dbase-41 14202.91 ( 0.00%)14470.59 * 1.88%* Hmean new_dbase-61 21207.24 ( 0.00%)21067.40 * -0.66%* Hmean new_dbase-81 25542.40 ( 0.00%)25542.40 * 0.00%* Hmean new_dbase-10130165.28 ( 0.00%)30046.21 * -0.39%* Hmean new_dbase-12133638.33 ( 0.00%)33219.90 * -1.24%* Hmean new_dbase-14136723.70 ( 0.00%)37504.52 * 2.13%* Hmean new_dbase-16142242.51 ( 0.00%)42117.34 * -0.30%* Hmean shared-176.54 ( 0.00%) 76.09 * -0.59%* Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%* Hmean shared-4117207.81 ( 0.00%)14651.94 * -14.85%* Hmean shared-61
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Thu, 07 Feb 2019, Waiman Long wrote: 30 files changed, 1197 insertions(+), 1594 deletions(-) Performance numbers on numerous workloads, pretty please. I'll go and throw this at my mmap_sem intensive workloads I've collected. Thanks, Davidlohr
Re: [PATCH] powerpc/mm: Fix missing mmap_sem release
On Thu, 07 Sep 2017, Laurent Dufour wrote: The commit b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()") reviewed the way the error path is managed in __do_page_fault() but it was a bit too agressive when handling a case by returning without releasing the mmap_sem. By the way, replacing current->mm->mmap_sem by mm->mmap_sem as mm is set to current->mm. Fixes: b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()") Cc: Benjamin HerrenschmidtSigned-off-by: Laurent Dufour --- arch/powerpc/mm/fault.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 4797d08581ce..f799ccf37d27 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c But... here: /* * If we need to retry the mmap_sem has already been released, * and if there is a fatal signal pending there is no guarantee * that we made any progress. Handle this case first. */ @@ -521,10 +521,11 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, * User mode? Just return to handle the fatal exception otherwise * return to bad_page_fault */ + up_read(>mmap_sem); return is_user ? 0 : SIGBUS; } Per the above comment, for that case handle_mm_fault() has already released mmap_sem. The same occurs in x86, for example. Thanks, Davidlohr
Re: [PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
On Fri, 28 Oct 2016, Pan Xinhui wrote: /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue ^^ detect + * and break. Could you please remove the rest of this comment? Its just noise to point out that vcpu_is_preempted is a macro defined by arch/false. This is standard protocol in the kernel. Same goes for all locks you change with this. Thanks, Davidlohr * vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants
On Wed, 23 Dec 2015, Boqun Feng wrote: There is one thing we should be aware of, that is the bug: http://lkml.kernel.org/r/5669d5f2.5050...@caviumnetworks.com which though has been fixed by: http://lkml.kernel.org/r/20151217160549.gh6...@twins.programming.kicks-ass.net Right, and fwiw the testing I did included this fix. signature.asc Description: Digital signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants
I've left this series testing overnight on a power7 box and so far so good, nothing has broken. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] arch: use WRITE_ONCE/READ_ONCE in smp_store_release/smp_load_acquire
On Sun, 2015-08-02 at 17:11 +0200, Andrey Konovalov wrote: Replace ACCESS_ONCE() macro in smp_store_release() and smp_load_acquire() with WRITE_ONCE() and READ_ONCE() on x86, arm, arm64, ia64, metag, mips, powerpc, s390, sparc and asm-generic since ACCESS_ONCE does not work reliably on non-scalar types. WRITE_ONCE() and READ_ONCE() were introduced in the commits 230fa253df63 (kernel: Provide READ_ONCE and ASSIGN_ONCE) and 43239cbe79fc (kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)). Signed-off-by: Andrey Konovalov andreyk...@google.com Acked-by: Davidlohr Bueso dbu...@suse.de ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/3] powerpc/oprofile: reduce mmap_sem hold for exe_file
In the future mm-exe_file will be done without mmap_sem serialization, thus isolate and reorganize the related code to make the transition easier. Good users will, make use of the more standard get_mm_exe_file(), requiring only holding the mmap_sem to read the value, and relying on reference counting to make sure that the exe file won't dissappear underneath us while getting the dcookie. Cc: Arnd Bergmann a...@arndb.de Cc: Robert Richter r...@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: cbe-oss-...@lists.ozlabs.org Cc: oprofile-l...@lists.sourceforge.net Signed-off-by: Davidlohr Bueso dbu...@suse.de --- Completely untested. arch/powerpc/oprofile/cell/spu_task_sync.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/oprofile/cell/spu_task_sync.c b/arch/powerpc/oprofile/cell/spu_task_sync.c index 1c27831..ed7b097 100644 --- a/arch/powerpc/oprofile/cell/spu_task_sync.c +++ b/arch/powerpc/oprofile/cell/spu_task_sync.c @@ -22,6 +22,7 @@ #include linux/kref.h #include linux/mm.h #include linux/fs.h +#include linux/file.h #include linux/module.h #include linux/notifier.h #include linux/numa.h @@ -322,18 +323,20 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int *offsetp, unsigned long app_cookie = 0; unsigned int my_offset = 0; struct vm_area_struct *vma; + struct file *exe_file; struct mm_struct *mm = spu-mm; if (!mm) goto out; - down_read(mm-mmap_sem); - - if (mm-exe_file) { - app_cookie = fast_get_dcookie(mm-exe_file-f_path); - pr_debug(got dcookie for %pD\n, mm-exe_file); + exe_file = get_mm_exe_file(mm); + if (exe_file) { + app_cookie = fast_get_dcookie(exe_file-f_path); + pr_debug(got dcookie for %pD\n, exe_file); + fput(exe_file); } + down_read(mm-mmap_sem); for (vma = mm-mmap; vma; vma = vma-vm_next) { if (vma-vm_start spu_ref || vma-vm_end = spu_ref) continue; -- 2.1.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [3.13.y.z extended stable] Patch locking/mutex: Disable optimistic spinning on some architectures has been added to staging queue
On Wed, 2014-08-06 at 17:25 -0400, Andev wrote: On Wed, Aug 6, 2014 at 4:54 PM, Kamal Mostafa ka...@canonical.com wrote: This is a note to let you know that I have just added a patch titled locking/mutex: Disable optimistic spinning on some architectures to the linux-3.13.y-queue branch of the 3.13.y.z extended stable tree which can be found at: http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=shortlog;h=refs/heads/linux-3.13.y-queue This patch is scheduled to be released in version 3.13.11.6. This patch is specifically for parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon none of which Ubuntu supports. Why are you backporting this patch? Well 3.13.y.z isn't an Ubuntu kernel, its upstream, and those archs *are* supported. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Sat, 2014-03-22 at 07:57 +0530, Srikar Dronamraju wrote: So reverting and applying v3 3/4 and 4/4 patches works for me. Ok, I verified that the above endds up resulting in the same tree as the minimal patch I sent out, modulo (a) some comments and (b) an #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter. So I committed the minimal patch with your tested-by. Just to be sure, I have verified with latest mainline with HEAD having commit 08edb33c4 (Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux) which also has the commit 11d4616bd0 futex:( revert back to the explicit waiter counting code). Thanks for double checking. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 15:38 +0530, Srikar Dronamraju wrote: This problem suggests that we missed a wakeup for a task that was adding itself to the queue in a wait path. And the only place that can happen is with the hb spinlock check for any pending waiters. Just in case we missed some assumption about checking the hash bucket spinlock as a way of detecting any waiters (powerpc?), could you revert this commit and try the original atomic operations variant: https://lkml.org/lkml/2013/12/19/630 I think the above url and the commit id that I reverted i.e git://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0c29f79ecea0b6fbcefc999 are the same. Or am I missing something? No, please take a closer look, it is a different approaches to the same end. diff --git a/kernel/futex.c b/kernel/futex.c index 34ecd9d..35ff697 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -203,6 +203,7 @@ static const struct futex_q futex_q_init = { * waiting on a futex. */ struct futex_hash_bucket { + atomic_t waiters; spinlock_t lock; struct plist_head chain; } cacheline_aligned_in_smp; @@ -211,6 +212,53 @@ static unsigned long __read_mostly futex_hashsize; static struct futex_hash_bucket *futex_queues; +static inline void futex_get_mm(union futex_key *key) +{ + atomic_inc(key-private.mm-mm_count); +#ifdef CONFIG_SMP + /* +* Ensure futex_get_mm() implies a full barrier such that +* get_futex_key() implies a full barrier. This is relied upon +* as full barrier (B), see the ordering comment above. +*/ + smp_mb__after_atomic_inc(); +#endif +} + +/* + * Reflects a new waiter being added to the waitqueue. + */ +static inline void hb_waiters_inc(struct futex_hash_bucket *hb) +{ +#ifdef CONFIG_SMP + atomic_inc(hb-waiters); + /* +* Full barrier (A), see the ordering comment above. +*/ + smp_mb__after_atomic_inc(); +#endif +} + +/* + * Reflects a waiter being removed from the waitqueue by wakeup + * paths. + */ +static inline void hb_waiters_dec(struct futex_hash_bucket *hb) +{ +#ifdef CONFIG_SMP + atomic_dec(hb-waiters); +#endif +} + +static inline int hb_waiters_pending(struct futex_hash_bucket *hb) +{ +#ifdef CONFIG_SMP + return atomic_read(hb-waiters); +#else + return 1; +#endif +} + /* * We hash on the keys returned from get_futex_key (see below). */ @@ -245,10 +293,10 @@ static void get_futex_key_refs(union futex_key *key) switch (key-both.offset (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { case FUT_OFF_INODE: - ihold(key-shared.inode); + ihold(key-shared.inode); /* implies MB */ break; case FUT_OFF_MMSHARED: - atomic_inc(key-private.mm-mm_count); + futex_get_mm(key); /* implies MB */ break; } } @@ -322,7 +370,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) if (!fshared) { key-private.mm = mm; key-private.address = address; - get_futex_key_refs(key); + get_futex_key_refs(key); /* implies MB (B) */ return 0; } @@ -429,7 +477,7 @@ again: key-shared.pgoff = basepage_index(page); } - get_futex_key_refs(key); + get_futex_key_refs(key); /* implies MB (B) */ out: unlock_page(page_head); @@ -893,6 +941,7 @@ static void __unqueue_futex(struct futex_q *q) hb = container_of(q-lock_ptr, struct futex_hash_bucket, lock); plist_del(q-list, hb-chain); + hb_waiters_dec(hb); } /* @@ -1052,6 +1101,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) goto out; hb = hash_futex(key); + + /* Make sure we really have tasks to wakeup */ + if (!hb_waiters_pending(hb)) + goto out_put_key; + spin_lock(hb-lock); plist_for_each_entry_safe(this, next, hb-chain, list) { @@ -1072,6 +1126,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) } spin_unlock(hb-lock); +out_put_key: put_futex_key(key); out: return ret; @@ -1190,7 +1245,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, */ if (likely(hb1-chain != hb2-chain)) { plist_del(q-list, hb1-chain); + hb_waiters_dec(hb1); plist_add(q-list, hb2-chain); + hb_waiters_inc(hb2); q-lock_ptr = hb2-lock; } get_futex_key_refs(key2); @@ -1533,6 +1590,17 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) struct futex_hash_bucket *hb; hb = hash_futex(q-key); + + /* +* Increment the counter before taking the lock so that +* a potential waker won't miss a to-be-slept task that
Re: Tasks stuck in futex code (in 3.14-rc6)
On Wed, 2014-03-19 at 22:56 -0700, Davidlohr Bueso wrote: On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote: Joy,.. let me look at that with ppc in mind. OK; so while pretty much all the comments from that patch are utter nonsense (what was I thinking), I cannot actually find a real bug. But could you try the below which replaces a control dependency with a full barrier. The control flow is plenty convoluted that I think the control barrier isn't actually valid anymore and that might indeed explain the fail. Unfortunately the patch didnt help. Still seeing tasks stuck # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex 14680 pts/0root java - 0 futex_wait_queue_me 14797 pts/0root java - 0 futex_wait_queue_me # : /var/log/messages # echo t /proc/sysrq-trigger # grep futex_wait_queue_me /var/log/messages | wc -l 334 # [ 6904.211478] Call Trace: [ 6904.211481] [c00fa1f1b4d0] [0020] 0x20 (unreliable) [ 6904.211486] [c00fa1f1b6a0] [c0015208] .__switch_to+0x1e8/0x330 [ 6904.211491] [c00fa1f1b750] [c0702f00] .__schedule+0x360/0x8b0 [ 6904.211495] [c00fa1f1b9d0] [c0147348] .futex_wait_queue_me+0xf8/0x1a0 [ 6904.211500] [c00fa1f1ba60] [c01486dc] .futex_wait+0x17c/0x2a0 [ 6904.211505] [c00fa1f1bc10] [c014a614] .do_futex+0x254/0xd80 [ 6904.211510] [c00fa1f1bd60] [c014b25c] .SyS_futex+0x11c/0x1d0 [ 6904.238874] [c00fa1f1be30] [c000a0fc] syscall_exit+0x0/0x7c [ 6904.238879] javaS 3fff825f6044 0 14682 14076 0x0080 Is there any other information that I provide that can help? This problem suggests that we missed a wakeup for a task that was adding itself to the queue in a wait path. And the only place that can happen is with the hb spinlock check for any pending waiters. Just in case we missed some assumption about checking the hash bucket spinlock as a way of detecting any waiters (powerpc?), could you revert this commit and try the original atomic operations variant: https://lkml.org/lkml/2013/12/19/630 hmmm looking at ppc spinlock code, it seems that it doesn't have ticket spinlocks -- in fact Torsten Duwe has been trying to get them upstream very recently. Since we rely on the counter for detecting waiters, this might explain the issue. Could someone confirm this spinlock implementation difference? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 09:41 -0700, Linus Torvalds wrote: On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso davidl...@hp.com wrote: This problem suggests that we missed a wakeup for a task that was adding itself to the queue in a wait path. And the only place that can happen is with the hb spinlock check for any pending waiters. Ok, so thinking about hb_waiters_pending(): - spin_is_locked() test - read barrier - plist_head_empty() test seems to be broken after all. The race is against futex_wait() that does - futex_wait_setup(): - queue_lock() - get_futex_value_locked() - futex_wait_queue_me() - queue_me() - plist_add() right? Yep. It strikes me that the spin_is_locked() test has no barriers wrt the writing of the new futex value on the wake path. And the read barrier obviously does nothing wrt the write either. Or am I missing something? So the write that actually released the futex might be almost arbitrarily delayed on the waking side. So the waiting side may not see the new value, even though the waker assumes it does due to the ordering of it doing the write first. Aha, that would certainly violate the ordering guarantees. I feared _something_ like that when we originally discussed your suggestion as opposed to the atomics one, but didn't have any case for it either. So maybe we need a memory barrier in hb_waiters_pending() just to make sure that the new value is written. But at that point, I suspect that Davidlohrs original patch that just had explicit waiting counts is just as well. The whole point with the head empty test was to emulate that do we have waiters without having to actually add the atomics, but a memory barrier is really no worse. The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs atomic count. It may be terminally broken, I literally didn't test it at all. Comparing with the patch I sent earlier this morning, looks equivalent, and fwiw, passes my initial qemu bootup, which is the first way of detecting anything stupid going on. So, Srikar, please try this patch out, as opposed to mine, you don't have to first revert the commit in question. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 10:42 -0700, Linus Torvalds wrote: On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso davidl...@hp.com wrote: It strikes me that the spin_is_locked() test has no barriers wrt the writing of the new futex value on the wake path. And the read barrier obviously does nothing wrt the write either. Or am I missing something? So the write that actually released the futex might be almost arbitrarily delayed on the waking side. So the waiting side may not see the new value, even though the waker assumes it does due to the ordering of it doing the write first. Aha, that would certainly violate the ordering guarantees. I feared _something_ like that when we originally discussed your suggestion as opposed to the atomics one, but didn't have any case for it either. Actually, looking closer, we have the memory barrier in get_futex_key_refs() (called by get_futex_key()) so that's not it. In fact, your atomic_read(hb-waiters) doesn't have any more serialization than the spin_is_locked() test had. But the spin_is_locked() and queue-empty tests are two separate memory reads, and maybe there is some ordering wrt those two that we missed, so the waiters patch is worth trying anyway. Well, imho we would have seen something wrong much much earlier. This patch has been very heavily tested (including with the java workload used by Shrikar). I still wonder about ppc and spinlocks (no ticketing!!) ... sure the waiters patch might fix the problem just because we explicitly count the members of the plist. And I guess if we cannot rely on all archs having an equivalent spinlock implementation, we simply cannot use this technique for futexes. Thanks, Davidlohr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 11:36 -0700, Linus Torvalds wrote: On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso davidl...@hp.com wrote: Comparing with the patch I sent earlier this morning, looks equivalent, and fwiw, passes my initial qemu bootup, which is the first way of detecting anything stupid going on. So, Srikar, please try this patch out, as opposed to mine, you don't have to first revert the commit in question. Ok, so it boots for me too, so hopefully it isn't totally broken. However, since it's just closing a race, and since getting the counts wrong should easily result in it *working* but always taking the slow path (for example), I'd really like people to also verify that it fixes the actual performance issue (ie assuming it fixes powerpc behavior for Srikar, I'd like to get it double-checked that it also avoids the spinlock in the common case). Oh, it does. This atomics technique was tested at a customer's site and ready for upstream. To refresh, we were originally seeing massive contention on the hb-lock and an enormous amounts of 0 returns from futex_wake, indicating that spinners where piling up just to realize that the plist was empty! While I don't have any official numbers, I can confirm that perf showed that this issue was addressed with the atomics variant. Yes, such pathological behavior shows problems in the userspace locking primitives design/implementation, but allowing the kernel not to be affected by suboptimal uses of futexes is definitely a plus. As tglx suggested at the time, I also made sure that adding the barriers when doing the key refcounting didn't impose any serious restrictions to performance either. Now, what at the time required re-testing everything was when you suggested replacing this approach with a more elegant spin is locked test. Both approaches showed pretty much identical performance (and correctness, at least on x86). And to this day shows *significant* less time spent in kernel space dealing with futexes. Because if the increment/decrement pairings end up being wrong, we could have a situation where the waiter count just ends up bogus, and it all works from a correctness standpoint but not from the intended performance optimization. No way I can test that sanely on my single-socket machine. Davidlohr? Not this patch, no :( -- we could never blindly reproduce the customer's workload. The only patch that I was able to create test cases for is the larger hash table one, which simply alleviates collisions. This is now part of perf-bench. Thanks, Davidlohr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 12:25 -0700, Linus Torvalds wrote: On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso davidl...@hp.com wrote: Oh, it does. This atomics technique was tested at a customer's site and ready for upstream. I'm not worried about the *original* patch. I'm worried about the incremental one. Your original patch never applied to my tree - I think it was based on -mm or something. So I couldn't verify my let's go back to the explicit 'waiters' incremental patch against reverting and re-applying the original patch. Ok, so a big reason why this patch doesn't apply cleanly after reverting is because *most* of the changes were done at the top of the file with regards to documenting the ordering guarantees, the actual code changes are quite minimal. I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending commit), and then I cleanly applied the equivalent ones from v3 of the series (which was already *tested* and ready for upstream until you suggested looking into the alternative spinlock approach): https://lkml.org/lkml/2013/12/19/624 https://lkml.org/lkml/2013/12/19/630 Assuming the atomics solves the issue, would you be willing to take this path? Any pending documentation fixes can be added afterwards. The important thing is that the actual code is well tested. Thanks, Davidlohr ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Wed, 2014-03-19 at 18:08 +0100, Peter Zijlstra wrote: On Wed, Mar 19, 2014 at 04:47:05PM +0100, Peter Zijlstra wrote: I reverted b0c29f79ecea0b6fbcefc999e70f2843ae8306db on top of v3.14-rc6 and confirmed that reverting the commit solved the problem. Joy,.. let me look at that with ppc in mind. errr... just sat down to check email this morning. CC'ing Paul as for any subtle barrier issues. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Tasks stuck in futex code (in 3.14-rc6)
On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote: Joy,.. let me look at that with ppc in mind. OK; so while pretty much all the comments from that patch are utter nonsense (what was I thinking), I cannot actually find a real bug. But could you try the below which replaces a control dependency with a full barrier. The control flow is plenty convoluted that I think the control barrier isn't actually valid anymore and that might indeed explain the fail. Unfortunately the patch didnt help. Still seeing tasks stuck # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex 14680 pts/0root java - 0 futex_wait_queue_me 14797 pts/0root java - 0 futex_wait_queue_me # : /var/log/messages # echo t /proc/sysrq-trigger # grep futex_wait_queue_me /var/log/messages | wc -l 334 # [ 6904.211478] Call Trace: [ 6904.211481] [c00fa1f1b4d0] [0020] 0x20 (unreliable) [ 6904.211486] [c00fa1f1b6a0] [c0015208] .__switch_to+0x1e8/0x330 [ 6904.211491] [c00fa1f1b750] [c0702f00] .__schedule+0x360/0x8b0 [ 6904.211495] [c00fa1f1b9d0] [c0147348] .futex_wait_queue_me+0xf8/0x1a0 [ 6904.211500] [c00fa1f1ba60] [c01486dc] .futex_wait+0x17c/0x2a0 [ 6904.211505] [c00fa1f1bc10] [c014a614] .do_futex+0x254/0xd80 [ 6904.211510] [c00fa1f1bd60] [c014b25c] .SyS_futex+0x11c/0x1d0 [ 6904.238874] [c00fa1f1be30] [c000a0fc] syscall_exit+0x0/0x7c [ 6904.238879] javaS 3fff825f6044 0 14682 14076 0x0080 Is there any other information that I provide that can help? This problem suggests that we missed a wakeup for a task that was adding itself to the queue in a wait path. And the only place that can happen is with the hb spinlock check for any pending waiters. Just in case we missed some assumption about checking the hash bucket spinlock as a way of detecting any waiters (powerpc?), could you revert this commit and try the original atomic operations variant: https://lkml.org/lkml/2013/12/19/630 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev