Re: [PATCH 04/17] arch, mm: move definition of node_data to generic code

2024-07-22 Thread Davidlohr Bueso

On Tue, 16 Jul 2024, Mike Rapoport wrote:\n

From: "Mike Rapoport (Microsoft)" 

Every architecture that supports NUMA defines node_data in the same way:

struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Signed-off-by: Mike Rapoport (Microsoft) 


Nice cleanup.

Acked-by: Davidlohr Bueso 


Re: [PATCH v3 00/35] Per-VMA locks

2023-02-27 Thread Davidlohr Bueso

On Fri, 24 Feb 2023, freak07 wrote:


Here are some measurements from a Pixel 7 Pro that´s running a kernel either 
with the Per-VMA locks patchset or without.
If there´s interest I can provide results of other specific apps as well.

Results are from consecutive cold app launches issued with "am start" command 
spawning the main activity of Slack Android app.

https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing

There´s quite a noticeable improvement, as can be seen in the graph. The 
results were reproducible.


Thanks for sharing. I am not surprised - after all, per-vma locks narrow some 
of the performance gaps
between vanilla and speculative pfs, with less complexity (albeit this is now a 
35 patch series :).

Thanks,
Davidlohr


Re: [PATCH v4 3/7] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-27 Thread Davidlohr Bueso

On Thu, 26 Jan 2023, Suren Baghdasaryan wrote:


To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(),
replace it with VM_LOCKED_MASK bitmask and convert all users.


Might be good to mention explicitly no change in semantics, but
otherwise lgtm

Reviewed-by: Davidlohr Bueso 


Re: [PATCH v3 2/7] mm: introduce vma->vm_flags wrapper functions

2023-01-26 Thread Davidlohr Bueso

On Wed, 25 Jan 2023, Andrew Morton wrote:


On Wed, 25 Jan 2023 15:35:49 -0800 Suren Baghdasaryan  wrote:


vm_flags are among VMA attributes which affect decisions like VMA merging
and splitting. Therefore all vm_flags modifications are performed after
taking exclusive mmap_lock to prevent vm_flags updates racing with such
operations. Introduce modifier functions for vm_flags to be used whenever
flags are updated. This way we can better check and control correct
locking behavior during these updates.

...

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
+static inline void init_vm_flags(struct vm_area_struct *vma,
+static inline void reset_vm_flags(struct vm_area_struct *vma,
+static inline void set_vm_flags(struct vm_area_struct *vma,
+static inline void clear_vm_flags(struct vm_area_struct *vma,
+static inline void mod_vm_flags(struct vm_area_struct *vma,


vm_flags_init(), vm_flags_reset(), etc?

This would be more idiomatic and I do think the most-significant-first
naming style is preferable.


I tend to prefer this naming yes, but lgtm regardless.

Reviewed-by: Davidlohr Bueso 


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-11 Thread Davidlohr Bueso

On Wed, 11 Jan 2023, Suren Baghdasaryan wrote:


On Wed, Jan 11, 2023 at 8:13 AM Davidlohr Bueso  wrote:


On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:

>To keep vma locking correctness when vm_flags are modified, add modifier
>functions to be used whenever flags are updated.

How about moving this patch and the ones that follow out of this series,
into a preliminary patchset? It would reduce the amount of noise in the
per-vma lock changes, which would then only be adding the needed
vma_write_lock()ing.


How about moving those prerequisite patches to the beginning of the
patchset (before maple_tree RCU changes)? I feel like they do belong
in the patchset because as a standalone patchset it would be unclear
why I'm adding all these accessor functions and introducing this
churn. Would that be acceptable?


imo the abstraction of vm_flags handling is worth being standalone and is
easier to be picked up before a more complex locking scheme change. But
either way, it's up to you.

Thanks,
Davidlohr


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-11 Thread Davidlohr Bueso

On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:


To keep vma locking correctness when vm_flags are modified, add modifier
functions to be used whenever flags are updated.


How about moving this patch and the ones that follow out of this series,
into a preliminary patchset? It would reduce the amount of noise in the
per-vma lock changes, which would then only be adding the needed
vma_write_lock()ing.

Thanks,
Davidlohr


Re: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-10 Thread Davidlohr Bueso

On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:


This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled by default on supported architectures with SMP and MMU
set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.


I don't think that per-vma locking should be something that is 
user-configurable.
It should just be depdendant on the arch. So maybe just remove 
CONFIG_PER_VMA_LOCK?

Thanks,
Davidlohr


Re: [PATCH 08/10] scsi/ibmvfc: Replace tasklet with work

2022-06-28 Thread Davidlohr Bueso

On Thu, 09 Jun 2022, Sebastian Andrzej Siewior wrote:


On 2022-05-30 16:15:10 [-0700], Davidlohr Bueso wrote:

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index d0eab5700dc5..31b1900489e7 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -891,7 +891,7 @@ static void ibmvfc_release_crq_queue(struct ibmvfc_host 
*vhost)

ibmvfc_dbg(vhost, "Releasing CRQ\n");
free_irq(vdev->irq, vhost);
-   tasklet_kill(>tasklet);
+cancel_work_sync(>work);

s/ {8}/\t/

is there a reason not to use threaded interrupts?


I went with a workqueue here because the resume from suspend also schedules
async processing, so threaded irqs didn't seem like a good fit. This is also
similar to patch 2 (but in that case I overlooked the resume caller which you
pointed out).

Thanks,
Davidlohr


[PATCH 08/10] scsi/ibmvfc: Replace tasklet with work

2022-05-30 Thread Davidlohr Bueso
Tasklets have long been deprecated as being too heavy on the system
by running in irq context - and this is not a performance critical
path. If a higher priority process wants to run, it must wait for
the tasklet to finish before doing so. Use a workqueue instead and
run in task context - albeit the increased concurrency (tasklets
safe against themselves), but the handler is done under both the
vhost's host_lock + crq.q_lock so should be safe.

Cc: Tyrel Datwyler 
Cc: Michael Ellerman 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 21 -
 drivers/scsi/ibmvscsi/ibmvfc.h |  3 ++-
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index d0eab5700dc5..31b1900489e7 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -891,7 +891,7 @@ static void ibmvfc_release_crq_queue(struct ibmvfc_host 
*vhost)
 
ibmvfc_dbg(vhost, "Releasing CRQ\n");
free_irq(vdev->irq, vhost);
-   tasklet_kill(>tasklet);
+cancel_work_sync(>work);
do {
if (rc)
msleep(100);
@@ -3689,22 +3689,22 @@ static irqreturn_t ibmvfc_interrupt(int irq, void 
*dev_instance)
 
spin_lock_irqsave(vhost->host->host_lock, flags);
vio_disable_interrupts(to_vio_dev(vhost->dev));
-   tasklet_schedule(>tasklet);
+   schedule_work(>work);
spin_unlock_irqrestore(vhost->host->host_lock, flags);
return IRQ_HANDLED;
 }
 
 /**
- * ibmvfc_tasklet - Interrupt handler tasklet
+ * ibmvfc_work - work handler
  * @data:  ibmvfc host struct
  *
  * Returns:
  * Nothing
  **/
-static void ibmvfc_tasklet(void *data)
+static void ibmvfc_workfn(struct work_struct *work)
 {
-   struct ibmvfc_host *vhost = data;
-   struct vio_dev *vdev = to_vio_dev(vhost->dev);
+   struct ibmvfc_host *vhost;
+   struct vio_dev *vdev;
struct ibmvfc_crq *crq;
struct ibmvfc_async_crq *async;
struct ibmvfc_event *evt, *temp;
@@ -3712,6 +3712,9 @@ static void ibmvfc_tasklet(void *data)
int done = 0;
LIST_HEAD(evt_doneq);
 
+   vhost = container_of(work, struct ibmvfc_host, work);
+   vdev = to_vio_dev(vhost->dev);
+
spin_lock_irqsave(vhost->host->host_lock, flags);
spin_lock(vhost->crq.q_lock);
while (!done) {
@@ -5722,7 +5725,7 @@ static int ibmvfc_init_crq(struct ibmvfc_host *vhost)
 
retrc = 0;
 
-   tasklet_init(>tasklet, (void *)ibmvfc_tasklet, (unsigned 
long)vhost);
+   INIT_WORK(>work, ibmvfc_workfn);
 
if ((rc = request_irq(vdev->irq, ibmvfc_interrupt, 0, IBMVFC_NAME, 
vhost))) {
dev_err(dev, "Couldn't register irq 0x%x. rc=%d\n", vdev->irq, 
rc);
@@ -5738,7 +5741,7 @@ static int ibmvfc_init_crq(struct ibmvfc_host *vhost)
return retrc;
 
 req_irq_failed:
-   tasklet_kill(>tasklet);
+   cancel_work_sync(>work);
do {
rc = plpar_hcall_norets(H_FREE_CRQ, vdev->unit_address);
} while (rc == H_BUSY || H_IS_LONG_BUSY(rc));
@@ -6213,7 +6216,7 @@ static int ibmvfc_resume(struct device *dev)
 
spin_lock_irqsave(vhost->host->host_lock, flags);
vio_disable_interrupts(vdev);
-   tasklet_schedule(>tasklet);
+   schedule_work(>work);
spin_unlock_irqrestore(vhost->host->host_lock, flags);
return 0;
 }
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 3718406e0988..7eca3622a2fa 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #define IBMVFC_NAME"ibmvfc"
@@ -892,7 +893,7 @@ struct ibmvfc_host {
char partition_name[97];
void (*job_step) (struct ibmvfc_host *);
struct task_struct *work_thread;
-   struct tasklet_struct tasklet;
+   struct work_struct work;
struct work_struct rport_add_work_q;
wait_queue_head_t init_wait_q;
wait_queue_head_t work_wait_q;
-- 
2.36.1



[PATCH 06/10] scsi/ibmvscsi_tgt: Replace work tasklet with threaded irq

2022-05-30 Thread Davidlohr Bueso
Tasklets have long been deprecated as being too heavy on the system
by running in irq context - and this is not a performance critical
path. If a higher priority process wants to run, it must wait for
the tasklet to finish before doing so. A more suitable equivalent
is to converted to threaded irq instead and deal with the async
work in task context.

Cc: Michael Cyr 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Davidlohr Bueso 
---
 drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 17 +++--
 drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h |  1 -
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c 
b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
index eee1a24f7e15..fafadb7158a3 100644
--- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
+++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
@@ -2948,9 +2948,8 @@ static irqreturn_t ibmvscsis_interrupt(int dummy, void 
*data)
struct scsi_info *vscsi = data;
 
vio_disable_interrupts(vscsi->dma_dev);
-   tasklet_schedule(>work_task);
 
-   return IRQ_HANDLED;
+   return IRQ_WAKE_THREAD;
 }
 
 /**
@@ -3317,7 +3316,7 @@ static int ibmvscsis_rdma(struct ibmvscsis_cmd *cmd, 
struct scatterlist *sg,
  *
  * Note: this is an edge triggered interrupt. It can not be shared.
  */
-static void ibmvscsis_handle_crq(unsigned long data)
+static irqreturn_t ibmvscsis_handle_crq(int irq, void *data)
 {
struct scsi_info *vscsi = (struct scsi_info *)data;
struct viosrp_crq *crq;
@@ -3340,7 +3339,7 @@ static void ibmvscsis_handle_crq(unsigned long data)
dev_dbg(>dev, "handle_crq, don't process: flags 0x%x, 
state 0x%hx\n",
vscsi->flags, vscsi->state);
spin_unlock_bh(>intr_lock);
-   return;
+   goto done;
}
 
rc = vscsi->flags & SCHEDULE_DISCONNECT;
@@ -3417,6 +3416,8 @@ static void ibmvscsis_handle_crq(unsigned long data)
vscsi->state);
 
spin_unlock_bh(>intr_lock);
+done:
+   return IRQ_HANDLED;
 }
 
 static int ibmvscsis_probe(struct vio_dev *vdev,
@@ -3530,9 +3531,6 @@ static int ibmvscsis_probe(struct vio_dev *vdev,
dev_dbg(>dev, "probe hrc %ld, client partition num %d\n",
hrc, vscsi->client_data.partition_number);
 
-   tasklet_init(>work_task, ibmvscsis_handle_crq,
-(unsigned long)vscsi);
-
init_completion(>wait_idle);
init_completion(>unconfig);
 
@@ -3544,7 +3542,8 @@ static int ibmvscsis_probe(struct vio_dev *vdev,
goto unmap_buf;
}
 
-   rc = request_irq(vdev->irq, ibmvscsis_interrupt, 0, "ibmvscsis", vscsi);
+   rc = request_threaded_irq(vdev->irq, ibmvscsis_interrupt,
+ ibmvscsis_handle_crq, 0, "ibmvscsis", vscsi);
if (rc) {
rc = -EPERM;
dev_err(>dev, "probe: request_irq failed, rc %d\n", rc);
@@ -3565,7 +3564,6 @@ static int ibmvscsis_probe(struct vio_dev *vdev,
 free_buf:
kfree(vscsi->map_buf);
 destroy_queue:
-   tasklet_kill(>work_task);
ibmvscsis_unregister_command_q(vscsi);
ibmvscsis_destroy_command_q(vscsi);
 free_timer:
@@ -3602,7 +3600,6 @@ static void ibmvscsis_remove(struct vio_dev *vdev)
dma_unmap_single(>dev, vscsi->map_ioba, PAGE_SIZE,
 DMA_BIDIRECTIONAL);
kfree(vscsi->map_buf);
-   tasklet_kill(>work_task);
ibmvscsis_destroy_command_q(vscsi);
ibmvscsis_freetimer(vscsi);
ibmvscsis_free_cmds(vscsi);
diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h 
b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h
index 7ae074e5d7a1..b66c982b8b00 100644
--- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h
+++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.h
@@ -295,7 +295,6 @@ struct scsi_info {
struct vio_dev *dma_dev;
struct srp_target target;
struct ibmvscsis_tport tport;
-   struct tasklet_struct work_task;
struct work_struct proc_work;
 };
 
-- 
2.36.1



[PATCH 09/10] scsi/ibmvscsi: Replace srp tasklet with work

2022-05-30 Thread Davidlohr Bueso
Tasklets have long been deprecated as being too heavy on the system
by running in irq context - and this is not a performance critical
path. If a higher priority process wants to run, it must wait for
the tasklet to finish before doing so.

Process srps asynchronously in process context in a dedicated
single threaded workqueue.

Cc: Tyrel Datwyler 
Cc: Michael Ellerman 
---
 drivers/scsi/ibmvscsi/ibmvscsi.c | 38 ++--
 drivers/scsi/ibmvscsi/ibmvscsi.h |  3 ++-
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 63f32f843e75..37cbea8bb0af 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -86,6 +86,8 @@ static DEFINE_SPINLOCK(ibmvscsi_driver_lock);
 
 static struct scsi_transport_template *ibmvscsi_transport_template;
 
+static struct workqueue_struct *ibmvscsi_wq;
+
 #define IBMVSCSI_VERSION "1.5.9"
 
 MODULE_DESCRIPTION("IBM Virtual SCSI");
@@ -117,7 +119,7 @@ static void ibmvscsi_handle_crq(struct viosrp_crq *crq,
  * @irq:   number of irq to handle, not used
  * @dev_instance: ibmvscsi_host_data of host that received interrupt
  *
- * Disables interrupts and schedules srp_task
+ * Disables interrupts and schedules srp_work
  * Always returns IRQ_HANDLED
  */
 static irqreturn_t ibmvscsi_handle_event(int irq, void *dev_instance)
@@ -125,7 +127,7 @@ static irqreturn_t ibmvscsi_handle_event(int irq, void 
*dev_instance)
struct ibmvscsi_host_data *hostdata =
(struct ibmvscsi_host_data *)dev_instance;
vio_disable_interrupts(to_vio_dev(hostdata->dev));
-   tasklet_schedule(>srp_task);
+   queue_work(ibmvscsi_wq, >srp_work);
return IRQ_HANDLED;
 }
 
@@ -145,7 +147,7 @@ static void ibmvscsi_release_crq_queue(struct crq_queue 
*queue,
long rc = 0;
struct vio_dev *vdev = to_vio_dev(hostdata->dev);
free_irq(vdev->irq, (void *)hostdata);
-   tasklet_kill(>srp_task);
+   cancel_work_sync(>srp_work);
do {
if (rc)
msleep(100);
@@ -206,16 +208,19 @@ static int ibmvscsi_send_crq(struct ibmvscsi_host_data 
*hostdata,
 }
 
 /**
- * ibmvscsi_task: - Process srps asynchronously
+ * ibmvscsi_workfn: - Process srps asynchronously
  * @data:  ibmvscsi_host_data of host
  */
-static void ibmvscsi_task(void *data)
+static void ibmvscsi_workfn(struct work_struct *work)
 {
-   struct ibmvscsi_host_data *hostdata = (struct ibmvscsi_host_data *)data;
-   struct vio_dev *vdev = to_vio_dev(hostdata->dev);
+   struct ibmvscsi_host_data *hostdata;
+   struct vio_dev *vdev;
struct viosrp_crq *crq;
int done = 0;
 
+   hostdata = container_of(work, struct ibmvscsi_host_data, srp_work);
+   vdev = to_vio_dev(hostdata->dev);
+
while (!done) {
/* Pull all the valid messages off the CRQ */
while ((crq = crq_queue_next_crq(>queue)) != NULL) {
@@ -367,8 +372,7 @@ static int ibmvscsi_init_crq_queue(struct crq_queue *queue,
queue->cur = 0;
spin_lock_init(>lock);
 
-   tasklet_init(>srp_task, (void *)ibmvscsi_task,
-(unsigned long)hostdata);
+   INIT_WORK(>srp_work, ibmvscsi_workfn);
 
if (request_irq(vdev->irq,
ibmvscsi_handle_event,
@@ -387,7 +391,7 @@ static int ibmvscsi_init_crq_queue(struct crq_queue *queue,
return retrc;
 
   req_irq_failed:
-   tasklet_kill(>srp_task);
+   cancel_work_sync(>srp_work);
rc = 0;
do {
if (rc)
@@ -2371,7 +2375,7 @@ static int ibmvscsi_resume(struct device *dev)
 {
struct ibmvscsi_host_data *hostdata = dev_get_drvdata(dev);
vio_disable_interrupts(to_vio_dev(hostdata->dev));
-   tasklet_schedule(>srp_task);
+   queue_work(ibmvscsi_wq, >srp_work);
 
return 0;
 }
@@ -2418,15 +2422,25 @@ static int __init ibmvscsi_module_init(void)
if (!ibmvscsi_transport_template)
return -ENOMEM;
 
+   ibmvscsi_wq = alloc_ordered_workqueue("ibmvscsi_wq", 0);
+   if (!ibmvscsi_wq) {
+   srp_release_transport(ibmvscsi_transport_template);
+   return -ENOMEM;
+   }
+
ret = vio_register_driver(_driver);
-   if (ret)
+   if (ret) {
+   destroy_workqueue(ibmvscsi_wq);
srp_release_transport(ibmvscsi_transport_template);
+   }
+
return ret;
 }
 
 static void __exit ibmvscsi_module_exit(void)
 {
vio_unregister_driver(_driver);
+   destroy_workqueue(ibmvscsi_wq);
srp_release_transport(ibmvscsi_transport_template);
 }
 
diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.h b/drivers/scsi/ibmvscsi/ibmvscsi.h
index e60916ef7a49..f7c52744a206 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.h
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

Re: [PATCH] xsysace: Remove SYSACE driver

2021-03-22 Thread Davidlohr Bueso

Hi,

On Mon, 09 Nov 2020, Michal Simek wrote:


Sysace IP is no longer used on Xilinx PowerPC 405/440 and Microblaze
systems. The driver is not regularly tested and very likely not working for
quite a long time that's why remove it.


Is there a reason this patch was never merged? can the driver be removed? I ran
into this as a potential tasklet user that can be replaced/removed.

Thanks,
Davidlohr



Signed-off-by: Michal Simek 
---

Based on discussion
https://lore.kernel.org/linux-arm-kernel/5ab9a2a1-20e3-c7b2-f666-2034df436...@kernel.dk/

I have grepped the kernel and found any old ppc platform. I have included
it in this patch to have a discussion about it.

---
MAINTAINERS |1 -
arch/microblaze/boot/dts/system.dts |8 -
arch/powerpc/boot/dts/icon.dts  |7 -
arch/powerpc/configs/44x/icon_defconfig |1 -
drivers/block/Kconfig   |6 -
drivers/block/Makefile  |1 -
drivers/block/xsysace.c | 1273 ---
7 files changed, 1297 deletions(-)
delete mode 100644 drivers/block/xsysace.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cba8ddf87a08..38556c009758 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2741,7 +2741,6 @@ T:git https://github.com/Xilinx/linux-xlnx.git
F:  Documentation/devicetree/bindings/i2c/cdns,i2c-r1p10.yaml
F:  Documentation/devicetree/bindings/i2c/xlnx,xps-iic-2.00.a.yaml
F:  arch/arm/mach-zynq/
-F: drivers/block/xsysace.c
F:  drivers/clocksource/timer-cadence-ttc.c
F:  drivers/cpuidle/cpuidle-zynq.c
F:  drivers/edac/synopsys_edac.c
diff --git a/arch/microblaze/boot/dts/system.dts 
b/arch/microblaze/boot/dts/system.dts
index 5b236527176e..b7ee1056779e 100644
--- a/arch/microblaze/boot/dts/system.dts
+++ b/arch/microblaze/boot/dts/system.dts
@@ -310,14 +310,6 @@ RS232_Uart_1: serial@8400 {
xlnx,odd-parity = <0x0>;
xlnx,use-parity = <0x0>;
} ;
-   SysACE_CompactFlash: sysace@8360 {
-   compatible = "xlnx,xps-sysace-1.00.a";
-   interrupt-parent = <_intc_0>;
-   interrupts = < 4 2 >;
-   reg = < 0x8360 0x1 >;
-   xlnx,family = "virtex5";
-   xlnx,mem-width = <0x10>;
-   } ;
debug_module: debug@8440 {
compatible = "xlnx,mdm-1.00.d";
reg = < 0x8440 0x1 >;
diff --git a/arch/powerpc/boot/dts/icon.dts b/arch/powerpc/boot/dts/icon.dts
index fbaa60b8f87a..4fd7a4fbb4fb 100644
--- a/arch/powerpc/boot/dts/icon.dts
+++ b/arch/powerpc/boot/dts/icon.dts
@@ -197,13 +197,6 @@ partition@fa {
reg = <0x00fa 0x0006>;
};
};
-
-   SysACE_CompactFlash: sysace@1,0 {
-   compatible = "xlnx,sysace";
-   interrupt-parent = <>;
-   interrupts = <24 0x4>;
-   reg = <0x0001 0x 0x1>;
-   };
};

UART0: serial@f200 {
diff --git a/arch/powerpc/configs/44x/icon_defconfig 
b/arch/powerpc/configs/44x/icon_defconfig
index 930948a1da76..fb9a15573546 100644
--- a/arch/powerpc/configs/44x/icon_defconfig
+++ b/arch/powerpc/configs/44x/icon_defconfig
@@ -28,7 +28,6 @@ CONFIG_MTD_CFI_AMDSTD=y
CONFIG_MTD_PHYSMAP_OF=y
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_SIZE=35000
-CONFIG_XILINX_SYSACE=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SCSI_CONSTANTS=y
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index ecceaaa1a66f..9cb02861298d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -388,12 +388,6 @@ config SUNVDC

source "drivers/s390/block/Kconfig"

-config XILINX_SYSACE
-   tristate "Xilinx SystemACE support"
-   depends on 4xx || MICROBLAZE
-   help
- Include support for the Xilinx SystemACE CompactFlash interface
-
config XEN_BLKDEV_FRONTEND
tristate "Xen virtual block device support"
depends on XEN
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index e1f63117ee94..5ddd9370972a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -19,7 +19,6 @@ obj-$(CONFIG_ATARI_FLOPPY)+= ataflop.o
obj-$(CONFIG_AMIGA_Z2RAM)   += z2ram.o
obj-$(CONFIG_BLK_DEV_RAM)   += brd.o
obj-$(CONFIG_BLK_DEV_LOOP)  += loop.o
-obj-$(CONFIG_XILINX_SYSACE)+= xsysace.o
obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
obj-$(CONFIG_SUNVDC)+= sunvdc.o
obj-$(CONFIG_BLK_DEV_SKD)   += skd.o
diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c
deleted file mode 100644
index eb8ef65778c3..

[PATCH v2] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-18 Thread Davidlohr Bueso
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=+==+==+===+
| # tasks | vanilla  | dirty| %diff |
+=+==+==+===+
| 2   | 46718565 | 48751350 | 4.35  |
+-+--+--+---+
| 4   | 51740198 | 50369082 | -2.65 |
+-+--+--+---+
| 8   | 63756510 | 62568821 | -1.86 |
+-+--+--+---+
| 16  | 67824531 | 70966546 | 4.63  |
+-+--+--+---+
| 32  | 53843519 | 61155508 | 13.58 |
+-+--+--+---+
| 64  | 53005778 | 53104412 | 0.18  |
+-+--+--+---+
| 128 | 53331980 | 54606910 | 2.39  |
+=+==+==+===+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

 simple-spinlock   vanilla  dirty
Hmean 1473.50 (   0.00%)   54.44 * -25.93%*   73.45 * 
-0.07%*
Hmean 100  654.47 (   0.00%)  385.61 * -41.08%*  771.43 * 
17.87%*
Hmean 300 2719.39 (   0.00%) 2181.67 * -19.77%* 2666.50 * 
-1.94%*
Hmean 500 4400.59 (   0.00%) 3390.77 * -22.95%* 4322.14 * 
-1.78%*
Hmean 850 6726.21 (   0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
 vanilladirty
Amean latency-1  1.67 (   0.00%)1.67 *   0.09%*
Amean latency-2  2.15 (   0.00%)2.08 *   3.36%*
Amean latency-4  2.50 (   0.00%)2.56 *  -2.27%*
Amean latency-8  2.49 (   0.00%)2.48 *   0.31%*
Amean latency-16 2.69 (   0.00%)2.72 *  -1.37%*
Amean latency-32 2.96 (   0.00%)3.04 *  -2.60%*
Amean latency-64 7.78 (   0.00%)8.17 *  -5.07%*
Amean latency-512  186.91 (   0.00%)  186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

 vanilladirty
Hmean 1849.13 (   0.00%)  851.51 *   0.28%*
Hmean 2   1664.03 (   0.00%) 1663.94 *  -0.01%*
Hmean 4   3073.70 (   0.00%) 3104.29 *   1.00%*
Hmean 8   5624.02 (   0.00%) 5694.16 *   1.25%*
Hmean 16  9169.49 (   0.00%) 9324.43 *   1.69%*
Hmean 32 11969.37 (   0.00%)12127.09 *   1.32%*
Hmean 64 15021.12 (   0.00%)15243.14 *   1.48%*
Hmean 51214891.27 (   0.00%)15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Fixes: 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed())
Acked-by: Nicholas Piggin 
Signed-off-by: Davidlohr Bueso 
---
Changes from v1:
Added small description and labeling smp_cond_load_relaxed requested by Nick.
Added Nick's ack.

 arch/powerpc/include/asm/barrier.h   | 16 
 arch/powerpc/include/asm/qspinlock.h |  7 +++
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do { 
\
___p1;  \
 })
 
-#ifdef CONFIG_PPC64
-#define smp_cond_load_relaxed(ptr, cond_expr) ({   \
-   typeof(ptr) __PTR = (ptr);  \
-   __unqual_scalar_typeof(*ptr) VAL;   \
-   VAL = READ_ONCE(*__PTR);\
-   if (unlikely(!(cond_expr))) {   \
-   spin_begin

Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-18 Thread Davidlohr Bueso

On Tue, 16 Mar 2021, Nicholas Piggin wrote:


One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
* smp_cond_load_relaxed was found to have performance problems if
* implemented with spin_begin()/spin_end().
*/


Sure, let me see where I can fit that in and send out a v2.

Similarly, but unrelated to this patch, is there any chance we could
remove the whole spin_until_cond() machinery and make it specific to
powerpc? This was introduced in 2017 and doesn't really have any users
outside of powerpc, except for these:

drivers/firmware/arm_scmi/driver.c: 
spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop));
drivers/firmware/arm_scmi/shmem.c:  
spin_until_cond(ioread32(>channel_status) &
drivers/net/ethernet/xilinx/ll_temac_main.c:
spin_until_cond(hard_acs_rdy_or_timeout(lp, timeout));

... which afaict only the xilinx one can actually build on powerpc.
Regardless, these could be converted to smp_cond_load_relaxed(), being
the more standard way to do optimized busy-waiting, caring more about
the family of barriers than ad-hoc SMT priorities. Of course, I have
no way of testing any of these changes.


I wonder if it should have a Fixes: tag to the original commit as
well.


I'm not sure either. I've actually been informed recently of other
workloads that benefit from the revert on large Power9 boxes. So I'll
go ahead and add it.



Otherwise,

Acked-by: Nicholas Piggin 


Thanks,
Davidlohr


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Davidlohr Bueso

On Tue, 09 Mar 2021, Michal Such�nek wrote:


On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:

49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple

 ^^

spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

...


diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do { 
\
___p1;  \
 })

-#ifdef CONFIG_PPC64

Maybe it should be kept for the simple spinlock case then?


It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
but instead deal with the priorities in arch_spin_lock(), so it will
spin in low priority until it sees a chance to take the lock, where
it switches back to medium.

Thanks,
Davidlohr


[PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-08 Thread Davidlohr Bueso
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=+==+==+===+
| # tasks | vanilla  | dirty| %diff |
+=+==+==+===+
| 2   | 46718565 | 48751350 | 4.35  |
+-+--+--+---+
| 4   | 51740198 | 50369082 | -2.65 |
+-+--+--+---+
| 8   | 63756510 | 62568821 | -1.86 |
+-+--+--+---+
| 16  | 67824531 | 70966546 | 4.63  |
+-+--+--+---+
| 32  | 53843519 | 61155508 | 13.58 |
+-+--+--+---+
| 64  | 53005778 | 53104412 | 0.18  |
+-+--+--+---+
| 128 | 53331980 | 54606910 | 2.39  |
+=+==+==+===+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

 simple-spinlock   vanilla  dirty
Hmean 1473.50 (   0.00%)   54.44 * -25.93%*   73.45 * 
-0.07%*
Hmean 100  654.47 (   0.00%)  385.61 * -41.08%*  771.43 * 
17.87%*
Hmean 300 2719.39 (   0.00%) 2181.67 * -19.77%* 2666.50 * 
-1.94%*
Hmean 500 4400.59 (   0.00%) 3390.77 * -22.95%* 4322.14 * 
-1.78%*
Hmean 850 6726.21 (   0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
 vanilladirty
Amean latency-1  1.67 (   0.00%)1.67 *   0.09%*
Amean latency-2  2.15 (   0.00%)2.08 *   3.36%*
Amean latency-4  2.50 (   0.00%)2.56 *  -2.27%*
Amean latency-8  2.49 (   0.00%)2.48 *   0.31%*
Amean latency-16 2.69 (   0.00%)2.72 *  -1.37%*
Amean latency-32 2.96 (   0.00%)3.04 *  -2.60%*
Amean latency-64 7.78 (   0.00%)8.17 *  -5.07%*
Amean latency-512  186.91 (   0.00%)  186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

 vanilladirty
Hmean 1849.13 (   0.00%)  851.51 *   0.28%*
Hmean 2   1664.03 (   0.00%) 1663.94 *  -0.01%*
Hmean 4   3073.70 (   0.00%) 3104.29 *   1.00%*
Hmean 8   5624.02 (   0.00%) 5694.16 *   1.25%*
Hmean 16  9169.49 (   0.00%) 9324.43 *   1.69%*
Hmean 32 11969.37 (   0.00%)12127.09 *   1.32%*
Hmean 64 15021.12 (   0.00%)15243.14 *   1.48%*
Hmean 51214891.27 (   0.00%)15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Signed-off-by: Davidlohr Bueso 
---
 arch/powerpc/include/asm/barrier.h | 16 
 1 file changed, 16 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do { 
\
___p1;  \
 })
 
-#ifdef CONFIG_PPC64
-#define smp_cond_load_relaxed(ptr, cond_expr) ({   \
-   typeof(ptr) __PTR = (ptr);  \
-   __unqual_scalar_typeof(*ptr) VAL;   \
-   VAL = READ_ONCE(*__PTR);\
-   if (unlikely(!(cond_expr))) {   \
-   spin_begin();   \
-   do {\
-   VAL = READ_ONCE(*__PTR);\
-   } while (!(cond_expr)); \
-   spin_end

[PATCH 2/3] powerpc/spinlock: Unserialize spin_is_locked

2021-03-08 Thread Davidlohr Bueso
c6f5d02b6a0f (locking/spinlocks/arm64: Remove smp_mb() from
arch_spin_is_locked()) made it pretty official that the call
semantics do not imply any sort of barriers, and any user that
gets creative must explicitly do any serialization.

This creativity, however, is nowadays pretty limited:

1. spin_unlock_wait() has been removed from the kernel in favor
of a lock/unlock combo. Furthermore, queued spinlocks have now
for a number of years no longer relied on _Q_LOCKED_VAL for the
call, but any non-zero value to indicate a locked state. There
were cases where the delayed locked store could lead to breaking
mutual exclusion with crossed locking; such as with sysv ipc and
netfilter being the most extreme.

2. The auditing Andrea did in verified that remaining spin_is_locked()
no longer rely on such semantics. Most callers just use it to assert
a lock is taken, in a debug nature. The only user that gets cute is
NOLOCK qdisc, as of:

   96009c7d500e (sched: replace __QDISC_STATE_RUNNING bit with a spin lock)

... which ironically went in the next day after c6f5d02b6a0f. This
change replaces test_bit() with spin_is_locked() to know whether
to take the busylock heuristic to reduce contention on the main
qdisc lock. So any races against spin_is_locked() for archs that
use LL/SC for spin_lock() will be benign and not break any mutual
exclusion; furthermore, both the seqlock and busylock have the same
scope.

Cc: parri.and...@gmail.com
Cc: pab...@redhat.com
Signed-off-by: Davidlohr Bueso 
---
 arch/powerpc/include/asm/qspinlock.h   | 12 
 arch/powerpc/include/asm/simple_spinlock.h |  3 +--
 2 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index 3ce1a0bee4fe..b052b0624816 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -44,18 +44,6 @@ static __always_inline void queued_spin_lock(struct 
qspinlock *lock)
 }
 #define queued_spin_lock queued_spin_lock
 
-static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
-{
-   /*
-* This barrier was added to simple spinlocks by commit 51d7d5205d338,
-* but it should now be possible to remove it, asm arm64 has done with
-* commit c6f5d02b6a0f.
-*/
-   smp_mb();
-   return atomic_read(>val);
-}
-#define queued_spin_is_locked queued_spin_is_locked
-
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define SPIN_THRESHOLD (1<<15) /* not tuned */
 
diff --git a/arch/powerpc/include/asm/simple_spinlock.h 
b/arch/powerpc/include/asm/simple_spinlock.h
index 3e87258f73b1..1b935396522a 100644
--- a/arch/powerpc/include/asm/simple_spinlock.h
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -38,8 +38,7 @@ static __always_inline int 
arch_spin_value_unlocked(arch_spinlock_t lock)
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-   smp_mb();
-   return !arch_spin_value_unlocked(*lock);
+   return !arch_spin_value_unlocked(READ_ONCE(*lock));
 }
 
 /*
-- 
2.26.2



[PATCH 1/3] powerpc/spinlock: Define smp_mb__after_spinlock only once

2021-03-08 Thread Davidlohr Bueso
Instead of both queued and simple spinlocks doing it. Move
it into the arch's spinlock.h.

Signed-off-by: Davidlohr Bueso 
---
 arch/powerpc/include/asm/qspinlock.h   | 2 --
 arch/powerpc/include/asm/simple_spinlock.h | 3 ---
 arch/powerpc/include/asm/spinlock.h| 3 +++
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..3ce1a0bee4fe 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -44,8 +44,6 @@ static __always_inline void queued_spin_lock(struct qspinlock 
*lock)
 }
 #define queued_spin_lock queued_spin_lock
 
-#define smp_mb__after_spinlock()   smp_mb()
-
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
/*
diff --git a/arch/powerpc/include/asm/simple_spinlock.h 
b/arch/powerpc/include/asm/simple_spinlock.h
index 9c3c30534333..3e87258f73b1 100644
--- a/arch/powerpc/include/asm/simple_spinlock.h
+++ b/arch/powerpc/include/asm/simple_spinlock.h
@@ -282,7 +282,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)  rw_yield(lock)
 #define arch_write_relax(lock) rw_yield(lock)
 
-/* See include/linux/spinlock.h */
-#define smp_mb__after_spinlock()   smp_mb()
-
 #endif /* _ASM_POWERPC_SIMPLE_SPINLOCK_H */
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 6ec72282888d..bd75872a6334 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -10,6 +10,9 @@
 #include 
 #endif
 
+/* See include/linux/spinlock.h */
+#define smp_mb__after_spinlock()   smp_mb()
+
 #ifndef CONFIG_PARAVIRT_SPINLOCKS
 static inline void pv_spinlocks_init(void) { }
 #endif
-- 
2.26.2



[PATCH 0/3] powerpc/qspinlock: Some tuning updates

2021-03-08 Thread Davidlohr Bueso
Hi,

A few updates while going through the powerpc port of the qspinlock.

Patches 1 and 2 are straightforward, while patch 3 can be considered
more of an rfc as I've only tested on a single machine, and there
could be an alternative way if it doesn't end up being nearly a
universal performance win.

Thanks!

Davidlohr Bueso (3):
  powerpc/spinlock: Define smp_mb__after_spinlock only once
  powerpc/spinlock: Unserialize spin_is_locked
  powerpc/qspinlock: Use generic smp_cond_load_relaxed

 arch/powerpc/include/asm/barrier.h | 16 
 arch/powerpc/include/asm/qspinlock.h   | 14 --
 arch/powerpc/include/asm/simple_spinlock.h |  6 +-
 arch/powerpc/include/asm/spinlock.h|  3 +++
 4 files changed, 4 insertions(+), 35 deletions(-)

--
2.26.2



Re: [PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-22 Thread Davidlohr Bueso

On Fri, 20 Mar 2020, Peter Zijlstra wrote:


On Fri, Mar 20, 2020 at 01:55:26AM -0700, Davidlohr Bueso wrote:

-   swait_event_interruptible_exclusive(*wq, ((!vcpu->arch.power_off) &&
-  (!vcpu->arch.pause)));
+   rcuwait_wait_event(*wait,
+  (!vcpu->arch.power_off) && (!vcpu->arch.pause),
+  TASK_INTERRUPTIBLE);



-   for (;;) {
-   prepare_to_swait_exclusive(>wq, , 
TASK_INTERRUPTIBLE);
-
-   if (kvm_vcpu_check_block(vcpu) < 0)
-   break;
-
-   waited = true;
-   schedule();
-   }
-
-   finish_swait(>wq, );
+   rcuwait_wait_event(>wait,
+  (block_check = kvm_vcpu_check_block(vcpu)) < 0,
+  TASK_INTERRUPTIBLE);


Are these yet more instances that really want to be TASK_IDLE ?


Hmm probably as it makes sense for a blocked vcpu not to be contributing to
the loadavg. So if this is the only reason to use interruptible, then yes we
ought to change it.

However, I'll make this a separate patch, given this (ab)use isn't as obvious
as the PS3 case, which is a kthread and therefore signals are masked.

Thanks,
Davidlohr


Re: [patch V3 00/20] Lock ordering documentation and annotation for lockdep

2020-03-21 Thread Davidlohr Bueso

On Sat, 21 Mar 2020, Thomas Gleixner wrote:


This is the third and hopefully final version of this work. The second one
can be found here:


Would you rather I send in a separate series with the kvm changes, or
should I just send a v2 with the fixes here again?

Thanks,
Davidlohr


Re: [patch V2 11/15] completion: Use simple wait queues

2020-03-20 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:


From: Thomas Gleixner 

completion uses a wait_queue_head_t to enqueue waiters.

wait_queue_head_t contains a spinlock_t to protect the list of waiters
which excludes it from being used in truly atomic context on a PREEMPT_RT
enabled kernel.

The spinlock in the wait queue head cannot be replaced by a raw_spinlock
because:

 - wait queues can have custom wakeup callbacks, which acquire other
   spinlock_t locks and have potentially long execution times

 - wake_up() walks an unbounded number of list entries during the wake up
   and may wake an unbounded number of waiters.

For simplicity and performance reasons complete() should be usable on
PREEMPT_RT enabled kernels.

completions do not use custom wakeup callbacks and are usually single
waiter, except for a few corner cases.

Replace the wait queue in the completion with a simple wait queue (swait),
which uses a raw_spinlock_t for protecting the waiter list and therefore is
safe to use inside truly atomic regions on PREEMPT_RT.

There is no semantical or functional change:

 - completions use the exclusive wait mode which is what swait provides

 - complete() wakes one exclusive waiter

 - complete_all() wakes all waiters while holding the lock which protects
   the wait queue against newly incoming waiters. The conversion to swait
   preserves this behaviour.

complete_all() might cause unbound latencies with a large number of waiters
being woken at once, but most complete_all() usage sites are either in
testing or initialization code or have only a really small number of
concurrent waiters which for now does not cause a latency problem. Keep it
simple for now.

The fixup of the warning check in the USB gadget driver is just a straight
forward conversion of the lockless waiter check from one waitqueue type to
the other.

Signed-off-by: Thomas Gleixner 
Cc: Arnd Bergmann 


Reviewed-by: Davidlohr Bueso 


Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()

2020-03-20 Thread Davidlohr Bueso

On Fri, 20 Mar 2020, Sebastian Andrzej Siewior wrote:


I though that v2 has it fixed with the previous commit (acpi: Remove
header dependency). The kbot just reported that everything is fine.
Let me look???


Nah my bad, that build did not have the full series applied :)

Sorry for the noise.

Thanks,
Davidlohr


[PATCH 19/15] sched/swait: Reword some of the main description

2020-03-20 Thread Davidlohr Bueso
With both the increased use of swait and kvm no longer using
it, we can reword some of the comments. While removing Linus'
valid rant, I've also cared to explicitly mention that swait
is very different than regular wait. In addition it is
mentioned against using swait in favor of the regular flavor.

Signed-off-by: Davidlohr Bueso 
---
 include/linux/swait.h | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/linux/swait.h b/include/linux/swait.h
index 73e06e9986d4..6e5b5d0e64fd 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -9,23 +9,10 @@
 #include 
 
 /*
- * BROKEN wait-queues.
- *
- * These "simple" wait-queues are broken garbage, and should never be
- * used. The comments below claim that they are "similar" to regular
- * wait-queues, but the semantics are actually completely different, and
- * every single user we have ever had has been buggy (or pointless).
- *
- * A "swake_up_one()" only wakes up _one_ waiter, which is not at all what
- * "wake_up()" does, and has led to problems. In other cases, it has
- * been fine, because there's only ever one waiter (kvm), but in that
- * case gthe whole "simple" wait-queue is just pointless to begin with,
- * since there is no "queue". Use "wake_up_process()" with a direct
- * pointer instead.
- *
- * While these are very similar to regular wait queues (wait.h) the most
- * important difference is that the simple waitqueue allows for deterministic
- * behaviour -- IOW it has strictly bounded IRQ and lock hold times.
+ * Simple waitqueues are semantically very different to regular wait queues
+ * (wait.h). The most important difference is that the simple waitqueue allows
+ * for deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold
+ * times.
  *
  * Mainly, this is accomplished by two things. Firstly not allowing 
swake_up_all
  * from IRQ disabled, and dropping the lock upon every wakeup, giving a higher
@@ -39,7 +26,7 @@
  *sleeper state.
  *
  *  - the !exclusive mode; because that leads to O(n) wakeups, everything is
- *exclusive.
+ *exclusive. As such swait_wake_up_one will only ever awake _one_ waiter.
  *
  *  - custom wake callback functions; because you cannot give any guarantees
  *about random code. This also allows swait to be used in RT, such that
-- 
2.16.4



[PATCH 18/15] kvm: Replace vcpu->swait with rcuwait

2020-03-20 Thread Davidlohr Bueso
The use of any sort of waitqueue (simple or regular) for
wait/waking vcpus has always been an overkill and semantically
wrong. Because this is per-vcpu (which is blocked) there is
only ever a single waiting vcpu, thus no need for any sort of
queue.

As such, make use of the rcuwait primitive, with the following
considerations:

  - rcuwait already provides the proper barriers that serialize
  concurrent waiter and waker.

  - Task wakeup is done in rcu read critical region, with a
  stable task pointer.

  - Because there is no concurrency among waiters, we need
  not worry about rcuwait_wait_event() calls corrupting
  the wait->task. As a consequence, this saves the locking
  done in swait when adding to the queue.

The x86-tscdeadline_latency test mentioned in 8577370fb0cb
("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
latency is reduced by around 15% with this change.

Cc: Paolo Bonzini 
Signed-off-by: Davidlohr Bueso 
---

Only compiled and tested on x86.

 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/book3s_hv.c| 10 --
 arch/x86/kvm/lapic.c|  2 +-
 include/linux/kvm_host.h| 10 +-
 virt/kvm/arm/arch_timer.c   |  2 +-
 virt/kvm/arm/arm.c  |  9 +
 virt/kvm/async_pf.c |  3 +--
 virt/kvm/kvm_main.c | 33 +
 8 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 6e8b8ffd06ad..e2b4a1e3fb7d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -752,7 +752,7 @@ struct kvm_vcpu_arch {
u8 irq_pending; /* Used by XIVE to signal pending guest irqs */
u32 last_inst;
 
-   struct swait_queue_head *wqp;
+   struct rcuwait *waitp;
struct kvmppc_vcore *vcore;
int ret;
int trap;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2cefd071b848..c7cbc4bd06e9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -231,13 +231,11 @@ static bool kvmppc_ipi_thread(int cpu)
 static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu)
 {
int cpu;
-   struct swait_queue_head *wqp;
+   struct rcuwait *wait;
 
-   wqp = kvm_arch_vcpu_wq(vcpu);
-   if (swq_has_sleeper(wqp)) {
-   swake_up_one(wqp);
+   wait = kvm_arch_vcpu_get_wait(vcpu);
+   if (rcuwait_wake_up(wait))
++vcpu->stat.halt_wakeup;
-   }
 
cpu = READ_ONCE(vcpu->arch.thread_cpu);
if (cpu >= 0 && kvmppc_ipi_thread(cpu))
@@ -4274,7 +4272,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
}
user_vrsave = mfspr(SPRN_VRSAVE);
 
-   vcpu->arch.wqp = >arch.vcore->wq;
+   vcpu->arch.waitp = >arch.vcore->wait;
vcpu->arch.pgdir = kvm->mm->pgd;
vcpu->arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e3099c642fec..a4420c26dfbc 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1815,7 +1815,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
/* If the preempt notifier has already run, it also called 
apic_timer_expired */
if (!apic->lapic_timer.hv_timer_in_use)
goto out;
-   WARN_ON(swait_active(>wq));
+   WARN_ON(rcu_dereference(vcpu->wait.task));
cancel_hv_timer(apic);
apic_timer_expired(apic);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bcb9b2ac0791..b5694429aede 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -23,7 +23,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -277,7 +277,7 @@ struct kvm_vcpu {
struct mutex mutex;
struct kvm_run *run;
 
-   struct swait_queue_head wq;
+   struct rcuwait wait;
struct pid __rcu *pid;
int sigset_active;
sigset_t sigset;
@@ -952,12 +952,12 @@ static inline bool kvm_arch_has_assigned_device(struct 
kvm *kvm)
 }
 #endif
 
-static inline struct swait_queue_head *kvm_arch_vcpu_wq(struct kvm_vcpu *vcpu)
+static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
 {
 #ifdef __KVM_HAVE_ARCH_WQP
-   return vcpu->arch.wqp;
+   return vcpu->arch.wait;
 #else
-   return >wq;
+   return >wait;
 #endif
 }
 
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 0d9438e9de2a..4be71cb58691 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -593,7 +593,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
if (map.emul_ptimer)
soft_timer_cancel(_ptimer->hrtimer);
 
-   if (swait_active(kvm_arch_vcpu_wq(vcpu)))
+   if (rcu_dereference(kvm_arch_v

[PATCH 17/15] rcuwait: Inform rcuwait_wake_up() users if a wakeup was attempted

2020-03-20 Thread Davidlohr Bueso
Let the caller know if wake_up_process() was actually called or not;
some users can use this information for ad-hoc. Of course returning
true does not guarantee that wake_up_process() actually woke anything
up.

Signed-off-by: Davidlohr Bueso 
---
 include/linux/rcuwait.h |  2 +-
 kernel/exit.c   | 10 --
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 6e8798458091..3f83b9a12ad3 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -24,7 +24,7 @@ static inline void rcuwait_init(struct rcuwait *w)
w->task = NULL;
 }
 
-extern void rcuwait_wake_up(struct rcuwait *w);
+extern bool rcuwait_wake_up(struct rcuwait *w);
 
 /*
  * The caller is responsible for locking around rcuwait_wait_event(),
diff --git a/kernel/exit.c b/kernel/exit.c
index 6cc6cc485d07..b0bb0a8ec4b1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,9 +234,10 @@ void release_task(struct task_struct *p)
goto repeat;
 }
 
-void rcuwait_wake_up(struct rcuwait *w)
+bool rcuwait_wake_up(struct rcuwait *w)
 {
struct task_struct *task;
+   bool ret = false;
 
rcu_read_lock();
 
@@ -254,10 +255,15 @@ void rcuwait_wake_up(struct rcuwait *w)
smp_mb(); /* (B) */
 
task = rcu_dereference(w->task);
-   if (task)
+   if (task) {
wake_up_process(task);
+   ret = true;
+   }
rcu_read_unlock();
+
+   return ret;
 }
+EXPORT_SYMBOL_GPL(rcuwait_wake_up);
 
 /*
  * Determine if a process group is "orphaned", according to the POSIX
-- 
2.16.4



[PATCH 16/15] rcuwait: Get rid of stale name comment

2020-03-20 Thread Davidlohr Bueso
The 'trywake' name was renamed to simply 'wake',
update the comment.

Signed-off-by: Davidlohr Bueso 
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 0b81b26a872a..6cc6cc485d07 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -243,7 +243,7 @@ void rcuwait_wake_up(struct rcuwait *w)
/*
 * Order condition vs @task, such that everything prior to the load
 * of @task is visible. This is the condition as to why the user called
-* rcuwait_trywake() in the first place. Pairs with set_current_state()
+* rcuwait_wake() in the first place. Pairs with set_current_state()
 * barrier (A) in rcuwait_wait_event().
 *
 *WAITWAKE
-- 
2.16.4



Re: [patch V2 00/15] Lock ordering documentation and annotation for lockdep

2020-03-20 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:

   The PS3 one got converted by Peter Zijlstra to rcu_wait().


While at it, I think it makes sense to finally convert the kvm vcpu swait
to rcuwait (patch 6/15 starts the necessary api changes). I'm sending
some patches on top of this patchset.

Thanks,
Davidlohr


Re: [patch V2 06/15] rcuwait: Add @state argument to rcuwait_wait_event()

2020-03-19 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:


--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -3,6 +3,7 @@
#define _LINUX_RCUWAIT_H_

#include 
+#include 


So this is causing build to fail for me:

 CC  arch/x86/boot/compressed/cmdline.o
arch/x86/boot/compressed/cmdline.c:5:20: error: conflicting types for 
‘set_fs’
static inline void set_fs(unsigned long seg)
   ^~
In file included from ./include/linux/uaccess.h:11:0,
from ./include/linux/sched/task.h:11,
from ./include/linux/sched/signal.h:9,
from ./include/linux/rcuwait.h:6,
from ./include/linux/percpu-rwsem.h:8,
from ./include/linux/fs.h:34,
from ./include/linux/proc_fs.h:9,
from ./include/acpi/acpi_bus.h:83,
from ./include/linux/acpi.h:32,
from arch/x86/boot/compressed/misc.h:28,
from arch/x86/boot/compressed/cmdline.c:2:
./arch/x86/include/asm/uaccess.h:29:20: note: previous definition of 
‘set_fs’ was here
static inline void set_fs(mm_segment_t fs)
   ^~
make[2]: *** [scripts/Makefile.build:268: arch/x86/boot/compressed/cmdline.o] 
Error 1
make[1]: *** [arch/x86/boot/Makefile:113: arch/x86/boot/compressed/vmlinux] 
Error 2
make: *** [arch/x86/Makefile:285: bzImage] Error 2

Right now I'm not sure what the proper fix should be.

Thanks,
Davidlohr


Re: [patch V2 07/15] powerpc/ps3: Convert half completion to rcuwait

2020-03-19 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:

AFAICT the kthread uses TASK_INTERRUPTIBLE to not increase loadavg, kthreads
cannot receive signals by default and this one doesn't look different. Use
TASK_IDLE instead.


Hmm it seems in general this needs to be done kernel-wide. This kthread abuse of
TASK_INTERRUPTIBLE seems to be a common thing. There's also the users doing
schedule_timeout_interruptible()...

Thanks,
Davidlohr


Re: [patch V2 08/15] Documentation: Add lock ordering and nesting documentation

2020-03-19 Thread Davidlohr Bueso

On Wed, 18 Mar 2020, Thomas Gleixner wrote:

+Owner semantics
+===
+
+Most lock types in the Linux kernel have strict owner semantics, i.e. the
+context (task) which acquires a lock has to release it.
+
+There are two exceptions:
+
+  - semaphores
+  - rwsems
+
+semaphores have no strict owner semantics for historical reasons. They are


I would rephrase this to:

semaphores have no owner semantics for historical reason, and as such
trylock and release operations can be called from interrupt context. They
are ...


+often used for both serialization and waiting purposes. That's generally
+discouraged and should be replaced by separate serialization and wait
+mechanisms.

   ^ , such as mutexes or completions.


+
+rwsems have grown interfaces which allow non owner release for special
+purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
+substitutes all locking primitives except semaphores with RT-mutex based
+implementations to provide priority inheritance for all lock types except
+the truly spinning ones. Priority inheritance on ownerless locks is
+obviously impossible.
+
+For now the rwsem non-owner release excludes code which utilizes it from
+being used on PREEMPT_RT enabled kernels. In same cases this can be
+mitigated by disabling portions of the code, in other cases the complete
+functionality has to be disabled until a workable solution has been found.


Thanks,
Davidlohr


Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic

2019-04-23 Thread Davidlohr Bueso

On Tue, 23 Apr 2019, Bueso wrote:


On Wed, 03 Apr 2019, Daniel Jordan wrote:


On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote:

Le 02/04/2019 à 22:41, Daniel Jordan a écrit :

With locked_vm now an atomic, there is no need to take mmap_sem as
writer.  Delete and refactor accordingly.


Could you please detail the change ?


Ok, I'll be more specific in the next version, using some of your language in
fact.  :)


It looks like this is not the only
change. I'm wondering what the consequences are.

Before we did:
- lock
- calculate future value
- check the future value is acceptable
- update value if future value acceptable
- return error if future value non acceptable
- unlock

Now we do:
- atomic update with future (possibly too high) value
- check the new value is acceptable
- atomic update back with older value if new value not acceptable and return
error

So if a concurrent action wants to increase locked_vm with an acceptable
step while another one has temporarily set it too high, it will now fail.

I think we should keep the previous approach and do a cmpxchg after
validating the new value.


Wouldn't the cmpxchg alternative also be exposed the locked_vm changing between
validating the new value and the cmpxchg() and we'd bogusly fail even when there
is still just because the value changed (I'm assuming we don't hold any locks,
otherwise all this is pointless).

 current_locked = atomic_read(>locked_vm);
 new_locked = current_locked + npages;
 if (new_locked < lock_limit)
if (cmpxchg(>locked_vm, current_locked, new_locked) == current_locked)


Err, this being != of course.


Re: [PATCH 5/6] powerpc/mmu: drop mmap_sem now that locked_vm is atomic

2019-04-23 Thread Davidlohr Bueso

On Wed, 03 Apr 2019, Daniel Jordan wrote:


On Wed, Apr 03, 2019 at 06:58:45AM +0200, Christophe Leroy wrote:

Le 02/04/2019 à 22:41, Daniel Jordan a écrit :
> With locked_vm now an atomic, there is no need to take mmap_sem as
> writer.  Delete and refactor accordingly.

Could you please detail the change ?


Ok, I'll be more specific in the next version, using some of your language in
fact.  :)


It looks like this is not the only
change. I'm wondering what the consequences are.

Before we did:
- lock
- calculate future value
- check the future value is acceptable
- update value if future value acceptable
- return error if future value non acceptable
- unlock

Now we do:
- atomic update with future (possibly too high) value
- check the new value is acceptable
- atomic update back with older value if new value not acceptable and return
error

So if a concurrent action wants to increase locked_vm with an acceptable
step while another one has temporarily set it too high, it will now fail.

I think we should keep the previous approach and do a cmpxchg after
validating the new value.


Wouldn't the cmpxchg alternative also be exposed the locked_vm changing between
validating the new value and the cmpxchg() and we'd bogusly fail even when there
is still just because the value changed (I'm assuming we don't hold any locks,
otherwise all this is pointless).

  current_locked = atomic_read(>locked_vm);
  new_locked = current_locked + npages;
  if (new_locked < lock_limit)
 if (cmpxchg(>locked_vm, current_locked, new_locked) == current_locked)
 /* ENOMEM */



That's a good idea, and especially worth doing considering that an arbitrary
number of threads that charge a low amount of locked_vm can fail just because
one thread charges lots of it.


Yeah but the window for this is quite small, I doubt it would be a real issue.

What if before doing the atomic_add_return(), we first did the racy new_locked
check for ENOMEM, then do the speculative add and cleanup, if necessary. This
would further reduce the scope of the window where false ENOMEM can occur.


pinned_vm appears to be broken the same way, so I can fix it too unless someone
beats me to it.


This should not be a surprise for the rdma folks. Cc'ing Jason nonetheless.

Thanks,
Davidlohr


Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

2019-04-23 Thread Davidlohr Bueso

On Tue, 23 Apr 2019, Peter Zijlstra wrote:


Also; the initial motivation was prefaulting large VMAs and the
contention on mmap was killing things; but similarly, the contention on
the refcount (I did try that) killed things just the same.


Right, this is just like what can happen with per-vma locking.

Thanks,
Davidlohr


Re: [PATCH 1/6] mm: change locked_vm's type from unsigned long to atomic64_t

2019-04-02 Thread Davidlohr Bueso

On Tue, 02 Apr 2019, Andrew Morton wrote:


Also, we didn't remove any down_write(mmap_sem)s from core code so I'm
thinking that the benefit of removing a few mmap_sem-takings from a few
obscure drivers (sorry ;)) is pretty small.


afaik porting the remaining incorrect users of locked_vm to pinned_vm was
the next step before this one, which made converting locked_vm to atomic
hardly worth it. Daniel?

Thanks,
Davidlohr


Re: [PATCH v5 1/3] locking/rwsem: Remove arch specific rwsem files

2019-03-22 Thread Davidlohr Bueso

On Fri, 22 Mar 2019, Linus Torvalds wrote:

Some of them _might_ be performance-critical. There's the one on
mmap_sem in the fault handling path, for example. And yes, I'd expect
the normal case to very much be "no other readers or writers" for that
one.


Yeah, the mmap_sem case in the fault path is really expecting an unlocked
state. To the point that four archs have added branch predictions, ie:

92181f190b6 (x86: optimise x86's do_page_fault (C entry point for the page 
fault path))
b15021d994f (powerpc/mm: Add a bunch of (un)likely annotations to do_page_fault)

And using PROFILE_ANNOTATED_BRANCHES shows pretty clearly:
(without resetting the counters)

correct incorrect  %Function  File  Line
--- -  -    
 4603685   34   0 do_user_addr_fault fault.c  1416 (bootup)
382327745  449   0 do_user_addr_fault fault.c  1416 (kernel 
build)
399446159  461   0 do_user_addr_fault fault.c  1416 (redis 
benchmark)

It would probably wouldn't harm doing the unlikely() for all archs, or
alternatively, add likely() to the atomic_long_try_cmpxchg_acquire in
patch 3 and do it implicitly but maybe that would be less flexible(?)

Thanks,
Davidlohr


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-14 Thread Davidlohr Bueso

On Fri, 08 Feb 2019, Waiman Long wrote:

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.


So I ran this series on a 40-core IB 2 socket with various worklods in
mmtests. Below are some of the interesting ones; full numbers and curves
at https://linux-scalability.org/rwsem-reader-spinner/

All workloads are with increasing number of threads.

-- pagefault timings: pft is an artificial pf benchmark (thus reader stress).
metric is faults/cpu and faults/sec
  v5.0-rc6 v5.0-rc6
   dirty
Hmean faults/cpu-1624224.9815 (   0.00%)   618847.5201 *  -0.86%*
Hmean faults/cpu-4539550.3509 (   0.00%)   547407.5738 *   1.46%*
Hmean faults/cpu-7401470.3461 (   0.00%)   381157.9830 *  -5.06%*
Hmean faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
Hmean faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
Hmean faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
Hmean faults/cpu-4091203.6820 (   0.00%)91832.7489 *   0.69%*
Hmean faults/sec-1623292.3467 (   0.00%)   617992.0795 *  -0.85%*
Hmean faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
Hmean faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
Hmean faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
Hmean faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
Hmean faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
Hmean faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
Stddevfaults/cpu-1  2949.5159 (   0.00%) 2802.2712 (   4.99%)
Stddevfaults/cpu-4 24165.9454 (   0.00%)15841.1232 (  34.45%)
Stddevfaults/cpu-7 20914.8351 (   0.00%)22744.3294 (  -8.75%)
Stddevfaults/cpu-1211274.3490 (   0.00%)14733.3152 ( -30.68%)
Stddevfaults/cpu-21 2500.1950 (   0.00%) 2200.9518 (  11.97%)
Stddevfaults/cpu-30 1599.5346 (   0.00%) 1414.0339 (  11.60%)
Stddevfaults/cpu-40 1473.0181 (   0.00%) 3004.1209 (-103.94%)
Stddevfaults/sec-1  2655.2581 (   0.00%) 2405.1625 (   9.42%)
Stddevfaults/sec-4 84042.7234 (   0.00%)57996.7158 (  30.99%)
Stddevfaults/sec-7123656.7901 (   0.00%)   135591.1087 (  -9.65%)
Stddevfaults/sec-1297135.6091 (   0.00%)   127054.4926 ( -30.80%)
Stddevfaults/sec-2169564.6264 (   0.00%)65922.6381 (   5.24%)
Stddevfaults/sec-3051524.4027 (   0.00%)56109.4159 (  -8.90%)
Stddevfaults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)

With the exception of the hicup at 7 threads, things are pretty much in
the noise region for both metrics.

-- git checkout

First metric is total runtime for runs with incremental threads.

  v5.0-rc6v5.0-rc6
 dirty
User 218.95  219.07
System   149.29  146.82
Elapsed 1574.10 1427.08

In this case there's a non trivial improvement (11%) in overall elapsed time.

-- reaim (which is always succeptible to rwsem changes for both mmap_sem and
i_mmap)
v5.0-rc6   v5.0-rc6
   dirty
Hmean compute-1 6674.01 (   0.00%) 6544.28 *  -1.94%*
Hmean compute-21   85294.91 (   0.00%)85524.20 *   0.27%*
Hmean compute-41  149674.70 (   0.00%)   149494.58 *  -0.12%*
Hmean compute-61  177721.15 (   0.00%)   170507.38 *  -4.06%*
Hmean compute-81  181531.07 (   0.00%)   180463.24 *  -0.59%*
Hmean compute-101 189024.09 (   0.00%)   187288.86 *  -0.92%*
Hmean compute-121 200673.24 (   0.00%)   195327.65 *  -2.66%*
Hmean compute-141 213082.29 (   0.00%)   211290.80 *  -0.84%*
Hmean compute-161 207764.06 (   0.00%)   204626.68 *  -1.51%*

The 'compute' workload overall takes a small hit.

Hmean new_dbase-1 60.48 (   0.00%)   60.63 *   0.25%*
Hmean new_dbase-21  6590.49 (   0.00%) 6671.81 *   1.23%*
Hmean new_dbase-41 14202.91 (   0.00%)14470.59 *   1.88%*
Hmean new_dbase-61 21207.24 (   0.00%)21067.40 *  -0.66%*
Hmean new_dbase-81 25542.40 (   0.00%)25542.40 *   0.00%*
Hmean new_dbase-10130165.28 (   0.00%)30046.21 *  -0.39%*
Hmean new_dbase-12133638.33 (   0.00%)33219.90 *  -1.24%*
Hmean new_dbase-14136723.70 (   0.00%)37504.52 *   2.13%*
Hmean new_dbase-16142242.51 (   0.00%)42117.34 *  -0.30%*
Hmean shared-176.54 (   0.00%)   76.09 *  -0.59%*
Hmean shared-21 7535.51 (   0.00%) 5518.75 * -26.76%*
Hmean shared-4117207.81 (   0.00%)14651.94 * -14.85%*
Hmean shared-61

Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-02-07 Thread Davidlohr Bueso

On Thu, 07 Feb 2019, Waiman Long wrote:

30 files changed, 1197 insertions(+), 1594 deletions(-)


Performance numbers on numerous workloads, pretty please.

I'll go and throw this at my mmap_sem intensive workloads
I've collected.

Thanks,
Davidlohr


Re: [PATCH] powerpc/mm: Fix missing mmap_sem release

2017-09-07 Thread Davidlohr Bueso

On Thu, 07 Sep 2017, Laurent Dufour wrote:


The commit b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()") reviewed
the way the error path is managed in __do_page_fault() but it was a bit too
agressive when handling a case by returning without releasing the mmap_sem.

By the way, replacing current->mm->mmap_sem by mm->mmap_sem as mm is set to
current->mm.

Fixes: b5c8f0fd595d ("powerpc/mm: Rework mm_fault_error()")
Cc: Benjamin Herrenschmidt 
Signed-off-by: Laurent Dufour 
---
arch/powerpc/mm/fault.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 4797d08581ce..f799ccf37d27 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c


But... here:

/*
 * If we need to retry the mmap_sem has already been released,
 * and if there is a fatal signal pending there is no guarantee
 * that we made any progress. Handle this case first.
 */


@@ -521,10 +521,11 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 * User mode? Just return to handle the fatal exception 
otherwise
 * return to bad_page_fault
 */
+   up_read(>mmap_sem);
return is_user ? 0 : SIGBUS;
}


Per the above comment, for that case handle_mm_fault()
has already released mmap_sem. The same occurs in x86,
for example.

Thanks,
Davidlohr


Re: [PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()

2016-10-29 Thread Davidlohr Bueso

On Fri, 28 Oct 2016, Pan Xinhui wrote:

/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detech lock holder preemption issue

   ^^ detect
+		 * and break. 


Could you please remove the rest of this comment? Its just noise to point out
that vcpu_is_preempted is a macro defined by arch/false. This is standard 
protocol
in the kernel.

Same goes for all locks you change with this.

Thanks,
Davidlohr


   * vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;

cpu_relax_lowlatency();
--
2.4.11



Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-23 Thread Davidlohr Bueso

On Wed, 23 Dec 2015, Boqun Feng wrote:

There is one thing we should be aware of, that is the bug:

http://lkml.kernel.org/r/5669d5f2.5050...@caviumnetworks.com

which though has been fixed by:

http://lkml.kernel.org/r/20151217160549.gh6...@twins.programming.kicks-ass.net


Right, and fwiw the testing I did included this fix.

signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH powerpc/next v6 0/4] atomics: powerpc: Implement relaxed/acquire/release variants

2015-12-18 Thread Davidlohr Bueso

I've left this series testing overnight on a power7 box and so far so good,
nothing has broken.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] arch: use WRITE_ONCE/READ_ONCE in smp_store_release/smp_load_acquire

2015-08-02 Thread Davidlohr Bueso
On Sun, 2015-08-02 at 17:11 +0200, Andrey Konovalov wrote:
 Replace ACCESS_ONCE() macro in smp_store_release() and smp_load_acquire()
 with WRITE_ONCE() and READ_ONCE() on x86, arm, arm64, ia64, metag, mips,
 powerpc, s390, sparc and asm-generic since ACCESS_ONCE does not work
 reliably on non-scalar types.
 
 WRITE_ONCE() and READ_ONCE() were introduced in the commits 230fa253df63
 (kernel: Provide READ_ONCE and ASSIGN_ONCE) and 43239cbe79fc (kernel:
 Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)).
 
 Signed-off-by: Andrey Konovalov andreyk...@google.com

Acked-by: Davidlohr Bueso dbu...@suse.de

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/3] powerpc/oprofile: reduce mmap_sem hold for exe_file

2015-02-25 Thread Davidlohr Bueso
In the future mm-exe_file will be done without mmap_sem
serialization, thus isolate and reorganize the related
code to make the transition easier. Good users will, make
use of the more standard get_mm_exe_file(), requiring only
holding the mmap_sem to read the value, and relying on reference
counting to make sure that the exe file won't dissappear
underneath us while getting the dcookie.

Cc: Arnd Bergmann a...@arndb.de
Cc: Robert Richter r...@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: cbe-oss-...@lists.ozlabs.org
Cc: oprofile-l...@lists.sourceforge.net
Signed-off-by: Davidlohr Bueso dbu...@suse.de
---

Completely untested.

 arch/powerpc/oprofile/cell/spu_task_sync.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/oprofile/cell/spu_task_sync.c 
b/arch/powerpc/oprofile/cell/spu_task_sync.c
index 1c27831..ed7b097 100644
--- a/arch/powerpc/oprofile/cell/spu_task_sync.c
+++ b/arch/powerpc/oprofile/cell/spu_task_sync.c
@@ -22,6 +22,7 @@
 #include linux/kref.h
 #include linux/mm.h
 #include linux/fs.h
+#include linux/file.h
 #include linux/module.h
 #include linux/notifier.h
 #include linux/numa.h
@@ -322,18 +323,20 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int 
*offsetp,
unsigned long app_cookie = 0;
unsigned int my_offset = 0;
struct vm_area_struct *vma;
+   struct file *exe_file;
struct mm_struct *mm = spu-mm;
 
if (!mm)
goto out;
 
-   down_read(mm-mmap_sem);
-
-   if (mm-exe_file) {
-   app_cookie = fast_get_dcookie(mm-exe_file-f_path);
-   pr_debug(got dcookie for %pD\n, mm-exe_file);
+   exe_file = get_mm_exe_file(mm);
+   if (exe_file) {
+   app_cookie = fast_get_dcookie(exe_file-f_path);
+   pr_debug(got dcookie for %pD\n, exe_file);
+   fput(exe_file);
}
 
+   down_read(mm-mmap_sem);
for (vma = mm-mmap; vma; vma = vma-vm_next) {
if (vma-vm_start  spu_ref || vma-vm_end = spu_ref)
continue;
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [3.13.y.z extended stable] Patch locking/mutex: Disable optimistic spinning on some architectures has been added to staging queue

2014-08-06 Thread Davidlohr Bueso
On Wed, 2014-08-06 at 17:25 -0400, Andev wrote:
 On Wed, Aug 6, 2014 at 4:54 PM, Kamal Mostafa ka...@canonical.com wrote:
  This is a note to let you know that I have just added a patch titled
 
  locking/mutex: Disable optimistic spinning on some architectures
 
  to the linux-3.13.y-queue branch of the 3.13.y.z extended stable tree
  which can be found at:
 
   
  http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=shortlog;h=refs/heads/linux-3.13.y-queue
 
  This patch is scheduled to be released in version 3.13.11.6.
 
 
 This patch is specifically for parisc, sparc32, tile32, metag-lock1,
 arc-!llsc and hexagon none of which Ubuntu supports.
 
 Why are you backporting this patch?

Well 3.13.y.z isn't an Ubuntu kernel, its upstream, and those archs
*are* supported.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-21 Thread Davidlohr Bueso
On Sat, 2014-03-22 at 07:57 +0530, Srikar Dronamraju wrote:
   So reverting and applying v3 3/4 and 4/4 patches works for me.
  
  Ok, I verified that the above endds up resulting in the same tree as
  the minimal patch I sent out, modulo (a) some comments and (b) an
  #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter.
  
  So I committed the minimal patch with your tested-by.
  
 
 Just to be sure, I have verified with latest mainline with HEAD having
 commit 08edb33c4 (Merge branch 'drm-fixes' of
 git://people.freedesktop.org/~airlied/linux) which also has the commit
 11d4616bd0 futex:( revert back to the explicit waiter counting code).

Thanks for double checking.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 15:38 +0530, Srikar Dronamraju wrote:
  This problem suggests that we missed a wakeup for a task that was adding
  itself to the queue in a wait path. And the only place that can happen
  is with the hb spinlock check for any pending waiters. Just in case we
  missed some assumption about checking the hash bucket spinlock as a way
  of detecting any waiters (powerpc?), could you revert this commit and
  try the original atomic operations variant:
  
  https://lkml.org/lkml/2013/12/19/630
 
 I think the above url and the commit id that I reverted i.e
 git://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0c29f79ecea0b6fbcefc999
 are the same.
 
 Or am I missing something? 

No, please take a closer look, it is a different approaches to the same
end.

diff --git a/kernel/futex.c b/kernel/futex.c
index 34ecd9d..35ff697 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -203,6 +203,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+   atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
 } cacheline_aligned_in_smp;
@@ -211,6 +212,53 @@ static unsigned long __read_mostly futex_hashsize;
 
 static struct futex_hash_bucket *futex_queues;
 
+static inline void futex_get_mm(union futex_key *key)
+{
+   atomic_inc(key-private.mm-mm_count);
+#ifdef CONFIG_SMP
+   /*
+* Ensure futex_get_mm() implies a full barrier such that
+* get_futex_key() implies a full barrier. This is relied upon
+* as full barrier (B), see the ordering comment above.
+*/
+   smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+   atomic_inc(hb-waiters);
+   /*
+* Full barrier (A), see the ordering comment above.
+*/
+   smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+   atomic_dec(hb-waiters);
+#endif
+}
+
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+   return atomic_read(hb-waiters);
+#else
+   return 1;
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -245,10 +293,10 @@ static void get_futex_key_refs(union futex_key *key)
 
switch (key-both.offset  (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
case FUT_OFF_INODE:
-   ihold(key-shared.inode);
+   ihold(key-shared.inode); /* implies MB */
break;
case FUT_OFF_MMSHARED:
-   atomic_inc(key-private.mm-mm_count);
+   futex_get_mm(key); /* implies MB */
break;
}
 }
@@ -322,7 +370,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union 
futex_key *key, int rw)
if (!fshared) {
key-private.mm = mm;
key-private.address = address;
-   get_futex_key_refs(key);
+   get_futex_key_refs(key); /* implies MB (B) */
return 0;
}
 
@@ -429,7 +477,7 @@ again:
key-shared.pgoff = basepage_index(page);
}
 
-   get_futex_key_refs(key);
+   get_futex_key_refs(key); /* implies MB (B) */
 
 out:
unlock_page(page_head);
@@ -893,6 +941,7 @@ static void __unqueue_futex(struct futex_q *q)
 
hb = container_of(q-lock_ptr, struct futex_hash_bucket, lock);
plist_del(q-list, hb-chain);
+   hb_waiters_dec(hb);
 }
 
 /*
@@ -1052,6 +1101,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
nr_wake, u32 bitset)
goto out;
 
hb = hash_futex(key);
+
+   /* Make sure we really have tasks to wakeup */
+   if (!hb_waiters_pending(hb))
+   goto out_put_key;
+
spin_lock(hb-lock);
 
plist_for_each_entry_safe(this, next, hb-chain, list) {
@@ -1072,6 +1126,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int 
nr_wake, u32 bitset)
}
 
spin_unlock(hb-lock);
+out_put_key:
put_futex_key(key);
 out:
return ret;
@@ -1190,7 +1245,9 @@ void requeue_futex(struct futex_q *q, struct 
futex_hash_bucket *hb1,
 */
if (likely(hb1-chain != hb2-chain)) {
plist_del(q-list, hb1-chain);
+   hb_waiters_dec(hb1);
plist_add(q-list, hb2-chain);
+   hb_waiters_inc(hb2);
q-lock_ptr = hb2-lock;
}
get_futex_key_refs(key2);
@@ -1533,6 +1590,17 @@ static inline struct futex_hash_bucket 
*queue_lock(struct futex_q *q)
struct futex_hash_bucket *hb;
 
hb = hash_futex(q-key);
+
+   /*
+* Increment the counter before taking the lock so that
+* a potential waker won't miss a to-be-slept task that 

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Wed, 2014-03-19 at 22:56 -0700, Davidlohr Bueso wrote:
 On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote:
Joy,.. let me look at that with ppc in mind.
   
   OK; so while pretty much all the comments from that patch are utter
   nonsense (what was I thinking), I cannot actually find a real bug.
   
   But could you try the below which replaces a control dependency with a
   full barrier. The control flow is plenty convoluted that I think the
   control barrier isn't actually valid anymore and that might indeed
   explain the fail.
   
  
  Unfortunately the patch didnt help. Still seeing tasks stuck
  
  # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
  14680 pts/0root java - 0 futex_wait_queue_me
  14797 pts/0root java - 0 futex_wait_queue_me
  # : /var/log/messages
  # echo t  /proc/sysrq-trigger 
  # grep futex_wait_queue_me /var/log/messages | wc -l 
  334
  #
  
  [ 6904.211478] Call Trace:
  [ 6904.211481] [c00fa1f1b4d0] [0020] 0x20 (unreliable)
  [ 6904.211486] [c00fa1f1b6a0] [c0015208] 
  .__switch_to+0x1e8/0x330
  [ 6904.211491] [c00fa1f1b750] [c0702f00] .__schedule+0x360/0x8b0
  [ 6904.211495] [c00fa1f1b9d0] [c0147348] 
  .futex_wait_queue_me+0xf8/0x1a0
  [ 6904.211500] [c00fa1f1ba60] [c01486dc] .futex_wait+0x17c/0x2a0
  [ 6904.211505] [c00fa1f1bc10] [c014a614] .do_futex+0x254/0xd80
  [ 6904.211510] [c00fa1f1bd60] [c014b25c] .SyS_futex+0x11c/0x1d0
  [ 6904.238874] [c00fa1f1be30] [c000a0fc] syscall_exit+0x0/0x7c
  [ 6904.238879] javaS 3fff825f6044 0 14682  14076 
  0x0080
  
  Is there any other information that I provide that can help?
 
 This problem suggests that we missed a wakeup for a task that was adding
 itself to the queue in a wait path. And the only place that can happen
 is with the hb spinlock check for any pending waiters. Just in case we
 missed some assumption about checking the hash bucket spinlock as a way
 of detecting any waiters (powerpc?), could you revert this commit and
 try the original atomic operations variant:
 
 https://lkml.org/lkml/2013/12/19/630

hmmm looking at ppc spinlock code, it seems that it doesn't have ticket
spinlocks -- in fact Torsten Duwe has been trying to get them upstream
very recently. Since we rely on the counter for detecting waiters, this
might explain the issue. Could someone confirm this spinlock
implementation difference? 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 09:41 -0700, Linus Torvalds wrote:
 On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso davidl...@hp.com wrote:
 
  This problem suggests that we missed a wakeup for a task that was adding
  itself to the queue in a wait path. And the only place that can happen
  is with the hb spinlock check for any pending waiters.
 
 Ok, so thinking about hb_waiters_pending():
 
  - spin_is_locked() test
  - read barrier
  - plist_head_empty() test
 
 seems to be broken after all.
 
 The race is against futex_wait() that does
 
  - futex_wait_setup():
- queue_lock()
- get_futex_value_locked()
  - futex_wait_queue_me()
- queue_me()
  - plist_add()
 
 right?

Yep.

 
 It strikes me that the spin_is_locked() test has no barriers wrt the
 writing of the new futex value on the wake path. And the read barrier
 obviously does nothing wrt the write either. Or am I missing
 something? So the write that actually released the futex might be
 almost arbitrarily delayed on the waking side. So the waiting side may
 not see the new value, even though the waker assumes it does due to
 the ordering of it doing the write first.

Aha, that would certainly violate the ordering guarantees. I feared
_something_ like that when we originally discussed your suggestion as
opposed to the atomics one, but didn't have any case for it either.

 So maybe we need a memory barrier in hb_waiters_pending() just to make
 sure that the new value is written.
 
 But at that point, I suspect that Davidlohrs original patch that just
 had explicit waiting counts is just as well. The whole point with the
 head empty test was to emulate that do we have waiters without
 having to actually add the atomics, but a memory barrier is really no
 worse.
 
 The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs
 atomic count. It may be terminally broken, I literally didn't test it
 at all.

Comparing with the patch I sent earlier this morning, looks equivalent,
and fwiw, passes my initial qemu bootup, which is the first way of
detecting anything stupid going on.

So, Srikar, please try this patch out, as opposed to mine, you don't
have to first revert the commit in question.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 10:42 -0700, Linus Torvalds wrote:
 On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso davidl...@hp.com wrote:
  It strikes me that the spin_is_locked() test has no barriers wrt the
  writing of the new futex value on the wake path. And the read barrier
  obviously does nothing wrt the write either. Or am I missing
  something? So the write that actually released the futex might be
  almost arbitrarily delayed on the waking side. So the waiting side may
  not see the new value, even though the waker assumes it does due to
  the ordering of it doing the write first.
 
  Aha, that would certainly violate the ordering guarantees. I feared
  _something_ like that when we originally discussed your suggestion as
  opposed to the atomics one, but didn't have any case for it either.
 
 Actually, looking closer, we have the memory barrier in
 get_futex_key_refs() (called by get_futex_key()) so that's not it.
 In fact, your atomic_read(hb-waiters) doesn't have any more
 serialization than the spin_is_locked() test had.
 
 But the spin_is_locked() and queue-empty tests are two separate memory
 reads, and maybe there is some ordering wrt those two that we missed,
 so the waiters patch is worth trying anyway.

Well, imho we would have seen something wrong much much earlier. This
patch has been very heavily tested (including with the java workload
used by Shrikar). 

I still wonder about ppc and spinlocks (no ticketing!!) ... sure the
waiters patch might fix the problem just because we explicitly count
the members of the plist. And I guess if we cannot rely on all archs
having an equivalent spinlock implementation, we simply cannot use this
technique for futexes.

Thanks,
Davidlohr

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 11:36 -0700, Linus Torvalds wrote:
 On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso davidl...@hp.com wrote:
 
  Comparing with the patch I sent earlier this morning, looks equivalent,
  and fwiw, passes my initial qemu bootup, which is the first way of
  detecting anything stupid going on.
 
  So, Srikar, please try this patch out, as opposed to mine, you don't
  have to first revert the commit in question.
 
 Ok, so it boots for me too, so hopefully it isn't totally broken.
 
 However, since it's just closing a race, and since getting the counts
 wrong should easily result in it *working* but always taking the slow
 path (for example), I'd really like people to also verify that it
 fixes the actual performance issue (ie assuming it fixes powerpc
 behavior for Srikar, I'd like to get it double-checked that it also
 avoids the spinlock in the common case). 

Oh, it does. This atomics technique was tested at a customer's site and
ready for upstream. To refresh, we were originally seeing massive
contention on the hb-lock and an enormous amounts of 0 returns from
futex_wake, indicating that spinners where piling up just to realize
that the plist was empty! While I don't have any official numbers, I can
confirm that perf showed that this issue was addressed with the atomics
variant. Yes, such pathological behavior shows problems in the userspace
locking primitives design/implementation, but allowing the kernel not to
be affected by suboptimal uses of futexes is definitely a plus. 

As tglx suggested at the time, I also made sure that adding the barriers
when doing the key refcounting didn't impose any serious restrictions to
performance either.

Now, what at the time required re-testing everything was when you
suggested replacing this approach with a more elegant spin is locked
test. Both approaches showed pretty much identical performance (and
correctness, at least on x86). And to this day shows *significant* less
time spent in kernel space dealing with futexes.


 Because if the
 increment/decrement pairings end up being wrong, we could have a
 situation where the waiter count just ends up bogus, and it all works
 from a correctness standpoint but not from the intended performance
 optimization.
 
 No way I can test that sanely on my single-socket machine. Davidlohr?

Not this patch, no :( -- we could never blindly reproduce the customer's
workload. The only patch that I was able to create test cases for is the
larger hash table one, which simply alleviates collisions. This is now
part of perf-bench.

Thanks,
Davidlohr

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-20 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 12:25 -0700, Linus Torvalds wrote:
 On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso davidl...@hp.com wrote:
 
  Oh, it does. This atomics technique was tested at a customer's site and
  ready for upstream.
 
 I'm not worried about the *original* patch. I'm worried about the
 incremental one.
 
 Your original patch never applied to my tree - I think it was based on
 -mm or something. So I couldn't verify my let's go back to the
 explicit 'waiters' incremental patch against reverting and
 re-applying the original patch.

Ok, so a big reason why this patch doesn't apply cleanly after reverting
is because *most* of the changes were done at the top of the file with
regards to documenting the ordering guarantees, the actual code changes
are quite minimal.

I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending
commit), and then I cleanly applied the equivalent ones from v3 of the
series (which was already *tested* and ready for upstream until you
suggested looking into the alternative spinlock approach):

https://lkml.org/lkml/2013/12/19/624
https://lkml.org/lkml/2013/12/19/630

Assuming the atomics solves the issue, would you be willing to take this
path? Any pending documentation fixes can be added afterwards. The
important thing is that the actual code is well tested.

Thanks,
Davidlohr

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-19 Thread Davidlohr Bueso
On Wed, 2014-03-19 at 18:08 +0100, Peter Zijlstra wrote:
 On Wed, Mar 19, 2014 at 04:47:05PM +0100, Peter Zijlstra wrote:
   I reverted b0c29f79ecea0b6fbcefc999e70f2843ae8306db on top of v3.14-rc6 
   and confirmed that
   reverting the commit solved the problem.
  
  Joy,.. let me look at that with ppc in mind.

errr... just sat down to check email this morning. CC'ing Paul as for
any subtle barrier issues.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Tasks stuck in futex code (in 3.14-rc6)

2014-03-19 Thread Davidlohr Bueso
On Thu, 2014-03-20 at 11:03 +0530, Srikar Dronamraju wrote:
   Joy,.. let me look at that with ppc in mind.
  
  OK; so while pretty much all the comments from that patch are utter
  nonsense (what was I thinking), I cannot actually find a real bug.
  
  But could you try the below which replaces a control dependency with a
  full barrier. The control flow is plenty convoluted that I think the
  control barrier isn't actually valid anymore and that might indeed
  explain the fail.
  
 
 Unfortunately the patch didnt help. Still seeing tasks stuck
 
 # ps -Ao pid,tt,user,fname,tmout,f,wchan | grep futex
 14680 pts/0root java - 0 futex_wait_queue_me
 14797 pts/0root java - 0 futex_wait_queue_me
 # : /var/log/messages
 # echo t  /proc/sysrq-trigger 
 # grep futex_wait_queue_me /var/log/messages | wc -l 
 334
 #
 
 [ 6904.211478] Call Trace:
 [ 6904.211481] [c00fa1f1b4d0] [0020] 0x20 (unreliable)
 [ 6904.211486] [c00fa1f1b6a0] [c0015208] .__switch_to+0x1e8/0x330
 [ 6904.211491] [c00fa1f1b750] [c0702f00] .__schedule+0x360/0x8b0
 [ 6904.211495] [c00fa1f1b9d0] [c0147348] 
 .futex_wait_queue_me+0xf8/0x1a0
 [ 6904.211500] [c00fa1f1ba60] [c01486dc] .futex_wait+0x17c/0x2a0
 [ 6904.211505] [c00fa1f1bc10] [c014a614] .do_futex+0x254/0xd80
 [ 6904.211510] [c00fa1f1bd60] [c014b25c] .SyS_futex+0x11c/0x1d0
 [ 6904.238874] [c00fa1f1be30] [c000a0fc] syscall_exit+0x0/0x7c
 [ 6904.238879] javaS 3fff825f6044 0 14682  14076 
 0x0080
 
 Is there any other information that I provide that can help?

This problem suggests that we missed a wakeup for a task that was adding
itself to the queue in a wait path. And the only place that can happen
is with the hb spinlock check for any pending waiters. Just in case we
missed some assumption about checking the hash bucket spinlock as a way
of detecting any waiters (powerpc?), could you revert this commit and
try the original atomic operations variant:

https://lkml.org/lkml/2013/12/19/630

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev