from:"Thomas Gleixner"

Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask

2016-09-07 Thread Thomas Gleixner

On Tue, 6 Sep 2016, Christoph Hellwig wrote:

> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> 
> On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > > Always the previous one.  Below is a patch to get us back to the
> > > previous behavior:
> > 
> > No, that's not right.
> > 
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> > 0:  10  21
> > 1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.

That's a normal Intel mapping with two sockets and HT enabled. The cpu
enumeration is

Socket0 - physical cores
Socket1 - physical cores

Socket0 - HT siblings
Socket1 - HT siblings
 
> I'll need to see if I can find a system with such a mapping to reproduce.

Any 2 socket Intel with HT enabled will do. If you need access to one let
me know.

> > If I have 16 vectors, the affinity_mask generated by what you're doing
> > looks like , CPU's 0-15. So the first 16 bits are set since each
> > of those are the first unique CPU, getting a unique vector just like you
> > wanted. If an unset bit just means share with the previous, then all of
> > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> > 
> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

I think you want both.

NUMA nodes are certainly the first decision factor. You split the number of
vectors to the nodes:

  vecs_per_node = num_vector / num_nodes;

Then you spread the number of vectors per node by the number of cpus per
node.

  cpus_per_vec = cpus_on(node) / vecs_per_node;

If the number of cpus per vector is <= 1 you just use a round robin
scheme. If not, you need to look at siblings.

Looking at the whole thing, I think we need to be more clever when setting
up the msi descriptor affinity masks.

I'll send a RFC series soon.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine

2016-09-19 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers..

Signed-off-by: Sebastian Andrzej Siewior 
Cc: Peter Zijlstra 
Cc: Jens Axboe 
Cc: r...@linutronix.de
Signed-off-by: Thomas Gleixner 

---

 block/blk-mq-cpu.c |   15 +++
 block/blk-mq.c |   21 +
 block/blk-mq.h |2 +-
 include/linux/blk-mq.h |2 +-
 4 files changed, 14 insertions(+), 26 deletions(-)

--- a/block/blk-mq-cpu.c
+++ b/block/blk-mq-cpu.c
@@ -18,18 +18,16 @@
 static LIST_HEAD(blk_mq_cpu_notify_list);
 static DEFINE_RAW_SPINLOCK(blk_mq_cpu_notify_lock);
 
-static int blk_mq_main_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
+static int blk_mq_cpu_dead(unsigned int cpu)
 {
-   unsigned int cpu = (unsigned long) hcpu;
struct blk_mq_cpu_notifier *notify;
-   int ret = NOTIFY_OK;
+   int ret;
 
raw_spin_lock(&blk_mq_cpu_notify_lock);
 
list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) {
-   ret = notify->notify(notify->data, action, cpu);
-   if (ret != NOTIFY_OK)
+   ret = notify->notify(notify->data, cpu);
+   if (ret)
break;
}
 
@@ -54,7 +52,7 @@ void blk_mq_unregister_cpu_notifier(stru
 }
 
 void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
- int (*fn)(void *, unsigned long, unsigned int),
+ int (*fn)(void *, unsigned int),
  void *data)
 {
notifier->notify = fn;
@@ -63,5 +61,6 @@ void blk_mq_init_cpu_notifier(struct blk
 
 void __init blk_mq_cpu_init(void)
 {
-   hotcpu_notifier(blk_mq_main_cpu_notify, 0);
+   cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
+ blk_mq_cpu_dead);
 }
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1590,30 +1590,19 @@ static int blk_mq_hctx_cpu_offline(struc
spin_unlock(&ctx->lock);
 
if (list_empty(&tmp))
-   return NOTIFY_OK;
+   return 0;
 
spin_lock(&hctx->lock);
list_splice_tail_init(&tmp, &hctx->dispatch);
spin_unlock(&hctx->lock);
 
blk_mq_run_hw_queue(hctx, true);
-   return NOTIFY_OK;
+   return 0;
 }
 
-static int blk_mq_hctx_notify(void *data, unsigned long action,
- unsigned int cpu)
+static int blk_mq_hctx_notify_dead(void *hctx, unsigned int cpu)
 {
-   struct blk_mq_hw_ctx *hctx = data;
-
-   if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-   return blk_mq_hctx_cpu_offline(hctx, cpu);
-
-   /*
-* In case of CPU online, tags may be reallocated
-* in blk_mq_map_swqueue() after mapping is updated.
-*/
-
-   return NOTIFY_OK;
+   return blk_mq_hctx_cpu_offline(hctx, cpu);
 }
 
 /* hctx->ctxs will be freed in queue's release handler */
@@ -1681,7 +1670,7 @@ static int blk_mq_init_hctx(struct reque
hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
-   blk_mq_hctx_notify, hctx);
+   blk_mq_hctx_notify_dead, hctx);
blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
 
hctx->tags = set->tags[hctx_idx];
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -34,7 +34,7 @@ void blk_mq_wake_waiters(struct request_
  */
 struct blk_mq_cpu_notifier;
 void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
- int (*fn)(void *, unsigned long, unsigned int),
+ int (*fn)(void *, unsigned int),
  void *data);
 void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier);
 void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier);
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -10,7 +10,7 @@ struct blk_flush_queue;
 struct blk_mq_cpu_notifier {
struct list_head list;
void *data;
-   int (*notify)(void *data, unsigned long action, unsigned int cpu);
+   int (*notify)(void *data, unsigned int cpu);
 };
 
 struct blk_mq_hw_ctx {


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 0/3] block/mq: Convert to the new hotplug state machine

2016-09-19 Thread Thomas Gleixner

The following series converts block/mq to the new hotplug state
machine. Patch 1/3 reserves the states for the block layer and is already 
applied to

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

to avoid merge conflicts. This branch can be pulled into the block layer
instead of applying patch 1/3 manually,

Thanks,

tglx
---
 include/linux/blk-mq.h |2 
 block/blk-mq-cpu.c |   15 ++
 block/blk-mq.c |  108 -
 block/blk-mq.h |2 
 include/linux/cpuhotplug.h |2 
 5 files changed, 59 insertions(+), 70 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/3] blk/mq: Reserve hotplug states for block multiqueue

2016-09-19 Thread Thomas Gleixner

This patch only reserves two CPU hotplug states for block/mq so the block tree
can apply the conversion patches.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: Peter Zijlstra 
Cc: Jens Axboe 
Cc: r...@linutronix.de
Signed-off-by: Thomas Gleixner 

---
 include/linux/cpuhotplug.h |2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -14,6 +14,7 @@ enum cpuhp_state {
CPUHP_PERF_SUPERH,
CPUHP_X86_HPET_DEAD,
CPUHP_X86_APB_DEAD,
+   CPUHP_BLK_MQ_DEAD,
CPUHP_WORKQUEUE_PREP,
CPUHP_POWER_NUMA_PREPARE,
CPUHP_HRTIMERS_PREPARE,
@@ -22,6 +23,7 @@ enum cpuhp_state {
CPUHP_SMPCFD_PREPARE,
CPUHP_RCUTREE_PREP,
CPUHP_NOTIFY_PREPARE,
+   CPUHP_BLK_MQ_PREPARE,
CPUHP_TIMERS_DEAD,
CPUHP_BRINGUP_CPU,
CPUHP_AP_IDLE_DEAD,


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 3/3] blk/mq: Convert to hotplug state machine

2016-09-19 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: Peter Zijlstra 
Cc: Jens Axboe 
Cc: r...@linutronix.de
Signed-off-by: Thomas Gleixner 

---

 block/blk-mq.c |   87 -
 1 file changed, 43 insertions(+), 44 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r
blk_mq_sysfs_register(q);
 }
 
-static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
- unsigned long action, void *hcpu)
+/*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask cpuhp_online_new;
+
+static void blk_mq_queue_reinit_work(void)
 {
struct request_queue *q;
-   int cpu = (unsigned long)hcpu;
-   /*
-* New online cpumask which is going to be set in this hotplug event.
-* Declare this cpumasks as global as cpu-hotplug operation is invoked
-* one-by-one and dynamically allocating this could result in a failure.
-*/
-   static struct cpumask online_new;
-
-   /*
-* Before hotadded cpu starts handling requests, new mappings must
-* be established.  Otherwise, these requests in hw queue might
-* never be dispatched.
-*
-* For example, there is a single hw queue (hctx) and two CPU queues
-* (ctx0 for CPU0, and ctx1 for CPU1).
-*
-* Now CPU1 is just onlined and a request is inserted into
-* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
-* still zero.
-*
-* And then while running hw queue, flush_busy_ctxs() finds bit0 is
-* set in pending bitmap and tries to retrieve requests in
-* hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
-* so the request in ctx1->rq_list is ignored.
-*/
-   switch (action & ~CPU_TASKS_FROZEN) {
-   case CPU_DEAD:
-   case CPU_UP_CANCELED:
-   cpumask_copy(&online_new, cpu_online_mask);
-   break;
-   case CPU_UP_PREPARE:
-   cpumask_copy(&online_new, cpu_online_mask);
-   cpumask_set_cpu(cpu, &online_new);
-   break;
-   default:
-   return NOTIFY_OK;
-   }
 
mutex_lock(&all_q_mutex);
-
/*
 * We need to freeze and reinit all existing queues.  Freezing
 * involves synchronous wait for an RCU grace period and doing it
@@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st
}
 
list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_mq_queue_reinit(q, &online_new);
+   blk_mq_queue_reinit(q, &cpuhp_online_new);
 
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
 
mutex_unlock(&all_q_mutex);
-   return NOTIFY_OK;
+}
+
+static int blk_mq_queue_reinit_dead(unsigned int cpu)
+{
+   cpumask_clear_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
+}
+
+/*
+ * Before hotadded cpu starts handling requests, new mappings must be
+ * established.  Otherwise, these requests in hw queue might never be
+ * dispatched.
+ *
+ * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
+ * for CPU0, and ctx1 for CPU1).
+ *
+ * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
+ * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
+ *
+ * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
+ * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
+ * is ignored.
+ */
+static int blk_mq_queue_reinit_prepare(unsigned int cpu)
+{
+   cpumask_copy(&cpuhp_online_new, cpu_online_mask);
+   cpumask_set_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
@@ -2381,8 +2379,9 @@ static int __init blk_mq_init(void)
 {
blk_mq_cpu_init();
 
-   hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
-
+   cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
+ blk_mq_queue_reinit_prepare,
+ blk_mq_queue_reinit_dead);
return 0;
 }
 subsys_initcall(blk_mq_init);


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/3] blk/mq/cpu-notif: Convert to hotplug state machine

2016-09-19 Thread Thomas Gleixner

On Tue, 20 Sep 2016, Christoph Hellwing wrote:

> On Mon, Sep 19, 2016 at 09:28:20PM -0000, Thomas Gleixner wrote:
> > From: Sebastian Andrzej Siewior 
> > 
> > Install the callbacks via the state machine so we can phase out the cpu
> > hotplug notifiers..
> 
> Didn't Sebastian come up with a version of the hotplug state callbacks
> that can be per-object?  This seems to be a perfect candidate for that.

Indeed. I wrote that myself and forgot about it already. :(

So yes, we can use that and get rid of blk-mq-cpu.c completely.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/13] genirq/msi: Add cpumask allocation to alloc_msi_entry

2016-09-20 Thread Thomas Gleixner

On Tue, 20 Sep 2016, Alexander Gordeev wrote:
> On Mon, Sep 19, 2016 at 03:50:07PM +0200, Christoph Hellwig wrote:
> > On Mon, Sep 19, 2016 at 09:30:58AM +0200, Alexander Gordeev wrote:
> > > > INIT_LIST_HEAD(&desc->list);
> > > > desc->dev = dev;
> > > > +   desc->nvec_used = nvec;
> 
> (*)
> 
> > > > +   if (affinity) {
> > > > +   desc->affinity = kmemdup(affinity,
> > > > +   nvec * sizeof(*desc->affinity), GFP_KERNEL);
> > > > +   if (!desc->affinity) {
> > > > +   kfree(desc);
> > > > +   return NULL;
> > > > +   }
> > > > +   }
> > > 
> > > nit - should not "desc" initialization follow "desc->affinity" allocation?
> > 
> > I can't parse that sentence.  Do you mean the desc->nvec_used setup?
> 
> Yes, the inits above (*) would be useless if desc->affinity allocation failed.

And that matters how?

Thanks,

tglx
 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch V2 0/2] block/mq: Convert to new hotplug state machine

2016-09-20 Thread Thomas Gleixner

The following series converts block/mq to the new hotplug state
machine. The series is against block.git/for-next and depends on

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/for-block

This branch contains the necessary infrastructure for multi-instance
callbacks which allows us to remove blk-mq-cpu.c completely. It can be
pulled into the block tree.

Changes vs. V1:

  Use the multi instance callbacks and remove the private notifier handling
  in the block code.

Thanks,

tglx
---
 a/block/blk-mq-cpu.c |   67 -
 b/include/linux/blk-mq.h |8 ---
 block/Makefile   |2 
 block/blk-mq.c   |  123 +--
 block/blk-mq.h   |7 --
 5 files changed, 58 insertions(+), 149 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch V2 2/2] blk/mq: Convert to hotplug state machine

2016-09-20 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.


Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Jens Axboe 
Cc: Peter Zijlstra 
Cc: linux-block@vger.kernel.org
Cc: r...@linutronix.de
Cc: Christoph Hellwing 
Link: http://lkml.kernel.org/r/20160919212601.180033...@linutronix.de
Signed-off-by: Thomas Gleixner 

---

 block/blk-mq.c |   87 -
 1 file changed, 43 insertions(+), 44 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2090,50 +2090,18 @@ static void blk_mq_queue_reinit(struct r
blk_mq_sysfs_register(q);
 }
 
-static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
- unsigned long action, void *hcpu)
+/*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask cpuhp_online_new;
+
+static void blk_mq_queue_reinit_work(void)
 {
struct request_queue *q;
-   int cpu = (unsigned long)hcpu;
-   /*
-* New online cpumask which is going to be set in this hotplug event.
-* Declare this cpumasks as global as cpu-hotplug operation is invoked
-* one-by-one and dynamically allocating this could result in a failure.
-*/
-   static struct cpumask online_new;
-
-   /*
-* Before hotadded cpu starts handling requests, new mappings must
-* be established.  Otherwise, these requests in hw queue might
-* never be dispatched.
-*
-* For example, there is a single hw queue (hctx) and two CPU queues
-* (ctx0 for CPU0, and ctx1 for CPU1).
-*
-* Now CPU1 is just onlined and a request is inserted into
-* ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is
-* still zero.
-*
-* And then while running hw queue, flush_busy_ctxs() finds bit0 is
-* set in pending bitmap and tries to retrieve requests in
-* hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0,
-* so the request in ctx1->rq_list is ignored.
-*/
-   switch (action & ~CPU_TASKS_FROZEN) {
-   case CPU_DEAD:
-   case CPU_UP_CANCELED:
-   cpumask_copy(&online_new, cpu_online_mask);
-   break;
-   case CPU_UP_PREPARE:
-   cpumask_copy(&online_new, cpu_online_mask);
-   cpumask_set_cpu(cpu, &online_new);
-   break;
-   default:
-   return NOTIFY_OK;
-   }
 
mutex_lock(&all_q_mutex);
-
/*
 * We need to freeze and reinit all existing queues.  Freezing
 * involves synchronous wait for an RCU grace period and doing it
@@ -2154,13 +2122,43 @@ static int blk_mq_queue_reinit_notify(st
}
 
list_for_each_entry(q, &all_q_list, all_q_node)
-   blk_mq_queue_reinit(q, &online_new);
+   blk_mq_queue_reinit(q, &cpuhp_online_new);
 
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_unfreeze_queue(q);
 
mutex_unlock(&all_q_mutex);
-   return NOTIFY_OK;
+}
+
+static int blk_mq_queue_reinit_dead(unsigned int cpu)
+{
+   cpumask_clear_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
+}
+
+/*
+ * Before hotadded cpu starts handling requests, new mappings must be
+ * established.  Otherwise, these requests in hw queue might never be
+ * dispatched.
+ *
+ * For example, there is a single hw queue (hctx) and two CPU queues (ctx0
+ * for CPU0, and ctx1 for CPU1).
+ *
+ * Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
+ * and set bit0 in pending bitmap as ctx1->index_hw is still zero.
+ *
+ * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
+ * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
+ * is ignored.
+ */
+static int blk_mq_queue_reinit_prepare(unsigned int cpu)
+{
+   cpumask_copy(&cpuhp_online_new, cpu_online_mask);
+   cpumask_set_cpu(cpu, &cpuhp_online_new);
+   blk_mq_queue_reinit_work();
+   return 0;
 }
 
 static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
@@ -2382,8 +2380,9 @@ static int __init blk_mq_init(void)
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
blk_mq_hctx_notify_dead);
 
-   hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
-
+   cpuhp_setup_state_nocalls(CPUHP_BLK_MQ_PREPARE, "block/mq:prepare",
+ blk_mq_queue_reinit_prepare,
+

[patch V2 1/2] blk/mq/cpu-notif: Convert to hotplug state machine

2016-09-20 Thread Thomas Gleixner

Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.

Signed-off-by: Thomas Gleixner 
Cc: Jens Axboe 
Cc: Peter Zijlstra 
Cc: linux-block@vger.kernel.org
Cc: r...@linutronix.de
Cc: Christoph Hellwing 

---

 block/Makefile |2 -
 block/blk-mq-cpu.c |   67 -
 block/blk-mq.c |   36 +-
 block/blk-mq.h |7 -
 include/linux/blk-mq.h |8 -
 5 files changed, 15 insertions(+), 105 deletions(-)

--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o \
-   blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+   blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
 
--- a/block/blk-mq-cpu.c
+++ /dev/null
@@ -1,67 +0,0 @@
-/*
- * CPU notifier helper code for blk-mq
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include "blk-mq.h"
-
-static LIST_HEAD(blk_mq_cpu_notify_list);
-static DEFINE_RAW_SPINLOCK(blk_mq_cpu_notify_lock);
-
-static int blk_mq_main_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
-   unsigned int cpu = (unsigned long) hcpu;
-   struct blk_mq_cpu_notifier *notify;
-   int ret = NOTIFY_OK;
-
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-
-   list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) {
-   ret = notify->notify(notify->data, action, cpu);
-   if (ret != NOTIFY_OK)
-   break;
-   }
-
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-   return ret;
-}
-
-void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
-{
-   BUG_ON(!notifier->notify);
-
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-   list_add_tail(¬ifier->list, &blk_mq_cpu_notify_list);
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-}
-
-void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier)
-{
-   raw_spin_lock(&blk_mq_cpu_notify_lock);
-   list_del(¬ifier->list);
-   raw_spin_unlock(&blk_mq_cpu_notify_lock);
-}
-
-void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier,
- int (*fn)(void *, unsigned long, unsigned int),
- void *data)
-{
-   notifier->notify = fn;
-   notifier->data = data;
-}
-
-void __init blk_mq_cpu_init(void)
-{
-   hotcpu_notifier(blk_mq_main_cpu_notify, 0);
-}
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1575,11 +1575,13 @@ static struct blk_mq_tags *blk_mq_init_r
  * software queue to the hw queue dispatch list, and ensure that it
  * gets run.
  */
-static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
+static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
+   struct blk_mq_hw_ctx *hctx;
struct blk_mq_ctx *ctx;
LIST_HEAD(tmp);
 
+   hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 
spin_lock(&ctx->lock);
@@ -1590,30 +1592,20 @@ static int blk_mq_hctx_cpu_offline(struc
spin_unlock(&ctx->lock);
 
if (list_empty(&tmp))
-   return NOTIFY_OK;
+   return 0;
 
spin_lock(&hctx->lock);
list_splice_tail_init(&tmp, &hctx->dispatch);
spin_unlock(&hctx->lock);
 
blk_mq_run_hw_queue(hctx, true);
-   return NOTIFY_OK;
+   return 0;
 }
 
-static int blk_mq_hctx_notify(void *data, unsigned long action,
- unsigned int cpu)
+static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
-   struct blk_mq_hw_ctx *hctx = data;
-
-   if (action == CPU_DEAD || action == CPU_DEAD_FROZEN)
-   return blk_mq_hctx_cpu_offline(hctx, cpu);
-
-   /*
-* In case of CPU online, tags may be reallocated
-* in blk_mq_map_swqueue() after mapping is updated.
-*/
-
-   return NOTIFY_OK;
+   cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
+   &hctx->cpuhp_dead);
 }
 
 /* hctx->ctxs will be freed in queue's release handler */
@@ -1633,7 +1625,7 @@ static void blk_mq_exit_hctx(struct requ
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
 
-   blk_mq_unregister_cpu_notif

Re: [PATCH 02/13] genirq/affinity: Provide smarter irq spreading infrastructure

2016-09-22 Thread Thomas Gleixner

Alexander,

On Wed, 21 Sep 2016, Alexander Gordeev wrote:
> On Wed, Sep 14, 2016 at 04:18:48PM +0200, Christoph Hellwig wrote:
> > +/**
> > + * irq_calc_affinity_vectors - Calculate to optimal number of vectors for 
> > a given affinity mask
> > + * @affinity:  The affinity mask to spread. If NULL 
> > cpu_online_mask
> > + * is used
> > + * @maxvec:The maximum number of vectors available
> > + */
> > +int irq_calc_affinity_vectors(const struct cpumask *affinity, int maxvec)
> > +{
> > +   int cpus, ret;
> > +
> > +   /* Stabilize the cpumasks */
> > +   get_online_cpus();
> > +   /* If the supplied affinity mask is NULL, use cpu online mask */
> > +   if (!affinity)
> > +   affinity = cpu_online_mask;
> > +
> > +   cpus = cpumask_weight(affinity);
> 
> Should not we consider the result of AND of affinity and cpu_online_mask?

That's a good question.

The argument against it is the increased usage of cpu (soft)hotplug for
power-management. The driver might well want to set the mapping even for an
offline cpu and as long as the interrupt is not requested for that
particular queue, it will stay (in software) associated to that cpu. So
once the CPU is brought up again the driver can request the interrupt and
work with the associated queue.

I'm aware that there are arguments against it, but lets see how it works
out.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] irq,pci: allow drivers to specify complex affinity requirement

2016-11-07 Thread Thomas Gleixner

On Sun, 6 Nov 2016, Christoph Hellwig wrote:
>  drivers/pci/msi.c | 61 
>  include/linux/interrupt.h | 26 ---
>  include/linux/pci.h   | 14 ++
>  kernel/irq/affinity.c | 65 
> ---

Can you please split that up in bits and pieces. This change it all patch
is not fun to review.

>  4 files changed, 98 insertions(+), 68 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index bfdd074..c312535 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -551,14 +551,15 @@ static int populate_msi_sysfs(struct pci_dev *pdev)
>  }
>  
>  static struct msi_desc *
> -msi_setup_entry(struct pci_dev *dev, int nvec, bool affinity)
> +msi_setup_entry(struct pci_dev *dev, int nvec, bool affinity,
> + struct irq_affinity *desc)

This should be 'const struct '. And can we please name this *affd or
something like that?

>  {
>   struct cpumask *masks = NULL;
>   struct msi_desc *entry;
>   u16 control;
>  
>   if (affinity) {

If we do it right, then we can get rid of the bool and depend on that irq
affinity pointer. We just have to push down the flag evaluation to
pci_alloc_irq_vectors[_affinity].

> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -232,6 +232,24 @@ struct irq_affinity_notify {
>   void (*release)(struct kref *ref);
>  };
>  
> +/*
> + * This structure allows drivers to optionally specify complex IRQ affinity
> + * requirements.
> + */
> +struct irq_affinity {
> + /*
> +  * Number of vectors before the main affinity vectors that should not
> +  * have have any CPU affinity:
> +  */

Please make that kernel doc.

> + int pre_vectors;
> +
> + /*
> +  * Number of vectors after the main affinity vectors that should not
> +  * have have any CPU affinity:
> +  */
> + int post_vectors;
> +};
> +

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e49f70..be0abd2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
...  
> +#define pci_alloc_irq_vectors(dev, min_vecs, max_vecs, flags) \
> + pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags, NULL)

static inline please

>  #endif /* LINUX_PCI_H */
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index 17f51d63..151b285 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -51,16 +51,16 @@ static int get_nodes_in_cpumask(const struct cpumask 
> *mask, nodemask_t *nodemsk)
>  
>  /**
>   * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
> - * @affinity:The affinity mask to spread. If NULL 
> cpu_online_mask
> - *   is used
> + * @desc:description of the affinity requirements

@affd and sentence starts with an uppercase letter, please.

>   * @nvecs:   The number of vectors

If you rename the argument then you want to rename this as well.

>   *
>   * Returns the masks pointer or NULL if allocation failed.
>   */
> -struct cpumask *irq_create_affinity_masks(const struct cpumask *affinity,
> -   int nvec)
> +struct cpumask *irq_create_affinity_masks(struct irq_affinity *desc,
> +   int total_nvec)

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] genirq/affinity: Handle pre/post vectors in irq_calc_affinity_vectors()

2016-11-08 Thread Thomas Gleixner

On Tue, 8 Nov 2016, Hannes Reinecke wrote:
> Shouldn't you check for NULL affd here?

No. The introduction of the default affinity struct should happen in that
patch and it should be handed down instead of NULL. Ditto for the next
patch.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] pci: add pci_irq_get_affinity_vector()

2016-11-08 Thread Thomas Gleixner

On Tue, 8 Nov 2016, Hannes Reinecke wrote:

> Add a reverse-mapping function to return the interrupt vector for
> any CPU if interrupt affinity is enabled.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/pci/msi.c   | 36 
>  include/linux/pci.h |  1 +
>  2 files changed, 37 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index bfdd074..de5ed32 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1302,6 +1302,42 @@ const struct cpumask *pci_irq_get_affinity(struct 
> pci_dev *dev, int nr)
>  }
>  EXPORT_SYMBOL(pci_irq_get_affinity);
>  
> +/**
> + * pci_irq_get_affinity_vector - return the vector number for a given CPU
> + * @dev: PCI device to operate on
> + * @cpu: cpu number
> + *
> + * Returns the vector number for CPU @cpu or a negative error number
> + * if interrupt affinity is not set.
> + */
> +int pci_irq_get_affinity_vector(struct pci_dev *dev, int cpu)
> +{
> + if (dev->msix_enabled) {
> + struct msi_desc *entry;
> +
> + for_each_pci_msi_entry(entry, dev) {
> + if (cpumask_test_cpu(cpu, entry->affinity))

entry->affinity can be NULL

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/7] genirq/affinity: Handle pre/post vectors in irq_create_affinity_masks()

2016-11-08 Thread Thomas Gleixner

On Tue, 8 Nov 2016, Christoph Hellwig wrote:

> On Tue, Nov 08, 2016 at 03:59:16PM +0100, Hannes Reinecke wrote:
> >
> > Which you don't in this patch:
> 
> True.  We will always in the end, but the split isn't right, we'll
> need to pass the non-NULL argument starting in this patch.

No, in the previous one 
 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: support for partial irq affinity assignment V3

2016-11-08 Thread Thomas Gleixner

On Tue, 8 Nov 2016, Jens Axboe wrote:

> On 11/08/2016 06:15 PM, Christoph Hellwig wrote:
> > This series adds support for automatic interrupt assignment to devices
> > that have a few vectors that are set aside for admin or config purposes
> > and thus should not fall into the general per-cpu assginment pool.
> > 
> > The first patch adds that support to the core IRQ and PCI/msi code,
> > and the second is a small tweak to a block layer helper to make use
> > of it.  I'd love to have both go into the same tree so that consumers
> > of this (e.g. the virtio, scsi and rdma trees) only need to pull in
> > one of these trees as dependency.
> 
> Series looks good to me, you can add my Acked-by to all of them.

It's available from

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/for-block

for you to pull into the block tree so you can apply the block changes.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: support for partial irq affinity assignment V3

2016-11-09 Thread Thomas Gleixner

On Wed, 9 Nov 2016, Christoph Hellwig wrote:
> On Wed, Nov 09, 2016 at 08:51:35AM +0100, Thomas Gleixner wrote:
> > It's available from
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/for-block
> > 
> > for you to pull into the block tree so you can apply the block changes.
> 
> I don't actually need them in the block tree, but in the SCSI tree..
> So the block tree is a bad place for this, that's why I initially
> wanted you to pick it up.  But thinking about it the SCSI tree might
> a better idea as we could avoid a non-bisectability with that.

Right.
 
> I'll coordinate it with Jens and Martin and we'll leave you off the
> hook.  Thanks for all the help!

You're welcome!

   tglx
 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Thomas Gleixner

On Thu, 5 Oct 2017, Jens Axboe wrote:
> On 10/05/2017 11:49 AM, Kees Cook wrote:
> > Yes, totally true. tglx and I ended up meeting face-to-face at the
> > Kernel Recipes conference and we solved some outstanding design issues
> > with the conversion. The timing meant the new API went into -rc3,
> > which seemed better than missing an entire release cycle, or carrying
> > deltas against maintainer trees that would drift. (This is actually my
> > second massive refactoring of these changes...)
> 
> Honestly, I think the change should have waited for 4.15 in that case.
> Why the rush? It wasn't ready for the merge window.

Come on. You know very well that a prerequisite for global changes which is
not yet used in Linus tree can get merged post merge windew in order to
avoid massive inter maintainer tree dependencies. We've done that before.

There are only a few options we have for such global changes:

   - Delay everything for a full release and keep hunting the ever changing
 code and the new users of old interfaces, which is a pain in the
 butt. I know what I'm talking about.

   - Apply everything in a central tree, which is prone to merge conflicts
 in next when maintainer trees contain changes in the same area. So
 getting it through the maintainers is the best option for this kind of
 stuff.

   - Create a separate branch for other maintainers to pull, which I did
 often enough in the past to avoid merge dependencies.  I decided not
 to offer that branch this time, because it would be been necessary to
 pull it into a gazillion of trees. So we decided that Linus tree as a
 dependency is good enough.

 Just for the record:

 Last time I did this for block you did not complain, that you got
 something based on -rc3 from me because you wanted to have these
 changes in your tree so you could apply the depending multi-queue
 patches. tip irq/for-block exists for a reason.

 So what's the difference to pull -rc3 this time? Nothing, just the
 fact that you are not interested in the change.

 So please stop this hypocritical ranting right here.

Thanks,

tglx

Re: [PATCH] block/laptop_mode: Convert timers to use timer_setup()

2017-10-05 Thread Thomas Gleixner

Jens,

On Thu, 5 Oct 2017, Jens Axboe wrote:
> On 10/05/2017 01:23 PM, Thomas Gleixner wrote:
> > Come on. You know very well that a prerequisite for global changes which is
> > not yet used in Linus tree can get merged post merge windew in order to
> > avoid massive inter maintainer tree dependencies. We've done that before.
> 
> My point is that doing it this late makes things harder than they should
> have been. If this was in for -rc1, it would have made things a lot
> easier. Or even -rc2. I try and wait to fork off the block tree for as
> long as I can, -rc2 is generally where that happens.

Well, yes. I know it's about habits. There is no real technical reason not
to merge -rc3 or later into your devel/next branch. I actually do that for
various reasons, one being that I prefer to have halfways testable
branches, which is often not the case when they are based of rc1, which is
especially true in this 4.14 cycle. The other is to pick up stuff which
went into Linus tree via a urgent branch or even got applied from mail
directly.

> I'm not judging this based on whether I find it interesting or not, but
> rather if it's something that's generally important to get in. Maybe it
> is, but I don't see any justification for that at all. So just looking
> at the isolated change, it does not strike me as something that's
> important enough to warrant special treatment (and the pain associated
> with that). I don't care about the core change, it's obviously trivial.
> Expecting maintainers to pick up this dependency asap mid cycle is what
> sucks.

I'm really not getting the 'pain' point.

'git merge linus' is not really a pain and it does not break workflows
assumed that you do that on a branch which has immutable state. If you want
to keep your branches open for rebasing due to some wreckage in the middle
of it, that's a different story.

> Please stop accusing me of being hypocritical. I'm questionning the
> timing of the change, that should be possible without someone resorting
> to ad hominem attacks.

Well, it seemed hypocritical to me for a hopefully understandable reason. I
didn't want to attack or offend you in any way.

I just know from repeated experience how painful it is to do full tree
overhauls and sit on large patch queues for a long time. There is some
point where you need to get things going and I really appreciate the work
of people doing that. Refactoring the kernel to get rid of legacy burdens
and in this case to address a popular attack vector is definitely useful
for everybody and should be supported. We tried to make it easy by pushing
this to Linus and I really did not expect that merging Linus -rc3 into a
devel/next branch is a painful work to do.

As Kees said already, we can set that particular patch aside and push it
along with the rest of ignored ones around 15-rc1 time so we can remove the
old interfaces. Though we hopefully wont end up with a gazillion of ignored
or considered too painful ones.

Thanks,

tglx

Re: [RESEND PATCH 1/3] completion: Add support for initializing completion with lockdep_map

2017-10-19 Thread Thomas Gleixner

On Thu, 19 Oct 2017, Bart Van Assche wrote:
> On Wed, 2017-10-18 at 18:38 +0900, Byungchul Park wrote:
> > Sometimes, we want to initialize completions with sparate lockdep maps
> > to assign lock classes under control. For example, the workqueue code
> > manages lockdep maps, as it can classify lockdep maps properly.
> > Provided a function for that purpose.
> > 
> > Signed-off-by: Byungchul Park 
> > ---
> >  include/linux/completion.h | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/include/linux/completion.h b/include/linux/completion.h
> > index cae5400..182d56e 100644
> > --- a/include/linux/completion.h
> > +++ b/include/linux/completion.h
> > @@ -49,6 +49,13 @@ static inline void complete_release_commit(struct 
> > completion *x)
> > lock_commit_crosslock((struct lockdep_map *)&x->map);
> >  }
> >  
> > +#define init_completion_with_map(x, m) 
> > \
> > +do {   
> > \
> > +   lockdep_init_map_crosslock((struct lockdep_map *)&(x)->map, \
> > +   (m)->name, (m)->key, 0);
> > \
> > +   __init_completion(x);   \
> > +} while (0)
> 
> Are there any completion objects for which the cross-release checking is
> useful?

All of them by definition.

> Are there any wait_for_completion() callers that hold a mutex or
> other locking object?

Yes, there are also cross completion dependencies. There have been such
bugs and I expect more to be unearthed.

I really have to ask what your motiviation is to fight the lockdep coverage
of synchronization objects tooth and nail?

Thanks,

tglx

Re: [PATCH 1/2] genirq/affinity: assign vectors to all possible CPUs

2018-01-12 Thread Thomas Gleixner

On Fri, 12 Jan 2018, Ming Lei wrote:

> From: Christoph Hellwig 
> 
> Currently we assign managed interrupt vectors to all present CPUs.  This
> works fine for systems were we only online/offline CPUs.  But in case of
> systems that support physical CPU hotplug (or the virtualized version of
> it) this means the additional CPUs covered for in the ACPI tables or on
> the command line are not catered for.  To fix this we'd either need to
> introduce new hotplug CPU states just for this case, or we can start
> assining vectors to possible but not present CPUs.
> 
> Reported-by: Christian Borntraeger 
> Tested-by: Christian Borntraeger 
> Tested-by: Stefan Haberland 
> Cc: linux-ker...@vger.kernel.org
> Cc: Thomas Gleixner 

FWIW, Acked-by: Thomas Gleixner

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Thomas Gleixner

On Tue, 16 Jan 2018, Ming Lei wrote:
> These two patches fixes IO hang issue reported by Laurence.
> 
> 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> may cause one irq vector assigned to all offline CPUs, then this vector
> can't handle irq any more.
> 
> The 1st patch moves irq vectors spread into one function, and prepares
> for the fix done in 2nd patch.
> 
> The 2nd patch fixes the issue by trying to make sure online CPUs assigned
> to irq vector.

Which means it's completely undoing the intent and mechanism of managed
interrupts. Not going to happen.

Which driver is that which abuses managed interrupts and does not keep its
queues properly sorted on cpu hotplug?

Thanks,

tglx

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Thomas Gleixner

On Tue, 16 Jan 2018, Ming Lei wrote:

> On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > These two patches fixes IO hang issue reported by Laurence.
> > > 
> > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > may cause one irq vector assigned to all offline CPUs, then this vector
> > > can't handle irq any more.
> > 
> > Well, that very much was the intention of managed interrupts.  Why
> > does the device raise an interrupt for a queue that has no online
> > cpu assigned to it?
> 
> It is because of irq_create_affinity_masks().

That still does not answer the question. If the interrupt for a queue is
assigned to an offline CPU, then the queue should not be used and never
raise an interrupt. That's how managed interrupts have been designed.

Thanks,

tglx

Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-08 Thread Thomas Gleixner

On Thu, 8 Mar 2018, Ming Lei wrote:
> Actually, it isn't a real fix, the real one is in the following two:
> 
>   0c20244d458e scsi: megaraid_sas: fix selection of reply queue
>   ed6d043be8cd scsi: hpsa: fix selection of reply queue

Where are these commits? Neither Linus tree not -next know anything about
them

> This patchset can't guarantee that all IRQ vectors are assigned by one
> online CPU, for example, in a quad-socket system, if only one processor
> is present, then some of vectors are still assigned by all offline CPUs,
> and it is a valid case, but still may cause io hang if drivers(hpsa,
> megaraid_sas) select reply queue in current way.

So my understanding is that these irq patches are enhancements and not bug
fixes. I'll queue them for 4.17 then.

Thanks,

tglx

Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-09 Thread Thomas Gleixner

On Fri, 9 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > Actually, it isn't a real fix, the real one is in the following two:
> > > 
> > >   0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > >   ed6d043be8cd scsi: hpsa: fix selection of reply queue
> > 
> > Where are these commits? Neither Linus tree not -next know anything about
> > them
> 
> Both aren't merged yet, but they should land V4.16, IMO.
> 
> > 
> > > This patchset can't guarantee that all IRQ vectors are assigned by one
> > > online CPU, for example, in a quad-socket system, if only one processor
> > > is present, then some of vectors are still assigned by all offline CPUs,
> > > and it is a valid case, but still may cause io hang if drivers(hpsa,
> > > megaraid_sas) select reply queue in current way.
> > 
> > So my understanding is that these irq patches are enhancements and not bug
> > fixes. I'll queue them for 4.17 then.
> 
> Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> fix performance regression[1] for some systems caused by 84676c1f21 
> ("genirq/affinity:
> assign vectors to all possible CPUs").
> 
> [1] https://marc.info/?l=linux-block&m=152050347831149&w=2

Hmm. The patches are rather large for urgent and evtl. backporting. Is
there a simpler way to address that performance issue?

Thanks,

tglx

Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-09 Thread Thomas Gleixner

On Fri, 9 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 11:08:54AM +0100, Thomas Gleixner wrote:
> > > > So my understanding is that these irq patches are enhancements and not 
> > > > bug
> > > > fixes. I'll queue them for 4.17 then.
> > > 
> > > Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> > > fix performance regression[1] for some systems caused by 84676c1f21 
> > > ("genirq/affinity:
> > > assign vectors to all possible CPUs").
> > > 
> > > [1] https://marc.info/?l=linux-block&m=152050347831149&w=2
> > 
> > Hmm. The patches are rather large for urgent and evtl. backporting. Is
> > there a simpler way to address that performance issue?
> 
> Not thought of a simpler solution. The problem is that number of active msix 
> vector
> is decreased a lot by commit 84676c1f21.

It's reduced in cases where the number of possible CPUs is way larger than
the number of online CPUs.

Now, if you look at the number of present CPUs on such systems it's
probably the same as the number of online CPUs.

It only differs on machines which support physical hotplug, but that's not
the normal case. Those systems are more special and less wide spread.

So the obvious simple fix for this regression issue is to spread out the
vectors accross present CPUs and not accross possible CPUs.

I'm not sure if there is a clear indicator whether physcial hotplug is
supported or not, but the ACPI folks (x86) and architecture maintainers
should be able to answer that question. I have a machine which says:

   smpboot: Allowing 128 CPUs, 96 hotplug CPUs

There is definitely no way to hotplug anything on that machine and sure the
existing spread algorithm will waste vectors to no end.

Sure then there is virt, which can pretend to have a gazillion of possible
hotpluggable CPUs, but virt is an insanity on its own. Though someone might
come up with reasonable heuristics for that as well.

Thoughts?

Thanks,

tglx

Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-03 Thread Thomas Gleixner

Ming,

On Fri, 30 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 04:08:19PM +0100, Thomas Gleixner wrote:
> > Thoughts?
> 
> Given this patchset doesn't have effect on normal machines without
> supporting physical CPU hotplug, it can fix performance regression on
> machines which might support physical CPU hotplug(cpu_present_mask !=
> cpu_possible_mask) with some extra memory allocation cost.
>
> So is there any chance to make it in v4.17?

Sorry, that thing fell through the cracks. I'll queue it now and try to do
a pull request late in the merge window.

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-03 Thread Thomas Gleixner

On Thu, 8 Mar 2018, Ming Lei wrote:
> 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
>   irq 39, cpu list 0
>   irq 40, cpu list 1
>   irq 41, cpu list 2
>   irq 42, cpu list 3
> 
> 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
>   irq 39, cpu list 0-2
>   irq 40, cpu list 3-4,6
>   irq 41, cpu list 5
>   irq 42, cpu list 7
> 
> 3) after applying this patch against V4.15+:
>   irq 39, cpu list 0,4
>   irq 40, cpu list 1,6
>   irq 41, cpu list 2,5
>   irq 42, cpu list 3,7

That's more or less window dressing. If the device is already in use when
the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
because the effective affinity of interrupts on X86 (and other
architectures) is always a single CPU.

So this only might move interrupts to the hotplugged CPUs when the device
is initialized after CPU hotplug and the actual vector allocation moves an
interrupt out to the higher numbered CPUs if they have less vectors
allocated than the lower numbered ones.

Probably not an issue for the majority of machines where ACPI stupidly
claims that extra CPUs are possible while there is absolute no support for
physical hotplug.

Though in scenarios where "physical" hotplug is possible, e.g. virt, this
might come surprising.

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-04 Thread Thomas Gleixner

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible 
> > > CPUs")
> > >   irq 39, cpu list 0
> > >   irq 40, cpu list 1
> > >   irq 41, cpu list 2
> > >   irq 42, cpu list 3
> > > 
> > > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible 
> > > CPUs")
> > >   irq 39, cpu list 0-2
> > >   irq 40, cpu list 3-4,6
> > >   irq 41, cpu list 5
> > >   irq 42, cpu list 7
> > > 
> > > 3) after applying this patch against V4.15+:
> > >   irq 39, cpu list 0,4
> > >   irq 40, cpu list 1,6
> > >   irq 41, cpu list 2,5
> > >   irq 42, cpu list 3,7
> > 
> > That's more or less window dressing. If the device is already in use when
> > the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
> > because the effective affinity of interrupts on X86 (and other
> > architectures) is always a single CPU.
> > 
> > So this only might move interrupts to the hotplugged CPUs when the device
> > is initialized after CPU hotplug and the actual vector allocation moves an
> > interrupt out to the higher numbered CPUs if they have less vectors
> > allocated than the lower numbered ones.
> 
> It works for blk-mq devices, such as NVMe.
> 
> Now NVMe driver creates num_possible_cpus() hw queues, and each
> hw queue is assigned one msix irq vector.
> 
> Storage is Client/Server model, that means the interrupt is only
> delivered to CPU after one IO request is submitted to hw queue and
> it is completed by this hw queue.
> 
> When CPUs is hotplugged, and there will be IO submitted from these
> CPUs, then finally IOs complete and irq events are generated from
> hw queues, and notify these submission CPU by IRQ finally.

I'm aware how that hw-queue stuff works. But that only works if the
spreading algorithm makes the interrupts affine to offline/not-present CPUs
when the block device is initialized.

In the example above:

> > >   irq 39, cpu list 0,4
> > >   irq 40, cpu list 1,6
> > >   irq 41, cpu list 2,5
> > >   irq 42, cpu list 3,7

and assumed that at driver init time only CPU 0-3 are online then the
hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.

So the extra assignment to CPU 4-7 in the affinity mask has no effect
whatsoever and even if the spreading result is 'perfect' it just looks
perfect as it is not making any difference versus the original result:

> > >   irq 39, cpu list 0
> > >   irq 40, cpu list 1
> > >   irq 41, cpu list 2
> > >   irq 42, cpu list 3

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-04 Thread Thomas Gleixner

On Wed, 4 Apr 2018, Thomas Gleixner wrote:
> I'm aware how that hw-queue stuff works. But that only works if the
> spreading algorithm makes the interrupts affine to offline/not-present CPUs
> when the block device is initialized.
> 
> In the example above:
> 
> > > > irq 39, cpu list 0,4
> > > > irq 40, cpu list 1,6
> > > > irq 41, cpu list 2,5
> > > > irq 42, cpu list 3,7
> 
> and assumed that at driver init time only CPU 0-3 are online then the
> hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
> 
> So the extra assignment to CPU 4-7 in the affinity mask has no effect
> whatsoever and even if the spreading result is 'perfect' it just looks
> perfect as it is not making any difference versus the original result:
> 
> > > >   irq 39, cpu list 0
> > > >   irq 40, cpu list 1
> > > >   irq 41, cpu list 2
> > > >   irq 42, cpu list 3

And looking deeper into the changes, I think that the first spreading step
has to use cpu_present_mask and not cpu_online_mask.

Assume the following scenario:

Machine with 8 present CPUs is booted, the 4 last CPUs are
unplugged. Device with 4 queues is initialized.

The resulting spread is going to be exactly your example:

irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
unplugged. Using cpu_present_mask the spread would be:

irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7

while on a machine where CPU 4-7 are NOT present, but advertised as
possible the spread would be:

irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Hmm?

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-04 Thread Thomas Gleixner

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> > In the example above:
> > 
> > > > >   irq 39, cpu list 0,4
> > > > >   irq 40, cpu list 1,6
> > > > >   irq 41, cpu list 2,5
> > > > >   irq 42, cpu list 3,7
> > 
> > and assumed that at driver init time only CPU 0-3 are online then the
> > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
> 
> Indeed, and I just tested this case, and found that no interrupts are
> delivered to CPU 4-7.
> 
> In theory, the affinity has been assigned to these irq vectors, and
> programmed to interrupt controller, I understand it should work.
> 
> Could you explain it a bit why interrupts aren't delivered to CPU 4-7?

As I explained before:

"If the device is already in use when the offline CPUs get hot plugged, then
 the interrupts still stay on cpu 0-3 because the effective affinity of
 interrupts on X86 (and other architectures) is always a single CPU."

IOW. If you set the affinity mask so it contains more than one CPU then the
kernel selects a single CPU as target. The selected CPU must be online and
if there is more than one online CPU in the mask then the kernel picks the
one which has the least number of interrupts targeted at it. This selected
CPU target is programmed into the corresponding interrupt chip
(IOAPIC/MSI/MSIX) and it stays that way until the selected target CPU
goes offline or the affinity mask changes.

The reasons why we use single target delivery on X86 are:

   1) Not all X86 systems support multi target delivery

   2) If a system supports multi target delivery then the interrupt is
  preferrably delivered to the CPU with the lowest APIC ID (which
  usually corresponds to the lowest CPU number) due to hardware magic
  and only a very small percentage of interrupts are delivered to the
  other CPUs in the multi target set. So the benefit is rather dubious
  and extensive performance testing did not show any significant
  difference.

   3) The management of multi targets on the software side is painful as
  the same low level vector number has to be allocated on all possible
  target CPUs. That's making a lot of things including hotplug more
  complex for very little - if at all - benefit.

So at some point we ripped out the multi target support on X86 and moved
everything to single target delivery mode.

Other architectures never supported multi target delivery either due to
hardware restrictions or for similar reasons why X86 dropped it. There
might be a few architectures which support it, but I have no overview at
the moment.

The information is in procfs

# cat /proc/irq/9/smp_affinity_list 
0-3
# cat /proc/irq/9/effective_affinity_list 
1

# cat /proc/irq/10/smp_affinity_list 
0-3
# cat /proc/irq/10/effective_affinity_list 
2

smp_affinity[_list] is the affinity which is set either by the kernel or by
writing to /proc/irq/$N/smp_affinity[_list]

effective_affinity[_list] is the affinity which is effective, i.e. the
single target CPU to which the interrupt is affine at this point.

As you can see in the above examples the target CPU is selected from the
given possible target set and the internal spreading of the low level x86
vector allocation code picks a CPU which has the lowest number of
interrupts targeted at it.

Let's assume for the example below

# cat /proc/irq/10/smp_affinity_list 
0-3
# cat /proc/irq/10/effective_affinity_list 
2

that CPU 3 was offline when the device was initialized. So there was no way
to select it and when CPU 3 comes online there is no reason to change the
affinity of that interrupt, at least not from the kernel POV. Actually we
don't even have a mechanism to do so automagically.

If I offline CPU 2 after onlining CPU 3 then the kernel has to move the
interrupt away from CPU 2, so it selects CPU 3 as it's the one with the
lowest number of interrupts targeted at it.

Now this is a bit different if you use affinity managed interrupts like
NVME and other devices do.

Many of these devices create one queue per possible CPU, so the spreading
is simple; One interrupt per possible cpu. Pretty boring.

When the device has less queues than possible CPUs, then stuff gets more
interesting. The queues and therefore the interrupts must be targeted at
multiple CPUs. There is some logic which spreads them over the numa nodes
and takes siblings into account when Hyperthreading is enabled.

In both cases the managed interrupts are handled over CPU soft
hotplug/unplug:

  1) If a CPU is soft unplugged and an interrupt is targeted at the CPU
 then the interrupt is either moved to a still online CPU in the
 affinity mask or if the outgoing CPU is the last one in the affinity
 mask it is shut down.

  2) If a CPU is soft plugged then the

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-05 Thread Thomas Gleixner

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote:
> > Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
> > interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
> > unplugged. Using cpu_present_mask the spread would be:
> > 
> > irq 39, cpu list 0,1
> > irq 40, cpu list 2,3
> > irq 41, cpu list 4,5
> > irq 42, cpu list 6,7
> 
> Given physical CPU hotplug isn't common, this way will make only irq 39
> and irq 40 active most of times, so performance regression is caused just
> as Kashyap reported.

That is only true, if CPU 4-7 are in the present mask at boot time. I
seriously doubt that this is the case for Kashyaps scenario. Grrr, if you
would have included him into the Reported-by: tags then I could have asked
him myself.

In the physical hotplug case, the physcially (or virtually) not available
CPUs are not in the present mask. They are solely in the possible mask.

The above is about soft hotplug where the CPUs are physically there and
therefore in the present mask and can be onlined without interaction from
the outside (mechanical or virt config).

If nobody objects, I'll make that change and queue the stuff tomorrow
morning so it can brew a few days in next before I send it off to Linus.

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-06 Thread Thomas Gleixner

On Fri, 6 Apr 2018, Ming Lei wrote:
> 
> I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> And it should work fine for Kashyap's case in normal cases.

No need to resend. I've changed it already and will push it out after
lunch.

Thanks,

tglx

Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-06 Thread Thomas Gleixner

On Fri, 6 Apr 2018, Thomas Gleixner wrote:

> On Fri, 6 Apr 2018, Ming Lei wrote:
> > 
> > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> > And it should work fine for Kashyap's case in normal cases.
> 
> No need to resend. I've changed it already and will push it out after
> lunch.

No. Lunch did not last that long :)

I pushed out the lot to

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

Please double check the modifications I did. The first related commit fixes
an existing error handling bug.

Thanks,

tglx

Re: [PATCH 3/6] genirq/affinity: update CPU affinity for CPU hotplug events

2017-02-10 Thread Thomas Gleixner

On Fri, 3 Feb 2017, Christoph Hellwig wrote:
> @@ -127,6 +127,7 @@ enum cpuhp_state {
>   CPUHP_AP_ONLINE_IDLE,
>   CPUHP_AP_SMPBOOT_THREADS,
>   CPUHP_AP_X86_VDSO_VMA_ONLINE,
> + CPUHP_AP_IRQ_AFFINIY_ONLINE,

s/AFFINIY/AFFINITY/ perhaps?

> +static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
> + cpumask_t *mask)

static void __irq_affinity_set(unsigned int irq, struct irq_desc *desc,
   cpumask_t *mask)

Please

> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + int ret;
> +
> + if (!irqd_can_move_in_process_context(data) && chip->irq_mask)
> + chip->irq_mask(data);
> + ret = chip->irq_set_affinity(data, mask, true);
> + WARN_ON_ONCE(ret);
> +
> + /*
> +  * We unmask if the irq was not marked masked by the core code.
> +  * That respects the lazy irq disable behaviour.
> +  */
> + if (!irqd_can_move_in_process_context(data) &&
> + !irqd_irq_masked(data) && chip->irq_unmask)
> + chip->irq_unmask(data);
> +}

This looks very familiar. arch/x86/kernel/irq.c comes to mind

> +
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * The interrupt descriptor might have been cleaned up
> +  * already, but it is not yet removed from the radix tree
> +  */
> + chip = irq_data_get_irq_chip(data);
> + if (!chip)
> + goto out_unlock;
> +
> + if (WARN_ON_ONCE(!chip->irq_set_affinity))
> + goto out_unlock;
> +
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {

You really want to allocate that _BEFORE_ locking desc->lock. GFP_KERNEL
inside the lock held region is wrong and shows that this was never tested :)

And no, we don't want GFP_ATOMIC here. You can allocate is once at the call
site and hand it in, so you avoid the alloc/free dance when iterating over
a large number of descriptors.

> + pr_err("failed to allocate memory for cpumask\n");
> + goto out_unlock;
> + }
> +
> + cpumask_and(mask, affinity, cpu_online_mask);
> + cpumask_set_cpu(cpu, mask);
> + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> + irq_startup(desc, false);
> + irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> + } else {
> + __irq_affinity_set(irq, desc, mask);
> + }
> +
> + free_cpumask_var(mask);
> +out_unlock:
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
> +}



> +int irq_affinity_online_cpu(unsigned int cpu)
> +{
> + struct irq_desc *desc;
> + unsigned int irq;
> +
> + for_each_irq_desc(irq, desc)
> + irq_affinity_online_irq(irq, desc, cpu);

That lacks protection against concurrent irq setup/teardown. Wants to be
protected with irq_lock_sparse()

> + return 0;
> +}
> +
> +static void irq_affinity_offline_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> +
> + data = irq_desc_get_irq_data(desc);
> + affinity = irq_data_get_affinity_mask(data);
> + if (!irqd_affinity_is_managed(data) ||
> + !irq_has_action(irq) ||
> + irqd_has_set(data, IRQD_AFFINITY_SUSPENDED) ||
> + !cpumask_test_cpu(cpu, affinity))
> + goto out_unlock;
> +
> + /*
> +  * Complete the irq move. This cpu is going down and for
> +  * non intr-remapping case, we can't wait till this interrupt
> +  * arrives at this cpu before completing the irq move.
> +  */
> + irq_force_complete_move(desc);

Hmm. That's what we do in x86 when the cpu is really dying, i.e. before it
really goes away. It's the last resort we have.

So if a move is pending, then you force it here and then you call
__irq_affinity_set() further down, which queues another pending move, which
then gets cleaned up in the cpu dying code.

If a move is pending, then you should first verify whether the pending
affinity mask is different from the one you are going to set. If it's the
same, you can just let the final cleanup code do its job. If not, then you
need to

[BUG] Potential deadlock in the block layer

2017-02-13 Thread Thomas Gleixner

Gabriel reported the lockdep splat below while investigating something
different.

Explanation for the splat is in the function comment above
del_timer_sync().

I can reproduce it as well and it's clearly broken.

Thanks,

tglx

---

[   81.518032] ==
[   81.520151] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[   81.522276] 4.10.0-rc8-debug-dirty #1 Tainted: G  I
[   81.524445] --
[   81.526560] (systemd)/1725 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[   81.528672]  (((&rwb->window_timer))){+.-...}, at: [] 
del_timer_sync+0x0/0xd0
[   81.530973]
and this task is already holding:
[   81.535399]  (&(&q->__queue_lock)->rlock){-.-...}, at: [] 
cfq_set_request+0x80/0x2c0
[   81.537637] which would create a new lock dependency:
[   81.539870]  (&(&q->__queue_lock)->rlock){-.-...} -> 
(((&rwb->window_timer))){+.-...}
[   81.542145]
but this new dependency connects a HARDIRQ-irq-safe lock:
[   81.546717]  (&(&q->__queue_lock)->rlock){-.-...}
[   81.546720]
... which became HARDIRQ-irq-safe at:
[   81.553643]   __lock_acquire+0x24f/0x19e0
[   81.555970]   lock_acquire+0xa5/0xd0
[   81.558302]   _raw_spin_lock_irqsave+0x54/0x90
[   81.560667]   ata_qc_schedule_eh+0x56/0x80 [libata]
[   81.563035]   ata_qc_complete+0x99/0x1c0 [libata]
[   81.565410]   ata_do_link_abort+0x93/0xd0 [libata]
[   81.567786]   ata_port_abort+0xb/0x10 [libata]
[   81.570150]   ahci_handle_port_interrupt+0x41e/0x570 [libahci]
[   81.572533]   ahci_handle_port_intr+0x74/0xb0 [libahci]
[   81.574918]   ahci_single_level_irq_intr+0x3b/0x60 [libahci]
[   81.577311]   __handle_irq_event_percpu+0x35/0x100
[   81.579696]   handle_irq_event_percpu+0x1e/0x50
[   81.582065]   handle_irq_event+0x34/0x60
[   81.584427]   handle_edge_irq+0x112/0x140
[   81.586776]   handle_irq+0x18/0x20
[   81.589109]   do_IRQ+0x87/0x110
[   81.591403]   ret_from_intr+0x0/0x20
[   81.593666]   cpuidle_enter_state+0x102/0x230
[   81.595939]   cpuidle_enter+0x12/0x20
[   81.598202]   call_cpuidle+0x38/0x40
[   81.600443]   do_idle+0x16e/0x1e0
[   81.602670]   cpu_startup_entry+0x5d/0x60
[   81.604890]   rest_init+0x12c/0x140
[   81.607080]   start_kernel+0x45f/0x46c
[   81.609252]   x86_64_start_reservations+0x2a/0x2c
[   81.611399]   x86_64_start_kernel+0xeb/0xf8
[   81.613505]   verify_cpu+0x0/0xfc
[   81.615574]
to a HARDIRQ-irq-unsafe lock:
[   81.619611]  (((&rwb->window_timer))){+.-...}
[   81.619614]
... which became HARDIRQ-irq-unsafe at:
[   81.625562] ...
[   81.625565]   __lock_acquire+0x2ba/0x19e0
[   81.629504]   lock_acquire+0xa5/0xd0
[   81.631467]   call_timer_fn+0x74/0x110
[   81.633397]   expire_timers+0xaa/0xd0
[   81.635287]   run_timer_softirq+0x72/0x140
[   81.637145]   __do_softirq+0x143/0x290
[   81.638974]   irq_exit+0x6a/0xd0
[   81.640818]   smp_trace_apic_timer_interrupt+0x79/0x90
[   81.642698]   smp_apic_timer_interrupt+0x9/0x10
[   81.644572]   apic_timer_interrupt+0x93/0xa0
[   81.646445]   cpuidle_enter_state+0x102/0x230
[   81.648332]   cpuidle_enter+0x12/0x20
[   81.650211]   call_cpuidle+0x38/0x40
[   81.652096]   do_idle+0x16e/0x1e0
[   81.653984]   cpu_startup_entry+0x5d/0x60
[   81.655855]   start_secondary+0x150/0x180
[   81.657696]   verify_cpu+0x0/0xfc
[   81.659497]
other info that might help us debug this:

[   81.664802]  Possible interrupt unsafe locking scenario:

[   81.668296]CPU0CPU1
[   81.670010]
[   81.671685]   lock(((&rwb->window_timer)));
[   81.673358]local_irq_disable();
[   81.675018]lock(&(&q->__queue_lock)->rlock);
[   81.676659]lock(((&rwb->window_timer)));
[   81.678275]   
[   81.679845] lock(&(&q->__queue_lock)->rlock);
[   81.681419]
 *** DEADLOCK ***

[   81.685989] 1 lock held by (systemd)/1725:
[   81.687518]  #0:  (&(&q->__queue_lock)->rlock){-.-...}, at: 
[] cfq_set_request+0x80/0x2c0
[   81.689125]
the dependencies between HARDIRQ-irq-safe lock and the holding 
lock:
[   81.692327] -> (&(&q->__queue_lock)->rlock){-.-...} ops: 70140 {
[   81.693971]IN-HARDIRQ-W at:
[   81.695606] __lock_acquire+0x24f/0x19e0
[   81.697262] lock_acquire+0xa5/0xd0
[   81.698906] _raw_spin_lock_irqsave+0x54/0x90
[   81.700568] ata_qc_schedule_eh+0x56/0x80 [libata]
[   81.702229] ata_qc_complete+0x99/0x1c0 [libata]
[   81.703884] ata_do_link_abort+0x93/0xd0 [libata]
[   81.705540] ata_port_abort+0xb/0x10 [libata]
[   81.707199] ahci_handle_port_interrupt+0x41e/0x570 
[libahci]
[   81.708881] ahci_handle_port_intr+0x74/

Re: [BUG] Potential deadlock in the block layer

2017-02-16 Thread Thomas Gleixner

On Mon, 13 Feb 2017, Jens Axboe wrote:

> On 02/13/2017 07:14 AM, Thomas Gleixner wrote:
> > Gabriel reported the lockdep splat below while investigating something
> > different.
> > 
> > Explanation for the splat is in the function comment above
> > del_timer_sync().
> > 
> > I can reproduce it as well and it's clearly broken.
> 
> Agree, the disable logic is broken. The below isn't super pretty, but it
> should do the trick for 4.10.

Cures the splat. Is that a new 4.10 'feature'? That wbt stuff has been
merged a few releases ago.

Thanks,

tglx

Re: [PATCH 2/7] genirq/affinity: assign vectors to all present CPUs

2017-05-21 Thread Thomas Gleixner

On Fri, 19 May 2017, Christoph Hellwig wrote:
> - /* Stabilize the cpumasks */
> - get_online_cpus();

How is that protected against physical CPU hotplug? Physical CPU hotplug
manipulates the present mask.

> - nodes = get_nodes_in_cpumask(cpu_online_mask, &nodemsk);
> + nodes = get_nodes_in_cpumask(cpu_present_mask, &nodemsk);
> +static int __init irq_build_cpumap(void)
> +{
> + int node, cpu;
> +
> + for (node = 0; node < nr_node_ids; node++) {
> + if (!zalloc_cpumask_var(&node_to_present_cpumask[node],
> + GFP_KERNEL))
> + panic("can't allocate early memory\n");
> + }
>  
> - return min(cpus, vecs) + resv;
> + for_each_present_cpu(cpu) {
> + node = cpu_to_node(cpu);
> + cpumask_set_cpu(cpu, node_to_present_cpumask[node]);
> + }

This mask needs updating on physical hotplug as well.

Thanks,

tglx

Re: [PATCH 3/7] genirq/affinity: factor out a irq_affinity_set helper

2017-05-21 Thread Thomas Gleixner

On Fri, 19 May 2017, Christoph Hellwig wrote:

> Factor out code from the x86 cpu hot plug code to program the affinity
> for a vector for a hot plug / hot unplug event.
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + bool ret = false;
> +
> + if (!irqd_can_move_in_process_context(data) && chip->irq_mask)

Using that inline directly will make this function useless on architectures
which have GENERIC_PENDING_IRQS not set.

kernel/irq/manage.c has that wrapped. We need to move those wrappers to the
internal header file, so that it gets compiled out on other platforms.

Thanks,

tglx

Re: spread MSI(-X) vectors to all possible CPUs V2

2017-06-16 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Christoph Hellwig wrote:
> can you take a look at the generic patches as they are the required
> base for the block work?

It's next on my ever growing todo list

Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)

This should be named irq_affinity_force() because it circumvents the 'move
in irq context' mechanism. I'll do that myself. No need to resend.

Thanks,

tglx

Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> +
> + cpumask_and(mask, affinity, cpu_online_mask);
> + cpumask_set_cpu(cpu, mask);
> + if (irqd_has_set(data, IRQD_AFFINITY_SUSPENDED)) {
> + irq_startup(desc, false);
> + irqd_clear(data, IRQD_AFFINITY_SUSPENDED);
> + } else {
> + irq_affinity_set(irq, desc, mask);

I don't think this should be forced. In that case we really can use the
regular mechanism.

Thanks,

tglx

Re: [PATCH 5/8] genirq/affinity: update CPU affinity for CPU hotplug events

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +static void irq_affinity_online_irq(unsigned int irq, struct irq_desc *desc,
> + unsigned int cpu)
> +{
> + const struct cpumask *affinity;
> + struct irq_data *data;
> + struct irq_chip *chip;
> + unsigned long flags;
> + cpumask_var_t mask;
> +
> + if (!desc)
> + return;
> + if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
> + return;

That's silly. Why want you to do that for each irq descriptor? That's
outright silly. Either you allocated it in irq_affinity_[on|off]line_cpu()
once or just make it a static cpumask. Lemme fix that up.

Thanks,

tglx

Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper

2017-06-16 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + bool ret = false;
> +
> + if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> + chip->irq_mask(data);
> +
> + if (chip->irq_set_affinity) {
> + if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> + pr_crit("IRQ %d set affinity failed because there are 
> no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
> + ret = true;
> + }
> +
> + /*
> +  * We unmask if the irq was not marked masked by the core code.
> +  * That respects the lazy irq disable behaviour.
> +  */
> + if (!irq_can_move_pcntxt(data) &&
> + !irqd_irq_masked(data) && chip->irq_unmask)
> + chip->irq_unmask(data);

There is another issue with this. Nothing updates the affinity mask in
irq_desc, when we just invoke the chip callback. Let me have a look.

Thanks,

tglx

Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper

2017-06-16 Thread Thomas Gleixner

On Fri, 16 Jun 2017, Thomas Gleixner wrote:
> On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> > +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t 
> > *mask)
> > +{
> > +   struct irq_data *data = irq_desc_get_irq_data(desc);
> > +   struct irq_chip *chip = irq_data_get_irq_chip(data);
> > +   bool ret = false;
> > +
> > +   if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> > +   chip->irq_mask(data);
> > +
> > +   if (chip->irq_set_affinity) {
> > +   if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> > +   pr_crit("IRQ %d set affinity failed because there are 
> > no available vectors.  The device assigned to this IRQ is unstable.\n", 
> > irq);
> > +   ret = true;
> > +   }
> > +
> > +   /*
> > +* We unmask if the irq was not marked masked by the core code.
> > +* That respects the lazy irq disable behaviour.
> > +*/
> > +   if (!irq_can_move_pcntxt(data) &&
> > +   !irqd_irq_masked(data) && chip->irq_unmask)
> > +   chip->irq_unmask(data);
> 
> There is another issue with this. Nothing updates the affinity mask in
> irq_desc, when we just invoke the chip callback. Let me have a look.

Indeed. So that magic you do in the next patches (the hotplug callbacks)
only work proper for affinity masks with a single cpu set.

The problem is that we don't have a distinction between the 'possible'
(e.g. set by /proc/irq/affinity) and the effective affinity mask.

Needs more thought.

Thanks,

tglx

Re: [PATCH 3/8] genirq/affinity: factor out a irq_affinity_set helper

2017-06-17 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:
> +
> +bool irq_affinity_set(int irq, struct irq_desc *desc, const cpumask_t *mask)
> +{
> + struct irq_data *data = irq_desc_get_irq_data(desc);
> + struct irq_chip *chip = irq_data_get_irq_chip(data);
> + bool ret = false;
> +
> + if (!irq_can_move_pcntxt(data) && chip->irq_mask)
> + chip->irq_mask(data);
> +
> + if (chip->irq_set_affinity) {
> + if (chip->irq_set_affinity(data, mask, true) == -ENOSPC)
> + pr_crit("IRQ %d set affinity failed because there are 
> no available vectors.  The device assigned to this IRQ is unstable.\n", irq);
> + ret = true;
> + }
> +
> + /*
> +  * We unmask if the irq was not marked masked by the core code.
> +  * That respects the lazy irq disable behaviour.
> +  */
> + if (!irq_can_move_pcntxt(data) &&
> + !irqd_irq_masked(data) && chip->irq_unmask)
> + chip->irq_unmask(data);
> +
> + return ret;
> +}

That needs even more care as this does not include the handling for move in
progress, which will make your affinity setting fail.

Ideally we include that managed shutdown magic into fixup_irqs(), but
that's arch specific. So that needs the long dragged out update to the
generic irq cpuhotplug code to handle the x86'isms proper.

Thanks,

tglx

Re: [PATCH 1/8] genirq: allow assigning affinity to present but not online CPUs

2017-06-17 Thread Thomas Gleixner

On Sat, 3 Jun 2017, Christoph Hellwig wrote:

> This will allow us to spread MSI/MSI-X affinity over all present CPUs and
> thus better deal with systems where cpus are take on and offline all the
> time.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/irq/manage.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 070be980c37a..5c25d4a5dc46 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -361,17 +361,17 @@ static int setup_affinity(struct irq_desc *desc, struct 
> cpumask *mask)
>   if (irqd_affinity_is_managed(&desc->irq_data) ||
>   irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
>   if (cpumask_intersects(desc->irq_common_data.affinity,
> -cpu_online_mask))
> +cpu_present_mask))
>   set = desc->irq_common_data.affinity;
>   else
>   irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
>   }
>  
> - cpumask_and(mask, cpu_online_mask, set);
> + cpumask_and(mask, cpu_present_mask, set);
>   if (node != NUMA_NO_NODE) {
>   const struct cpumask *nodemask = cpumask_of_node(node);
>  
> - /* make sure at least one of the cpus in nodemask is online */
> + /* make sure at least one of the cpus in nodemask is present */
>   if (cpumask_intersects(mask, nodemask))
>   cpumask_and(mask, mask, nodemask);
>   }

This is a dangerous one. It might break existing setups subtly. Assume the
AFFINITY_SET flag is set, then this tries to preserve the user supplied
affinity mask. So that might end up with some random mask which does not
contain any online CPU. Not what we want.

We really need to seperate the handling of the managed interrupts from the
regular ones. Otherwise we end up with hard to debug issues. Cramming stuff
into the existing code, does not solve the problem, but it creates new ones.

Thanks,

tglx

[patch 55/55] genirq/affinity: Assign vectors to all present CPUs

2017-06-19 Thread Thomas Gleixner

From: Christoph Hellwig 

Currently the irq vector spread algorithm is restricted to online CPUs,
which ties the IRQ mapping to the currently online devices and doesn't deal
nicely with the fact that CPUs could come and go rapidly due to e.g. power
management.

Instead assign vectors to all present CPUs to avoid this churn.

Build a map of all possible CPUs for a given node, as the architectures
only provide a map of all onlines CPUs. Do this dynamically on each call
for the vector assingments, which is a bit suboptimal and could be
optimized in the future by provinding a mapping from the arch code.

Signed-off-by: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: Jens Axboe 
Cc: Keith Busch 
Cc: linux-block@vger.kernel.org
Cc: linux-n...@lists.infradead.org
Link: http://lkml.kernel.org/r/20170603140403.27379-5-...@lst.de
Signed-off-by: Thomas Gleixner 

---
 kernel/irq/affinity.c |   76 +-
 1 file changed, 63 insertions(+), 13 deletions(-)

--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -1,4 +1,7 @@
-
+/*
+ * Copyright (C) 2016 Thomas Gleixner.
+ * Copyright (C) 2016-2017 Christoph Hellwig.
+ */
 #include 
 #include 
 #include 
@@ -35,13 +38,54 @@ static void irq_spread_init_one(struct c
}
 }
 
-static int get_nodes_in_cpumask(const struct cpumask *mask, nodemask_t 
*nodemsk)
+static cpumask_var_t *alloc_node_to_present_cpumask(void)
+{
+   cpumask_var_t *masks;
+   int node;
+
+   masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+   if (!masks)
+   return NULL;
+
+   for (node = 0; node < nr_node_ids; node++) {
+   if (!zalloc_cpumask_var(&masks[node], GFP_KERNEL))
+   goto out_unwind;
+   }
+
+   return masks;
+
+out_unwind:
+   while (--node >= 0)
+   free_cpumask_var(masks[node]);
+   kfree(masks);
+   return NULL;
+}
+
+static void free_node_to_present_cpumask(cpumask_var_t *masks)
+{
+   int node;
+
+   for (node = 0; node < nr_node_ids; node++)
+   free_cpumask_var(masks[node]);
+   kfree(masks);
+}
+
+static void build_node_to_present_cpumask(cpumask_var_t *masks)
+{
+   int cpu;
+
+   for_each_present_cpu(cpu)
+   cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
+}
+
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+   const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
 
/* Calculate the number of nodes in the supplied affinity mask */
-   for_each_online_node(n) {
-   if (cpumask_intersects(mask, cpumask_of_node(n))) {
+   for_each_node(n) {
+   if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -64,7 +108,7 @@ irq_create_affinity_masks(int nvecs, con
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
-   cpumask_var_t nmsk;
+   cpumask_var_t nmsk, *node_to_present_cpumask;
 
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;
@@ -73,13 +117,19 @@ irq_create_affinity_masks(int nvecs, con
if (!masks)
goto out;
 
+   node_to_present_cpumask = alloc_node_to_present_cpumask();
+   if (!node_to_present_cpumask)
+   goto out;
+
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
 
/* Stabilize the cpumasks */
get_online_cpus();
-   nodes = get_nodes_in_cpumask(cpu_online_mask, &nodemsk);
+   build_node_to_present_cpumask(node_to_present_cpumask);
+   nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
+&nodemsk);
 
/*
 * If the number of nodes in the mask is greater than or equal the
@@ -87,7 +137,8 @@ irq_create_affinity_masks(int nvecs, con
 */
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
-   cpumask_copy(masks + curvec, cpumask_of_node(n));
+   cpumask_copy(masks + curvec,
+node_to_present_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -101,7 +152,7 @@ irq_create_affinity_masks(int nvecs, con
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
+   cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);

[patch 1/9] block: Cleanup license notice

2019-01-17 Thread Thomas Gleixner

Remove the imprecise and sloppy:

  "This files is licensed under the GPL."

license notice in the top level comment.

1) The file already contains a SPDX license identifier which clearly
   states that the license of the file is GPL V2 only

2) The notice resolves to GPL v1 or later for scanners which is just
   contrary to the intent of SPDX identifiers to provide clear and non
   ambiguous license information. Aside of that the value add of this
   notice is below zero,

Fixes: 6a5ac9846508 ("block: Make struct request_queue smaller for 
CONFIG_BLK_DEV_ZONED=n")
Signed-off-by: Thomas Gleixner 
Cc: Bart Van Assche 
Cc: Damien Le Moal 
Cc: Matias Bjorling 
Cc: Christoph Hellwig 
Cc: Jens Axboe 
Cc: linux-block@vger.kernel.org
---

P.S.: This patch is part of a larger cleanup, but independent of other
  patches and is intended to be picked up by the maintainer directly.

 block/blk-mq-debugfs-zoned.c |2 --
 1 file changed, 2 deletions(-)

--- a/block/blk-mq-debugfs-zoned.c
+++ b/block/blk-mq-debugfs-zoned.c
@@ -1,8 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2017 Western Digital Corporation or its affiliates.
- *
- * This file is released under the GPL.
  */
 
 #include

Re: [PATCH] genirq/affinity: Assign default affinity to pre/post vectors

2019-01-22 Thread Thomas Gleixner

Chen,

On Fri, 18 Jan 2019, Huacai Chen wrote:
> > > I did not say that you removed all NULL returns. I said that this function
> > > can return NULL for other reasons and then the same situation will happen.
> > >
> > > If the masks pointer returned is NULL then the calling code or any
> > > subsequent usage needs to handle it properly. Yes, I understand that this
> > > change makes the warning go away for that particular case, but that's not
> > > making it any more correct.
> 
> Hi, Thomas,
> 
> I don't think "nvecs == affd->pre_vectors + affd->post_vectors" is an ERROR,
> so it should be different with "return NULL for other reasons" to the caller. 
> If
> the caller fallback from MSI-X to MSI, it is probably "nvecs=1,pre_vectors=1,
> post_vectors=0". The caller can work perfectly, if pre/post vectors are filled
> with the default affinity.

This is not about 'works'. This is about correctness. So again:

The semantics of that function is, that it returns NULL on error. The
reason for this NULL return is entirely irrelevant for the moment.

  If the calling code or any subsequent code proceeds as if nothing
  happened and later complains about it being NULL, then that logic at the
  calling or subsequent code is broken.

  And just making one particular error case not return NULL does not make
  it less broken because the function still can return NULL. So that
  proposed 'fix' is sunshine programming at best.

Now for the change you are proposing. It's semantically wrong in the face
of multiqueue devices. You are trying to make exactly one particular corner
case "work" by some dubious definition of work:

 nvecs=1,pre_vectors=1,post_vectors=0

If pre + post != 1 then this still returns NULL and the same wreckage
happens again.

The point is that if there are not enough vectors to have at least one
queue vector aside of pre and post then the whole queue management logic
does not make any sense. This needs to be fixed elsewhere and not duct tape
in the core logic with the argument 'works for me'.

Thanks,

tglx

Re: Question on handling managed IRQs when hotplugging CPUs

2019-02-01 Thread Thomas Gleixner

On Fri, 1 Feb 2019, Hannes Reinecke wrote:
> Thing is, if we have _managed_ CPU hotplug (ie if the hardware provides some
> means of quiescing the CPU before hotplug) then the whole thing is trivial;
> disable SQ and wait for all outstanding commands to complete.
> Then trivially all requests are completed and the issue is resolved.
> Even with todays infrastructure.
> 
> And I'm not sure if we can handle surprise CPU hotplug at all, given all the
> possible race conditions.
> But then I might be wrong.

The kernel would completely fall apart when a CPU would vanish by surprise,
i.e. uncontrolled by the kernel. Then the SCSI driver exploding would be
the least of our problems.

Thanks,

tglx

Re: [PATCH 05/19] Add io_uring IO interface

2019-02-10 Thread Thomas Gleixner

On Sat, 9 Feb 2019, Jens Axboe wrote:
> +static void io_commit_cqring(struct io_ring_ctx *ctx)
> +{
> + struct io_cq_ring *ring = ctx->cq_ring;
> +
> + if (ctx->cached_cq_tail != READ_ONCE(ring->r.tail)) {
> + /* order cqe stores with ring update */

This lacks a reference to the matching rmb()

> + smp_wmb();
> + WRITE_ONCE(ring->r.tail, ctx->cached_cq_tail);
> + /* write side barrier of tail update, app has read side */

That's a bit meager. Can you please document the barriers which are paired
with user space barriers very elaborate?

> + smp_wmb();
> +
> + if (wq_has_sleeper(&ctx->cq_wait)) {
> + wake_up_interruptible(&ctx->cq_wait);
> + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
> + }
> + }
> +}
> +
> +static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
> +{
> + struct io_cq_ring *ring = ctx->cq_ring;
> + unsigned tail;
> +
> + tail = ctx->cached_cq_tail;
> + smp_rmb();

Undocumented barrier

> + if (tail + 1 == READ_ONCE(ring->r.head))
> + return NULL;

> +static void io_commit_sqring(struct io_ring_ctx *ctx)
> +{
> + struct io_sq_ring *ring = ctx->sq_ring;
> +
> + if (ctx->cached_sq_head != READ_ONCE(ring->r.head)) {
> + WRITE_ONCE(ring->r.head, ctx->cached_sq_head);
> + /* write side barrier of head update, app has read side */

See above.

> + smp_wmb();
> + }
> +}


> +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
> +{
> + struct io_sq_ring *ring = ctx->sq_ring;
> + unsigned head;
> +
> + /*
> +  * The cached sq head (or cq tail) serves two purposes:
> +  *
> +  * 1) allows us to batch the cost of updating the user visible
> +  *head updates.
> +  * 2) allows the kernel side to track the head on its own, even
> +  *though the application is the one updating it.
> +  */
> + head = ctx->cached_sq_head;
> + smp_rmb();

Undocumented barrier

> + if (head == READ_ONCE(ring->r.tail))
> + return false;
> +
> + head = READ_ONCE(ring->array[head & ctx->sq_mask]);
> + if (head < ctx->sq_entries) {
> + s->index = head;
> + s->sqe = &ctx->sq_sqes[head];
> + ctx->cached_sq_head++;
> + return true;
> + }
> +
> + /* drop invalid entries */
> + ctx->cached_sq_head++;
> + ring->dropped++;
> + smp_wmb();

Undocumented barrier

> + return false;
> +}
> +

> +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> +   const sigset_t __user *sig, size_t sigsz)
> +{
> + struct io_cq_ring *ring = ctx->cq_ring;
> + sigset_t ksigmask, sigsaved;
> + DEFINE_WAIT(wait);
> + int ret = 0;
> +
> + smp_rmb();
> + if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
> + return 0;
> + if (!min_events)
> + return 0;
> +
> + if (sig) {
> + ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz);
> + if (ret)
> + return ret;
> + }
> +
> + do {
> + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
> +
> + ret = 0;
> + smp_rmb();

Undocumented barrier

> + if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
> + break;
> +

There are undocumented smp_wmb()'s in 'io_uring: Add submission polling' as
well. It's really hard to tell where the corresponding barriers are and
what they are supposed to order.

Especially the barriers which are paired with user space barriers need some
careful documentation. What are the side effects if user space is missing a
barrier? Just user space seing unconsistent data or is there something
which goes the other way round and might cause havoc in the kernel?

Thanks,

tglx

Re: [PATCH 2/5] genirq/affinity: allow driver to setup managed IRQ's affinity

2019-02-10 Thread Thomas Gleixner

Ming,

On Fri, 25 Jan 2019, Ming Lei wrote:

> This patch introduces callback of .setup_affinity into 'struct
> irq_affinity', so that:

Please see Documentation/process/submitting-patches.rst. Search for 'This
patch' 

> 
> 1) allow drivers to customize the affinity for managed IRQ, for
> example, now NVMe has special requirement for read queues & poll
> queues

That's nothing new and already handled today.

> 2) 6da4b3ab9a6e9 ("genirq/affinity: Add support for allocating interrupt 
> sets")
> makes pci_alloc_irq_vectors_affinity() a bit difficult to use for
> allocating interrupt sets: 'max_vecs' is required to same with 'min_vecs'.

So it's a bit difficult, but you fail to explain why it's not sufficient.

> With this patch, driver can implement their own .setup_affinity to
> customize the affinity, then the above thing can be solved easily.

Well, I don't really understand what is solved easily and you are merily
describing the fact that the new callback allows drivers to customize
something. What's the rationale? If it's just the 'bit difficult' part,
then what is the reason for not making the core functionality easier to use
instead of moving stuff into driver space again?

NVME is not special and all this achieves is that all drivers writers will
claim that their device is special and needs its own affinity setter
routine. The whole point of having the generic code is to exactly avoid
that. If it has shortcomings, then they need to be addressed, but not
worked around with random driver callbacks.

Thanks,

tglx

Re: [PATCH 4/5] nvme-pci: simplify nvme_setup_irqs() via .setup_affinity callback

2019-02-10 Thread Thomas Gleixner

On Fri, 25 Jan 2019, Ming Lei wrote:

> Use the callback of .setup_affinity() to re-caculate number
> of queues, and build irqs affinity with help of irq_build_affinity().
> 
> Then nvme_setup_irqs() gets simplified a lot.

I'm pretty sure you can achieve the same by reworking the core code without
that callback.

> + /* Fill out vectors at the beginning that don't need affinity */
> + for (curvec = 0; curvec < affd->pre_vectors; curvec++)
> + cpumask_copy(&masks[curvec].mask, cpu_possible_mask);

cpu_possible_mask is wrong.  Why are you deliberately trying to make this
special? There is absolutely no reason to do so.

These interrupts are not managed and therefore the initial affinity has to
be irq_default_affinity. Setting them to cpu_possible_mask can and probably
will evade a well thought out default affinity mask, which was set to
isolate a set of cores from general purpose interrupts.

This is exactly the thing which happens with driver special stuff and which
needs to be avoided. There is nothing special about this NVME setup and
yes, I can see why the current core code is a bit tedious to work with, but
that does not justify that extra driver magic by any means.

Thanks,

tglx

Re: [PATCH 4/5] nvme-pci: simplify nvme_setup_irqs() via .setup_affinity callback

2019-02-10 Thread Thomas Gleixner

On Fri, 25 Jan 2019, Ming Lei wrote:
> +static int nvme_setup_affinity(const struct irq_affinity *affd,
> +struct irq_affinity_desc *masks,
> +unsigned int nmasks)
> +{
> + struct nvme_dev *dev = affd->priv;
> + int affvecs = nmasks - affd->pre_vectors - affd->post_vectors;
> + int curvec, usedvecs;
> + int i;
> +
> + nvme_calc_io_queues(dev, nmasks);

So this is the only NVME specific information. Everything else can be done
in generic code. So what you really want is:

struct affd {
...
+   calc_sets(struct affd *, unsigned int nvecs);
...
}

And sets want to be actually inside of the affinity descriptor structure:

unsigned int num_sets;
unsigned int set_vectors[MAX_SETS];

We surely can define a sensible maximum of sets for now. If that ever turns
out to be insufficient, then struct affd might become to large for the
stack, but for now, using e.g. 8, there is no need to do so.

So then the logic in the generic code becomes exactly the same as what you
added to nvme_setup_affinity():

if (affd->calc_sets) {
affd->calc_sets(affd, nvecs);
} else if (!affd->num_sets) {
affd->num_sets = 1;
affd->set_vectors[0] = affvecs;
}

for (i = 0; i < affd->num_sets; i++) {

}

See?

Thanks,

tglx

Re: [PATCH 2/5] genirq/affinity: allow driver to setup managed IRQ's affinity

2019-02-11 Thread Thomas Gleixner

Ming,

On Mon, 11 Feb 2019, Bjorn Helgaas wrote:

> On Mon, Feb 11, 2019 at 11:54:00AM +0800, Ming Lei wrote:
> > On Sun, Feb 10, 2019 at 05:30:41PM +0100, Thomas Gleixner wrote:
> > > On Fri, 25 Jan 2019, Ming Lei wrote:
> > > 
> > > > This patch introduces callback of .setup_affinity into 'struct
> > > > irq_affinity', so that:
> > > 
> > > Please see Documentation/process/submitting-patches.rst. Search for 'This
> > > patch' 
> > 
> > Sorry for that, because I am not a native English speaker and it looks a bit
> > difficult for me to understand the subtle difference.

Sorry I was a bit terse.

> I think Thomas is saying that instead of "This patch introduces
> callback ...", you could say "Introduce callback of ...".
> 
> The changelog is *part* of the patch, so the context is obvious and
> there's no need to include the words "This patch".
> 
> I make the same changes to patches I receive.  In fact, I would go
> even further and say "Add callback .setup_affinity() ..." because "add"
> means the same as "introduce" but is shorter and simpler.

Thanks for the explanation, Bjorn!

There is another point here. It's not only the 'This patch introduces ...'
part. It's also good practice to structure the changelog so it provides
context and reasoning first and then tells what the change is, e.g.:

   The current handling of multiple interrupt sets in the core interrupt
   affinity logic, requires the driver to do ...  This is necessary
   because 

   This handling should be in the core code, but the core implementation
   has no way to recompute the interrupt sets for a given number of
   vectors.

   Add an optional callback to struct affd, which lets the driver recompute
   the interrupt set before the interrupt affinity spreading takes place.

The first paragraph provides context, i.e. the status quo, The second
paragraph provides reasoning what is missing and the last one tells what's
done to solve it.

That's pretty much the same for all changelogs, even if you fix a bug. Just
in the bug case, the second paragraph describes the details of the bug and
the possible consequences.

You really need to look at the changelog as a stand alone information
source. That's important when you look at a commit as an outsider or even
if you look at your own patch 6 month down the road when you already paged
out all the details.

Hope that clarifies it.

Thanks,

tglx

Re: [PATCH V2 1/4] genirq/affinity: store irq set vectors in 'struct irq_affinity'

2019-02-12 Thread Thomas Gleixner

On Tue, 12 Feb 2019, Ming Lei wrote:

> Currently the array of irq set vectors is provided by driver.
> 
> irq_create_affinity_masks() can be simplied a bit by treating the
> non-irq-set case as single irq set.
> 
> So move this array into 'struct irq_affinity', and pre-define the max
> set number as 4, which should be enough for normal cases.

You really want to add some sanity check which makes sure, that nr_sets is
<= IRQ_MAX_SETS.

Thanks,

tglx

Re: [PATCH V2 2/4] genirq/affinity: add new callback for caculating set vectors

2019-02-12 Thread Thomas Gleixner

On Tue, 12 Feb 2019, Ming Lei wrote:

> Currently pre-caculated set vectors are provided by driver for
> allocating & spread vectors. This way only works when drivers passes
> same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
> also requires driver to retry the allocating & spread.
> 
> As Bjorn and Keith mentioned, the current usage & interface for irq sets
> is a bit awkward because the retrying should have been avoided by providing
> one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
> 'max_vecs', number of the allocated vectors is unknown before calling
> pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
> pre-caculated.
> 
> Add a new callback of .calc_sets into 'struct irq_affinity' so that
> driver can caculate set vectors after IRQ vector is allocated and
> before spread IRQ vectors. Add 'priv' so that driver may retrieve
> its private data via the 'struct irq_affinity'.

Nice changelog!

>* Spread on present CPUs starting from affd->pre_vectors. If we
>* have multiple sets, build each sets affinity mask separately.
>*/
> - nr_sets = affd->nr_sets;
> - if (!nr_sets) {
> + if (affd->calc_sets) {
> + affd->calc_sets(affd, nvecs);
> + nr_sets = affd->nr_sets;
> + } else if (!affd->nr_sets) {
>   nr_sets = 1;
>   affd->set_vectors[0] = affvecs;
> - }
> + } else
> + nr_sets = affd->nr_sets;

Lacks curly brackets.

Aside of that it might make sense to get rid of the local variable, but
that should be done in patch 1 if at all. No strong opinion though.

Thanks,

tglx

Re: [PATCH V2 0/4] genirq/affinity: add .calc_sets for improving IRQ allocation & spread

2019-02-12 Thread Thomas Gleixner

On Tue, 12 Feb 2019, Ming Lei wrote:

> Hi,
> 
> Currently pre-caculated set vectors are provided by driver for
> allocating & spread vectors. This way only works when drivers passes
> same 'max_vecs' and 'min_vecs' to pci_alloc_irq_vectors_affinity(),
> also requires driver to retry the allocating & spread.
> 
> As Bjorn and Keith mentioned, the current usage & interface for irq sets
> is a bit awkward because the retrying should have been avoided by providing
> one resonable 'min_vecs'. However, if 'min_vecs' isn't same with
> 'max_vecs', number of the allocated vectors is unknown before calling
> pci_alloc_irq_vectors_affinity(), then each set's vectors can't be
> pre-caculated.
> 
> Add a new callback of .calc_sets into 'struct irq_affinity' so that
> driver can caculate set vectors after IRQ vector is allocated and
> before spread IRQ vectors. Add 'priv' so that driver may retrieve
> its private data via the 'struct irq_affinity'.
> 
> 
> V2:
>   - add .calc_sets instead of .setup_affinity() which is easy to
>   be abused by drivers

This looks really well done. If you can address the minor nitpicks, then
this is good to go, unless someone has objections.

Thanks,

tglx

Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const

2019-02-13 Thread Thomas Gleixner

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:

> On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> > Currently all parameters in 'affd' are read-only, so 'affd' is marked
> > as const in both pci_alloc_irq_vectors_affinity() and 
> > irq_create_affinity_masks().
> 
> s/all parameters in 'affd'/the contents of '*affd'/
> 
> > We have to ask driver to re-caculate set vectors after the whole IRQ
> > vectors are allocated later, and the result needs to be stored in 'affd'.
> > Also both the two interfaces are core APIs, which should be trusted.
> 
> s/re-caculate/recalculate/
> s/stored in 'affd'/stored in '*affd'/
> s/both the two/both/
> 
> This is a little confusing because you're talking about both "IRQ
> vectors" and these other "set vectors", which I think are different
> things.  I assume the "set vectors" are cpumasks showing the affinity
> of the IRQ vectors with some CPUs?

I think we should drop the whole vector wording completely.

The driver does not care about vectors, it only cares about a block of
interrupt numbers. These numbers are kernel managed and the interrupts just
happen to have a CPU vector assigned at some point. Depending on the CPU
architecture the underlying mechanism might not even be named vector.

> AFAICT, *this* patch doesn't add anything that writes to *affd.  I
> think the removal of "const" should be in the same patch that makes
> the removal necessary.

So this should be:

   The interrupt affinity spreading mechanism supports to spread out
   affinities for one or more interrupt sets. A interrupt set contains one
   or more interrupts. Each set is mapped to a specific functionality of a
   device, e.g. general I/O queues and read I/O queus of multiqueue block
   devices.

   The number of interrupts per set is defined by the driver. It depends on
   the total number of available interrupts for the device, which is
   determined by the PCI capabilites and the availability of underlying CPU
   resources, and the number of queues which the device provides and the
   driver wants to instantiate.

   The driver passes initial configuration for the interrupt allocation via
   a pointer to struct affinity_desc.

   Right now the allocation mechanism is complex as it requires to have a
   loop in the driver to determine the maximum number of interrupts which
   are provided by the PCI capabilities and the underlying CPU resources.
   This loop would have to be replicated in every driver which wants to
   utilize this mechanism. That's unwanted code duplication and error
   prone.

   In order to move this into generic facilities it is required to have a
   mechanism, which allows the recalculation of the interrupt sets and
   their size, in the core code. As the core code does not have any
   knowledge about the underlying device, a driver specific callback will
   be added to struct affinity_desc, which will be invoked by the core
   code. The callback will get the number of available interupts as an
   argument, so the driver can calculate the corresponding number and size
   of interrupt sets.

   To support this, two modifications for the handling of struct
   affinity_desc are required:

   1) The (optional) interrupt sets size information is contained in a
  separate array of integers and struct affinity_desc contains a
  pointer to it.

  This is cumbersome and as the maximum number of interrupt sets is
  small, there is no reason to have separate storage. Moving the size
  array into struct affinity_desc avoids indirections makes the code
  simpler.

   2) At the moment the struct affinity_desc pointer which is handed in from
  the driver and passed through to several core functions is marked
  'const'.

  With the upcoming callback to recalculate the number and size of
  interrupt sets, it's necessary to remove the 'const'
  qualifier. Otherwise the callback would not be able to update the
  data.

   Move the set size array into struct affinity_desc as a first preparatory
   step. The removal of the 'const' qualifier will be done when adding the
   callback.

IOW, The first patch moves the set array into the struct itself.

The second patch introduces the callback and removes the 'const'
qualifier. I wouldn't mind to have the same changelog duplicated (+/- the
last two paragraphs which need some update of course).

Thanks,

tglx

Re: [PATCH V3 3/5] genirq/affinity: add new callback for caculating set vectors

2019-02-13 Thread Thomas Gleixner

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:
> > Add a new callback of .calc_sets into 'struct irq_affinity' so that
> > driver can caculate set vectors after IRQ vector is allocated and
> > before spread IRQ vectors. Add 'priv' so that driver may retrieve
> > its private data via the 'struct irq_affinity'.
> 
> s/caculate/calculate/
> 
> I *think* "set vectors" here has something to do with an affinity
> cpumask?  "Set vectors" doesn't seem like quite the right terminology.

See previous mail. Let's stay with 'interrupt sets' and get rid of the
vector naming completely.

Thanks,

tglx

Re: [PATCH V3 4/5] nvme-pci: avoid irq allocation retrying via .calc_sets

2019-02-13 Thread Thomas Gleixner

On Wed, 13 Feb 2019, Bjorn Helgaas wrote:

> On Wed, Feb 13, 2019 at 06:50:40PM +0800, Ming Lei wrote:
> > Currently pre-caculate each set vectors, and this way requires same
> > 'max_vecs' and 'min_vecs' passed to pci_alloc_irq_vectors_affinity(),
> > then nvme_setup_irqs() has to retry in case of allocation failure.
> 
> s/pre-caculate/precalculate/
> My usual "set vectors" question as on other patches.
> 
> > This usage & interface is a bit awkward because the retry should have
> > been avoided by providing one reasonable 'min_vecs'.
> > 
> > Implement the callback of .calc_sets, so that 
> > pci_alloc_irq_vectors_affinity()
> > can calculate each set's vector after IRQ vectors is allocated and
> > before spread IRQ, then NVMe's retry in case of irq allocation failure
> > can be removed.
> 
> s/irq/IRQ/

Let me rephrase that thing as well

  Subject: nvme-pci: Simplify interrupt allocation

  The NVME PCI driver contains a tedious mechanism for interrupt
  allocation, which is necessary to adjust the number and size of interrupt
  sets to the maximum available number of interrupts which depends on the
  underlying PCI capabilities and the available CPU resources.

  It works around the former short comings of the PCI and core interrupt
  allocation mechanims in combination with interrupt sets.

  The PCI interrupt allocation function allows to provide a maximum and a
  minimum number of interrupts to be allocated and tries to allocate as
  many as possible. This worked without driver interaction as long as there
  was only a single set of interrupts to handle.

  With the addition of support for multiple interrupt sets in the generic
  affinity spreading logic, which is invoked from the PCI interrupt
  allocation, the adaptive loop in the PCI interrupt allocation did not
  work for multiple interrupt sets. The reason is that depending on the
  total number of interrupts which the PCI allocation adaptive loop tries
  to allocate in each step, the number and the size of the interrupt sets
  need to be adapted as well. Due to the way the interrupt sets support was
  implemented there was no way for the PCI interrupt allocation code or the
  core affinity spreading mechanism to invoke a driver specific function
  for adapting the interrupt sets configuration.

  As a consequence the driver had to implement another adaptive loop around
  the PCI interrupt allocation function and calling that with maximum and
  minimum interrupts set to the same value. This ensured that the
  allocation either succeeded or immediately failed without any attempt to
  adjust the number of interrupts in the PCI code.

  The core code now allows drivers to provide a callback to recalculate the
  number and the size of interrupt sets during PCI interrupt allocation,
  which in turn allows the PCI interrupt allocation function to be called
  in the same way as with a single set of interrupts. The PCI code handles
  the adaptive loop and the interrupt affinity spreading mechanism invokes
  the driver callback to adapt the interrupt set configuration to the
  current loop value. This replaces the adaptive loop in the driver
  completely.

  Implement the NVME specific callback which adjusts the interrupt sets
  configuration and remove the adaptive allocation loop.

Thanks,

tglx

Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const

2019-02-13 Thread Thomas Gleixner

On Wed, 13 Feb 2019, Keith Busch wrote:
> On Wed, Feb 13, 2019 at 09:56:36PM +0100, Thomas Gleixner wrote:
> > On Wed, 13 Feb 2019, Bjorn Helgaas wrote:
> > > On Wed, Feb 13, 2019 at 06:50:37PM +0800, Ming Lei wrote:
> > > > We have to ask driver to re-caculate set vectors after the whole IRQ
> > > > vectors are allocated later, and the result needs to be stored in 
> > > > 'affd'.
> > > > Also both the two interfaces are core APIs, which should be trusted.
> > > 
> > > s/re-caculate/recalculate/
> > > s/stored in 'affd'/stored in '*affd'/
> > > s/both the two/both/
> > > 
> > > This is a little confusing because you're talking about both "IRQ
> > > vectors" and these other "set vectors", which I think are different
> > > things.  I assume the "set vectors" are cpumasks showing the affinity
> > > of the IRQ vectors with some CPUs?
> > 
> > I think we should drop the whole vector wording completely.
> > 
> > The driver does not care about vectors, it only cares about a block of
> > interrupt numbers. These numbers are kernel managed and the interrupts just
> > happen to have a CPU vector assigned at some point. Depending on the CPU
> > architecture the underlying mechanism might not even be named vector.
> 
> Perhaps longer term we could move affinity mask creation from the irq
> subsystem into a more generic library. Interrupts aren't the only
> resource that want to spread across CPUs. For example, blk-mq has it's
> own implementation to for polled queues, so I think a non-irq specific
> implementation would be a nice addition to the kernel lib.

Agreed. There is nothing interrupt specific in that code aside of some
name choices.

Btw, while I have your attention. There popped up an issue recently related
to that affinity logic.

The current implementation fails when:

/*
 * If there aren't any vectors left after applying the pre/post
 * vectors don't bother with assigning affinity.
 */
if (nvecs == affd->pre_vectors + affd->post_vectors)
return NULL;

Now the discussion arised, that in that case the affinity sets are not
allocated and filled in for the pre/post vectors, but somehow the
underlying device still works and later on triggers the warning in the
blk-mq code because the MSI entries do not have affinity information
attached.

Sure, we could make that work, but there are several issues:

1) irq_create_affinity_masks() has another reason to return NULL:
   memory allocation fails.

2) Does it make sense at all.

Right now the PCI allocator ignores the NULL return and proceeds without
setting any affinities. As a consequence nothing is managed and everything
happens to work.

But that happens to work is more by chance than by design and the warning
is bogus if this is an expected mode of operation.

We should address these points in some way.

Thanks,

tglx

Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const

2019-02-14 Thread Thomas Gleixner

On Wed, 13 Feb 2019, Keith Busch wrote:

Cc+ Huacai Chen

> On Wed, Feb 13, 2019 at 10:41:55PM +0100, Thomas Gleixner wrote:
> > Btw, while I have your attention. There popped up an issue recently related
> > to that affinity logic.
> > 
> > The current implementation fails when:
> > 
> > /*
> >  * If there aren't any vectors left after applying the pre/post
> >  * vectors don't bother with assigning affinity.
> >  */
> > if (nvecs == affd->pre_vectors + affd->post_vectors)
> > return NULL;
> > 
> > Now the discussion arised, that in that case the affinity sets are not
> > allocated and filled in for the pre/post vectors, but somehow the
> > underlying device still works and later on triggers the warning in the
> > blk-mq code because the MSI entries do not have affinity information
> > attached.
> >
> > Sure, we could make that work, but there are several issues:
> > 
> > 1) irq_create_affinity_masks() has another reason to return NULL:
> >memory allocation fails.
> > 
> > 2) Does it make sense at all.
> > 
> > Right now the PCI allocator ignores the NULL return and proceeds without
> > setting any affinities. As a consequence nothing is managed and everything
> > happens to work.
> > 
> > But that happens to work is more by chance than by design and the warning
> > is bogus if this is an expected mode of operation.
> > 
> > We should address these points in some way.
> 
> Ah, yes, that's a mistake in the nvme driver. It is assuming IO queues are
> always on managed interrupts, but that's not true if when only 1 vector
> could be allocated. This should be an appropriate fix to the warning:

Looks correct. Chen, can you please test that?

> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 022ea1ee63f8..f2ccebe1c926 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -506,7 +506,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>* affinity), so use the regular blk-mq cpu mapping
>*/
>   map->queue_offset = qoff;
> - if (i != HCTX_TYPE_POLL)
> + if (i != HCTX_TYPE_POLL && dev->num_vecs > 1)
>   blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), 
> offset);
>   else
>   blk_mq_map_queues(map);
> --
>

Re: [PATCH V4 3/4] nvme-pci: Simplify interrupt allocation

2019-02-14 Thread Thomas Gleixner

On Thu, 14 Feb 2019, Ming Lei wrote:
> +static void nvme_calc_irq_sets(struct irq_affinity *affd, int nvecs)
> +{
> + struct nvme_dev *dev = affd->priv;
> +
> + nvme_calc_io_queues(dev, nvecs);
> +
> + affd->set_size[HCTX_TYPE_DEFAULT] = dev->io_queues[HCTX_TYPE_DEFAULT];
> + affd->set_size[HCTX_TYPE_READ] = dev->io_queues[HCTX_TYPE_READ];
> + affd->nr_sets = 2;

That's not really correct.

> - nvme_calc_io_queues(dev, irq_queues);
> - irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
> - irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
> - if (!irq_sets[1])
> - affd.nr_sets = 1;

---^^

That happens when the number of interrupts becomes too small to have split
queues. I'll fix that up.

Thanks,

tglx

Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const

2019-02-14 Thread Thomas Gleixner

On Thu, 14 Feb 2019, 陈华才 wrote:

Please do not top post

> I'll test next week, but 4.19 has the same problem, how to fix that for 4.19?

By applying the very same patch perhaps?

Thanks,

tglx

Re: [PATCH V4 1/4] genirq/affinity: store interrupt sets size in 'struct irq_affinity'

2019-02-14 Thread Thomas Gleixner

On Thu, 14 Feb 2019, Ming Lei wrote:
>  /**
>   * struct irq_affinity - Description for automatic irq affinity assignements
>   * @pre_vectors: Don't apply affinity to @pre_vectors at beginning of
> @@ -266,13 +268,13 @@ struct irq_affinity_notify {
>   * @post_vectors:Don't apply affinity to @post_vectors at end of
>   *   the MSI(-X) vector space
>   * @nr_sets: Length of passed in *sets array
> - * @sets:Number of affinitized sets
> + * @set_size:Number of affinitized sets

Both nr_sets and set_size comments are wrong ...

>   nr_sets = affd->nr_sets;
> - if (!nr_sets)
> + if (!nr_sets) {
>   nr_sets = 1;
> + set_size[0] = affvecs;
> + } else {
> + memcpy(set_size, affd->set_size,
> + IRQ_AFFINITY_MAX_SETS * sizeof(int));

Uuurgh. No. This needs to be nr_sets * sizeof(int) otherwise you copy
beyond the size of the source. nr_sets is already verified to be less than
IRQ_AFFINITY_MAX_SETS.

Fixed it up.

Thanks,

tglx

Re: [PATCH V4 2/4] genirq/affinity: add new callback for caculating interrupt sets size

2019-02-14 Thread Thomas Gleixner

On Thu, 14 Feb 2019, Ming Lei wrote:
> + if (affd->calc_sets) {
> + affd->calc_sets(affd, nvecs);
> + } else if (!affd->nr_sets) {
> + affd->nr_sets = 1;
> + affd->set_size[0] = affvecs;

Hrmpf. I suggested that to you to get rid of the nr_sets local variable,
but that's actually broken. The reason is that on the first invocation from
the pci code, which is with maxvecs usually, the size is stored and if that
allocation failed, the subsequent invocation with maxvecs - 1 will not
update set_size[0] because affd->nr_sets == 1.

/me scratches head and stares at the code some more...

Thanks,

tglx

Re: [PATCH V4 1/4] genirq/affinity: store interrupt sets size in 'struct irq_affinity'

2019-02-14 Thread Thomas Gleixner

On Thu, 14 Feb 2019, Ming Lei wrote:
> The driver passes initial configuration for the interrupt allocation via
> a pointer to struct affinity_desc.

Btw, blindly copying a suggestion without proof reading is a bad idea. That
want's to be 'struct irq_affinity' obviously. Tired brain confused them
yesterday night.

Thanks,

tglx

[patch V5 2/8] genirq/affinity: Store interrupt sets size in struct irq_affinity

2019-02-14 Thread Thomas Gleixner

From: Ming Lei 

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.

To support this, two modifications for the handling of struct irq_affinity
are required:

1) The (optional) interrupt sets size information is contained in a
   separate array of integers and struct irq_affinity contains a
   pointer to it.

   This is cumbersome and as the maximum number of interrupt sets is small,
   there is no reason to have separate storage. Moving the size array into
   struct affinity_desc avoids indirections and makes the code simpler.

2) At the moment the struct irq_affinity pointer which is handed in from
   the driver and passed through to several core functions is marked
   'const'.

   With the upcoming callback to recalculate the number and size of
   interrupt sets, it's necessary to remove the 'const'
   qualifier. Otherwise the callback would not be able to update the data.

Implement #1 and store the interrupt sets size in 'struct irq_affinity'.

No functional change.

[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/nvme/host/pci.c   |7 +++
 include/linux/interrupt.h |9 ++---
 kernel/irq/affinity.c |   16 
 3 files changed, 21 insertions(+), 11 deletions(-)

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2081,12 +2081,11 @@ static void nvme_calc_io_queues(struct n
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
-   int irq_sets[2];
struct irq_affinity affd = {
-   .pre_vectors = 1,
-   .nr_sets = ARRAY_SIZE(irq_sets),
-   .sets = irq_sets,
+   .pre_vectors= 1,
+   .nr_sets= 2,
};
+   unsigned int *irq_sets = affd.set_size;
int result = 0;
unsigned int irq_queues, this_p_queues;
 
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -241,20 +241,23 @@ struct irq_affinity_notify {
void (*release)(struct kref *ref);
 };
 
+#defineIRQ_AFFINITY_MAX_SETS  4
+
 /**
  * struct irq_affinity - Description for automatic irq affinity assignements
  * @pre_vectors:   Don't apply affinity to @pre_vectors at beginning of
  * the MSI(-X) vector space
  * @post_vectors:  Don't apply affinity to @post_vectors at end of
  * the MSI(-X) vector space
- * @nr_sets:   Length of passed in *sets array
- * @sets:  Number of affinitized sets
+ * @nr_sets:   The number of interrupt sets for which affinity
+ * spreading is required
+ * @set_size:  Array holding the size of each interrupt set
  */
 struct irq_affinity {
unsigned intpre_vectors;
unsigned intpost_vectors;
unsigned intnr_sets;
-   unsigned int*sets;
+   unsigned intset_size[IRQ_AFFINITY_MAX_SETS];
 };
 
 /**
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -238,9 +238,10 @@ static int irq_build_affinity_masks(cons
  * Returns the irq_affinity_desc pointer or NULL if allocation failed.
  */
 struct irq_affinity_desc *
-irq_create_affinity_masks(unsigned int nvecs,

[patch V5 4/8] nvme-pci: Simplify interrupt allocation

2019-02-14 Thread Thomas Gleixner

From: Ming Lei 

The NVME PCI driver contains a tedious mechanism for interrupt
allocation, which is necessary to adjust the number and size of interrupt
sets to the maximum available number of interrupts which depends on the
underlying PCI capabilities and the available CPU resources.

It works around the former short comings of the PCI and core interrupt
allocation mechanims in combination with interrupt sets.

The PCI interrupt allocation function allows to provide a maximum and a
minimum number of interrupts to be allocated and tries to allocate as
many as possible. This worked without driver interaction as long as there
was only a single set of interrupts to handle.

With the addition of support for multiple interrupt sets in the generic
affinity spreading logic, which is invoked from the PCI interrupt
allocation, the adaptive loop in the PCI interrupt allocation did not
work for multiple interrupt sets. The reason is that depending on the
total number of interrupts which the PCI allocation adaptive loop tries
to allocate in each step, the number and the size of the interrupt sets
need to be adapted as well. Due to the way the interrupt sets support was
implemented there was no way for the PCI interrupt allocation code or the
core affinity spreading mechanism to invoke a driver specific function
for adapting the interrupt sets configuration.

As a consequence the driver had to implement another adaptive loop around
the PCI interrupt allocation function and calling that with maximum and
minimum interrupts set to the same value. This ensured that the
allocation either succeeded or immediately failed without any attempt to
adjust the number of interrupts in the PCI code.

The core code now allows drivers to provide a callback to recalculate the
number and the size of interrupt sets during PCI interrupt allocation,
which in turn allows the PCI interrupt allocation function to be called
in the same way as with a single set of interrupts. The PCI code handles
the adaptive loop and the interrupt affinity spreading mechanism invokes
the driver callback to adapt the interrupt set configuration to the
current loop value. This replaces the adaptive loop in the driver
completely.

Implement the NVME specific callback which adjusts the interrupt sets
configuration and remove the adaptive allocation loop.

[ tglx: Simplify the callback further and restore the dropped adjustment of
number of sets ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/nvme/host/pci.c |  108 
 1 file changed, 28 insertions(+), 80 deletions(-)

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2041,41 +2041,32 @@ static int nvme_setup_host_mem(struct nv
return ret;
 }
 
-/* irq_queues covers admin queue */
-static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
+/*
+ * nirqs is the number of interrupts available for write and read
+ * queues. The core already reserved an interrupt for the admin queue.
+ */
+static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 {
-   unsigned int this_w_queues = write_queues;
-
-   WARN_ON(!irq_queues);
-
-   /*
-* Setup read/write queue split, assign admin queue one independent
-* irq vector if irq_queues is > 1.
-*/
-   if (irq_queues <= 2) {
-   dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
-   dev->io_queues[HCTX_TYPE_READ] = 0;
-   return;
-   }
+   struct nvme_dev *dev = affd->priv;
+   unsigned int nr_read_queues;
 
/*
-* If 'write_queues' is set, ensure it leaves room for at least
-* one read queue and one admin queue
-*/
-   if (this_w_queues >= irq_queues)
-   this_w_queues = irq_queues - 2;
-
-   /*
-* If 'write_queues' is set to zero, reads and writes will share
-* a queue set.
-*/
-   if (!this_w_queues) {
-   dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues - 1;
-   dev->io_queues[HCTX_TYPE_READ] = 0;
-   } else {
-   dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
-   dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues - 1;
-   }
+* If only one interrupt is available, combine write and read
+* queues. If 'write_queues' is set, ensure it leaves room for at
+* least one read queue.
+*/
+   if (nrirqs == 1)
+   nr_read_queues = 0;
+   else if (write_queues >= nrirqs)
+   nr_read_queues = nrirqs - 1;
+   else
+   nr_read_queues = nrirqs - write_queues;
+
+   dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
+   affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
+   dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
+   affd->set_size[HCTX_

[patch V5 1/8] genirq/affinity: Code consolidation

2019-02-14 Thread Thomas Gleixner

All information and calculations in the interrupt affinity spreading code
is strictly unsigned int. Though the code uses int all over the place.

Convert it over to unsigned int.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |   20 +---
 kernel/irq/affinity.c |   56 ++
 2 files changed, 38 insertions(+), 38 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -251,10 +251,10 @@ struct irq_affinity_notify {
  * @sets:  Number of affinitized sets
  */
 struct irq_affinity {
-   int pre_vectors;
-   int post_vectors;
-   int nr_sets;
-   int *sets;
+   unsigned intpre_vectors;
+   unsigned intpost_vectors;
+   unsigned intnr_sets;
+   unsigned int*sets;
 };
 
 /**
@@ -314,9 +314,10 @@ extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify 
*notify);
 
 struct irq_affinity_desc *
-irq_create_affinity_masks(int nvec, const struct irq_affinity *affd);
+irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd);
 
-int irq_calc_affinity_vectors(int minvec, int maxvec, const struct 
irq_affinity *affd);
+unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int 
maxvec,
+  const struct irq_affinity *affd);
 
 #else /* CONFIG_SMP */
 
@@ -350,13 +351,14 @@ irq_set_affinity_notifier(unsigned int i
 }
 
 static inline struct irq_affinity_desc *
-irq_create_affinity_masks(int nvec, const struct irq_affinity *affd)
+irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd)
 {
return NULL;
 }
 
-static inline int
-irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity 
*affd)
+static inline unsigned int
+irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
+ const struct irq_affinity *affd)
 {
return maxvec;
 }
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -9,7 +9,7 @@
 #include 
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
-   int cpus_per_vec)
+   unsigned int cpus_per_vec)
 {
const struct cpumask *siblmsk;
int cpu, sibl;
@@ -95,15 +95,17 @@ static int get_nodes_in_cpumask(cpumask_
 }
 
 static int __irq_build_affinity_masks(const struct irq_affinity *affd,
- int startvec, int numvecs, int firstvec,
+ unsigned int startvec,
+ unsigned int numvecs,
+ unsigned int firstvec,
  cpumask_var_t *node_to_cpumask,
  const struct cpumask *cpu_mask,
  struct cpumask *nmsk,
  struct irq_affinity_desc *masks)
 {
-   int n, nodes, cpus_per_vec, extra_vecs, done = 0;
-   int last_affv = firstvec + numvecs;
-   int curvec = startvec;
+   unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
+   unsigned int last_affv = firstvec + numvecs;
+   unsigned int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;
 
if (!cpumask_weight(cpu_mask))
@@ -117,18 +119,16 @@ static int __irq_build_affinity_masks(co
 */
if (numvecs <= nodes) {
for_each_node_mask(n, nodemsk) {
-   cpumask_or(&masks[curvec].mask,
-   &masks[curvec].mask,
-   node_to_cpumask[n]);
+   cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
+  node_to_cpumask[n]);
if (++curvec == last_affv)
curvec = firstvec;
}
-   done = numvecs;
-   goto out;
+   return numvecs;
}
 
for_each_node_mask(n, nodemsk) {
-   int ncpus, v, vecs_to_assign, vecs_per_node;
+   unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
 
/* Spread the vectors per node */
vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
@@ -163,8 +163,6 @@ static int __irq_build_affinity_masks(co
curvec = firstvec;
--nodes;
}
-
-out:
return done;
 }
 
@@ -174,13 +172,14 @@ static int __irq_build_affinity_masks(co
  * 2) spread other possible CPUs on these vectors
  */
 static int irq_build_affinity_masks(const struct irq_affinity *affd,
-   int startvec, int numvecs, int firstvec,
+   unsigned int startvec, unsigned int numvecs,
+   un

[patch V5 7/8] genirq/affinity: Set is_managed in the spreading function

2019-02-14 Thread Thomas Gleixner

Some drivers need an extra set of interrupts which are not marked managed,
but should get initial interrupt spreading.

To achieve this it is simpler to set the is_managed bit of the affinity
descriptor in the spreading function instead of having yet another loop and
tons of conditionals.

No functional change.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/affinity.c |   18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -98,6 +98,7 @@ static int __irq_build_affinity_masks(co
  unsigned int startvec,
  unsigned int numvecs,
  unsigned int firstvec,
+ bool managed,
  cpumask_var_t *node_to_cpumask,
  const struct cpumask *cpu_mask,
  struct cpumask *nmsk,
@@ -154,6 +155,7 @@ static int __irq_build_affinity_masks(co
}
irq_spread_init_one(&masks[curvec].mask, nmsk,
cpus_per_vec);
+   masks[curvec].is_managed = managed;
}
 
done += v;
@@ -173,7 +175,7 @@ static int __irq_build_affinity_masks(co
  */
 static int irq_build_affinity_masks(const struct irq_affinity *affd,
unsigned int startvec, unsigned int numvecs,
-   unsigned int firstvec,
+   unsigned int firstvec, bool managed,
struct irq_affinity_desc *masks)
 {
unsigned int curvec = startvec, nr_present, nr_others;
@@ -197,8 +199,8 @@ static int irq_build_affinity_masks(cons
build_node_to_cpumask(node_to_cpumask);
 
/* Spread on present CPUs starting from affd->pre_vectors */
-   nr_present = __irq_build_affinity_masks(affd, curvec, numvecs,
-   firstvec, node_to_cpumask,
+   nr_present = __irq_build_affinity_masks(affd, curvec, numvecs, firstvec,
+   managed, node_to_cpumask,
cpu_present_mask, nmsk, masks);
 
/*
@@ -212,8 +214,8 @@ static int irq_build_affinity_masks(cons
else
curvec = firstvec + nr_present;
cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
-   nr_others = __irq_build_affinity_masks(affd, curvec, numvecs,
-  firstvec, node_to_cpumask,
+   nr_others = __irq_build_affinity_masks(affd, curvec, numvecs, firstvec,
+  managed, node_to_cpumask,
   npresmsk, nmsk, masks);
put_online_cpus();
 
@@ -290,7 +292,7 @@ irq_create_affinity_masks(unsigned int n
int ret;
 
ret = irq_build_affinity_masks(affd, curvec, this_vecs,
-  curvec, masks);
+  true, curvec, masks);
if (ret) {
kfree(masks);
return NULL;
@@ -307,10 +309,6 @@ irq_create_affinity_masks(unsigned int n
for (; curvec < nvecs; curvec++)
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
-   /* Mark the managed interrupts */
-   for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
-   masks[i].is_managed = 1;
-
return masks;
 }

[patch V5 3/8] genirq/affinity: Add new callback for (re)calculating interrupt sets

2019-02-14 Thread Thomas Gleixner

From: Ming Lei 

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one or
more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via a
pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a loop
in the driver to determine the maximum number of interrupts which are
provided by the PCI capabilities and the underlying CPU resources.  This
loop would have to be replicated in every driver which wants to utilize
this mechanism. That's unwanted code duplication and error prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and their
size, in the core code. As the core code does not have any knowledge about
the underlying device, a driver specific callback will be added to struct
irq_affinity, which will be invoked by the core code. The callback will get
the number of available interupts as an argument, so the driver can
calculate the corresponding number and size of interrupt sets.

At the moment the struct irq_affinity pointer which is handed in from the
driver and passed through to several core functions is marked 'const', but
for the callback to be able to modify the data in the struct it's required
to remove the 'const' qualifier.

Add the optional callback to struct irq_affinity, which allows drivers to
recalculate the number and size of interrupt sets and remove the 'const'
qualifier.

For simple invocations, which do not supply a callback, a default callback
is installed, which just sets nr_sets to 1 and transfers the number of
spreadable vectors to the set_size array at index 0.

This is for now guarded by a check for nr_sets != 0 to keep the NVME driver
working until it is converted to the callback mechanism.

[ tglx: Fixed the simple case (no sets required). Moved the sanity check
for nr_sets after the invocation of the callback so it catches
broken drivers. Fixed the kernel doc comments for struct
irq_affinity and de-'This patch'-ed the changelog ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi.c   |   18 
 drivers/scsi/be2iscsi/be_main.c |2 -
 include/linux/interrupt.h   |   10 +++-
 include/linux/pci.h |4 +--
 kernel/irq/affinity.c   |   45 +++-
 5 files changed, 51 insertions(+), 28 deletions(-)

--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -532,7 +532,7 @@ static int populate_msi_sysfs(struct pci
 }
 
 static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
 {
struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
@@ -597,7 +597,7 @@ static int msi_verify_entries(struct pci
  * which could have been allocated.
  */
 static int msi_capability_init(struct pci_dev *dev, int nvec,
-  const struct irq_affinity *affd)
+  struct irq_affinity *affd)
 {
struct msi_desc *entry;
int ret;
@@ -669,7 +669,7 @@ static void __iomem *msix_map_region(str
 
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
  struct msix_entry *entries, int nvec,
- const struct irq_affinity *affd)
+ struct irq_affinity *affd)
 {
struct irq_affinity_desc *curmsk, *masks = NULL;
struct msi_desc *entry;
@@ -736,7 +736,7 @@ static void msix_program_entries(struct
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry 
*entries,
-   int nvec, const struct irq_affinity *affd)
+   int nvec, struct irq_affinity *affd)
 {
int ret;
u16 control;
@@ -932,7 +932,7 @@ int pci_msix_vec_count(struct pci_dev *d
 EXPORT_SYMBOL(pci_msix_vec_count);
 
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-int nvec, const struct irq_affinity *affd)
+int nvec, struct irq_affinity *affd)
 {
int nr_entries;
int i, j;
@@ -1018,7 +1018,7 @@ int pci_msi_enabled(vo

[patch V5 5/8] genirq/affinity: Remove the leftovers of the original set support

2019-02-14 Thread Thomas Gleixner

Now that the NVME driver is converted over to the calc_set() callback, the
workarounds of the original set support can be removed.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/affinity.c |   17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -258,21 +258,17 @@ irq_create_affinity_masks(unsigned int n
 
/*
 * Simple invocations do not provide a calc_sets()
-* callback. Install the generic one. The check for affd->nr_sets
-* is a temporary workaround and will be removed after the NVME
-* driver is converted over.
+* callback. Install the generic one.
 */
-   if (!affd->nr_sets && !affd->calc_sets)
+   if (!affd->calc_sets)
affd->calc_sets = default_calc_sets;
 
/*
 * If the device driver provided a calc_sets() callback let it
-* recalculate the number of sets and their size. The check will go
-* away once the NVME driver is converted over.
+* recalculate the number of sets and their size.
 */
affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
-   if (affd->calc_sets)
-   affd->calc_sets(affd, affvecs);
+   affd->calc_sets(affd, affvecs);
 
if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS))
return NULL;
@@ -335,11 +331,6 @@ unsigned int irq_calc_affinity_vectors(u
 
if (affd->calc_sets) {
set_vecs = maxvec - resv;
-   } else if (affd->nr_sets) {
-   unsigned int i;
-
-   for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-   set_vecs += affd->set_size[i];
} else {
get_online_cpus();
set_vecs = cpumask_weight(cpu_possible_mask);

[patch V5 6/8] PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets

2019-02-14 Thread Thomas Gleixner

Multiple interrupt sets for affinity spreading are now handled in the core
code and the number of sets and their size is recalculated via a driver
supplied callback.

That avoids the requirement to invoke pci_alloc_irq_vectors_affinity() with
the arguments minvecs and maxvecs set to the same value and the callsite
handling the ENOSPC situation.

Remove the now obsolete sanity checks and the related comments.

Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi.c |   14 --
 1 file changed, 14 deletions(-)

--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
 
@@ -1093,13 +1086,6 @@ static int __pci_enable_msix_range(struc
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* supported vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;

[patch V5 8/8] genirq/affinity: Add support for non-managed affinity sets

2019-02-14 Thread Thomas Gleixner

Some drivers need an extra set of interrupts which should not be marked
managed, but should get initial interrupt spreading.

Add a bitmap to struct irq_affinity which allows the driver to mark a
particular set of interrupts as non managed. Check the bitmap during
spreading and use the result to mark the interrupts in the sets
accordingly.

The unmanaged interrupts get initial spreading, but user space can change
their affinity later on. For the managed sets, i.e. the corresponding bit
in the mask is not set, there is no change in behaviour.

Usage example:

struct irq_affinity affd = {
.pre_vectors= 2,
.unmanaged_sets = 0x02,
.calc_sets  = drv_calc_sets,
};


For both interrupt sets the interrupts are properly spread out, but the
second set is not marked managed.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |2 ++
 kernel/irq/affinity.c |5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -251,6 +251,7 @@ struct irq_affinity_notify {
  * the MSI(-X) vector space
  * @nr_sets:   The number of interrupt sets for which affinity
  * spreading is required
+ * @unmanaged_sets:Bitmap to mark entries in the @set_size array unmanaged
  * @set_size:  Array holding the size of each interrupt set
  * @calc_sets: Callback for calculating the number and size
  * of interrupt sets
@@ -261,6 +262,7 @@ struct irq_affinity {
unsigned intpre_vectors;
unsigned intpost_vectors;
unsigned intnr_sets;
+   unsigned intunmanaged_sets;
unsigned intset_size[IRQ_AFFINITY_MAX_SETS];
void(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
void*priv;
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -251,6 +251,8 @@ irq_create_affinity_masks(unsigned int n
unsigned int affvecs, curvec, usedvecs, i;
struct irq_affinity_desc *masks = NULL;
 
+   BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS > sizeof(affd->unmanaged_sets) * 8);
+
/*
 * If there aren't any vectors left after applying the pre/post
 * vectors don't bother with assigning affinity.
@@ -288,11 +290,12 @@ irq_create_affinity_masks(unsigned int n
 * have multiple sets, build each sets affinity mask separately.
 */
for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
+   bool managed = affd->unmanaged_sets & (1U << i) ? true : false;
unsigned int this_vecs = affd->set_size[i];
int ret;
 
ret = irq_build_affinity_masks(affd, curvec, this_vecs,
-  true, curvec, masks);
+  managed, curvec, masks);
if (ret) {
kfree(masks);
return NULL;

[patch V5 0/8] genirq/affinity: Overhaul the multiple interrupt sets support

2019-02-14 Thread Thomas Gleixner

This is a follow up on Ming's V4 patch series, which addresses the short
comings of multiple interrupt sets in the core code:

  https://lkml.kernel.org/r/20190214122347.17372-1-ming@redhat.com

The series applies against:

 git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

and is also available from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 

The changes vs. Ming's v4 are:

  1) Do a cleanup as a first step to convert all the integer logic over to
 use unsigned int. I did this because I tripped over the plain integer
 calculation at some point and there is really no reason why any of
 this should be signed. In hindsight I should had asked for that when
 the whole stuff got introduced but for some reason I totally missed
 that.

  2) Fix the memcpy() in the preparatory patch which moves the set_size
 array into struct irq_affinity. Fixed up the kerneldoc comments

  3) Fixed the case for simple invocations (no sets, no callback) by
 installing a default callback which just sets nr_sets to 1 and
 transfers the number of spreadable vectors to the set_size array at
 index 0. That allows multiple consequtive invocations from the PCI
 code without having conditionals and corner case handling in the
 affinity spreading logic. IOW, it's just a variant of set handling.

 Moved the sanity check for the number of sets after the callback
 invocation so broken driver callback code is catched properly.

 The callback is now invoked with the number of spreadable interrupts
 instead of the total vectors, so the callback does not have to worry
 about the pre/post_vector reservation at all.

  4) Simplified the NVME callback logic further and brought the adjustments
 of the number of sets back which got dropped accidentaly.

  5) Remove all workarounds and leftovers of the old set support because
 from now on multiple interrupt sets can only be supported when a
 driver callback is supplied. Checking irq_affinity::nr_sets and the
 callback does not make any sense now.

On top of that I added the two patches which I postponed due to Ming's
work. They add support for marking a set unmanaged. This was asked for
the MegaSaS folks (Cc'ed) so they can request one managed set for the
normal multi queue logic and one unmanaged set for special driver specific
interrupts. The unmanaged set is spread out in the usual way, but not
marked managed and therefore the interrupts are treated as regular device
interrupts like the pre/post vectors.

Ming, thanks for the great work and your patience. I picked that up and
fixed up the missing bits only because my deadline for 5.1 feature patches
is basically tomorrow and not because I'm disappointed with your
work. Quite the contrary!

As I dropped the reviewed/acked-by's due to some fundamental changes,
can I ask everyone to have an eye on the set again please? Especially
the NVME callback needs some scrunity, it looks way too simple now :)

Some testing would be appreciated as well.

Thanks,

tglx

8<
 drivers/nvme/host/pci.c |  111 ++--
 drivers/pci/msi.c   |   32 +++
 drivers/scsi/be2iscsi/be_main.c |2 
 include/linux/interrupt.h   |   35 
 include/linux/pci.h |4 -
 kernel/irq/affinity.c   |  107 +-
 6 files changed, 126 insertions(+), 165 deletions(-)

Re: [patch V5 4/8] nvme-pci: Simplify interrupt allocation

2019-02-14 Thread Thomas Gleixner

On Fri, 15 Feb 2019, Ming Lei wrote:
> > +* If only one interrupt is available, combine write and read
> > +* queues. If 'write_queues' is set, ensure it leaves room for at
> > +* least one read queue.
> > +*/
> > +   if (nrirqs == 1)
> > +   nr_read_queues = 0;
> > +   else if (write_queues >= nrirqs)
> > +   nr_read_queues = nrirqs - 1;
> > +   else
> > +   nr_read_queues = nrirqs - write_queues;
> > +
> > +   dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > +   affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > +   dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
> > +   affd->set_size[HCTX_TYPE_READ] = nr_read_queues;
> > +   affd->nr_sets = nr_read_queues ? 2 : 1;
> >  }
> 
> .calc_sets is called only if more than .pre_vectors is available,
> then dev->io_queues[HCTX_TYPE_DEFAULT] may not be set in case of
> (nvecs == affd->pre_vectors + affd->post_vectors).

Hmm, good catch. The delta patch below should fix that, but I have to go
through all the possible cases in pci_alloc_irq_vectors_affinity() once
more with brain awake.

Thanks,

tglx

8<-

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2092,6 +2092,10 @@ static int nvme_setup_irqs(struct nvme_d
}
dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
 
+   /* Initialize for the single interrupt case */
+   dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
+   dev->io_queues[HCTX_TYPE_READ] = 0;
+
return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
  PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 }

Re: [patch V5 4/8] nvme-pci: Simplify interrupt allocation

2019-02-15 Thread Thomas Gleixner

On Fri, 15 Feb 2019, Marc Zyngier wrote:
> On Thu, 14 Feb 2019 20:47:59 +,
> Thomas Gleixner  wrote:
> >  drivers/nvme/host/pci.c |  108 
> > 
> >  1 file changed, 28 insertions(+), 80 deletions(-)
> > 
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2041,41 +2041,32 @@ static int nvme_setup_host_mem(struct nv
> > return ret;
> >  }
> >  
> > -/* irq_queues covers admin queue */
> > -static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int 
> > irq_queues)
> > +/*
> > + * nirqs is the number of interrupts available for write and read
> > + * queues. The core already reserved an interrupt for the admin queue.
> > + */
> > +static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int 
> > nrirqs)
> >  {
> > -   unsigned int this_w_queues = write_queues;
> > -
> > -   WARN_ON(!irq_queues);
> > -
> > -   /*
> > -* Setup read/write queue split, assign admin queue one independent
> > -* irq vector if irq_queues is > 1.
> > -*/
> > -   if (irq_queues <= 2) {
> > -   dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
> > -   dev->io_queues[HCTX_TYPE_READ] = 0;
> > -   return;
> > -   }
> > +   struct nvme_dev *dev = affd->priv;
> > +   unsigned int nr_read_queues;
> >  
> > /*
> > -* If 'write_queues' is set, ensure it leaves room for at least
> > -* one read queue and one admin queue
> > -*/
> > -   if (this_w_queues >= irq_queues)
> > -   this_w_queues = irq_queues - 2;
> > -
> > -   /*
> > -* If 'write_queues' is set to zero, reads and writes will share
> > -* a queue set.
> > -*/
> > -   if (!this_w_queues) {
> > -   dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues - 1;
> > -   dev->io_queues[HCTX_TYPE_READ] = 0;
> > -   } else {
> > -   dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
> > -   dev->io_queues[HCTX_TYPE_READ] = irq_queues - this_w_queues - 1;
> > -   }
> > +* If only one interrupt is available, combine write and read
> > +* queues. If 'write_queues' is set, ensure it leaves room for at
> > +* least one read queue.
> 
> [Full disclaimer: I only have had two coffees this morning, and it is
> only at the fourth that my brain is able to kick in...]
> 
> I don't know much about NVME, but I feel like there is a small
> disconnect between the code and the above comment, which says "leave
> room for at least one read queue"...
> 
> > +*/
> > +   if (nrirqs == 1)
> > +   nr_read_queues = 0;
> > +   else if (write_queues >= nrirqs)
> > +   nr_read_queues = nrirqs - 1;
> 
> ... while this seem to ensure that we carve out one write queue out of
> the irq set. It looks like a departure from the original code, which
> would set nr_read_queues to 1 in that particular case.

Bah. right you are.

Re: [patch V5 4/8] nvme-pci: Simplify interrupt allocation

2019-02-15 Thread Thomas Gleixner

On Fri, 15 Feb 2019, Thomas Gleixner wrote:
> On Fri, 15 Feb 2019, Marc Zyngier wrote:
> > > +  */
> > > + if (nrirqs == 1)
> > > + nr_read_queues = 0;
> > > + else if (write_queues >= nrirqs)
> > > + nr_read_queues = nrirqs - 1;
> > 
> > ... while this seem to ensure that we carve out one write queue out of
> > the irq set. It looks like a departure from the original code, which
> > would set nr_read_queues to 1 in that particular case.
> 
> Bah. right you are.

That needs to be:

 nr_read_queues = 1;

obviously.

/me blushes.

Re: [patch V5 4/8] nvme-pci: Simplify interrupt allocation

2019-02-15 Thread Thomas Gleixner

On Fri, 15 Feb 2019, Thomas Gleixner wrote:

> On Fri, 15 Feb 2019, Ming Lei wrote:
> > > +  * If only one interrupt is available, combine write and read
> > > +  * queues. If 'write_queues' is set, ensure it leaves room for at
> > > +  * least one read queue.
> > > +  */
> > > + if (nrirqs == 1)
> > > + nr_read_queues = 0;
> > > + else if (write_queues >= nrirqs)
> > > + nr_read_queues = nrirqs - 1;
> > > + else
> > > + nr_read_queues = nrirqs - write_queues;
> > > +
> > > + dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > > + affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
> > > + dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
> > > + affd->set_size[HCTX_TYPE_READ] = nr_read_queues;
> > > + affd->nr_sets = nr_read_queues ? 2 : 1;
> > >  }
> > 
> > .calc_sets is called only if more than .pre_vectors is available,
> > then dev->io_queues[HCTX_TYPE_DEFAULT] may not be set in case of
> > (nvecs == affd->pre_vectors + affd->post_vectors).
> 
> Hmm, good catch. The delta patch below should fix that, but I have to go
> through all the possible cases in pci_alloc_irq_vectors_affinity() once
> more with brain awake.

The existing logic in the driver is somewhat strange. If there is only a
single interrupt available, i.e. no separation of admin and I/O queue, then
dev->io_queues[HCTX_TYPE_DEFAULT] is still set to 1.

Now with the callback scheme (independent of my changes) that breaks in
various ways:

  1) irq_create_affinity_masks() bails out early:

if (nvecs == affd->pre_vectors + affd->post_vectors)
return NULL;

 So the callback won't be invoked and the last content of
 dev->io_queues is preserved. By chance this might end up with

dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
dev->io_queues[HCTX_TYPE_READ] = 0;

but that does not happen by design.

 2) pci_alloc_irq_vectors_affinity() has the following flow:

__pci_enable_msix_range()
  for (...) {
__pci_enable_msix()
  }

If that fails because MSIX is not supported, then the affinity
spreading code has not been called yet and neither the callback.
Now it goes on and tries MSI, which is the same as the above.

When that fails because MSI is not supported, same situation as above
and the code tries to allocate the legacy interrupt.

Now with the initial initialization that case is covered, but not the
following error case:

  Assume MSIX is enabled, but __pci_enable_msix() fails with -ENOMEM
  after having called irq_create_affinity_masks() and subsequently the
  callback with maxvecs.

  Then pci_alloc_irq_vectors_affinity() will try MSI and fail _BEFORE_
  the callback is invoked.

  The next step is to install the leagcy interrupt, but that does not
  invoke the affinity code and neither the callback.

  At the end dev->io_queues[*] are still initialized with the failed
  attempt of enabling MSIX based on maxvecs.

  Result: inconsistent state ...

I have an idea how to fix that, but that's again a task for brain being
awake. Will take care of that tomorrow morning.

Thanks,

tglx

[patch v6 2/7] genirq/affinity: Store interrupt sets size in struct irq_affinity

2019-02-16 Thread Thomas Gleixner

From: Ming Lei 

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.

To support this, two modifications for the handling of struct irq_affinity
are required:

1) The (optional) interrupt sets size information is contained in a
   separate array of integers and struct irq_affinity contains a
   pointer to it.

   This is cumbersome and as the maximum number of interrupt sets is small,
   there is no reason to have separate storage. Moving the size array into
   struct affinity_desc avoids indirections and makes the code simpler.

2) At the moment the struct irq_affinity pointer which is handed in from
   the driver and passed through to several core functions is marked
   'const'.

   With the upcoming callback to recalculate the number and size of
   interrupt sets, it's necessary to remove the 'const'
   qualifier. Otherwise the callback would not be able to update the data.

Implement #1 and store the interrupt sets size in 'struct irq_affinity'.

No functional change.

[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/nvme/host/pci.c   |7 +++
 include/linux/interrupt.h |9 ++---
 kernel/irq/affinity.c |   16 
 3 files changed, 21 insertions(+), 11 deletions(-)

--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2081,12 +2081,11 @@ static void nvme_calc_io_queues(struct n
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
-   int irq_sets[2];
struct irq_affinity affd = {
-   .pre_vectors = 1,
-   .nr_sets = ARRAY_SIZE(irq_sets),
-   .sets = irq_sets,
+   .pre_vectors= 1,
+   .nr_sets= 2,
};
+   unsigned int *irq_sets = affd.set_size;
int result = 0;
unsigned int irq_queues, this_p_queues;
 
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -241,20 +241,23 @@ struct irq_affinity_notify {
void (*release)(struct kref *ref);
 };
 
+#defineIRQ_AFFINITY_MAX_SETS  4
+
 /**
  * struct irq_affinity - Description for automatic irq affinity assignements
  * @pre_vectors:   Don't apply affinity to @pre_vectors at beginning of
  * the MSI(-X) vector space
  * @post_vectors:  Don't apply affinity to @post_vectors at end of
  * the MSI(-X) vector space
- * @nr_sets:   Length of passed in *sets array
- * @sets:  Number of affinitized sets
+ * @nr_sets:   The number of interrupt sets for which affinity
+ * spreading is required
+ * @set_size:  Array holding the size of each interrupt set
  */
 struct irq_affinity {
unsigned intpre_vectors;
unsigned intpost_vectors;
unsigned intnr_sets;
-   unsigned int*sets;
+   unsigned intset_size[IRQ_AFFINITY_MAX_SETS];
 };
 
 /**
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -238,9 +238,10 @@ static int irq_build_affinity_masks(cons
  * Returns the irq_affinity_desc pointer or NULL if allocation failed.
  */
 struct irq_affinity_desc *
-irq_create_affinity_masks(unsigned int nvecs,

[patch v6 0/7] genirq/affinity: Overhaul the multiple interrupt sets support

2019-02-16 Thread Thomas Gleixner

This is the final update to the series with a few corner cases fixes
vs. V5 which can be found here:

   https://lkml.kernel.org/r/20190214204755.819014...@linutronix.de

The series applies against:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

and is also available from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.irq

Changes vs. V5:

  - Change the invocation for the driver callback so it is invoked even
when there are no interrupts left to spread out. That ensures that the
driver can adjust to the situation (in case of NVME a single interrupt)

  - Make sure the callback is invoked in the legacy irq fallback case so
the driver is not in a stale state from a failed MSI[X} allocation
attempt.

  - Fix the adjustment logic in the NVME driver as pointed out by Ming and
Marc, plus another corner case I found during testing.

  - Simplify the unmanaged set support

Thanks,

tglx

8<-
 drivers/nvme/host/pci.c |  117 +---
 drivers/pci/msi.c   |   39 +
 drivers/scsi/be2iscsi/be_main.c |2 
 include/linux/interrupt.h   |   35 ---
 include/linux/pci.h |4 -
 kernel/irq/affinity.c   |  116 ---
 6 files changed, 153 insertions(+), 160 deletions(-)

[patch v6 3/7] genirq/affinity: Add new callback for (re)calculating interrupt sets

2019-02-16 Thread Thomas Gleixner

From: Ming Lei 

The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one or
more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.

The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.

The driver passes initial configuration for the interrupt allocation via a
pointer to struct irq_affinity.

Right now the allocation mechanism is complex as it requires to have a loop
in the driver to determine the maximum number of interrupts which are
provided by the PCI capabilities and the underlying CPU resources.  This
loop would have to be replicated in every driver which wants to utilize
this mechanism. That's unwanted code duplication and error prone.

In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and their
size, in the core code. As the core code does not have any knowledge about the
underlying device, a driver specific callback is required in struct
irq_affinity, which can be invoked by the core code. The callback gets the
number of available interupts as an argument, so the driver can calculate the
corresponding number and size of interrupt sets.

At the moment the struct irq_affinity pointer which is handed in from the
driver and passed through to several core functions is marked 'const', but for
the callback to be able to modify the data in the struct it's required to
remove the 'const' qualifier.

Add the optional callback to struct irq_affinity, which allows drivers to
recalculate the number and size of interrupt sets and remove the 'const'
qualifier.

For simple invocations, which do not supply a callback, a default callback
is installed, which just sets nr_sets to 1 and transfers the number of
spreadable vectors to the set_size array at index 0.

This is for now guarded by a check for nr_sets != 0 to keep the NVME driver
working until it is converted to the callback mechanism.

To make sure that the driver configuration is correct under all circumstances
the callback is invoked even when there are no interrupts for queues left,
i.e. the pre/post requirements already exhaust the numner of available
interrupts.

At the PCI layer irq_create_affinity_masks() has to be invoked even for the
case where the legacy interrupt is used. That ensures that the callback is
invoked and the device driver can adjust to that situation.

[ tglx: Fixed the simple case (no sets required). Moved the sanity check
for nr_sets after the invocation of the callback so it catches
broken drivers. Fixed the kernel doc comments for struct
irq_affinity and de-'This patch'-ed the changelog ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi.c   |   25 ++--
 drivers/scsi/be2iscsi/be_main.c |2 -
 include/linux/interrupt.h   |   10 +-
 include/linux/pci.h |4 +-
 kernel/irq/affinity.c   |   62 
 5 files changed, 71 insertions(+), 32 deletions(-)

Index: b/drivers/pci/msi.c
===
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -532,7 +532,7 @@ static int populate_msi_sysfs(struct pci
 }
 
 static struct msi_desc *
-msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
+msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd)
 {
struct irq_affinity_desc *masks = NULL;
struct msi_desc *entry;
@@ -597,7 +597,7 @@ static int msi_verify_entries(struct pci
  * which could have been allocated.
  */
 static int msi_capability_init(struct pci_dev *dev, int nvec,
-  const struct irq_affinity *affd)
+  struct irq_affinity *affd)
 {
struct msi_desc *entry;
int ret;
@@ -669,7 +669,7 @@ static void __iomem *msix_map_region(str
 
 static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
  struct msix_entry *entries, int nvec,
- const struct irq_affinity *affd)
+ struct irq_affinity *affd)
 {
struct irq_affinity_desc *curmsk, *masks = NULL;
struct msi_desc *entry;
@@ -736,7 +736,7 @@ static void msix_program_entries(struct
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev, struct msix_entry 
*entries,
-

[patch v6 6/7] PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets

2019-02-16 Thread Thomas Gleixner

Multiple interrupt sets for affinity spreading are now handled in the core
code and the number of sets and their size is recalculated via a driver
supplied callback.

That avoids the requirement to invoke pci_alloc_irq_vectors_affinity() with
the arguments minvecs and maxvecs set to the same value and the callsite
handling the ENOSPC situation.

Remove the now obsolete sanity checks and the related comments.

Signed-off-by: Thomas Gleixner 

---
 drivers/pci/msi.c |   14 --
 1 file changed, 14 deletions(-)

--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1035,13 +1035,6 @@ static int __pci_enable_msi_range(struct
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
 
@@ -1093,13 +1086,6 @@ static int __pci_enable_msix_range(struc
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* supported vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;

[patch v6 4/7] nvme-pci: Simplify interrupt allocation

2019-02-16 Thread Thomas Gleixner

From: Ming Lei 

The NVME PCI driver contains a tedious mechanism for interrupt
allocation, which is necessary to adjust the number and size of interrupt
sets to the maximum available number of interrupts which depends on the
underlying PCI capabilities and the available CPU resources.

It works around the former short comings of the PCI and core interrupt
allocation mechanims in combination with interrupt sets.

The PCI interrupt allocation function allows to provide a maximum and a
minimum number of interrupts to be allocated and tries to allocate as
many as possible. This worked without driver interaction as long as there
was only a single set of interrupts to handle.

With the addition of support for multiple interrupt sets in the generic
affinity spreading logic, which is invoked from the PCI interrupt
allocation, the adaptive loop in the PCI interrupt allocation did not
work for multiple interrupt sets. The reason is that depending on the
total number of interrupts which the PCI allocation adaptive loop tries
to allocate in each step, the number and the size of the interrupt sets
need to be adapted as well. Due to the way the interrupt sets support was
implemented there was no way for the PCI interrupt allocation code or the
core affinity spreading mechanism to invoke a driver specific function
for adapting the interrupt sets configuration.

As a consequence the driver had to implement another adaptive loop around
the PCI interrupt allocation function and calling that with maximum and
minimum interrupts set to the same value. This ensured that the
allocation either succeeded or immediately failed without any attempt to
adjust the number of interrupts in the PCI code.

The core code now allows drivers to provide a callback to recalculate the
number and the size of interrupt sets during PCI interrupt allocation,
which in turn allows the PCI interrupt allocation function to be called
in the same way as with a single set of interrupts. The PCI code handles
the adaptive loop and the interrupt affinity spreading mechanism invokes
the driver callback to adapt the interrupt set configuration to the
current loop value. This replaces the adaptive loop in the driver
completely.

Implement the NVME specific callback which adjusts the interrupt sets
configuration and remove the adaptive allocation loop.

[ tglx: Simplify the callback further and restore the dropped adjustment of
number of sets ]

Signed-off-by: Ming Lei 
Signed-off-by: Thomas Gleixner 

---
 drivers/nvme/host/pci.c |  116 
 1 file changed, 39 insertions(+), 77 deletions(-)

Index: b/drivers/nvme/host/pci.c
===
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2041,41 +2041,42 @@ static int nvme_setup_host_mem(struct nv
return ret;
 }
 
-/* irq_queues covers admin queue */
-static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int irq_queues)
+/*
+ * nirqs is the number of interrupts available for write and read
+ * queues. The core already reserved an interrupt for the admin queue.
+ */
+static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 {
-   unsigned int this_w_queues = write_queues;
-
-   WARN_ON(!irq_queues);
-
-   /*
-* Setup read/write queue split, assign admin queue one independent
-* irq vector if irq_queues is > 1.
-*/
-   if (irq_queues <= 2) {
-   dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
-   dev->io_queues[HCTX_TYPE_READ] = 0;
-   return;
-   }
-
-   /*
-* If 'write_queues' is set, ensure it leaves room for at least
-* one read queue and one admin queue
-*/
-   if (this_w_queues >= irq_queues)
-   this_w_queues = irq_queues - 2;
+   struct nvme_dev *dev = affd->priv;
+   unsigned int nr_read_queues;
 
/*
-* If 'write_queues' is set to zero, reads and writes will share
-* a queue set.
-*/
-   if (!this_w_queues) {
-   dev->io_queues[HCTX_TYPE_DEFAULT] = irq_queues - 1;
-   dev->io_queues[HCTX_TYPE_READ] = 0;
+* If there is no interupt available for queues, ensure that
+* the default queue is set to 1. The affinity set size is
+* also set to one, but the irq core ignores it for this case.
+*
+* If only one interrupt is available or 'write_queue' == 0, combine
+* write and read queues.
+*
+* If 'write_queues' > 0, ensure it leaves room for at least one read
+* queue.
+*/
+   if (!nrirqs) {
+   nrirqs = 1;
+   nr_read_queues = 0;
+   } else if (nrirqs == 1 || !write_queues) {
+   nr_read_queues = 0;
+   } else if (write_queues >= nrirqs) {
+   nr_read_queues

[patch v6 7/7] genirq/affinity: Add support for non-managed affinity sets

2019-02-16 Thread Thomas Gleixner

Some drivers need an extra set of interrupts which should not be marked
managed, but should get initial interrupt spreading.

Add a bitmap to struct irq_affinity which allows the driver to mark a
particular set of interrupts as non managed. Check the bitmap during
spreading and use the result to mark the interrupts in the sets
accordingly.

The unmanaged interrupts get initial spreading, but user space can change
their affinity later on. For the managed sets, i.e. the corresponding bit
in the mask is not set, there is no change in behaviour.

Usage example:

struct irq_affinity affd = {
.pre_vectors= 2,
.unmanaged_sets = 0x02,
.calc_sets  = drv_calc_sets,
};


For both interrupt sets the interrupts are properly spread out, but the
second set is not marked managed.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |2 ++
 kernel/irq/affinity.c |   16 +++-
 2 files changed, 13 insertions(+), 5 deletions(-)

Index: b/include/linux/interrupt.h
===
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -251,6 +251,7 @@ struct irq_affinity_notify {
  * the MSI(-X) vector space
  * @nr_sets:   The number of interrupt sets for which affinity
  * spreading is required
+ * @unmanaged_sets:Bitmap to mark entries in the @set_size array unmanaged
  * @set_size:  Array holding the size of each interrupt set
  * @calc_sets: Callback for calculating the number and size
  * of interrupt sets
@@ -261,6 +262,7 @@ struct irq_affinity {
unsigned intpre_vectors;
unsigned intpost_vectors;
unsigned intnr_sets;
+   unsigned intunmanaged_sets;
unsigned intset_size[IRQ_AFFINITY_MAX_SETS];
void(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
void*priv;
Index: b/kernel/irq/affinity.c
===
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -249,6 +249,8 @@ irq_create_affinity_masks(unsigned int n
unsigned int affvecs, curvec, usedvecs, i;
struct irq_affinity_desc *masks = NULL;
 
+   BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS > sizeof(affd->unmanaged_sets) * 8);
+
/*
 * Determine the number of vectors which need interrupt affinities
 * assigned. If the pre/post request exhausts the available vectors
@@ -292,7 +294,8 @@ irq_create_affinity_masks(unsigned int n
 * have multiple sets, build each sets affinity mask separately.
 */
for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
-   unsigned int this_vecs = affd->set_size[i];
+   bool managed = affd->unmanaged_sets & (1U << i) ? true : false;
+   unsigned int idx, this_vecs = affd->set_size[i];
int ret;
 
ret = irq_build_affinity_masks(affd, curvec, this_vecs,
@@ -301,8 +304,15 @@ irq_create_affinity_masks(unsigned int n
kfree(masks);
return NULL;
}
+
+   idx = curvec;
curvec += this_vecs;
usedvecs += this_vecs;
+   if (managed) {
+   /* Mark the managed interrupts */
+   for (; idx < curvec; idx++)
+   masks[idx].is_managed = 1;
+   }
}
 
/* Fill out vectors at the end that don't need affinity */
@@ -313,10 +323,6 @@ irq_create_affinity_masks(unsigned int n
for (; curvec < nvecs; curvec++)
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
-   /* Mark the managed interrupts */
-   for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
-   masks[i].is_managed = 1;
-
return masks;
 }

[patch v6 5/7] genirq/affinity: Remove the leftovers of the original set support

2019-02-16 Thread Thomas Gleixner

Now that the NVME driver is converted over to the calc_set() callback, the
workarounds of the original set support can be removed.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/affinity.c |   20 
 1 file changed, 4 insertions(+), 16 deletions(-)

Index: b/kernel/irq/affinity.c
===
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -264,20 +264,13 @@ irq_create_affinity_masks(unsigned int n
 
/*
 * Simple invocations do not provide a calc_sets() callback. Install
-* the generic one. The check for affd->nr_sets is a temporary
-* workaround and will be removed after the NVME driver is converted
-* over.
+* the generic one.
 */
-   if (!affd->nr_sets && !affd->calc_sets)
+   if (!affd->calc_sets)
affd->calc_sets = default_calc_sets;
 
-   /*
-* If the device driver provided a calc_sets() callback let it
-* recalculate the number of sets and their size. The check will go
-* away once the NVME driver is converted over.
-*/
-   if (affd->calc_sets)
-   affd->calc_sets(affd, affvecs);
+   /* Recalculate the sets */
+   affd->calc_sets(affd, affvecs);
 
if (WARN_ON_ONCE(affd->nr_sets > IRQ_AFFINITY_MAX_SETS))
return NULL;
@@ -344,11 +337,6 @@ unsigned int irq_calc_affinity_vectors(u
 
if (affd->calc_sets) {
set_vecs = maxvec - resv;
-   } else if (affd->nr_sets) {
-   unsigned int i;
-
-   for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-   set_vecs += affd->set_size[i];
} else {
get_online_cpus();
set_vecs = cpumask_weight(cpu_possible_mask);

[patch v6 1/7] genirq/affinity: Code consolidation

2019-02-16 Thread Thomas Gleixner

All information and calculations in the interrupt affinity spreading code
is strictly unsigned int. Though the code uses int all over the place.

Convert it over to unsigned int.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |   20 +---
 kernel/irq/affinity.c |   56 ++
 2 files changed, 38 insertions(+), 38 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -251,10 +251,10 @@ struct irq_affinity_notify {
  * @sets:  Number of affinitized sets
  */
 struct irq_affinity {
-   int pre_vectors;
-   int post_vectors;
-   int nr_sets;
-   int *sets;
+   unsigned intpre_vectors;
+   unsigned intpost_vectors;
+   unsigned intnr_sets;
+   unsigned int*sets;
 };
 
 /**
@@ -314,9 +314,10 @@ extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify 
*notify);
 
 struct irq_affinity_desc *
-irq_create_affinity_masks(int nvec, const struct irq_affinity *affd);
+irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd);
 
-int irq_calc_affinity_vectors(int minvec, int maxvec, const struct 
irq_affinity *affd);
+unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int 
maxvec,
+  const struct irq_affinity *affd);
 
 #else /* CONFIG_SMP */
 
@@ -350,13 +351,14 @@ irq_set_affinity_notifier(unsigned int i
 }
 
 static inline struct irq_affinity_desc *
-irq_create_affinity_masks(int nvec, const struct irq_affinity *affd)
+irq_create_affinity_masks(unsigned int nvec, const struct irq_affinity *affd)
 {
return NULL;
 }
 
-static inline int
-irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity 
*affd)
+static inline unsigned int
+irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
+ const struct irq_affinity *affd)
 {
return maxvec;
 }
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -9,7 +9,7 @@
 #include 
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
-   int cpus_per_vec)
+   unsigned int cpus_per_vec)
 {
const struct cpumask *siblmsk;
int cpu, sibl;
@@ -95,15 +95,17 @@ static int get_nodes_in_cpumask(cpumask_
 }
 
 static int __irq_build_affinity_masks(const struct irq_affinity *affd,
- int startvec, int numvecs, int firstvec,
+ unsigned int startvec,
+ unsigned int numvecs,
+ unsigned int firstvec,
  cpumask_var_t *node_to_cpumask,
  const struct cpumask *cpu_mask,
  struct cpumask *nmsk,
  struct irq_affinity_desc *masks)
 {
-   int n, nodes, cpus_per_vec, extra_vecs, done = 0;
-   int last_affv = firstvec + numvecs;
-   int curvec = startvec;
+   unsigned int n, nodes, cpus_per_vec, extra_vecs, done = 0;
+   unsigned int last_affv = firstvec + numvecs;
+   unsigned int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;
 
if (!cpumask_weight(cpu_mask))
@@ -117,18 +119,16 @@ static int __irq_build_affinity_masks(co
 */
if (numvecs <= nodes) {
for_each_node_mask(n, nodemsk) {
-   cpumask_or(&masks[curvec].mask,
-   &masks[curvec].mask,
-   node_to_cpumask[n]);
+   cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
+  node_to_cpumask[n]);
if (++curvec == last_affv)
curvec = firstvec;
}
-   done = numvecs;
-   goto out;
+   return numvecs;
}
 
for_each_node_mask(n, nodemsk) {
-   int ncpus, v, vecs_to_assign, vecs_per_node;
+   unsigned int ncpus, v, vecs_to_assign, vecs_per_node;
 
/* Spread the vectors per node */
vecs_per_node = (numvecs - (curvec - firstvec)) / nodes;
@@ -163,8 +163,6 @@ static int __irq_build_affinity_masks(co
curvec = firstvec;
--nodes;
}
-
-out:
return done;
 }
 
@@ -174,13 +172,14 @@ static int __irq_build_affinity_masks(co
  * 2) spread other possible CPUs on these vectors
  */
 static int irq_build_affinity_masks(const struct irq_affinity *affd,
-   int startvec, int numvecs, int firstvec,
+   unsigned int startvec, unsigned int numvecs,
+   un

Re: [patch v6 7/7] genirq/affinity: Add support for non-managed affinity sets

2019-02-17 Thread Thomas Gleixner

On Sun, 17 Feb 2019, Ming Lei wrote:
> On Sat, Feb 16, 2019 at 06:13:13PM +0100, Thomas Gleixner wrote:
> > Some drivers need an extra set of interrupts which should not be marked
> > managed, but should get initial interrupt spreading.
> 
> Could you share the drivers and their use case?

You were Cc'ed on that old discussion:

 https://lkml.kernel.org/r/300d6fef733ca76ced581f8c6304b...@mail.gmail.com

> > For both interrupt sets the interrupts are properly spread out, but the
> > second set is not marked managed.
> 
> Given drivers only care the managed vs non-managed interrupt numbers,
> just wondering why this case can't be covered by .pre_vectors &
> .post_vectors?

Well, yes, but post/pre are not subject to spreading and I really don't
want to go there.

> Also this kind of usage may break blk-mq easily, in which the following
> rule needs to be respected:
> 
> 1) all CPUs are required to spread among each interrupt set
> 
> 2) no any CPU is shared between two IRQs in same set.

I don't see how that would break blk-mq. The unmanaged set is not used by
the blk-mq stuff, that's some driver internal voodoo. So blk-mq still gets
a perfectly spread and managed interrupt set for the queues.

> > for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
> > -   unsigned int this_vecs = affd->set_size[i];
> > +   bool managed = affd->unmanaged_sets & (1U << i) ? true : false;
> 
> The above check is inverted.

Doh. Stupid me.

Thanks,

tglx

Re: [patch v6 7/7] genirq/affinity: Add support for non-managed affinity sets

2019-02-17 Thread Thomas Gleixner

Ming,

On Mon, 18 Feb 2019, Ming Lei wrote:
> On Sun, Feb 17, 2019 at 08:17:05PM +0100, Thomas Gleixner wrote:
> > I don't see how that would break blk-mq. The unmanaged set is not used by
> > the blk-mq stuff, that's some driver internal voodoo. So blk-mq still gets
> > a perfectly spread and managed interrupt set for the queues.
> 
> >From the discussion above, the use case is for megaraid_sas. And one of the
> two interrupt sets(managed and non-managed) will be chosen according to
> workloads runtime.
> 
> Each interrupt set actually defines one blk-mq queue mapping, and the
> queue mapping needs to respect the rule I mentioned now. However,
> non-managed affinity can be changed to any way anytime by user-space.
> 
> Recently HPSA tried to add one module parameter to use non-managed
> IRQ[1].
> 
> Also NVMe RDMA uses non-managed interrupts, and at least one CPU hotplug
> issue is never fixed yet[2]. 
> 
> [1] https://marc.info/?t=15438766521&r=1&w=2
> [2] https://www.spinics.net/lists/linux-block/msg24140.html

Interesting. I misread the description which megasas folks provided
then. I'll drop that patch and get the other lot merged. Thanks for your
work and help with that!

Thanks,

tglx

Re: [PATCH V3 1/5] genirq/affinity: don't mark 'affd' as const

2019-02-18 Thread Thomas Gleixner

On Tue, 19 Feb 2019, 陈华才 wrote:
> 
> I've tested, this patch can fix the nvme problem, but it can't be applied
> to 4.19 because of different context. And, I still think my original solution
> (genirq/affinity: Assign default affinity to pre/post vectors) is correct.
> There may be similar problems except nvme.

As I explained you in great length, it is not correct because it's just
papering over the underlying problem.

Please stop advertising semantically broken duct tape solutions.

Thanks,

tglx

Re: [PATCH v3 4/5] genirq/affinity: allow driver's discontigous affinity set

2019-06-24 Thread Thomas Gleixner

On Mon, 24 Jun 2019, Weiping Zhang wrote:

> The driver may implement multiple affinity set, and some of
> are empty, for this case we just skip them.

Why? What's the point of creating empty sets? Just because is not a real
good justification.

Leaving the patch for Ming.

Thanks,

tglx

> Signed-off-by: Weiping Zhang 
> ---
>  kernel/irq/affinity.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index f18cd5aa33e8..6d964fe0fbd8 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -295,6 +295,10 @@ irq_create_affinity_masks(unsigned int nvecs, struct 
> irq_affinity *affd)
>   unsigned int this_vecs = affd->set_size[i];
>   int ret;
>  
> + /* skip empty affinity set */
> + if (this_vecs == 0)
> + continue;
> +
>   ret = irq_build_affinity_masks(affd, curvec, this_vecs,
>  curvec, masks);
>   if (ret) {
> -- 
> 2.14.1
> 
>

Re: [PATCH v3 4/5] genirq/affinity: allow driver's discontigous affinity set

2019-06-24 Thread Thomas Gleixner

MIng,

On Tue, 25 Jun 2019, Ming Lei wrote:
> On Mon, Jun 24, 2019 at 05:42:39PM +0200, Thomas Gleixner wrote:
> > On Mon, 24 Jun 2019, Weiping Zhang wrote:
> > 
> > > The driver may implement multiple affinity set, and some of
> > > are empty, for this case we just skip them.
> > 
> > Why? What's the point of creating empty sets? Just because is not a real
> > good justification.
> 
> Patch 5 will add 4 new sets for supporting NVMe's weighted round robin
> arbitration. It can be a headache to manage so many irq sets(now the total
> sets can become 6) dynamically since size of anyone in the new 4 sets can
> be zero, so each particular set is assigned one static index for avoiding
> the management trouble, then empty set will be seen by
> irq_create_affinity_masks().
> 
> So looks skipping the empty set makes sense because the API will become
> easier to use than before.

That makes sense, but at least some of that information wants to be in the
change log and not some uninformative description of what the patch does.

I was not Cc'ed on the rest of the patches so I had exactly zero context.

Thanks,

tglx

Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-29 Thread Thomas Gleixner

Jens,

On Mon, 29 Oct 2018, Jens Axboe wrote:

> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.
> 
> Cc: Thomas Gleixner 
> Cc: linux-ker...@vger.kernel.org
> Reviewed-by: Hannes Reinecke 
> Signed-off-by: Jens Axboe 

This looks good.

Vs. merge logistics: I'm expecting some other changes in that area as per
discussion with megasas (IIRC) folks. So I'd like to apply that myself
right after -rc1 and provide it to you as a single commit to pull from so
we can avoid collisions in next and the merge window.

Thanks,

tglx

Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Thomas Gleixner

Jens,

On Tue, 30 Oct 2018, Jens Axboe wrote:
> On 10/30/18 10:02 AM, Keith Busch wrote:
> > pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
> > that doesn't work, it will iterate down to min_vecs without returning to
> > the caller. The caller doesn't have a chance to adjust its sets between
> > iterations when you provide a range.
> > 
> > The 'masks' overrun problem happens if the caller provides min_vecs
> > as a smaller value than the sum of the set (plus any reserved).
> > 
> > If it's up to the caller to ensure that doesn't happen, then min and
> > max must both be the same value, and that value must also be the same as
> > the set sum + reserved vectors. The range just becomes redundant since
> > it is already bounded by the set.
> > 
> > Using the nvme example, it would need something like this to prevent the
> > 'masks' overrun:
> 
> OK, now I hear what you are saying. And you are right, the callers needs
> to provide minvec == maxvec for sets, and then have a loop around that
> to adjust as needed.

But then we should enforce it in the core code, right?

Thanks,

tglx

Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Thomas Gleixner

On Tue, 30 Oct 2018, Jens Axboe wrote:
> On 10/30/18 11:25 AM, Thomas Gleixner wrote:
> > Jens,
> > 
> > On Tue, 30 Oct 2018, Jens Axboe wrote:
> >> On 10/30/18 10:02 AM, Keith Busch wrote:
> >>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
> >>> that doesn't work, it will iterate down to min_vecs without returning to
> >>> the caller. The caller doesn't have a chance to adjust its sets between
> >>> iterations when you provide a range.
> >>>
> >>> The 'masks' overrun problem happens if the caller provides min_vecs
> >>> as a smaller value than the sum of the set (plus any reserved).
> >>>
> >>> If it's up to the caller to ensure that doesn't happen, then min and
> >>> max must both be the same value, and that value must also be the same as
> >>> the set sum + reserved vectors. The range just becomes redundant since
> >>> it is already bounded by the set.
> >>>
> >>> Using the nvme example, it would need something like this to prevent the
> >>> 'masks' overrun:
> >>
> >> OK, now I hear what you are saying. And you are right, the callers needs
> >> to provide minvec == maxvec for sets, and then have a loop around that
> >> to adjust as needed.
> > 
> > But then we should enforce it in the core code, right?
> 
> Yes, I was going to ask you if you want a followup patch for that, or
> an updated version of the original?

Updated combo patch would be nice :)

Thanks

lazytglx

Re: [PATCH 0/4] irq: fix support for allocating sets of IRQs

2018-11-04 Thread Thomas Gleixner

Jens,

On Sat, 3 Nov 2018, Jens Axboe wrote:

> On 11/2/18 8:59 AM, Ming Lei wrote:
> > Hi Jens,
> > 
> > As I mentioned, there are at least two issues in the patch of '
> > irq: add support for allocating (and affinitizing) sets of IRQs':
> > 
> > 1) it is wrong to pass 'mask + usedvec' to irq_build_affinity_masks()
> > 
> > 2) we should spread all possible CPUs in 2-stage way on each set of IRQs
> > 
> > The fix isn't trivial, and I introduce two extra patches as preparation,
> > then the implementation can be more clean.
> > 
> > The patchset is against mq-maps branch of block tree, feel free to
> > integrate into the whole patchset of multiple queue maps.
> 
> Thanks Ming, I ran this through my testing, and I end up with the
> same maps and affinities for all the cases I cared about. I'm going
> to drop my initial version, and add the three.

So I assume, that I can pick up Mings series instead.

There is another patch pending affecting the irq affinity spreading. Can
you folks please have a look at it?

https://lkml.kernel.org/r/20181102180248.13583-1-lon...@linuxonhyperv.com

Thanks,

tglx

Re: [PATCH 0/4] irq: fix support for allocating sets of IRQs

2018-11-04 Thread Thomas Gleixner

On Sun, 4 Nov 2018, Jens Axboe wrote:

Cc'ing Long with a hopefully working E-Mail address. The previous one
bounced because I stupidly copied the wrong one...

> On 11/4/18 5:02 AM, Thomas Gleixner wrote:
> > Jens,
> > 
> > On Sat, 3 Nov 2018, Jens Axboe wrote:
> > 
> >> On 11/2/18 8:59 AM, Ming Lei wrote:
> >>> Hi Jens,
> >>>
> >>> As I mentioned, there are at least two issues in the patch of '
> >>> irq: add support for allocating (and affinitizing) sets of IRQs':
> >>>
> >>> 1) it is wrong to pass 'mask + usedvec' to irq_build_affinity_masks()
> >>>
> >>> 2) we should spread all possible CPUs in 2-stage way on each set of IRQs
> >>>
> >>> The fix isn't trivial, and I introduce two extra patches as preparation,
> >>> then the implementation can be more clean.
> >>>
> >>> The patchset is against mq-maps branch of block tree, feel free to
> >>> integrate into the whole patchset of multiple queue maps.
> >>
> >> Thanks Ming, I ran this through my testing, and I end up with the
> >> same maps and affinities for all the cases I cared about. I'm going
> >> to drop my initial version, and add the three.
> > 
> > So I assume, that I can pick up Mings series instead.
> 
> Yes, let's do that.
> 
> > There is another patch pending affecting the irq affinity spreading. Can
> > you folks please have a look at it?
> > 
> > https://lkml.kernel.org/r/20181102180248.13583-1-lon...@linuxonhyperv.com
> 
> I'll give that a look and test. Thanks for the heads-up.
> 
> -- 
> Jens Axboe
> 
>

1 2 >

1 - 100 of 101 matches

Mail list logo