[PATCH v2] bcache: move closure debug file into debug direcotry

2018-03-04 Thread Chengguang Xu
In current code closure debug file is outside of debug directory
and when unloading module there is lack of removing operation
for closure debug file, so it will cause creating error when trying
to reload  module.

This patch move closure debug file into "bcache" debug direcory
so that the file can get deleted properly.

Signed-off-by: Chengguang Xu 
---
Changes since v1:
- Move closure debug file into "bcache" debug direcory instead of
deleting it individually.
- Change Signed-off-by mail address.

 drivers/md/bcache/closure.c | 9 +
 drivers/md/bcache/closure.h | 5 +++--
 drivers/md/bcache/debug.c   | 2 +-
 drivers/md/bcache/super.c   | 3 +--
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 7f12920..64b123c 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -157,7 +157,7 @@ void closure_debug_destroy(struct closure *cl)
 }
 EXPORT_SYMBOL(closure_debug_destroy);
 
-static struct dentry *debug;
+static struct dentry *closure_debug;
 
 static int debug_seq_show(struct seq_file *f, void *data)
 {
@@ -199,11 +199,12 @@ static int debug_seq_open(struct inode *inode, struct 
file *file)
.release= single_release
 };
 
-void __init closure_debug_init(void)
+int __init closure_debug_init(void)
 {
-   debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops);
+   closure_debug = debugfs_create_file("closures",
+   0400, debug, NULL, _ops);
+   return IS_ERR_OR_NULL(closure_debug);
 }
-
 #endif
 
 MODULE_AUTHOR("Kent Overstreet ");
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 3b9dfc9..0fb704d 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -105,6 +105,7 @@
 struct closure;
 struct closure_syncer;
 typedef void (closure_fn) (struct closure *);
+extern struct dentry *debug;
 
 struct closure_waitlist {
struct llist_head   list;
@@ -185,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
-void closure_debug_init(void);
+int closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
+static inline int closure_debug_init(void) { return 0; }
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index af89408..5db02de 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-static struct dentry *debug;
+struct dentry *debug;
 
 #ifdef CONFIG_BCACHE_DEBUG
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1a9fdab..b784292 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2133,7 +2133,6 @@ static int __init bcache_init(void)
mutex_init(_register_lock);
init_waitqueue_head(_wait);
register_reboot_notifier();
-   closure_debug_init();
 
bcache_major = register_blkdev(0, "bcache");
if (bcache_major < 0) {
@@ -2145,7 +2144,7 @@ static int __init bcache_init(void)
if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
bch_request_init() ||
-   bch_debug_init(bcache_kobj) ||
+   bch_debug_init(bcache_kobj) || closure_debug_init() ||
sysfs_create_files(bcache_kobj, files))
goto err;
 
-- 
1.8.3.1



RE: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-04 Thread Kashyap Desai
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Saturday, March 3, 2018 3:23 AM
> To: Don Brace; Ming Lei
> Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike
> Snitzer;
> linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Kashyap Desai;
> Peter
> Rivera; Meelis Roos
> Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
>
> On Fri, 2018-03-02 at 15:03 +, Don Brace wrote:
> > > -Original Message-
> > > From: Laurence Oberman [mailto:lober...@redhat.com]
> > > Sent: Friday, March 02, 2018 8:09 AM
> > > To: Ming Lei 
> > > Cc: Don Brace ; Jens Axboe  > > k>;
> > > linux-block@vger.kernel.org; Christoph Hellwig ;
> > > Mike Snitzer ; linux-s...@vger.kernel.org;
> > > Hannes Reinecke ; Arun Easi ;
> > > Omar Sandoval ; Martin K . Petersen
> > > ; James Bottomley
> > > ; Christoph Hellwig
> > > ; Kashyap Desai ; Peter
> > > Rivera ; Meelis Roos 
> > > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
> > >
> > > EXTERNAL EMAIL
> > >
> > >
> > > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote:
> > > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote:
> > > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> > > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > > > > > > To: Jens Axboe ; linux-block@vger.kernel
> > > > > > > > .org ; Christoph Hellwig ; Mike Snitzer
> > > > > > > >  > > > > > > > >
> > > > > > > >
> > > > > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke  > > > > > > > e.de
> > > > > > > > > ;
> > > > > > > >
> > > > > > > > Arun Easi
> > > > > > > > ; Omar Sandoval ;
> > > > > > > > Martin K .
> > > > > > > > Petersen ; James Bottomley
> > > > > > > > ; Christoph Hellwig
> > > > > > > > ; Don Brace ;
> > > > > > > > Kashyap Desai ; Peter Rivera
> > > > > > > >  > > > > > > > om>;
> > > > > > > > Laurence Oberman ; Ming Lei
> > > > > > > > ; Meelis Roos 
> > > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply
> > > > > > > > queue
> > > > > > > >
> > > >
> > > > Seems Don run into IO failure without blk-mq, could you run your
> > > > tests again in legacy mode?
> > > >
> > > > Thanks,
> > > > Ming
> > >
> > > Hello Ming
> > > I ran multiple passes on Legacy and still see no issues in my test
> > > bed
> > >
> > > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789-
> > > a28e-
> > > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M
> > > console=ttyS1,115200n8
> > >
> > > HEAD of the git kernel I am using
> > >
> > > 694e16f scsi: megaraid: improve scsi_mq performance via .host_tagset
> > > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset
> > > 60d5b36 block: null_blk: introduce module parameter of 'g_host_tags'
> > > 8847067 scsi: Add template flag 'host_tagset'
> > > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS 4710fab blk-mq:
> > > introduce 'start_tag' field to 'struct blk_mq_tags'
> > > 09bb153 scsi: megaraid_sas: fix selection of reply queue
> > > 52700d8 scsi: hpsa: fix selection of reply queue
> >
> > I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above
> > patches.
> > I  and have been running 24 hours with no issues.
> > Evidently my forked copy was corrupted.
> >
> > So, my I/O testing has gone well.
> >
> > I'll run some performance numbers next.
> >
> > Thanks,
> > Don
>
> Unless Kashyap is not happy we need to consider getting this in to Linus
> now
> because we are seeing HPE servers that keep hanging now with the original
> commit now upstream.
>
> Kashyap, are you good with the v3 patchset or still concerned with
> performance. I was getting pretty good IOPS/sec to individual SSD drives
> set
> up as jbod devices on the megaraid_sas.

Laurence -
Did you find difference with/without the patch ? What was IOPs number with
and without patch.
It is not urgent feature, so I would like to take some time to get BRCM's
performance team involved and do full analysis of performance run and find
pros/cons.

Kashyap
>
> With larger I/O sizes like 1MB I was getting good MB/sec and not seeing a
> measurable performance impact.
>

Re: [PATCH] bcache: remove closure debug file when unloading module

2018-03-04 Thread Chengguang Xu


> 在 2018年3月2日,下午2:34,tang.jun...@zte.com.cn 写道:
> 
> From: Tang Junhui 
> 
> Hello Chengguang
> 
>> When unloading bcache module there is lack of removing
>> operation for closure debug file, so it will cause
>> creating error when trying to reload module.
>> 
> 
> Yes, This issue is true. 
> Actually, the original code try to remove closure debug file
> by bch_debug_exit(), which remove all the debug file in
> bcache directory, and closure debug file is expected to be
> one debug file in bcache debug directory.
> 
> But currently code, closure_debug_init() is called to create
> closure debug file before the bcache debug crated in 
> bch_debug_init(), so closure debug file created outside
> the bcache directory, then when bch_debug_exit() being called,
> bcache diretory removed, but closure debug file didn't removed.
> 
> So the best way to resolve this issue is not remove the 
> closure debug file again, but to take the closure debug file
> under the bcache directory in debug sysfs.

Yes, that looks better, I’ll modify as your suggestion in v2. Thanks for your 
review.

> 
>> This fix introduces closure_debug_exit to handle removing
>> operation properly.
>> 
>> Signed-off-by: Chengguang Xu 
>> ---
>> drivers/md/bcache/closure.c | 5 +
>> drivers/md/bcache/closure.h | 2 ++
>> drivers/md/bcache/super.c   | 2 ++
>> 3 files changed, 9 insertions(+)
>> 
>> diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
>> index 7f12920..8fcd737 100644
>> --- a/drivers/md/bcache/closure.c
>> +++ b/drivers/md/bcache/closure.c
>> @@ -204,6 +204,11 @@ void __init closure_debug_init(void)
>>debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops);
>> }
>> 
>> +void closure_debug_exit(void)
>> +{
>> +debugfs_remove(debug);
>> +}
>> +
>> #endif
>> 
>> MODULE_AUTHOR("Kent Overstreet ");
>> diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
>> index 3b9dfc9..1aa0f7e 100644
>> --- a/drivers/md/bcache/closure.h
>> +++ b/drivers/md/bcache/closure.h
>> @@ -186,12 +186,14 @@ static inline void closure_sync(struct closure *cl)
>> #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
>> 
>> void closure_debug_init(void);
>> +void closure_debug_exit(void);
>> void closure_debug_create(struct closure *cl);
>> void closure_debug_destroy(struct closure *cl);
>> 
>> #else
>> 
>> static inline void closure_debug_init(void) {}
>> +static inline void closure_debug_exit(void) {}
>> static inline void closure_debug_create(struct closure *cl) {}
>> static inline void closure_debug_destroy(struct closure *cl) {}
>> 
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 1a9fdab..38e2e21 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -2118,6 +2118,7 @@ static void bcache_exit(void)
>>destroy_workqueue(bcache_wq);
>>if (bcache_major)
>>unregister_blkdev(bcache_major, "bcache");
>> +closure_debug_exit();
>>unregister_reboot_notifier();
>>mutex_destroy(_register_lock);
>> }
>> @@ -2137,6 +2138,7 @@ static int __init bcache_init(void)
>> 
>>bcache_major = register_blkdev(0, "bcache");
>>if (bcache_major < 0) {
>> +closure_debug_exit();
>>unregister_reboot_notifier();
>>mutex_destroy(_register_lock);
>>return bcache_major;
>> -- 
>> 1.8.3.1
> 
> Thanks
> Tang Junhui



[PATCH V2 3/5] genirq/affinity: move actual irq vector spread into one helper

2018-03-04 Thread Ming Lei
No functional change, just prepare for converting to 2-stage
irq vector spread.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 97 +--
 1 file changed, 55 insertions(+), 42 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 9f49d6ef0dc8..256adf92ec62 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(const cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-/**
- * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
- * @nvecs: The total number of vectors
- * @affd:  Description of the affinity requirements
- *
- * Returns the masks pointer or NULL if allocation failed.
- */
-struct cpumask *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+   const cpumask_var_t *node_to_cpumask,
+   const struct cpumask *cpu_mask,
+   struct cpumask *nmsk,
+   struct cpumask *masks)
 {
-   int n, nodes, cpus_per_vec, extra_vecs, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
+   int curvec = affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
-   struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_cpumask;
-
-   /*
-* If there aren't any vectors left after applying the pre/post
-* vectors don't bother with assigning affinity.
-*/
-   if (!affv)
-   return NULL;
-
-   if (!zalloc_cpumask_var(, GFP_KERNEL))
-   return NULL;
-
-   masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
-   if (!masks)
-   goto out;
+   int n, nodes, cpus_per_vec, extra_vecs;
 
-   node_to_cpumask = alloc_node_to_cpumask();
-   if (!node_to_cpumask)
-   goto out;
-
-   /* Fill out vectors at the beginning that don't need affinity */
-   for (curvec = 0; curvec < affd->pre_vectors; curvec++)
-   cpumask_copy(masks + curvec, irq_default_affinity);
-
-   /* Stabilize the cpumasks */
-   get_online_cpus();
-   build_node_to_cpumask(node_to_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
-);
+   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
/*
 * If the number of nodes in the mask is greater than or equal the
@@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (++curvec == last_affv)
break;
}
-   goto done;
+   goto out;
}
 
for_each_node_mask(n, nodemsk) {
@@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
+   cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
 
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
--nodes;
}
 
-done:
+out:
+   return curvec - affd->pre_vectors;
+}
+
+/**
+ * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
+ * @nvecs: The total number of vectors
+ * @affd:  Description of the affinity requirements
+ *
+ * Returns the masks pointer or NULL if allocation failed.
+ */
+struct cpumask *
+irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+{
+   int curvec;
+   struct cpumask *masks;
+   cpumask_var_t nmsk, *node_to_cpumask;
+
+   /*
+* If there aren't any vectors left after applying the pre/post
+* vectors don't bother with assigning affinity.
+*/
+   if (nvecs == affd->pre_vectors + affd->post_vectors)
+   return NULL;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL))
+   return NULL;
+
+   masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+   if (!masks)
+   goto out;
+
+   node_to_cpumask = alloc_node_to_cpumask();
+   if (!node_to_cpumask)
+   goto out;
+
+   /* Fill out vectors at the beginning that don't need affinity */
+   for (curvec = 0; curvec < affd->pre_vectors; curvec++)
+   cpumask_copy(masks + curvec, irq_default_affinity);
+
+   /* Stabilize the cpumasks */
+   

[PATCH V2 2/5] genirq/affinity: mark 'node_to_cpumask' as const for get_nodes_in_cpumask()

2018-03-04 Thread Ming Lei
Inside irq_create_affinity_masks(), once 'node_to_cpumask' is created,
it is accessed read-only, so mark it as const for
get_nodes_in_cpumask().

Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4b1c4763212d..9f49d6ef0dc8 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -79,7 +79,7 @@ static void build_node_to_cpumask(cpumask_var_t *masks)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
+static int get_nodes_in_cpumask(const cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
-- 
2.9.5



[PATCH V2 4/5] genirq/affinity: support to do irq vectors spread starting from any vector

2018-03-04 Thread Ming Lei
Now two parameters(start_vec, affv) are introduced to 
irq_build_affinity_masks(),
then this helper can build the affinity of each irq vector starting from
the irq vector of 'start_vec', and handle at most 'affv' vectors.

This way is required to do 2-stages irq vectors spread among all
possible CPUs.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 256adf92ec62..a8c5d07890a6 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(const cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+   const int start_vec, const int affv,
const cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
struct cpumask *masks)
 {
-   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
-   int curvec = affd->pre_vectors;
+   int curvec = start_vec;
nodemask_t nodemsk = NODE_MASK_NONE;
-   int n, nodes, cpus_per_vec, extra_vecs;
+   int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
@@ -116,8 +116,10 @@ static int irq_build_affinity_masks(int nvecs, const 
struct irq_affinity *affd,
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
 node_to_cpumask[n]);
-   if (++curvec == last_affv)
+   if (++done == affv)
break;
+   if (++curvec == last_affv)
+   curvec = affd->pre_vectors;
}
goto out;
}
@@ -150,13 +152,16 @@ static int irq_build_affinity_masks(int nvecs, const 
struct irq_affinity *affd,
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}
 
-   if (curvec >= last_affv)
+   done += v;
+   if (done >= affv)
break;
+   if (curvec >= last_affv)
+   curvec = affd->pre_vectors;
--nodes;
}
 
 out:
-   return curvec - affd->pre_vectors;
+   return done;
 }
 
 /**
@@ -169,6 +174,7 @@ static int irq_build_affinity_masks(int nvecs, const struct 
irq_affinity *affd,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
+   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int curvec;
struct cpumask *masks;
cpumask_var_t nmsk, *node_to_cpumask;
@@ -198,7 +204,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+   curvec += irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
   cpu_possible_mask, nmsk, masks);
put_online_cpus();
 
-- 
2.9.5



[PATCH V2 5/5] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-04 Thread Ming Lei
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
may cause irq vector assigned to all offline CPUs, and this kind of
assignment may cause much less irq vectors mapped to online CPUs, and
performance may get hurt.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

[ming@box]$lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:1
Core(s) per socket:2
Socket(s): 2
NUMA node(s):  2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...

For example, one device has 4 queues:

1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3

2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

3) after applying this patch against V4.15+:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

This patch tries to do irq vector spread among online CPUs as far as
possible by 2 stages spread.

The above assignment 3) isn't the optimal result from NUMA view, but it
returns more irq vectors with online CPU mapped, given in reality one CPU
should be enough to handle one irq vector, so it is better to do this way.

Cc: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Reported-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a8c5d07890a6..aa2635416fc5 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct 
irq_affinity *affd,
nodemask_t nodemsk = NODE_MASK_NONE;
int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
+   if (!cpumask_weight(cpu_mask))
+   return 0;
+
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
/*
@@ -175,9 +178,9 @@ struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
-   int curvec;
+   int curvec, vecs_offline, vecs_online;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_cpumask;
+   cpumask_var_t nmsk, cpu_mask, *node_to_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -193,9 +196,12 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
+   if (!alloc_cpumask_var(_mask, GFP_KERNEL))
+   goto out;
+
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
-   goto out;
+   goto out_free_cpu_mask;
 
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
@@ -204,15 +210,32 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(affd, curvec, affv,
-  node_to_cpumask,
-  cpu_possible_mask, nmsk, masks);
+   /* spread on online CPUs starting from the vector of affd->pre_vectors 
*/
+   vecs_online = irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
+  cpu_online_mask, nmsk, masks);
+
+   /* spread on offline CPUs starting from the next vector to be handled */
+   if (vecs_online >= affv)
+   curvec = affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online;
+   cpumask_andnot(cpu_mask, cpu_possible_mask, cpu_online_mask);
+   vecs_offline = irq_build_affinity_masks(affd, curvec, affv,
+   node_to_cpumask,
+   cpu_mask, nmsk, masks);
put_online_cpus();
 
/* Fill out vectors at the end that don't need affinity */
+   if (vecs_online + vecs_offline >= affv)
+   curvec = affv + affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online + vecs_offline;
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, 

[PATCH V2 0/5] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-03-04 Thread Ming Lei
Hi,

This patchset tries to spread among online CPUs as far as possible, so
that we can avoid to allocate too less irq vectors with online CPUs
mapped.

For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
on a device with 4 queues:

1) before this patchset
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

2) after this patchset
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Without this patchset, only two vectors(39, 40) can be active, but there
can be 4 active irq vectors after applying this patchset.

One disadvantage is that CPUs from different NUMA node can be mapped to
one same irq vector. Given generally one CPU should be enough to handle
one irq vector, it shouldn't be a big deal. Especailly more vectors have
to be allocated, otherwise performance can be hurt in current
assignment.

V2:
- address coments from Christoph
- mark irq_build_affinity_masks as static
- move constification of get_nodes_in_cpumask's parameter into one
  prep patch
- add Reviewed-by tag

Thanks
Ming

Ming Lei (5):
  genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask
  genirq/affinity: mark 'node_to_cpumask' as const for
get_nodes_in_cpumask()
  genirq/affinity: move actual irq vector spread into one helper
  genirq/affinity: support to do irq vectors spread starting from any
vector
  genirq/affinity: irq vector spread among online CPUs as far as
possible

 kernel/irq/affinity.c | 145 --
 1 file changed, 94 insertions(+), 51 deletions(-)

-- 
2.9.5



Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-04 Thread Ming Lei
On Fri, Mar 02, 2018 at 04:53:21PM -0500, Laurence Oberman wrote:
> On Fri, 2018-03-02 at 15:03 +, Don Brace wrote:
> > > -Original Message-
> > > From: Laurence Oberman [mailto:lober...@redhat.com]
> > > Sent: Friday, March 02, 2018 8:09 AM
> > > To: Ming Lei 
> > > Cc: Don Brace ; Jens Axboe  > > k>;
> > > linux-block@vger.kernel.org; Christoph Hellwig ;
> > > Mike
> > > Snitzer ; linux-s...@vger.kernel.org; Hannes
> > > Reinecke
> > > ; Arun Easi ; Omar Sandoval
> > > ; Martin K . Petersen ;
> > > James
> > > Bottomley ; Christoph
> > > Hellwig
> > > ; Kashyap Desai ; Peter
> > > Rivera
> > > ; Meelis Roos 
> > > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply
> > > queue
> > > 
> > > EXTERNAL EMAIL
> > > 
> > > 
> > > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote:
> > > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote:
> > > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> > > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > > > > > > To: Jens Axboe ; linux-block@vger.kernel
> > > > > > > > .org
> > > > > > > > ;
> > > > > > > > Christoph
> > > > > > > > Hellwig ; Mike Snitzer  > > > > > > > .com
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke  > > > > > > > e.de
> > > > > > > > > ;
> > > > > > > > 
> > > > > > > > Arun Easi
> > > > > > > > ; Omar Sandoval ;
> > > > > > > > Martin K
> > > > > > > > .
> > > > > > > > Petersen ; James Bottomley
> > > > > > > > ; Christoph
> > > > > > > > Hellwig  > > > > > > > ch@l
> > > > > > > > st
> > > > > > > > .de>;
> > > > > > > > Don Brace ; Kashyap Desai
> > > > > > > > ; Peter Rivera  > > > > > > > broa
> > > > > > > > dcom
> > > > > > > > .c
> > > > > > > > om>;
> > > > > > > > Laurence Oberman ; Ming Lei
> > > > > > > > ; Meelis Roos 
> > > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of
> > > > > > > > reply
> > > > > > > > queue
> > > > > > > > 
> > > > 
> > > > Seems Don run into IO failure without blk-mq, could you run your
> > > > tests again
> > > > in legacy mode?
> > > > 
> > > > Thanks,
> > > > Ming
> > > 
> > > Hello Ming
> > > I ran multiple passes on Legacy and still see no issues in my test
> > > bed
> > > 
> > > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789-
> > > a28e-
> > > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M
> > > console=ttyS1,115200n8
> > > 
> > > HEAD of the git kernel I am using
> > > 
> > > 694e16f scsi: megaraid: improve scsi_mq performance via
> > > .host_tagset
> > > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset
> > > 60d5b36 block: null_blk: introduce module parameter of
> > > 'g_host_tags'
> > > 8847067 scsi: Add template flag 'host_tagset'
> > > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS
> > > 4710fab blk-mq: introduce 'start_tag' field to 'struct blk_mq_tags'
> > > 09bb153 scsi: megaraid_sas: fix selection of reply queue
> > > 52700d8 scsi: hpsa: fix selection of reply queue
> > 
> > I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above
> > patches.
> > I  and have been running 24 hours with no issues.
> > Evidently my forked copy was corrupted. 
> > 
> > So, my I/O testing has gone well. 
> > 
> > I'll run some performance numbers next.
> > 
> > Thanks,
> > Don
> 
> Unless Kashyap is not happy we need to consider getting this in to
> Linus now because we are seeing HPE servers that keep hanging now with
> the original commit now upstream.

Hi Martin,

Given both Don and Laurence have verified that patch 1 and patch 2
does fix IO hang, could you consider to merge the two first?

Thanks,
Ming


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-04 Thread Oliver
On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe  wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO SQ.
>
> If the CMB supports WDS and RDS, publish it for use as p2p memory
> by other devices.
>
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/nvme/host/pci.c | 75 
> +++--
>  1 file changed, 41 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 73036d2fbbd5..56ca79be8476 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "nvme.h"
>
> @@ -91,9 +92,8 @@ struct nvme_dev {
> struct work_struct remove_work;
> struct mutex shutdown_lock;
> bool subsystem;
> -   void __iomem *cmb;
> -   pci_bus_addr_t cmb_bus_addr;
> u64 cmb_size;
> +   bool cmb_use_sqes;
> u32 cmbsz;
> u32 cmbloc;
> struct nvme_ctrl ctrl;
> @@ -148,7 +148,7 @@ struct nvme_queue {
> struct nvme_dev *dev;
> spinlock_t q_lock;
> struct nvme_command *sq_cmds;
> -   struct nvme_command __iomem *sq_cmds_io;
> +   bool sq_cmds_is_io;
> volatile struct nvme_completion *cqes;
> struct blk_mq_tags **tags;
> dma_addr_t sq_dma_addr;
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
> u16 tail = nvmeq->sq_tail;

> -   if (nvmeq->sq_cmds_io)
> -   memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd));
> -   else
> -   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
> +   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));

Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.

> if (++tail == nvmeq->q_depth)
> tail = 0;
> @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
> dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
> -   if (nvmeq->sq_cmds)
> -   dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> -   nvmeq->sq_cmds, nvmeq->sq_dma_addr);
> +
> +   if (nvmeq->sq_cmds) {
> +   if (nvmeq->sq_cmds_is_io)
> +   pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
> +   nvmeq->sq_cmds,
> +   SQ_SIZE(nvmeq->q_depth));
> +   else
> +   dma_free_coherent(nvmeq->q_dmadev,
> + SQ_SIZE(nvmeq->q_depth),
> + nvmeq->sq_cmds,
> + nvmeq->sq_dma_addr);
> +   }
>  }
>
>  static void nvme_free_queues(struct nvme_dev *dev, int lowest)
> @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
> nr_io_queues,
>  static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
> int qid, int depth)
>  {
> -   /* CMB SQEs will be mapped before creation */
> -   if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
> -   return 0;
> +   struct pci_dev *pdev = to_pci_dev(dev->dev);
> +
> +   if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> +   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
> +   nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
> +   nvmeq->sq_cmds);
> +   nvmeq->sq_cmds_is_io = true;
> +   }
> +
> +   if (!nvmeq->sq_cmds) {
> +   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
> +   >sq_dma_addr, GFP_KERNEL);
> +   nvmeq->sq_cmds_is_io = false;
> +   }
>
> -   

Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-03-04 Thread Joseph Qi


On 18/3/5 04:23, Tejun Heo wrote:
> Hello, Joseph.
> 
> Sorry about late reply.
> 
> On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote:
>> In current code, I'm afraid pd_offline_fn() as well as the rest
>> destruction have to be called together under the same blkcg->lock and
>> q->queue_lock.
>> For example, if we split the pd_offline_fn() and radix_tree_delete()
>> into 2 phases, it may introduce a race between blkcg_deactivate_policy()
>> when exit queue and blkcg_css_free(), which will result in
>> pd_offline_fn() to be called twice.
> 
> So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need
> some restructuring to separate out blkg offlining and release, but it
> looks like that'd be the right thing to do, no?
> 
Agree, except the restriction above, as of now I don't find any more.
I'll try to fix in the way you suggested and post v3.

Thanks,
Joseph



Re: [PATCH] bcache: don't attach backing with duplicate UUID

2018-03-04 Thread tang . junhui

Hello Mike

I send the email from my personal mailbox(110950...@qq.com), it may be fail,
so I resend this email from my office mailbox again. bellow is the mail
context I send previous.



I am Tang Junhui(tang.jun...@zte.com.cn), This email comes from my
personal mailbox, since I am not in office today.

> > From: Tang Junhui 
> > 
> > Hello, Mike
> > 
> > This patch looks good, but has some conflicts with this patch:
> > bcache: fix for data collapse after re-attaching an attached device
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930
> > Could you modify your fix base on the previous patch?

> That doesn't make sense.  This patch was generated from a current tree
> where it's applied on top of that: (It's based on next when it should
> really be based on Linus's tree, but it doesn't matter for patch
> application because there's no changes in next right now to bcache that
> aren't in Linus's tree).

Originally, I did not mean merger conflicts, but the 
code logical conflicts, since the previous patch add a new input parameter
set_uuid in bch_cached_dev_attach(), and if set_uuid is not NULL,
we use set_uuid as cache set uuid, otherwise, we use dc->sb.set_uuid as
the cache set uuid.

But now, I read your patch again, and realize that you did not use 
dc->sb.set_uuid, but use dc->sb.uuid to judge whether the device is a
duplicate backend device, so it's OK for me.

> May I add your reviewed-by so I can send this (and your fix) upstream?
Reviewed-by: Tang Junhui 

Thanks 
Tang Junhui

Thanks
Tang Junhui


Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-03-04 Thread Tejun Heo
Hello, Joseph.

Sorry about late reply.

On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote:
> In current code, I'm afraid pd_offline_fn() as well as the rest
> destruction have to be called together under the same blkcg->lock and
> q->queue_lock.
> For example, if we split the pd_offline_fn() and radix_tree_delete()
> into 2 phases, it may introduce a race between blkcg_deactivate_policy()
> when exit queue and blkcg_css_free(), which will result in
> pd_offline_fn() to be called twice.

So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need
some restructuring to separate out blkg offlining and release, but it
looks like that'd be the right thing to do, no?

Thanks.

-- 
tejun


Re: vgdisplay hang on iSCSI session

2018-03-04 Thread Bart Van Assche
On Sun, 2018-03-04 at 20:01 +0100, Jean-Louis Dupond wrote:
> I'm running indeed CentOS 6 with the Virt SIG kernels. Already updated 
> to 4.9.75, but recently hit the problem again.
> 
> The first PID that was in D-state (root 27157  0.0  0.0 127664  5196 
> ?D06:19   0:00  \_ vgdisplay -c --ignorelockingfailure), had 
> the following stack:
> # cat /proc/27157/stack
> [] blk_mq_freeze_queue_wait+0x6f/0xd0
> [] blk_freeze_queue+0x1e/0x30
> [] blk_mq_freeze_queue+0xe/0x10
> [] loop_switch+0x1e/0xd0
> [] lo_release+0x7a/0x80
> [] __blkdev_put+0x1a7/0x200
> [] blkdev_put+0x56/0x140
> [] blkdev_close+0x24/0x30
> [] __fput+0xc8/0x240
> [] fput+0xe/0x10
> [] task_work_run+0x68/0xa0
> [] exit_to_usermode_loop+0xc6/0xd0
> [] do_syscall_64+0x185/0x240
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
> 
> Other procs show the following:
> # cat /proc/7803/stack
> [] __blkdev_get+0x6c/0x3f0
> [] blkdev_get+0x5c/0x1c0
> [] blkdev_open+0x62/0x80
> [] do_dentry_open+0x22a/0x340
> [] vfs_open+0x51/0x80
> [] do_last+0x435/0x7a0
> [] path_openat+0x87/0x1c0
> [] do_filp_open+0x85/0xe0
> [] do_sys_open+0x11c/0x210
> [] SyS_open+0x1e/0x20
> [] do_syscall_64+0x7a/0x240
> [] entry_SYSCALL64_slow_path+0x25/0x25
> [] 0x
> 
> An strace hangs again on loop0 open:
> stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 
> 0
> open("/dev/loop0", O_RDONLY|O_DIRECT|O_NOATIME
> 
> And it seems like indeed alot is hanging on loop0:
> # cat /sys/block/loop0/mq/0/queued
> 5957

Hello Jean-Louis,

Is the system still in this state? If so, can you provide the output of the
following command (as an attachment):

find /sys/kernel/debug/block/ -type f \! \( -name poll_stat -o -name dispatched 
-o -name merged -o -name completed \)

Thanks,

Bart.