[PATCH v2] bcache: move closure debug file into debug direcotry
In current code closure debug file is outside of debug directory and when unloading module there is lack of removing operation for closure debug file, so it will cause creating error when trying to reload module. This patch move closure debug file into "bcache" debug direcory so that the file can get deleted properly. Signed-off-by: Chengguang Xu--- Changes since v1: - Move closure debug file into "bcache" debug direcory instead of deleting it individually. - Change Signed-off-by mail address. drivers/md/bcache/closure.c | 9 + drivers/md/bcache/closure.h | 5 +++-- drivers/md/bcache/debug.c | 2 +- drivers/md/bcache/super.c | 3 +-- 4 files changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 7f12920..64b123c 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -157,7 +157,7 @@ void closure_debug_destroy(struct closure *cl) } EXPORT_SYMBOL(closure_debug_destroy); -static struct dentry *debug; +static struct dentry *closure_debug; static int debug_seq_show(struct seq_file *f, void *data) { @@ -199,11 +199,12 @@ static int debug_seq_open(struct inode *inode, struct file *file) .release= single_release }; -void __init closure_debug_init(void) +int __init closure_debug_init(void) { - debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops); + closure_debug = debugfs_create_file("closures", + 0400, debug, NULL, _ops); + return IS_ERR_OR_NULL(closure_debug); } - #endif MODULE_AUTHOR("Kent Overstreet "); diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 3b9dfc9..0fb704d 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -105,6 +105,7 @@ struct closure; struct closure_syncer; typedef void (closure_fn) (struct closure *); +extern struct dentry *debug; struct closure_waitlist { struct llist_head list; @@ -185,13 +186,13 @@ static inline void closure_sync(struct closure *cl) #ifdef CONFIG_BCACHE_CLOSURES_DEBUG -void closure_debug_init(void); +int closure_debug_init(void); void closure_debug_create(struct closure *cl); void closure_debug_destroy(struct closure *cl); #else -static inline void closure_debug_init(void) {} +static inline int closure_debug_init(void) { return 0; } static inline void closure_debug_create(struct closure *cl) {} static inline void closure_debug_destroy(struct closure *cl) {} diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index af89408..5db02de 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -17,7 +17,7 @@ #include #include -static struct dentry *debug; +struct dentry *debug; #ifdef CONFIG_BCACHE_DEBUG diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 1a9fdab..b784292 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2133,7 +2133,6 @@ static int __init bcache_init(void) mutex_init(_register_lock); init_waitqueue_head(_wait); register_reboot_notifier(); - closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); if (bcache_major < 0) { @@ -2145,7 +2144,7 @@ static int __init bcache_init(void) if (!(bcache_wq = alloc_workqueue("bcache", WQ_MEM_RECLAIM, 0)) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || bch_request_init() || - bch_debug_init(bcache_kobj) || + bch_debug_init(bcache_kobj) || closure_debug_init() || sysfs_create_files(bcache_kobj, files)) goto err; -- 1.8.3.1
RE: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
> -Original Message- > From: Laurence Oberman [mailto:lober...@redhat.com] > Sent: Saturday, March 3, 2018 3:23 AM > To: Don Brace; Ming Lei > Cc: Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig; Mike > Snitzer; > linux-s...@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Kashyap Desai; > Peter > Rivera; Meelis Roos > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue > > On Fri, 2018-03-02 at 15:03 +, Don Brace wrote: > > > -Original Message- > > > From: Laurence Oberman [mailto:lober...@redhat.com] > > > Sent: Friday, March 02, 2018 8:09 AM > > > To: Ming Lei> > > Cc: Don Brace ; Jens Axboe > > k>; > > > linux-block@vger.kernel.org; Christoph Hellwig ; > > > Mike Snitzer ; linux-s...@vger.kernel.org; > > > Hannes Reinecke ; Arun Easi ; > > > Omar Sandoval ; Martin K . Petersen > > > ; James Bottomley > > > ; Christoph Hellwig > > > ; Kashyap Desai ; Peter > > > Rivera ; Meelis Roos > > > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue > > > > > > EXTERNAL EMAIL > > > > > > > > > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote: > > > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote: > > > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote: > > > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote: > > > > > > > > -Original Message- > > > > > > > > From: Ming Lei [mailto:ming@redhat.com] > > > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM > > > > > > > > To: Jens Axboe ; linux-block@vger.kernel > > > > > > > > .org ; Christoph Hellwig ; Mike Snitzer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke > > > > > > > e.de > > > > > > > > > ; > > > > > > > > > > > > > > > > Arun Easi > > > > > > > > ; Omar Sandoval ; > > > > > > > > Martin K . > > > > > > > > Petersen ; James Bottomley > > > > > > > > ; Christoph Hellwig > > > > > > > > ; Don Brace ; > > > > > > > > Kashyap Desai ; Peter Rivera > > > > > > > > > > > > > > > om>; > > > > > > > > Laurence Oberman ; Ming Lei > > > > > > > > ; Meelis Roos > > > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply > > > > > > > > queue > > > > > > > > > > > > > > > > Seems Don run into IO failure without blk-mq, could you run your > > > > tests again in legacy mode? > > > > > > > > Thanks, > > > > Ming > > > > > > Hello Ming > > > I ran multiple passes on Legacy and still see no issues in my test > > > bed > > > > > > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789- > > > a28e- > > > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M > > > console=ttyS1,115200n8 > > > > > > HEAD of the git kernel I am using > > > > > > 694e16f scsi: megaraid: improve scsi_mq performance via .host_tagset > > > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset > > > 60d5b36 block: null_blk: introduce module parameter of 'g_host_tags' > > > 8847067 scsi: Add template flag 'host_tagset' > > > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS 4710fab blk-mq: > > > introduce 'start_tag' field to 'struct blk_mq_tags' > > > 09bb153 scsi: megaraid_sas: fix selection of reply queue > > > 52700d8 scsi: hpsa: fix selection of reply queue > > > > I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above > > patches. > > I and have been running 24 hours with no issues. > > Evidently my forked copy was corrupted. > > > > So, my I/O testing has gone well. > > > > I'll run some performance numbers next. > > > > Thanks, > > Don > > Unless Kashyap is not happy we need to consider getting this in to Linus > now > because we are seeing HPE servers that keep hanging now with the original > commit now upstream. > > Kashyap, are you good with the v3 patchset or still concerned with > performance. I was getting pretty good IOPS/sec to individual SSD drives > set > up as jbod devices on the megaraid_sas. Laurence - Did you find difference with/without the patch ? What was IOPs number with and without patch. It is not urgent feature, so I would like to take some time to get BRCM's performance team involved and do full analysis of performance run and find pros/cons. Kashyap > > With larger I/O sizes like 1MB I was getting good MB/sec and not seeing a > measurable performance impact. >
Re: [PATCH] bcache: remove closure debug file when unloading module
> 在 2018年3月2日,下午2:34,tang.jun...@zte.com.cn 写道: > > From: Tang Junhui> > Hello Chengguang > >> When unloading bcache module there is lack of removing >> operation for closure debug file, so it will cause >> creating error when trying to reload module. >> > > Yes, This issue is true. > Actually, the original code try to remove closure debug file > by bch_debug_exit(), which remove all the debug file in > bcache directory, and closure debug file is expected to be > one debug file in bcache debug directory. > > But currently code, closure_debug_init() is called to create > closure debug file before the bcache debug crated in > bch_debug_init(), so closure debug file created outside > the bcache directory, then when bch_debug_exit() being called, > bcache diretory removed, but closure debug file didn't removed. > > So the best way to resolve this issue is not remove the > closure debug file again, but to take the closure debug file > under the bcache directory in debug sysfs. Yes, that looks better, I’ll modify as your suggestion in v2. Thanks for your review. > >> This fix introduces closure_debug_exit to handle removing >> operation properly. >> >> Signed-off-by: Chengguang Xu >> --- >> drivers/md/bcache/closure.c | 5 + >> drivers/md/bcache/closure.h | 2 ++ >> drivers/md/bcache/super.c | 2 ++ >> 3 files changed, 9 insertions(+) >> >> diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c >> index 7f12920..8fcd737 100644 >> --- a/drivers/md/bcache/closure.c >> +++ b/drivers/md/bcache/closure.c >> @@ -204,6 +204,11 @@ void __init closure_debug_init(void) >>debug = debugfs_create_file("closures", 0400, NULL, NULL, _ops); >> } >> >> +void closure_debug_exit(void) >> +{ >> +debugfs_remove(debug); >> +} >> + >> #endif >> >> MODULE_AUTHOR("Kent Overstreet "); >> diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h >> index 3b9dfc9..1aa0f7e 100644 >> --- a/drivers/md/bcache/closure.h >> +++ b/drivers/md/bcache/closure.h >> @@ -186,12 +186,14 @@ static inline void closure_sync(struct closure *cl) >> #ifdef CONFIG_BCACHE_CLOSURES_DEBUG >> >> void closure_debug_init(void); >> +void closure_debug_exit(void); >> void closure_debug_create(struct closure *cl); >> void closure_debug_destroy(struct closure *cl); >> >> #else >> >> static inline void closure_debug_init(void) {} >> +static inline void closure_debug_exit(void) {} >> static inline void closure_debug_create(struct closure *cl) {} >> static inline void closure_debug_destroy(struct closure *cl) {} >> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index 1a9fdab..38e2e21 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -2118,6 +2118,7 @@ static void bcache_exit(void) >>destroy_workqueue(bcache_wq); >>if (bcache_major) >>unregister_blkdev(bcache_major, "bcache"); >> +closure_debug_exit(); >>unregister_reboot_notifier(); >>mutex_destroy(_register_lock); >> } >> @@ -2137,6 +2138,7 @@ static int __init bcache_init(void) >> >>bcache_major = register_blkdev(0, "bcache"); >>if (bcache_major < 0) { >> +closure_debug_exit(); >>unregister_reboot_notifier(); >>mutex_destroy(_register_lock); >>return bcache_major; >> -- >> 1.8.3.1 > > Thanks > Tang Junhui
[PATCH V2 3/5] genirq/affinity: move actual irq vector spread into one helper
No functional change, just prepare for converting to 2-stage irq vector spread. Cc: Thomas GleixnerReviewed-by: Christoph Hellwig Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 97 +-- 1 file changed, 55 insertions(+), 42 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 9f49d6ef0dc8..256adf92ec62 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(const cpumask_var_t *node_to_cpumask, return nodes; } -/** - * irq_create_affinity_masks - Create affinity masks for multiqueue spreading - * @nvecs: The total number of vectors - * @affd: Description of the affinity requirements - * - * Returns the masks pointer or NULL if allocation failed. - */ -struct cpumask * -irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) +static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd, + const cpumask_var_t *node_to_cpumask, + const struct cpumask *cpu_mask, + struct cpumask *nmsk, + struct cpumask *masks) { - int n, nodes, cpus_per_vec, extra_vecs, curvec; int affv = nvecs - affd->pre_vectors - affd->post_vectors; int last_affv = affv + affd->pre_vectors; + int curvec = affd->pre_vectors; nodemask_t nodemsk = NODE_MASK_NONE; - struct cpumask *masks; - cpumask_var_t nmsk, *node_to_cpumask; - - /* -* If there aren't any vectors left after applying the pre/post -* vectors don't bother with assigning affinity. -*/ - if (!affv) - return NULL; - - if (!zalloc_cpumask_var(, GFP_KERNEL)) - return NULL; - - masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); - if (!masks) - goto out; + int n, nodes, cpus_per_vec, extra_vecs; - node_to_cpumask = alloc_node_to_cpumask(); - if (!node_to_cpumask) - goto out; - - /* Fill out vectors at the beginning that don't need affinity */ - for (curvec = 0; curvec < affd->pre_vectors; curvec++) - cpumask_copy(masks + curvec, irq_default_affinity); - - /* Stabilize the cpumasks */ - get_online_cpus(); - build_node_to_cpumask(node_to_cpumask); - nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask, -); + nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, ); /* * If the number of nodes in the mask is greater than or equal the @@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) if (++curvec == last_affv) break; } - goto done; + goto out; } for_each_node_mask(n, nodemsk) { @@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes; /* Get the cpus on this node which are in the mask */ - cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]); + cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]); /* Calculate the number of cpus per vector */ ncpus = cpumask_weight(nmsk); @@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) --nodes; } -done: +out: + return curvec - affd->pre_vectors; +} + +/** + * irq_create_affinity_masks - Create affinity masks for multiqueue spreading + * @nvecs: The total number of vectors + * @affd: Description of the affinity requirements + * + * Returns the masks pointer or NULL if allocation failed. + */ +struct cpumask * +irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) +{ + int curvec; + struct cpumask *masks; + cpumask_var_t nmsk, *node_to_cpumask; + + /* +* If there aren't any vectors left after applying the pre/post +* vectors don't bother with assigning affinity. +*/ + if (nvecs == affd->pre_vectors + affd->post_vectors) + return NULL; + + if (!zalloc_cpumask_var(, GFP_KERNEL)) + return NULL; + + masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL); + if (!masks) + goto out; + + node_to_cpumask = alloc_node_to_cpumask(); + if (!node_to_cpumask) + goto out; + + /* Fill out vectors at the beginning that don't need affinity */ + for (curvec = 0; curvec < affd->pre_vectors; curvec++) + cpumask_copy(masks + curvec, irq_default_affinity); + + /* Stabilize the cpumasks */ +
[PATCH V2 2/5] genirq/affinity: mark 'node_to_cpumask' as const for get_nodes_in_cpumask()
Inside irq_create_affinity_masks(), once 'node_to_cpumask' is created, it is accessed read-only, so mark it as const for get_nodes_in_cpumask(). Cc: Thomas GleixnerCc: Christoph Hellwig Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 4b1c4763212d..9f49d6ef0dc8 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -79,7 +79,7 @@ static void build_node_to_cpumask(cpumask_var_t *masks) cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]); } -static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask, +static int get_nodes_in_cpumask(const cpumask_var_t *node_to_cpumask, const struct cpumask *mask, nodemask_t *nodemsk) { int n, nodes = 0; -- 2.9.5
[PATCH V2 4/5] genirq/affinity: support to do irq vectors spread starting from any vector
Now two parameters(start_vec, affv) are introduced to irq_build_affinity_masks(), then this helper can build the affinity of each irq vector starting from the irq vector of 'start_vec', and handle at most 'affv' vectors. This way is required to do 2-stages irq vectors spread among all possible CPUs. Cc: Thomas GleixnerReviewed-by: Christoph Hellwig Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 256adf92ec62..a8c5d07890a6 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(const cpumask_var_t *node_to_cpumask, return nodes; } -static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd, +static int irq_build_affinity_masks(const struct irq_affinity *affd, + const int start_vec, const int affv, const cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, struct cpumask *nmsk, struct cpumask *masks) { - int affv = nvecs - affd->pre_vectors - affd->post_vectors; int last_affv = affv + affd->pre_vectors; - int curvec = affd->pre_vectors; + int curvec = start_vec; nodemask_t nodemsk = NODE_MASK_NONE; - int n, nodes, cpus_per_vec, extra_vecs; + int n, nodes, cpus_per_vec, extra_vecs, done = 0; nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, ); @@ -116,8 +116,10 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd, for_each_node_mask(n, nodemsk) { cpumask_copy(masks + curvec, node_to_cpumask[n]); - if (++curvec == last_affv) + if (++done == affv) break; + if (++curvec == last_affv) + curvec = affd->pre_vectors; } goto out; } @@ -150,13 +152,16 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd, irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec); } - if (curvec >= last_affv) + done += v; + if (done >= affv) break; + if (curvec >= last_affv) + curvec = affd->pre_vectors; --nodes; } out: - return curvec - affd->pre_vectors; + return done; } /** @@ -169,6 +174,7 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd, struct cpumask * irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { + int affv = nvecs - affd->pre_vectors - affd->post_vectors; int curvec; struct cpumask *masks; cpumask_var_t nmsk, *node_to_cpumask; @@ -198,7 +204,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Stabilize the cpumasks */ get_online_cpus(); build_node_to_cpumask(node_to_cpumask); - curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask, + curvec += irq_build_affinity_masks(affd, curvec, affv, + node_to_cpumask, cpu_possible_mask, nmsk, masks); put_online_cpus(); -- 2.9.5
[PATCH V2 5/5] genirq/affinity: irq vector spread among online CPUs as far as possible
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") may cause irq vector assigned to all offline CPUs, and this kind of assignment may cause much less irq vectors mapped to online CPUs, and performance may get hurt. For example, in a 8 cores system, 0~3 online, 4~8 offline/not present, see 'lscpu': [ming@box]$lscpu Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):4 On-line CPU(s) list: 0-3 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 2 NUMA node(s): 2 ... NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): ... For example, one device has 4 queues: 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0 irq 40, cpu list 1 irq 41, cpu list 2 irq 42, cpu list 3 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") irq 39, cpu list 0-2 irq 40, cpu list 3-4,6 irq 41, cpu list 5 irq 42, cpu list 7 3) after applying this patch against V4.15+: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 This patch tries to do irq vector spread among online CPUs as far as possible by 2 stages spread. The above assignment 3) isn't the optimal result from NUMA view, but it returns more irq vectors with online CPU mapped, given in reality one CPU should be enough to handle one irq vector, so it is better to do this way. Cc: Thomas GleixnerReviewed-by: Christoph Hellwig Reported-by: Laurence Oberman Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 35 +-- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index a8c5d07890a6..aa2635416fc5 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd, nodemask_t nodemsk = NODE_MASK_NONE; int n, nodes, cpus_per_vec, extra_vecs, done = 0; + if (!cpumask_weight(cpu_mask)) + return 0; + nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, ); /* @@ -175,9 +178,9 @@ struct cpumask * irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { int affv = nvecs - affd->pre_vectors - affd->post_vectors; - int curvec; + int curvec, vecs_offline, vecs_online; struct cpumask *masks; - cpumask_var_t nmsk, *node_to_cpumask; + cpumask_var_t nmsk, cpu_mask, *node_to_cpumask; /* * If there aren't any vectors left after applying the pre/post @@ -193,9 +196,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) if (!masks) goto out; + if (!alloc_cpumask_var(_mask, GFP_KERNEL)) + goto out; + node_to_cpumask = alloc_node_to_cpumask(); if (!node_to_cpumask) - goto out; + goto out_free_cpu_mask; /* Fill out vectors at the beginning that don't need affinity */ for (curvec = 0; curvec < affd->pre_vectors; curvec++) @@ -204,15 +210,32 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Stabilize the cpumasks */ get_online_cpus(); build_node_to_cpumask(node_to_cpumask); - curvec += irq_build_affinity_masks(affd, curvec, affv, - node_to_cpumask, - cpu_possible_mask, nmsk, masks); + /* spread on online CPUs starting from the vector of affd->pre_vectors */ + vecs_online = irq_build_affinity_masks(affd, curvec, affv, + node_to_cpumask, + cpu_online_mask, nmsk, masks); + + /* spread on offline CPUs starting from the next vector to be handled */ + if (vecs_online >= affv) + curvec = affd->pre_vectors; + else + curvec = affd->pre_vectors + vecs_online; + cpumask_andnot(cpu_mask, cpu_possible_mask, cpu_online_mask); + vecs_offline = irq_build_affinity_masks(affd, curvec, affv, + node_to_cpumask, + cpu_mask, nmsk, masks); put_online_cpus(); /* Fill out vectors at the end that don't need affinity */ + if (vecs_online + vecs_offline >= affv) + curvec = affv + affd->pre_vectors; + else + curvec = affd->pre_vectors + vecs_online + vecs_offline; for (; curvec < nvecs; curvec++) cpumask_copy(masks + curvec,
[PATCH V2 0/5] genirq/affinity: irq vector spread among online CPUs as far as possible
Hi, This patchset tries to spread among online CPUs as far as possible, so that we can avoid to allocate too less irq vectors with online CPUs mapped. For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present, on a device with 4 queues: 1) before this patchset irq 39, cpu list 0-2 irq 40, cpu list 3-4,6 irq 41, cpu list 5 irq 42, cpu list 7 2) after this patchset irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 Without this patchset, only two vectors(39, 40) can be active, but there can be 4 active irq vectors after applying this patchset. One disadvantage is that CPUs from different NUMA node can be mapped to one same irq vector. Given generally one CPU should be enough to handle one irq vector, it shouldn't be a big deal. Especailly more vectors have to be allocated, otherwise performance can be hurt in current assignment. V2: - address coments from Christoph - mark irq_build_affinity_masks as static - move constification of get_nodes_in_cpumask's parameter into one prep patch - add Reviewed-by tag Thanks Ming Ming Lei (5): genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask genirq/affinity: mark 'node_to_cpumask' as const for get_nodes_in_cpumask() genirq/affinity: move actual irq vector spread into one helper genirq/affinity: support to do irq vectors spread starting from any vector genirq/affinity: irq vector spread among online CPUs as far as possible kernel/irq/affinity.c | 145 -- 1 file changed, 94 insertions(+), 51 deletions(-) -- 2.9.5
Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
On Fri, Mar 02, 2018 at 04:53:21PM -0500, Laurence Oberman wrote: > On Fri, 2018-03-02 at 15:03 +, Don Brace wrote: > > > -Original Message- > > > From: Laurence Oberman [mailto:lober...@redhat.com] > > > Sent: Friday, March 02, 2018 8:09 AM > > > To: Ming Lei> > > Cc: Don Brace ; Jens Axboe > > k>; > > > linux-block@vger.kernel.org; Christoph Hellwig ; > > > Mike > > > Snitzer ; linux-s...@vger.kernel.org; Hannes > > > Reinecke > > > ; Arun Easi ; Omar Sandoval > > > ; Martin K . Petersen ; > > > James > > > Bottomley ; Christoph > > > Hellwig > > > ; Kashyap Desai ; Peter > > > Rivera > > > ; Meelis Roos > > > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply > > > queue > > > > > > EXTERNAL EMAIL > > > > > > > > > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote: > > > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote: > > > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote: > > > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote: > > > > > > > > -Original Message- > > > > > > > > From: Ming Lei [mailto:ming@redhat.com] > > > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM > > > > > > > > To: Jens Axboe ; linux-block@vger.kernel > > > > > > > > .org > > > > > > > > ; > > > > > > > > Christoph > > > > > > > > Hellwig ; Mike Snitzer > > > > > > > .com > > > > > > > > > > > > > > > > > > > > > > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke > > > > > > > e.de > > > > > > > > > ; > > > > > > > > > > > > > > > > Arun Easi > > > > > > > > ; Omar Sandoval ; > > > > > > > > Martin K > > > > > > > > . > > > > > > > > Petersen ; James Bottomley > > > > > > > > ; Christoph > > > > > > > > Hellwig > > > > > > > ch@l > > > > > > > > st > > > > > > > > .de>; > > > > > > > > Don Brace ; Kashyap Desai > > > > > > > > ; Peter Rivera > > > > > > > broa > > > > > > > > dcom > > > > > > > > .c > > > > > > > > om>; > > > > > > > > Laurence Oberman ; Ming Lei > > > > > > > > ; Meelis Roos > > > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of > > > > > > > > reply > > > > > > > > queue > > > > > > > > > > > > > > > > Seems Don run into IO failure without blk-mq, could you run your > > > > tests again > > > > in legacy mode? > > > > > > > > Thanks, > > > > Ming > > > > > > Hello Ming > > > I ran multiple passes on Legacy and still see no issues in my test > > > bed > > > > > > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789- > > > a28e- > > > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M > > > console=ttyS1,115200n8 > > > > > > HEAD of the git kernel I am using > > > > > > 694e16f scsi: megaraid: improve scsi_mq performance via > > > .host_tagset > > > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset > > > 60d5b36 block: null_blk: introduce module parameter of > > > 'g_host_tags' > > > 8847067 scsi: Add template flag 'host_tagset' > > > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS > > > 4710fab blk-mq: introduce 'start_tag' field to 'struct blk_mq_tags' > > > 09bb153 scsi: megaraid_sas: fix selection of reply queue > > > 52700d8 scsi: hpsa: fix selection of reply queue > > > > I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above > > patches. > > I and have been running 24 hours with no issues. > > Evidently my forked copy was corrupted. > > > > So, my I/O testing has gone well. > > > > I'll run some performance numbers next. > > > > Thanks, > > Don > > Unless Kashyap is not happy we need to consider getting this in to > Linus now because we are seeing HPE servers that keep hanging now with > the original commit now upstream. Hi Martin, Given both Don and Laurence have verified that patch 1 and patch 2 does fix IO hang, could you consider to merge the two first? Thanks, Ming
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpewrote: > Register the CMB buffer as p2pmem and use the appropriate allocation > functions to create and destroy the IO SQ. > > If the CMB supports WDS and RDS, publish it for use as p2p memory > by other devices. > > Signed-off-by: Logan Gunthorpe > --- > drivers/nvme/host/pci.c | 75 > +++-- > 1 file changed, 41 insertions(+), 34 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 73036d2fbbd5..56ca79be8476 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > > #include "nvme.h" > > @@ -91,9 +92,8 @@ struct nvme_dev { > struct work_struct remove_work; > struct mutex shutdown_lock; > bool subsystem; > - void __iomem *cmb; > - pci_bus_addr_t cmb_bus_addr; > u64 cmb_size; > + bool cmb_use_sqes; > u32 cmbsz; > u32 cmbloc; > struct nvme_ctrl ctrl; > @@ -148,7 +148,7 @@ struct nvme_queue { > struct nvme_dev *dev; > spinlock_t q_lock; > struct nvme_command *sq_cmds; > - struct nvme_command __iomem *sq_cmds_io; > + bool sq_cmds_is_io; > volatile struct nvme_completion *cqes; > struct blk_mq_tags **tags; > dma_addr_t sq_dma_addr; > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq, > { > u16 tail = nvmeq->sq_tail; > - if (nvmeq->sq_cmds_io) > - memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd)); > - else > - memcpy(>sq_cmds[tail], cmd, sizeof(*cmd)); > + memcpy(>sq_cmds[tail], cmd, sizeof(*cmd)); Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC the _toio() variant enforces alignment, does the copy with 4 byte stores, and has a full barrier after the copy. In comparison our regular memcpy() does none of those things and may use unaligned and vector load/stores. For normal (cacheable) memory that is perfectly fine, but they can cause alignment faults when targeted at MMIO (cache-inhibited) memory. I think in this particular case it might be ok since we know SEQs are aligned to 64 byte boundaries and the copy is too small to use our vectorised memcpy(). I'll assume we don't need explicit ordering between writes of SEQs since the existing code doesn't seem to care unless the doorbell is being rung, so you're probably fine there too. That said, I still think this is a little bit sketchy and at the very least you should add a comment explaining what's going on when the CMB is being used. If someone more familiar with the NVMe driver could chime in I would appreciate it. > if (++tail == nvmeq->q_depth) > tail = 0; > @@ -1286,9 +1283,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq) > { > dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth), > (void *)nvmeq->cqes, nvmeq->cq_dma_addr); > - if (nvmeq->sq_cmds) > - dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth), > - nvmeq->sq_cmds, nvmeq->sq_dma_addr); > + > + if (nvmeq->sq_cmds) { > + if (nvmeq->sq_cmds_is_io) > + pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev), > + nvmeq->sq_cmds, > + SQ_SIZE(nvmeq->q_depth)); > + else > + dma_free_coherent(nvmeq->q_dmadev, > + SQ_SIZE(nvmeq->q_depth), > + nvmeq->sq_cmds, > + nvmeq->sq_dma_addr); > + } > } > > static void nvme_free_queues(struct nvme_dev *dev, int lowest) > @@ -1368,12 +1374,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int > nr_io_queues, > static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq, > int qid, int depth) > { > - /* CMB SQEs will be mapped before creation */ > - if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) > - return 0; > + struct pci_dev *pdev = to_pci_dev(dev->dev); > + > + if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { > + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth)); > + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev, > + nvmeq->sq_cmds); > + nvmeq->sq_cmds_is_io = true; > + } > + > + if (!nvmeq->sq_cmds) { > + nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth), > + >sq_dma_addr, GFP_KERNEL); > + nvmeq->sq_cmds_is_io = false; > + } > > -
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
On 18/3/5 04:23, Tejun Heo wrote: > Hello, Joseph. > > Sorry about late reply. > > On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: >> In current code, I'm afraid pd_offline_fn() as well as the rest >> destruction have to be called together under the same blkcg->lock and >> q->queue_lock. >> For example, if we split the pd_offline_fn() and radix_tree_delete() >> into 2 phases, it may introduce a race between blkcg_deactivate_policy() >> when exit queue and blkcg_css_free(), which will result in >> pd_offline_fn() to be called twice. > > So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need > some restructuring to separate out blkg offlining and release, but it > looks like that'd be the right thing to do, no? > Agree, except the restriction above, as of now I don't find any more. I'll try to fix in the way you suggested and post v3. Thanks, Joseph
Re: [PATCH] bcache: don't attach backing with duplicate UUID
Hello Mike I send the email from my personal mailbox(110950...@qq.com), it may be fail, so I resend this email from my office mailbox again. bellow is the mail context I send previous. I am Tang Junhui(tang.jun...@zte.com.cn), This email comes from my personal mailbox, since I am not in office today. > > From: Tang Junhui> > > > Hello, Mike > > > > This patch looks good, but has some conflicts with this patch: > > bcache: fix for data collapse after re-attaching an attached device > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=73ac105be390c1de42a2f21643c9778a5e002930 > > Could you modify your fix base on the previous patch? > That doesn't make sense. This patch was generated from a current tree > where it's applied on top of that: (It's based on next when it should > really be based on Linus's tree, but it doesn't matter for patch > application because there's no changes in next right now to bcache that > aren't in Linus's tree). Originally, I did not mean merger conflicts, but the code logical conflicts, since the previous patch add a new input parameter set_uuid in bch_cached_dev_attach(), and if set_uuid is not NULL, we use set_uuid as cache set uuid, otherwise, we use dc->sb.set_uuid as the cache set uuid. But now, I read your patch again, and realize that you did not use dc->sb.set_uuid, but use dc->sb.uuid to judge whether the device is a duplicate backend device, so it's OK for me. > May I add your reviewed-by so I can send this (and your fix) upstream? Reviewed-by: Tang Junhui Thanks Tang Junhui Thanks Tang Junhui
Re: [PATCH v2] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir
Hello, Joseph. Sorry about late reply. On Wed, Feb 28, 2018 at 02:52:10PM +0800, Joseph Qi wrote: > In current code, I'm afraid pd_offline_fn() as well as the rest > destruction have to be called together under the same blkcg->lock and > q->queue_lock. > For example, if we split the pd_offline_fn() and radix_tree_delete() > into 2 phases, it may introduce a race between blkcg_deactivate_policy() > when exit queue and blkcg_css_free(), which will result in > pd_offline_fn() to be called twice. So, yeah, the sync scheme aroung blkg is pretty brittle and we'd need some restructuring to separate out blkg offlining and release, but it looks like that'd be the right thing to do, no? Thanks. -- tejun
Re: vgdisplay hang on iSCSI session
On Sun, 2018-03-04 at 20:01 +0100, Jean-Louis Dupond wrote: > I'm running indeed CentOS 6 with the Virt SIG kernels. Already updated > to 4.9.75, but recently hit the problem again. > > The first PID that was in D-state (root 27157 0.0 0.0 127664 5196 > ?D06:19 0:00 \_ vgdisplay -c --ignorelockingfailure), had > the following stack: > # cat /proc/27157/stack > [] blk_mq_freeze_queue_wait+0x6f/0xd0 > [] blk_freeze_queue+0x1e/0x30 > [] blk_mq_freeze_queue+0xe/0x10 > [] loop_switch+0x1e/0xd0 > [] lo_release+0x7a/0x80 > [] __blkdev_put+0x1a7/0x200 > [] blkdev_put+0x56/0x140 > [] blkdev_close+0x24/0x30 > [] __fput+0xc8/0x240 > [] fput+0xe/0x10 > [] task_work_run+0x68/0xa0 > [] exit_to_usermode_loop+0xc6/0xd0 > [] do_syscall_64+0x185/0x240 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x > > Other procs show the following: > # cat /proc/7803/stack > [] __blkdev_get+0x6c/0x3f0 > [] blkdev_get+0x5c/0x1c0 > [] blkdev_open+0x62/0x80 > [] do_dentry_open+0x22a/0x340 > [] vfs_open+0x51/0x80 > [] do_last+0x435/0x7a0 > [] path_openat+0x87/0x1c0 > [] do_filp_open+0x85/0xe0 > [] do_sys_open+0x11c/0x210 > [] SyS_open+0x1e/0x20 > [] do_syscall_64+0x7a/0x240 > [] entry_SYSCALL64_slow_path+0x25/0x25 > [] 0x > > An strace hangs again on loop0 open: > stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = > 0 > open("/dev/loop0", O_RDONLY|O_DIRECT|O_NOATIME > > And it seems like indeed alot is hanging on loop0: > # cat /sys/block/loop0/mq/0/queued > 5957 Hello Jean-Louis, Is the system still in this state? If so, can you provide the output of the following command (as an attachment): find /sys/kernel/debug/block/ -type f \! \( -name poll_stat -o -name dispatched -o -name merged -o -name completed \) Thanks, Bart.