Re: [PATCH 12/17] bcache: fix code comments style
On 2018/8/10 11:10 AM, shenghui wrote: > > > On 08/09/2018 02:43 PM, Coly Li wrote: >> This patch fixes 3 style issues warned by checkpatch.pl, >> - Comment lines are not aligned >> - Comments use "/*" on subsequent lines >> - Comment lines use a trailing "*/" >> >> Signed-off-by: Coly Li >> --- >> drivers/md/bcache/super.c | 22 +- >> 1 file changed, 13 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index eac0d5d1ada3..a8b751ba3ebc 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -465,8 +465,8 @@ static struct uuid_entry *uuid_find_empty(struct >> cache_set *c) >> * Bucket priorities/gens: >> * >> * For each bucket, we store on disk its >> - * 8 bit gen >> - * 16 bit priority >> + * 8 bit gen >> + * 16 bit priority >> * >> * See alloc.c for an explanation of the gen. The priority is used to >> implement >> * lru (and in the future other) cache replacement policies; for most >> purposes >> @@ -934,8 +934,10 @@ void bch_cached_dev_run(struct cached_dev *dc) >> >> add_disk(d->disk); >> bd_link_disk_holder(dc->bdev, dc->disk.disk); >> -/* won't show up in the uevent file, use udevadm monitor -e instead >> - * only class / kset properties are persistent */ >> +/* >> + * won't show up in the uevent file, use udevadm monitor -e instead >> + * only class / kset properties are persistent >> + */ >> kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env); >> kfree(env[1]); >> kfree(env[2]); >> @@ -1104,8 +1106,9 @@ int bch_cached_dev_attach(struct cached_dev *dc, >> struct cache_set *c, >> } >> } >> >> -/* Deadlocks since we're called via sysfs... >> -sysfs_remove_file(&dc->kobj, &sysfs_attach); >> +/* >> + * Deadlocks since we're called via sysfs... >> + * sysfs_remove_file(&dc->kobj, &sysfs_attach); >> */ >> >> if (bch_is_zero(u->uuid, 16)) { >> @@ -1468,9 +1471,10 @@ bool bch_cache_set_error(struct cache_set *c, const >> char *fmt, ...) >> if (test_and_set_bit(CACHE_SET_IO_DISABLE, &c->flags)) >> pr_info("CACHE_SET_IO_DISABLE already set"); >> >> -/* XXX: we can be called from atomic context >> -acquire_console_sem(); >> -*/ >> +/* >> + * XXX: we can be called from atomic context >> + * acquire_console_sem(); >> + */ >> >> pr_err("bcache: error on %pU: ", c->sb.set_uuid); >> >> > > Hi Coly, > > I noticed some missing: > bset.c:405:/* I have no idea why this code works... and I'm the one who > wrote it This is a single line comment, IMHO it is OK. > bset.c:796: /* k is the key we just inserted; we need to find the > entry in the > bset.c:804: /* Adjust all the lookup table entries, and find a new > key for any that > writeback.c:471:/* We've acquired a semaphore > for the maximum > The above lines should be fixed, I will update them in v2 series. > And some comments start with '/**', like: > 85 /** > 86 * bch_hprint - formats @v to human readable string for sysfs. I see this kind of comment for library routines, the original author may just treat them as library code. So it's OK for me. Thanks for your review. Coly Li
Re: [RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook
Hi Omar On 08/10/2018 04:26 AM, Omar Sandoval wrote: > @@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, > blk_status_t error) > blk_stat_add(rq, now); > } > > + if (rq->internal_tag != -1) > + blk_mq_sched_completed_request(rq, now); > + Is it OK to move the io scheduler completed callback into __blk_mq_end_request ? There is a relatively long distance between the __blk_mq_complete_request and __blk_mq_end_request, especially the driver's mq_ops.complete and the bio_endio. Thanks Jianchao
Re: [PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > +void blk_pm_runtime_exit(struct request_queue *q) > +{ > + if (!q->dev) > + return; > + > + pm_runtime_get_sync(q->dev); > + q->dev = NULL; > +} > +EXPORT_SYMBOL(blk_pm_runtime_exit); > + > /** > * blk_pre_runtime_suspend - Pre runtime suspend check > * @q: the queue of the device > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c > index 69ab459abb98..5537762dfcfd 100644 > --- a/drivers/scsi/sd.c > +++ b/drivers/scsi/sd.c > @@ -3420,12 +3420,11 @@ static int sd_probe(struct device *dev) > **/ > static int sd_remove(struct device *dev) > { > - struct scsi_disk *sdkp; > - dev_t devt; > + struct scsi_disk *sdkp = dev_get_drvdata(dev); > + struct scsi_device *sdp = sdkp->device; > + dev_t devt = disk_devt(sdkp->disk); > > - sdkp = dev_get_drvdata(dev); > - devt = disk_devt(sdkp->disk); > - scsi_autopm_get_device(sdkp->device); > + blk_pm_runtime_exit(sdp->request_queue) Based on __scsi_device_remove, sd_remove will be invoked before blk_cleanup_queue. So it should be not safe that we set the q->dev to NULL here. We have following operations in followed patch: +static inline void blk_pm_mark_last_busy(struct request *rq) +{ + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + pm_runtime_mark_last_busy(rq->q->dev); +} Thanks Jianchao
Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
Hi Ming (and all), Your series "scsi: virtio_scsi: fix IO hang caused by irq vector automatic affinity" which forces virtio-scsi to use blk-mq fixes an issue introduced by 84676c1f. We noticed that this bug also exists in 4.14.y (as ef86f3a72adb), but your series was not backported to that stable branch. Are there any plans to do that? At least CoreOS is using 4.14 and showing issues on AHV (which provides an mq virtio-scsi controller). Thanks, Felipe
Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
On Fri, Aug 10, 2018 at 02:09:01AM +, Felipe Franciosi wrote: > Hi Ming (and all), > > Your series "scsi: virtio_scsi: fix IO hang caused by irq vector automatic > affinity" which forces virtio-scsi to use blk-mq fixes an issue introduced by > 84676c1f. We noticed that this bug also exists in 4.14.y (as ef86f3a72adb), > but your series was not backported to that stable branch. > > Are there any plans to do that? At least CoreOS is using 4.14 and showing > issues on AHV (which provides an mq virtio-scsi controller). > Hi Felipe, Looks the following 4 patches should have been marked as stable, sorry for missing that. b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity 2f31115e940c scsi: core: introduce force_blk_mq adbe552349f2 scsi: megaraid_sas: fix selection of reply queue 8b834bff1b73 scsi: hpsa: fix selection of reply queue Usually this backporting is done by our stable guys, so I will CC stable and leave them handle it, but I am happy to provide any help for addressing conflicts or sort of thing. Thanks, Ming
Re: [PATCH v6 10/12] block: Change the runtime power management approach (1/2)
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > Instead of scheduling runtime resume of a request queue after a > request has been queued, schedule asynchronous resume during request > allocation. The new pm_request_resume() calls occur after > blk_queue_enter() has increased the q_usage_counter request queue ^^^ > member. This change is needed for a later patch that will make request > allocation block while the queue status is not RPM_ACTIVE. Is it "after getting q->q_usage_counter fails" ? And also this blk_pm_request_resume will not affect the normal path. ;) Thanks Jianchao
Re: [PATCH v6 11/12] block: Change the runtime power management approach (2/2)
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > + > + blk_set_pm_only(q); > + /* > + * This function only gets called if the most recent > + * pm_request_resume() call occurred at least autosuspend_delay_ms > + * ago. Since blk_queue_enter() is called by the request allocation > + * code before pm_request_resume(), if no requests have a tag assigned > + * it is safe to suspend the device. > + */ > + ret = -EBUSY; > + if (blk_requests_in_flight(q) == 0) { > + /* > + * Call synchronize_rcu() such that later blk_queue_enter() > + * calls see the preempt-only state. See also > + * > https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=U9uPCJD2WnkXvdzrWaKPh2wJuk8-IHvxZ9sWDVrg2Tg&s=c9E23TPCpNQkiZpuzGztwHxjWF8qrESfRnPmI-e-Z48&e=. > + */ > + synchronize_rcu(); > + if (blk_requests_in_flight(q) == 0) > + ret = 0; > + } I still think blk_set_pm_only should be moved after blk_requests_in_flight. Otherwise, the normal IO will be blocked for a little while if there are still busy requests. Thanks Jianchao
Re: [PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY
Hi Bart On 08/10/2018 03:41 AM, Bart Van Assche wrote: > +/* > + * Whether or not blk_queue_enter() should proceed. RQF_PM requests are > always > + * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not > been > + * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has > been > + * set. > + */ > +static inline bool blk_enter_allowed(struct request_queue *q, > + blk_mq_req_flags_t flags) > +{ > + return flags & BLK_MQ_REQ_PM || > + (!blk_queue_pm_only(q) && > + (flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q))); > +} If a new state is indeed necessary, I think this kind of checking in hot path is inefficient. How about introduce a new state into request_queue, such as request_queue->gate_state. Set the PM_ONLY and DV_ONLY into this state, then we could just check request_queue->gate_state > 0 before do further checking. Thanks Jianchao
Re: [PATCH v6 02/12] scsi: Alter handling of RQF_DV requests
On Thu, Aug 09, 2018 at 12:41:39PM -0700, Bart Van Assche wrote: > Process all requests in state SDEV_CREATED instead of only RQF_DV > requests. This does not change the behavior of the SCSI core because > the SCSI device state is modified into another state before SCSI > devices become visible in sysfs and before any device nodes are > created in /dev. Do not process RQF_DV requests in state SDEV_CANCEL > because only power management requests should be processed in this > state. Could you explain a bit why only PM requests should be processed in SDEV_CANCEL? Thanks, Ming
[RFC PATCH 4/5] kyber: implement improved heuristics
From: Omar Sandoval Kyber's current heuristics have a few flaws: - It's based on the mean latency, but p99 latency tends to be more meaningful to anyone who cares about latency. The mean can also be skewed by rare outliers that the scheduler can't do anything about. - The statistics calculations are purely time-based with a short window. This works for steady, high load, but is more sensitive to outliers with bursty workloads. - It only considers the latency once an I/O has been submitted to the device, but the user cares about the time spent in the kernel, as well. These are shortcomings of the generic blk-stat code which doesn't quite fit the ideal use case for Kyber. So, this replaces the statistics with a histogram used to calculate percentiles of total latency and I/O latency, which we then use to adjust depths in a slightly more intelligent manner: - Sync and async writes are now the same domain. - Discards are a separate domain. - Domain queue depths are scaled by the ratio of the p99 total latency to the target latency (e.g., if the p99 latency is double the target latency, we will double the queue depth; if the p99 latency is half of the target latency, we can halve the queue depth). - We use the I/O latency to determine whether we should scale queue depths down: we will only scale down if any domain's I/O latency exceeds the target latency, which is an indicator of congestion in the device. These new heuristics are just as scalable as the heuristics they replace. Signed-off-by: Omar Sandoval --- block/kyber-iosched.c | 497 -- 1 file changed, 279 insertions(+), 218 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 08eb5295c18d..adc8e6393829 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -29,13 +29,16 @@ #include "blk-mq-debugfs.h" #include "blk-mq-sched.h" #include "blk-mq-tag.h" -#include "blk-stat.h" -/* Scheduling domains. */ +/* + * Scheduling domains: the device is divided into multiple domains based on the + * request type. + */ enum { KYBER_READ, - KYBER_SYNC_WRITE, - KYBER_OTHER, /* Async writes, discard, etc. */ + KYBER_WRITE, + KYBER_DISCARD, + KYBER_OTHER, KYBER_NUM_DOMAINS, }; @@ -49,25 +52,82 @@ enum { }; /* - * Initial device-wide depths for each scheduling domain. + * Maximum device-wide depth for each scheduling domain. * - * Even for fast devices with lots of tags like NVMe, you can saturate - * the device with only a fraction of the maximum possible queue depth. - * So, we cap these to a reasonable value. + * Even for fast devices with lots of tags like NVMe, you can saturate the + * device with only a fraction of the maximum possible queue depth. So, we cap + * these to a reasonable value. */ static const unsigned int kyber_depth[] = { [KYBER_READ] = 256, - [KYBER_SYNC_WRITE] = 128, - [KYBER_OTHER] = 64, + [KYBER_WRITE] = 128, + [KYBER_DISCARD] = 64, + [KYBER_OTHER] = 16, }; /* - * Scheduling domain batch sizes. We favor reads. + * Default latency targets for each scheduling domain. + */ +static const u64 kyber_latency_targets[] = { + [KYBER_READ] = 2 * NSEC_PER_MSEC, + [KYBER_WRITE] = 10 * NSEC_PER_MSEC, + [KYBER_DISCARD] = 5 * NSEC_PER_SEC, +}; + +/* + * Batch size (number of requests we'll dispatch in a row) for each scheduling + * domain. */ static const unsigned int kyber_batch_size[] = { [KYBER_READ] = 16, - [KYBER_SYNC_WRITE] = 8, - [KYBER_OTHER] = 8, + [KYBER_WRITE] = 8, + [KYBER_DISCARD] = 1, + [KYBER_OTHER] = 1, +}; + +/* + * Requests latencies are recorded in a histogram with buckets defined relative + * to the target latency: + * + * <= 1/4 * target latency + * <= 1/2 * target latency + * <= 3/4 * target latency + * <= target latency + * <= 1 1/4 * target latency + * <= 1 1/2 * target latency + * <= 1 3/4 * target latency + * > 1 3/4 * target latency + */ +enum { + /* +* The width of the latency histogram buckets is +* 1 / (1 << KYBER_LATENCY_SHIFT) * target latency. +*/ + KYBER_LATENCY_SHIFT = 2, + /* +* The first (1 << KYBER_LATENCY_SHIFT) buckets are <= target latency, +* thus, "good". +*/ + KYBER_GOOD_BUCKETS = 1 << KYBER_LATENCY_SHIFT, + /* There are also (1 << KYBER_LATENCY_SHIFT) "bad" buckets. */ + KYBER_LATENCY_BUCKETS = 2 << KYBER_LATENCY_SHIFT, +}; + +/* + * We measure both the total latency and the I/O latency (i.e., latency after + * submitting to the device). + */ +enum { + KYBER_TOTAL_LATENCY, + KYBER_IO_LATENCY, +}; + +/* + * Per-cpu latency histograms: total latency and I/O latency for each scheduling + * domain except for KYBER_OTHER. + */ +struct kyber_cpu_latency { + atomic_t buckets[KYBER_OTHER][2][KYBER_LATENCY_BUCKETS]; }; /* @@ -84,14 +144,9 @@ struct kyber
[RFC PATCH 5/5] kyber: add tracepoints
From: Omar Sandoval When debugging Kyber, it's really useful to know what latencies we've been having and how the domain depths have been adjusted. Add two tracepoints, kyber_latency and kyber_adjust, to record that. Signed-off-by: Omar Sandoval --- block/kyber-iosched.c| 46 +- include/trace/events/kyber.h | 76 2 files changed, 104 insertions(+), 18 deletions(-) create mode 100644 include/trace/events/kyber.h diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index adc8e6393829..d8ddcb4f2435 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -30,6 +30,9 @@ #include "blk-mq-sched.h" #include "blk-mq-tag.h" +#define CREATE_TRACE_POINTS +#include + /* * Scheduling domains: the device is divided into multiple domains based on the * request type. @@ -42,6 +45,13 @@ enum { KYBER_NUM_DOMAINS, }; +static const char *kyber_domain_names[] = { + [KYBER_READ] = "READ", + [KYBER_WRITE] = "WRITE", + [KYBER_DISCARD] = "DISCARD", + [KYBER_OTHER] = "OTHER", +}; + enum { /* * In order to prevent starvation of synchronous requests by a flood of @@ -122,6 +132,11 @@ enum { KYBER_IO_LATENCY, }; +static const char *kyber_latency_type_names[] = { + [KYBER_TOTAL_LATENCY] = "total", + [KYBER_IO_LATENCY] = "I/O", +}; + /* * Per-cpu latency histograms: total latency and I/O latency for each scheduling * domain except for KYBER_OTHER. @@ -144,6 +159,8 @@ struct kyber_ctx_queue { } cacheline_aligned_in_smp; struct kyber_queue_data { + struct request_queue *q; + /* * Each scheduling domain has a limited number of in-flight requests * device-wide, limited by these tokens. @@ -249,6 +266,10 @@ static int calculate_percentile(struct kyber_queue_data *kqd, } memset(buckets, 0, sizeof(kqd->latency_buckets[sched_domain][type])); + trace_kyber_latency(kqd->q, kyber_domain_names[sched_domain], + kyber_latency_type_names[type], percentile, + bucket + 1, 1 << KYBER_LATENCY_SHIFT, samples); + return bucket; } @@ -256,8 +277,11 @@ static void kyber_resize_domain(struct kyber_queue_data *kqd, unsigned int sched_domain, unsigned int depth) { depth = clamp(depth, 1U, kyber_depth[sched_domain]); - if (depth != kqd->domain_tokens[sched_domain].sb.depth) + if (depth != kqd->domain_tokens[sched_domain].sb.depth) { sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth); + trace_kyber_adjust(kqd->q, kyber_domain_names[sched_domain], + depth); + } } static void kyber_timer_fn(struct timer_list *t) @@ -360,6 +384,8 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (!kqd) goto err; + kqd->q = q; + kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency, GFP_KERNEL | __GFP_ZERO); if (!kqd->cpu_latency) @@ -944,23 +970,7 @@ static int kyber_cur_domain_show(void *data, struct seq_file *m) struct blk_mq_hw_ctx *hctx = data; struct kyber_hctx_data *khd = hctx->sched_data; - switch (khd->cur_domain) { - case KYBER_READ: - seq_puts(m, "READ\n"); - break; - case KYBER_WRITE: - seq_puts(m, "WRITE\n"); - break; - case KYBER_DISCARD: - seq_puts(m, "DISCARD\n"); - break; - case KYBER_OTHER: - seq_puts(m, "OTHER\n"); - break; - default: - seq_printf(m, "%u\n", khd->cur_domain); - break; - } + seq_printf(m, "%s\n", kyber_domain_names[khd->cur_domain]); return 0; } diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h new file mode 100644 index ..9bf85dbef492 --- /dev/null +++ b/include/trace/events/kyber.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM kyber + +#if !defined(_TRACE_KYBER_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_KYBER_H + +#include +#include + +#define DOMAIN_LEN 16 +#define LATENCY_TYPE_LEN 8 + +TRACE_EVENT(kyber_latency, + + TP_PROTO(struct request_queue *q, const char *domain, const char *type, +unsigned int percentile, unsigned int numerator, +unsigned int denominator, unsigned int samples), + + TP_ARGS(q, domain, type, percentile, numerator, denominator, samples), + + TP_STRUCT__entry( + __field(dev_t, dev ) + __array(char, domain, DOMAIN_LEN ) + __array(char, type, LATENCY_TYPE_LEN
[RFC PATCH 3/5] kyber: don't make domain token sbitmap larger than necessary
From: Omar Sandoval The domain token sbitmaps are currently initialized to the device queue depth or 256, whichever is larger, and immediately resized to the maximum depth for that domain (256, 128, or 64 for read, write, and other, respectively). The sbitmap is never resized larger than that, so it's unnecessary to allocate a bitmap larger than the maximum depth. Let's just allocate it to the maximum depth to begin with. This will use marginally less memory, and more importantly, give us a more appropriate number of bits per sbitmap word. Signed-off-by: Omar Sandoval --- block/kyber-iosched.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 95d062c07c61..08eb5295c18d 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -40,8 +40,6 @@ enum { }; enum { - KYBER_MIN_DEPTH = 256, - /* * In order to prevent starvation of synchronous requests by a flood of * asynchronous requests, we reserve 25% of requests for synchronous @@ -305,7 +303,6 @@ static int kyber_bucket_fn(const struct request *rq) static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) { struct kyber_queue_data *kqd; - unsigned int max_tokens; unsigned int shift; int ret = -ENOMEM; int i; @@ -320,25 +317,17 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) if (!kqd->cb) goto err_kqd; - /* -* The maximum number of tokens for any scheduling domain is at least -* the queue depth of a single hardware queue. If the hardware doesn't -* have many tags, still provide a reasonable number. -*/ - max_tokens = max_t(unsigned int, q->tag_set->queue_depth, - KYBER_MIN_DEPTH); for (i = 0; i < KYBER_NUM_DOMAINS; i++) { WARN_ON(!kyber_depth[i]); WARN_ON(!kyber_batch_size[i]); ret = sbitmap_queue_init_node(&kqd->domain_tokens[i], - max_tokens, -1, false, GFP_KERNEL, - q->node); + kyber_depth[i], -1, false, + GFP_KERNEL, q->node); if (ret) { while (--i >= 0) sbitmap_queue_free(&kqd->domain_tokens[i]); goto err_cb; } - sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]); } shift = kyber_sched_tags_shift(kqd); -- 2.18.0
[RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook
From: Omar Sandoval Commit 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()") consolidated some calls using ktime_get() so we'd only need to call it once. Kyber's ->completed_request() hook also calls ktime_get(), so let's move it to the same place, too. Signed-off-by: Omar Sandoval --- block/blk-mq-sched.h | 4 ++-- block/blk-mq.c | 5 +++-- block/kyber-iosched.c| 5 ++--- include/linux/elevator.h | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 0cb8f938dff9..74fb6ff9a30d 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -54,12 +54,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq, return true; } -static inline void blk_mq_sched_completed_request(struct request *rq) +static inline void blk_mq_sched_completed_request(struct request *rq, u64 now) { struct elevator_queue *e = rq->q->elevator; if (e && e->type->ops.mq.completed_request) - e->type->ops.mq.completed_request(rq); + e->type->ops.mq.completed_request(rq, now); } static inline void blk_mq_sched_started_request(struct request *rq) diff --git a/block/blk-mq.c b/block/blk-mq.c index 654b0dc7e001..0c5be0001d0f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error) blk_stat_add(rq, now); } + if (rq->internal_tag != -1) + blk_mq_sched_completed_request(rq, now); + blk_account_io_done(rq, now); if (rq->end_io) { @@ -560,8 +563,6 @@ static void __blk_mq_complete_request(struct request *rq) if (!blk_mq_mark_complete(rq)) return; - if (rq->internal_tag != -1) - blk_mq_sched_completed_request(rq); if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) { rq->q->softirq_done_fn(rq); diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index a1660bafc912..95d062c07c61 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -558,12 +558,12 @@ static void kyber_finish_request(struct request *rq) rq_clear_domain_token(kqd, rq); } -static void kyber_completed_request(struct request *rq) +static void kyber_completed_request(struct request *rq, u64 now) { struct request_queue *q = rq->q; struct kyber_queue_data *kqd = q->elevator->elevator_data; unsigned int sched_domain; - u64 now, latency, target; + u64 latency, target; /* * Check if this request met our latency goal. If not, quickly gather @@ -585,7 +585,6 @@ static void kyber_completed_request(struct request *rq) if (blk_stat_is_active(kqd->cb)) return; - now = ktime_get_ns(); if (now < rq->io_start_time_ns) return; diff --git a/include/linux/elevator.h b/include/linux/elevator.h index a02deea30185..015bb59c0331 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -111,7 +111,7 @@ struct elevator_mq_ops { void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool); struct request *(*dispatch_request)(struct blk_mq_hw_ctx *); bool (*has_work)(struct blk_mq_hw_ctx *); - void (*completed_request)(struct request *); + void (*completed_request)(struct request *, u64); void (*started_request)(struct request *); void (*requeue_request)(struct request *); struct request *(*former_request)(struct request_queue *, struct request *); -- 2.18.0
[RFC PATCH 0/5] kyber: better heuristics
From: Omar Sandoval Hello, I've spent the past few weeks experimenting with different heuristics for Kyber in order to deal with some edge cases we've hit here. This series is my progress so far, implementing less handwavy heuristics while keeping the same basic mechanisms. Patches 1 and 2 are preparation. Patch 3 is a minor optimization. Patch 4 is the main change, and includes a detailed description of the new heuristics. Patch 5 adds tracepoints for debugging. Please, take a look and try it out. Thanks! Omar Sandoval (5): block: move call of scheduler's ->completed_request() hook block: export blk_stat_enable_accounting() kyber: don't make domain token sbitmap larger than necessary kyber: implement improved heuristics kyber: add tracepoints block/blk-mq-sched.h | 4 +- block/blk-mq.c | 5 +- block/blk-stat.c | 1 + block/kyber-iosched.c| 541 +++ include/linux/elevator.h | 2 +- include/trace/events/kyber.h | 76 + 6 files changed, 383 insertions(+), 246 deletions(-) create mode 100644 include/trace/events/kyber.h -- 2.18.0
[RFC PATCH 2/5] block: export blk_stat_enable_accounting()
From: Omar Sandoval Kyber will need this in a future change if it is built as a module. Signed-off-by: Omar Sandoval --- block/blk-stat.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-stat.c b/block/blk-stat.c index 175c143ac5b9..d98f3ad6794e 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -190,6 +190,7 @@ void blk_stat_enable_accounting(struct request_queue *q) blk_queue_flag_set(QUEUE_FLAG_STATS, q); spin_unlock(&q->stats->lock); } +EXPORT_SYMBOL_GPL(blk_stat_enable_accounting); struct blk_queue_stats *blk_alloc_queue_stats(void) { -- 2.18.0
Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()
On 8/9/18 2:17 PM, Bart Van Assche wrote: > On Thu, 2018-08-09 at 12:56 -0700, Tejun Heo wrote: >> Hello, >> >> On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote: >>> +/** >>> + * blkg_lookup - look up blkg for the specified request queue >>> + * @q: request_queue of interest >>> + * >>> + * Lookup blkg for @q at the root level. See also blkg_lookup(). >>> + */ >>> +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) >>> +{ >>> + struct blkcg_gq *blkg; >>> + >>> + rcu_read_lock(); >>> + blkg = blkg_lookup(&blkcg_root, q); >>> + rcu_read_unlock(); >>> + >>> + return blkg; >>> +} >> >> So, root blkg is always available as long as the request_queue is >> alive. Sth like the following would be simpler? >> >> static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q) >> { >> return q->root_blkg; >> } > > That would not only be simpler, it would also avoid that blk_queue_root_blkg() > returns NULL if bypass mode is enabled and the request queue is still > associated > with the block cgroup controller. How do you want to proceed with this change? Do a followup patch? -- Jens Axboe
Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()
On Thu, 2018-08-09 at 12:56 -0700, Tejun Heo wrote: > Hello, > > On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote: > > +/** > > + * blkg_lookup - look up blkg for the specified request queue > > + * @q: request_queue of interest > > + * > > + * Lookup blkg for @q at the root level. See also blkg_lookup(). > > + */ > > +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) > > +{ > > + struct blkcg_gq *blkg; > > + > > + rcu_read_lock(); > > + blkg = blkg_lookup(&blkcg_root, q); > > + rcu_read_unlock(); > > + > > + return blkg; > > +} > > So, root blkg is always available as long as the request_queue is > alive. Sth like the following would be simpler? > > static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q) > { > return q->root_blkg; > } That would not only be simpler, it would also avoid that blk_queue_root_blkg() returns NULL if bypass mode is enabled and the request queue is still associated with the block cgroup controller. How do you want to proceed with this change? Bart.
Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()
Hello, On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote: > +/** > + * blkg_lookup - look up blkg for the specified request queue > + * @q: request_queue of interest > + * > + * Lookup blkg for @q at the root level. See also blkg_lookup(). > + */ > +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) > +{ > + struct blkcg_gq *blkg; > + > + rcu_read_lock(); > + blkg = blkg_lookup(&blkcg_root, q); > + rcu_read_unlock(); > + > + return blkg; > +} So, root blkg is always available as long as the request_queue is alive. Sth like the following would be simpler? static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q) { return q->root_blkg; } -- tejun
[PATCH v6 01/12] block, scsi: Introduce request flag RQF_DV
Instead of marking all power management, SCSI domain validation and IDE preempt requests with RQF_PREEMPT, only mark IDE preempt requests with RQF_PREEMPT. Use RQF_DV to mark requests submitted by scsi_execute() and RQF_PM to mark power management requests. Most but not all power management requests already have the RQF_PM flag set. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: David S. Miller Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 11 +-- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 2 -- drivers/ide/ide-pm.c| 3 ++- drivers/scsi/scsi_lib.c | 11 --- include/linux/blk-mq.h | 6 -- include/linux/blkdev.h | 5 +++-- 7 files changed, 23 insertions(+), 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 12550340418d..c4ff58491758 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -913,11 +913,11 @@ EXPORT_SYMBOL(blk_alloc_queue); /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + * @flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_DV */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + const bool preempt = flags & (BLK_MQ_REQ_PM | BLK_MQ_REQ_DV); while (true) { bool success = false; @@ -1436,8 +1436,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, blk_rq_set_rl(rq, rl); rq->cmd_flags = op; rq->rq_flags = rq_flags; - if (flags & BLK_MQ_REQ_PREEMPT) - rq->rq_flags |= RQF_PREEMPT; /* init elvpriv */ if (rq_flags & RQF_ELVPRIV) { @@ -1575,7 +1573,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, goto retry; } -/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ +/* flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_DV. */ static struct request *blk_old_get_request(struct request_queue *q, unsigned int op, blk_mq_req_flags_t flags) { @@ -1618,7 +1616,8 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op, struct request *req; WARN_ON_ONCE(op & REQ_NOWAIT); - WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT)); + WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PM | + BLK_MQ_REQ_DV)); if (q->mq_ops) { req = blk_mq_alloc_request(q, op, flags); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index cb1e6cf7ac48..b4e722311ae1 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -330,6 +330,7 @@ static const char *const rqf_name[] = { RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), RQF_NAME(MQ_POLL_SLEPT), + RQF_NAME(DV), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 72a0033ccee9..2a0eb058ba5a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -300,8 +300,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->rq_flags = rq_flags; rq->cpu = -1; rq->cmd_flags = op; - if (data->flags & BLK_MQ_REQ_PREEMPT) - rq->rq_flags |= RQF_PREEMPT; if (blk_queue_io_stat(data->q)) rq->rq_flags |= RQF_IO_STAT; INIT_LIST_HEAD(&rq->queuelist); diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c index 59217aa1d1fb..8bf378164aee 100644 --- a/drivers/ide/ide-pm.c +++ b/drivers/ide/ide-pm.c @@ -90,8 +90,9 @@ int generic_ide_resume(struct device *dev) } memset(&rqpm, 0, sizeof(rqpm)); - rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, BLK_MQ_REQ_PREEMPT); + rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, BLK_MQ_REQ_PM); ide_req(rq)->type = ATA_PRIV_PM_RESUME; + rq->rq_flags |= RQF_PM; rq->special = &rqpm; rqpm.pm_step = IDE_PM_START_RESUME; rqpm.pm_state = PM_EVENT_ON; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9cb9a166fa0c..a65a03e2bcc4 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -263,11 +263,16 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, { struct request *req; struct scsi_request *rq; + blk_mq_req_flags_t blk_mq_req_flags = 0; int ret = DRIVER_ERROR << 24; + if (rq_flags & RQF_PM) + blk_mq_req_flags |= BLK_MQ_REQ_PM; + rq_flags |= RQF_DV; + blk_mq_req_flags |= BLK_MQ_REQ_DV; req = blk_get_request(sdev->request_queue, data_direction == DMA_TO_DEVICE ? - REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT); + REQ_OP_SCSI_OUT : REQ_
[PATCH v6 06/12] scsi: Reallow SPI domain validation during system suspend
Now that SCSI power management and SPI domain validation use different mechanisms for blocking SCSI command execution, remove the mechanism again that prevents system suspend during SPI domain validation. This patch reverts 203f8c250e21 ("block, scsi: Fix race between SPI domain validation and system suspend"). Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Woody Suwalski Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- drivers/scsi/scsi_lib.c | 37 +-- drivers/scsi/scsi_sysfs.c | 1 + drivers/scsi/scsi_transport_spi.c | 15 ++--- include/scsi/scsi_device.h| 1 + 4 files changed, 39 insertions(+), 15 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index d466f846043f..89f790e73ed6 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1366,6 +1366,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) ret = BLKPREP_DEFER; break; case SDEV_SUSPENDED: + case SDEV_QUIESCED_SUSPENDED: /* Process RQF_PM requests only. */ if (!(req->rq_flags & RQF_PM)) ret = BLKPREP_DEFER; @@ -2683,6 +2684,17 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state) break; case SDEV_QUIESCE: + switch (oldstate) { + case SDEV_RUNNING: + case SDEV_QUIESCED_SUSPENDED: + case SDEV_OFFLINE: + case SDEV_TRANSPORT_OFFLINE: + break; + default: + goto illegal; + } + break; + case SDEV_SUSPENDED: switch (oldstate) { case SDEV_RUNNING: @@ -2694,6 +2706,15 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state) } break; + case SDEV_QUIESCED_SUSPENDED: + switch (oldstate) { + case SDEV_QUIESCE: + break; + default: + goto illegal; + } + break; + case SDEV_OFFLINE: case SDEV_TRANSPORT_OFFLINE: switch (oldstate) { @@ -2994,8 +3015,13 @@ scsi_device_suspend(struct scsi_device *sdev) struct request_queue *q = sdev->request_queue; int err; - if (sdev->sdev_state == SDEV_SUSPENDED) + switch (sdev->sdev_state) { + case SDEV_SUSPENDED: + case SDEV_QUIESCED_SUSPENDED: return 0; + default: + break; + } blk_set_pm_only(q); @@ -3011,6 +3037,8 @@ scsi_device_suspend(struct scsi_device *sdev) mutex_lock(&sdev->state_mutex); err = scsi_device_set_state(sdev, SDEV_SUSPENDED); + if (err) + err = scsi_device_set_state(sdev, SDEV_QUIESCED_SUSPENDED); if (err) blk_clear_pm_only(q); mutex_unlock(&sdev->state_mutex); @@ -3031,8 +3059,13 @@ void scsi_device_unsuspend(struct scsi_device *sdev) { mutex_lock(&sdev->state_mutex); blk_clear_pm_only(sdev->request_queue); - if (sdev->sdev_state == SDEV_SUSPENDED) + switch (sdev->sdev_state) { + case SDEV_SUSPENDED: + case SDEV_QUIESCED_SUSPENDED: scsi_device_set_state(sdev, SDEV_RUNNING); + default: + break; + } mutex_unlock(&sdev->state_mutex); } EXPORT_SYMBOL(scsi_device_unsuspend); diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 496c5eff4859..8c152e35ff77 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -37,6 +37,7 @@ static const struct { { SDEV_DEL, "deleted" }, { SDEV_QUIESCE, "quiesce" }, { SDEV_SUSPENDED, "suspended" }, + { SDEV_QUIESCED_SUSPENDED, "quiesced-suspended" }, { SDEV_OFFLINE, "offline" }, { SDEV_TRANSPORT_OFFLINE, "transport-offline" }, { SDEV_BLOCK, "blocked" }, diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c index 16bec4884249..582b18efc3fb 100644 --- a/drivers/scsi/scsi_transport_spi.c +++ b/drivers/scsi/scsi_transport_spi.c @@ -26,7 +26,6 @@ #include #include #include -#include #include #include "scsi_priv.h" #include @@ -1008,19 +1007,11 @@ spi_dv_device(struct scsi_device *sdev) u8 *buffer; const int len = SPI_MAX_ECHO_BUFFER_SIZE*2; - /* -* Because this function and the power management code both call -* scsi_device_quiesce(), it is not safe to perform domain validation -* while suspend or resume is in progress. Hence the -* lock/unlock_system_sleep() calls. -*/ - lock_system_sleep(); - if (unlikely(starget->spi_dv_context)) - got
[PATCH v6 03/12] scsi: Only set RQF_DV for requests used for domain validation
Instead of setting RQF_DV for all requests submitted by scsi_execute(), only set that flag for requests that are used for domain validation. Move the SCSI Parallel Interface (SPI) domain validation status from the transport data to struct scsi_target such that this status information can be accessed easily from inside scsi_execute(). This patch prevents that e.g. event checking can occur during domain validation. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- drivers/scsi/scsi_lib.c | 6 -- drivers/scsi/scsi_transport_spi.c | 9 +++-- include/scsi/scsi_device.h| 1 + include/scsi/scsi_transport_spi.h | 1 - 4 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 8685704f6c8b..4d7411a7985f 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -268,8 +268,10 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, if (rq_flags & RQF_PM) blk_mq_req_flags |= BLK_MQ_REQ_PM; - rq_flags |= RQF_DV; - blk_mq_req_flags |= BLK_MQ_REQ_DV; + if (scsi_target(sdev)->spi_dv_context == current) { + rq_flags |= RQF_DV; + blk_mq_req_flags |= BLK_MQ_REQ_DV; + } req = blk_get_request(sdev->request_queue, data_direction == DMA_TO_DEVICE ? REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, blk_mq_req_flags); diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c index 2ca150b16764..bf6b18768e79 100644 --- a/drivers/scsi/scsi_transport_spi.c +++ b/drivers/scsi/scsi_transport_spi.c @@ -66,7 +66,6 @@ static struct { }; /* Private data accessors (keep these out of the header file) */ -#define spi_dv_in_progress(x) (((struct spi_transport_attrs *)&(x)->starget_data)->dv_in_progress) #define spi_dv_mutex(x) (((struct spi_transport_attrs *)&(x)->starget_data)->dv_mutex) struct spi_internal { @@ -268,7 +267,6 @@ static int spi_setup_transport_attrs(struct transport_container *tc, spi_pcomp_en(starget) = 0; spi_hold_mcs(starget) = 0; spi_dv_pending(starget) = 0; - spi_dv_in_progress(starget) = 0; spi_initial_dv(starget) = 0; mutex_init(&spi_dv_mutex(starget)); @@ -1018,14 +1016,12 @@ spi_dv_device(struct scsi_device *sdev) */ lock_system_sleep(); - if (unlikely(spi_dv_in_progress(starget))) + if (unlikely(starget->spi_dv_context)) goto unlock; if (unlikely(scsi_device_get(sdev))) goto unlock; - spi_dv_in_progress(starget) = 1; - buffer = kzalloc(len, GFP_KERNEL); if (unlikely(!buffer)) @@ -1043,7 +1039,9 @@ spi_dv_device(struct scsi_device *sdev) starget_printk(KERN_INFO, starget, "Beginning Domain Validation\n"); + starget->spi_dv_context = current; spi_dv_device_internal(sdev, buffer); + starget->spi_dv_context = NULL; starget_printk(KERN_INFO, starget, "Ending Domain Validation\n"); @@ -1057,7 +1055,6 @@ spi_dv_device(struct scsi_device *sdev) out_free: kfree(buffer); out_put: - spi_dv_in_progress(starget) = 0; scsi_device_put(sdev); unlock: unlock_system_sleep(); diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 202f4d6a4342..440834f4252e 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -310,6 +310,7 @@ struct scsi_target { charscsi_level; enum scsi_target_state state; + struct task_struct *spi_dv_context; void*hostdata; /* available to low-level driver */ unsigned long starget_data[0]; /* for the transport */ /* starget_data must be the last element */ diff --git a/include/scsi/scsi_transport_spi.h b/include/scsi/scsi_transport_spi.h index a4fa52b4d5c5..36934ac0a5ca 100644 --- a/include/scsi/scsi_transport_spi.h +++ b/include/scsi/scsi_transport_spi.h @@ -56,7 +56,6 @@ struct spi_transport_attrs { unsigned int support_qas; /* supports quick arbitration and selection */ /* Private Fields */ unsigned int dv_pending:1; /* Internal flag: DV Requested */ - unsigned int dv_in_progress:1; /* Internal: DV started */ struct mutex dv_mutex; /* semaphore to serialise dv */ }; -- 2.18.0
[PATCH v6 04/12] scsi: Introduce the SDEV_SUSPENDED device status
Instead of using the SDEV_QUIESCE state for both SCSI domain validation and runtime suspend, use the SDEV_QUIESCE state only for SCSI domain validation. Keep using scsi_device_quiesce() and scsi_device_unquiesce() for SCSI domain validation. Add new functions scsi_device_suspend() and scsi_device_unsuspend() for power management. Rename scsi_device_resume() into scsi_device_unquiesce() to avoid confusion. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- drivers/scsi/scsi_lib.c | 138 -- drivers/scsi/scsi_pm.c| 6 +- drivers/scsi/scsi_sysfs.c | 1 + drivers/scsi/scsi_transport_spi.c | 2 +- include/scsi/scsi_device.h| 13 +-- 5 files changed, 125 insertions(+), 35 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 4d7411a7985f..eb914d8e17fd 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1365,6 +1365,11 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) if (!(req->rq_flags & (RQF_PM | RQF_DV))) ret = BLKPREP_DEFER; break; + case SDEV_SUSPENDED: + /* Process RQF_PM requests only. */ + if (!(req->rq_flags & RQF_PM)) + ret = BLKPREP_DEFER; + break; case SDEV_CANCEL: /* Only allow RQF_PM requests. */ if (!(req->rq_flags & RQF_PM)) @@ -2669,6 +2674,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state) case SDEV_OFFLINE: case SDEV_TRANSPORT_OFFLINE: case SDEV_QUIESCE: + case SDEV_SUSPENDED: case SDEV_BLOCK: break; default: @@ -2677,6 +2683,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state) break; case SDEV_QUIESCE: + case SDEV_SUSPENDED: switch (oldstate) { case SDEV_RUNNING: case SDEV_OFFLINE: @@ -2970,19 +2977,103 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev) } /** - * scsi_device_quiesce - Block user issued commands. - * @sdev: scsi device to quiesce. + * scsi_device_suspend - only process RQF_PM requests + * @sdev: scsi device to suspend. + * + * This works by trying to transition to the SDEV_SUSPENDED state (which must be + * a legal transition). When the device is in this state, only RQF_DV + * requests will be accepted, all others will be deferred. + * + * Must be called with user context, may sleep. + * + * Returns zero if unsuccessful or an error if not. + */ +int +scsi_device_suspend(struct scsi_device *sdev) +{ + struct request_queue *q = sdev->request_queue; + int err; + + if (sdev->sdev_state == SDEV_SUSPENDED) + return 0; + + blk_set_preempt_only(q); + + blk_mq_freeze_queue(q); + /* +* Ensure that the effect of blk_set_preempt_only() will be visible +* for percpu_ref_tryget() callers that occur after the queue +* unfreeze even if the queue was already frozen before this function +* was called. See also https://lwn.net/Articles/573497/. +*/ + synchronize_rcu(); + blk_mq_unfreeze_queue(q); + + mutex_lock(&sdev->state_mutex); + err = scsi_device_set_state(sdev, SDEV_SUSPENDED); + if (err) + blk_clear_preempt_only(q); + mutex_unlock(&sdev->state_mutex); + + return err; +} +EXPORT_SYMBOL(scsi_device_suspend); + +/** + * scsi_device_unsuspend - unsuspend processing non-RQF_DV requests + * @sdev: scsi device to unsuspend. * - * This works by trying to transition to the SDEV_QUIESCE state - * (which must be a legal transition). When the device is in this - * state, only special requests will be accepted, all others will - * be deferred. Since special requests may also be requeued requests, - * a successful return doesn't guarantee the device will be - * totally quiescent. + * Moves the device from suspended back to running and restarts the queues. * - * Must be called with user context, may sleep. + * Must be called with user context, may sleep. + */ +void scsi_device_unsuspend(struct scsi_device *sdev) +{ + mutex_lock(&sdev->state_mutex); + blk_clear_preempt_only(sdev->request_queue); + if (sdev->sdev_state == SDEV_SUSPENDED) + scsi_device_set_state(sdev, SDEV_RUNNING); + mutex_unlock(&sdev->state_mutex); +} +EXPORT_SYMBOL(scsi_device_unsuspend); + +static void +device_suspend_fn(struct scsi_device *sdev, void *data) +{ + scsi_device_suspend(sdev); +} + +void +scsi_target_suspend(struct scsi_target *starget) +{ + starget_for_each_device(starget, NUL
[PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()
Since it is possible to unbind a SCSI ULD and since unbinding removes the association between a request queue and struct device, the q->dev pointer has to be reset during unbind. Hence introduce a function in the block layer that clears request_queue.dev. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-pm.c | 18 ++ drivers/scsi/sd.c | 9 - drivers/scsi/sr.c | 3 ++- include/linux/blk-pm.h | 2 ++ 4 files changed, 26 insertions(+), 6 deletions(-) diff --git a/block/blk-pm.c b/block/blk-pm.c index 9b636960d285..bf8532da952d 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -40,6 +40,24 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) } EXPORT_SYMBOL(blk_pm_runtime_init); +/** + * blk_pm_runtime_exit - runtime PM exit routine + * @q: the queue of the device + * + * This function should be called from the device_driver.remove() callback + * function to avoid that further calls to runtime power management functions + * occur. + */ +void blk_pm_runtime_exit(struct request_queue *q) +{ + if (!q->dev) + return; + + pm_runtime_get_sync(q->dev); + q->dev = NULL; +} +EXPORT_SYMBOL(blk_pm_runtime_exit); + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 69ab459abb98..5537762dfcfd 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3420,12 +3420,11 @@ static int sd_probe(struct device *dev) **/ static int sd_remove(struct device *dev) { - struct scsi_disk *sdkp; - dev_t devt; + struct scsi_disk *sdkp = dev_get_drvdata(dev); + struct scsi_device *sdp = sdkp->device; + dev_t devt = disk_devt(sdkp->disk); - sdkp = dev_get_drvdata(dev); - devt = disk_devt(sdkp->disk); - scsi_autopm_get_device(sdkp->device); + blk_pm_runtime_exit(sdp->request_queue); async_synchronize_full_domain(&scsi_sd_pm_domain); async_synchronize_full_domain(&scsi_sd_probe_domain); diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index de4413e66eca..476987f7ed48 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -1002,8 +1002,9 @@ static void sr_kref_release(struct kref *kref) static int sr_remove(struct device *dev) { struct scsi_cd *cd = dev_get_drvdata(dev); + struct scsi_device *sdev = cd->device; - scsi_autopm_get_device(cd->device); + blk_pm_runtime_exit(sdev->request_queue); del_gendisk(cd->disk); dev_set_drvdata(dev, NULL); diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h index b80c65aba249..6d654f41acbf 100644 --- a/include/linux/blk-pm.h +++ b/include/linux/blk-pm.h @@ -11,6 +11,7 @@ struct request_queue; */ #ifdef CONFIG_PM extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev); +extern void blk_pm_runtime_exit(struct request_queue *q); extern int blk_pre_runtime_suspend(struct request_queue *q); extern void blk_post_runtime_suspend(struct request_queue *q, int err); extern void blk_pre_runtime_resume(struct request_queue *q); @@ -19,6 +20,7 @@ extern void blk_set_runtime_active(struct request_queue *q); #else static inline void blk_pm_runtime_init(struct request_queue *q, struct device *dev) {} +static inline void blk_pm_runtime_exit(struct request_queue *q) {} #endif #endif /* _BLK_PM_H_ */ -- 2.18.0
[PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY
Instead of having one queue state (PREEMPT_ONLY) in which both power management and SCSI domain validation requests are processed, rename the PREEMPT_ONLY state into DV_ONLY and introduce a new queue flag QUEUE_FLAG_PM_ONLY. Provide the new functions blk_set_pm_only() and blk_clear_pm_only() for power management and blk_set_dv_only() and blk_clear_dv_only() functions for SCSI domain validation purposes. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c| 65 ++--- block/blk-mq-debugfs.c | 3 +- drivers/scsi/scsi_lib.c | 16 +- include/linux/blkdev.h | 13 + 4 files changed, 66 insertions(+), 31 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index c4ff58491758..401f4927a8db 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -421,24 +421,44 @@ void blk_sync_queue(struct request_queue *q) EXPORT_SYMBOL(blk_sync_queue); /** - * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY + * blk_set_dv_only - set QUEUE_FLAG_DV_ONLY * @q: request queue pointer * - * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not + * Returns the previous value of the DV_ONLY flag - 0 if the flag was not * set and 1 if the flag was already set. */ -int blk_set_preempt_only(struct request_queue *q) +int blk_set_dv_only(struct request_queue *q) { - return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q); + return blk_queue_flag_test_and_set(QUEUE_FLAG_DV_ONLY, q); } -EXPORT_SYMBOL_GPL(blk_set_preempt_only); +EXPORT_SYMBOL_GPL(blk_set_dv_only); -void blk_clear_preempt_only(struct request_queue *q) +void blk_clear_dv_only(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + blk_queue_flag_clear(QUEUE_FLAG_DV_ONLY, q); wake_up_all(&q->mq_freeze_wq); } -EXPORT_SYMBOL_GPL(blk_clear_preempt_only); +EXPORT_SYMBOL_GPL(blk_clear_dv_only); + +/** + * blk_set_pm_only - set QUEUE_FLAG_PM_ONLY + * @q: request queue pointer + * + * Returns the previous value of the PM_ONLY flag - 0 if the flag was not + * set and 1 if the flag was already set. + */ +int blk_set_pm_only(struct request_queue *q) +{ + return blk_queue_flag_test_and_set(QUEUE_FLAG_PM_ONLY, q); +} +EXPORT_SYMBOL_GPL(blk_set_pm_only); + +void blk_clear_pm_only(struct request_queue *q) +{ + blk_queue_flag_clear(QUEUE_FLAG_PM_ONLY, q); + wake_up_all(&q->mq_freeze_wq); +} +EXPORT_SYMBOL_GPL(blk_clear_pm_only); /** * __blk_run_queue_uncond - run a queue whether or not it has been stopped @@ -910,6 +930,20 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); +/* + * Whether or not blk_queue_enter() should proceed. RQF_PM requests are always + * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not been + * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has been + * set. + */ +static inline bool blk_enter_allowed(struct request_queue *q, +blk_mq_req_flags_t flags) +{ + return flags & BLK_MQ_REQ_PM || + (!blk_queue_pm_only(q) && +(flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q))); +} + /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer @@ -917,23 +951,20 @@ EXPORT_SYMBOL(blk_alloc_queue); */ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) { - const bool preempt = flags & (BLK_MQ_REQ_PM | BLK_MQ_REQ_DV); - while (true) { bool success = false; rcu_read_lock(); if (percpu_ref_tryget_live(&q->q_usage_counter)) { /* -* The code that sets the PREEMPT_ONLY flag is -* responsible for ensuring that that flag is globally -* visible before the queue is unfrozen. +* The code that sets the PM_ONLY or DV_ONLY queue +* flags is responsible for ensuring that that flag is +* globally visible before the queue is unfrozen. */ - if (preempt || !blk_queue_preempt_only(q)) { + if (blk_enter_allowed(q, flags)) success = true; - } else { + else percpu_ref_put(&q->q_usage_counter); - } } rcu_read_unlock(); @@ -954,7 +985,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (atomic_read(&q->mq_freeze_depth) == 0 && - (preempt || !blk_queue_preempt_only(q))) || +
[PATCH v6 07/12] block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from into . For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/Kconfig | 5 ++ block/Makefile | 1 + block/blk-core.c | 196 + block/blk-pm.c | 188 +++ block/blk-pm.h | 43 + block/elevator.c | 22 + drivers/scsi/scsi_pm.c | 1 + drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/blk-pm.h | 24 + include/linux/blkdev.h | 23 - 11 files changed, 266 insertions(+), 239 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h diff --git a/block/Kconfig b/block/Kconfig index 1f2469a0123c..e213d90a5e64 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -228,4 +228,9 @@ config BLK_MQ_RDMA depends on BLOCK && INFINIBAND default y +config BLK_PM + bool + depends on BLOCK && PM + default y + source block/Kconfig.iosched diff --git a/block/Makefile b/block/Makefile index 572b33f32c07..27eac600474f 100644 --- a/block/Makefile +++ b/block/Makefile @@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o +obj-$(CONFIG_BLK_PM) += blk-pm.o diff --git a/block/blk-core.c b/block/blk-core.c index 401f4927a8db..3770a36e85be 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -42,6 +42,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" +#include "blk-pm.h" #include "blk-rq-qos.h" #ifdef CONFIG_DEBUG_FS @@ -1757,16 +1758,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part) } EXPORT_SYMBOL_GPL(part_round_stats); -#ifdef CONFIG_PM -static void blk_pm_put_request(struct request *rq) -{ - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); -} -#else -static inline void blk_pm_put_request(struct request *rq) {} -#endif - void __blk_put_request(struct request_queue *q, struct request *req) { req_flags_t rq_flags = req->rq_flags; @@ -3783,191 +3774,6 @@ void blk_finish_plug(struct blk_plug *plug) } EXPORT_SYMBOL(blk_finish_plug); -#ifdef CONFIG_PM -/** - * blk_pm_runtime_init - Block layer runtime PM initialization routine - * @q: the queue of the device - * @dev: the device the queue belongs to - * - * Description: - *Initialize runtime-PM-related fields for @q and start auto suspend for - *@dev. Drivers that want to take advantage of request-based runtime PM - *should call this function after @dev has been initialized, and its - *request queue @q has been allocated, and runtime PM for it can not happen - *yet(either due to disabled/forbidden or its usage_count > 0). In most - *cases, driver should call this function before any I/O has taken place. - * - *This function takes care of setting up using auto suspend for the device, - *the autosuspend delay is set to -1 to make runtime suspend impossible - *until an updated value is either set by user or by driver. Drivers do - *not need to touch other autosuspend settings. - * - *The block layer runtime PM is request based, so only works for drivers - *that use request as their IO unit instead of those directly use bio's. - */ -void blk_pm_runtime_init(struct request_queue *q, struct device *dev) -{ - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - - q->dev = dev; - q->rpm_status = RPM_ACTIVE; - pm_runtime_set_autosuspend_delay(q->dev, -1); - pm_runtime_use_autosuspend(q->dev); -} -EXPORT_SYMBOL(blk_pm_runtime_init); - -/** - * blk_pre_runtime_suspend - Pre runtime suspend check - * @q: the queue of the device - * - * Description: - *This function will check if runtime suspend is allowed for the device - *by examining if there are any requests pending in the queue. If there - *are requests pending, the device can not be runtime suspended; otherwise, - *the queue's status will be updated to SUSPENDING and the driver can - *proceed to suspend the device. - * - *For the not allowed case, we mark last busy for the device so that - *run
[PATCH v6 02/12] scsi: Alter handling of RQF_DV requests
Process all requests in state SDEV_CREATED instead of only RQF_DV requests. This does not change the behavior of the SCSI core because the SCSI device state is modified into another state before SCSI devices become visible in sysfs and before any device nodes are created in /dev. Do not process RQF_DV requests in state SDEV_CANCEL because only power management requests should be processed in this state. Handle all SCSI device states explicitly in scsi_prep_state_check() instead of using a default case in the switch/case statement in scsi_prep_state_check(). This allows the compiler to verify whether all states have been handled. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- drivers/scsi/scsi_lib.c | 78 +++-- 1 file changed, 36 insertions(+), 42 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index a65a03e2bcc4..8685704f6c8b 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1331,49 +1331,43 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) * If the device is not in running state we will reject some * or all commands. */ - if (unlikely(sdev->sdev_state != SDEV_RUNNING)) { - switch (sdev->sdev_state) { - case SDEV_OFFLINE: - case SDEV_TRANSPORT_OFFLINE: - /* -* If the device is offline we refuse to process any -* commands. The device must be brought online -* before trying any recovery commands. -*/ - sdev_printk(KERN_ERR, sdev, - "rejecting I/O to offline device\n"); - ret = BLKPREP_KILL; - break; - case SDEV_DEL: - /* -* If the device is fully deleted, we refuse to -* process any commands as well. -*/ - sdev_printk(KERN_ERR, sdev, - "rejecting I/O to dead device\n"); - ret = BLKPREP_KILL; - break; - case SDEV_BLOCK: - case SDEV_CREATED_BLOCK: + switch (sdev->sdev_state) { + case SDEV_RUNNING: + case SDEV_CREATED: + break; + case SDEV_OFFLINE: + case SDEV_TRANSPORT_OFFLINE: + /* +* If the device is offline we refuse to process any commands. +* The device must be brought online before trying any +* recovery commands. +*/ + sdev_printk(KERN_ERR, sdev, + "rejecting I/O to offline device\n"); + ret = BLKPREP_KILL; + break; + case SDEV_DEL: + /* +* If the device is fully deleted, we refuse to process any +* commands as well. +*/ + sdev_printk(KERN_ERR, sdev, "rejecting I/O to dead device\n"); + ret = BLKPREP_KILL; + break; + case SDEV_BLOCK: + case SDEV_CREATED_BLOCK: + ret = BLKPREP_DEFER; + break; + case SDEV_QUIESCE: + /* Only allow RQF_PM and RQF_DV requests. */ + if (!(req->rq_flags & (RQF_PM | RQF_DV))) ret = BLKPREP_DEFER; - break; - case SDEV_QUIESCE: - /* -* If the devices is blocked we defer normal commands. -*/ - if (req && !(req->rq_flags & (RQF_PM | RQF_DV))) - ret = BLKPREP_DEFER; - break; - default: - /* -* For any other not fully online state we only allow -* special commands. In particular any user initiated -* command is not allowed. -*/ - if (req && !(req->rq_flags & (RQF_PM | RQF_DV))) - ret = BLKPREP_KILL; - break; - } + break; + case SDEV_CANCEL: + /* Only allow RQF_PM requests. */ + if (!(req->rq_flags & RQF_PM)) + ret = BLKPREP_KILL; + break; } return ret; } -- 2.18.0
[PATCH v6 12/12] blk-mq: Enable support for runtime power management
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-pm.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/block/blk-pm.c b/block/blk-pm.c index 977beffdccd2..e76d85dcdd67 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -30,12 +30,6 @@ */ void blk_pm_runtime_init(struct request_queue *q, struct device *dev) { - /* Don't enable runtime PM for blk-mq until it is ready */ - if (q->mq_ops) { - pm_runtime_disable(dev); - return; - } - q->dev = dev; q->rpm_status = RPM_ACTIVE; pm_runtime_set_autosuspend_delay(q->dev, -1); -- 2.18.0
[PATCH v6 11/12] block: Change the runtime power management approach (2/2)
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. Instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 37 ++--- block/blk-pm.c | 72 2 files changed, 75 insertions(+), 34 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index f30545fb2de2..b0bb6b5320fe 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2772,30 +2772,6 @@ void blk_account_io_done(struct request *req, u64 now) } } -#ifdef CONFIG_PM -/* - * Don't process normal requests when queue is suspended - * or in the process of suspending/resuming - */ -static bool blk_pm_allow_request(struct request *rq) -{ - switch (rq->q->rpm_status) { - case RPM_RESUMING: - case RPM_SUSPENDING: - return rq->rq_flags & RQF_PM; - case RPM_SUSPENDED: - return false; - default: - return true; - } -} -#else -static bool blk_pm_allow_request(struct request *rq) -{ - return true; -} -#endif - void blk_account_io_start(struct request *rq, bool new_io) { struct hd_struct *part; @@ -2841,11 +2817,14 @@ static struct request *elv_next_request(struct request_queue *q) while (1) { list_for_each_entry(rq, &q->queue_head, queuelist) { - if (blk_pm_allow_request(rq)) - return rq; - - if (rq->rq_flags & RQF_SOFTBARRIER) - break; +#ifdef CONFIG_PM + /* +* If a request gets queued in state RPM_SUSPENDED +* then that's a kernel bug. +*/ + WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED); +#endif + return rq; } /* diff --git a/block/blk-pm.c b/block/blk-pm.c index bf8532da952d..977beffdccd2 100644 --- a/block/blk-pm.c +++ b/block/blk-pm.c @@ -1,8 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include +#include "blk-mq.h" +#include "blk-mq-tag.h" /** * blk_pm_runtime_init - Block layer runtime PM initialization routine @@ -58,6 +61,36 @@ void blk_pm_runtime_exit(struct request_queue *q) } EXPORT_SYMBOL(blk_pm_runtime_exit); +struct in_flight_data { + struct request_queue*q; + int in_flight; +}; + +static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq, + void *priv, bool reserved) +{ + struct in_flight_data *in_flight = priv; + + if (rq->q == in_flight->q) + in_flight->in_flight++; +} + +/* + * Count the number of requests that are in flight for request queue @q. Use + * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq + * queues. Use blk_mq_queue_tag_busy_iter() instead of + * blk_mq_tagset_busy_iter() because the latter only considers requests that + * have already been started. + */ +static int blk_requests_in_flight(struct request_queue *q) +{ + struct in_flight_data in_flight = { .q = q }; + + if (q->mq_ops) + blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, &in_flight); + return q->nr_pending + in_flight.in_flight; +} + /** * blk_pre_runtime_suspend - Pre runtime suspend check * @q: the queue of the device @@ -86,14 +119,38 @@ int blk_pre_runtime_suspend(struct request_queue *q) if (!q->dev) return ret; + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE); + + blk_set_pm_only(q); + /* +* This function only gets called if the most recent +* pm_request_resume() call occurred at least autosuspend_delay_ms +* ago. Since blk_queue_enter() is called by the request allocation +* code before pm_request_resume(), if no requests have a tag assigned +* it is safe to suspend the device. +*/ + ret = -EBUSY; + if (blk_requests_in_flight(q) == 0) { + /* +* Call synchronize_rcu() such that later blk_queue_enter() +* calls see the preempt-only state. See also +* http://lwn.net/Articles/573497/. +*/ + synchronize_rcu(); + if (blk_requests_in_flight(q) == 0) + ret = 0; + } +
[PATCH v6 09/12] block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into two new functions. Signed-off-by: Bart Van Assche Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 1 + block/blk-pm.h | 36 +++- block/elevator.c | 1 + 3 files changed, 33 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 3770a36e85be..59dd98585eb0 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1774,6 +1774,7 @@ void __blk_put_request(struct request_queue *q, struct request *req) blk_req_zone_write_unlock(req); blk_pm_put_request(req); + blk_pm_mark_last_busy(req); elv_completed_request(q, req); diff --git a/block/blk-pm.h b/block/blk-pm.h index 1ffc8ef203ec..a8564ea72a41 100644 --- a/block/blk-pm.h +++ b/block/blk-pm.h @@ -6,8 +6,23 @@ #include #ifdef CONFIG_PM +static inline void blk_pm_request_resume(struct request_queue *q) +{ + if (q->dev && (q->rpm_status == RPM_SUSPENDED || + q->rpm_status == RPM_SUSPENDING)) + pm_request_resume(q->dev); +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + pm_runtime_mark_last_busy(rq->q->dev); +} + static inline void blk_pm_requeue_request(struct request *rq) { + lockdep_assert_held(rq->q->queue_lock); + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) rq->q->nr_pending--; } @@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request *rq) static inline void blk_pm_add_request(struct request_queue *q, struct request *rq) { - if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 && - (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING)) - pm_request_resume(q->dev); + lockdep_assert_held(q->queue_lock); + + if (q->dev && !(rq->rq_flags & RQF_PM)) + q->nr_pending++; } static inline void blk_pm_put_request(struct request *rq) { - if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending) - pm_runtime_mark_last_busy(rq->q->dev); + lockdep_assert_held(rq->q->queue_lock); + + if (rq->q->dev && !(rq->rq_flags & RQF_PM)) + --rq->q->nr_pending; } #else +static inline void blk_pm_request_resume(struct request_queue *q) +{ +} + +static inline void blk_pm_mark_last_busy(struct request *rq) +{ +} + static inline void blk_pm_requeue_request(struct request *rq) { } diff --git a/block/elevator.c b/block/elevator.c index 4c15f0240c99..00c5d8dbce16 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); + blk_pm_request_resume(q); rq->q = q; -- 2.18.0
[PATCH v6 10/12] block: Change the runtime power management approach (1/2)
Instead of scheduling runtime resume of a request queue after a request has been queued, schedule asynchronous resume during request allocation. The new pm_request_resume() calls occur after blk_queue_enter() has increased the q_usage_counter request queue member. This change is needed for a later patch that will make request allocation block while the queue status is not RPM_ACTIVE. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke Cc: Johannes Thumshirn Cc: Alan Stern --- block/blk-core.c | 2 ++ block/blk-mq.c | 2 ++ block/elevator.c | 1 - 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 59dd98585eb0..f30545fb2de2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -972,6 +972,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) if (success) return 0; + blk_pm_request_resume(q); + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; diff --git a/block/blk-mq.c b/block/blk-mq.c index 2a0eb058ba5a..24439735f20b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -33,6 +33,7 @@ #include "blk-mq.h" #include "blk-mq-debugfs.h" #include "blk-mq-tag.h" +#include "blk-pm.h" #include "blk-stat.h" #include "blk-mq-sched.h" #include "blk-rq-qos.h" @@ -473,6 +474,7 @@ static void __blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); const int sched_tag = rq->internal_tag; + blk_pm_mark_last_busy(rq); if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) diff --git a/block/elevator.c b/block/elevator.c index 00c5d8dbce16..4c15f0240c99 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where) trace_block_rq_insert(q, rq); blk_pm_add_request(q, rq); - blk_pm_request_resume(q); rq->q = q; -- 2.18.0
[PATCH v6 00/12] blk-mq: Implement runtime power management
Hello Jens, This patch series not only implements runtime power management for blk-mq but also fixes a starvation issue in the power management code for the legacy block layer. Please consider this patch series for the upstream kernel. Thanks, Bart. Changes compared to v5: - Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain validation. - Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain validation. - Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain validation, use that state for domain validation only and introduce a new state for runtime suspend, namely SDEV_QUIESCE. - Reallow system suspend during SCSI domain validation. - Moved the runtime resume call from the request allocation code into blk_queue_enter(). - Instead of relying on q_usage_counter, iterate over the tag set to determine whether or not any requests are in flight. Changes compared to v4: - Dropped the patches "Give RQF_PREEMPT back its original meaning" and "Serialize queue freezing and blk_pre_runtime_suspend()". - Replaced "percpu_ref_read()" with "percpu_is_in_use()". - Inserted pm_request_resume() calls in the block layer request allocation code such that the context that submits a request no longer has to call pm_runtime_get(). Changes compared to v3: - Avoid adverse interactions between system-wide suspend/resume and runtime power management by changing the PREEMPT_ONLY flag into a counter. - Give RQF_PREEMPT back its original meaning, namely that it is only set for ide_preempt requests. - Remove the flag BLK_MQ_REQ_PREEMPT. - Removed the pm_request_resume() call. Changes compared to v2: - Fixed the build for CONFIG_BLOCK=n. - Added a patch that introduces percpu_ref_read() in the percpu-counter code. - Added a patch that makes it easier to detect missing pm_runtime_get*() calls. - Addressed Jianchao's feedback including the comment about runtime overhead of switching a per-cpu counter to atomic mode. Changes compared to v1: - Moved the runtime power management code into a separate file. - Addressed Ming's feedback. Bart Van Assche (12): block, scsi: Introduce request flag RQF_DV scsi: Alter handling of RQF_DV requests scsi: Only set RQF_DV for requests used for domain validation scsi: Introduce the SDEV_SUSPENDED device status block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY scsi: Reallow SPI domain validation during system suspend block: Move power management code into a new source file block, scsi: Introduce blk_pm_runtime_exit() block: Split blk_pm_add_request() and blk_pm_put_request() block: Change the runtime power management approach (1/2) block: Change the runtime power management approach (2/2) blk-mq: Enable support for runtime power management block/Kconfig | 5 + block/Makefile| 1 + block/blk-core.c | 310 ++ block/blk-mq-debugfs.c| 4 +- block/blk-mq.c| 4 +- block/blk-pm.c| 262 + block/blk-pm.h| 69 +++ block/elevator.c | 22 +-- drivers/ide/ide-pm.c | 3 +- drivers/scsi/scsi_lib.c | 266 ++--- drivers/scsi/scsi_pm.c| 7 +- drivers/scsi/scsi_sysfs.c | 2 + drivers/scsi/scsi_transport_spi.c | 26 +-- drivers/scsi/sd.c | 10 +- drivers/scsi/sr.c | 4 +- include/linux/blk-mq.h| 6 +- include/linux/blk-pm.h| 26 +++ include/linux/blkdev.h| 41 ++-- include/scsi/scsi_device.h| 15 +- include/scsi/scsi_transport_spi.h | 1 - 20 files changed, 673 insertions(+), 411 deletions(-) create mode 100644 block/blk-pm.c create mode 100644 block/blk-pm.h create mode 100644 include/linux/blk-pm.h -- 2.18.0
Re: [PATCH RESEND] Blk-throttle: reduce tail io latency when iops limit is enforced
On 8/9/18 11:47 AM, Liu Bo wrote: > When an application's iops has exceeded its cgroup's iops limit, surely it > is throttled and kernel will set a timer for dispatching, thus IO latency > includes the delay. > > However, the dispatch delay which is calculated by the limit and the > elapsed jiffies is suboptimal. As the dispatch delay is only calculated > once the application's iops is (iops limit + 1), it doesn't need to wait > any longer than the remaining time of the current slice. Applied, thanks. -- Jens Axboe
[PATCH RESEND] Blk-throttle: reduce tail io latency when iops limit is enforced
When an application's iops has exceeded its cgroup's iops limit, surely it is throttled and kernel will set a timer for dispatching, thus IO latency includes the delay. However, the dispatch delay which is calculated by the limit and the elapsed jiffies is suboptimal. As the dispatch delay is only calculated once the application's iops is (iops limit + 1), it doesn't need to wait any longer than the remaining time of the current slice. The difference can be proved by the following fio job and cgroup iops setting, - $ echo 4 > /mnt/config/nullb/disk1/mbps# limit nullb's bandwidth to 4MB/s for testing. $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max $ cat r2.job [global] name=fio-rand-read filename=/dev/nullb1 rw=randread bs=4k direct=1 numjobs=1 time_based=1 runtime=60 group_reporting=1 [file1] size=4G ioengine=libaio iodepth=1 rate_iops=5 norandommap=1 thinktime=4ms - wo patch: file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.7-66-gedfc Starting 1 process read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec) slat (usec): min=10, max=336, avg=27.71, stdev=17.82 clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29 lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22 clat percentiles (usec): | 1.00th=[4], 5.00th=[4], 10.00th=[4], 20.00th=[4], | 30.00th=[4], 40.00th=[4], 50.00th=[6], 60.00th=[11731], | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676], | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035], | 99.99th=[28967] w/ patch: file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.7-66-gedfc Starting 1 process read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec) slat (usec): min=10, max=155, avg=23.24, stdev=16.79 clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25 lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92 clat percentiles (usec): | 1.00th=[3], 5.00th=[3], 10.00th=[4], 20.00th=[4], | 30.00th=[4], 40.00th=[5], 50.00th=[ 47], 60.00th=[11863], | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994], | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125], | 99.99th=[12387] Signed-off-by: Liu Bo --- Rebase against branch for-4.19/block. block/blk-throttle.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index caaabbe8a7a5..a3eede00d302 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -922,12 +922,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio, } /* Calc approx time to dispatch */ - jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1; - - if (jiffy_wait > jiffy_elapsed) - jiffy_wait = jiffy_wait - jiffy_elapsed; - else - jiffy_wait = 1; + jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed; if (wait) *wait = jiffy_wait; -- 1.8.3.1
Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced
On Thu, Aug 9, 2018 at 10:22 AM, Jens Axboe wrote: > On 7/20/18 6:29 PM, Liu Bo wrote: >> When an application's iops has exceeded its cgroup's iops limit, surely it >> is throttled and kernel will set a timer for dispatching, thus IO latency >> includes the delay. >> >> However, the dispatch delay which is calculated by the limit and the >> elapsed jiffies is suboptimal. As the dispatch delay is only calculated >> once the application's iops is (iops limit + 1), it doesn't need to wait >> any longer than the remaining time of the current slice. > > Looks good to me - can you resend against my for-4.19/block branch, as > it doesn't apply. > Sure, I'll rebase against it. thanks, liubo
Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced
On 7/20/18 6:29 PM, Liu Bo wrote: > When an application's iops has exceeded its cgroup's iops limit, surely it > is throttled and kernel will set a timer for dispatching, thus IO latency > includes the delay. > > However, the dispatch delay which is calculated by the limit and the > elapsed jiffies is suboptimal. As the dispatch delay is only calculated > once the application's iops is (iops limit + 1), it doesn't need to wait > any longer than the remaining time of the current slice. Looks good to me - can you resend against my for-4.19/block branch, as it doesn't apply. -- Jens Axboe
Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced
ping? On Fri, Jul 20, 2018 at 5:29 PM, Liu Bo wrote: > When an application's iops has exceeded its cgroup's iops limit, surely it > is throttled and kernel will set a timer for dispatching, thus IO latency > includes the delay. > > However, the dispatch delay which is calculated by the limit and the > elapsed jiffies is suboptimal. As the dispatch delay is only calculated > once the application's iops is (iops limit + 1), it doesn't need to wait > any longer than the remaining time of the current slice. > > The difference can be proved by the following fio job and cgroup iops > setting, > - > $ echo 4 > /mnt/config/nullb/disk1/mbps# limit nullb's bandwidth to 4MB/s > for testing. > $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max > $ cat r2.job > [global] > name=fio-rand-read > filename=/dev/nullb1 > rw=randread > bs=4k > direct=1 > numjobs=1 > time_based=1 > runtime=60 > group_reporting=1 > > [file1] > size=4G > ioengine=libaio > iodepth=1 > rate_iops=5 > norandommap=1 > thinktime=4ms > - > > wo patch: > file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.7-66-gedfc > Starting 1 process > >read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec) > slat (usec): min=10, max=336, avg=27.71, stdev=17.82 > clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29 > lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22 > clat percentiles (usec): > | 1.00th=[4], 5.00th=[4], 10.00th=[4], 20.00th=[4], > | 30.00th=[4], 40.00th=[4], 50.00th=[6], 60.00th=[11731], > | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676], > | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035], > | 99.99th=[28967] > > w/ patch: > file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.7-66-gedfc > Starting 1 process > >read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec) > slat (usec): min=10, max=155, avg=23.24, stdev=16.79 > clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25 > lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92 > clat percentiles (usec): > | 1.00th=[3], 5.00th=[3], 10.00th=[4], 20.00th=[4], > | 30.00th=[4], 40.00th=[5], 50.00th=[ 47], 60.00th=[11863], > | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994], > | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125], > | 99.99th=[12387] > > Signed-off-by: Liu Bo > --- > block/blk-throttle.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > index 26ddbe5286f8..3ca9a98748be 100644 > --- a/block/blk-throttle.c > +++ b/block/blk-throttle.c > @@ -956,12 +956,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, > struct bio *bio, > return false; > > /* Calc approx time to dispatch */ > - jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1; > - > - if (jiffy_wait > jiffy_elapsed) > - jiffy_wait = jiffy_wait - jiffy_elapsed; > - else > - jiffy_wait = 1; > + jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed; > > *wait = jiffy_wait; > return false; > -- > 1.8.3.1 >
Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)
On Thu, 2018-08-09 at 10:52 +0800, Ming Lei wrote: > On Wed, Aug 08, 2018 at 05:28:43PM +, Bart Van Assche wrote: > > Some but not all blk_queue_enter() calls are related to request allocation > > so > > The only one I remember is scsi_ioctl_reset(), in which scsi_autopm_get_host() > is called before allocating this request, that means it is enough to put > host up for handling RESET. Also this request shouldn't have entered the block > request queue. > > Or are there other cases in which request allocation isn't related with > blk_queue_enter()? What I was referring to in my e-mail is the q_usage_counter manipulations in blk_mq_timeout_work(). However, blk_mq_timeout_work() manipulates that counter directly. So it's only the blk_queue_exit() call in that function that is not related to request allocation. All blk_queue_enter() calls I know of are related to request allocation. Bart.
Re: [PATCH v5 0/3] Ensure that a request queue is dissociated from the cgroup controller
On 8/9/18 8:53 AM, Bart Van Assche wrote: > Hello Jens, > > Several block drivers call alloc_disk() followed by put_disk() if something > fails before device_add_disk() is called without calling blk_cleanup_queue(). > Make sure that also for this scenario a request queue is dissociated from the > cgroup controller. This patch avoids that loading the parport_pc, paride and > pf drivers trigger a kernel crash. Since this patch series fixes a regression, > please consider these patches for kernel v4.19. Thanks, applied for 4.19. -- Jens Axboe
[PATCH v5 3/3] block: Ensure that a request queue is dissociated from the cgroup controller
Several block drivers call alloc_disk() followed by put_disk() if something fails before device_add_disk() is called without calling blk_cleanup_queue(). Make sure that also for this scenario a request queue is dissociated from the cgroup controller. This patch avoids that loading the parport_pc, paride and pf drivers triggers the following kernel crash: BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] Read of size 4 at addr 0008 by task modprobe/744 Call Trace: dump_stack+0x9a/0xeb kasan_report+0x139/0x350 pi_init+0x42e/0x580 [paride] pf_init+0x2bb/0x1000 [pf] do_one_initcall+0x8e/0x405 do_init_module+0xd9/0x2f2 load_module+0x3ab4/0x4700 SYSC_finit_module+0x176/0x1a0 do_syscall_64+0xee/0x2b0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17 Signed-off-by: Bart Van Assche Tested-by: Alexandru Moise <00moses.alexande...@gmail.com> Reviewed-by: Johannes Thumshirn Cc: Tejun Heo Cc: Christoph Hellwig Cc: Ming Lei Cc: Alexandru Moise <00moses.alexande...@gmail.com> Cc: Joseph Qi Cc: --- block/blk-sysfs.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index ca1984ecbdeb..fcadea471779 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -802,6 +802,21 @@ static void __blk_release_queue(struct work_struct *work) blk_stat_remove_callback(q, q->poll_cb); blk_stat_free_callback(q->poll_cb); + if (!blk_queue_dead(q)) { + /* +* Last reference was dropped without having called +* blk_cleanup_queue(). +*/ + WARN_ONCE(blk_queue_init_done(q), + "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n", + q); + blk_exit_queue(q); + } + + WARN(blkg_root_lookup(q), +"request queue %p is being released but it has not yet been removed from the blkcg controller\n", +q); + blk_free_queue_stats(q->stats); blk_exit_rl(q, &q->root_rl); -- 2.18.0
[PATCH v5 0/3] Ensure that a request queue is dissociated from the cgroup controller
Hello Jens, Several block drivers call alloc_disk() followed by put_disk() if something fails before device_add_disk() is called without calling blk_cleanup_queue(). Make sure that also for this scenario a request queue is dissociated from the cgroup controller. This patch avoids that loading the parport_pc, paride and pf drivers trigger a kernel crash. Since this patch series fixes a regression, please consider these patches for kernel v4.19. Thanks, Bart. Changes between v4 and v5: - Instead of using #ifdef CONFIG_CGROUP / #endif, introduced a new function in the cgroups code. Changes between v3 and v4: - Added "Cc: stable" tags. Changes between v2 and v3: - Avoid code duplication by introducing a new helper function. Changes between v1 and v2: - Fixed the build for CONFIG_BLK_CGROUP=n. Bart Van Assche (3): blkcg: Introduce blkg_root_lookup() block: Introduce blk_exit_queue() block: Ensure that a request queue is dissociated from the cgroup controller block/blk-core.c | 54 +- block/blk-sysfs.c | 15 +++ block/blk.h| 1 + include/linux/blk-cgroup.h | 18 + 4 files changed, 64 insertions(+), 24 deletions(-) -- 2.18.0
[PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()
This new function will be used in a later patch to verify whether a queue has been dissociated from the cgroup controller before being released. Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: Christoph Hellwig Cc: Ming Lei Cc: Omar Sandoval Cc: Johannes Thumshirn Cc: Alexandru Moise <00moses.alexande...@gmail.com> Cc: Joseph Qi Cc: --- include/linux/blk-cgroup.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index f7b910768306..1361cfc9b878 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -341,6 +341,23 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, return __blkg_lookup(blkcg, q, false); } +/** + * blkg_lookup - look up blkg for the specified request queue + * @q: request_queue of interest + * + * Lookup blkg for @q at the root level. See also blkg_lookup(). + */ +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) +{ + struct blkcg_gq *blkg; + + rcu_read_lock(); + blkg = blkg_lookup(&blkcg_root, q); + rcu_read_unlock(); + + return blkg; +} + /** * blkg_to_pdata - get policy private data * @blkg: blkg of interest @@ -864,6 +881,7 @@ static inline bool blk_cgroup_congested(void) { return false; } static inline void blkcg_schedule_throttle(struct request_queue *q, bool use_memdelay) { } static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) { return NULL; } static inline int blkcg_init_queue(struct request_queue *q) { return 0; } static inline void blkcg_drain_queue(struct request_queue *q) { } static inline void blkcg_exit_queue(struct request_queue *q) { } -- 2.18.0
[PATCH v5 2/3] block: Introduce blk_exit_queue()
This patch does not change any functionality. Signed-off-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Cc: Christoph Hellwig Cc: Ming Lei Cc: Omar Sandoval Cc: Alexandru Moise <00moses.alexande...@gmail.com> Cc: Joseph Qi Cc: --- block/blk-core.c | 54 +++- block/blk.h | 1 + 2 files changed, 31 insertions(+), 24 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2e054b65de42..55bc22ef2934 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -719,6 +719,35 @@ void blk_set_queue_dying(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_set_queue_dying); +/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */ +void blk_exit_queue(struct request_queue *q) +{ + /* +* Since the I/O scheduler exit code may access cgroup information, +* perform I/O scheduler exit before disassociating from the block +* cgroup controller. +*/ + if (q->elevator) { + ioc_clear_queue(q); + elevator_exit(q, q->elevator); + q->elevator = NULL; + } + + /* +* Remove all references to @q from the block cgroup controller before +* restoring @q->queue_lock to avoid that restoring this pointer causes +* e.g. blkcg_print_blkgs() to crash. +*/ + blkcg_exit_queue(q); + + /* +* Since the cgroup code may dereference the @q->backing_dev_info +* pointer, only decrease its reference count after having removed the +* association with the block cgroup controller. +*/ + bdi_put(q->backing_dev_info); +} + /** * blk_cleanup_queue - shutdown a request queue * @q: request queue to shutdown @@ -788,30 +817,7 @@ void blk_cleanup_queue(struct request_queue *q) */ WARN_ON_ONCE(q->kobj.state_in_sysfs); - /* -* Since the I/O scheduler exit code may access cgroup information, -* perform I/O scheduler exit before disassociating from the block -* cgroup controller. -*/ - if (q->elevator) { - ioc_clear_queue(q); - elevator_exit(q, q->elevator); - q->elevator = NULL; - } - - /* -* Remove all references to @q from the block cgroup controller before -* restoring @q->queue_lock to avoid that restoring this pointer causes -* e.g. blkcg_print_blkgs() to crash. -*/ - blkcg_exit_queue(q); - - /* -* Since the cgroup code may dereference the @q->backing_dev_info -* pointer, only decrease its reference count after having removed the -* association with the block cgroup controller. -*/ - bdi_put(q->backing_dev_info); + blk_exit_queue(q); if (q->mq_ops) blk_mq_free_queue(q); diff --git a/block/blk.h b/block/blk.h index 6adae8f94279..aeb8026f4d7b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -130,6 +130,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q); int blk_init_rl(struct request_list *rl, struct request_queue *q, gfp_t gfp_mask); void blk_exit_rl(struct request_queue *q, struct request_list *rl); +void blk_exit_queue(struct request_queue *q); void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio); void blk_queue_bypass_start(struct request_queue *q); -- 2.18.0
[PATCH] block: Remove two superfluous #include directives
Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the only seqcount_t and u64_stats_sync instances from but did not remove the corresponding #include directives. Since these include directives are no longer needed, remove them. Signed-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Keith Busch Cc: Ming Lei Cc: Jianchao Wang Cc: Hannes Reinecke , Cc: Johannes Thumshirn --- include/linux/blkdev.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 050d599f5ea9..d6869e0e2b64 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -27,8 +27,6 @@ #include #include #include -#include -#include struct module; struct scsi_ioctl_command; -- 2.18.0
Re: [PATCH v5 4/9] percpu-refcount: Introduce percpu_ref_is_in_use()
On Wed, 2018-08-08 at 08:23 -0700, Tejun Heo wrote: > On Tue, Aug 07, 2018 at 03:51:28PM -0700, Bart Van Assche wrote: > > Introduce a function that allows to determine whether a per-cpu refcount > > is in use. This function will be used in a later patch to determine > > whether or not any block layer requests are being executed. > > I thought about it a bit and am having a bit of difficulty convincing > myself this is necessary. Switching a percpu_ref to atomic mode isn't > expensive - it's one spinlock cycle, a rcu wait and one sweep of the > percpu counters. The most expensive part - the percpu sweep - needs > to be there even with optimization, the wait doesn't really matter as > all it'll do is slightly delaying timer based PM operation and can be > overlayed with the propagation of set_pm_only() anyway. > > So, how about just doing the simple thing? Switch it to atomic mode > and check the counter and switch back to percpu mode afterwards. If > we see any issues with that, we can try to optimize it later but that > seems unlikely to me. Hello Tejun, Switching to atomic mode would require me to add a serialization mechanism against blk_freeze_queue_start() and blk_mq_unfreeze_queue() since these functions call percpu_ref_kill() and percpu_ref_reinit(). That is easy but requires additional code. I will see whether I can implement an alternative approach using blk_mq_queue_tag_busy_iter(). Thanks, Bart.
Re: [PATCH] block: bvec_nr_vecs() returns value for wrong slab
On 8/8/18 1:27 PM, Greg Edwards wrote: > In commit ed996a52c868 ("block: simplify and cleanup bvec pool > handling"), the value of the slab index is incremented by one in > bvec_alloc() after the allocation is done to indicate an index value of > 0 does not need to be later freed. > > bvec_nr_vecs() was not updated accordingly, and thus returns the wrong > value. Decrement idx before performing the lookup. Applied, thanks. -- Jens Axboe
Re: [GIT PULL] last round of nvme updates for 4.19
On 8/9/18 2:07 AM, Christoph Hellwig wrote: > Hi Jens, > > this should be the last round of NVMe updates before the 4.19 merge > window opens. It conains support for write protected (aka read-only) > namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics > fix from Tal Shorer. Thanks, pulled. -- Jens Axboe
Re: [PATCH 00/10] bcache patches for 4.19, 2nd wave
On 2018/8/9 10:21 PM, Jens Axboe wrote: > On 8/9/18 1:48 AM, Coly Li wrote: >> Hi Jens, >> >> Here are 2nd wave bcache patches for 4.19. >> >> The patches from me were either verified by other people or posted >> for quite long time. Except for the debugfs_create_dir() fix and >> "set max writeback rate" fix, rested patches are simple or trivial >> IMHO. >> >> Our new bcache developer Shenghui Wang contributes two patches in >> this run. The first one fixes a misleading error message, and the >> second one is a code style clean up. >> >> Thanks in advance for picking them. > > Applied, fixed up a typo in your patch #4. > Oh, thanks! Coly Li
Re: [PATCH 00/10] bcache patches for 4.19, 2nd wave
On 8/9/18 1:48 AM, Coly Li wrote: > Hi Jens, > > Here are 2nd wave bcache patches for 4.19. > > The patches from me were either verified by other people or posted > for quite long time. Except for the debugfs_create_dir() fix and > "set max writeback rate" fix, rested patches are simple or trivial > IMHO. > > Our new bcache developer Shenghui Wang contributes two patches in > this run. The first one fixes a misleading error message, and the > second one is a code style clean up. > > Thanks in advance for picking them. Applied, fixed up a typo in your patch #4. -- Jens Axboe
[GIT PULL] last round of nvme updates for 4.19
Hi Jens, this should be the last round of NVMe updates before the 4.19 merge window opens. It conains support for write protected (aka read-only) namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics fix from Tal Shorer. The following changes since commit f10fe9d85dc0802b54519c917716e6f0092b4ce7: lightnvm: remove minor version check for 2.0 (2018-08-05 19:36:09 -0600) are available in the Git repository at: git://git.infradead.org/nvme.git nvme-4.19 for you to fetch changes up to 66414e80245e1e73222f67ee711951c7f4bdedab: nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever (2018-08-08 12:01:49 +0200) Chaitanya Kulkarni (3): nvme.h: add support for ns write protect definitions nvme: set gendisk read only based on nsattr nvmet: add ns write protect support Hannes Reinecke (2): nvme: fixup crash on failed discovery nvme.h: fixup ANA group descriptor format Tal Shorer (1): nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever drivers/nvme/host/core.c | 6 drivers/nvme/host/fabrics.c | 2 +- drivers/nvme/host/multipath.c | 2 +- drivers/nvme/target/admin-cmd.c | 76 +++ drivers/nvme/target/core.c| 20 ++- drivers/nvme/target/io-cmd-bdev.c | 7 drivers/nvme/target/io-cmd-file.c | 12 --- drivers/nvme/target/nvmet.h | 4 +++ include/linux/nvme.h | 19 -- 9 files changed, 138 insertions(+), 10 deletions(-)
[PATCH 10/10] bcache: trivial - remove tailing backslash in macro BTREE_FLAG
From: Shenghui Wang Remove the tailing backslash in macro BTREE_FLAG in btree.h Signed-off-by: Shenghui Wang Signed-off-by: Coly Li --- drivers/md/bcache/btree.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h index d211e2c25b6b..68e9d926134d 100644 --- a/drivers/md/bcache/btree.h +++ b/drivers/md/bcache/btree.h @@ -152,7 +152,7 @@ static inline bool btree_node_ ## flag(struct btree *b) \ { return test_bit(BTREE_NODE_ ## flag, &b->flags); } \ \ static inline void set_btree_node_ ## flag(struct btree *b)\ -{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \ +{ set_bit(BTREE_NODE_ ## flag, &b->flags); } enum btree_flags { BTREE_NODE_io_error, -- 2.18.0
[PATCH 09/10] bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
From: Shenghui Wang The pr_err statement in the code for sysfs_attatch section would run for various error codes, which maybe confusing. E.g, Run the command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach [the backing dev got attached on the first run] echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg, after the command run twice, we can get: bcache: bch_cached_dev_attach() Can't attach sda6: already attached bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be891 : cache set not found The first statement in the message was right, but the second was confusing. bch_cached_dev_attach has various pr_ statements for various error codes, except ENOENT. After the change, rerun above command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg we only got: bcache: bch_cached_dev_attach() Can't attach sda6: already attached No confusing "cache set not found" message anymore. And for some not exist SET-UUID: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \ /sys/block/bcache0/bcache/attach In dmesg we can get: bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be898 : cache set not found Signed-off-by: Shenghui Wang Signed-off-by: Coly Li --- drivers/md/bcache/sysfs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 6e88142514fb..22f8565d2bf1 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -351,8 +351,8 @@ STORE(__cached_dev) if (!v) return size; } - - pr_err("Can't attach %s: cache set not found", buf); + if (v == -ENOENT) + pr_err("Can't attach %s: cache set not found", buf); return v; } -- 2.18.0
[PATCH 08/10] bcache: set max writeback rate when I/O request is idle
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: sta...@vger.kernel.org #4.16+ Signed-off-by: Coly Li Tested-by: Kai Krakow Tested-by: Stefan Priebe Cc: Michael Lyle --- drivers/md/bcache/bcache.h| 10 ++-- drivers/md/bcache/request.c | 59 ++- drivers/md/bcache/super.c | 4 ++ drivers/md/bcache/sysfs.c | 15 -- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++ 7 files changed, 138 insertions(+), 45 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index b393b3fd06b6..05f82ff6f016 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_thas_dirty; - /* -* Set to zero by things that touch the backing volume-- except -* writeback. Incremented by writeback. Used to determine when to -* accelerate idle writeback. -*/ - atomic_tbacking_idle; - struct bch_ratelimitwriteback_rate; struct delayed_work writeback_rate_update; @@ -515,6 +508,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_tidle_counter; + atomic_tat_max_writeback_rate; struct cache_sb sb; @@ -524,6 +519,7 @@ struct cache_set { struct bcache_device**devices; unsigneddevices_max_used; + atomic_tattached_dev_nr; struct list_headcached_devs; uint64_tcached_dev_sectors; atomic_long_t flash_dev_dirty_sectors; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 914d501ad1e0..7dbe8b6316a0 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1103,6 +1103,44 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) generic_make_request(bio); } +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* +* mutex bch_register_lock may compete with other parallel requesters, +* or attach/detach operations on other backing device. Waiting to +* the mutex lock may increase I/O request latency for seconds or more. +* To avoid such situation, if mutext_trylock() failed, only writeback +* rate of current cached device is set to 1, and __update_write_back() +* will decide writeback rate of other cached devices (remember now +* c->idle_counter is 0 already). +*/ + if (mutex_trylock(&bch_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* +* set writeback rate to default minimum value, +* then let update_writeback_rate() to decide the +* upcoming rate. +*/
[PATCH 07/10] bcache: add code comments for bset.c
This patch tries to add code comments in bset.c, to make some tricky code and designment to be more comprehensible. Most information of this patch comes from the discussion between Kent and I, he offers very informative details. If there is any mistake of the idea behind the code, no doubt that's from me misrepresentation. Signed-off-by: Coly Li --- drivers/md/bcache/bset.c | 63 1 file changed, 63 insertions(+) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index f3403b45bc28..596c93b44e9b 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -366,6 +366,10 @@ EXPORT_SYMBOL(bch_btree_keys_init); /* Binary tree stuff for auxiliary search trees */ +/* + * return array index next to j when does in-order traverse + * of a binary tree which is stored in a linear array + */ static unsigned inorder_next(unsigned j, unsigned size) { if (j * 2 + 1 < size) { @@ -379,6 +383,10 @@ static unsigned inorder_next(unsigned j, unsigned size) return j; } +/* + * return array index previous to j when does in-order traverse + * of a binary tree which is stored in a linear array + */ static unsigned inorder_prev(unsigned j, unsigned size) { if (j * 2 < size) { @@ -421,6 +429,10 @@ static unsigned __to_inorder(unsigned j, unsigned size, unsigned extra) return j; } +/* + * Return the cacheline index in bset_tree->data, where j is index + * from a linear array which stores the auxiliar binary tree + */ static unsigned to_inorder(unsigned j, struct bset_tree *t) { return __to_inorder(j, t->size, t->extra); @@ -441,6 +453,10 @@ static unsigned __inorder_to_tree(unsigned j, unsigned size, unsigned extra) return j; } +/* + * Return an index from a linear array which stores the auxiliar binary + * tree, j is the cacheline index of t->data. + */ static unsigned inorder_to_tree(unsigned j, struct bset_tree *t) { return __inorder_to_tree(j, t->size, t->extra); @@ -546,6 +562,20 @@ static inline uint64_t shrd128(uint64_t high, uint64_t low, uint8_t shift) return low; } +/* + * Calculate mantissa value for struct bkey_float. + * If most significant bit of f->exponent is not set, then + * - f->exponent >> 6 is 0 + * - p[0] points to bkey->low + * - p[-1] borrows bits from KEY_INODE() of bkey->high + * if most isgnificant bits of f->exponent is set, then + * - f->exponent >> 6 is 1 + * - p[0] points to bits from KEY_INODE() of bkey->high + * - p[-1] points to other bits from KEY_INODE() of + *bkey->high too. + * See make_bfloat() to check when most significant bit of f->exponent + * is set or not. + */ static inline unsigned bfloat_mantissa(const struct bkey *k, struct bkey_float *f) { @@ -570,6 +600,16 @@ static void make_bfloat(struct bset_tree *t, unsigned j) BUG_ON(m < l || m > r); BUG_ON(bkey_next(p) != m); + /* +* If l and r have different KEY_INODE values (different backing +* device), f->exponent records how many least significant bits +* are different in KEY_INODE values and sets most significant +* bits to 1 (by +64). +* If l and r have same KEY_INODE value, f->exponent records +* how many different bits in least significant bits of bkey->low. +* See bfloat_mantiss() how the most significant bit of +* f->exponent is used to calculate bfloat mantissa value. +*/ if (KEY_INODE(l) != KEY_INODE(r)) f->exponent = fls64(KEY_INODE(r) ^ KEY_INODE(l)) + 64; else @@ -633,6 +673,15 @@ void bch_bset_init_next(struct btree_keys *b, struct bset *i, uint64_t magic) } EXPORT_SYMBOL(bch_bset_init_next); +/* + * Build auxiliary binary tree 'struct bset_tree *t', this tree is used to + * accelerate bkey search in a btree node (pointed by bset_tree->data in + * memory). After search in the auxiliar tree by calling bset_search_tree(), + * a struct bset_search_iter is returned which indicates range [l, r] from + * bset_tree->data where the searching bkey might be inside. Then a followed + * linear comparison does the exact search, see __bch_bset_search() for how + * the auxiliary tree is used. + */ void bch_bset_build_written_tree(struct btree_keys *b) { struct bset_tree *t = bset_tree_last(b); @@ -898,6 +947,17 @@ static struct bset_search_iter bset_search_tree(struct bset_tree *t, unsigned inorder, j, n = 1; do { + /* +* A bit trick here. +* If p < t->size, (int)(p - t->size) is a minus value and +* the most significant bit is set, right shifting 31 bits +* gets 1. If p >= t->size, the most significant bit is +* not set, right shifting 31 bits gets 0. +* So the following 2 lines equals to +* if (p >= t->size) +* p =
[PATCH 06/10] bcache: fix mistaken comments in request.c
This patch updates code comment in bch_keylist_realloc() by fixing incorrected function names, to make the code to be more comprehennsible. Signed-off-by: Coly Li --- drivers/md/bcache/request.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 43af905920f5..914d501ad1e0 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -107,7 +107,7 @@ static int bch_keylist_realloc(struct keylist *l, unsigned u64s, /* * The journalling code doesn't handle the case where the keys to insert * is bigger than an empty write: If we just return -ENOMEM here, -* bio_insert() and bio_invalidate() will insert the keys created so far +* bch_data_insert_keys() will insert the keys created so far * and finish the rest when the keylist is empty. */ if (newsize * sizeof(uint64_t) > block_bytes(c) - sizeof(struct jset)) -- 2.18.0
[PATCH 04/10] bcache: add a comment in super.c
This patch adds a line of code comment in super.c:register_bdev(), to make code to be more comprehensible. Signed-off-by: Coly Li --- drivers/md/bcache/super.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index c7ffa6ef3f82..f517d7d1fa10 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1291,6 +1291,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page, pr_info("registered backing device %s", dc->backing_dev_name); list_add(&dc->list, &uncached_devices); + /* attch to a matched cache set if it exists */ list_for_each_entry(c, &bch_cache_sets, list) bch_cached_dev_attach(dc, c, NULL); -- 2.18.0
[PATCH 03/10] bcache: avoid unncessary cache prefetch bch_btree_node_get()
In bch_btree_node_get() the read-in btree node will be partially prefetched into L1 cache for following bset iteration (if there is). But if the btree node read is failed, the perfetch operations will waste L1 cache space. This patch checkes whether read operation and only does cache prefetch when read I/O succeeded. Signed-off-by: Coly Li --- drivers/md/bcache/btree.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 475008fbbaab..c19f7716df88 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1011,6 +1011,13 @@ struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op, BUG_ON(b->level != level); } + if (btree_node_io_error(b)) { + rw_unlock(write, b); + return ERR_PTR(-EIO); + } + + BUG_ON(!b->written); + b->parent = parent; b->accessed = 1; @@ -1022,13 +1029,6 @@ struct btree *bch_btree_node_get(struct cache_set *c, struct btree_op *op, for (; i <= b->keys.nsets; i++) prefetch(b->keys.set[i].data); - if (btree_node_io_error(b)) { - rw_unlock(write, b); - return ERR_PTR(-EIO); - } - - BUG_ON(!b->written); - return b; } -- 2.18.0
[PATCH 05/10] bcache: fix mistaken code comments in bcache.h
This patch updates the code comment in struct cache with correct array names, to make the code to be more comprehensible. Signed-off-by: Coly Li --- drivers/md/bcache/bcache.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 0a3e82b0876d..b393b3fd06b6 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -423,9 +423,9 @@ struct cache { /* * When allocating new buckets, prio_write() gets first dibs - since we * may not be allocate at all without writing priorities and gens. -* prio_buckets[] contains the last buckets we wrote priorities to (so -* gc can mark them as metadata), prio_next[] contains the buckets -* allocated for the next prio write. +* prio_last_buckets[] contains the last buckets we wrote priorities to +* (so gc can mark them as metadata), prio_buckets[] contains the +* buckets allocated for the next prio write. */ uint64_t*prio_buckets; uint64_t*prio_last_buckets; -- 2.18.0
[PATCH 02/10] bcache: display rate debug parameters to 0 when writeback is not running
When writeback is not running, writeback rate should be 0, other value is misleading. And the following dyanmic writeback rate debug parameters should be 0 too, rate, proportional, integral, change otherwise they are misleading when writeback is not running. Signed-off-by: Coly Li --- drivers/md/bcache/sysfs.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..3e9d3459a224 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -149,6 +149,7 @@ SHOW(__bch_cached_dev) struct cached_dev *dc = container_of(kobj, struct cached_dev, disk.kobj); const char *states[] = { "no cache", "clean", "dirty", "inconsistent" }; + int wb = dc->writeback_running; #define var(stat) (dc->stat) @@ -170,7 +171,7 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate,dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate,wb ? dc->writeback_rate.rate << 9 : 0); sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); sysfs_printf(io_error_limit,"%i", dc->error_limit); sysfs_printf(io_disable,"%i", dc->io_disable); @@ -188,15 +189,20 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate,dc->writeback_rate.rate << 9); - bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); - bch_hprint(target, dc->writeback_rate_target << 9); - bch_hprint(proportional,dc->writeback_rate_proportional << 9); - bch_hprint(integral,dc->writeback_rate_integral_scaled << 9); - bch_hprint(change, dc->writeback_rate_change << 9); - - next_io = div64_s64(dc->writeback_rate.next - local_clock(), - NSEC_PER_MSEC); + /* +* Except for dirty and target, other values should +* be 0 if writeback is not running. +*/ + bch_hprint(rate, wb ? dc->writeback_rate.rate << 9 : 0); + bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); + bch_hprint(target, dc->writeback_rate_target << 9); + bch_hprint(proportional, + wb ? dc->writeback_rate_proportional << 9 : 0); + bch_hprint(integral, + wb ? dc->writeback_rate_integral_scaled << 9 : 0); + bch_hprint(change, wb ? dc->writeback_rate_change << 9 : 0); + next_io = wb ? div64_s64(dc->writeback_rate.next-local_clock(), +NSEC_PER_MSEC) : 0; return sprintf(buf, "rate:\t\t%s/sec\n" -- 2.18.0
[PATCH 00/10] bcache patches for 4.19, 2nd wave
Hi Jens, Here are 2nd wave bcache patches for 4.19. The patches from me were either verified by other people or posted for quite long time. Except for the debugfs_create_dir() fix and "set max writeback rate" fix, rested patches are simple or trivial IMHO. Our new bcache developer Shenghui Wang contributes two patches in this run. The first one fixes a misleading error message, and the second one is a code style clean up. Thanks in advance for picking them. Coly Li --- Coly Li (8): bcache: do not check return value of debugfs_create_dir() bcache: display rate debug parameters to 0 when writeback is not running bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: add a comment in super.c bcache: fix mistaken code comments in bcache.h bcache: fix mistaken comments in request.c bcache: add code comments for bset.c bcache: set max writeback rate when I/O request is idle Shenghui Wang (2): bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: trivial - remove tailing backslash in macro BTREE_FLAG drivers/md/bcache/bcache.h| 18 +++ drivers/md/bcache/bset.c | 63 drivers/md/bcache/btree.c | 14 +++--- drivers/md/bcache/btree.h | 2 +- drivers/md/bcache/closure.c | 13 +++-- drivers/md/bcache/closure.h | 4 +- drivers/md/bcache/debug.c | 11 +++-- drivers/md/bcache/request.c | 61 +-- drivers/md/bcache/super.c | 9 +++- drivers/md/bcache/sysfs.c | 41 ++-- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++ 13 files changed, 251 insertions(+), 80 deletions(-) -- 2.18.0
[PATCH 01/10] bcache: do not check return value of debugfs_create_dir()
Greg KH suggests that normal code should not care about debugfs. Therefore no matter successful or failed of debugfs_create_dir() execution, it is unncessary to check its return value. There are two functions called debugfs_create_dir() and check the return value, which are bch_debug_init() and closure_debug_init(). This patch changes these two functions from int to void type, and ignore return values of debugfs_create_dir(). This patch does not fix exact bug, just makes things work as they should. Signed-off-by: Coly Li Suggested-by: Greg Kroah-Hartman Cc: sta...@vger.kernel.org Cc: Kai Krakow Cc: Kent Overstreet --- drivers/md/bcache/bcache.h | 2 +- drivers/md/bcache/closure.c | 13 + drivers/md/bcache/closure.h | 4 ++-- drivers/md/bcache/debug.c | 11 ++- drivers/md/bcache/super.c | 4 +++- 5 files changed, 21 insertions(+), 13 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 872ef4d67711..0a3e82b0876d 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -1001,7 +1001,7 @@ void bch_open_buckets_free(struct cache_set *); int bch_cache_allocator_start(struct cache *ca); void bch_debug_exit(void); -int bch_debug_init(struct kobject *); +void bch_debug_init(struct kobject *kobj); void bch_request_exit(void); int bch_request_init(void); diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c index 0e14969182c6..618253683d40 100644 --- a/drivers/md/bcache/closure.c +++ b/drivers/md/bcache/closure.c @@ -199,11 +199,16 @@ static const struct file_operations debug_ops = { .release= single_release }; -int __init closure_debug_init(void) +void __init closure_debug_init(void) { - closure_debug = debugfs_create_file("closures", - 0400, bcache_debug, NULL, &debug_ops); - return IS_ERR_OR_NULL(closure_debug); + if (!IS_ERR_OR_NULL(bcache_debug)) + /* +* it is unnecessary to check return value of +* debugfs_create_file(), we should not care +* about this. +*/ + closure_debug = debugfs_create_file( + "closures", 0400, bcache_debug, NULL, &debug_ops); } #endif diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h index 71427eb5fdae..7c2c5bc7c88b 100644 --- a/drivers/md/bcache/closure.h +++ b/drivers/md/bcache/closure.h @@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl) #ifdef CONFIG_BCACHE_CLOSURES_DEBUG -int closure_debug_init(void); +void closure_debug_init(void); void closure_debug_create(struct closure *cl); void closure_debug_destroy(struct closure *cl); #else -static inline int closure_debug_init(void) { return 0; } +static inline void closure_debug_init(void) {} static inline void closure_debug_create(struct closure *cl) {} static inline void closure_debug_destroy(struct closure *cl) {} diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 04d146711950..12034c07257b 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -252,11 +252,12 @@ void bch_debug_exit(void) debugfs_remove_recursive(bcache_debug); } -int __init bch_debug_init(struct kobject *kobj) +void __init bch_debug_init(struct kobject *kobj) { - if (!IS_ENABLED(CONFIG_DEBUG_FS)) - return 0; - + /* +* it is unnecessary to check return value of +* debugfs_create_file(), we should not care +* about this. +*/ bcache_debug = debugfs_create_dir("bcache", NULL); - return IS_ERR_OR_NULL(bcache_debug); } diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e0a92104ca23..c7ffa6ef3f82 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2345,10 +2345,12 @@ static int __init bcache_init(void) goto err; if (bch_request_init() || - bch_debug_init(bcache_kobj) || closure_debug_init() || sysfs_create_files(bcache_kobj, files)) goto err; + bch_debug_init(bcache_kobj); + closure_debug_init(); + return 0; err: bcache_exit(); -- 2.18.0