date:20180809

Re: [PATCH 12/17] bcache: fix code comments style

2018-08-09 Thread Coly Li

On 2018/8/10 11:10 AM, shenghui wrote:
> 
> 
> On 08/09/2018 02:43 PM, Coly Li wrote:
>> This patch fixes 3 style issues warned by checkpatch.pl,
>> - Comment lines are not aligned
>> - Comments use "/*" on subsequent lines
>> - Comment lines use a trailing "*/"
>>
>> Signed-off-by: Coly Li 
>> ---
>>  drivers/md/bcache/super.c | 22 +-
>>  1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index eac0d5d1ada3..a8b751ba3ebc 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -465,8 +465,8 @@ static struct uuid_entry *uuid_find_empty(struct 
>> cache_set *c)
>>   * Bucket priorities/gens:
>>   *
>>   * For each bucket, we store on disk its
>> -   * 8 bit gen
>> -   * 16 bit priority
>> + *   8 bit gen
>> + *  16 bit priority
>>   *
>>   * See alloc.c for an explanation of the gen. The priority is used to 
>> implement
>>   * lru (and in the future other) cache replacement policies; for most 
>> purposes
>> @@ -934,8 +934,10 @@ void bch_cached_dev_run(struct cached_dev *dc)
>>  
>>  add_disk(d->disk);
>>  bd_link_disk_holder(dc->bdev, dc->disk.disk);
>> -/* won't show up in the uevent file, use udevadm monitor -e instead
>> - * only class / kset properties are persistent */
>> +/*
>> + * won't show up in the uevent file, use udevadm monitor -e instead
>> + * only class / kset properties are persistent
>> + */
>>  kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env);
>>  kfree(env[1]);
>>  kfree(env[2]);
>> @@ -1104,8 +1106,9 @@ int bch_cached_dev_attach(struct cached_dev *dc, 
>> struct cache_set *c,
>>  }
>>  }
>>  
>> -/* Deadlocks since we're called via sysfs...
>> -sysfs_remove_file(&dc->kobj, &sysfs_attach);
>> +/*
>> + * Deadlocks since we're called via sysfs...
>> + * sysfs_remove_file(&dc->kobj, &sysfs_attach);
>>   */
>>  
>>  if (bch_is_zero(u->uuid, 16)) {
>> @@ -1468,9 +1471,10 @@ bool bch_cache_set_error(struct cache_set *c, const 
>> char *fmt, ...)
>>  if (test_and_set_bit(CACHE_SET_IO_DISABLE, &c->flags))
>>  pr_info("CACHE_SET_IO_DISABLE already set");
>>  
>> -/* XXX: we can be called from atomic context
>> -acquire_console_sem();
>> -*/
>> +/*
>> + * XXX: we can be called from atomic context
>> + * acquire_console_sem();
>> + */
>>  
>>  pr_err("bcache: error on %pU: ", c->sb.set_uuid);
>>  
>>
> 
> Hi Coly,
> 
> I noticed some missing:
>   bset.c:405:/* I have no idea why this code works... and I'm the one who 
> wrote it

This is a single line comment, IMHO it is OK.

>   bset.c:796: /* k is the key we just inserted; we need to find the 
> entry in the
>   bset.c:804: /* Adjust all the lookup table entries, and find a new 
> key for any that
>   writeback.c:471:/* We've acquired a semaphore 
> for the maximum
> 

The above lines should be fixed, I will update them in v2 series.

> And some comments start with '/**', like:
>  85 /**
>  86  * bch_hprint - formats @v to human readable string for sysfs.

I see this kind of comment for library routines, the original author may
just treat them as library code. So it's OK for me.

Thanks for your review.

Coly Li

Re: [RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook

2018-08-09 Thread jianchao.wang

Hi Omar

On 08/10/2018 04:26 AM, Omar Sandoval wrote:
> @@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, 
> blk_status_t error)
>   blk_stat_add(rq, now);
>   }
>  
> + if (rq->internal_tag != -1)
> + blk_mq_sched_completed_request(rq, now);
> +

Is it OK to move the io scheduler completed callback into __blk_mq_end_request ?

There is a relatively long distance between the __blk_mq_complete_request and 
__blk_mq_end_request,
especially the driver's mq_ops.complete and the bio_endio.

Thanks
Jianchao

Re: [PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()

2018-08-09 Thread jianchao.wang

Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> +void blk_pm_runtime_exit(struct request_queue *q)
> +{
> + if (!q->dev)
> + return;
> +
> + pm_runtime_get_sync(q->dev);
> + q->dev = NULL;
> +}
> +EXPORT_SYMBOL(blk_pm_runtime_exit);
> +
>  /**
>   * blk_pre_runtime_suspend - Pre runtime suspend check
>   * @q: the queue of the device
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 69ab459abb98..5537762dfcfd 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3420,12 +3420,11 @@ static int sd_probe(struct device *dev)
>   **/
>  static int sd_remove(struct device *dev)
>  {
> - struct scsi_disk *sdkp;
> - dev_t devt;
> + struct scsi_disk *sdkp = dev_get_drvdata(dev);
> + struct scsi_device *sdp = sdkp->device;
> + dev_t devt = disk_devt(sdkp->disk);
>  
> - sdkp = dev_get_drvdata(dev);
> - devt = disk_devt(sdkp->disk);
> - scsi_autopm_get_device(sdkp->device);
> + blk_pm_runtime_exit(sdp->request_queue)


Based on __scsi_device_remove, sd_remove will be invoked before 
blk_cleanup_queue.
So it should be not safe that we set the q->dev to NULL here.

We have following operations in followed patch:
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}

Thanks
Jianchao

Fix for 84676c1f (b5b6e8c8) missing in 4.14.y

2018-08-09 Thread Felipe Franciosi

Hi Ming (and all),

Your series "scsi: virtio_scsi: fix IO hang caused by irq vector automatic 
affinity" which forces virtio-scsi to use blk-mq fixes an issue introduced by 
84676c1f. We noticed that this bug also exists in 4.14.y (as ef86f3a72adb), but 
your series was not backported to that stable branch.

Are there any plans to do that? At least CoreOS is using 4.14 and showing 
issues on AHV (which provides an mq virtio-scsi controller).

Thanks,
Felipe

Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y

2018-08-09 Thread Ming Lei

On Fri, Aug 10, 2018 at 02:09:01AM +, Felipe Franciosi wrote:
> Hi Ming (and all),
> 
> Your series "scsi: virtio_scsi: fix IO hang caused by irq vector automatic 
> affinity" which forces virtio-scsi to use blk-mq fixes an issue introduced by 
> 84676c1f. We noticed that this bug also exists in 4.14.y (as ef86f3a72adb), 
> but your series was not backported to that stable branch.
> 
> Are there any plans to do that? At least CoreOS is using 4.14 and showing 
> issues on AHV (which provides an mq virtio-scsi controller).
> 

Hi Felipe,

Looks the following 4 patches should have been marked as stable, sorry
for missing that.

b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector 
affinity
2f31115e940c scsi: core: introduce force_blk_mq
adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
8b834bff1b73 scsi: hpsa: fix selection of reply queue

Usually this backporting is done by our stable guys, so I will CC stable
and leave them handle it, but I am happy to provide any help for
addressing conflicts or sort of thing.


Thanks,
Ming

Re: [PATCH v6 10/12] block: Change the runtime power management approach (1/2)

2018-08-09 Thread jianchao.wang

Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> Instead of scheduling runtime resume of a request queue after a
> request has been queued, schedule asynchronous resume during request
> allocation. The new pm_request_resume() calls occur after
> blk_queue_enter() has increased the q_usage_counter request queue
^^^
> member. This change is needed for a later patch that will make request
> allocation block while the queue status is not RPM_ACTIVE.

Is it "after getting q->q_usage_counter fails" ?
And also this blk_pm_request_resume  will not affect the normal path. ;)

Thanks
Jianchao

Re: [PATCH v6 11/12] block: Change the runtime power management approach (2/2)

2018-08-09 Thread jianchao.wang

Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> +
> + blk_set_pm_only(q);
> + /*
> +  * This function only gets called if the most recent
> +  * pm_request_resume() call occurred at least autosuspend_delay_ms
> +  * ago. Since blk_queue_enter() is called by the request allocation
> +  * code before pm_request_resume(), if no requests have a tag assigned
> +  * it is safe to suspend the device.
> +  */
> + ret = -EBUSY;
> + if (blk_requests_in_flight(q) == 0) {
> + /*
> +  * Call synchronize_rcu() such that later blk_queue_enter()
> +  * calls see the preempt-only state. See also
> +  * 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=U9uPCJD2WnkXvdzrWaKPh2wJuk8-IHvxZ9sWDVrg2Tg&s=c9E23TPCpNQkiZpuzGztwHxjWF8qrESfRnPmI-e-Z48&e=.
> +  */
> + synchronize_rcu();
> + if (blk_requests_in_flight(q) == 0)
> + ret = 0;
> + }

I still think blk_set_pm_only should be moved after blk_requests_in_flight.
Otherwise, the normal IO will be blocked for a little while if there are still
busy requests.

Thanks
Jianchao

Re: [PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY

2018-08-09 Thread jianchao.wang

Hi Bart

On 08/10/2018 03:41 AM, Bart Van Assche wrote:
> +/*
> + * Whether or not blk_queue_enter() should proceed. RQF_PM requests are 
> always
> + * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not 
> been
> + * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has 
> been
> + * set.
> + */
> +static inline bool blk_enter_allowed(struct request_queue *q,
> +  blk_mq_req_flags_t flags)
> +{
> + return flags & BLK_MQ_REQ_PM ||
> + (!blk_queue_pm_only(q) &&
> +  (flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q)));
> +}

If a new state is indeed necessary, I think this kind of checking in hot path 
is inefficient.
How about introduce a new state into request_queue, such as 
request_queue->gate_state.
Set the PM_ONLY and DV_ONLY into this state, then we could just check 
request_queue->gate_state > 0
before do further checking.

Thanks
Jianchao

Re: [PATCH v6 02/12] scsi: Alter handling of RQF_DV requests

2018-08-09 Thread Ming Lei

On Thu, Aug 09, 2018 at 12:41:39PM -0700, Bart Van Assche wrote:
> Process all requests in state SDEV_CREATED instead of only RQF_DV
> requests. This does not change the behavior of the SCSI core because
> the SCSI device state is modified into another state before SCSI
> devices become visible in sysfs and before any device nodes are
> created in /dev. Do not process RQF_DV requests in state SDEV_CANCEL
> because only power management requests should be processed in this
> state.

Could you explain a bit why only PM requests should be processed in
SDEV_CANCEL?

Thanks,
Ming

[RFC PATCH 4/5] kyber: implement improved heuristics

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

Kyber's current heuristics have a few flaws:

- It's based on the mean latency, but p99 latency tends to be more
  meaningful to anyone who cares about latency. The mean can also be
  skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
  This works for steady, high load, but is more sensitive to outliers
  with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
  device, but the user cares about the time spent in the kernel, as
  well.

These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:

- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
  to the target latency (e.g., if the p99 latency is double the target
  latency, we will double the queue depth; if the p99 latency is half of
  the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
  depths down: we will only scale down if any domain's I/O latency
  exceeds the target latency, which is an indicator of congestion in the
  device.

These new heuristics are just as scalable as the heuristics they
replace.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 497 --
 1 file changed, 279 insertions(+), 218 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 08eb5295c18d..adc8e6393829 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -29,13 +29,16 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
-#include "blk-stat.h"
 
-/* Scheduling domains. */
+/*
+ * Scheduling domains: the device is divided into multiple domains based on the
+ * request type.
+ */
 enum {
KYBER_READ,
-   KYBER_SYNC_WRITE,
-   KYBER_OTHER, /* Async writes, discard, etc. */
+   KYBER_WRITE,
+   KYBER_DISCARD,
+   KYBER_OTHER,
KYBER_NUM_DOMAINS,
 };
 
@@ -49,25 +52,82 @@ enum {
 };
 
 /*
- * Initial device-wide depths for each scheduling domain.
+ * Maximum device-wide depth for each scheduling domain.
  *
- * Even for fast devices with lots of tags like NVMe, you can saturate
- * the device with only a fraction of the maximum possible queue depth.
- * So, we cap these to a reasonable value.
+ * Even for fast devices with lots of tags like NVMe, you can saturate the
+ * device with only a fraction of the maximum possible queue depth. So, we cap
+ * these to a reasonable value.
  */
 static const unsigned int kyber_depth[] = {
[KYBER_READ] = 256,
-   [KYBER_SYNC_WRITE] = 128,
-   [KYBER_OTHER] = 64,
+   [KYBER_WRITE] = 128,
+   [KYBER_DISCARD] = 64,
+   [KYBER_OTHER] = 16,
 };
 
 /*
- * Scheduling domain batch sizes. We favor reads.
+ * Default latency targets for each scheduling domain.
+ */
+static const u64 kyber_latency_targets[] = {
+   [KYBER_READ] = 2 * NSEC_PER_MSEC,
+   [KYBER_WRITE] = 10 * NSEC_PER_MSEC,
+   [KYBER_DISCARD] = 5 * NSEC_PER_SEC,
+};
+
+/*
+ * Batch size (number of requests we'll dispatch in a row) for each scheduling
+ * domain.
  */
 static const unsigned int kyber_batch_size[] = {
[KYBER_READ] = 16,
-   [KYBER_SYNC_WRITE] = 8,
-   [KYBER_OTHER] = 8,
+   [KYBER_WRITE] = 8,
+   [KYBER_DISCARD] = 1,
+   [KYBER_OTHER] = 1,
+};
+
+/*
+ * Requests latencies are recorded in a histogram with buckets defined relative
+ * to the target latency:
+ *
+ * <= 1/4 * target latency
+ * <= 1/2 * target latency
+ * <= 3/4 * target latency
+ * <= target latency
+ * <= 1 1/4 * target latency
+ * <= 1 1/2 * target latency
+ * <= 1 3/4 * target latency
+ * > 1 3/4 * target latency
+ */
+enum {
+   /*
+* The width of the latency histogram buckets is
+* 1 / (1 << KYBER_LATENCY_SHIFT) * target latency.
+*/
+   KYBER_LATENCY_SHIFT = 2,
+   /*
+* The first (1 << KYBER_LATENCY_SHIFT) buckets are <= target latency,
+* thus, "good".
+*/
+   KYBER_GOOD_BUCKETS = 1 << KYBER_LATENCY_SHIFT,
+   /* There are also (1 << KYBER_LATENCY_SHIFT) "bad" buckets. */
+   KYBER_LATENCY_BUCKETS = 2 << KYBER_LATENCY_SHIFT,
+};
+
+/*
+ * We measure both the total latency and the I/O latency (i.e., latency after
+ * submitting to the device).
+ */
+enum {
+   KYBER_TOTAL_LATENCY,
+   KYBER_IO_LATENCY,
+};
+
+/*
+ * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
+ * domain except for KYBER_OTHER.
+ */
+struct kyber_cpu_latency {
+   atomic_t buckets[KYBER_OTHER][2][KYBER_LATENCY_BUCKETS];
 };
 
 /*
@@ -84,14 +144,9 @@ struct kyber

[RFC PATCH 5/5] kyber: add tracepoints

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

When debugging Kyber, it's really useful to know what latencies we've
been having and how the domain depths have been adjusted. Add two
tracepoints, kyber_latency and kyber_adjust, to record that.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c| 46 +-
 include/trace/events/kyber.h | 76 
 2 files changed, 104 insertions(+), 18 deletions(-)
 create mode 100644 include/trace/events/kyber.h

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index adc8e6393829..d8ddcb4f2435 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -30,6 +30,9 @@
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Scheduling domains: the device is divided into multiple domains based on the
  * request type.
@@ -42,6 +45,13 @@ enum {
KYBER_NUM_DOMAINS,
 };
 
+static const char *kyber_domain_names[] = {
+   [KYBER_READ] = "READ",
+   [KYBER_WRITE] = "WRITE",
+   [KYBER_DISCARD] = "DISCARD",
+   [KYBER_OTHER] = "OTHER",
+};
+
 enum {
/*
 * In order to prevent starvation of synchronous requests by a flood of
@@ -122,6 +132,11 @@ enum {
KYBER_IO_LATENCY,
 };
 
+static const char *kyber_latency_type_names[] = {
+   [KYBER_TOTAL_LATENCY] = "total",
+   [KYBER_IO_LATENCY] = "I/O",
+};
+
 /*
  * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
  * domain except for KYBER_OTHER.
@@ -144,6 +159,8 @@ struct kyber_ctx_queue {
 } cacheline_aligned_in_smp;
 
 struct kyber_queue_data {
+   struct request_queue *q;
+
/*
 * Each scheduling domain has a limited number of in-flight requests
 * device-wide, limited by these tokens.
@@ -249,6 +266,10 @@ static int calculate_percentile(struct kyber_queue_data 
*kqd,
}
memset(buckets, 0, sizeof(kqd->latency_buckets[sched_domain][type]));
 
+   trace_kyber_latency(kqd->q, kyber_domain_names[sched_domain],
+   kyber_latency_type_names[type], percentile,
+   bucket + 1, 1 << KYBER_LATENCY_SHIFT, samples);
+
return bucket;
 }
 
@@ -256,8 +277,11 @@ static void kyber_resize_domain(struct kyber_queue_data 
*kqd,
unsigned int sched_domain, unsigned int depth)
 {
depth = clamp(depth, 1U, kyber_depth[sched_domain]);
-   if (depth != kqd->domain_tokens[sched_domain].sb.depth)
+   if (depth != kqd->domain_tokens[sched_domain].sb.depth) {
sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth);
+   trace_kyber_adjust(kqd->q, kyber_domain_names[sched_domain],
+  depth);
+   }
 }
 
 static void kyber_timer_fn(struct timer_list *t)
@@ -360,6 +384,8 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd)
goto err;
 
+   kqd->q = q;
+
kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
GFP_KERNEL | __GFP_ZERO);
if (!kqd->cpu_latency)
@@ -944,23 +970,7 @@ static int kyber_cur_domain_show(void *data, struct 
seq_file *m)
struct blk_mq_hw_ctx *hctx = data;
struct kyber_hctx_data *khd = hctx->sched_data;
 
-   switch (khd->cur_domain) {
-   case KYBER_READ:
-   seq_puts(m, "READ\n");
-   break;
-   case KYBER_WRITE:
-   seq_puts(m, "WRITE\n");
-   break;
-   case KYBER_DISCARD:
-   seq_puts(m, "DISCARD\n");
-   break;
-   case KYBER_OTHER:
-   seq_puts(m, "OTHER\n");
-   break;
-   default:
-   seq_printf(m, "%u\n", khd->cur_domain);
-   break;
-   }
+   seq_printf(m, "%s\n", kyber_domain_names[khd->cur_domain]);
return 0;
 }
 
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
new file mode 100644
index ..9bf85dbef492
--- /dev/null
+++ b/include/trace/events/kyber.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kyber
+
+#if !defined(_TRACE_KYBER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KYBER_H
+
+#include 
+#include 
+
+#define DOMAIN_LEN 16
+#define LATENCY_TYPE_LEN   8
+
+TRACE_EVENT(kyber_latency,
+
+   TP_PROTO(struct request_queue *q, const char *domain, const char *type,
+unsigned int percentile, unsigned int numerator,
+unsigned int denominator, unsigned int samples),
+
+   TP_ARGS(q, domain, type, percentile, numerator, denominator, samples),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev )
+   __array(char,   domain, DOMAIN_LEN  )
+   __array(char,   type,   LATENCY_TYPE_LEN

[RFC PATCH 3/5] kyber: don't make domain token sbitmap larger than necessary

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 95d062c07c61..08eb5295c18d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -40,8 +40,6 @@ enum {
 };
 
 enum {
-   KYBER_MIN_DEPTH = 256,
-
/*
 * In order to prevent starvation of synchronous requests by a flood of
 * asynchronous requests, we reserve 25% of requests for synchronous
@@ -305,7 +303,6 @@ static int kyber_bucket_fn(const struct request *rq)
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 {
struct kyber_queue_data *kqd;
-   unsigned int max_tokens;
unsigned int shift;
int ret = -ENOMEM;
int i;
@@ -320,25 +317,17 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd->cb)
goto err_kqd;
 
-   /*
-* The maximum number of tokens for any scheduling domain is at least
-* the queue depth of a single hardware queue. If the hardware doesn't
-* have many tags, still provide a reasonable number.
-*/
-   max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
-  KYBER_MIN_DEPTH);
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
WARN_ON(!kyber_depth[i]);
WARN_ON(!kyber_batch_size[i]);
ret = sbitmap_queue_init_node(&kqd->domain_tokens[i],
- max_tokens, -1, false, GFP_KERNEL,
- q->node);
+ kyber_depth[i], -1, false,
+ GFP_KERNEL, q->node);
if (ret) {
while (--i >= 0)
sbitmap_queue_free(&kqd->domain_tokens[i]);
goto err_cb;
}
-   sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]);
}
 
shift = kyber_sched_tags_shift(kqd);
-- 
2.18.0

[RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-sched.h | 4 ++--
 block/blk-mq.c   | 5 +++--
 block/kyber-iosched.c| 5 ++---
 include/linux/elevator.h | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cb8f938dff9..74fb6ff9a30d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -54,12 +54,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct 
request *rq,
return true;
 }
 
-static inline void blk_mq_sched_completed_request(struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
 {
struct elevator_queue *e = rq->q->elevator;
 
if (e && e->type->ops.mq.completed_request)
-   e->type->ops.mq.completed_request(rq);
+   e->type->ops.mq.completed_request(rq, now);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 654b0dc7e001..0c5be0001d0f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, 
blk_status_t error)
blk_stat_add(rq, now);
}
 
+   if (rq->internal_tag != -1)
+   blk_mq_sched_completed_request(rq, now);
+
blk_account_io_done(rq, now);
 
if (rq->end_io) {
@@ -560,8 +563,6 @@ static void __blk_mq_complete_request(struct request *rq)
 
if (!blk_mq_mark_complete(rq))
return;
-   if (rq->internal_tag != -1)
-   blk_mq_sched_completed_request(rq);
 
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a1660bafc912..95d062c07c61 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -558,12 +558,12 @@ static void kyber_finish_request(struct request *rq)
rq_clear_domain_token(kqd, rq);
 }
 
-static void kyber_completed_request(struct request *rq)
+static void kyber_completed_request(struct request *rq, u64 now)
 {
struct request_queue *q = rq->q;
struct kyber_queue_data *kqd = q->elevator->elevator_data;
unsigned int sched_domain;
-   u64 now, latency, target;
+   u64 latency, target;
 
/*
 * Check if this request met our latency goal. If not, quickly gather
@@ -585,7 +585,6 @@ static void kyber_completed_request(struct request *rq)
if (blk_stat_is_active(kqd->cb))
return;
 
-   now = ktime_get_ns();
if (now < rq->io_start_time_ns)
return;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index a02deea30185..015bb59c0331 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -111,7 +111,7 @@ struct elevator_mq_ops {
void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, 
bool);
struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
bool (*has_work)(struct blk_mq_hw_ctx *);
-   void (*completed_request)(struct request *);
+   void (*completed_request)(struct request *, u64);
void (*started_request)(struct request *);
void (*requeue_request)(struct request *);
struct request *(*former_request)(struct request_queue *, struct 
request *);
-- 
2.18.0

[RFC PATCH 0/5] kyber: better heuristics

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

Hello,

I've spent the past few weeks experimenting with different heuristics
for Kyber in order to deal with some edge cases we've hit here. This
series is my progress so far, implementing less handwavy heuristics
while keeping the same basic mechanisms. Patches 1 and 2 are
preparation. Patch 3 is a minor optimization. Patch 4 is the main
change, and includes a detailed description of the new heuristics. Patch
5 adds tracepoints for debugging.

Please, take a look and try it out.

Thanks!

Omar Sandoval (5):
  block: move call of scheduler's ->completed_request() hook
  block: export blk_stat_enable_accounting()
  kyber: don't make domain token sbitmap larger than necessary
  kyber: implement improved heuristics
  kyber: add tracepoints

 block/blk-mq-sched.h |   4 +-
 block/blk-mq.c   |   5 +-
 block/blk-stat.c |   1 +
 block/kyber-iosched.c| 541 +++
 include/linux/elevator.h |   2 +-
 include/trace/events/kyber.h |  76 +
 6 files changed, 383 insertions(+), 246 deletions(-)
 create mode 100644 include/trace/events/kyber.h

-- 
2.18.0

[RFC PATCH 2/5] block: export blk_stat_enable_accounting()

2018-08-09 Thread Omar Sandoval

From: Omar Sandoval 

Kyber will need this in a future change if it is built as a module.

Signed-off-by: Omar Sandoval 
---
 block/blk-stat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 175c143ac5b9..d98f3ad6794e 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -190,6 +190,7 @@ void blk_stat_enable_accounting(struct request_queue *q)
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
spin_unlock(&q->stats->lock);
 }
+EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
 
 struct blk_queue_stats *blk_alloc_queue_stats(void)
 {
-- 
2.18.0

Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()

2018-08-09 Thread Jens Axboe

On 8/9/18 2:17 PM, Bart Van Assche wrote:
> On Thu, 2018-08-09 at 12:56 -0700, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote:
>>> +/**
>>> + * blkg_lookup - look up blkg for the specified request queue
>>> + * @q: request_queue of interest
>>> + *
>>> + * Lookup blkg for @q at the root level. See also blkg_lookup().
>>> + */
>>> +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q)
>>> +{
>>> +   struct blkcg_gq *blkg;
>>> +
>>> +   rcu_read_lock();
>>> +   blkg = blkg_lookup(&blkcg_root, q);
>>> +   rcu_read_unlock();
>>> +
>>> +   return blkg;
>>> +}
>>
>> So, root blkg is always available as long as the request_queue is
>> alive.  Sth like the following would be simpler?
>>
>> static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q)
>> {
>>  return q->root_blkg;
>> }
> 
> That would not only be simpler, it would also avoid that blk_queue_root_blkg()
> returns NULL if bypass mode is enabled and the request queue is still 
> associated
> with the block cgroup controller. How do you want to proceed with this change?

Do a followup patch?

-- 
Jens Axboe

Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()

2018-08-09 Thread Bart Van Assche

On Thu, 2018-08-09 at 12:56 -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote:
> > +/**
> > + * blkg_lookup - look up blkg for the specified request queue
> > + * @q: request_queue of interest
> > + *
> > + * Lookup blkg for @q at the root level. See also blkg_lookup().
> > + */
> > +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q)
> > +{
> > +   struct blkcg_gq *blkg;
> > +
> > +   rcu_read_lock();
> > +   blkg = blkg_lookup(&blkcg_root, q);
> > +   rcu_read_unlock();
> > +
> > +   return blkg;
> > +}
> 
> So, root blkg is always available as long as the request_queue is
> alive.  Sth like the following would be simpler?
> 
> static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q)
> {
>   return q->root_blkg;
> }

That would not only be simpler, it would also avoid that blk_queue_root_blkg()
returns NULL if bypass mode is enabled and the request queue is still associated
with the block cgroup controller. How do you want to proceed with this change?

Bart.

Re: [PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()

2018-08-09 Thread Tejun Heo

Hello,

On Thu, Aug 09, 2018 at 07:53:36AM -0700, Bart Van Assche wrote:
> +/**
> + * blkg_lookup - look up blkg for the specified request queue
> + * @q: request_queue of interest
> + *
> + * Lookup blkg for @q at the root level. See also blkg_lookup().
> + */
> +static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q)
> +{
> + struct blkcg_gq *blkg;
> +
> + rcu_read_lock();
> + blkg = blkg_lookup(&blkcg_root, q);
> + rcu_read_unlock();
> +
> + return blkg;
> +}

So, root blkg is always available as long as the request_queue is
alive.  Sth like the following would be simpler?

static inline struct blkcg_gq *blk_queue_root_blkg(struct request_queue *q)
{
return q->root_blkg;
}

-- 
tejun

[PATCH v6 01/12] block, scsi: Introduce request flag RQF_DV

2018-08-09 Thread Bart Van Assche

Instead of marking all power management, SCSI domain validation and
IDE preempt requests with RQF_PREEMPT, only mark IDE preempt
requests with RQF_PREEMPT. Use RQF_DV to mark requests submitted by
scsi_execute() and RQF_PM to mark power management requests. Most
but not all power management requests already have the RQF_PM flag
set.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: David S. Miller 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 11 +--
 block/blk-mq-debugfs.c  |  1 +
 block/blk-mq.c  |  2 --
 drivers/ide/ide-pm.c|  3 ++-
 drivers/scsi/scsi_lib.c | 11 ---
 include/linux/blk-mq.h  |  6 --
 include/linux/blkdev.h  |  5 +++--
 7 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 12550340418d..c4ff58491758 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -913,11 +913,11 @@ EXPORT_SYMBOL(blk_alloc_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ * @flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_DV
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool preempt = flags & (BLK_MQ_REQ_PM | BLK_MQ_REQ_DV);
 
while (true) {
bool success = false;
@@ -1436,8 +1436,6 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
-   if (flags & BLK_MQ_REQ_PREEMPT)
-   rq->rq_flags |= RQF_PREEMPT;
 
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@@ -1575,7 +1573,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
goto retry;
 }
 
-/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
+/* flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_DV. */
 static struct request *blk_old_get_request(struct request_queue *q,
unsigned int op, blk_mq_req_flags_t flags)
 {
@@ -1618,7 +1616,8 @@ struct request *blk_get_request(struct request_queue *q, 
unsigned int op,
struct request *req;
 
WARN_ON_ONCE(op & REQ_NOWAIT);
-   WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
+   WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PM |
+  BLK_MQ_REQ_DV));
 
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..b4e722311ae1 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -330,6 +330,7 @@ static const char *const rqf_name[] = {
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
RQF_NAME(MQ_POLL_SLEPT),
+   RQF_NAME(DV),
 };
 #undef RQF_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 72a0033ccee9..2a0eb058ba5a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -300,8 +300,6 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->rq_flags = rq_flags;
rq->cpu = -1;
rq->cmd_flags = op;
-   if (data->flags & BLK_MQ_REQ_PREEMPT)
-   rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
INIT_LIST_HEAD(&rq->queuelist);
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 59217aa1d1fb..8bf378164aee 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -90,8 +90,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(&rqpm, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, BLK_MQ_REQ_PREEMPT);
+   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, BLK_MQ_REQ_PM);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
+   rq->rq_flags |= RQF_PM;
rq->special = &rqpm;
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cb9a166fa0c..a65a03e2bcc4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -263,11 +263,16 @@ int __scsi_execute(struct scsi_device *sdev, const 
unsigned char *cmd,
 {
struct request *req;
struct scsi_request *rq;
+   blk_mq_req_flags_t blk_mq_req_flags = 0;
int ret = DRIVER_ERROR << 24;
 
+   if (rq_flags & RQF_PM)
+   blk_mq_req_flags |= BLK_MQ_REQ_PM;
+   rq_flags |= RQF_DV;
+   blk_mq_req_flags |= BLK_MQ_REQ_DV;
req = blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
+   REQ_OP_SCSI_OUT : REQ_

[PATCH v6 06/12] scsi: Reallow SPI domain validation during system suspend

2018-08-09 Thread Bart Van Assche

Now that SCSI power management and SPI domain validation use different
mechanisms for blocking SCSI command execution, remove the mechanism
again that prevents system suspend during SPI domain validation.

This patch reverts 203f8c250e21 ("block, scsi: Fix race between SPI
domain validation and system suspend").

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Woody Suwalski 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 drivers/scsi/scsi_lib.c   | 37 +--
 drivers/scsi/scsi_sysfs.c |  1 +
 drivers/scsi/scsi_transport_spi.c | 15 ++---
 include/scsi/scsi_device.h|  1 +
 4 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d466f846043f..89f790e73ed6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1366,6 +1366,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
ret = BLKPREP_DEFER;
break;
case SDEV_SUSPENDED:
+   case SDEV_QUIESCED_SUSPENDED:
/* Process RQF_PM requests only. */
if (!(req->rq_flags & RQF_PM))
ret = BLKPREP_DEFER;
@@ -2683,6 +2684,17 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
break;
 
case SDEV_QUIESCE:
+   switch (oldstate) {
+   case SDEV_RUNNING:
+   case SDEV_QUIESCED_SUSPENDED:
+   case SDEV_OFFLINE:
+   case SDEV_TRANSPORT_OFFLINE:
+   break;
+   default:
+   goto illegal;
+   }
+   break;
+
case SDEV_SUSPENDED:
switch (oldstate) {
case SDEV_RUNNING:
@@ -2694,6 +2706,15 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
}
break;
 
+   case SDEV_QUIESCED_SUSPENDED:
+   switch (oldstate) {
+   case SDEV_QUIESCE:
+   break;
+   default:
+   goto illegal;
+   }
+   break;
+
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
switch (oldstate) {
@@ -2994,8 +3015,13 @@ scsi_device_suspend(struct scsi_device *sdev)
struct request_queue *q = sdev->request_queue;
int err;
 
-   if (sdev->sdev_state == SDEV_SUSPENDED)
+   switch (sdev->sdev_state) {
+   case SDEV_SUSPENDED:
+   case SDEV_QUIESCED_SUSPENDED:
return 0;
+   default:
+   break;
+   }
 
blk_set_pm_only(q);
 
@@ -3011,6 +3037,8 @@ scsi_device_suspend(struct scsi_device *sdev)
 
mutex_lock(&sdev->state_mutex);
err = scsi_device_set_state(sdev, SDEV_SUSPENDED);
+   if (err)
+   err = scsi_device_set_state(sdev, SDEV_QUIESCED_SUSPENDED);
if (err)
blk_clear_pm_only(q);
mutex_unlock(&sdev->state_mutex);
@@ -3031,8 +3059,13 @@ void scsi_device_unsuspend(struct scsi_device *sdev)
 {
mutex_lock(&sdev->state_mutex);
blk_clear_pm_only(sdev->request_queue);
-   if (sdev->sdev_state == SDEV_SUSPENDED)
+   switch (sdev->sdev_state) {
+   case SDEV_SUSPENDED:
+   case SDEV_QUIESCED_SUSPENDED:
scsi_device_set_state(sdev, SDEV_RUNNING);
+   default:
+   break;
+   }
mutex_unlock(&sdev->state_mutex);
 }
 EXPORT_SYMBOL(scsi_device_unsuspend);
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 496c5eff4859..8c152e35ff77 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -37,6 +37,7 @@ static const struct {
{ SDEV_DEL, "deleted" },
{ SDEV_QUIESCE, "quiesce" },
{ SDEV_SUSPENDED, "suspended" },
+   { SDEV_QUIESCED_SUSPENDED, "quiesced-suspended" },
{ SDEV_OFFLINE, "offline" },
{ SDEV_TRANSPORT_OFFLINE, "transport-offline" },
{ SDEV_BLOCK,   "blocked" },
diff --git a/drivers/scsi/scsi_transport_spi.c 
b/drivers/scsi/scsi_transport_spi.c
index 16bec4884249..582b18efc3fb 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "scsi_priv.h"
 #include 
@@ -1008,19 +1007,11 @@ spi_dv_device(struct scsi_device *sdev)
u8 *buffer;
const int len = SPI_MAX_ECHO_BUFFER_SIZE*2;
 
-   /*
-* Because this function and the power management code both call
-* scsi_device_quiesce(), it is not safe to perform domain validation
-* while suspend or resume is in progress. Hence the
-* lock/unlock_system_sleep() calls.
-*/
-   lock_system_sleep();
-
if (unlikely(starget->spi_dv_context))
-   got

[PATCH v6 03/12] scsi: Only set RQF_DV for requests used for domain validation

2018-08-09 Thread Bart Van Assche

Instead of setting RQF_DV for all requests submitted by scsi_execute(),
only set that flag for requests that are used for domain validation.
Move the SCSI Parallel Interface (SPI) domain validation status from
the transport data to struct scsi_target such that this status
information can be accessed easily from inside scsi_execute(). This
patch prevents that e.g. event checking can occur during domain
validation.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 drivers/scsi/scsi_lib.c   | 6 --
 drivers/scsi/scsi_transport_spi.c | 9 +++--
 include/scsi/scsi_device.h| 1 +
 include/scsi/scsi_transport_spi.h | 1 -
 4 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8685704f6c8b..4d7411a7985f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -268,8 +268,10 @@ int __scsi_execute(struct scsi_device *sdev, const 
unsigned char *cmd,
 
if (rq_flags & RQF_PM)
blk_mq_req_flags |= BLK_MQ_REQ_PM;
-   rq_flags |= RQF_DV;
-   blk_mq_req_flags |= BLK_MQ_REQ_DV;
+   if (scsi_target(sdev)->spi_dv_context == current) {
+   rq_flags |= RQF_DV;
+   blk_mq_req_flags |= BLK_MQ_REQ_DV;
+   }
req = blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, blk_mq_req_flags);
diff --git a/drivers/scsi/scsi_transport_spi.c 
b/drivers/scsi/scsi_transport_spi.c
index 2ca150b16764..bf6b18768e79 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -66,7 +66,6 @@ static struct {
 };
 
 /* Private data accessors (keep these out of the header file) */
-#define spi_dv_in_progress(x) (((struct spi_transport_attrs 
*)&(x)->starget_data)->dv_in_progress)
 #define spi_dv_mutex(x) (((struct spi_transport_attrs 
*)&(x)->starget_data)->dv_mutex)
 
 struct spi_internal {
@@ -268,7 +267,6 @@ static int spi_setup_transport_attrs(struct 
transport_container *tc,
spi_pcomp_en(starget) = 0;
spi_hold_mcs(starget) = 0;
spi_dv_pending(starget) = 0;
-   spi_dv_in_progress(starget) = 0;
spi_initial_dv(starget) = 0;
mutex_init(&spi_dv_mutex(starget));
 
@@ -1018,14 +1016,12 @@ spi_dv_device(struct scsi_device *sdev)
 */
lock_system_sleep();
 
-   if (unlikely(spi_dv_in_progress(starget)))
+   if (unlikely(starget->spi_dv_context))
goto unlock;
 
if (unlikely(scsi_device_get(sdev)))
goto unlock;
 
-   spi_dv_in_progress(starget) = 1;
-
buffer = kzalloc(len, GFP_KERNEL);
 
if (unlikely(!buffer))
@@ -1043,7 +1039,9 @@ spi_dv_device(struct scsi_device *sdev)
 
starget_printk(KERN_INFO, starget, "Beginning Domain Validation\n");
 
+   starget->spi_dv_context = current;
spi_dv_device_internal(sdev, buffer);
+   starget->spi_dv_context = NULL;
 
starget_printk(KERN_INFO, starget, "Ending Domain Validation\n");
 
@@ -1057,7 +1055,6 @@ spi_dv_device(struct scsi_device *sdev)
  out_free:
kfree(buffer);
  out_put:
-   spi_dv_in_progress(starget) = 0;
scsi_device_put(sdev);
 unlock:
unlock_system_sleep();
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 202f4d6a4342..440834f4252e 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -310,6 +310,7 @@ struct scsi_target {
 
charscsi_level;
enum scsi_target_state  state;
+   struct task_struct  *spi_dv_context;
void*hostdata; /* available to low-level driver */
unsigned long   starget_data[0]; /* for the transport */
/* starget_data must be the last element */
diff --git a/include/scsi/scsi_transport_spi.h 
b/include/scsi/scsi_transport_spi.h
index a4fa52b4d5c5..36934ac0a5ca 100644
--- a/include/scsi/scsi_transport_spi.h
+++ b/include/scsi/scsi_transport_spi.h
@@ -56,7 +56,6 @@ struct spi_transport_attrs {
unsigned int support_qas; /* supports quick arbitration and selection */
/* Private Fields */
unsigned int dv_pending:1; /* Internal flag: DV Requested */
-   unsigned int dv_in_progress:1;  /* Internal: DV started */
struct mutex dv_mutex; /* semaphore to serialise dv */
 };
 
-- 
2.18.0

[PATCH v6 04/12] scsi: Introduce the SDEV_SUSPENDED device status

2018-08-09 Thread Bart Van Assche

Instead of using the SDEV_QUIESCE state for both SCSI domain validation
and runtime suspend, use the SDEV_QUIESCE state only for SCSI domain
validation. Keep using scsi_device_quiesce() and scsi_device_unquiesce()
for SCSI domain validation. Add new functions scsi_device_suspend() and
scsi_device_unsuspend() for power management. Rename scsi_device_resume()
into scsi_device_unquiesce() to avoid confusion.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 drivers/scsi/scsi_lib.c   | 138 --
 drivers/scsi/scsi_pm.c|   6 +-
 drivers/scsi/scsi_sysfs.c |   1 +
 drivers/scsi/scsi_transport_spi.c |   2 +-
 include/scsi/scsi_device.h|  13 +--
 5 files changed, 125 insertions(+), 35 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 4d7411a7985f..eb914d8e17fd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1365,6 +1365,11 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
if (!(req->rq_flags & (RQF_PM | RQF_DV)))
ret = BLKPREP_DEFER;
break;
+   case SDEV_SUSPENDED:
+   /* Process RQF_PM requests only. */
+   if (!(req->rq_flags & RQF_PM))
+   ret = BLKPREP_DEFER;
+   break;
case SDEV_CANCEL:
/* Only allow RQF_PM requests. */
if (!(req->rq_flags & RQF_PM))
@@ -2669,6 +2674,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
case SDEV_OFFLINE:
case SDEV_TRANSPORT_OFFLINE:
case SDEV_QUIESCE:
+   case SDEV_SUSPENDED:
case SDEV_BLOCK:
break;
default:
@@ -2677,6 +2683,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum 
scsi_device_state state)
break;
 
case SDEV_QUIESCE:
+   case SDEV_SUSPENDED:
switch (oldstate) {
case SDEV_RUNNING:
case SDEV_OFFLINE:
@@ -2970,19 +2977,103 @@ static void scsi_wait_for_queuecommand(struct 
scsi_device *sdev)
 }
 
 /**
- * scsi_device_quiesce - Block user issued commands.
- * @sdev:  scsi device to quiesce.
+ * scsi_device_suspend - only process RQF_PM requests
+ * @sdev: scsi device to suspend.
+ *
+ * This works by trying to transition to the SDEV_SUSPENDED state (which must 
be
+ * a legal transition).  When the device is in this state, only RQF_DV
+ * requests will be accepted, all others will be deferred.
+ *
+ * Must be called with user context, may sleep.
+ *
+ * Returns zero if unsuccessful or an error if not.
+ */
+int
+scsi_device_suspend(struct scsi_device *sdev)
+{
+   struct request_queue *q = sdev->request_queue;
+   int err;
+
+   if (sdev->sdev_state == SDEV_SUSPENDED)
+   return 0;
+
+   blk_set_preempt_only(q);
+
+   blk_mq_freeze_queue(q);
+   /*
+* Ensure that the effect of blk_set_preempt_only() will be visible
+* for percpu_ref_tryget() callers that occur after the queue
+* unfreeze even if the queue was already frozen before this function
+* was called. See also https://lwn.net/Articles/573497/.
+*/
+   synchronize_rcu();
+   blk_mq_unfreeze_queue(q);
+
+   mutex_lock(&sdev->state_mutex);
+   err = scsi_device_set_state(sdev, SDEV_SUSPENDED);
+   if (err)
+   blk_clear_preempt_only(q);
+   mutex_unlock(&sdev->state_mutex);
+
+   return err;
+}
+EXPORT_SYMBOL(scsi_device_suspend);
+
+/**
+ * scsi_device_unsuspend - unsuspend processing non-RQF_DV requests
+ * @sdev: scsi device to unsuspend.
  *
- * This works by trying to transition to the SDEV_QUIESCE state
- * (which must be a legal transition).  When the device is in this
- * state, only special requests will be accepted, all others will
- * be deferred.  Since special requests may also be requeued requests,
- * a successful return doesn't guarantee the device will be 
- * totally quiescent.
+ * Moves the device from suspended back to running and restarts the queues.
  *
- * Must be called with user context, may sleep.
+ * Must be called with user context, may sleep.
+ */
+void scsi_device_unsuspend(struct scsi_device *sdev)
+{
+   mutex_lock(&sdev->state_mutex);
+   blk_clear_preempt_only(sdev->request_queue);
+   if (sdev->sdev_state == SDEV_SUSPENDED)
+   scsi_device_set_state(sdev, SDEV_RUNNING);
+   mutex_unlock(&sdev->state_mutex);
+}
+EXPORT_SYMBOL(scsi_device_unsuspend);
+
+static void
+device_suspend_fn(struct scsi_device *sdev, void *data)
+{
+   scsi_device_suspend(sdev);
+}
+
+void
+scsi_target_suspend(struct scsi_target *starget)
+{
+   starget_for_each_device(starget, NUL

[PATCH v6 08/12] block, scsi: Introduce blk_pm_runtime_exit()

2018-08-09 Thread Bart Van Assche

Since it is possible to unbind a SCSI ULD and since unbinding
removes the association between a request queue and struct device,
the q->dev pointer has to be reset during unbind. Hence introduce
a function in the block layer that clears request_queue.dev.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-pm.c | 18 ++
 drivers/scsi/sd.c  |  9 -
 drivers/scsi/sr.c  |  3 ++-
 include/linux/blk-pm.h |  2 ++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..bf8532da952d 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -40,6 +40,24 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+/**
+ * blk_pm_runtime_exit - runtime PM exit routine
+ * @q: the queue of the device
+ *
+ * This function should be called from the device_driver.remove() callback
+ * function to avoid that further calls to runtime power management functions
+ * occur.
+ */
+void blk_pm_runtime_exit(struct request_queue *q)
+{
+   if (!q->dev)
+   return;
+
+   pm_runtime_get_sync(q->dev);
+   q->dev = NULL;
+}
+EXPORT_SYMBOL(blk_pm_runtime_exit);
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 69ab459abb98..5537762dfcfd 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3420,12 +3420,11 @@ static int sd_probe(struct device *dev)
  **/
 static int sd_remove(struct device *dev)
 {
-   struct scsi_disk *sdkp;
-   dev_t devt;
+   struct scsi_disk *sdkp = dev_get_drvdata(dev);
+   struct scsi_device *sdp = sdkp->device;
+   dev_t devt = disk_devt(sdkp->disk);
 
-   sdkp = dev_get_drvdata(dev);
-   devt = disk_devt(sdkp->disk);
-   scsi_autopm_get_device(sdkp->device);
+   blk_pm_runtime_exit(sdp->request_queue);
 
async_synchronize_full_domain(&scsi_sd_pm_domain);
async_synchronize_full_domain(&scsi_sd_probe_domain);
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index de4413e66eca..476987f7ed48 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -1002,8 +1002,9 @@ static void sr_kref_release(struct kref *kref)
 static int sr_remove(struct device *dev)
 {
struct scsi_cd *cd = dev_get_drvdata(dev);
+   struct scsi_device *sdev = cd->device;
 
-   scsi_autopm_get_device(cd->device);
+   blk_pm_runtime_exit(sdev->request_queue);
 
del_gendisk(cd->disk);
dev_set_drvdata(dev, NULL);
diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h
index b80c65aba249..6d654f41acbf 100644
--- a/include/linux/blk-pm.h
+++ b/include/linux/blk-pm.h
@@ -11,6 +11,7 @@ struct request_queue;
  */
 #ifdef CONFIG_PM
 extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
+extern void blk_pm_runtime_exit(struct request_queue *q);
 extern int blk_pre_runtime_suspend(struct request_queue *q);
 extern void blk_post_runtime_suspend(struct request_queue *q, int err);
 extern void blk_pre_runtime_resume(struct request_queue *q);
@@ -19,6 +20,7 @@ extern void blk_set_runtime_active(struct request_queue *q);
 #else
 static inline void blk_pm_runtime_init(struct request_queue *q,
   struct device *dev) {}
+static inline void blk_pm_runtime_exit(struct request_queue *q) {}
 #endif
 
 #endif /* _BLK_PM_H_ */
-- 
2.18.0

[PATCH v6 05/12] block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce PM_ONLY

2018-08-09 Thread Bart Van Assche

Instead of having one queue state (PREEMPT_ONLY) in which both
power management and SCSI domain validation requests are processed,
rename the PREEMPT_ONLY state into DV_ONLY and introduce a new queue
flag QUEUE_FLAG_PM_ONLY. Provide the new functions blk_set_pm_only()
and blk_clear_pm_only() for power management and blk_set_dv_only()
and blk_clear_dv_only() functions for SCSI domain validation purposes.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 65 ++---
 block/blk-mq-debugfs.c  |  3 +-
 drivers/scsi/scsi_lib.c | 16 +-
 include/linux/blkdev.h  | 13 +
 4 files changed, 66 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c4ff58491758..401f4927a8db 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -421,24 +421,44 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_dv_only - set QUEUE_FLAG_DV_ONLY
  * @q: request queue pointer
  *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
+ * Returns the previous value of the DV_ONLY flag - 0 if the flag was not
  * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+int blk_set_dv_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   return blk_queue_flag_test_and_set(QUEUE_FLAG_DV_ONLY, q);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_dv_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_dv_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   blk_queue_flag_clear(QUEUE_FLAG_DV_ONLY, q);
wake_up_all(&q->mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_dv_only);
+
+/**
+ * blk_set_pm_only - set QUEUE_FLAG_PM_ONLY
+ * @q: request queue pointer
+ *
+ * Returns the previous value of the PM_ONLY flag - 0 if the flag was not
+ * set and 1 if the flag was already set.
+ */
+int blk_set_pm_only(struct request_queue *q)
+{
+   return blk_queue_flag_test_and_set(QUEUE_FLAG_PM_ONLY, q);
+}
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
+
+void blk_clear_pm_only(struct request_queue *q)
+{
+   blk_queue_flag_clear(QUEUE_FLAG_PM_ONLY, q);
+   wake_up_all(&q->mq_freeze_wq);
+}
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -910,6 +930,20 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+/*
+ * Whether or not blk_queue_enter() should proceed. RQF_PM requests are always
+ * allowed. RQF_DV requests are allowed if the PM_ONLY queue flag has not been
+ * set. Other requests are only allowed if neither PM_ONLY nor DV_ONLY has been
+ * set.
+ */
+static inline bool blk_enter_allowed(struct request_queue *q,
+blk_mq_req_flags_t flags)
+{
+   return flags & BLK_MQ_REQ_PM ||
+   (!blk_queue_pm_only(q) &&
+(flags & BLK_MQ_REQ_DV || !blk_queue_dv_only(q)));
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -917,23 +951,20 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & (BLK_MQ_REQ_PM | BLK_MQ_REQ_DV);
-
while (true) {
bool success = false;
 
rcu_read_lock();
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that sets the PM_ONLY or DV_ONLY queue
+* flags is responsible for ensuring that that flag is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (blk_enter_allowed(q, flags))
success = true;
-   } else {
+   else
percpu_ref_put(&q->q_usage_counter);
-   }
}
rcu_read_unlock();
 
@@ -954,7 +985,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(&q->mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+

[PATCH v6 07/12] block: Move power management code into a new source file

2018-08-09 Thread Bart Van Assche

Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
 into . For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file 
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/Kconfig  |   5 ++
 block/Makefile |   1 +
 block/blk-core.c   | 196 +
 block/blk-pm.c | 188 +++
 block/blk-pm.h |  43 +
 block/elevator.c   |  22 +
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c  |   1 +
 drivers/scsi/sr.c  |   1 +
 include/linux/blk-pm.h |  24 +
 include/linux/blkdev.h |  23 -
 11 files changed, 266 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..e213d90a5e64 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,9 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
 
+config BLK_PM
+   bool
+   depends on BLOCK && PM
+   default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_PM)   += blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 401f4927a8db..3770a36e85be 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1757,16 +1758,6 @@ void part_round_stats(struct request_queue *q, int cpu, 
struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
req_flags_t rq_flags = req->rq_flags;
@@ -3783,191 +3774,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *Initialize runtime-PM-related fields for @q and start auto suspend for
- *@dev. Drivers that want to take advantage of request-based runtime PM
- *should call this function after @dev has been initialized, and its
- *request queue @q has been allocated, and runtime PM for it can not happen
- *yet(either due to disabled/forbidden or its usage_count > 0). In most
- *cases, driver should call this function before any I/O has taken place.
- *
- *This function takes care of setting up using auto suspend for the device,
- *the autosuspend delay is set to -1 to make runtime suspend impossible
- *until an updated value is either set by user or by driver. Drivers do
- *not need to touch other autosuspend settings.
- *
- *The block layer runtime PM is request based, so only works for drivers
- *that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
-   q->dev = dev;
-   q->rpm_status = RPM_ACTIVE;
-   pm_runtime_set_autosuspend_delay(q->dev, -1);
-   pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *This function will check if runtime suspend is allowed for the device
- *by examining if there are any requests pending in the queue. If there
- *are requests pending, the device can not be runtime suspended; otherwise,
- *the queue's status will be updated to SUSPENDING and the driver can
- *proceed to suspend the device.
- *
- *For the not allowed case, we mark last busy for the device so that
- *run

[PATCH v6 02/12] scsi: Alter handling of RQF_DV requests

2018-08-09 Thread Bart Van Assche

Process all requests in state SDEV_CREATED instead of only RQF_DV
requests. This does not change the behavior of the SCSI core because
the SCSI device state is modified into another state before SCSI
devices become visible in sysfs and before any device nodes are
created in /dev. Do not process RQF_DV requests in state SDEV_CANCEL
because only power management requests should be processed in this
state. Handle all SCSI device states explicitly in
scsi_prep_state_check() instead of using a default case in the
switch/case statement in scsi_prep_state_check(). This allows the
compiler to verify whether all states have been handled.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 drivers/scsi/scsi_lib.c | 78 +++--
 1 file changed, 36 insertions(+), 42 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a65a03e2bcc4..8685704f6c8b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1331,49 +1331,43 @@ scsi_prep_state_check(struct scsi_device *sdev, struct 
request *req)
 * If the device is not in running state we will reject some
 * or all commands.
 */
-   if (unlikely(sdev->sdev_state != SDEV_RUNNING)) {
-   switch (sdev->sdev_state) {
-   case SDEV_OFFLINE:
-   case SDEV_TRANSPORT_OFFLINE:
-   /*
-* If the device is offline we refuse to process any
-* commands.  The device must be brought online
-* before trying any recovery commands.
-*/
-   sdev_printk(KERN_ERR, sdev,
-   "rejecting I/O to offline device\n");
-   ret = BLKPREP_KILL;
-   break;
-   case SDEV_DEL:
-   /*
-* If the device is fully deleted, we refuse to
-* process any commands as well.
-*/
-   sdev_printk(KERN_ERR, sdev,
-   "rejecting I/O to dead device\n");
-   ret = BLKPREP_KILL;
-   break;
-   case SDEV_BLOCK:
-   case SDEV_CREATED_BLOCK:
+   switch (sdev->sdev_state) {
+   case SDEV_RUNNING:
+   case SDEV_CREATED:
+   break;
+   case SDEV_OFFLINE:
+   case SDEV_TRANSPORT_OFFLINE:
+   /*
+* If the device is offline we refuse to process any commands.
+* The device must be brought online before trying any
+* recovery commands.
+*/
+   sdev_printk(KERN_ERR, sdev,
+   "rejecting I/O to offline device\n");
+   ret = BLKPREP_KILL;
+   break;
+   case SDEV_DEL:
+   /*
+* If the device is fully deleted, we refuse to process any
+* commands as well.
+*/
+   sdev_printk(KERN_ERR, sdev, "rejecting I/O to dead device\n");
+   ret = BLKPREP_KILL;
+   break;
+   case SDEV_BLOCK:
+   case SDEV_CREATED_BLOCK:
+   ret = BLKPREP_DEFER;
+   break;
+   case SDEV_QUIESCE:
+   /* Only allow RQF_PM and RQF_DV requests. */
+   if (!(req->rq_flags & (RQF_PM | RQF_DV)))
ret = BLKPREP_DEFER;
-   break;
-   case SDEV_QUIESCE:
-   /*
-* If the devices is blocked we defer normal commands.
-*/
-   if (req && !(req->rq_flags & (RQF_PM | RQF_DV)))
-   ret = BLKPREP_DEFER;
-   break;
-   default:
-   /*
-* For any other not fully online state we only allow
-* special commands.  In particular any user initiated
-* command is not allowed.
-*/
-   if (req && !(req->rq_flags & (RQF_PM | RQF_DV)))
-   ret = BLKPREP_KILL;
-   break;
-   }
+   break;
+   case SDEV_CANCEL:
+   /* Only allow RQF_PM requests. */
+   if (!(req->rq_flags & RQF_PM))
+   ret = BLKPREP_KILL;
+   break;
}
return ret;
 }
-- 
2.18.0

[PATCH v6 12/12] blk-mq: Enable support for runtime power management

2018-08-09 Thread Bart Van Assche

Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-pm.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index 977beffdccd2..e76d85dcdd67 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.18.0

[PATCH v6 11/12] block: Change the runtime power management approach (2/2)

2018-08-09 Thread Bart Van Assche

Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 ++---
 block/blk-pm.c   | 72 
 2 files changed, 75 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f30545fb2de2..b0bb6b5320fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2772,30 +2772,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2841,11 +2817,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, &q->queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index bf8532da952d..977beffdccd2 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -58,6 +61,36 @@ void blk_pm_runtime_exit(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_pm_runtime_exit);
 
+struct in_flight_data {
+   struct request_queue*q;
+   int in_flight;
+};
+
+static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   void *priv, bool reserved)
+{
+   struct in_flight_data *in_flight = priv;
+
+   if (rq->q == in_flight->q)
+   in_flight->in_flight++;
+}
+
+/*
+ * Count the number of requests that are in flight for request queue @q. Use
+ * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
+ * queues.  Use blk_mq_queue_tag_busy_iter() instead of
+ * blk_mq_tagset_busy_iter() because the latter only considers requests that
+ * have already been started.
+ */
+static int blk_requests_in_flight(struct request_queue *q)
+{
+   struct in_flight_data in_flight = { .q = q };
+
+   if (q->mq_ops)
+   blk_mq_queue_tag_busy_iter(q, blk_count_in_flight, &in_flight);
+   return q->nr_pending + in_flight.in_flight;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -86,14 +119,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent
+* pm_request_resume() call occurred at least autosuspend_delay_ms
+* ago. Since blk_queue_enter() is called by the request allocation
+* code before pm_request_resume(), if no requests have a tag assigned
+* it is safe to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the preempt-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   synchronize_rcu();
+   if (blk_requests_in_flight(q) == 0)
+   ret = 0;
+   }
+

[PATCH v6 09/12] block: Split blk_pm_add_request() and blk_pm_put_request()

2018-08-09 Thread Bart Van Assche

Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c |  1 +
 block/blk-pm.h   | 36 +++-
 block/elevator.c |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3770a36e85be..59dd98585eb0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1774,6 +1774,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
+   blk_pm_mark_last_busy(req);
 
elv_completed_request(q, req);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..a8564ea72a41 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,8 +6,23 @@
 #include 
 
 #ifdef CONFIG_PM
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
+   lockdep_assert_held(rq->q->queue_lock);
+
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
 }
@@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request 
*rq)
 static inline void blk_pm_add_request(struct request_queue *q,
  struct request *rq)
 {
-   if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-   (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-   pm_request_resume(q->dev);
+   lockdep_assert_held(q->queue_lock);
+
+   if (q->dev && !(rq->rq_flags & RQF_PM))
+   q->nr_pending++;
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
+   lockdep_assert_held(rq->q->queue_lock);
+
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   --rq->q->nr_pending;
 }
 #else
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index 4c15f0240c99..00c5d8dbce16 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
+   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0

[PATCH v6 10/12] block: Change the runtime power management approach (1/2)

2018-08-09 Thread Bart Van Assche

Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 2 ++
 block/blk-mq.c   | 2 ++
 block/elevator.c | 1 -
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 59dd98585eb0..f30545fb2de2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -972,6 +972,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2a0eb058ba5a..24439735f20b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -473,6 +474,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/elevator.c b/block/elevator.c
index 00c5d8dbce16..4c15f0240c99 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0

[PATCH v6 00/12] blk-mq: Implement runtime power management

2018-08-09 Thread Bart Van Assche

Hello Jens,

This patch series not only implements runtime power management for blk-mq but
also fixes a starvation issue in the power management code for the legacy
block layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to determine
  whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (12):
  block, scsi: Introduce request flag RQF_DV
  scsi: Alter handling of RQF_DV requests
  scsi: Only set RQF_DV for requests used for domain validation
  scsi: Introduce the SDEV_SUSPENDED device status
  block, scsi: Rename QUEUE_FLAG_PREEMPT_ONLY into DV_ONLY and introduce
PM_ONLY
  scsi: Reallow SPI domain validation during system suspend
  block: Move power management code into a new source file
  block, scsi: Introduce blk_pm_runtime_exit()
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Change the runtime power management approach (1/2)
  block: Change the runtime power management approach (2/2)
  blk-mq: Enable support for runtime power management

 block/Kconfig |   5 +
 block/Makefile|   1 +
 block/blk-core.c  | 310 ++
 block/blk-mq-debugfs.c|   4 +-
 block/blk-mq.c|   4 +-
 block/blk-pm.c| 262 +
 block/blk-pm.h|  69 +++
 block/elevator.c  |  22 +--
 drivers/ide/ide-pm.c  |   3 +-
 drivers/scsi/scsi_lib.c   | 266 ++---
 drivers/scsi/scsi_pm.c|   7 +-
 drivers/scsi/scsi_sysfs.c |   2 +
 drivers/scsi/scsi_transport_spi.c |  26 +--
 drivers/scsi/sd.c |  10 +-
 drivers/scsi/sr.c |   4 +-
 include/linux/blk-mq.h|   6 +-
 include/linux/blk-pm.h|  26 +++
 include/linux/blkdev.h|  41 ++--
 include/scsi/scsi_device.h|  15 +-
 include/scsi/scsi_transport_spi.h |   1 -
 20 files changed, 673 insertions(+), 411 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.18.0

Re: [PATCH RESEND] Blk-throttle: reduce tail io latency when iops limit is enforced

2018-08-09 Thread Jens Axboe

On 8/9/18 11:47 AM, Liu Bo wrote:
> When an application's iops has exceeded its cgroup's iops limit, surely it
> is throttled and kernel will set a timer for dispatching, thus IO latency
> includes the delay.
> 
> However, the dispatch delay which is calculated by the limit and the
> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
> once the application's iops is (iops limit + 1), it doesn't need to wait
> any longer than the remaining time of the current slice.

Applied, thanks.

-- 
Jens Axboe

[PATCH RESEND] Blk-throttle: reduce tail io latency when iops limit is enforced

2018-08-09 Thread Liu Bo

When an application's iops has exceeded its cgroup's iops limit, surely it
is throttled and kernel will set a timer for dispatching, thus IO latency
includes the delay.

However, the dispatch delay which is calculated by the limit and the
elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
once the application's iops is (iops limit + 1), it doesn't need to wait
any longer than the remaining time of the current slice.

The difference can be proved by the following fio job and cgroup iops
setting,
-
$ echo 4 > /mnt/config/nullb/disk1/mbps# limit nullb's bandwidth to 4MB/s 
for testing.
$ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
$ cat r2.job
[global]
name=fio-rand-read
filename=/dev/nullb1
rw=randread
bs=4k
direct=1
numjobs=1
time_based=1
runtime=60
group_reporting=1

[file1]
size=4G
ioengine=libaio
iodepth=1
rate_iops=5
norandommap=1
thinktime=4ms
-

wo patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
slat (usec): min=10, max=336, avg=27.71, stdev=17.82
clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
 lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
clat percentiles (usec):
 |  1.00th=[4],  5.00th=[4], 10.00th=[4], 20.00th=[4],
 | 30.00th=[4], 40.00th=[4], 50.00th=[6], 60.00th=[11731],
 | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
 | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
 | 99.99th=[28967]

w/ patch:
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7-66-gedfc
Starting 1 process

   read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
slat (usec): min=10, max=155, avg=23.24, stdev=16.79
clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
 lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
clat percentiles (usec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[4], 20.00th=[4],
 | 30.00th=[4], 40.00th=[5], 50.00th=[   47], 60.00th=[11863],
 | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
 | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
 | 99.99th=[12387]

Signed-off-by: Liu Bo 
---
Rebase against branch for-4.19/block.

 block/blk-throttle.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index caaabbe8a7a5..a3eede00d302 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -922,12 +922,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
}
 
/* Calc approx time to dispatch */
-   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
-
-   if (jiffy_wait > jiffy_elapsed)
-   jiffy_wait = jiffy_wait - jiffy_elapsed;
-   else
-   jiffy_wait = 1;
+   jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed;
 
if (wait)
*wait = jiffy_wait;
-- 
1.8.3.1

Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced

2018-08-09 Thread Liu Bo

On Thu, Aug 9, 2018 at 10:22 AM, Jens Axboe  wrote:
> On 7/20/18 6:29 PM, Liu Bo wrote:
>> When an application's iops has exceeded its cgroup's iops limit, surely it
>> is throttled and kernel will set a timer for dispatching, thus IO latency
>> includes the delay.
>>
>> However, the dispatch delay which is calculated by the limit and the
>> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
>> once the application's iops is (iops limit + 1), it doesn't need to wait
>> any longer than the remaining time of the current slice.
>
> Looks good to me - can you resend against my for-4.19/block branch, as
> it doesn't apply.
>

Sure, I'll rebase against it.

thanks,
liubo

Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced

2018-08-09 Thread Jens Axboe

On 7/20/18 6:29 PM, Liu Bo wrote:
> When an application's iops has exceeded its cgroup's iops limit, surely it
> is throttled and kernel will set a timer for dispatching, thus IO latency
> includes the delay.
> 
> However, the dispatch delay which is calculated by the limit and the
> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
> once the application's iops is (iops limit + 1), it doesn't need to wait
> any longer than the remaining time of the current slice.

Looks good to me - can you resend against my for-4.19/block branch, as
it doesn't apply.

-- 
Jens Axboe

Re: [PATCH] Blk-throttle: reduce tail io latency when iops limit is enforced

2018-08-09 Thread Liu Bo

ping?


On Fri, Jul 20, 2018 at 5:29 PM, Liu Bo  wrote:
> When an application's iops has exceeded its cgroup's iops limit, surely it
> is throttled and kernel will set a timer for dispatching, thus IO latency
> includes the delay.
>
> However, the dispatch delay which is calculated by the limit and the
> elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
> once the application's iops is (iops limit + 1), it doesn't need to wait
> any longer than the remaining time of the current slice.
>
> The difference can be proved by the following fio job and cgroup iops
> setting,
> -
> $ echo 4 > /mnt/config/nullb/disk1/mbps# limit nullb's bandwidth to 4MB/s 
> for testing.
> $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
> $ cat r2.job
> [global]
> name=fio-rand-read
> filename=/dev/nullb1
> rw=randread
> bs=4k
> direct=1
> numjobs=1
> time_based=1
> runtime=60
> group_reporting=1
>
> [file1]
> size=4G
> ioengine=libaio
> iodepth=1
> rate_iops=5
> norandommap=1
> thinktime=4ms
> -
>
> wo patch:
> file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7-66-gedfc
> Starting 1 process
>
>read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
> slat (usec): min=10, max=336, avg=27.71, stdev=17.82
> clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
>  lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
> clat percentiles (usec):
>  |  1.00th=[4],  5.00th=[4], 10.00th=[4], 20.00th=[4],
>  | 30.00th=[4], 40.00th=[4], 50.00th=[6], 60.00th=[11731],
>  | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
>  | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
>  | 99.99th=[28967]
>
> w/ patch:
> file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7-66-gedfc
> Starting 1 process
>
>read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
> slat (usec): min=10, max=155, avg=23.24, stdev=16.79
> clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
>  lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
> clat percentiles (usec):
>  |  1.00th=[3],  5.00th=[3], 10.00th=[4], 20.00th=[4],
>  | 30.00th=[4], 40.00th=[5], 50.00th=[   47], 60.00th=[11863],
>  | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
>  | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
>  | 99.99th=[12387]
>
> Signed-off-by: Liu Bo 
> ---
>  block/blk-throttle.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 26ddbe5286f8..3ca9a98748be 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -956,12 +956,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
> struct bio *bio,
> return false;
>
> /* Calc approx time to dispatch */
> -   jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
> -
> -   if (jiffy_wait > jiffy_elapsed)
> -   jiffy_wait = jiffy_wait - jiffy_elapsed;
> -   else
> -   jiffy_wait = 1;
> +   jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed;
>
> *wait = jiffy_wait;
> return false;
> --
> 1.8.3.1
>

Re: [PATCH v5 5/9] block: Change the runtime power management approach (1/2)

2018-08-09 Thread Bart Van Assche

On Thu, 2018-08-09 at 10:52 +0800, Ming Lei wrote:
> On Wed, Aug 08, 2018 at 05:28:43PM +, Bart Van Assche wrote:
> > Some but not all blk_queue_enter() calls are related to request allocation 
> > so
> 
> The only one I remember is scsi_ioctl_reset(), in which scsi_autopm_get_host()
> is called before allocating this request, that means it is enough to put
> host up for handling RESET. Also this request shouldn't have entered the block
> request queue.
> 
> Or are there other cases in which request allocation isn't related with
> blk_queue_enter()?

What I was referring to in my e-mail is the q_usage_counter manipulations
in blk_mq_timeout_work(). However, blk_mq_timeout_work() manipulates that
counter directly. So it's only the blk_queue_exit() call in that function
that is not related to request allocation. All blk_queue_enter() calls I
know of are related to request allocation.

Bart.

Re: [PATCH v5 0/3] Ensure that a request queue is dissociated from the cgroup controller

2018-08-09 Thread Jens Axboe

On 8/9/18 8:53 AM, Bart Van Assche wrote:
> Hello Jens,
> 
> Several block drivers call alloc_disk() followed by put_disk() if something
> fails before device_add_disk() is called without calling blk_cleanup_queue().
> Make sure that also for this scenario a request queue is dissociated from the
> cgroup controller. This patch avoids that loading the parport_pc, paride and
> pf drivers trigger a kernel crash. Since this patch series fixes a regression,
> please consider these patches for kernel v4.19.

Thanks, applied for 4.19.

-- 
Jens Axboe

[PATCH v5 3/3] block: Ensure that a request queue is dissociated from the cgroup controller

2018-08-09 Thread Bart Van Assche

Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller") # v4.17
Signed-off-by: Bart Van Assche 
Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>
Reviewed-by: Johannes Thumshirn 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Joseph Qi 
Cc: 
---
 block/blk-sysfs.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index ca1984ecbdeb..fcadea471779 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -802,6 +802,21 @@ static void __blk_release_queue(struct work_struct *work)
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
 
+   if (!blk_queue_dead(q)) {
+   /*
+* Last reference was dropped without having called
+* blk_cleanup_queue().
+*/
+   WARN_ONCE(blk_queue_init_done(q),
+ "request queue %p has been registered but 
blk_cleanup_queue() has not been called for that queue\n",
+ q);
+   blk_exit_queue(q);
+   }
+
+   WARN(blkg_root_lookup(q),
+"request queue %p is being released but it has not yet been 
removed from the blkcg controller\n",
+q);
+
blk_free_queue_stats(q->stats);
 
blk_exit_rl(q, &q->root_rl);
-- 
2.18.0

[PATCH v5 0/3] Ensure that a request queue is dissociated from the cgroup controller

2018-08-09 Thread Bart Van Assche

Hello Jens,

Several block drivers call alloc_disk() followed by put_disk() if something
fails before device_add_disk() is called without calling blk_cleanup_queue().
Make sure that also for this scenario a request queue is dissociated from the
cgroup controller. This patch avoids that loading the parport_pc, paride and
pf drivers trigger a kernel crash. Since this patch series fixes a regression,
please consider these patches for kernel v4.19.

Thanks,

Bart.

Changes between v4 and v5:
- Instead of using #ifdef CONFIG_CGROUP / #endif, introduced a new function in
  the cgroups code.

Changes between v3 and v4:
- Added "Cc: stable" tags.

Changes between v2 and v3:
- Avoid code duplication by introducing a new helper function.

Changes between v1 and v2:
- Fixed the build for CONFIG_BLK_CGROUP=n.

Bart Van Assche (3):
  blkcg: Introduce blkg_root_lookup()
  block: Introduce blk_exit_queue()
  block: Ensure that a request queue is dissociated from the cgroup
controller

 block/blk-core.c   | 54 +-
 block/blk-sysfs.c  | 15 +++
 block/blk.h|  1 +
 include/linux/blk-cgroup.h | 18 +
 4 files changed, 64 insertions(+), 24 deletions(-)

-- 
2.18.0

[PATCH v5 1/3] blkcg: Introduce blkg_root_lookup()

2018-08-09 Thread Bart Van Assche

This new function will be used in a later patch to verify whether a
queue has been dissociated from the cgroup controller before being
released.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Omar Sandoval 
Cc: Johannes Thumshirn 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Joseph Qi 
Cc: 
---
 include/linux/blk-cgroup.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index f7b910768306..1361cfc9b878 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -341,6 +341,23 @@ static inline struct blkcg_gq *blkg_lookup(struct blkcg 
*blkcg,
return __blkg_lookup(blkcg, q, false);
 }
 
+/**
+ * blkg_lookup - look up blkg for the specified request queue
+ * @q: request_queue of interest
+ *
+ * Lookup blkg for @q at the root level. See also blkg_lookup().
+ */
+static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q)
+{
+   struct blkcg_gq *blkg;
+
+   rcu_read_lock();
+   blkg = blkg_lookup(&blkcg_root, q);
+   rcu_read_unlock();
+
+   return blkg;
+}
+
 /**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
@@ -864,6 +881,7 @@ static inline bool blk_cgroup_congested(void) { return 
false; }
 static inline void blkcg_schedule_throttle(struct request_queue *q, bool 
use_memdelay) { }
 
 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { 
return NULL; }
+static inline struct blkcg_gq *blkg_root_lookup(struct request_queue *q) { 
return NULL; }
 static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
 static inline void blkcg_drain_queue(struct request_queue *q) { }
 static inline void blkcg_exit_queue(struct request_queue *q) { }
-- 
2.18.0

[PATCH v5 2/3] block: Introduce blk_exit_queue()

2018-08-09 Thread Bart Van Assche

This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Omar Sandoval 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Joseph Qi 
Cc: 
---
 block/blk-core.c | 54 +++-
 block/blk.h  |  1 +
 2 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2e054b65de42..55bc22ef2934 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -719,6 +719,35 @@ void blk_set_queue_dying(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
+/* Unconfigure the I/O scheduler and dissociate from the cgroup controller. */
+void blk_exit_queue(struct request_queue *q)
+{
+   /*
+* Since the I/O scheduler exit code may access cgroup information,
+* perform I/O scheduler exit before disassociating from the block
+* cgroup controller.
+*/
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   q->elevator = NULL;
+   }
+
+   /*
+* Remove all references to @q from the block cgroup controller before
+* restoring @q->queue_lock to avoid that restoring this pointer causes
+* e.g. blkcg_print_blkgs() to crash.
+*/
+   blkcg_exit_queue(q);
+
+   /*
+* Since the cgroup code may dereference the @q->backing_dev_info
+* pointer, only decrease its reference count after having removed the
+* association with the block cgroup controller.
+*/
+   bdi_put(q->backing_dev_info);
+}
+
 /**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
@@ -788,30 +817,7 @@ void blk_cleanup_queue(struct request_queue *q)
 */
WARN_ON_ONCE(q->kobj.state_in_sysfs);
 
-   /*
-* Since the I/O scheduler exit code may access cgroup information,
-* perform I/O scheduler exit before disassociating from the block
-* cgroup controller.
-*/
-   if (q->elevator) {
-   ioc_clear_queue(q);
-   elevator_exit(q, q->elevator);
-   q->elevator = NULL;
-   }
-
-   /*
-* Remove all references to @q from the block cgroup controller before
-* restoring @q->queue_lock to avoid that restoring this pointer causes
-* e.g. blkcg_print_blkgs() to crash.
-*/
-   blkcg_exit_queue(q);
-
-   /*
-* Since the cgroup code may dereference the @q->backing_dev_info
-* pointer, only decrease its reference count after having removed the
-* association with the block cgroup controller.
-*/
-   bdi_put(q->backing_dev_info);
+   blk_exit_queue(q);
 
if (q->mq_ops)
blk_mq_free_queue(q);
diff --git a/block/blk.h b/block/blk.h
index 6adae8f94279..aeb8026f4d7b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -130,6 +130,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask);
 void blk_exit_rl(struct request_queue *q, struct request_list *rl);
+void blk_exit_queue(struct request_queue *q);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
-- 
2.18.0

[PATCH] block: Remove two superfluous #include directives

2018-08-09 Thread Bart Van Assche

Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the
only seqcount_t and u64_stats_sync instances from  but
did not remove the corresponding #include directives. Since these
include directives are no longer needed, remove them.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke ,
Cc: Johannes Thumshirn 
---
 include/linux/blkdev.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..d6869e0e2b64 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,8 +27,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 

 struct module;
 struct scsi_ioctl_command;
-- 
2.18.0

Re: [PATCH v5 4/9] percpu-refcount: Introduce percpu_ref_is_in_use()

2018-08-09 Thread Bart Van Assche

On Wed, 2018-08-08 at 08:23 -0700, Tejun Heo wrote:
> On Tue, Aug 07, 2018 at 03:51:28PM -0700, Bart Van Assche wrote:
> > Introduce a function that allows to determine whether a per-cpu refcount
> > is in use. This function will be used in a later patch to determine
> > whether or not any block layer requests are being executed.
> 
> I thought about it a bit and am having a bit of difficulty convincing
> myself this is necessary.  Switching a percpu_ref to atomic mode isn't
> expensive - it's one spinlock cycle, a rcu wait and one sweep of the
> percpu counters.  The most expensive part - the percpu sweep - needs
> to be there even with optimization, the wait doesn't really matter as
> all it'll do is slightly delaying timer based PM operation and can be
> overlayed with the propagation of set_pm_only() anyway.
> 
> So, how about just doing the simple thing?  Switch it to atomic mode
> and check the counter and switch back to percpu mode afterwards.  If
> we see any issues with that, we can try to optimize it later but that
> seems unlikely to me.

Hello Tejun,

Switching to atomic mode would require me to add a serialization mechanism
against blk_freeze_queue_start() and blk_mq_unfreeze_queue() since these
functions call percpu_ref_kill() and percpu_ref_reinit(). That is easy but
requires additional code. I will see whether I can implement an alternative
approach using blk_mq_queue_tag_busy_iter().

Thanks,

Bart.

Re: [PATCH] block: bvec_nr_vecs() returns value for wrong slab

2018-08-09 Thread Jens Axboe

On 8/8/18 1:27 PM, Greg Edwards wrote:
> In commit ed996a52c868 ("block: simplify and cleanup bvec pool
> handling"), the value of the slab index is incremented by one in
> bvec_alloc() after the allocation is done to indicate an index value of
> 0 does not need to be later freed.
> 
> bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
> value.  Decrement idx before performing the lookup.

Applied, thanks.

-- 
Jens Axboe

Re: [GIT PULL] last round of nvme updates for 4.19

2018-08-09 Thread Jens Axboe

On 8/9/18 2:07 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this should be the last round of NVMe updates before the 4.19 merge
> window opens.  It conains support for write protected (aka read-only)
> namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics
> fix from Tal Shorer.

Thanks, pulled.

-- 
Jens Axboe

Re: [PATCH 00/10] bcache patches for 4.19, 2nd wave

2018-08-09 Thread Coly Li

On 2018/8/9 10:21 PM, Jens Axboe wrote:
> On 8/9/18 1:48 AM, Coly Li wrote:
>> Hi Jens, 
>>
>> Here are 2nd wave bcache patches for 4.19.
>>
>> The patches from me were either verified by other people or posted
>> for quite long time. Except for the debugfs_create_dir() fix and
>> "set max writeback rate" fix, rested patches are simple or trivial
>> IMHO.
>>
>> Our new bcache developer Shenghui Wang contributes two patches in
>> this run. The first one fixes a misleading error message, and the
>> second one is a code style clean up.
>>
>> Thanks in advance for picking them.
> 
> Applied, fixed up a typo in your patch #4.
> 
Oh, thanks!

Coly Li

Re: [PATCH 00/10] bcache patches for 4.19, 2nd wave

2018-08-09 Thread Jens Axboe

On 8/9/18 1:48 AM, Coly Li wrote:
> Hi Jens, 
> 
> Here are 2nd wave bcache patches for 4.19.
> 
> The patches from me were either verified by other people or posted
> for quite long time. Except for the debugfs_create_dir() fix and
> "set max writeback rate" fix, rested patches are simple or trivial
> IMHO.
> 
> Our new bcache developer Shenghui Wang contributes two patches in
> this run. The first one fixes a misleading error message, and the
> second one is a code style clean up.
> 
> Thanks in advance for picking them.

Applied, fixed up a typo in your patch #4.

-- 
Jens Axboe

[GIT PULL] last round of nvme updates for 4.19

2018-08-09 Thread Christoph Hellwig

Hi Jens,

this should be the last round of NVMe updates before the 4.19 merge
window opens.  It conains support for write protected (aka read-only)
namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics
fix from Tal Shorer.


The following changes since commit f10fe9d85dc0802b54519c917716e6f0092b4ce7:

  lightnvm: remove minor version check for 2.0 (2018-08-05 19:36:09 -0600)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.19

for you to fetch changes up to 66414e80245e1e73222f67ee711951c7f4bdedab:

  nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever (2018-08-08 12:01:49 
+0200)


Chaitanya Kulkarni (3):
  nvme.h: add support for ns write protect definitions
  nvme: set gendisk read only based on nsattr
  nvmet: add ns write protect support

Hannes Reinecke (2):
  nvme: fixup crash on failed discovery
  nvme.h: fixup ANA group descriptor format

Tal Shorer (1):
  nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever

 drivers/nvme/host/core.c  |  6 
 drivers/nvme/host/fabrics.c   |  2 +-
 drivers/nvme/host/multipath.c |  2 +-
 drivers/nvme/target/admin-cmd.c   | 76 +++
 drivers/nvme/target/core.c| 20 ++-
 drivers/nvme/target/io-cmd-bdev.c |  7 
 drivers/nvme/target/io-cmd-file.c | 12 ---
 drivers/nvme/target/nvmet.h   |  4 +++
 include/linux/nvme.h  | 19 --
 9 files changed, 138 insertions(+), 10 deletions(-)

[PATCH 10/10] bcache: trivial - remove tailing backslash in macro BTREE_FLAG

2018-08-09 Thread Coly Li

From: Shenghui Wang 

Remove the tailing backslash in macro BTREE_FLAG in btree.h

Signed-off-by: Shenghui Wang 
Signed-off-by: Coly Li 
---
 drivers/md/bcache/btree.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d211e2c25b6b..68e9d926134d 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -152,7 +152,7 @@ static inline bool btree_node_ ## flag(struct btree *b) 
\
 {  return test_bit(BTREE_NODE_ ## flag, &b->flags); }  \
\
 static inline void set_btree_node_ ## flag(struct btree *b)\
-{  set_bit(BTREE_NODE_ ## flag, &b->flags); }  \
+{  set_bit(BTREE_NODE_ ## flag, &b->flags); }
 
 enum btree_flags {
BTREE_NODE_io_error,
-- 
2.18.0

[PATCH 09/10] bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section

2018-08-09 Thread Coly Li

From: Shenghui Wang 

The pr_err statement in the code for sysfs_attatch section would run
for various error codes, which maybe confusing.

E.g,

Run the command twice:
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach
   [the backing dev got attached on the first run]
   echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach

In dmesg, after the command run twice, we can get:
bcache: bch_cached_dev_attach() Can't attach sda6: already attached
bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be891
   : cache set not found
The first statement in the message was right, but the second was
confusing.

bch_cached_dev_attach has various pr_ statements for various error
codes, except ENOENT.

After the change, rerun above command twice:
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
/sys/block/bcache0/bcache/attach

In dmesg we only got:
bcache: bch_cached_dev_attach() Can't attach sda6: already attached
No confusing "cache set not found" message anymore.

And for some not exist SET-UUID:
echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
/sys/block/bcache0/bcache/attach
In dmesg we can get:
bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
a8df5e8be898
   : cache set not found

Signed-off-by: Shenghui Wang 
Signed-off-by: Coly Li 
---
 drivers/md/bcache/sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 6e88142514fb..22f8565d2bf1 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -351,8 +351,8 @@ STORE(__cached_dev)
if (!v)
return size;
}
-
-   pr_err("Can't attach %s: cache set not found", buf);
+   if (v == -ENOENT)
+   pr_err("Can't attach %s: cache set not found", buf);
return v;
}
 
-- 
2.18.0

[PATCH 08/10] bcache: set max writeback rate when I/O request is idle

2018-08-09 Thread Coly Li

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
allows the writeback rate to be faster if there is no I/O request on a
bcache device. It works well if there is only one bcache device attached
to the cache set. If there are many bcache devices attached to a cache
set, it may introduce performance regression because multiple faster
writeback threads of the idle bcache devices will compete the btree level
locks with the bcache device who have I/O requests coming.

This patch fixes the above issue by only permitting fast writebac when
all bcache devices attached on the cache set are idle. And if one of the
bcache devices has new I/O request coming, minimized all writeback
throughput immediately and let PI controller __update_writeback_rate()
to decide the upcoming writeback rate for each bcache device.

Also when all bcache devices are idle, limited wrieback rate to a small
number is wast of thoughput, especially when backing devices are slower
non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
rate for each backing device if the whole cache set is idle. A faster
writeback rate in idle time means new I/Os may have more available space
for dirty data, and people may observe a better write performance then.

Please note bcache may change its cache mode in run time, and this patch
still works if the cache mode is switched from writeback mode and there
is still dirty data on cache.

Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
Cc: sta...@vger.kernel.org #4.16+
Signed-off-by: Coly Li 
Tested-by: Kai Krakow 
Tested-by: Stefan Priebe 
Cc: Michael Lyle 
---
 drivers/md/bcache/bcache.h| 10 ++--
 drivers/md/bcache/request.c   | 59 ++-
 drivers/md/bcache/super.c |  4 ++
 drivers/md/bcache/sysfs.c | 15 --
 drivers/md/bcache/util.c  |  2 +-
 drivers/md/bcache/util.h  |  2 +-
 drivers/md/bcache/writeback.c | 91 +++
 7 files changed, 138 insertions(+), 45 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b393b3fd06b6..05f82ff6f016 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -328,13 +328,6 @@ struct cached_dev {
 */
atomic_thas_dirty;
 
-   /*
-* Set to zero by things that touch the backing volume-- except
-* writeback.  Incremented by writeback.  Used to determine when to
-* accelerate idle writeback.
-*/
-   atomic_tbacking_idle;
-
struct bch_ratelimitwriteback_rate;
struct delayed_work writeback_rate_update;
 
@@ -515,6 +508,8 @@ struct cache_set {
struct cache_accounting accounting;
 
unsigned long   flags;
+   atomic_tidle_counter;
+   atomic_tat_max_writeback_rate;
 
struct cache_sb sb;
 
@@ -524,6 +519,7 @@ struct cache_set {
 
struct bcache_device**devices;
unsigneddevices_max_used;
+   atomic_tattached_dev_nr;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
atomic_long_t   flash_dev_dirty_sectors;
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 914d501ad1e0..7dbe8b6316a0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -1103,6 +1103,44 @@ static void detached_dev_do_request(struct bcache_device 
*d, struct bio *bio)
generic_make_request(bio);
 }
 
+static void quit_max_writeback_rate(struct cache_set *c,
+   struct cached_dev *this_dc)
+{
+   int i;
+   struct bcache_device *d;
+   struct cached_dev *dc;
+
+   /*
+* mutex bch_register_lock may compete with other parallel requesters,
+* or attach/detach operations on other backing device. Waiting to
+* the mutex lock may increase I/O request latency for seconds or more.
+* To avoid such situation, if mutext_trylock() failed, only writeback
+* rate of current cached device is set to 1, and __update_write_back()
+* will decide writeback rate of other cached devices (remember now
+* c->idle_counter is 0 already).
+*/
+   if (mutex_trylock(&bch_register_lock)) {
+   for (i = 0; i < c->devices_max_used; i++) {
+   if (!c->devices[i])
+   continue;
+
+   if (UUID_FLASH_ONLY(&c->uuids[i]))
+   continue;
+
+   d = c->devices[i];
+   dc = container_of(d, struct cached_dev, disk);
+   /*
+* set writeback rate to default minimum value,
+* then let update_writeback_rate() to decide the
+* upcoming rate.
+*/

[PATCH 07/10] bcache: add code comments for bset.c

2018-08-09 Thread Coly Li

This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/bset.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index f3403b45bc28..596c93b44e9b 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -366,6 +366,10 @@ EXPORT_SYMBOL(bch_btree_keys_init);
 
 /* Binary tree stuff for auxiliary search trees */
 
+/*
+ * return array index next to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_next(unsigned j, unsigned size)
 {
if (j * 2 + 1 < size) {
@@ -379,6 +383,10 @@ static unsigned inorder_next(unsigned j, unsigned size)
return j;
 }
 
+/*
+ * return array index previous to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_prev(unsigned j, unsigned size)
 {
if (j * 2 < size) {
@@ -421,6 +429,10 @@ static unsigned __to_inorder(unsigned j, unsigned size, 
unsigned extra)
return j;
 }
 
+/*
+ * Return the cacheline index in bset_tree->data, where j is index
+ * from a linear array which stores the auxiliar binary tree
+ */
 static unsigned to_inorder(unsigned j, struct bset_tree *t)
 {
return __to_inorder(j, t->size, t->extra);
@@ -441,6 +453,10 @@ static unsigned __inorder_to_tree(unsigned j, unsigned 
size, unsigned extra)
return j;
 }
 
+/*
+ * Return an index from a linear array which stores the auxiliar binary
+ * tree, j is the cacheline index of t->data.
+ */
 static unsigned inorder_to_tree(unsigned j, struct bset_tree *t)
 {
return __inorder_to_tree(j, t->size, t->extra);
@@ -546,6 +562,20 @@ static inline uint64_t shrd128(uint64_t high, uint64_t 
low, uint8_t shift)
return low;
 }
 
+/*
+ * Calculate mantissa value for struct bkey_float.
+ * If most significant bit of f->exponent is not set, then
+ *  - f->exponent >> 6 is 0
+ *  - p[0] points to bkey->low
+ *  - p[-1] borrows bits from KEY_INODE() of bkey->high
+ * if most isgnificant bits of f->exponent is set, then
+ *  - f->exponent >> 6 is 1
+ *  - p[0] points to bits from KEY_INODE() of bkey->high
+ *  - p[-1] points to other bits from KEY_INODE() of
+ *bkey->high too.
+ * See make_bfloat() to check when most significant bit of f->exponent
+ * is set or not.
+ */
 static inline unsigned bfloat_mantissa(const struct bkey *k,
   struct bkey_float *f)
 {
@@ -570,6 +600,16 @@ static void make_bfloat(struct bset_tree *t, unsigned j)
BUG_ON(m < l || m > r);
BUG_ON(bkey_next(p) != m);
 
+   /*
+* If l and r have different KEY_INODE values (different backing
+* device), f->exponent records how many least significant bits
+* are different in KEY_INODE values and sets most significant
+* bits to 1 (by +64).
+* If l and r have same KEY_INODE value, f->exponent records
+* how many different bits in least significant bits of bkey->low.
+* See bfloat_mantiss() how the most significant bit of
+* f->exponent is used to calculate bfloat mantissa value.
+*/
if (KEY_INODE(l) != KEY_INODE(r))
f->exponent = fls64(KEY_INODE(r) ^ KEY_INODE(l)) + 64;
else
@@ -633,6 +673,15 @@ void bch_bset_init_next(struct btree_keys *b, struct bset 
*i, uint64_t magic)
 }
 EXPORT_SYMBOL(bch_bset_init_next);
 
+/*
+ * Build auxiliary binary tree 'struct bset_tree *t', this tree is used to
+ * accelerate bkey search in a btree node (pointed by bset_tree->data in
+ * memory). After search in the auxiliar tree by calling bset_search_tree(),
+ * a struct bset_search_iter is returned which indicates range [l, r] from
+ * bset_tree->data where the searching bkey might be inside. Then a followed
+ * linear comparison does the exact search, see __bch_bset_search() for how
+ * the auxiliary tree is used.
+ */
 void bch_bset_build_written_tree(struct btree_keys *b)
 {
struct bset_tree *t = bset_tree_last(b);
@@ -898,6 +947,17 @@ static struct bset_search_iter bset_search_tree(struct 
bset_tree *t,
unsigned inorder, j, n = 1;
 
do {
+   /*
+* A bit trick here.
+* If p < t->size, (int)(p - t->size) is a minus value and
+* the most significant bit is set, right shifting 31 bits
+* gets 1. If p >= t->size, the most significant bit is
+* not set, right shifting 31 bits gets 0.
+* So the following 2 lines equals to
+*  if (p >= t->size)
+*  p =

[PATCH 06/10] bcache: fix mistaken comments in request.c

2018-08-09 Thread Coly Li

This patch updates code comment in bch_keylist_realloc() by fixing
incorrected function names, to make the code to be more comprehennsible.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/request.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 43af905920f5..914d501ad1e0 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -107,7 +107,7 @@ static int bch_keylist_realloc(struct keylist *l, unsigned 
u64s,
/*
 * The journalling code doesn't handle the case where the keys to insert
 * is bigger than an empty write: If we just return -ENOMEM here,
-* bio_insert() and bio_invalidate() will insert the keys created so far
+* bch_data_insert_keys() will insert the keys created so far
 * and finish the rest when the keylist is empty.
 */
if (newsize * sizeof(uint64_t) > block_bytes(c) - sizeof(struct jset))
-- 
2.18.0

[PATCH 04/10] bcache: add a comment in super.c

2018-08-09 Thread Coly Li

This patch adds a line of code comment in super.c:register_bdev(), to
make code to be more comprehensible.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index c7ffa6ef3f82..f517d7d1fa10 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1291,6 +1291,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
pr_info("registered backing device %s", dc->backing_dev_name);
 
list_add(&dc->list, &uncached_devices);
+   /* attch to a matched cache set if it exists */
list_for_each_entry(c, &bch_cache_sets, list)
bch_cached_dev_attach(dc, c, NULL);
 
-- 
2.18.0

[PATCH 03/10] bcache: avoid unncessary cache prefetch bch_btree_node_get()

2018-08-09 Thread Coly Li

In bch_btree_node_get() the read-in btree node will be partially
prefetched into L1 cache for following bset iteration (if there is).
But if the btree node read is failed, the perfetch operations will
waste L1 cache space. This patch checkes whether read operation and
only does cache prefetch when read I/O succeeded.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/btree.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 475008fbbaab..c19f7716df88 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1011,6 +1011,13 @@ struct btree *bch_btree_node_get(struct cache_set *c, 
struct btree_op *op,
BUG_ON(b->level != level);
}
 
+   if (btree_node_io_error(b)) {
+   rw_unlock(write, b);
+   return ERR_PTR(-EIO);
+   }
+
+   BUG_ON(!b->written);
+
b->parent = parent;
b->accessed = 1;
 
@@ -1022,13 +1029,6 @@ struct btree *bch_btree_node_get(struct cache_set *c, 
struct btree_op *op,
for (; i <= b->keys.nsets; i++)
prefetch(b->keys.set[i].data);
 
-   if (btree_node_io_error(b)) {
-   rw_unlock(write, b);
-   return ERR_PTR(-EIO);
-   }
-
-   BUG_ON(!b->written);
-
return b;
 }
 
-- 
2.18.0

[PATCH 05/10] bcache: fix mistaken code comments in bcache.h

2018-08-09 Thread Coly Li

This patch updates the code comment in struct cache with correct array
names, to make the code to be more comprehensible.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/bcache.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 0a3e82b0876d..b393b3fd06b6 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -423,9 +423,9 @@ struct cache {
/*
 * When allocating new buckets, prio_write() gets first dibs - since we
 * may not be allocate at all without writing priorities and gens.
-* prio_buckets[] contains the last buckets we wrote priorities to (so
-* gc can mark them as metadata), prio_next[] contains the buckets
-* allocated for the next prio write.
+* prio_last_buckets[] contains the last buckets we wrote priorities to
+* (so gc can mark them as metadata), prio_buckets[] contains the
+* buckets allocated for the next prio write.
 */
uint64_t*prio_buckets;
uint64_t*prio_last_buckets;
-- 
2.18.0

[PATCH 02/10] bcache: display rate debug parameters to 0 when writeback is not running

2018-08-09 Thread Coly Li

When writeback is not running, writeback rate should be 0, other value is
misleading. And the following dyanmic writeback rate debug parameters
should be 0 too,
rate, proportional, integral, change
otherwise they are misleading when writeback is not running.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/sysfs.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 225b15aa0340..3e9d3459a224 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -149,6 +149,7 @@ SHOW(__bch_cached_dev)
struct cached_dev *dc = container_of(kobj, struct cached_dev,
 disk.kobj);
const char *states[] = { "no cache", "clean", "dirty", "inconsistent" };
+   int wb = dc->writeback_running;
 
 #define var(stat)  (dc->stat)
 
@@ -170,7 +171,7 @@ SHOW(__bch_cached_dev)
var_printf(writeback_running,   "%i");
var_print(writeback_delay);
var_print(writeback_percent);
-   sysfs_hprint(writeback_rate,dc->writeback_rate.rate << 9);
+   sysfs_hprint(writeback_rate,wb ? dc->writeback_rate.rate << 9 : 0);
sysfs_hprint(io_errors, atomic_read(&dc->io_errors));
sysfs_printf(io_error_limit,"%i", dc->error_limit);
sysfs_printf(io_disable,"%i", dc->io_disable);
@@ -188,15 +189,20 @@ SHOW(__bch_cached_dev)
char change[20];
s64 next_io;
 
-   bch_hprint(rate,dc->writeback_rate.rate << 9);
-   bch_hprint(dirty,   bcache_dev_sectors_dirty(&dc->disk) << 
9);
-   bch_hprint(target,  dc->writeback_rate_target << 9);
-   bch_hprint(proportional,dc->writeback_rate_proportional << 9);
-   bch_hprint(integral,dc->writeback_rate_integral_scaled << 
9);
-   bch_hprint(change,  dc->writeback_rate_change << 9);
-
-   next_io = div64_s64(dc->writeback_rate.next - local_clock(),
-   NSEC_PER_MSEC);
+   /*
+* Except for dirty and target, other values should
+* be 0 if writeback is not running.
+*/
+   bch_hprint(rate, wb ? dc->writeback_rate.rate << 9 : 0);
+   bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9);
+   bch_hprint(target, dc->writeback_rate_target << 9);
+   bch_hprint(proportional,
+  wb ? dc->writeback_rate_proportional << 9 : 0);
+   bch_hprint(integral,
+  wb ? dc->writeback_rate_integral_scaled << 9 : 0);
+   bch_hprint(change, wb ? dc->writeback_rate_change << 9 : 0);
+   next_io = wb ? div64_s64(dc->writeback_rate.next-local_clock(),
+NSEC_PER_MSEC) : 0;
 
return sprintf(buf,
   "rate:\t\t%s/sec\n"
-- 
2.18.0

[PATCH 00/10] bcache patches for 4.19, 2nd wave

2018-08-09 Thread Coly Li

Hi Jens, 

Here are 2nd wave bcache patches for 4.19.

The patches from me were either verified by other people or posted
for quite long time. Except for the debugfs_create_dir() fix and
"set max writeback rate" fix, rested patches are simple or trivial
IMHO.

Our new bcache developer Shenghui Wang contributes two patches in
this run. The first one fixes a misleading error message, and the
second one is a code style clean up.

Thanks in advance for picking them.

Coly Li
---
Coly Li (8):
  bcache: do not check return value of debugfs_create_dir()
  bcache: display rate debug parameters to 0 when writeback is not
running
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: add a comment in super.c
  bcache: fix mistaken code comments in bcache.h
  bcache: fix mistaken comments in request.c
  bcache: add code comments for bset.c
  bcache: set max writeback rate when I/O request is idle

Shenghui Wang (2):
  bcache: make the pr_err statement used for ENOENT only in
sysfs_attatch section
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG

 drivers/md/bcache/bcache.h| 18 +++
 drivers/md/bcache/bset.c  | 63 
 drivers/md/bcache/btree.c | 14 +++---
 drivers/md/bcache/btree.h |  2 +-
 drivers/md/bcache/closure.c   | 13 +++--
 drivers/md/bcache/closure.h   |  4 +-
 drivers/md/bcache/debug.c | 11 +++--
 drivers/md/bcache/request.c   | 61 +--
 drivers/md/bcache/super.c |  9 +++-
 drivers/md/bcache/sysfs.c | 41 ++--
 drivers/md/bcache/util.c  |  2 +-
 drivers/md/bcache/util.h  |  2 +-
 drivers/md/bcache/writeback.c | 91 +++
 13 files changed, 251 insertions(+), 80 deletions(-)

-- 
2.18.0

[PATCH 01/10] bcache: do not check return value of debugfs_create_dir()

2018-08-09 Thread Coly Li

Greg KH suggests that normal code should not care about debugfs. Therefore
no matter successful or failed of debugfs_create_dir() execution, it is
unncessary to check its return value.

There are two functions called debugfs_create_dir() and check the return
value, which are bch_debug_init() and closure_debug_init(). This patch
changes these two functions from int to void type, and ignore return values
of debugfs_create_dir().

This patch does not fix exact bug, just makes things work as they should.

Signed-off-by: Coly Li 
Suggested-by: Greg Kroah-Hartman 
Cc: sta...@vger.kernel.org
Cc: Kai Krakow 
Cc: Kent Overstreet 
---
 drivers/md/bcache/bcache.h  |  2 +-
 drivers/md/bcache/closure.c | 13 +
 drivers/md/bcache/closure.h |  4 ++--
 drivers/md/bcache/debug.c   | 11 ++-
 drivers/md/bcache/super.c   |  4 +++-
 5 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 872ef4d67711..0a3e82b0876d 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -1001,7 +1001,7 @@ void bch_open_buckets_free(struct cache_set *);
 int bch_cache_allocator_start(struct cache *ca);
 
 void bch_debug_exit(void);
-int bch_debug_init(struct kobject *);
+void bch_debug_init(struct kobject *kobj);
 void bch_request_exit(void);
 int bch_request_init(void);
 
diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 0e14969182c6..618253683d40 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -199,11 +199,16 @@ static const struct file_operations debug_ops = {
.release= single_release
 };
 
-int __init closure_debug_init(void)
+void  __init closure_debug_init(void)
 {
-   closure_debug = debugfs_create_file("closures",
-   0400, bcache_debug, NULL, &debug_ops);
-   return IS_ERR_OR_NULL(closure_debug);
+   if (!IS_ERR_OR_NULL(bcache_debug))
+   /*
+* it is unnecessary to check return value of
+* debugfs_create_file(), we should not care
+* about this.
+*/
+   closure_debug = debugfs_create_file(
+   "closures", 0400, bcache_debug, NULL, &debug_ops);
 }
 #endif
 
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index 71427eb5fdae..7c2c5bc7c88b 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -186,13 +186,13 @@ static inline void closure_sync(struct closure *cl)
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
-int closure_debug_init(void);
+void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline int closure_debug_init(void) { return 0; }
+static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 04d146711950..12034c07257b 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -252,11 +252,12 @@ void bch_debug_exit(void)
debugfs_remove_recursive(bcache_debug);
 }
 
-int __init bch_debug_init(struct kobject *kobj)
+void __init bch_debug_init(struct kobject *kobj)
 {
-   if (!IS_ENABLED(CONFIG_DEBUG_FS))
-   return 0;
-
+   /*
+* it is unnecessary to check return value of
+* debugfs_create_file(), we should not care
+* about this.
+*/
bcache_debug = debugfs_create_dir("bcache", NULL);
-   return IS_ERR_OR_NULL(bcache_debug);
 }
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e0a92104ca23..c7ffa6ef3f82 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2345,10 +2345,12 @@ static int __init bcache_init(void)
goto err;
 
if (bch_request_init() ||
-   bch_debug_init(bcache_kobj) || closure_debug_init() ||
sysfs_create_files(bcache_kobj, files))
goto err;
 
+   bch_debug_init(bcache_kobj);
+   closure_debug_init();
+
return 0;
 err:
bcache_exit();
-- 
2.18.0

60 matches

Mail list logo