Re: [PATCH v8 6/8] block: Schedule runtime resume earlier

2018-09-18 Thread Ming Lei
On Tue, Sep 18, 2018 at 01:59:01PM -0700, Bart Van Assche wrote:
> Instead of scheduling runtime resume of a request queue after a
> request has been queued, schedule asynchronous resume during request
> allocation. The new pm_request_resume() calls occur after
> blk_queue_enter() has increased the q_usage_counter request queue
> member. This change is needed for a later patch that will make request
> allocation block while the queue status is not RPM_ACTIVE.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> Cc: Jianchao Wang 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: Alan Stern 
> ---
>  block/blk-core.c | 2 ++
>  block/elevator.c | 1 -
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fd91e9bf2893..18b874d5c9c9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, 
> blk_mq_req_flags_t flags)
>   if (success)
>   return 0;
>  
> + blk_pm_request_resume(q);


Looks this patch may introduce the following race between queue freeze and
runtime suspend:

--
CPU0CPU1
CPU2
--

blk_freeze_queue()

blk_mq_alloc_request()
blk_queue_enter()
blk_pm_request_resume()
wait_event()


blk_pre_runtime_suspend()

->blk_set_pm_only

...

blk_post_runtime_suspend()

...
blk_mq_unfreeze_queue()
--
- CPU0: queue is frozen

- CPU1: one new request comes, and see queue is frozen, but queue isn't
runtime-suspended yet, then blk_pm_request_resume() does nothing. So this
allocation is blocked in wait_event().

- CPU2: runtime suspend comes, and queue becomes runtime-suspended now

- CPU0: queue is unfreeze, but the new request allocation in CPU1 may never
be done because the queue is runtime suspended, and wait_event() won't return.
And the expected result is that the queue becomes active and the allocation on
CPU1 is done immediately.

Thanks,
Ming


Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread jianchao.wang
Hi Bart

On 09/19/2018 01:44 AM, Bart Van Assche wrote:
>>> + * ago. Since blk_queue_enter() is called by the request allocation
>>> + * code before pm_request_resume(), if no requests have a tag assigned
>>> + * it is safe to suspend the device.
>>> + */
>>> +    ret = -EBUSY;
>>> +    if (blk_requests_in_flight(q) == 0) {
>>> +    /*
>>> + * Call synchronize_rcu() such that later blk_queue_enter()
>>> + * calls see the pm-only state. See also
>>> + * 
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=.
>>> + */
>>> +    synchronize_rcu();
>>> +    if (blk_requests_in_flight(q) == 0)
>>
>> Seems not safe here.
>>
>> For blk-mq path:
>> Someone may have escaped from the preempt checking, missed the 
>> blk_pm_request_resume there,
>> entered into generic_make_request, but have not allocated request and 
>> occupied any tag.
>>
>> There could be a similar scenario for blk-legacy path, the q->nr_pending is 
>> increased when
>> request is queued.
>>
>> So I guess the q_usage_counter checking is still needed here.
> 
> There is only one blk_pm_request_resume() call and that call is inside 
> blk_queue_enter() after the pm_only counter has been checked.
> 
> For the legacy block layer, nr_pending is increased after the 
> blk_queue_enter() call from inside blk_old_get_request() succeeded.
> 
> So I don't see how blk_pm_request_resume() or q->nr_pending++ could escape 
> from the preempt checking?

For example:


blk_pre_runtime_suspendgeneric_make_request
 blk_queue_enter  // success here, 
no blk_pm_request_resume here
 blk_mq_make_requset
blk_set_pm_only

if (blk_requests_in_flight(q) == 0) {
  synchronize_rcu();
  if (blk_requests_in_flight(q) == 0)
ret = 0;
  }
 blk_mq_get_request

Thanks
Jianchao


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Omar Sandoval
On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote:
> On 9/18/18 4:24 PM, Omar Sandoval wrote:
> > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> > > Can you have a look at the updated master branch of
> > > https://github.com/bvanassche/blktests? That code should no longer fail if
> > > unloading the nvme kernel module fails. Please note that you will need
> > > kernel v4.18 to test these scripts - a KASAN complaint appears if I run
> > > these tests against kernel v4.19-rc4.
> > 
> > Thanks, these pass now. Is it expected that my nvme device gets a
> > multipath device configured after running these tests?
> > 
> > $ lsblk
> > NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> > vda  254:00  16G  0 disk
> > └─vda1   254:10  16G  0 part  /
> > vdb  254:16   0   8G  0 disk
> > vdc  254:32   0   8G  0 disk
> > vdd  254:48   0   8G  0 disk
> > nvme0n1  259:00   8G  0 disk
> > └─mpatha 253:00   8G  0 mpath
> 
> No, all multipath devices that were created during a test should be removed
> before that test finishes. I will look into this.
> 
> > Also, can you please fix:
> > 
> > _have_kernel_option NVME_MULTIPATH && exit 1
> > 
> > to not exit on failure? It should use SKIP_REASON and return 1. You
> > might need to add something like _dont_have_kernel_option to properly
> > handle the case where the config is not found.
> 
> OK, I will change this.
> 
> > Side note which isn't a blocker for merging is that there's a lot of
> > duplicated code between these helpers and the srp helpers. How hard
> > would it be to refactor that?
> 
> Are you perhaps referring to the code that is shared between the
> tests/srp/rc tests/nvmeof-mp/rc shell scripts?

Yes, those.

> The hardest part is probably
> to chose a location where to store these functions. Should I create a file
> with common code under common/, under tests/srp/, under tests/nvmeof-mp/ or
> perhaps somewhere else?

Just put it under common.

Thanks!


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Bart Van Assche

On 9/18/18 4:24 PM, Omar Sandoval wrote:

On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:

Can you have a look at the updated master branch of
https://github.com/bvanassche/blktests? That code should no longer fail if
unloading the nvme kernel module fails. Please note that you will need
kernel v4.18 to test these scripts - a KASAN complaint appears if I run
these tests against kernel v4.19-rc4.


Thanks, these pass now. Is it expected that my nvme device gets a
multipath device configured after running these tests?

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
vda  254:00  16G  0 disk
└─vda1   254:10  16G  0 part  /
vdb  254:16   0   8G  0 disk
vdc  254:32   0   8G  0 disk
vdd  254:48   0   8G  0 disk
nvme0n1  259:00   8G  0 disk
└─mpatha 253:00   8G  0 mpath


No, all multipath devices that were created during a test should be 
removed before that test finishes. I will look into this.



Also, can you please fix:

_have_kernel_option NVME_MULTIPATH && exit 1

to not exit on failure? It should use SKIP_REASON and return 1. You
might need to add something like _dont_have_kernel_option to properly
handle the case where the config is not found.


OK, I will change this.


Side note which isn't a blocker for merging is that there's a lot of
duplicated code between these helpers and the srp helpers. How hard
would it be to refactor that?


Are you perhaps referring to the code that is shared between the 
tests/srp/rc tests/nvmeof-mp/rc shell scripts? The hardest part is 
probably to chose a location where to store these functions. Should I 
create a file with common code under common/, under tests/srp/, under 
tests/nvmeof-mp/ or perhaps somewhere else?


Thanks,

Bart.




Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Omar Sandoval
On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> On 8/23/18 5:21 PM, Omar Sandoval wrote:
> > On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:
> > > On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:
> > > > On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:
> > > > > Moving these tests into the nvme directory is possible but will make 
> > > > > it
> > > > > harder to run the NVMeOF multipath tests separately. Are you fine 
> > > > > with this?
> > > > 
> > > > Both way's have it's up and downsides, I agree.
> > > > 
> > > > Having two distinct groups requires to run './check nvme nvmeof-mp' to
> > > > run full coverage with nvme.
> > > > 
> > > > Having it all in one group would require to run './check nvme 18 19 20
> > > > 21 22 23 24 ...' to get only the dm-mpath ones.
> > > > 
> > > > Honestly I hate both but your's (the two distinct groups) is probably
> > > > easier to handle in the end, I have to admit.
> > > 
> > > Omar, do you have a preference for one of the two aforementioned 
> > > approaches?
> > 
> > Let's keep it in a separate category, since lots of people running nvme
> > tests probably aren't interested in testing multipath.
> > 
> > A bunch of the tests failed with
> > 
> > modprobe: FATAL: Module nvme is in use.
> > 
> > Maybe related to my test VM having an nvme device?
> 
> Hello Omar,
> 
> Can you have a look at the updated master branch of
> https://github.com/bvanassche/blktests? That code should no longer fail if
> unloading the nvme kernel module fails. Please note that you will need
> kernel v4.18 to test these scripts - a KASAN complaint appears if I run
> these tests against kernel v4.19-rc4.

Thanks, these pass now. Is it expected that my nvme device gets a
multipath device configured after running these tests?

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
vda  254:00  16G  0 disk
└─vda1   254:10  16G  0 part  /
vdb  254:16   0   8G  0 disk
vdc  254:32   0   8G  0 disk
vdd  254:48   0   8G  0 disk
nvme0n1  259:00   8G  0 disk
└─mpatha 253:00   8G  0 mpath

Also, can you please fix:

_have_kernel_option NVME_MULTIPATH && exit 1

to not exit on failure? It should use SKIP_REASON and return 1. You
might need to add something like _dont_have_kernel_option to properly
handle the case where the config is not found.

Side note which isn't a blocker for merging is that there's a lot of
duplicated code between these helpers and the srp helpers. How hard
would it be to refactor that?


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Bart Van Assche

On 8/23/18 5:21 PM, Omar Sandoval wrote:

On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:

On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:

On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:

Moving these tests into the nvme directory is possible but will make it
harder to run the NVMeOF multipath tests separately. Are you fine with this?


Both way's have it's up and downsides, I agree.

Having two distinct groups requires to run './check nvme nvmeof-mp' to
run full coverage with nvme.

Having it all in one group would require to run './check nvme 18 19 20
21 22 23 24 ...' to get only the dm-mpath ones.

Honestly I hate both but your's (the two distinct groups) is probably
easier to handle in the end, I have to admit.


Omar, do you have a preference for one of the two aforementioned approaches?


Let's keep it in a separate category, since lots of people running nvme
tests probably aren't interested in testing multipath.

A bunch of the tests failed with

modprobe: FATAL: Module nvme is in use.

Maybe related to my test VM having an nvme device?


Hello Omar,

Can you have a look at the updated master branch of 
https://github.com/bvanassche/blktests? That code should no longer fail 
if unloading the nvme kernel module fails. Please note that you will 
need kernel v4.18 to test these scripts - a KASAN complaint appears if I 
run these tests against kernel v4.19-rc4.


Thanks,

Bart.



[PATCH v8 7/8] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread Bart Van Assche
Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
For blk-mq, instead of maintaining the q->nr_pending counter, rely
on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 37 ++---
 block/blk-pm.c   | 70 
 2 files changed, 73 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 18b874d5c9c9..ae092ca121d5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2747,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-   switch (rq->q->rpm_status) {
-   case RPM_RESUMING:
-   case RPM_SUSPENDING:
-   return rq->rq_flags & RQF_PM;
-   case RPM_SUSPENDED:
-   return false;
-   default:
-   return true;
-   }
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-   return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
struct hd_struct *part;
@@ -2816,11 +2792,14 @@ static struct request *elv_next_request(struct 
request_queue *q)
 
while (1) {
list_for_each_entry(rq, >queue_head, queuelist) {
-   if (blk_pm_allow_request(rq))
-   return rq;
-
-   if (rq->rq_flags & RQF_SOFTBARRIER)
-   break;
+#ifdef CONFIG_PM
+   /*
+* If a request gets queued in state RPM_SUSPENDED
+* then that's a kernel bug.
+*/
+   WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+   return rq;
}
 
/*
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..dc9bc45b0db5 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include 
 #include 
 #include 
 #include 
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -40,6 +43,34 @@ void blk_pm_runtime_init(struct request_queue *q, struct 
device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+struct in_flight_data {
+   struct request_queue*q;
+   int in_flight;
+};
+
+static void blk_count_in_flight(struct blk_mq_hw_ctx *hctx, struct request *rq,
+   void *priv, bool reserved)
+{
+   struct in_flight_data *in_flight = priv;
+
+   if (rq->q == in_flight->q)
+   in_flight->in_flight++;
+}
+
+/*
+ * Count the number of requests that are in flight for request queue @q. Use
+ * @q->nr_pending for legacy queues. Iterate over the tag set for blk-mq
+ * queues.
+ */
+static int blk_requests_in_flight(struct request_queue *q)
+{
+   struct in_flight_data in_flight = { .q = q };
+
+   if (q->mq_ops)
+   blk_mq_queue_rq_iter(q, blk_count_in_flight, _flight);
+   return q->nr_pending + in_flight.in_flight;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -68,14 +99,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
 
+   WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent request activity
+* occurred at least autosuspend_delay_ms ago. Since blk_queue_enter()
+* is called by the request allocation code before
+* pm_request_resume(), if no requests have a tag assigned it is safe
+* to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the pm-only state. See also
+* http://lwn.net/Articles/573497/.
+*/
+   synchronize_rcu();
+   if (blk_requests_in_flight(q) == 0)
+   ret = 0;
+   }
+
spin_lock_irq(q->queue_lock);
-   if (q->nr_pending) {
-   ret = -EBUSY;
+   if (ret < 0)

[PATCH v8 8/8] blk-mq: Enable support for runtime power management

2018-09-18 Thread Bart Van Assche
Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq.c | 2 ++
 block/blk-pm.c | 6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..20fdd78b75c7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -33,6 +33,7 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
 
+   blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index dc9bc45b0db5..2a7a9c0f5b99 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -30,12 +30,6 @@
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.18.0



[PATCH v8 6/8] block: Schedule runtime resume earlier

2018-09-18 Thread Bart Van Assche
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c | 2 ++
 block/elevator.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd91e9bf2893..18b874d5c9c9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -942,6 +942,8 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
if (success)
return 0;
 
+   blk_pm_request_resume(q);
+
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
diff --git a/block/elevator.c b/block/elevator.c
index 1c992bf6cfb1..e18ac68626e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,7 +601,6 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
-   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0



[PATCH v8 3/8] block: Move power management code into a new source file

2018-09-18 Thread Bart Van Assche
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
 into . For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file 
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/Kconfig  |   3 +
 block/Makefile |   1 +
 block/blk-core.c   | 196 +
 block/blk-pm.c | 188 +++
 block/blk-pm.h |  43 +
 block/elevator.c   |  22 +
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c  |   1 +
 drivers/scsi/sr.c  |   1 +
 include/linux/blk-pm.h |  24 +
 include/linux/blkdev.h |  23 -
 11 files changed, 264 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..85263e7bded6 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,7 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
 
+config BLK_PM
+   def_bool BLOCK && PM
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_PM)   += blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..6d4dd176bd9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1726,16 +1727,6 @@ void part_round_stats(struct request_queue *q, int cpu, 
struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
req_flags_t rq_flags = req->rq_flags;
@@ -3757,191 +3748,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *Initialize runtime-PM-related fields for @q and start auto suspend for
- *@dev. Drivers that want to take advantage of request-based runtime PM
- *should call this function after @dev has been initialized, and its
- *request queue @q has been allocated, and runtime PM for it can not happen
- *yet(either due to disabled/forbidden or its usage_count > 0). In most
- *cases, driver should call this function before any I/O has taken place.
- *
- *This function takes care of setting up using auto suspend for the device,
- *the autosuspend delay is set to -1 to make runtime suspend impossible
- *until an updated value is either set by user or by driver. Drivers do
- *not need to touch other autosuspend settings.
- *
- *The block layer runtime PM is request based, so only works for drivers
- *that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-   /* Don't enable runtime PM for blk-mq until it is ready */
-   if (q->mq_ops) {
-   pm_runtime_disable(dev);
-   return;
-   }
-
-   q->dev = dev;
-   q->rpm_status = RPM_ACTIVE;
-   pm_runtime_set_autosuspend_delay(q->dev, -1);
-   pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *This function will check if runtime suspend is allowed for the device
- *by examining if there are any requests pending in the queue. If there
- *are requests pending, the device can not be runtime suspended; otherwise,
- *the queue's status will be updated to SUSPENDING and the driver can
- *proceed to suspend the device.
- *
- *For the not allowed case, we mark last busy for the device so that
- *runtime PM core will try to 

[PATCH v8 1/8] blk-mq: Document the functions that iterate over requests

2018-09-18 Thread Bart Van Assche
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix two minor spelling
mistakes in comments in these functions.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq-tag.c | 28 ++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..ef3acb4a80e0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,13 +232,19 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
 
/*
 * We can hit rq == NULL here, because the tagging functions
-* test and set the bit before assining ->rqs[].
+* test and set the bit before assigning ->rqs[].
 */
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
 }
 
+/*
+ * Call function @fn(@hctx, rq, @data, @reserved) for each request queued on
+ * @hctx that has been assigned a driver tag. @reserved indicates whether @bt
+ * is the breserved_tags member or the bitmap_tags member of struct
+ * blk_mq_tags.
+ */
 static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
 {
@@ -280,6 +286,11 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
return true;
 }
 
+/*
+ * Call function @fn(rq, @data, @reserved) for each request in @tags that has
+ * been started. @reserved indicates whether @bt is the breserved_tags member
+ * or the bitmap_tags member of struct blk_mq_tags.
+ */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue 
*bt,
 busy_tag_iter_fn *fn, void *data, bool reserved)
 {
@@ -294,6 +305,10 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
struct sbitmap_queue *bt,
sbitmap_for_each_set(>sb, bt_tags_iter, _data);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each started request in @tags. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -302,6 +317,10 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags 
*tags,
bt_tags_for_each(tags, >bitmap_tags, fn, priv, false);
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request in @tagset. 'reserved'
+ * indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
 {
@@ -314,6 +333,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue it shares tags with and that has been assigned a
+ * driver tag. 'reserved' indicates whether or not 'rq' is a reserved request.
+ */
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
@@ -337,7 +361,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
struct blk_mq_tags *tags = hctx->tags;
 
/*
-* If not software queues are currently mapped to this
+* If no software queues are currently mapped to this
 * hardware queue, there's nothing to check
 */
if (!blk_mq_hw_queue_mapped(hctx))
-- 
2.18.0



[PATCH v8 5/8] block: Split blk_pm_add_request() and blk_pm_put_request()

2018-09-18 Thread Bart Van Assche
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c |  1 +
 block/blk-pm.h   | 36 +++-
 block/elevator.c |  1 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1a691f5269bb..fd91e9bf2893 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1744,6 +1744,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
 
blk_req_zone_write_unlock(req);
blk_pm_put_request(req);
+   blk_pm_mark_last_busy(req);
 
elv_completed_request(q, req);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..a8564ea72a41 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,8 +6,23 @@
 #include 
 
 #ifdef CONFIG_PM
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+   if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
+  q->rpm_status == RPM_SUSPENDING))
+   pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   pm_runtime_mark_last_busy(rq->q->dev);
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
+   lockdep_assert_held(rq->q->queue_lock);
+
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
 }
@@ -15,17 +30,28 @@ static inline void blk_pm_requeue_request(struct request 
*rq)
 static inline void blk_pm_add_request(struct request_queue *q,
  struct request *rq)
 {
-   if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-   (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-   pm_request_resume(q->dev);
+   lockdep_assert_held(q->queue_lock);
+
+   if (q->dev && !(rq->rq_flags & RQF_PM))
+   q->nr_pending++;
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-   if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-   pm_runtime_mark_last_busy(rq->q->dev);
+   lockdep_assert_held(rq->q->queue_lock);
+
+   if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+   --rq->q->nr_pending;
 }
 #else
+static inline void blk_pm_request_resume(struct request_queue *q)
+{
+}
+
+static inline void blk_pm_mark_last_busy(struct request *rq)
+{
+}
+
 static inline void blk_pm_requeue_request(struct request *rq)
 {
 }
diff --git a/block/elevator.c b/block/elevator.c
index e18ac68626e3..1c992bf6cfb1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ void __elv_add_request(struct request_queue *q, struct 
request *rq, int where)
trace_block_rq_insert(q, rq);
 
blk_pm_add_request(q, rq);
+   blk_pm_request_resume(q);
 
rq->q = q;
 
-- 
2.18.0



[PATCH v8 4/8] block, scsi: Change the preempt-only flag into a counter

2018-09-18 Thread Bart Van Assche
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-core.c| 35 ++-
 block/blk-mq-debugfs.c  | 10 +-
 drivers/scsi/scsi_lib.c | 11 +++
 include/linux/blkdev.h  | 14 +-
 4 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d4dd176bd9d..1a691f5269bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -422,24 +422,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-   return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   atomic_inc(>pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-   blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   int pm_only;
+
+   pm_only = atomic_dec_return(>pm_only);
+   WARN_ON_ONCE(pm_only < 0);
+   if (pm_only == 0)
+   wake_up_all(>mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -918,7 +919,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+   const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
while (true) {
bool success = false;
@@ -926,11 +927,11 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
/*
-* The code that sets the PREEMPT_ONLY flag is
-* responsible for ensuring that that flag is globally
-* visible before the queue is unfrozen.
+* The code that increments the pm_only counter is
+* responsible for ensuring that that counter is
+* globally visible before the queue is unfrozen.
 */
-   if (preempt || !blk_queue_preempt_only(q)) {
+   if (pm || !blk_queue_pm_only(q)) {
success = true;
} else {
percpu_ref_put(>q_usage_counter);
@@ -955,7 +956,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
wait_event(q->mq_freeze_wq,
   (atomic_read(>mq_freeze_depth) == 0 &&
-   (preempt || !blk_queue_preempt_only(q))) ||
+   (pm || !blk_queue_pm_only(q))) ||
   blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const 
unsigned long flags,
return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+
+   seq_printf(m, "%d\n", atomic_read(>pm_only));
+   return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(QUEUED),
@@ -132,7 +140,6 @@ static const char *const 

[PATCH v8 2/8] blk-mq: Introduce blk_mq_queue_rq_iter()

2018-09-18 Thread Bart Van Assche
This function will be used in the patch "Make blk_get_request() block for
non-PM requests while suspended".

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Jianchao Wang 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
---
 block/blk-mq-tag.c | 44 
 block/blk-mq-tag.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index ef3acb4a80e0..cf8537017f78 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -374,6 +374,50 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
rcu_read_unlock();
 }
 
+/*
+ * Call @fn(rq, @priv, reserved) for each request associated with request
+ * queue @q or any queue that it shares tags with and that has been assigned a
+ * tag. 'reserved' indicates whether or not 'rq' is a reserved request. In
+ * contrast to blk_mq_queue_tag_busy_iter(), if an I/O scheduler has been
+ * associated with @q, this function also iterates over requests that have
+ * been assigned a scheduler tag but that have not yet been assigned a driver
+ * tag.
+ */
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn, void 
*priv)
+{
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   /*
+* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
+* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
+* to avoid race with it. __blk_mq_update_nr_hw_queues will users
+* synchronize_rcu to ensure all of the users go out of the critical
+* section below and see zeroed q_usage_counter.
+*/
+   rcu_read_lock();
+   if (percpu_ref_is_zero(>q_usage_counter)) {
+   rcu_read_unlock();
+   return;
+   }
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   struct blk_mq_tags *tags = hctx->sched_tags ? : hctx->tags;
+
+   /*
+* If no software queues are currently mapped to this
+* hardware queue, there's nothing to check
+*/
+   if (!blk_mq_hw_queue_mapped(hctx))
+   continue;
+
+   if (tags->nr_reserved_tags)
+   bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
+   bt_for_each(hctx, >bitmap_tags, fn, priv, false);
+   }
+   rcu_read_unlock();
+}
+
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
bool round_robin, int node)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..25e62997ed6c 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
+void blk_mq_queue_rq_iter(struct request_queue *q, busy_iter_fn *fn,
+ void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 struct blk_mq_hw_ctx *hctx)
-- 
2.18.0



[PATCH v8 0/8] blk-mq: Implement runtime power management

2018-09-18 Thread Bart Van Assche
Hello Jens,

One of the pieces that is missing before blk-mq can be made the default is
implementing runtime power management support for blk-mq.  This patch series
not only implements runtime power management for blk-mq but also fixes a
starvation issue in the power management code for the legacy block
layer. Please consider this patch series for the upstream kernel.

Thanks,

Bart.

Changes compared to v7:
- Addressed Jianchao's feedback about patch "Make blk_get_request() block for
  non-PM requests while suspended".
- Added two new patches - one that documents the functions that iterate over
  requests and one that introduces a new function that iterates over all
  requests associated with a queue.

Changes compared to v6:
- Left out the patches that split RQF_PREEMPT in three flags.
- Left out the patch that introduces the SCSI device state SDEV_SUSPENDED.
- Left out the patch that introduces blk_pm_runtime_exit().
- Restored the patch that changes the PREEMPT_ONLY flag into a counter.

Changes compared to v5:
- Introduced a new flag RQF_DV that replaces RQF_PREEMPT for SCSI domain
  validation.
- Introduced a new request queue state QUEUE_FLAG_DV_ONLY for SCSI domain
  validation.
- Instead of using SDEV_QUIESCE for both runtime suspend and SCSI domain
  validation, use that state for domain validation only and introduce a new
  state for runtime suspend, namely SDEV_QUIESCE.
- Reallow system suspend during SCSI domain validation.
- Moved the runtime resume call from the request allocation code into
  blk_queue_enter().
- Instead of relying on q_usage_counter, iterate over the tag set to determine
  whether or not any requests are in flight.

Changes compared to v4:
- Dropped the patches "Give RQF_PREEMPT back its original meaning" and
  "Serialize queue freezing and blk_pre_runtime_suspend()".
- Replaced "percpu_ref_read()" with "percpu_is_in_use()".
- Inserted pm_request_resume() calls in the block layer request allocation
  code such that the context that submits a request no longer has to call
  pm_runtime_get().

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (6):
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

Bart Van Assche (8):
  blk-mq: Document the functions that iterate over requests
  blk-mq: Introduce blk_mq_queue_rq_iter()
  block: Move power management code into a new source file
  block, scsi: Change the preempt-only flag into a counter
  block: Split blk_pm_add_request() and blk_pm_put_request()
  block: Schedule runtime resume earlier
  block: Make blk_get_request() block for non-PM requests while
suspended
  blk-mq: Enable support for runtime power management

 block/Kconfig   |   3 +
 block/Makefile  |   1 +
 block/blk-core.c| 271 +---
 block/blk-mq-debugfs.c  |  10 +-
 block/blk-mq-tag.c  |  72 ++-
 block/blk-mq-tag.h  |   2 +
 block/blk-mq.c  |   2 +
 block/blk-pm.c  | 242 +++
 block/blk-pm.h  |  69 ++
 block/elevator.c|  22 +---
 drivers/scsi/scsi_lib.c |  11 +-
 drivers/scsi/scsi_pm.c  |   1 +
 drivers/scsi/sd.c   |   1 +
 drivers/scsi/sr.c   |   1 +
 include/linux/blk-pm.h  |  24 
 include/linux/blkdev.h  |  37 ++
 16 files changed, 472 insertions(+), 297 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.18.0



Re: [PATCH v7 5/6] block: Make blk_get_request() block for non-PM requests while suspended

2018-09-18 Thread Bart Van Assche

On 9/17/18 7:39 PM, jianchao.wang wrote:

On 09/18/2018 04:15 AM, Bart Van Assche wrote:

Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter.


Looks like we still depend on this nr_pending for blk-legacy.


That's right. I will update the commit message.


blk_mq_queue_tag_busy_iter only accounts the driver tags. This will only work 
w/o io scheduler


+
  /**
   * blk_pre_runtime_suspend - Pre runtime suspend check
   * @q: the queue of the device
@@ -68,14 +101,38 @@ int blk_pre_runtime_suspend(struct request_queue *q)
if (!q->dev)
return ret;
  
+	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);

+
+   blk_set_pm_only(q);
+   /*
+* This function only gets called if the most recent
+* pm_request_resume() call occurred at least autosuspend_delay_ms

^^^
pm_runtime_mark_last_busy ?


Since every pm_request_resume() call from the block layer is followed by 
a pm_runtime_mark_last_busy() call and since the latter is called later 
I think you are right. I will update the comment.



+* ago. Since blk_queue_enter() is called by the request allocation
+* code before pm_request_resume(), if no requests have a tag assigned
+* it is safe to suspend the device.
+*/
+   ret = -EBUSY;
+   if (blk_requests_in_flight(q) == 0) {
+   /*
+* Call synchronize_rcu() such that later blk_queue_enter()
+* calls see the pm-only state. See also
+* 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lwn.net_Articles_573497_=DwIDAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=o3yai95U5Ge5APoSiMJX64Z8wlI7gf3x0mFnfHpuA4E=VPnOWAeo4J4VB944La0uBcynCgVHE-qp52b_9pV8NH4=.
+*/
+   synchronize_rcu();
+   if (blk_requests_in_flight(q) == 0)


Seems not safe here.

For blk-mq path:
Someone may have escaped from the preempt checking, missed the 
blk_pm_request_resume there,
entered into generic_make_request, but have not allocated request and occupied 
any tag.

There could be a similar scenario for blk-legacy path, the q->nr_pending is 
increased when
request is queued.

So I guess the q_usage_counter checking is still needed here.


There is only one blk_pm_request_resume() call and that call is inside 
blk_queue_enter() after the pm_only counter has been checked.


For the legacy block layer, nr_pending is increased after the 
blk_queue_enter() call from inside blk_old_get_request() succeeded.


So I don't see how blk_pm_request_resume() or q->nr_pending++ could 
escape from the preempt checking?


Thanks,

Bart.


Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency

2018-09-18 Thread Heiner Litz
Hi Hans,
thanks a lot for your comments! I will send you a git repo to test. I
have a patch which enables/disables RAIL via ioctl and will send that
as well.

Heiner
On Tue, Sep 18, 2018 at 4:46 AM Hans Holmberg
 wrote:
>
> On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz  wrote:
> >
> > Hi All,
> > this patchset introduces RAIL, a mechanism to enforce low tail read latency 
> > for
> > lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads 
> > are
> > always served from LUNs that do not serve a high latency operation such as a
> > write or erase. This avoids that reads become serialized behind these 
> > operations
> > reducing tail latency by ~10x. In particular, in the absence of ECC read 
> > errors,
> > it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> > capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> > tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
> >
> > This patch is based on pblk/core and requires two additional patches from 
> > Javier
> > to be applicable (let me know if you want me to rebase):
>
> As the patches do not apply, could you make a branch available so I
> can get hold of the code in it's present state?
> That would make reviewing and testing so much easier.
>
> I have some concerns regarding recovery and write error handling, but
> I have not found anything that can't be fixed.
> I also believe that rail/on off and stride width should not be
> configured at build-time, but instead be part of the create IOCTL.
>
> See my comments on the individual patches for details.
>
> >
> > The 1st patch exposes some existing APIs so they can be used by RAIL
> > The 2nd patch introduces a configurable sector mapping function
> > The 3rd patch refactors the write path so the end_io_fn can be specified 
> > when
> > setting up the request
> > The 4th patch adds a new submit io function that acquires the write 
> > semaphore
> > The 5th patch introduces the RAIL feature and its API
> > The 6th patch integrates RAIL into pblk's read and write path
> >
> >


Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface

2018-09-18 Thread Heiner Litz
On Tue, Sep 18, 2018 at 4:28 AM Hans Holmberg
 wrote:
>
> On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz  wrote:
> >
> > In prepartion of supporting RAIL, add the RAIL API.
> >
> > Signed-off-by: Heiner Litz 
> > ---
> >  drivers/lightnvm/pblk-rail.c | 808 +++
> >  drivers/lightnvm/pblk.h  |  63 ++-
> >  2 files changed, 870 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/lightnvm/pblk-rail.c
> >
> > diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> > new file mode 100644
> > index ..a48ed31a0ba9
> > --- /dev/null
> > +++ b/drivers/lightnvm/pblk-rail.c
> > @@ -0,0 +1,808 @@
> > +/*
> > + * Copyright (C) 2018 Heiner Litz
> > + * Initial release: Heiner Litz 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License version
> > + * 2 as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful, but
> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * pblk-rail.c - pblk's RAIL path
> > + */
> > +
> > +#include "pblk.h"
> > +
> > +#define PBLK_RAIL_EMPTY ~0x0
> This constant is not being used.

thanks, will remove

> > +#define PBLK_RAIL_PARITY_WRITE 0x8000
> Where does this magic number come from? Please document.

ok , will document

>
> > +
> > +/* RAIL auxiliary functions */
> > +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> > +{
> > +   struct pblk_line_meta *lm = >lm;
> > +
> > +   return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> > +}
> > +
> > +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> > +{
> > +   struct pblk_line_meta *lm = >lm;
> > +
> > +   return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> > +{
> > +   struct pblk_line_meta *lm = >lm;
> > +
> > +   return lm->blk_per_line * pblk->min_write_pgs;
> > +}
> > +
> > +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> > +{
> > +   return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> > +}
> > +
> > +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> > +{
> > +   return pblk_rail_sec_per_stripe(pblk) - 
> > pblk_rail_psec_per_stripe(pblk);
> > +}
> > +
> > +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> > +{
> > +   struct pblk_line_meta *lm = >lm;
> > +
> > +   return (lun & (lm->blk_per_line - 1));
> > +}
> > +
> > +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> > +{
> > +   return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> > +}
> > +
> > +/* Notify readers that LUN is serving high latency operation */
> > +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> > +{
> > +   WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> > +   /* Make sure that busy bit is seen by reader before proceeding */
> > +   smp_mb__after_atomic();
> > +}
> > +
> > +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> > +{
> > +   /* Make sure that write is completed before releasing busy bit */
> > +   smp_mb__before_atomic();
> > +   WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> > +}
> > +
> > +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> > +{
> > +   struct nvm_tgt_dev *dev = pblk->dev;
> > +   struct nvm_geo *geo = >geo;
> > +   int lun_pos = pblk_ppa_to_pos(geo, ppa);
> > +
> > +   return test_bit(lun_pos, pblk->rail.busy_bitmap);
> > +}
> > +
> > +/* Enforces one writer per stride */
> > +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> > +{
> > +   struct pblk_lun *rlun;
> > +   int strides = pblk_rail_nr_parity_luns(pblk);
> > +   int stride = lun_pos % strides;
> > +   int ret;
> > +
> > +   rlun = >luns[stride];
> > +   ret = down_timeout(>wr_sem, timeout);
> > +   pblk_rail_notify_reader_down(pblk, lun_pos);
> > +
> > +   return ret;
> > +}
> > +
> > +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> > +{
> > +   struct pblk_lun *rlun;
> > +   int strides = pblk_rail_nr_parity_luns(pblk);
> > +   int stride = lun_pos % strides;
> > +
> > +   pblk_rail_notify_reader_up(pblk, lun_pos);
> > +   rlun = >luns[stride];
> > +   up(>wr_sem);
> > +}
> > +
> > +/* Determine whether a sector holds data, meta or is bad*/
> > +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int 
> > pos)
> > +{
> > +   struct pblk_line_meta *lm = >lm;
> > +   struct nvm_tgt_dev *dev = pblk->dev;
> > +   struct nvm_geo *geo = >geo;
> > +   struct ppa_addr ppa;
> > +   int lun;
> > +
> > +   

Re: [RFC PATCH 0/6] lightnvm: pblk: Introduce RAIL to enforce low tail read latency

2018-09-18 Thread Hans Holmberg
On Mon, Sep 17, 2018 at 7:29 AM Heiner Litz  wrote:
>
> Hi All,
> this patchset introduces RAIL, a mechanism to enforce low tail read latency 
> for
> lightnvm OCSSD devices. RAIL leverages redundancy to guarantee that reads are
> always served from LUNs that do not serve a high latency operation such as a
> write or erase. This avoids that reads become serialized behind these 
> operations
> reducing tail latency by ~10x. In particular, in the absence of ECC read 
> errors,
> it provides 99.99 percentile read latencies of below 500us. RAIL introduces
> capacity overheads (7%-25%) due to RAID-5 like striping (providing fault
> tolerance) and reduces the maximum write bandwidth to 110K IOPS on CNEX SSD.
>
> This patch is based on pblk/core and requires two additional patches from 
> Javier
> to be applicable (let me know if you want me to rebase):

As the patches do not apply, could you make a branch available so I
can get hold of the code in it's present state?
That would make reviewing and testing so much easier.

I have some concerns regarding recovery and write error handling, but
I have not found anything that can't be fixed.
I also believe that rail/on off and stride width should not be
configured at build-time, but instead be part of the create IOCTL.

See my comments on the individual patches for details.

>
> The 1st patch exposes some existing APIs so they can be used by RAIL
> The 2nd patch introduces a configurable sector mapping function
> The 3rd patch refactors the write path so the end_io_fn can be specified when
> setting up the request
> The 4th patch adds a new submit io function that acquires the write semaphore
> The 5th patch introduces the RAIL feature and its API
> The 6th patch integrates RAIL into pblk's read and write path
>
>


Re: [RFC PATCH 6/6] lightnvm: pblk: Integrate RAIL

2018-09-18 Thread Hans Holmberg
On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz  wrote:
>
> Integrate Redundant Array of Independent Luns (RAIL) into lightnvm. RAIL
> enforces low tail read latency by guaranteeing that reads are never
> serialized behind writes and erases to the same LUN. Whenever LUNs serve a
> high latency operation, reads are performed by recomputing the original
> utilizing redundant parity information.
> Rail trades-off read latency for capacity (redundancy) which, however, can
> be leveraged for fault tolerance.
>
> On FIO, with the kyber scheduler set to a target read latency of 500us,
> RAIL reduces tail latency percentiles (us) as follows:
>
>Avg90%99% 99.9%  99.95%  99.99%
>pblk   90 10002200   30006000
>RAIL   85 100 250400 500
>
> Signed-off-by: Heiner Litz 
> ---
>  drivers/lightnvm/Kconfig  | 10 ++
>  drivers/lightnvm/Makefile |  1 +
>  drivers/lightnvm/pblk-core.c  | 36 ++-
>  drivers/lightnvm/pblk-init.c  | 17 +
>  drivers/lightnvm/pblk-rail.c  |  1 +
>  drivers/lightnvm/pblk-rb.c|  6 ++
>  drivers/lightnvm/pblk-read.c  |  9 +
>  drivers/lightnvm/pblk-write.c |  9 +
>  drivers/lightnvm/pblk.h   |  5 +
>  9 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
> index a872cd720967..165d5a29acc3 100644
> --- a/drivers/lightnvm/Kconfig
> +++ b/drivers/lightnvm/Kconfig
> @@ -35,6 +35,16 @@ config NVM_PBLK_DEBUG
>   vocal error messages, and extra tracking fields in the pblk sysfs
>   entries.
>
> +config NVM_PBLK_RAIL
> +   bool "Pblk RAIL Support"
> +   default n
> +   help
> + Enables RAIL for pblk. RAIL enforces tail read latency guarantees by
> +eliminiating reads being serialized behind writes to the same LUN.
> +RAIL partitions LUNs into strides and enforces that only one LUN per
> +stride is written at a time. Reads can bypass busy LUNs by recompting
> +requested data using parity redundancy.
> +
>  endif # NVM_PBLK_DEBUG

Having a compile-time option forces the user (or even worse,
distribution provider) to pick the rail- OR non-rail version of pblk.
It's also a pain having to re-compile and re-provision the kernel when testing.

I see no reason why this should not be dynamically handled within pblk
(rail on/off and stride width could be supplied via the create ioctl)
One would want to configure stride-width to fit a given workload in any case.

nvm_ioctl_create_extended has 16 reserved bits, so we have room for
adding RAIL parameters.

>
>  endif # NVM
> diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
> index 97d9d7c71550..92f4376428cc 100644
> --- a/drivers/lightnvm/Makefile
> +++ b/drivers/lightnvm/Makefile
> @@ -5,6 +5,7 @@
>
>  obj-$(CONFIG_NVM)  := core.o
>  obj-$(CONFIG_NVM_PBLK) += pblk.o
> +obj-$(CONFIG_NVM_PBLK_RAIL)+= pblk-rail.o
>  pblk-y := pblk-init.o pblk-core.o pblk-rb.o \
>pblk-write.o pblk-cache.o pblk-read.o \
>pblk-gc.o pblk-recovery.o pblk-map.o \
> diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
> index a31bf359f905..ca74d7763fa9 100644
> --- a/drivers/lightnvm/pblk-core.c
> +++ b/drivers/lightnvm/pblk-core.c
> @@ -113,6 +113,12 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
>  {
> struct pblk *pblk = rqd->private;
>
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +   struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd);
> +
> +   pblk_up_chunk(pblk, ppa_list[0]);
> +#endif
> +
> __pblk_end_io_erase(pblk, rqd);
> mempool_free(rqd, >e_rq_pool);
>  }
> @@ -940,7 +946,11 @@ static int pblk_blk_erase_sync(struct pblk *pblk, struct 
> ppa_addr ppa)
> /* The write thread schedules erases so that it minimizes disturbances
>  * with writes. Thus, there is no need to take the LUN semaphore.
>  */
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +   ret = pblk_submit_io_sync_sem(pblk, );
> +#else
> ret = pblk_submit_io_sync(pblk, );
> +#endif
> rqd.private = pblk;
> __pblk_end_io_erase(pblk, );
>
> @@ -1754,7 +1764,11 @@ int pblk_blk_erase_async(struct pblk *pblk, struct 
> ppa_addr ppa)
> /* The write thread schedules erases so that it minimizes disturbances
>  * with writes. Thus, there is no need to take the LUN semaphore.
>  */
> +#ifdef CONFIG_NVM_PBLK_RAIL
> +   err = pblk_submit_io_sem(pblk, rqd);
> +#else
> err = pblk_submit_io(pblk, rqd);
> +#endif
> if (err) {
> struct nvm_tgt_dev *dev = pblk->dev;
> struct nvm_geo *geo = >geo;
> @@ -1909,6 +1923,10 @@ void pblk_line_close_ws(struct work_struct *work)
> if (w_err_gc->has_write_err)
> pblk_save_lba_list(pblk, line);
>
> +#ifdef 

Re: [RFC PATCH 5/6] lightnvm: pblk: Add RAIL interface

2018-09-18 Thread Hans Holmberg
On Mon, Sep 17, 2018 at 7:30 AM Heiner Litz  wrote:
>
> In prepartion of supporting RAIL, add the RAIL API.
>
> Signed-off-by: Heiner Litz 
> ---
>  drivers/lightnvm/pblk-rail.c | 808 +++
>  drivers/lightnvm/pblk.h  |  63 ++-
>  2 files changed, 870 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/lightnvm/pblk-rail.c
>
> diff --git a/drivers/lightnvm/pblk-rail.c b/drivers/lightnvm/pblk-rail.c
> new file mode 100644
> index ..a48ed31a0ba9
> --- /dev/null
> +++ b/drivers/lightnvm/pblk-rail.c
> @@ -0,0 +1,808 @@
> +/*
> + * Copyright (C) 2018 Heiner Litz
> + * Initial release: Heiner Litz 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version
> + * 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * pblk-rail.c - pblk's RAIL path
> + */
> +
> +#include "pblk.h"
> +
> +#define PBLK_RAIL_EMPTY ~0x0
This constant is not being used.

> +#define PBLK_RAIL_PARITY_WRITE 0x8000
Where does this magic number come from? Please document.

> +
> +/* RAIL auxiliary functions */
> +static unsigned int pblk_rail_nr_parity_luns(struct pblk *pblk)
> +{
> +   struct pblk_line_meta *lm = >lm;
> +
> +   return lm->blk_per_line / PBLK_RAIL_STRIDE_WIDTH;
> +}
> +
> +static unsigned int pblk_rail_nr_data_luns(struct pblk *pblk)
> +{
> +   struct pblk_line_meta *lm = >lm;
> +
> +   return lm->blk_per_line - pblk_rail_nr_parity_luns(pblk);
> +}
> +
> +static unsigned int pblk_rail_sec_per_stripe(struct pblk *pblk)
> +{
> +   struct pblk_line_meta *lm = >lm;
> +
> +   return lm->blk_per_line * pblk->min_write_pgs;
> +}
> +
> +static unsigned int pblk_rail_psec_per_stripe(struct pblk *pblk)
> +{
> +   return pblk_rail_nr_parity_luns(pblk) * pblk->min_write_pgs;
> +}
> +
> +static unsigned int pblk_rail_dsec_per_stripe(struct pblk *pblk)
> +{
> +   return pblk_rail_sec_per_stripe(pblk) - 
> pblk_rail_psec_per_stripe(pblk);
> +}
> +
> +static unsigned int pblk_rail_wrap_lun(struct pblk *pblk, unsigned int lun)
> +{
> +   struct pblk_line_meta *lm = >lm;
> +
> +   return (lun & (lm->blk_per_line - 1));
> +}
> +
> +bool pblk_rail_meta_distance(struct pblk_line *data_line)
> +{
> +   return (data_line->meta_distance % PBLK_RAIL_STRIDE_WIDTH) == 0;
> +}
> +
> +/* Notify readers that LUN is serving high latency operation */
> +static void pblk_rail_notify_reader_down(struct pblk *pblk, int lun)
> +{
> +   WARN_ON(test_and_set_bit(lun, pblk->rail.busy_bitmap));
> +   /* Make sure that busy bit is seen by reader before proceeding */
> +   smp_mb__after_atomic();
> +}
> +
> +static void pblk_rail_notify_reader_up(struct pblk *pblk, int lun)
> +{
> +   /* Make sure that write is completed before releasing busy bit */
> +   smp_mb__before_atomic();
> +   WARN_ON(!test_and_clear_bit(lun, pblk->rail.busy_bitmap));
> +}
> +
> +int pblk_rail_lun_busy(struct pblk *pblk, struct ppa_addr ppa)
> +{
> +   struct nvm_tgt_dev *dev = pblk->dev;
> +   struct nvm_geo *geo = >geo;
> +   int lun_pos = pblk_ppa_to_pos(geo, ppa);
> +
> +   return test_bit(lun_pos, pblk->rail.busy_bitmap);
> +}
> +
> +/* Enforces one writer per stride */
> +int pblk_rail_down_stride(struct pblk *pblk, int lun_pos, int timeout)
> +{
> +   struct pblk_lun *rlun;
> +   int strides = pblk_rail_nr_parity_luns(pblk);
> +   int stride = lun_pos % strides;
> +   int ret;
> +
> +   rlun = >luns[stride];
> +   ret = down_timeout(>wr_sem, timeout);
> +   pblk_rail_notify_reader_down(pblk, lun_pos);
> +
> +   return ret;
> +}
> +
> +void pblk_rail_up_stride(struct pblk *pblk, int lun_pos)
> +{
> +   struct pblk_lun *rlun;
> +   int strides = pblk_rail_nr_parity_luns(pblk);
> +   int stride = lun_pos % strides;
> +
> +   pblk_rail_notify_reader_up(pblk, lun_pos);
> +   rlun = >luns[stride];
> +   up(>wr_sem);
> +}
> +
> +/* Determine whether a sector holds data, meta or is bad*/
> +bool pblk_rail_valid_sector(struct pblk *pblk, struct pblk_line *line, int 
> pos)
> +{
> +   struct pblk_line_meta *lm = >lm;
> +   struct nvm_tgt_dev *dev = pblk->dev;
> +   struct nvm_geo *geo = >geo;
> +   struct ppa_addr ppa;
> +   int lun;
> +
> +   if (pos >= line->smeta_ssec && pos < (line->smeta_ssec + 
> lm->smeta_sec))
> +   return false;
> +
> +   if (pos >= line->emeta_ssec &&
> +   pos < (line->emeta_ssec + lm->emeta_sec[0]))
> +   return false;
> +
> +   ppa = addr_to_gen_ppa(pblk, pos, line->id);
> +   lun = pblk_ppa_to_pos(geo, ppa);
> +
> +   return !test_bit(lun, 

Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-18 Thread jianchao.wang
Hi Ming

On 09/18/2018 03:42 PM, Ming Lei wrote:
> On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/17/2018 08:07 PM, Ming Lei wrote:
> This way will delay runtime pm or system suspend until the queue is 
> unfrozen,
> but it isn't reasonable.
 This interface is for the __scsi_execute only, before we call into 
 function, we should have
 get the device resumed synchronously.
>>> I mean when the queue is frozen, it is reasonable to runtime suspend the 
>>> queue. However,
>>> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing 
>>> first.
>>
>> We don't freeze the queue, but set preempt-only mode when suspend the queue. 
>> :)
> 
> But the queue can be frozen by other paths. Even though it is frozen, the 
> queue
> should be allowed to suspend too.
> 

Yes, the race between freeze and preempt-only is a tricky issue.

Thanks
Jianchao


Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM

2018-09-18 Thread Ming Lei
On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/17/2018 08:07 PM, Ming Lei wrote:
> >>> This way will delay runtime pm or system suspend until the queue is 
> >>> unfrozen,
> >>> but it isn't reasonable.
> >> This interface is for the __scsi_execute only, before we call into 
> >> function, we should have
> >> get the device resumed synchronously.
> > I mean when the queue is frozen, it is reasonable to runtime suspend the 
> > queue. However,
> > blk_queue_preempt_enter() is still waiting for queue becoming unfreezing 
> > first.
> 
> We don't freeze the queue, but set preempt-only mode when suspend the queue. 
> :)

But the queue can be frozen by other paths. Even though it is frozen, the queue
should be allowed to suspend too.

Thanks,
Ming