Re: [PATCH 8/8] scsi: Introduce ZBC disk I/O scheduler

2017-09-01 Thread Bart Van Assche
On Fri, 2017-09-01 at 20:36 +0900, Damien Le Moal wrote:
> +static struct scsi_disk *__zoned_scsi_disk(struct request_queue *q)
> +{
> +   struct scsi_device *sdp;
> +   struct scsi_disk *sdkp;
> +
> +   if (!blk_queue_is_zoned(q)) {
> +   pr_err("zoned: Not a zoned block device\n");
> +   return NULL;
> +   }
> +
> +   sdp = scsi_device_from_queue(q);
> +   if (!sdp) {
> +   pr_err("zoned: Not a SCSI device\n");
> +   return NULL;
> +   }
> +
> +   sdkp = dev_get_drvdata(&sdp->sdev_gendev);
> +   if (WARN_ON(sdkp->disk->queue != q))
> +   return NULL;
> +
> +   return sdkp;
> +}

Hello Damien,

Can reading sdkp->disk->queue cause a kernel crash if sdp does not point at
a SCSI device that is associated with a SCSI disk? How about using something
like the code below to convert a request queue pointer into a SCSI disk
pointer?

static int lookup_disk(struct device *dev, void *data)
{
struct scsi_disk **sdkp = data;

if (!*sdkp && dev->class == &sd_disk_class)
*sdkp = to_scsi_disk(dev);

return 0;
}

static struct scsi_disk *q_to_sdkp(struct request_queue *q)
{
struct scsi_device *sdp = scsi_device_from_queue(q);
struct scsi_disk *sdkp = NULL;

if (sdp)
device_for_each_child(&sdp->sdev_gendev, &sdkp, lookup_disk);
return sdkp;
}

Thanks,

Bart.

[PATCH 8/8] scsi: Introduce ZBC disk I/O scheduler

2017-09-01 Thread Damien Le Moal
The zoned I/O scheduler is mostly identical to mq-deadline and retains
the same configuration attributes. The main difference is that the
zoned scheduler will ensure that at any time at most only one write
request (command) per sequential zone is in flight (has been issued to
the disk) in order to protect against sequential write reordering
potentially resulting from the concurrent execution of request dispatch
by multiple contexts.

This is achieved similarly to the legacy SCSI I/O path by write locking
zones when a write requests is issued. Subsequent writes to the same
zone have to wait for the completion of the previously issued write
before being in turn dispatched to the disk. This ensures that
sequential writes are processed in the correct order without needing
any modification to the execution model of blk-mq. In addition, this
also protects against write reordering at the HBA level (e.g. AHCI).

This zoned scheduler code is added under the drivers/scsi directory so
that information managed using the scsi_disk structure can be directly
referenced. Doing so, cluttering the block layer with device type
specific code is avoided.

Signed-off-by: Damien Le Moal 
---
 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched |  12 +
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.h |   3 +-
 drivers/scsi/sd_zbc.c |  16 +-
 drivers/scsi/sd_zbc.h |  11 +-
 drivers/scsi/sd_zbc_iosched.c | 934 ++
 7 files changed, 1009 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/sd_zbc_iosched.c

diff --git a/Documentation/block/zoned-iosched.txt 
b/Documentation/block/zoned-iosched.txt
new file mode 100644
index ..b269125bdc61
--- /dev/null
+++ b/Documentation/block/zoned-iosched.txt
@@ -0,0 +1,48 @@
+Zoned I/O scheduler
+===
+
+The Zoned I/O scheduler solves zoned block devices write ordering problems
+inherent to the absence of a global request queue lock in the blk-mq
+infrastructure. Multiple contexts may try to dispatch simultaneously write
+requests to the same sequential zone of a zoned block device, doing so
+potentially breaking the sequential write order imposed by the device.
+
+The Zoned I/O scheduler is based on the mq-deadline scheduler. It shares the
+same tunables and behaves in a comparable manner. The main difference 
introduced
+with the zoned scheduler is handling of write batches. Whereas mq-deadline will
+keep dispatching write requests to the device as long as the batching size
+allows and reads are not starved, the zoned scheduler introduces additional
+constraints:
+1) At most only one write request can be issued to a sequential zone of the
+device. This ensures that no reordering of sequential writes to a sequential
+zone can happen once the write request leaves the scheduler internal queue (rb
+tree).
+2) The number of sequential zones that can be simultaneously written is limited
+to the device advertized maximum number of open zones. This additional 
condition
+avoids performance degradation due to excessive open zone resource use at the
+device level.
+
+These conditions do not apply to write requests targeting conventional zones.
+For these, the zoned scheduler behaves exactly like the mq-deadline scheduler.
+
+The zoned I/O scheduler cannot be used with regular block devices. It can only
+be used with host-managed or host-aware zoned block devices.
+Using the zoned I/O scheduler is mandatory with host-managed disks unless the
+disk user tightly controls itself write sequencing to sequential zones. The
+zoned scheduler will treat host-aware disks exactly the same way as 
host-managed
+devices. That is, eventhough host aware disks can be randomly written, the 
zoned
+scheduler will still impose the limit to one write per zone so that sequential
+writes sequences are preserved.
+
+For host-managed disks, automating the used of the zoned scheduler can be done
+simply with a udev rule. An example of such rule is shown below.
+
+# Set zoned scheduler for host-managed zoned block devices
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/zoned}=="host-managed", \
+   ATTR{queue/scheduler}="zoned"
+
+Zoned I/O scheduler tunables
+
+
+Tunables of the Zoned I/O scheduler are identical to those of the deadline
+I/O scheduler. See Documentation/block/deadline-iosched.txt for details.
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fd2cefa47d35..b87c67dbf1f6 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -70,6 +70,18 @@ config MQ_IOSCHED_DEADLINE
---help---
  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_ZONED
+   tristate "Zoned I/O scheduler"
+   depends on BLK_DEV_ZONED
+   depends on SCSI_MOD
+   depends on BLK_DEV_SD
+   default y
+   ---h