Re: [RFD] I/O scheduling in blk-mq
Hi Omar, have you had a chance to look at these last questions of mine? Thanks, Paolo > Il giorno 31 ago 2016, alle ore 17:20, Paolo Valente > ha scritto: > > > Il giorno 08/ago/2016, alle ore 22:09, Omar Sandoval ha > scritto: > >> On Mon, Aug 08, 2016 at 04:09:56PM +0200, Paolo wrote: >>> Hi Jens, Tejun, Christoph, all, >>> AFAIK blk-mq does not yet feature I/O schedulers. In particular, there >>> is no scheduler providing strong guarantees in terms of >>> responsiveness, latency for time-sensitive applications and bandwidth >>> distribution. >>> >>> For this reason, I'm trying to port BFQ to blk-mq, or to develop >>> something simpler if even a reduced version of BFQ proves to be too >>> heavy (this project is supported by Linaro). If you are willing to >>> provide some feedback in this respect, I would like to ask for >>> opinions/suggestions on the following two matters, and possibly to >>> open a more general discussion on I/O scheduling in blk-mq. >>> >>> 1) My idea is to have an independent instance of BFQ, or in general of >>> the I/O scheduler, executed for each software queue. Then there would >>> be no global scheduling. The drawback of no global scheduling is that >>> each process cannot get more than 1/M of the total throughput of the >>> device, if M is the number of software queues. But, if I'm not >>> mistaken, it is however unfeasible to give a process more than 1/M of >>> the total throughput, without lowering the throughput itself. In fact, >>> giving a process more than 1/M of the total throughput implies serving >>> its software queue, say Q, more than the others. The only way to do >>> it is periodically stopping the service of the other software queues >>> and dispatching only the requests in Q. But this would reduce >>> parallelism, which is the main way how blk-mq achieves a very high >>> throughput. Are these considerations, and, in particular, one >>> independent I/O scheduler per software queue, sensible? >>> >>> 2) To provide per-process service guarantees, an I/O scheduler must >>> create per-process internal queues. BFQ and CFQ use I/O contexts to >>> achieve this goal. Is something like that (or exactly the same) >>> available also in blk-mq? If so, do you have any suggestion, or link to >>> documentation/code on how to use what is available in blk-mq? >>> >>> Thanks, >>> Paolo >> >> Hi, Paolo, >> >> I've been working on I/O scheduling for blk-mq with Jens for the past >> few months (splitting time with other small projects), and we're making >> good progress. Like you noticed, the hard part isn't really grafting a >> scheduler interface onto blk-mq, it's maintaining good scalability while >> providing adequate fairness. >> >> We're working towards a scheduler more like deadline and getting the >> architectural issues worked out. The goal is some sort of fairness >> across all queues. > > If I'm not mistaken, the requests of a process (the bios after your > patch) end up in a given software queue basically by chance, i.e., > because the process happens to be executed on the core which that > queue is associated with. If this is true, then the scheduler cannot > control in which queue a request is sent. So, how do you imagine the > scheduler to control the global request service order exactly? By > stopping the service of some queues and letting only the head-of-line > request(s) of some other queue(s) be dispatched? > > In this respect, I guess that, as of now, it is again chance that > determines from which software queue the next request to dispatch is > picked, i.e., it depends on which core the dispatch functions happen > to be executed. Is it correct? > >> The scheduler-per-software-queue model won't hold up >> so well if we have a slower device with an I/O-hungry process on one CPU >> and an interactive process on another CPU. >> > > So, the problem would be that the hungry process eats all the > bandwidth, and the interactive one never gets served. > > What about the case where both processes are on the same CPU, i.e., > where the requests of both processes are on the same software queue? > How does the scheduler you envisage guarantees a good latency to the > interactive process in this case? By properly reordering requests > inside the software queue? > > I'm sorry if my questions are quite silly, or do not make much sense. > > Thanks, > Paolo > > >> The issue I'm working through now is that on blk-mq, we only have as >> many `struct request`s as the hardware has tags, so on a device with a >> limited queue depth, it's really hard to do any sort of intelligent >> scheduling. The solution for that is switching over to working with >> `struct bio`s in the software queues instead, which abstracts away the >> hardware capabilities. I have some work in progress at >> https://github.com/osandov/linux/tree/blk-mq-iosched, but it's not yet >> at feature-parity. >> >> After that, I'll be back to working on the scheduling itself. The vague >> idea
[PATCH v5 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes
From: Hannes Reinecke The queue limits already have a 'chunk_sectors' setting, so we should be presenting it via sysfs. Signed-off-by: Hannes Reinecke [Damien: Updated Documentation/ABI/testing/sysfs-block] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Documentation/ABI/testing/sysfs-block | 13 + block/blk-sysfs.c | 11 +++ 2 files changed, 24 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 75a5055..ee2d5cd 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -251,3 +251,16 @@ Description: since drive-managed zoned block devices do not support zone commands, they will be treated as regular block devices and zoned will report "none". + +What: /sys/block//queue/chunk_sectors +Date: September 2016 +Contact: Hannes Reinecke +Description: + chunk_sectors has different meaning depending on the type + of the disk. For a RAID device (dm-raid), chunk_sectors + indicates the size in 512B sectors of the RAID volume + stripe segment. For a zoned block device, either + host-aware or host-managed, chunk_sectors indicates the + size of 512B sectors of the zones of the device, with + the eventual exception of the last zone of the device + which may be smaller. diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index ff9cd9c..488c2e2 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -130,6 +130,11 @@ static ssize_t queue_physical_block_size_show(struct request_queue *q, char *pag return queue_var_show(queue_physical_block_size(q), page); } +static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page) +{ + return queue_var_show(q->limits.chunk_sectors, page); +} + static ssize_t queue_io_min_show(struct request_queue *q, char *page) { return queue_var_show(queue_io_min(q), page); @@ -455,6 +460,11 @@ static struct queue_sysfs_entry queue_physical_block_size_entry = { .show = queue_physical_block_size_show, }; +static struct queue_sysfs_entry queue_chunk_sectors_entry = { + .attr = {.name = "chunk_sectors", .mode = S_IRUGO }, + .show = queue_chunk_sectors_show, +}; + static struct queue_sysfs_entry queue_io_min_entry = { .attr = {.name = "minimum_io_size", .mode = S_IRUGO }, .show = queue_io_min_show, @@ -555,6 +565,7 @@ static struct attribute *default_attrs[] = { &queue_hw_sector_size_entry.attr, &queue_logical_block_size_entry.attr, &queue_physical_block_size_entry.attr, + &queue_chunk_sectors_entry.attr, &queue_io_min_entry.attr, &queue_io_opt_entry.attr, &queue_discard_granularity_entry.attr, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/7] block: Add 'zoned' queue limit
Add the zoned queue limit to indicate the zoning model of a block device. Defined values are 0 (BLK_ZONED_NONE) for regular block devices, 1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM) for host-managed zone block devices. The standards defined drive managed model is not defined here since these block devices do not provide any command for accessing zone information. Drive managed model devices will be reported as BLK_ZONED_NONE. The helper functions blk_queue_zoned_model and bdev_zoned_model return the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned return a boolean for callers to test if a block device is zoned. The zoned attribute is also exported as a string to applications via sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and BLK_ZONED_HM as "host-managed". Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- Documentation/ABI/testing/sysfs-block | 16 block/blk-settings.c | 1 + block/blk-sysfs.c | 18 ++ include/linux/blkdev.h| 47 +++ 4 files changed, 82 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 71d184d..75a5055 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -235,3 +235,19 @@ Description: write_same_max_bytes is 0, write same is not supported by the device. +What: /sys/block//queue/zoned +Date: September 2016 +Contact: Damien Le Moal +Description: + zoned indicates if the device is a zoned block device + and the zone model of the device if it is indeed zoned. + The possible values indicated by zoned are "none" for + regular block devices and "host-aware" or "host-managed" + for zoned block devices. The characteristics of + host-aware and host-managed zoned block devices are + described in the ZBC (Zoned Block Commands) and ZAC + (Zoned Device ATA Command Set) standards. These standards + also define the "drive-managed" zone model. However, + since drive-managed zoned block devices do not support + zone commands, they will be treated as regular block + devices and zoned will report "none". diff --git a/block/blk-settings.c b/block/blk-settings.c index f679ae1..b1d5b7f 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -107,6 +107,7 @@ void blk_set_default_limits(struct queue_limits *lim) lim->io_opt = 0; lim->misaligned = 0; lim->cluster = 1; + lim->zoned = BLK_ZONED_NONE; } EXPORT_SYMBOL(blk_set_default_limits); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 9cc8d7c..ff9cd9c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -257,6 +257,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); #undef QUEUE_SYSFS_BIT_FNS +static ssize_t queue_zoned_show(struct request_queue *q, char *page) +{ + switch (blk_queue_zoned_model(q)) { + case BLK_ZONED_HA: + return sprintf(page, "host-aware\n"); + case BLK_ZONED_HM: + return sprintf(page, "host-managed\n"); + default: + return sprintf(page, "none\n"); + } +} + static ssize_t queue_nomerges_show(struct request_queue *q, char *page) { return queue_var_show((blk_queue_nomerges(q) << 1) | @@ -485,6 +497,11 @@ static struct queue_sysfs_entry queue_nonrot_entry = { .store = queue_store_nonrot, }; +static struct queue_sysfs_entry queue_zoned_entry = { + .attr = {.name = "zoned", .mode = S_IRUGO }, + .show = queue_zoned_show, +}; + static struct queue_sysfs_entry queue_nomerges_entry = { .attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR }, .show = queue_nomerges_show, @@ -546,6 +563,7 @@ static struct attribute *default_attrs[] = { &queue_discard_zeroes_data_entry.attr, &queue_write_same_max_entry.attr, &queue_nonrot_entry.attr, + &queue_zoned_entry.attr, &queue_nomerges_entry.attr, &queue_rq_affinity_entry.attr, &queue_iostats_entry.attr, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c47c358..f19e16b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -261,6 +261,15 @@ struct blk_queue_tag { #define BLK_SCSI_MAX_CMDS (256) #define BLK_SCSI_CMD_PER_LONG (BLK_SCSI_MAX_CMDS / (sizeof(long) * 8)) +/* + * Zoned block device models (zoned limit). + */ +enum blk_zoned_model { + BLK_ZONED_NONE, /* Regular block device */ + BLK_ZONED_HA, /* Host-aware zoned block device */ + BLK_ZONED_HM, /* Host-managed
[PATCH v5 3/7] block: update chunk_sectors in blk_stack_limits()
From: Hannes Reinecke Signed-off-by: Hannes Reinecke Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- block/blk-settings.c | 4 1 file changed, 4 insertions(+) diff --git a/block/blk-settings.c b/block/blk-settings.c index b1d5b7f..55369a6 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -631,6 +631,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->discard_granularity; } + if (b->chunk_sectors) + t->chunk_sectors = min_not_zero(t->chunk_sectors, + b->chunk_sectors); + return ret; } EXPORT_SYMBOL(blk_stack_limits); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/7] sd: Implement support for ZBC devices
From: Hannes Reinecke Implement ZBC support functions to setup zoned disks, both host-managed and host-aware models. Only zoned disks that satisfy the following conditions are supported: 1) All zones are the same size, with the exception of an eventual last smaller runt zone. 2) For host-managed disks, reads are unrestricted (reads are not failed due to zone or write pointer alignement constraints). Zoned disks that do not satisfy these 2 conditions are setup with a capacity of 0 to prevent their use. The function sd_zbc_read_zones, called from sd_revalidate_disk, checks that the device satisfies the above two constraints. This function may also change the disk capacity previously set by sd_read_capacity for devices reporting only the capacity of conventional zones at the beginning of the LBA range (i.e. devices reporting rc_basis set to 0). The capacity message output was moved out of sd_read_capacity into a new function sd_print_capacity to include this eventual capacity change by sd_zbc_read_zones. This new function also includes a call to sd_zbc_print_zones to display the number of zones and zone size of the device. Signed-off-by: Hannes Reinecke [Damien: * Removed zone cache support * Removed mapping of discard to reset write pointer command * Modified sd_zbc_read_zones to include checks that the device satisfies the kernel constraints * Implemeted REPORT ZONES setup and post-processing based on code from Shaun Tancheff * Removed confusing use of 512B sector units in functions interface] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- drivers/scsi/Makefile | 1 + drivers/scsi/sd.c | 141 --- drivers/scsi/sd.h | 67 + drivers/scsi/sd_zbc.c | 627 ++ include/scsi/scsi_proto.h | 17 ++ 5 files changed, 820 insertions(+), 33 deletions(-) create mode 100644 drivers/scsi/sd_zbc.c diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index fc0d9b8..350513c 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -180,6 +180,7 @@ hv_storvsc-y:= storvsc_drv.o sd_mod-objs:= sd.o sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o +sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o sr_mod-objs:= sr.o sr_ioctl.o sr_vendor.o ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \ diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 51e5629..20889b5 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -93,6 +93,7 @@ MODULE_ALIAS_BLOCKDEV_MAJOR(SCSI_DISK15_MAJOR); MODULE_ALIAS_SCSI_DEVICE(TYPE_DISK); MODULE_ALIAS_SCSI_DEVICE(TYPE_MOD); MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC); +MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC); #if !defined(CONFIG_DEBUG_BLOCK_EXT_DEVT) #define SD_MINORS 16 @@ -163,7 +164,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr, static const char temp[] = "temporary "; int len; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) /* no cache control on RBC devices; theoretically they * can do it, but there's probably so many exceptions * it's not worth the risk */ @@ -262,7 +263,7 @@ allow_restart_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) return -EINVAL; sdp->allow_restart = simple_strtoul(buf, NULL, 10); @@ -392,6 +393,11 @@ provisioning_mode_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; + if (sd_is_zoned(sdkp)) { + sd_config_discard(sdkp, SD_LBP_DISABLE); + return count; + } + if (sdp->type != TYPE_DISK) return -EINVAL; @@ -459,7 +465,7 @@ max_write_same_blocks_store(struct device *dev, struct device_attribute *attr, if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (sdp->type != TYPE_DISK) + if (sdp->type != TYPE_DISK && sdp->type != TYPE_ZBC) return -EINVAL; err = kstrtoul(buf, 10, &max); @@ -844,6 +850,12 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd) BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); + if (sd_is_zoned(sdkp)) { + ret = sd_zbc_setup_read_write(cmd); + if (ret != BLKPREP_OK) + return ret; + } + sector >>= ilog2(sdp->sector_size) - 9; nr_sectors >>= ilog2(sdp->sector_size) - 9; @@ -963,6 +975,12 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) SCSI_LOG_HLQUEUE(2, scmd_printk(KERN_INFO, SCpnt, "block=%llu\n",
[PATCH v5 7/7] blk-zoned: implement ioctls
From: Shaun Tancheff Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively obtaining the zone configuration of a zoned block device and resetting the write pointer of sequential zones of a zoned block device. The BLKREPORTZONE ioctl maps directly to a single call of the function blkdev_report_zones. The zone information result is passed as an array of struct blk_zone identical to the structure used internally for processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl maps to a single call of the blkdev_reset_zones function. Signed-off-by: Shaun Tancheff Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen --- block/blk-zoned.c | 93 +++ block/ioctl.c | 4 ++ include/linux/blkdev.h| 21 ++ include/uapi/linux/blkzoned.h | 40 +++ include/uapi/linux/fs.h | 4 ++ 5 files changed, 162 insertions(+) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 1603573..667f95d 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -255,3 +255,96 @@ int blkdev_reset_zones(struct block_device *bdev, return 0; } EXPORT_SYMBOL_GPL(blkdev_reset_zones); + +/** + * BLKREPORTZONE ioctl processing. + * Called from blkdev_ioctl. + */ +int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, + unsigned int cmd, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + struct request_queue *q; + struct blk_zone_report rep; + struct blk_zone *zones; + int ret; + + if (!argp) + return -EINVAL; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + if (!blk_queue_is_zoned(q)) + return -ENOTTY; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (copy_from_user(&rep, argp, sizeof(struct blk_zone_report))) + return -EFAULT; + + if (!rep.nr_zones) + return -EINVAL; + + zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) + return -ENOMEM; + + ret = blkdev_report_zones(bdev, rep.sector, + zones, &rep.nr_zones, + GFP_KERNEL); + if (ret) + goto out; + + if (copy_to_user(argp, &rep, sizeof(struct blk_zone_report))) { + ret = -EFAULT; + goto out; + } + + if (rep.nr_zones) { + if (copy_to_user(argp + sizeof(struct blk_zone_report), zones, +sizeof(struct blk_zone) * rep.nr_zones)) + ret = -EFAULT; + } + + out: + kfree(zones); + + return ret; +} + +/** + * BLKRESETZONE ioctl processing. + * Called from blkdev_ioctl. + */ +int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_t mode, +unsigned int cmd, unsigned long arg) +{ + void __user *argp = (void __user *)arg; + struct request_queue *q; + struct blk_zone_range zrange; + + if (!argp) + return -EINVAL; + + q = bdev_get_queue(bdev); + if (!q) + return -ENXIO; + + if (!blk_queue_is_zoned(q)) + return -ENOTTY; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + if (copy_from_user(&zrange, argp, sizeof(struct blk_zone_range))) + return -EFAULT; + + return blkdev_reset_zones(bdev, zrange.sector, zrange.nr_sectors, + GFP_KERNEL); +} diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..448f78a 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -513,6 +513,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, BLKDEV_DISCARD_SECURE); case BLKZEROOUT: return blk_ioctl_zeroout(bdev, mode, arg); + case BLKREPORTZONE: + return blkdev_report_zones_ioctl(bdev, mode, cmd, arg); + case BLKRESETZONE: + return blkdev_reset_zones_ioctl(bdev, mode, cmd, arg); case HDIO_GETGEO: return blkdev_getgeo(bdev, argp); case BLKRAGET: diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 252043f..90097dd 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -316,6 +316,27 @@ extern int blkdev_report_zones(struct block_device *bdev, extern int blkdev_reset_zones(struct block_device *bdev, sector_t sectors, sector_t nr_sectors, gfp_t gfp_mask); +extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, +unsigned int cmd, unsigned long arg); +extern int blkdev_reset_zones_ioctl(struct block_device *bdev, fmode_
[PATCH v5 5/7] block: Implement support for zoned block devices
From: Hannes Reinecke Implement zoned block device zone information reporting and reset. Zone information are reported as struct blk_zone. This implementation does not differentiate between host-aware and host-managed device models and is valid for both. Two functions are provided: blkdev_report_zones for discovering the zone configuration of a zoned block device, and blkdev_reset_zones for resetting the write pointer of sequential zones. The helper function blk_queue_zone_size and bdev_zone_size are also provided for, as the name suggest, obtaining the zone size (in 512B sectors) of the zones of the device. Signed-off-by: Hannes Reinecke [Damien: * Removed the zone cache * Implement report zones operation based on earlier proposal by Shaun Tancheff ] Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Shaun Tancheff Tested-by: Shaun Tancheff --- block/Kconfig | 8 ++ block/Makefile| 1 + block/blk-zoned.c | 257 ++ include/linux/blkdev.h| 31 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned.h | 103 + 6 files changed, 401 insertions(+) create mode 100644 block/blk-zoned.c create mode 100644 include/uapi/linux/blkzoned.h diff --git a/block/Kconfig b/block/Kconfig index 1d4d624..6b0ad08 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -89,6 +89,14 @@ config BLK_DEV_INTEGRITY T10/SCSI Data Integrity Field or the T13/ATA External Path Protection. If in doubt, say N. +config BLK_DEV_ZONED + bool "Zoned block device support" + ---help--- + Block layer zoned block device support. This option enables + support for ZAC/ZBC host-managed and host-aware zoned block devices. + + Say yes here if you have a ZAC or ZBC storage device. + config BLK_DEV_THROTTLING bool "Block layer bio throttling support" depends on BLK_CGROUP=y diff --git a/block/Makefile b/block/Makefile index 36acdd7..9371bc7 100644 --- a/block/Makefile +++ b/block/Makefile @@ -22,4 +22,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o +obj-$(CONFIG_BLK_DEV_ZONED)+= blk-zoned.o obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o diff --git a/block/blk-zoned.c b/block/blk-zoned.c new file mode 100644 index 000..1603573 --- /dev/null +++ b/block/blk-zoned.c @@ -0,0 +1,257 @@ +/* + * Zoned block device handling + * + * Copyright (c) 2015, Hannes Reinecke + * Copyright (c) 2015, SUSE Linux GmbH + * + * Copyright (c) 2016, Damien Le Moal + * Copyright (c) 2016, Western Digital + */ + +#include +#include +#include +#include + +static inline sector_t blk_zone_start(struct request_queue *q, + sector_t sector) +{ + sector_t zone_mask = blk_queue_zone_size(q) - 1; + + return sector & ~zone_mask; +} + +/* + * Check that a zone report belongs to the partition. + * If yes, fix its start sector and write pointer, copy it in the + * zone information array and return true. Return false otherwise. + */ +static bool blkdev_report_zone(struct block_device *bdev, + struct blk_zone *rep, + struct blk_zone *zone) +{ + sector_t offset = get_start_sect(bdev); + + if (rep->start < offset) + return false; + + rep->start -= offset; + if (rep->start + rep->len > bdev->bd_part->nr_sects) + return false; + + if (rep->type == BLK_ZONE_TYPE_CONVENTIONAL) + rep->wp = rep->start + rep->len; + else + rep->wp -= offset; + memcpy(zone, rep, sizeof(struct blk_zone)); + + return true; +} + +/** + * blkdev_report_zones - Get zones information + * @bdev: Target block device + * @sector:Sector from which to report zones + * @zones: Array of zone structures where to return the zones information + * @nr_zones: Number of zone structures in the zone array + * @gfp_mask: Memory allocation flags (for bio_alloc) + * + * Description: + *Get zone information starting from the zone containing @sector. + *The number of zone information reported may be less than the number + *requested by @nr_zones. The number of zones actually reported is + *returned in @nr_zones. + */ +int blkdev_report_zones(struct block_device *bdev, + sector_t sector, + struct blk_zone *zones, + unsigned int *nr_zones, + gfp_t gfp_mask) +{ + struct request_queue *q = bdev_get_queue(bdev); + struct blk_zone_report_hdr *hdr; + unsigned int nrz = *nr_zones; + struct page *page; + unsigned int nr_rep; +
[PATCH v5 4/7] block: Define zoned block device operations
From: Shaun Tancheff Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of host-managed and host-aware zoned block devices. With with these two new operations, the total number of operations defined reaches 8 and still fits with the 3 bits definition of REQ_OP_BITS. Signed-off-by: Shaun Tancheff Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen --- block/blk-core.c | 4 include/linux/blk_types.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index 14d7c07..e4eda5d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1941,6 +1941,10 @@ generic_make_request_checks(struct bio *bio) case REQ_OP_WRITE_SAME: if (!bdev_write_same(bio->bi_bdev)) goto not_supported; + case REQ_OP_ZONE_REPORT: + case REQ_OP_ZONE_RESET: + if (!bdev_is_zoned(bio->bi_bdev)) + goto not_supported; break; default: break; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cd395ec..dd50dce 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -243,6 +243,8 @@ enum req_op { REQ_OP_SECURE_ERASE,/* request to securely erase sectors */ REQ_OP_WRITE_SAME, /* write same block many times */ REQ_OP_FLUSH, /* request for cache flush */ + REQ_OP_ZONE_REPORT, /* Get zone information */ + REQ_OP_ZONE_RESET, /* Reset a zone write pointer */ }; #define REQ_OP_BITS 3 -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 0/7] ZBC / Zoned block device support
This series introduces support for zoned block devices. It integrates earlier submissions by Hannes Reinecke and Shaun Tancheff. Compared to the previous series version, the code was significantly simplified by limiting support to zoned devices satisfying the following conditions: 1) All zones of the device are the same size, with the exception of an eventual last smaller runt zone. 2) For host-managed disks, reads must be unrestricted (read commands do not fail due to zone or write pointer alignement constraints). Zoned disks that do not satisfy these 2 conditions are ignored. These 2 conditions allowed dropping the zone information cache implemented in the previous version. This simplifies the code and also reduces the memory consumption at run time. Support for zoned devices now only require one bit per zone (less than 8KB in total). This bit field is used to write-lock zones and prevent the concurrent execution of multiple write commands in the same zone. This avoids write ordering problems at dispatch time, for both the simple queue and scsi-mq settings. The new operations introduced to suport zone manipulation was reduced to only the two main ZBC/ZAC defined commands: REPORT ZONES (REQ_OP_ZONE_REPORT) and RESET WRITE POINTER (REQ_OP_ZONE_RESET). This brings the total number of operations defined to 8, which fits in the 3 bits (REQ_OP_BITS) reserved for operation code in bio->bi_opf and req->cmd_flags. Most of the ZBC specific code is kept out of sd.c and implemented in the new file sd_zbc.c. Similarly, at the block layer, most of the zoned block device code is implemented in the new blk-zoned.c. For host-managed zoned block devices, the sequential write constraint of write pointer zones is exposed to the user. Users of the disk (applications, file systems or device mappers) must sequentially write to zones. This means that for raw block device accesses from applications, buffered writes are unreliable and direct I/Os must be used (or buffered writes with O_SYNC). Access to zone manipulation operations is also provided to applications through a set of new ioctls. This allows applications operating on raw block devices (e.g. mkfs.xxx) to discover a device zone layout and manipulate zone state. Changes from v4: * Changed interface of sd_zbc_setup_read_write Changes from v3: * Fixed several typos and tabs/spaces * Added description of zoned and chunk_sectors queue attributes in Documentation/ABI/testing/sysfs-block * Fixed sd_read_capacity call in sd.c and to avoid missing information on the first pass of a disk scan * Fixed scsi_disk zone related field to use logical block size unit instead of 512B sector unit. Changes from v2: * Use kcalloc to allocate zone information array for ioctl * Use kcalloc to allocate zone information array for ioctl * Export GPL the functions blkdev_report_zones and blkdev_reset_zones * Shuffled uapi definitions from patch 7 into patch 5 Damien Le Moal (1): block: Add 'zoned' queue limit Hannes Reinecke (4): blk-sysfs: Add 'chunk_sectors' to sysfs attributes block: update chunk_sectors in blk_stack_limits() block: Implement support for zoned block devices sd: Implement support for ZBC devices Shaun Tancheff (2): block: Define zoned block device operations blk-zoned: implement ioctls Documentation/ABI/testing/sysfs-block | 29 ++ block/Kconfig | 8 + block/Makefile| 1 + block/blk-core.c | 4 + block/blk-settings.c | 5 + block/blk-sysfs.c | 29 ++ block/blk-zoned.c | 350 +++ block/ioctl.c | 4 + drivers/scsi/Makefile | 1 + drivers/scsi/sd.c | 141 ++-- drivers/scsi/sd.h | 67 drivers/scsi/sd_zbc.c | 627 ++ include/linux/blk_types.h | 2 + include/linux/blkdev.h| 99 ++ include/scsi/scsi_proto.h | 17 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/blkzoned.h | 143 include/uapi/linux/fs.h | 4 + 18 files changed, 1499 insertions(+), 33 deletions(-) create mode 100644 block/blk-zoned.c create mode 100644 drivers/scsi/sd_zbc.c create mode 100644 include/uapi/linux/blkzoned.h -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sd: Implement support for ZBC devices
> "Damien" == Damien Le Moal writes: Damien, Almost there! And A-OK on the read capacity changes. However: @@ -844,6 +850,13 @@ static int sd_setup_write_same_cmnd(struct scsi_cmnd *cmd) BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size); + if (sd_is_zoned(sdkp)) { + /* sd_zbc_setup_read_write uses block layer sector units */ That comment really says: "I am doing confusing stuff that doesn't follow the normal calling convention in the driver". Plus it's another case of using block layer sectors where they shouldn't be. Please just pass the scsi_cmnd to sd_zbc_set_read_write() like it's done for sd_zbc_setup_reset_cmnd() and the regular sd_setup_* calls. And then no commentary is necessary... + ret = sd_zbc_setup_read_write(sdkp, rq, sector, nr_sectors); + if (ret != BLKPREP_OK) + return ret; + } + @@ -963,6 +976,13 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) SCSI_LOG_HLQUEUE(2, scmd_printk(KERN_INFO, SCpnt, "block=%llu\n", (unsigned long long)block)); + if (sd_is_zoned(sdkp)) { + /* sd_zbc_setup_read_write uses block layer sector units */ + ret = sd_zbc_setup_read_write(sdkp, rq, block, this_count); + if (ret != BLKPREP_OK) + goto out; + } + Thanks! -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 5/7] block: Implement support for zoned block devices
> "Damien" == Damien Le Moal writes: Damien> Implement zoned block device zone information reporting and Damien> reset. Zone information are reported as struct blk_zone. This Damien> implementation does not differentiate between host-aware and Damien> host-managed device models and is valid for both. Two functions Damien> are provided: blkdev_report_zones for discovering the zone Damien> configuration of a zoned block device, and blkdev_reset_zones Damien> for resetting the write pointer of sequential zones. The helper Damien> function blk_queue_zone_size and bdev_zone_size are also Damien> provided for, as the name suggest, obtaining the zone size (in Damien> 512B sectors) of the zones of the device. Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/7] block: Define zoned block device operations
> "Damien" == Damien Le Moal writes: Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/7] block: update chunk_sectors in blk_stack_limits()
> "Damien" == Damien Le Moal writes: Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/7] blk-sysfs: Add 'chunk_sectors' to sysfs attributes
> "Damien" == Damien Le Moal writes: Damien> From: Hannes Reinecke The queue limits already Damien> have a 'chunk_sectors' setting, so we should be presenting it Damien> via sysfs. Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/7] block: Add 'zoned' queue limit
> "Damien" == Damien Le Moal writes: Damien> Add the zoned queue limit to indicate the zoning model of a Damien> block device. Defined values are 0 (BLK_ZONED_NONE) for regular Damien> block devices, 1 (BLK_ZONED_HA) for host-aware zone block Damien> devices and 2 (BLK_ZONED_HM) for host-managed zone block Damien> devices. The standards defined drive managed model is not Damien> defined here since these block devices do not provide any Damien> command for accessing zone information. Drive managed model Damien> devices will be reported as BLK_ZONED_NONE. Damien> The helper functions blk_queue_zoned_model and bdev_zoned_model Damien> return the zoned limit and the functions blk_queue_is_zoned and Damien> bdev_is_zoned return a boolean for callers to test if a block Damien> device is zoned. Damien> The zoned attribute is also exported as a string to applications Damien> via sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as Damien> "host-aware" and BLK_ZONED_HM as "host-managed". Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()
On Thu, Sep 29, 2016 at 7:59 AM, Bart Van Assche wrote: > blk_quiesce_queue() prevents that new queue_rq() invocations blk_mq_quiesce_queue() > occur and waits until ongoing invocations have finished. This > function does *not* wait until all outstanding requests have I guess it still may wait finally since this way may exhaust tag space easily, :-) > finished (this means invocation of request.end_io()). > blk_resume_queue() resumes normal I/O processing. > > Signed-off-by: Bart Van Assche > Cc: Hannes Reinecke > Cc: Johannes Thumshirn > --- > block/blk-mq.c | 137 > - > include/linux/blk-mq.h | 2 + > include/linux/blkdev.h | 5 ++ > 3 files changed, 131 insertions(+), 13 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index d8c45de..f5c71ad 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -58,15 +58,23 @@ static void blk_mq_hctx_clear_pending(struct > blk_mq_hw_ctx *hctx, > sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw); > } > > -void blk_mq_freeze_queue_start(struct request_queue *q) > +static bool __blk_mq_freeze_queue_start(struct request_queue *q, > + bool kill_percpu_ref) > { > int freeze_depth; > > freeze_depth = atomic_inc_return(&q->mq_freeze_depth); > if (freeze_depth == 1) { > - percpu_ref_kill(&q->q_usage_counter); > + if (kill_percpu_ref) > + percpu_ref_kill(&q->q_usage_counter); > blk_mq_run_hw_queues(q, false); > } > + return freeze_depth == 1; > +} > + > +void blk_mq_freeze_queue_start(struct request_queue *q) > +{ > + __blk_mq_freeze_queue_start(q, true); > } > EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); > > @@ -102,19 +110,90 @@ void blk_mq_freeze_queue(struct request_queue *q) > } > EXPORT_SYMBOL_GPL(blk_mq_freeze_queue); > > -void blk_mq_unfreeze_queue(struct request_queue *q) > +static bool __blk_mq_unfreeze_queue(struct request_queue *q, > + bool reinit_percpu_ref) > { > int freeze_depth; > > freeze_depth = atomic_dec_return(&q->mq_freeze_depth); > WARN_ON_ONCE(freeze_depth < 0); > if (!freeze_depth) { > - percpu_ref_reinit(&q->q_usage_counter); > + if (reinit_percpu_ref) > + percpu_ref_reinit(&q->q_usage_counter); > wake_up_all(&q->mq_freeze_wq); > } > + return freeze_depth == 0; > +} > + > +void blk_mq_unfreeze_queue(struct request_queue *q) > +{ > + __blk_mq_unfreeze_queue(q, true); > } > EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); > > +/** > + * blk_mq_quiesce_queue() - wait until all pending queue_rq calls have > finished > + * > + * Prevent that new I/O requests are queued and wait until all pending > + * queue_rq() calls have finished. Must not be called if the queue has > already > + * been frozen. Additionally, freezing the queue after having quiesced the > + * queue and before resuming the queue is not allowed. > + * > + * Note: this function does not prevent that the struct request end_io() > + * callback function is invoked. > + */ > +void blk_mq_quiesce_queue(struct request_queue *q) > +{ > + struct blk_mq_hw_ctx *hctx; > + unsigned int i; > + bool res, rcu = false; > + > + spin_lock_irq(q->queue_lock); > + WARN_ON_ONCE(blk_queue_quiescing(q)); > + queue_flag_set(QUEUE_FLAG_QUIESCING, q); > + spin_unlock_irq(q->queue_lock); > + > + res = __blk_mq_freeze_queue_start(q, false); Looks the implementation is a bit tricky and complicated, if the percpu usage counter isn't killed, it isn't necessary to touch .mq_freeze_depth since you use QUEUE_FLAG_QUIESCING to set/get this status of the queue. Then using synchronize_rcu() and rcu_read_lock()/rcu_read_unlock() with the flag of QUIESCING may be enough to wait for completing of ongoing invocations of .queue_rq() and avoid to run new .queue_rq, right? > + WARN_ON_ONCE(!res); > + queue_for_each_hw_ctx(q, hctx, i) { > + if (hctx->flags & BLK_MQ_F_BLOCKING) { > + mutex_lock(&hctx->queue_rq_mutex); > + mutex_unlock(&hctx->queue_rq_mutex); Could you explain a bit why all BLOCKING is treated so special? And that flag just means the hw queue always need to schedule asynchronously, and of couse even though the flag isn't set, it still may be run asynchronously too. So looks it should be enough to just use RCU. > + } else { > + rcu = true; > + } > + } > + if (rcu) > + synchronize_rcu(); > + > + spin_lock_irq(q->queue_lock); > + WARN_ON_ONCE(!blk_queue_quiescing(q)); > + queue_flag_clear(QUEUE_FLAG_QUIESCING, q); > + spin_unlock_irq(q->queue_lock); > +} > +EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); > + > +/** > +
[PATCH 3/3] block: implement (some of) fallocate for block devices
After much discussion, it seems that the fallocate feature flag FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is set. A length that goes past the end of the device will be clamped to the device size if KEEP_SIZE is set; or will return -EINVAL if not. Both start and length must be aligned to the device's logical block size. Since the semantics of fallocate are fairly well established already, wire up the two pieces. The other fallocate variants (collapse range, insert range, and allocate blocks) are not supported. Signed-off-by: Darrick J. Wong Reviewed-by: Hannes Reinecke Reviewed-by: Bart Van Assche --- fs/block_dev.c | 77 fs/open.c |3 +- 2 files changed, 79 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 08ae993..777fd9b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include "internal.h" @@ -1787,6 +1788,81 @@ static const struct address_space_operations def_blk_aops = { .is_dirty_writeback = buffer_check_dirty_writeback, }; +#defineBLKDEV_FALLOC_FL_SUPPORTED \ + (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \ +FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE) + +static long blkdev_fallocate(struct file *file, int mode, loff_t start, +loff_t len) +{ + struct block_device *bdev = I_BDEV(bdev_file_inode(file)); + struct request_queue *q = bdev_get_queue(bdev); + struct address_space *mapping; + loff_t end = start + len - 1; + loff_t isize; + int error; + + /* Fail if we don't recognize the flags. */ + if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) + return -EOPNOTSUPP; + + /* Don't go off the end of the device. */ + isize = i_size_read(bdev->bd_inode); + if (start >= isize) + return -EINVAL; + if (end >= isize) { + if (mode & FALLOC_FL_KEEP_SIZE) { + len = isize - start; + end = start + len - 1; + } else + return -EINVAL; + } + + /* +* Don't allow IO that isn't aligned to logical block size. +*/ + if ((start | len) & (bdev_logical_block_size(bdev) - 1)) + return -EINVAL; + + /* Invalidate the page cache, including dirty pages. */ + mapping = bdev->bd_inode->i_mapping; + truncate_inode_pages_range(mapping, start, end); + + switch (mode) { + case FALLOC_FL_ZERO_RANGE: + case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE: + error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, + GFP_KERNEL, false); + break; + case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: + /* Only punch if the device can do zeroing discard. */ + if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data) + return -EOPNOTSUPP; + error = blkdev_issue_discard(bdev, start >> 9, len >> 9, +GFP_KERNEL, 0); + break; + case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE: + if (!blk_queue_discard(q)) + return -EOPNOTSUPP; + error = blkdev_issue_discard(bdev, start >> 9, len >> 9, +GFP_KERNEL, 0); + break; + default: + return -EOPNOTSUPP; + } + if (error) + return error; + + /* +* Invalidate again; if someone wandered in and dirtied a page, +* the caller will be given -EBUSY. The third argument is +* inclusive, so the rounding here is safe. +*/ + return invalidate_inode_pages2_range(mapping, +start >> PAGE_SHIFT, +end >> PAGE_SHIFT); +} + const struct file_operations def_blk_fops = { .open = blkdev_open, .release= blkdev_close, @@ -1801,6 +1877,7 @@ const struct file_operations def_blk_fops = { #endif .splice_read= generic_file_splice_read, .splice_write = iter_file_splice_write, + .fallocate = blkdev_fallocate, }; int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg) diff --git a/fs/open.c b/fs/open.c index 4fd6e25..01b6092 100644 --- a/fs/open.c +++ b/fs/open.c @@ -289,7 +289,8 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) * Let individual file system decide if it s
[PATCH 2/3] block: require write_same and discard requests align to logical block size
Make sure that the offset and length arguments that we're using to construct WRITE SAME and DISCARD requests are actually aligned to the logical block size. Failure to do this causes other errors in other parts of the block layer or the SCSI layer because disks don't support partial logical block writes. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Martin K. Petersen Reviewed-by: Hannes Reinecke --- block/blk-lib.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/block/blk-lib.c b/block/blk-lib.c index 083e56f..46fe924 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -31,6 +31,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, unsigned int granularity; enum req_op op; int alignment; + sector_t bs_mask; if (!q) return -ENXIO; @@ -50,6 +51,10 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, op = REQ_OP_DISCARD; } + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; + /* Zero-sector (unknown) and one-sector granularities are the same. */ granularity = max(q->limits.discard_granularity >> 9, 1U); alignment = (bdev_discard_alignment(bdev) >> 9) % granularity; @@ -150,10 +155,15 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, unsigned int max_write_same_sectors; struct bio *bio = NULL; int ret = 0; + sector_t bs_mask; if (!q) return -ENXIO; + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; + /* Ensure that max_write_same_sectors doesn't overflow bi_size */ max_write_same_sectors = UINT_MAX >> 9; @@ -202,6 +212,11 @@ static int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, int ret; struct bio *bio = NULL; unsigned int sz; + sector_t bs_mask; + + bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; + if ((sector | nr_sects) & bs_mask) + return -EINVAL; while (nr_sects != 0) { bio = next_bio(bio, min(nr_sects, (sector_t)BIO_MAX_PAGES), -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] block: invalidate the page cache when issuing BLKZEROOUT.
Invalidate the page cache (as a regular O_DIRECT write would do) to avoid returning stale cache contents at a later time. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Reviewed-by: Martin K. Petersen Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke --- block/ioctl.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index ed2397f..755119c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -225,7 +225,8 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, unsigned long arg) { uint64_t range[2]; - uint64_t start, len; + struct address_space *mapping; + uint64_t start, end, len; if (!(mode & FMODE_WRITE)) return -EBADF; @@ -235,18 +236,23 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, start = range[0]; len = range[1]; + end = start + len - 1; if (start & 511) return -EINVAL; if (len & 511) return -EINVAL; - start >>= 9; - len >>= 9; - - if (start + len > (i_size_read(bdev->bd_inode) >> 9)) + if (end >= (uint64_t)i_size_read(bdev->bd_inode)) + return -EINVAL; + if (end < start) return -EINVAL; - return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false); + /* Invalidate the page cache, including dirty pages */ + mapping = bdev->bd_inode->i_mapping; + truncate_inode_pages_range(mapping, start, end); + + return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, + false); } static int put_ushort(unsigned long arg, unsigned short val) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 0/3] fallocate for block devices
Hi Andrew & others, This is a resend of the patchset to fix page cache coherency with BLKZEROOUT and implement fallocate for block devices. This time I'm sending it direct to Andrew for inclusion because the block layer maintainer has not been responsive over the past year of submissions. Can this please go upstream for 4.9? The first patch is still a fix to the existing BLKZEROOUT ioctl to invalidate the page cache if the zeroing command to the underlying device succeeds. Without this patch we still have the pagecache coherence bug that's been in the kernel forever. The second patch changes the internal block device functions to reject attempts to discard or zeroout that are not aligned to the logical block size. Previously, we only checked that the start/len parameters were 512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA devices. The third patch creates an fallocate handler for block devices, wires up the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent fallocate interface between files and block devices. It also allows the combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard. Test cases for the new block device fallocate are now in xfstests as generic/349-351. Comments and questions are, as always, welcome. Patches are against 4.8-rc8. v7: Strengthen parameter checking and fix various code issues pointed out by Linus and Christoph. v8: More code rearranging, rebase to 4.6-rc3, and dig into alignment issues. v9: Forward port to 4.7. v10: Forward port to 4.8. Remove the extra call to invalidate_inode_pages2_range per Bart Van Assche's request. v11: Minor fallocate code cleanups suggested by Bart Van Assche. --D -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] block: implement (some of) fallocate for block devices
On Thu, Sep 29, 2016 at 01:08:57PM -0700, Bart Van Assche wrote: > On 09/28/2016 07:19 PM, Darrick J. Wong wrote: > >After much discussion, it seems that the fallocate feature flag > >FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature > >FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been > >whitelisted for zeroing SCSI UNMAP. Punch still requires that > >FALLOC_FL_KEEP_SIZE is set. A length that goes past the end of the > >device will be clamped to the device size if KEEP_SIZE is set; or will > >return -EINVAL if not. Both start and length must be aligned to the > >device's logical block size. > > > >Since the semantics of fallocate are fairly well established already, > >wire up the two pieces. The other fallocate variants (collapse range, > >insert range, and allocate blocks) are not supported. > > For the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE > case, it's probably safer not to try to send a discard to block devices that > do not support discard in order not to hit block driver bugs. But that's > something we can still discuss later. Hence: I'll just change it to check the queue flags and post a new revision. At this point I might as well repost the whole thing to reflect the reviewed-bys. --D > Reviewed-by: Bart Van Assche -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] block: implement (some of) fallocate for block devices
On 09/28/2016 07:19 PM, Darrick J. Wong wrote: After much discussion, it seems that the fallocate feature flag FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is set. A length that goes past the end of the device will be clamped to the device size if KEEP_SIZE is set; or will return -EINVAL if not. Both start and length must be aligned to the device's logical block size. Since the semantics of fallocate are fairly well established already, wire up the two pieces. The other fallocate variants (collapse range, insert range, and allocate blocks) are not supported. For the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE case, it's probably safer not to try to send a discard to block devices that do not support discard in order not to hit block driver bugs. But that's something we can still discuss later. Hence: Reviewed-by: Bart Van Assche -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] vfio_pci: use pci_irq_allocate_vectors
On Thu, 29 Sep 2016 21:24:04 +0200 Christoph Hellwig wrote: > On Thu, Sep 29, 2016 at 01:21:01PM -0600, Alex Williamson wrote: > > Sorry for the delay, slipped by me. Overall a really nice cleanup. > > One tiny nit, the commit log mis-names the function as > > pci_irq_allocate_vectors instead of pci_alloc_irq_vectors. With that, > > > > Acked-by: Alex Williamson > > > > Let me know if you're wanting me to pull this through my tree, I'm > > assuming not. Thanks, > > Please pull in through your tree. If you can also just fix up that > type that'd be even better. Will do. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] vfio_pci: use pci_irq_allocate_vectors
On Thu, Sep 29, 2016 at 01:21:01PM -0600, Alex Williamson wrote: > Sorry for the delay, slipped by me. Overall a really nice cleanup. > One tiny nit, the commit log mis-names the function as > pci_irq_allocate_vectors instead of pci_alloc_irq_vectors. With that, > > Acked-by: Alex Williamson > > Let me know if you're wanting me to pull this through my tree, I'm > assuming not. Thanks, Please pull in through your tree. If you can also just fix up that type that'd be even better. Thanks a lot! -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] vfio_pci: use pci_irq_allocate_vectors
On Sun, 11 Sep 2016 15:31:26 +0200 Christoph Hellwig wrote: > Simply the interrupt setup by using the new PCI layer helpers. > > Signed-off-by: Christoph Hellwig > --- > drivers/vfio/pci/vfio_pci_intrs.c | 45 > + > drivers/vfio/pci/vfio_pci_private.h | 1 - > 2 files changed, 10 insertions(+), 36 deletions(-) Sorry for the delay, slipped by me. Overall a really nice cleanup. One tiny nit, the commit log mis-names the function as pci_irq_allocate_vectors instead of pci_alloc_irq_vectors. With that, Acked-by: Alex Williamson Let me know if you're wanting me to pull this through my tree, I'm assuming not. Thanks, Alex > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c > b/drivers/vfio/pci/vfio_pci_intrs.c > index 152b438..a1d283e 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -250,6 +250,7 @@ static irqreturn_t vfio_msihandler(int irq, void *arg) > static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix) > { > struct pci_dev *pdev = vdev->pdev; > + unsigned int flag = msix ? PCI_IRQ_MSIX : PCI_IRQ_MSI; > int ret; > > if (!is_irq_none(vdev)) > @@ -259,35 +260,13 @@ static int vfio_msi_enable(struct vfio_pci_device > *vdev, int nvec, bool msix) > if (!vdev->ctx) > return -ENOMEM; > > - if (msix) { > - int i; > - > - vdev->msix = kzalloc(nvec * sizeof(struct msix_entry), > - GFP_KERNEL); > - if (!vdev->msix) { > - kfree(vdev->ctx); > - return -ENOMEM; > - } > - > - for (i = 0; i < nvec; i++) > - vdev->msix[i].entry = i; > - > - ret = pci_enable_msix_range(pdev, vdev->msix, 1, nvec); > - if (ret < nvec) { > - if (ret > 0) > - pci_disable_msix(pdev); > - kfree(vdev->msix); > - kfree(vdev->ctx); > - return ret; > - } > - } else { > - ret = pci_enable_msi_range(pdev, 1, nvec); > - if (ret < nvec) { > - if (ret > 0) > - pci_disable_msi(pdev); > - kfree(vdev->ctx); > - return ret; > - } > + /* return the number of supported vectors if we can't get all: */ > + ret = pci_alloc_irq_vectors(pdev, 1, nvec, flag); > + if (ret < nvec) { > + if (ret > 0) > + pci_free_irq_vectors(pdev); > + kfree(vdev->ctx); > + return ret; > } > > vdev->num_ctx = nvec; > @@ -315,7 +294,7 @@ static int vfio_msi_set_vector_signal(struct > vfio_pci_device *vdev, > if (vector < 0 || vector >= vdev->num_ctx) > return -EINVAL; > > - irq = msix ? vdev->msix[vector].vector : pdev->irq + vector; > + irq = pci_irq_vector(pdev, vector); > > if (vdev->ctx[vector].trigger) { > free_irq(irq, vdev->ctx[vector].trigger); > @@ -408,11 +387,7 @@ static void vfio_msi_disable(struct vfio_pci_device > *vdev, bool msix) > > vfio_msi_set_block(vdev, 0, vdev->num_ctx, NULL, msix); > > - if (msix) { > - pci_disable_msix(vdev->pdev); > - kfree(vdev->msix); > - } else > - pci_disable_msi(pdev); > + pci_free_irq_vectors(pdev); > > vdev->irq_type = VFIO_PCI_NUM_IRQS; > vdev->num_ctx = 0; > diff --git a/drivers/vfio/pci/vfio_pci_private.h > b/drivers/vfio/pci/vfio_pci_private.h > index 2128de8..f561ac1 100644 > --- a/drivers/vfio/pci/vfio_pci_private.h > +++ b/drivers/vfio/pci/vfio_pci_private.h > @@ -72,7 +72,6 @@ struct vfio_pci_device { > struct perm_bits*msi_perm; > spinlock_t irqlock; > struct mutexigate; > - struct msix_entry *msix; > struct vfio_pci_irq_ctx *ctx; > int num_ctx; > int irq_type; -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors
Christoph Hellwig writes: > On Thu, Sep 29, 2016 at 03:45:29PM -0300, Gabriel Krisman Bertazi wrote: >> I'm stepping up to assist with the genwqe_card driver just now, since we >> (ibm) missed some of the last patches that went in. I'll add myself to >> maintainers file. > > Can your forward it to Greg together with whatever other changes are > pending for the driver? sure, will do. -- Gabriel Krisman Bertazi -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors
On Thu, Sep 29, 2016 at 03:45:29PM -0300, Gabriel Krisman Bertazi wrote: > I'm stepping up to assist with the genwqe_card driver just now, since we > (ibm) missed some of the last patches that went in. I'll add myself to > maintainers file. Can your forward it to Greg together with whatever other changes are pending for the driver? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors
Christoph Hellwig writes: > On Thu, Sep 29, 2016 at 03:28:02PM -0300, Gabriel Krisman Bertazi wrote: >> Christoph Hellwig writes: >> >> > Simply the interrupt setup by using the new PCI layer helpers. >> >> Good clean up. Tested and: >> >> Acked-by: Gabriel Krisman Bertazi > > Which tree should this go in through? I'd say Greg's char-misc tree. I'm stepping up to assist with the genwqe_card driver just now, since we (ibm) missed some of the last patches that went in. I'll add myself to maintainers file. -- Gabriel Krisman Bertazi -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] skd: use pci_alloc_irq_vectors
On Sun, Sep 11, 2016 at 03:31:25PM +0200, Christoph Hellwig wrote: > Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to > pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not > multiple MSI vectors. > > Otherwise this cleans up a lot of cruft and allows to a lot more common code. > > Signed-off-by: Christoph Hellwig Jens, is this something you could pick up for 4.9? I'd like to get get rid of all users of pci_enable_msi_range in this merge window. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors
On Thu, Sep 29, 2016 at 03:28:02PM -0300, Gabriel Krisman Bertazi wrote: > Christoph Hellwig writes: > > > Simply the interrupt setup by using the new PCI layer helpers. > > Good clean up. Tested and: > > Acked-by: Gabriel Krisman Bertazi Which tree should this go in through? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] genwqe: use pci_irq_allocate_vectors
Christoph Hellwig writes: > Simply the interrupt setup by using the new PCI layer helpers. Good clean up. Tested and: Acked-by: Gabriel Krisman Bertazi > One odd thing about this driver is that it looks like it could request > multiple MSI vectors, but it will then only ever use a single one. I'll take a look at this. -- Gabriel Krisman Bertazi -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
On 09/29/2016 12:41 PM, Wouter Verhelst wrote: On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote: So think of it like normal disks with multiple channels. We don't send flushes down all the hwq's to make sure they are clear, we leave that decision up to the application (usually a FS of course). Well, when I asked earlier, Christoph said[1] that blk-mq assumes that when a FLUSH is sent over one channel, and the reply comes in, that all commands which have been received, regardless of which channel they were received over, have reched disk. [1] Message-ID: <20160915122304.ga15...@infradead.org> It is impossible for nbd to make such a guarantee, due to head-of-line blocking on TCP. Huh I missed that. Yeah that's not possible for us for sure, I think my option idea is the less awful way forward if we want to address that limitation. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote: > So think of it like normal disks with multiple channels. We don't send > flushes > down all the hwq's to make sure they are clear, we leave that decision up to > the > application (usually a FS of course). Well, when I asked earlier, Christoph said[1] that blk-mq assumes that when a FLUSH is sent over one channel, and the reply comes in, that all commands which have been received, regardless of which channel they were received over, have reched disk. [1] Message-ID: <20160915122304.ga15...@infradead.org> It is impossible for nbd to make such a guarantee, due to head-of-line blocking on TCP. [...] > perhaps we could add a flag later that says send all the flushes down > all the connections for the paranoid, it should be relatively > straightforward to do. Thanks, That's not an unreasonable approach, I guess. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blkcg: Unlock blkcg_pol_mutex once if cpd == NULL
On 09/29/2016 03:03 AM, Tejun Heo wrote: Hello, On Mon, Sep 26, 2016 at 03:36:25PM -0700, Bart Van Assche wrote: Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register() such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch avoids that smatch reports the following error: block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex' Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: --- block/blk-cgroup.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index dd38e5c..cdbca1c 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1327,8 +1327,10 @@ int blkcg_policy_register(struct blkcg_policy *pol) for (i = 0; i < BLKCG_MAX_POLS; i++) if (!blkcg_policy[i]) break; - if (i >= BLKCG_MAX_POLS) + if (i >= BLKCG_MAX_POLS) { + mutex_unlock(&blkcg_pol_mutex); goto err_unlock; + } Wouldn't it be better to drop explicit mutx_unlock(&blkcg_pol_mutex) on "if (!cpd)"? Agreed. I will rework this patch accordingly. Thanks for the feedback. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] blkcg: Unlock blkcg_pol_mutex once if cpd == NULL
Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register() such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch avoids that smatch reports the following error: block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex' Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug") Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: --- block/blk-cgroup.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index dd38e5c..b08ccbb 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1340,10 +1340,8 @@ int blkcg_policy_register(struct blkcg_policy *pol) struct blkcg_policy_data *cpd; cpd = pol->cpd_alloc_fn(GFP_KERNEL); - if (!cpd) { - mutex_unlock(&blkcg_pol_mutex); + if (!cpd) goto err_free_cpds; - } blkcg->cpd[pol->plid] = cpd; cpd->blkcg = blkcg; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] ipr: use pci_irq_allocate_vectors
On Thu, Sep 29, 2016 at 09:01:44AM -0500, Brian King wrote: > Thanks Christoph. Very nice. As I was reviewing the patch, I noticed > the additional PCI_IRQ_AFFINITY flag, which is currently not being set > in this patch. Is the intention to set that globally by default, or > should I follow up with a one liner to add that to the ipr driver > in the next patch set I send out? Hi Brian, PCI_IRQ_AFFINITY seems useful for ipr, especially if you also increase the number of vectors above the current default 2. And yes, please make it a separate patch. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
On 09/29/2016 05:52 AM, Wouter Verhelst wrote: Hi Josef, On Wed, Sep 28, 2016 at 04:01:32PM -0400, Josef Bacik wrote: NBD can become contended on its single connection. We have to serialize all writes and we can only process one read response at a time. Fix this by allowing userspace to provide multiple connections to a single nbd device. This coupled with block-mq drastically increases performance in multi-process cases. Thanks, This reminds me: I've been pondering this for a while, and I think there is no way we can guarantee the correct ordering of FLUSH replies in the face of multiple connections, since a WRITE reply on one connection may arrive before a FLUSH reply on another which it does not cover, even if the server has no cache coherency issues otherwise. Having said that, there can certainly be cases where that is not a problem, and where performance considerations are more important than reliability guarantees; so once this patch lands in the kernel (and the necessary support patch lands in the userland utilities), I think I'll just update the documentation to mention the problems that might ensue, and be done with it. I can see only a few ways in which to potentially solve this problem: - Kernel-side nbd-client could send a FLUSH command over every channel, and only report successful completion once all replies have been received. This might negate some of the performance benefits, however. - Multiplexing commands over a single connection (perhaps an SCTP one, rather than TCP); this would require some effort though, as you said, and would probably complicate the protocol significantly. So think of it like normal disks with multiple channels. We don't send flushes down all the hwq's to make sure they are clear, we leave that decision up to the application (usually a FS of course). So what we're doing here is no worse than what every real disk on the planet does, our hw queues are just have a lot longer transfer times and are more error prone ;). I definitely think documenting the behavior is important so that people don't expect magic to happen, and perhaps we could add a flag later that says send all the flushes down all the connections for the paranoid, it should be relatively straightforward to do. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] ipr: use pci_irq_allocate_vectors
Thanks Christoph. Very nice. As I was reviewing the patch, I noticed the additional PCI_IRQ_AFFINITY flag, which is currently not being set in this patch. Is the intention to set that globally by default, or should I follow up with a one liner to add that to the ipr driver in the next patch set I send out? Acked-by: Brian King Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] block: bdi_register_owner() failure cause NULL pointer dereference
Hi, Folks We observed the hard lockup while trying raid assemble with sas3ircu, it was start with the failure inside bdi_register_owner() with duplicated kobj path, and later comeup the NULL pointer dereference, after that system hang and we saw hard lockup on screen. The duplicated issue could be with the scsi controller driver and we are going to upgrade it anyway, but my question is why we don't do some error handling like: diff --git a/block/genhd.c b/block/genhd.c index a178c8e..318bc63 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -614,7 +614,15 @@ void device_add_disk(struct device *parent, struct gendisk *disk) /* Register BDI before referencing it from bdev */ bdi = &disk->queue->backing_dev_info; - bdi_register_owner(bdi, disk_to_dev(disk)); + if (bdi_register_owner(bdi, disk_to_dev(disk))) { + disk_release_events(disk); + blk_free_devt(devt); + disk->ev = NULL; + disk->first_minor = 0; + disk->major = 0; + WARN_ON(1); + return; + } blk_register_region(disk_devt(disk), disk->minors, NULL, exact_match, exact_lock, disk); to prevent the following NULL pointer dereference and hard lockup? Regards, Michael Wang Sep 29 09:53:28 st401b-3 systemd[1]: Starting Update UTMP about System Runlevel Changes... Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen and drop on 1 v6wildcard :: UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 2 lo 127.0.0.1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 3 eth0 10.41.12.3 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 4 lo ::1 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: Listen normally on 5 eth0 fe80::ec4:7aff:feab:6b0 UDP 123 Sep 29 09:53:28 st401b-3 ntpd[4970]: peers refreshed Sep 29 09:53:28 st401b-3 ntpd[4970]: Listening on routing socket on fd #22 for interface updates Sep 29 09:53:28 st401b-3 systemd[1]: Started Update UTMP about System Runlevel Changes. Sep 29 09:53:28 st401b-3 systemd[1]: Startup finished in 18.720s (kernel) + 39.513s (userspace) = 58.233s. Sep 29 09:55:01 st401b-3 CRON[5433]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Sep 29 09:55:01 st401b-3 CRON[5434]: (root) CMD (test -x /opt/profitbricks/bin/check_memory && /opt/profitbricks/bin/check_memory) Sep 29 09:55:17 st401b-3 kernel: [ 167.693658] scsi 0:1:0:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:55:17 st401b-3 kernel: [ 167.693795] scsi 0:1:0:0: RAID1: handle(0x0143), wwid(0x02f7ec7949091b05), pd_count(2), type(SSP) Sep 29 09:55:17 st401b-3 kernel: [ 167.694042] sd 0:1:0:0: [sdam] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:55:17 st401b-3 kernel: [ 167.694044] sd 0:1:0:0: [sdam] 4096-byte physical blocks Sep 29 09:55:17 st401b-3 kernel: [ 167.694057] sd 0:1:0:0: Attached scsi generic sg40 type 0 Sep 29 09:55:17 st401b-3 kernel: [ 167.694129] sd 0:1:0:0: [sdam] Write Protect is off Sep 29 09:55:17 st401b-3 kernel: [ 167.694131] sd 0:1:0:0: [sdam] Mode Sense: 03 00 00 08 Sep 29 09:55:17 st401b-3 kernel: [ 167.694166] sd 0:1:0:0: [sdam] No Caching mode page found Sep 29 09:55:17 st401b-3 kernel: [ 167.694282] sd 0:0:4:0: hidding raid component Sep 29 09:55:17 st401b-3 kernel: [ 167.694589] sd 0:1:0:0: [sdam] Assuming drive cache: write through Sep 29 09:55:17 st401b-3 kernel: [ 167.703346] sd 0:1:0:0: [sdam] Attached SCSI disk Sep 29 09:55:17 st401b-3 kernel: [ 167.703653] sd 0:0:5:0: hidding raid component Sep 29 09:55:39 st401b-3 check_backup_lvm_push: critical: local git command rev-parse HEAD failed, retval: 0, fatal: Not a git repository: '/var/lib/backup-lvm/.git/' Sep 29 09:56:03 st401b-3 kernel: [ 213.684812] scsi 0:1:1:0: Direct-Access LSI Logical Volume 3000 PQ: 0 ANSI: 6 Sep 29 09:56:03 st401b-3 kernel: [ 213.684946] scsi 0:1:1:0: RAID1: handle(0x0142), wwid(0x0ab3eca651cd1b58), pd_count(2), type(SSP) Sep 29 09:56:03 st401b-3 kernel: [ 213.685189] sd 0:1:1:0: [sde] 5859373056 512-byte logical blocks: (3.00 TB/2.73 TiB) Sep 29 09:56:03 st401b-3 kernel: [ 213.685192] sd 0:1:1:0: [sde] 4096-byte physical blocks Sep 29 09:56:03 st401b-3 kernel: [ 213.685204] sd 0:1:1:0: Attached scsi generic sg41 type 0 Sep 29 09:56:03 st401b-3 kernel: [ 213.685275] sd 0:1:1:0: [sde] Write Protect is off Sep 29 09:56:03 st401b-3 kernel: [ 213.685277] sd 0:1:1:0: [sde] Mode Sense: 03 00 00 08 Sep 29 09:56:03 st401b-3 kernel: [ 213.685307] sd 0:1:1:0: [sde] No Caching mode page found Sep 29 09:56:03 st401b-3 kernel: [ 213.685423] sd 0:0:6:0: hidding raid component Sep 29 09:56:03 st401b-3 kernel: [ 213.685698] sd 0:1:1:0: [sde] Assuming drive cache: write through Sep 29 09:56:03 st401b-3 kernel: [ 213.686226] [ cut here ] Sep 29 09:56:03 st401b-3 kernel: [ 213.686232] WARNING: CPU: 2 PID: 2061 at fs/sysfs/dir.c:31 sysfs_
Re: [PATCH] blkcg: Unlock blkcg_pol_mutex once if cpd == NULL
Hello, On Mon, Sep 26, 2016 at 03:36:25PM -0700, Bart Van Assche wrote: > Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register() > such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch > avoids that smatch reports the following error: > > block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock > 'mutex:&blkcg_pol_mutex' > > Signed-off-by: Bart Van Assche > Cc: Tejun Heo > Cc: > --- > block/blk-cgroup.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index dd38e5c..cdbca1c 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1327,8 +1327,10 @@ int blkcg_policy_register(struct blkcg_policy *pol) > for (i = 0; i < BLKCG_MAX_POLS; i++) > if (!blkcg_policy[i]) > break; > - if (i >= BLKCG_MAX_POLS) > + if (i >= BLKCG_MAX_POLS) { > + mutex_unlock(&blkcg_pol_mutex); > goto err_unlock; > + } Wouldn't it be better to drop explicit mutx_unlock(&blkcg_pol_mutex) on "if (!cpd)"? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Nbd] [PATCH][V3] nbd: add multi-connection support
Hi Josef, On Wed, Sep 28, 2016 at 04:01:32PM -0400, Josef Bacik wrote: > NBD can become contended on its single connection. We have to serialize all > writes and we can only process one read response at a time. Fix this by > allowing userspace to provide multiple connections to a single nbd device. > This > coupled with block-mq drastically increases performance in multi-process > cases. > Thanks, This reminds me: I've been pondering this for a while, and I think there is no way we can guarantee the correct ordering of FLUSH replies in the face of multiple connections, since a WRITE reply on one connection may arrive before a FLUSH reply on another which it does not cover, even if the server has no cache coherency issues otherwise. Having said that, there can certainly be cases where that is not a problem, and where performance considerations are more important than reliability guarantees; so once this patch lands in the kernel (and the necessary support patch lands in the userland utilities), I think I'll just update the documentation to mention the problems that might ensue, and be done with it. I can see only a few ways in which to potentially solve this problem: - Kernel-side nbd-client could send a FLUSH command over every channel, and only report successful completion once all replies have been received. This might negate some of the performance benefits, however. - Multiplexing commands over a single connection (perhaps an SCTP one, rather than TCP); this would require some effort though, as you said, and would probably complicate the protocol significantly. Regards, -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Enabling ATA Command Priorities
Hello, On Wed, Sep 28, 2016 at 03:43:32AM +, Adam Manzanares wrote: > I prefer having the feature conditional so you can use the CFQ > scheduler with I/O priorities as is. If you decide to enable the > feature then the priorities will be passed down to the drive in > addition to the work that the CFQ scheduler does. Since this feature > may change the user perceived performance of the device I want to > make sure they know what they are getting into. Yeah, I prefer to have it behind an explicit flag given the history of optional ATA features. The feature is unlikely to matter to a lot of people and is almost bound to break existing RT prio usages. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] ata: Enabling ATA Command Priorities
Hello, On Tue, Sep 27, 2016 at 11:14:55AM -0700, Adam Manzanares wrote: > +/** > + * ata_ncq_prio_enabled - Test whether NCQ prio is enabled > + * @dev: ATA device to test for > + * > + * LOCKING: > + * spin_lock_irqsave(host lock) > + * > + * RETURNS: > + * 1 if NCQ prio is enabled for @dev, 0 otherwise. > + */ > +static inline int ata_ncq_prio_enabled(struct ata_device *dev) > +{ > + return (dev->flags & (ATA_DFLAG_PIO | ATA_DFLAG_NCQ_OFF | > + ATA_DFLAG_NCQ_PRIO)) == ATA_DFLAG_NCQ_PRIO; I'm not sure this needs to test PIO and NCQ_OFF. This functions pretty much can assume that it'd be only called in NCQ context, no? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] block: Add iocontext priority to request
Hello, On Tue, Sep 27, 2016 at 11:14:54AM -0700, Adam Manzanares wrote: > Patch adds association between iocontext and a request. > > Signed-off-by: Adam Manzanares Can you please describe how this may impact existing usages? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html