Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
Hannes, Shaun, Let me add some more comments. > On Aug 2, 2016, at 23:35, Hannes Reinecke wrote: > > On 08/01/2016 07:07 PM, Shaun Tancheff wrote: >> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: >>> >>> Can you please integrate this with Hannes series so that it uses >>> his cache of the zone information? >> >> Adding Hannes and Damien to Cc. >> >> Christoph, >> >> I can make a patch the marshal Hannes' RB-Tree into to a block report, that >> is >> quite simple. I can even have the open/close/reset zone commands update the >> RB-Tree .. the non-private parts anyway. I would prefer to do this around the >> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do >> not need the RB-Tree to function with zoned media. >> >> I do still have concerns with the approach which I have shared in smaller >> forums but perhaps I have to bring them to this group. >> >> First is the memory consumption. This isn't really much of a concern for >> large >> servers with few drives but I think the embedded NAS market will grumble as >> well as the large data pods trying to stuff 300+ drives in a chassis. >> >> As of now the RB-Tree needs to hold ~3 zones. >> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields >> around 3.5 MB per zoned drive attached. >> Which is fine if it is really needed, but most of it is fixed information >> and it can be significantly condensed (I have proposed 8 bytes per zone held >> in an array as more than adequate). Worse is that the crucial piece of >> information, the current wp needed for scheduling the next write, is mostly >> out of date because it is updated only after the write completes and zones >> being actively written to must work off of the last location / size that was >> submitted, not completed. The work around is for that tracking to be handled >> in the private_data member. I am not saying that updating the wp on >> completing a write isn’t important, I am saying that the bi_end_io hook is >> the existing hook that works just fine. >> > Which _actually_ is not true; with my patches I'll update the write > pointer prior to submit the I/O (on the reasoning that most of the time > I/O will succeed) and re-read the zone information if an I/O failed. > (Which I'll have to do anyway as after an I/O failure the write pointer > status is not clearly defined.) > > I have thought about condensing the RB tree information, but then I > figured that for 'real' SMR handling we cannot assume all zones are of > fixed size, and hence we need all the information there. > Any condensing method would assume a given structure of the zones, which > the standard just doesn't provide. > Or am I missing something here? Indeed, the standards do not mandate any particular zone configuration, constant zone size, etc. So writing code so that can be handled is certainly the right way of doing things. However, if we decide to go forward with mapping RESET WRITE POINTER command to DISCARD, then at least a constant zone size (minus the last zone as you said) must be assumed, and that information can be removed from the entries in the RB tree (as it will be saved for the sysfs "zone_size" file anyway. Adding a little code to handle that eventual last runt zone with a different size is not a big problem. > > As for write pointer handling: yes, the write pointer on the zones is > not really useful for upper-level usage. > Where we do need it is to detect I/O which is crossing the write pointer > (eg when doing reads over the entire zone). > As per spec you will be getting an I/O error here, so we need to split > the I/O on the write pointer to get valid results back. To be precise here, the I/O splitting will be handled by the block layer thanks to the "chunk_sectors" setting. But that relies on a constant zone size assumption too. The RB-tree here is most useful for reads over or after the write pointer as this can have different behavior on different drives (URSWRZ bit). The RB-tree allows us to hide these differences to upper layers and simplify support at those levels. I too agree that the write pointer value is not very useful to upper layers (e.g. FS). What matters most of the times for these layers is an "allocation pointer" or "submission pointer" which indicates the LBA to use for the next BIO submission. That LBA will most of the time be in advance of the zone WP because of request queing in the block scheduler. Note that from what we have done already, in many cases, upper layers do not even need this (e.g. F2FS, btrfs) and work fine without needing *any* zone->private_data because that "allocation pointer" information generally already exists in a different form (e.g. a bit in a bitmap). For these cases, not having the RB-tree of zones would force a lot more code to be added to file systems for SMR support. > >> This all tails into domain responsibility. With the RB-Tree doing half of the >> work and the ‘responsible’ doma
[PATCH 0039/1285] Replace numeric parameter like 0444 with macro
I find that the developers often just specified the numeric value when calling a macro which is defined with a parameter for access permission. As we know, these numeric value for access permission have had the corresponding macro, and that using macro can improve the robustness and readability of the code, thus, I suggest replacing the numeric parameter with the macro. Signed-off-by: Chuansheng Liu Signed-off-by: Baole Ni --- block/genhd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index 9f42526..1079e623 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1807,7 +1807,7 @@ static const struct kernel_param_ops disk_events_dfl_poll_msecs_param_ops = { #define MODULE_PARAM_PREFIX"block." module_param_cb(events_dfl_poll_msecs, &disk_events_dfl_poll_msecs_param_ops, - &disk_events_dfl_poll_msecs, 0644); + &disk_events_dfl_poll_msecs, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); /* * disk_{alloc|add|del|release}_events - initialize and destroy disk_events. -- 2.9.2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixup direct bi_rw modifiers
On 08/02/2016 05:32 AM, Christoph Hellwig wrote: On Mon, Aug 01, 2016 at 01:55:36PM -0600, Jens Axboe wrote: Set of three patches, where the target one is an actual bug fix... Temporary branch, I'll rebase it once -rc1 is out, if more changes/fixups need to be made in the next week until that happens. http://git.kernel.dk/cgit/linux-block/log/?h=for-4.8/bi_rwf Thanks. Any chance to have a slightly better name for the field? E.g. bi_opf or bi_op_flags if we want to be a bit more verbose. bi_opf is fine with me as well, don't care too strongly about bi_rwf or bi_opf. I don't like the longer variants. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0310/1285] Replace numeric parameter like 0444 with macro
I find that the developers often just specified the numeric value when calling a macro which is defined with a parameter for access permission. As we know, these numeric value for access permission have had the corresponding macro, and that using macro can improve the robustness and readability of the code, thus, I suggest replacing the numeric parameter with the macro. Signed-off-by: Chuansheng Liu Signed-off-by: Baole Ni --- drivers/lightnvm/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 160c1a6..8dec614 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -1032,7 +1032,7 @@ static const struct kernel_param_ops nvm_configure_by_str_event_param_ops = { #define MODULE_PARAM_PREFIX"lnvm." module_param_cb(configure_debug, &nvm_configure_by_str_event_param_ops, NULL, - 0644); + S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); #endif /* CONFIG_NVM_DEBUG */ -- 2.9.2 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Deadlock in blk_mq_register_disk error path
On 08/02/2016 06:58 AM, Jinpu Wang wrote: Hi Jens, I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in turn mutex_lock(&all_q_mutex); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); if (ret) break; /// if about error out, we will call unregister below } if (ret) blk_mq_unregister_disk(disk); In blk_mq_unregister_disk, we will try to disable_hotplug again, which leads to dead lock. Did I miss anything? Nope, your analysis looks correct. This should fix it: http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] blk-mq: Prevent round-robin from scheduling dead cpus
Hi, I'm not completely sure I got the cause for this one completely right. Still, it does looks like the correct fix and a good improvement in the overall, so I'm making it an RFC for now to gather some feedback. Let me hear your thoughts. -- >8 -- When notifying blk-mq about CPU removals while running IO, we risk racing the hctx->cpumask update with blk_mq_hctx_next_cpu, and end up scheduling a dead cpu to execute hctx->run_{,delayed_}work. As a result, kblockd_schedule_delayed_work_on() may schedule another cpu outside of hctx->cpumask, which triggers the following warning at __blk_mq_run_hw_queue: WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); This patch makes the issue much more unlikely to happen, as it makes blk_mq_hctx_next_cpu aware of dead cpus, and triggers the round-robin code, despite of remaining batch processing time. Thus, in case we offline a cpu in the middle of its batch processing time, we no longer waste time scheduling it here, and just move through to the next cpu in the mask. The warning may still be triggered, though, since this is not the only case that may cause the queue to schedule on a dead cpu. But this fixes the common case, which is the remaining batch processing time of a sudden dead cpu, which makes the issue much more unlikely to happen. Signed-off-by: Gabriel Krisman Bertazi Cc: Brian King Cc: linux-block@vger.kernel.org Cc: linux-s...@vger.kernel.org --- block/blk-mq.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index c27bb37..a2cb64c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -858,7 +858,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) if (hctx->queue->nr_hw_queues == 1) return WORK_CPU_UNBOUND; - if (--hctx->next_cpu_batch <= 0) { + if (--hctx->next_cpu_batch <= 0 || + !cpumask_test_cpu(hctx->next_cpu, cpu_online_mask)) { int cpu = hctx->next_cpu, next_cpu; next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask); @@ -868,7 +869,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) hctx->next_cpu = next_cpu; hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; - return cpu; + return (cpumask_test_cpu(cpu, cpu_online_mask)) ? + cpu : blk_mq_hctx_next_cpu(hctx); } return hctx->next_cpu; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On 08/01/2016 07:07 PM, Shaun Tancheff wrote: > On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig wrote: >> >> Can you please integrate this with Hannes series so that it uses >> his cache of the zone information? > > Adding Hannes and Damien to Cc. > > Christoph, > > I can make a patch the marshal Hannes' RB-Tree into to a block report, that is > quite simple. I can even have the open/close/reset zone commands update the > RB-Tree .. the non-private parts anyway. I would prefer to do this around the > CONFIG_SD_ZBC support, offering the existing type of patch for setups that do > not need the RB-Tree to function with zoned media. > > I do still have concerns with the approach which I have shared in smaller > forums but perhaps I have to bring them to this group. > > First is the memory consumption. This isn't really much of a concern for large > servers with few drives but I think the embedded NAS market will grumble as > well as the large data pods trying to stuff 300+ drives in a chassis. > > As of now the RB-Tree needs to hold ~3 zones. > sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields > around 3.5 MB per zoned drive attached. > Which is fine if it is really needed, but most of it is fixed information > and it can be significantly condensed (I have proposed 8 bytes per zone held > in an array as more than adequate). Worse is that the crucial piece of > information, the current wp needed for scheduling the next write, is mostly > out of date because it is updated only after the write completes and zones > being actively written to must work off of the last location / size that was > submitted, not completed. The work around is for that tracking to be handled > in the private_data member. I am not saying that updating the wp on > completing a write isn’t important, I am saying that the bi_end_io hook is > the existing hook that works just fine. > Which _actually_ is not true; with my patches I'll update the write pointer prior to submit the I/O (on the reasoning that most of the time I/O will succeed) and re-read the zone information if an I/O failed. (Which I'll have to do anyway as after an I/O failure the write pointer status is not clearly defined.) I have thought about condensing the RB tree information, but then I figured that for 'real' SMR handling we cannot assume all zones are of fixed size, and hence we need all the information there. Any condensing method would assume a given structure of the zones, which the standard just doesn't provide. Or am I missing something here? As for write pointer handling: yes, the write pointer on the zones is not really useful for upper-level usage. Where we do need it is to detect I/O which is crossing the write pointer (eg when doing reads over the entire zone). As per spec you will be getting an I/O error here, so we need to split the I/O on the write pointer to get valid results back. > This all tails into domain responsibility. With the RB-Tree doing half of the > work and the ‘responsible’ domain handling the active path via private_data > why have the split at all? It seems to be a double work to have second object > tracking the first so that I/O scheduling can function. > Tracking the zone states via RB trees is mainly to handly host-managed drives seamlessly. Without an RB tree we will be subjected to I/O errors during boot, as eg with new drives we inevitably will try to read beyond the write pointer, which in turn will generate I/O errors on certain drives. I do agree that there is no strict need to setup an RB tree for host-aware drives; I can easily add an attribute/flag to disable RB trees for those. However, tracking zone information in the RB tree _dramatically_ speeds up device initialisation; issuing a blkdev_discard() for the entire drive will take _ages_ without it. > Finally is the error handling path when the RB-Tree encounters and error it > attempts to requery the drive topology virtually guaranteeing that the > private_data is now out-of-sync with the RB-Tree. Again this is something > that can be better encapsulated in the bi_end_io to be informed of the > failed I/O and schedule the appropriate recovery (including re-querying the > zone information of the affected zone(s)). > Well, if we have an RB tree with write pointers of course we need to re-read the zone information on failure (and I thought I did that?). Plus the error information will be propagated via the usual mechanism, so they need to take care of updating any information in ->private_data. I'm perfectly fine with integrating your patches for the various blkdev_XX callouts and associated ioctls; Jens favours that approach, too, so we should be combining those efforts. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB
Re: Fixup direct bi_rw modifiers
On Mon, Aug 01, 2016 at 01:55:36PM -0600, Jens Axboe wrote: > > Set of three patches, where the target one is an actual bug fix... > Temporary branch, I'll rebase it once -rc1 is out, if more > changes/fixups need to be made in the next week until that happens. > > http://git.kernel.dk/cgit/linux-block/log/?h=for-4.8/bi_rwf Thanks. Any chance to have a slightly better name for the field? E.g. bi_opf or bi_op_flags if we want to be a bit more verbose. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] blkio: update cgroup's document path
cgroup's document path is changed to "cgroup-v1". update it. Signed-off-by: seokhoon.yoon --- block/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/Kconfig b/block/Kconfig index 0363cd7..5665ffe 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -111,7 +111,7 @@ config BLK_DEV_THROTTLING one needs to mount and use blkio cgroup controller for creating cgroups and specifying per device IO rate policies. - See Documentation/cgroups/blkio-controller.txt for more information. + See Documentation/cgroup-v1/blkio-controller.txt for more information. config BLK_CMDLINE_PARSER bool "Block device command line partition parser" -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG] Deadlock in blk_mq_register_disk error path
Hi Jens, I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in turn mutex_lock(&all_q_mutex); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); if (ret) break; /// if about error out, we will call unregister below } if (ret) blk_mq_unregister_disk(disk); In blk_mq_unregister_disk, we will try to disable_hotplug again, which leads to dead lock. Did I miss anything? -- Mit freundlichen Grüßen, Best Regards, Jack Wang Linux Kernel Developer Storage ProfitBricks GmbH The IaaS-Company. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH linux-4.7-rc7] blk_stack_limits() setting discard_granularity
Hi Martin, I totally agree with better having a common block layer infrastructure to handle such discard misfit cases. But, for now, I do not have a good idea of how to aggregate in the block layer discard chunks (< discard_granularity) and issue later only a big one (== discard_granularity) to underlying block device in a generic and persistent fashion. For me, the current handling of discards by the block layer [blk_stack_limits() + blk_bio_discard_split()] seems to be inconsistent with the handling of normal (rd/wr) IO. It makes the life of block drivers developers harder as they can not rely on blk_queue_split() doing its job on discard bio's. Regards, Florian On Tue, Aug 2, 2016 at 4:08 AM, Martin K. Petersen wrote: >> Florian-Ewald Müller writes: > > Florian-Ewald, > >> Now my experiments show that, at least, dm-cache doesn't complain nor >> rejects those smaller discards than its discard_granularity, but >> possibly turning them into noop (?). > > Correct. Anything smaller than (an aligned) multiple of the discard > granularity will effectively be ignored. > > In practice this means that your device should allocate things in > aligned units of the underlying device's discard granularity. > >> May be that the needed functionality of accumulating small discards to >> a big one covering its own granularity (similar to SSDs block erasure) >> should be done at that driver level. > > Do you allocate blocks in a predictable pattern between your nodes? > > For MD RAID0, for instance, we issue many small discard requests. But > for I/Os that are bigger than the stripe width we'll wrap around and do > merging so that for instance blocks 0, n, 2*n, 3*n, etc. become part of > the same discard request sent to the device. > > If you want discards smaller than the underlying granularity to have an > effect then I'm afraid the burden is on you to maintain a bitmap of each > granularity sized region. And then issue a deferred discard when all > blocks inside that region have been discarded by the application or > filesystem above. > > If you want to pursue partial block tracking it would be good to come up > with a common block layer infrastructure for it. dm-thin could benefit > from it as well... > > -- > Martin K. Petersen Oracle Linux Engineering -- Florian-Ewald Mueller Architecture Board ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 331 Fax: +49 30 577 008 598 Email: florian-ewald.muel...@profitbricks.com URL: http://www.profitbricks.de Sitz der Gesellschaft: Berlin. Registergericht: Amtsgericht Charlottenburg, HRB 125506 B. Geschäftsführer: Andreas Gauger, Achim Weiss. Please consider the environment before printing this email. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html