Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal wrote: > Shaun, > > On 8/10/16 12:58, Shaun Tancheff wrote: >> >> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal >> wrote: On Aug 9, 2016, at 15:47, Hannes Reinecke wrote: >> >> >> [trim] >> > Since disk type == 0 for everything that isn't HM so I would prefer the > sysfs 'zoned' file just report if the drive is HA or HM. > Okay. So let's put in the 'zoned' attribute the device type: 'host-managed', 'host-aware', or 'device managed'. >>> >>> >>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned >>> file. >>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything >>> else. This means that drive managed models are not exposed as zoned block >>> devices. For HM vs HA differentiation, an application can look at the >>> device type file since it is already present. >>> >>> We could indeed set the "zoned" file to the device type, but HM drives >>> and >>> regular drives will both have "0" in it, so no differentiation possible. >>> The other choice could be the "zoned" bits defined by ZBC, but these >>> do not define a value for host managed drives, and the drive managed >>> value >>> being not "0" could be confusing too. So I settled for a simple 0/1 >>> boolean. >> >> >> This seems good to me. > > > Another option I forgot is for the "zoned" file to indicate the total number > of zones of the device, and 0 for a non zoned regular block device. That > would work as well. Clearly either is sufficient. > [...] >>> >>> Done: I hacked Shaun ioctl code and added finish zone too. The >>> difference with Shaun initial code is that the ioctl are propagated down >>> to >>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need >>> for >>> BIO request definition for the zone operations. So a lot less code added. >> >> >> The purpose of the BIO flags is not to enable the ioctls so much as >> the other way round. Creating BIO op's is to enable issuing ZBC >> commands from device mapper targets and file systems without some >> heinous ioctl hacks. >> Making the resulting block layer interfaces available via ioctls is just a >> reasonable way to exercise the code ... or that was my intent. > > > Yes, I understood your code. However, since (or if) we keep the zone > information in the RB-tree cache, there is no need for the report zone > operation BIO interface. Same for reset write pointer by keeping the mapping > to discard. blk_lookup_zone can be used in kernel as a report zone BIO > replacement and works as well for the report zone ioctl implementation. For > reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl > becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and > finish zone remains. For these, adding the BIO interface seemed an overkill. > Hence my choice of propagating the ioctl to the driver. > This is debatable of course, and adding an in-kernel interface is not hard: > we can implement blk_open_zone, blk_close_zone and blk_finish_zone using > __blkdev_driver_ioctl. That looks clean to me. Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API -> Kernel is not really a good designed IMO. > Overall, my concern with the BIO based interface for the ZBC commands is > that it adds one flag for each command, which is not really the philosophy > of the interface and potentially opens the door for more such > implementations in the future with new standards and new commands coming up. > Clearly that is not a sustainable path. So I think that a more specific > interface for these zone operations is a better choice. That is consistent > with what happens with the tons of ATA and SCSI commands not actually doing > data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and > are processed as request REQ_TYPE_BLOCK_PC. Part of the reason for following on Mike Christie's bio op/flags cleanup was to make these op's. The advantage of being added as ops is that there is only 1 extra bit need (not 4 or 5 bits for flags). The other reason for being promoted into the block layer as commands is because it seems to me to make sense that these abstractions could be allowed to be passed through a DM layer and be handled by a files system. >>> The ioctls do not mimic exactly the ZBC standard. For instance, there is >>> no >>> reporting options for report zones, nor is the "all" bit supported for >>> open, >>> close or finish zone commands. But the information provided on zones is >>> complete >>> and maps to the standard definitions. >> >> >> For the reporting options I have planned to reuse the stream_id in >> struct bio when that is formalized. There are certainly other places in >> struct bio to stuff a few extra bits ... Sorry I was confused here. I was under the impression you were talking about one of my patches when you seem to have been talking about your hacking thereof. > We could add reporting options to blk_loo
Re: [Query] increased latency observed in cpu hotplug path
On 8/5/2016 12:49 PM, Khan, Imran wrote: > On 8/1/2016 2:58 PM, Khan, Imran wrote: >> On 7/30/2016 7:54 AM, Akinobu Mita wrote: >>> 2016-07-28 22:18 GMT+09:00 Khan, Imran : Hi, Recently we have observed some increased latency in CPU hotplug event in CPU online path. For online latency we see that block layer is executing notification handler for CPU_UP_PREPARE event and this in turn waits for RCU grace period resulting (sometimes) in an execution time of 15-20 ms for this notification handler. This change was not there in 3.18 kernel but is present in 4.4 kernel and was introduced by following commit: commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba Author: Akinobu Mita Date: Sun Sep 27 02:09:23 2015 +0900 blk-mq: avoid inserting requests before establishing new mapping >>> >>> ... >>> Upon reverting this commit I could see an improvement of 15-20 ms in online latency. So I am looking for some help in analyzing the effects of reverting this or should some other approach to reduce the online latency must be taken. >>> >>> Can you observe the difference in online latency by removing >>> get_online_cpus() and put_online_cpus() pair in >>> blk_mq_init_allocated_queue() >>> instead of full reverting the commit? >>> >> Hi Akinobu, >> I tried your suggestion but could not achieve any improvement. Actually the >> snippet that is causing the change in latency is the following one : >> >> list_for_each_entry(q, &all_q_list, all_q_node) { >> blk_mq_freeze_queue_wait(q); >> >> /* >> * timeout handler can't touch hw queue during the >> * reinitialization >> */ >> del_timer_sync(&q->timeout); >> } >> >> I understand that this is getting executed now for CPU_UP_PREPARE as well >> resulting in >> increased latency in the cpu online path. I am trying to reduce this latency >> while keeping the >> purpose of this commit intact. I would welcome further suggestions/feedback >> in this regard. >> > Hi Akinobu, > > I am not able to reduce the cpu online latency with this patch, could you > please let me know what > functionality will be broken, if we avoid this patch in our kernel. Also if > you have some other > suggestions towards improving this patch please let me know. > After moving the remapping of queues to block layer's kworker I see that online latency has improved while offline latency remains the same. As the freezing of queues happens in the context of block layer's worker, I think it would be better to do the remapping in the same context and then go ahead with freezing. In this regard I have made following change: commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624 Author: Imran Khan Date: Fri Aug 12 19:59:47 2016 +0530 blk-mq: Move block queue remapping from cpu hotplug path During a cpu hotplug, the hardware and software contexts mappings need to be updated in order to take into account requests submitted for the hotadded CPU. But if this mapping is done in hotplug notifier, it deteriorates the hotplug latency. So move the block queue remapping to block layer worker which results in significant improvements in hotplug latency. Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4 Signed-off-by: Imran Khan diff --git a/block/blk-mq.c b/block/blk-mq.c index 6d6f8fe..06fcf89 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -22,7 +22,11 @@ #include #include #include - +#include +#include +#include +#include +#include #include #include @@ -32,10 +36,18 @@ static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); - static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx); /* + * New online cpumask which is going to be set in this hotplug event. + * Declare this cpumasks as global as cpu-hotplug operation is invoked + * one-by-one and dynamically allocating this could result in a failure. + */ +static struct cpumask online_new; + +static struct work_struct blk_mq_remap_work; + +/* * Check if any of the ctx's have pending work in this hardware queue */ static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx) @@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q, static int blk_mq_queue_reinit_notify(struct notifier_block *nb, unsigned long action, void *hcpu) { - struct request_queue *q; int cpu = (unsigned long)hcpu; - /* -* New online cpumask which is going to be set in this hotplug event. -* Declare this cpumasks as global as cpu-hotplug operation is invoked -* one-by-one and dynamically allocating this could result in a failure. -*/ - static struct cpumask online_new; /* * Before hotadded cpu starts handling requests, new mappings must @@ -2155,43 +2160,17 @@ static int
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal wrote: > > Shaun, > >> On Aug 14, 2016, at 09:09, Shaun Tancheff wrote: > […] >>> No, surely not. >>> But one of the _big_ advantages for the RB tree is blkdev_discard(). >>> Without the RB tree any mkfs program will issue a 'discard' for every >>> sector. We will be able to coalesce those into one discard per zone, but >>> we still need to issue one for _every_ zone. >> >> How can you make coalesce work transparently in the >> sd layer _without_ keeping some sort of a discard cache along >> with the zone cache? >> >> Currently the block layer's blkdev_issue_discard() is breaking >> large discard's into nice granular and aligned chunks but it is >> not preventing small discards nor coalescing them. >> >> In the sd layer would there be way to persist or purge an >> overly large discard cache? What about honoring >> discard_zeroes_data? Once the discard is completed with >> discard_zeroes_data you have to return zeroes whenever >> a discarded sector is read. Isn't that a log more than just >> tracking a write pointer? Couldn't a zone have dozens of holes? > > My understanding of the standards regarding discard is that it is not > mandatory and that it is a hint to the drive. The drive can completely > ignore it if it thinks that is a better choice. I may be wrong on this > though. Need to check again. But you are currently setting discard_zeroes_data=1 in your current patches. I believe that setting discard_zeroes_data=1 effectively promotes discards to being mandatory. I have a follow on patch to my SCT Write Same series that handles the CMR zone case in the sd_zbc_setup_discard() handler. > For reset write pointer, the mapping to discard requires that the calls > to blkdev_issue_discard be zone aligned for anything to happen. Specify > less than a zone and nothing will be done. This I think preserve the > discard semantic. Oh. If that is the intent then there is just a bug in the handler. I have pointed out where I believe it to be in my response to the zone cache patch being posted. > As for the “discard_zeroes_data” thing, I also think that is a drive > feature not mandatory. Drives may have it or not, which is consistent > with the ZBC/ZAC standards regarding reading after write pointer (nothing > says that zeros have to be returned). In any case, discard of CMR zones > will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better > choice. However I am still curious about discard's being coalesced. >>> Which is (as indicated) really slow, and easily takes several minutes. >>> With the RB tree we can short-circuit discards to empty zones, and speed >>> up processing time dramatically. >>> Sure we could be moving the logic into mkfs and friends, but that would >>> require us to change the programs and agree on a library (libzbc?) which >>> should be handling that. >> >> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... >> so I'm not sure your argument is valid here. > > This initial SMR support patch is just that: a first try. Jaegeuk > used SG_IO (in fact copy-paste of parts of libzbc) because the current > ZBC patch-set has no ioctl API for zone information manipulation. We > will fix this mkfs.f2fs once we agree on an ioctl interface. Which again is my point. If mkfs.f2fs wants to speed up it's discard pass in mkfs.f2fs by _not_ sending unneccessary Reset WP for zones that are already empty it has all the information it needs to do so. Here it seems to me that the zone cache is _at_best_ doing double work. At works the zone cache could be doing the wrong thing _if_ the zone cache got out of sync. It is certainly possible (however unlikely) that someone was doing some raw sg activity that is not seed by the sd path. All I am trying to do is have a discussion about the reasons for and against have a zone cache. Where it works and where it breaks this should be entirely technical but I understand that we have all spent a lot of time _not_ discussing this for various non-technical reasons. So far the only reason I've been able to ascertain is that Host Manged drives really don't like being stuck with the URSWRZ and would like to have a software hack to return MUD rather than ship drives with some weird out-of-the box config where the last zone is marked as FINISH'd thereby returning MUD on reads as per spec. I understand that it would be strange state to see of first boot and likely people would just do a ResetWP and have weird boot errors, which would probably just make matters worse. I just would rather the work around be a bit cleaner and/or use less memory. I would also like a path available that does not require SD_ZBC or BLK_ZONED for Host Aware drives to work, hence this set of patches and me begging for a single bit in struct bio. >> >> [..] >> > 3) Try to condense the blkzone data structure to save memory: > I think that we can at the very least remove the zone length, and also > may be t
Re: [RFC] libata-scsi: make sure Maximum Write Same Length is not too large
> "Tom" == Tom Yan writes: Tom, Tom> The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range Tom> Entries are always 512-byte. Lovely. And SAT conveniently ignores this entirely. Tom> Honestly, I have no idea how that would work on a 4Kn SSD, if it is Tom> / will ever be a thing. Highly unlikely. But I guess you would have to report 8 512-byte ranges in the 4Kn case. And then rely on the patch we talked about to clamp the number of sectors based on the block layer limit. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] sd: check BLK_DEF_MAX_SECTORS against max_dev_sectors
> "Tom" == Tom Yan writes: Tom, >> It would be pretty unusual for a device that is smart enough to >> report a transfer length limit to be constrained to 1 MB and change. Tom> Well, it is done pretty much for libata's SATL. But why? >> rw_max = min(BLK_DEF_MAX_SECTORS, q->limits.max_dev_sectors); Tom> That won't really work. min_t() would, though the line is gonna be Tom> a bit long; not sure if I can/should simply cast the type (unsigned Tom> int) to BLK_DEF_MAX_SECTORS. And which braces are you referring to? I'd rather have a split line than double lines with braces. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] sd: dynamically adjust SD_MAX_WS16_BLOCKS as per the actual logical block size
> "Tom" == Tom Yan writes: Tom, >> 0x7f, the maximum number of block layer sectors that can be >> expressed in a single bio. Tom> Hmm, so when we queue any of the limits, we convert a certain Tom> maximum number of physical sectors (which we has already been Tom> doing) logical sectors Tom> to its corresponding maximum number of block layer sectors, and Tom> then make sure that number does not exceed 0x7f, right? Yep. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands
Shaun, > On Aug 14, 2016, at 09:09, Shaun Tancheff wrote: […] >>> >> No, surely not. >> But one of the _big_ advantages for the RB tree is blkdev_discard(). >> Without the RB tree any mkfs program will issue a 'discard' for every >> sector. We will be able to coalesce those into one discard per zone, but >> we still need to issue one for _every_ zone. > > How can you make coalesce work transparently in the > sd layer _without_ keeping some sort of a discard cache along > with the zone cache? > > Currently the block layer's blkdev_issue_discard() is breaking > large discard's into nice granular and aligned chunks but it is > not preventing small discards nor coalescing them. > > In the sd layer would there be way to persist or purge an > overly large discard cache? What about honoring > discard_zeroes_data? Once the discard is completed with > discard_zeroes_data you have to return zeroes whenever > a discarded sector is read. Isn't that a log more than just > tracking a write pointer? Couldn't a zone have dozens of holes? My understanding of the standards regarding discard is that it is not mandatory and that it is a hint to the drive. The drive can completely ignore it if it thinks that is a better choice. I may be wrong on this though. Need to check again. For reset write pointer, the mapping to discard requires that the calls to blkdev_issue_discard be zone aligned for anything to happen. Specify less than a zone and nothing will be done. This I think preserve the discard semantic. As for the “discard_zeroes_data” thing, I also think that is a drive feature not mandatory. Drives may have it or not, which is consistent with the ZBC/ZAC standards regarding reading after write pointer (nothing says that zeros have to be returned). In any case, discard of CMR zones will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better choice. > >> Which is (as indicated) really slow, and easily takes several minutes. >> With the RB tree we can short-circuit discards to empty zones, and speed >> up processing time dramatically. >> Sure we could be moving the logic into mkfs and friends, but that would >> require us to change the programs and agree on a library (libzbc?) which >> should be handling that. > > F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ... > so I'm not sure your argument is valid here. This initial SMR support patch is just that: a first try. Jaegeuk used SG_IO (in fact copy-paste of parts of libzbc) because the current ZBC patch-set has no ioctl API for zone information manipulation. We will fix this mkfs.f2fs once we agree on an ioctl interface. > > [..] > 3) Try to condense the blkzone data structure to save memory: I think that we can at the very least remove the zone length, and also may be the per zone spinlock too (a single spinlock and proper state flags can be used). >>> >>> I have a variant that is an array of descriptors that roughly mimics the >>> api from blk-zoned.c that I did a few months ago as an example. >>> I should be able to update that to the current kernel + patches. >>> >> Okay. If we restrict the in-kernel SMR drive handling to devices with >> identical zone sizes of course we can remove the zone length. >> And we can do away with the per-zone spinlock and use a global one instead. > > I don't think dropping the zone length is a reasonable thing to do. > > What I propose is an array of _descriptors_ it doesn't drop _any_ > of the zone information that you are holding in an RB tree, it is > just a condensed format that _mostly_ plugs into your existing > API. I do not agree. The Seagate drive already has one zone (the last one) that is not the same length as the other zones. Sure, since it is the last one, we can had “if (last zone)” all over the place and make it work. But that is really ugly. Keeping the length field makes the code generic and following the standard, which has no restriction on the zone sizes. We could do some memory optimisation using different types of blk_zone sturcts, the types mapping to the SAME value: drives with constant zone size can use a blk_zone type without the length field, others use a different type that include the field. Accessor functions can hide the different types in the zone manipulation code. Best regards. -- Damien Le Moal, Ph.D. Sr. Manager, System Software Group, HGST Research, HGST, a Western Digital brand damien.lem...@hgst.com (+81) 0466-98-3593 (ext. 513593) 1 kirihara-cho, Fujisawa, Kanagawa, 252-0888 Japan www.hgst.com Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitte
Re: Time to make dynamically allocated devt the default for scsi disks?
On 08/14/2016 11:23 AM, Dan Williams wrote: [ adding Bart back to the cc ] On Sun, Aug 14, 2016 at 11:08 AM, Dan Williams wrote: On Sun, Aug 14, 2016 at 10:20 AM, James Bottomley wrote: [..] I like it. I still think the bdi registration code should be in charge of taking the extra reference on the disk device's parent to isolate / make clear why the extra reference is being acquired, but I won't lose sleep if Jens takes your smaller change. Thanks James! Bart, do you have a test configuration already set up for this case. Can you give the 2 patches from James a try? https://patchwork.kernel.org/patch/9278201/ https://patchwork.kernel.org/patch/9279513/ Hello Dan, The "sysfs: cannot create duplicate filename" issue is something I ran into sporadically before I started using my patch that fixes this issue. I have not yet found a systematic way to trigger this behavior. Anyway, I will drop my patch and will start testing James' two patches. It will take a few days to test these patches thoroughly. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs
On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote: > On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote: > > After arbitrary bio size is supported, the incoming bio may > > be very big. We have to split the bio into small bios so that > > each holds at most BIO_MAX_PAGES bvecs for safety reason, such > > as bio_clone(). > > I still think working around a rough driver submitting too large > I/O is a bad thing until we've done a full audit of all consuming > bios through ->make_request, and we've enabled it for the common > path as well. bcache originally had workaround code to split too-large bios when it first went upstream - that was dropped only after the patches to make generic_make_request() handle arbitrary size bios went in. So to do what you're suggesting would mean reverting that bcache patch and bringing that code back, which from my perspective would be a step in the wrong direction. I just want to get this over and done with. re: interactions with other drivers - bio_clone() has already been changed to only clone biovecs that are live for current bi_iter, so there shouldn't be any safety issues. A driver would have to be intentionally doing its own open coded bio cloning that clones all of bi_io_vec, not just the active ones - but if they're doing that, they're already broken because a driver isn't allowed to look at bi_vcnt if it isn't a bio that it owns - bi_vcnt is 0 on bios that don't own their biovec (i.e. that were created by bio_clone_fast). And the cloning and bi_vcnt usage stuff I audited very thoroughly back when I was working on immutable biovecs and such back in the day, and I had to do a fair amount of cleanup/refactoring before that stuff could go in. > > > bool do_split = true; > > struct bio *new = NULL; > > const unsigned max_sectors = get_max_io_size(q, bio); > > + unsigned bvecs = 0; > > + > > + *no_merge = true; > > > > bio_for_each_segment(bv, bio, iter) { > > /* > > +* With arbitrary bio size, the incoming bio may be very > > +* big. We have to split the bio into small bios so that > > +* each holds at most BIO_MAX_PAGES bvecs because > > +* bio_clone() can fail to allocate big bvecs. > > +* > > +* It should have been better to apply the limit per > > +* request queue in which bio_clone() is involved, > > +* instead of globally. The biggest blocker is > > +* bio_clone() in bio bounce. > > +* > > +* If bio is splitted by this reason, we should allow > > +* to continue bios merging. > > +* > > +* TODO: deal with bio bounce's bio_clone() gracefully > > +* and convert the global limit into per-queue limit. > > +*/ > > + if (bvecs++ >= BIO_MAX_PAGES) { > > + *no_merge = false; > > + goto split; > > + } > > That being said this simple if check here is simple enough that it's > probably fine. But I see no need to uglify the whole code path > with that no_merge flag. Please drop if for now, and if we start > caring for this path in common code we should just move the > REQ_NOMERGE setting into the actual blk_bio_*_split helpers. Agreed about the no_merge thing. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs
On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote: > After arbitrary bio size is supported, the incoming bio may > be very big. We have to split the bio into small bios so that > each holds at most BIO_MAX_PAGES bvecs for safety reason, such > as bio_clone(). I still think working around a rough driver submitting too large I/O is a bad thing until we've done a full audit of all consuming bios through ->make_request, and we've enabled it for the common path as well. > bool do_split = true; > struct bio *new = NULL; > const unsigned max_sectors = get_max_io_size(q, bio); > + unsigned bvecs = 0; > + > + *no_merge = true; > > bio_for_each_segment(bv, bio, iter) { > /* > + * With arbitrary bio size, the incoming bio may be very > + * big. We have to split the bio into small bios so that > + * each holds at most BIO_MAX_PAGES bvecs because > + * bio_clone() can fail to allocate big bvecs. > + * > + * It should have been better to apply the limit per > + * request queue in which bio_clone() is involved, > + * instead of globally. The biggest blocker is > + * bio_clone() in bio bounce. > + * > + * If bio is splitted by this reason, we should allow > + * to continue bios merging. > + * > + * TODO: deal with bio bounce's bio_clone() gracefully > + * and convert the global limit into per-queue limit. > + */ > + if (bvecs++ >= BIO_MAX_PAGES) { > + *no_merge = false; > + goto split; > + } That being said this simple if check here is simple enough that it's probably fine. But I see no need to uglify the whole code path with that no_merge flag. Please drop if for now, and if we start caring for this path in common code we should just move the REQ_NOMERGE setting into the actual blk_bio_*_split helpers. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: Fix secure erase
On Mon, Aug 15, 2016 at 12:16:30PM -0600, Jens Axboe wrote: >> This really should be a: >> >> if (req_op(rq) != req_op(pos)) >> >> I'l lleave it up to Jens if he wants that in this patch or not, otherwise >> I'll send an incremental patch. > > Let's get a v2 with that fixed up, it makes a big readability > difference. It's not just readbility, it's also a potential correctness issue. Not sure if we'd hit it for other request types at the moment, but we'd surely hit if people add more.. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: Fix secure erase
On 08/15/2016 12:13 PM, Christoph Hellwig wrote: --- a/block/elevator.c +++ b/block/elevator.c @@ -366,7 +366,10 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq) list_for_each_prev(entry, &q->queue_head) { struct request *pos = list_entry_rq(entry); - if ((req_op(rq) == REQ_OP_DISCARD) != (req_op(pos) == REQ_OP_DISCARD)) + if ((req_op(rq) == REQ_OP_DISCARD || +req_op(rq) == REQ_OP_SECURE_ERASE) != + (req_op(pos) == REQ_OP_DISCARD || +req_op(pos) == REQ_OP_SECURE_ERASE)) break; This really should be a: if (req_op(rq) != req_op(pos)) I'l lleave it up to Jens if he wants that in this patch or not, otherwise I'll send an incremental patch. Let's get a v2 with that fixed up, it makes a big readability difference. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: Fix secure erase
On Mon, Aug 15, 2016 at 11:43:12AM -0500, Shaun Tancheff wrote: > Hmm ... Since REQ_SECURE implied REQ_DISCARD doesn't this > mean that we should include REQ_OP_SECURE_ERASE checking > wherever REQ_OP_DISCARD is being checked now in drivers/scsi/sd.c ? > > (It's only in 3 spots so it's a quickie patch) SCSI doesn't support secure erase operations. Only MMC really supports it, plus the usual cargo culting in Xen blkfront that's probably never been tested.. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: Fix secure erase
> --- a/block/elevator.c > +++ b/block/elevator.c > @@ -366,7 +366,10 @@ void elv_dispatch_sort(struct request_queue *q, struct > request *rq) > list_for_each_prev(entry, &q->queue_head) { > struct request *pos = list_entry_rq(entry); > > - if ((req_op(rq) == REQ_OP_DISCARD) != (req_op(pos) == > REQ_OP_DISCARD)) > + if ((req_op(rq) == REQ_OP_DISCARD || > + req_op(rq) == REQ_OP_SECURE_ERASE) != > + (req_op(pos) == REQ_OP_DISCARD || > + req_op(pos) == REQ_OP_SECURE_ERASE)) > break; This really should be a: if (req_op(rq) != req_op(pos)) I'l lleave it up to Jens if he wants that in this patch or not, otherwise I'll send an incremental patch. Otherwise this looks fine: Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Deadlock in blk_mq_register_disk error path
On 08/15/2016 10:15 AM, Jens Axboe wrote: Can you reproduce at will? Would be nice to know if it hit the error case, which is where it would hang. Hello Jens, Unfortunately this hang is only triggered sporadically by my tests. Since about four weeks ago I triggered several thousand scsi_remove_host() calls with my https://github.com/bvanassche/srp-test software. This morning was the first time that I ran into a blk-mq related hang. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Deadlock in blk_mq_register_disk error path
On 08/15/2016 09:53 AM, Bart Van Assche wrote: On 08/02/2016 10:21 AM, Jens Axboe wrote: On 08/02/2016 06:58 AM, Jinpu Wang wrote: Hi Jens, I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in turn mutex_lock(&all_q_mutex); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); if (ret) break; /// if about error out, we will call unregister below } if (ret) blk_mq_unregister_disk(disk); In blk_mq_unregister_disk, we will try to disable_hotplug again, which leads to dead lock. Did I miss anything? Nope, your analysis looks correct. This should fix it: http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a Hi Jens, Will that patch be included in stable kernels? I just encountered a deadlock with kernel v4.7 that looks similar. Sure, we can push to stable, it's a pretty straight forward patch. Can you reproduce at will? Would be nice to know if it hit the error case, which is where it would hang. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: Fix secure erase
On Mon, Aug 15, 2016 at 9:07 AM, Adrian Hunter wrote: > Commit 288dab8a35a0 ("block: add a separate operation type for secure > erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering > all the places REQ_OP_DISCARD was being used to mean either. Fix those. > > Signed-off-by: Adrian Hunter > Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase") > --- > block/bio.c | 21 +++-- > block/blk-merge.c| 33 +++-- > block/elevator.c | 5 - > drivers/mmc/card/block.c | 1 + > drivers/mmc/card/queue.c | 3 ++- > drivers/mmc/card/queue.h | 4 +++- > include/linux/bio.h | 10 -- > include/linux/blkdev.h | 6 -- > kernel/trace/blktrace.c | 2 +- > 9 files changed, 53 insertions(+), 32 deletions(-) Hmm ... Since REQ_SECURE implied REQ_DISCARD doesn't this mean that we should include REQ_OP_SECURE_ERASE checking wherever REQ_OP_DISCARD is being checked now in drivers/scsi/sd.c ? (It's only in 3 spots so it's a quickie patch) > diff --git a/block/bio.c b/block/bio.c > index f39477538fef..aa7354088008 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -667,18 +667,19 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t > gfp_mask, > bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; > bio->bi_iter.bi_size= bio_src->bi_iter.bi_size; > > - if (bio_op(bio) == REQ_OP_DISCARD) > - goto integrity_clone; > - > - if (bio_op(bio) == REQ_OP_WRITE_SAME) { > + switch (bio_op(bio)) { > + case REQ_OP_DISCARD: > + case REQ_OP_SECURE_ERASE: > + break; > + case REQ_OP_WRITE_SAME: > bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0]; > - goto integrity_clone; > + break; > + default: > + bio_for_each_segment(bv, bio_src, iter) > + bio->bi_io_vec[bio->bi_vcnt++] = bv; > + break; > } > > - bio_for_each_segment(bv, bio_src, iter) > - bio->bi_io_vec[bio->bi_vcnt++] = bv; > - > -integrity_clone: > if (bio_integrity(bio_src)) { > int ret; > > @@ -1788,7 +1789,7 @@ struct bio *bio_split(struct bio *bio, int sectors, > * Discards need a mutable bio_vec to accommodate the payload > * required by the DSM TRIM and UNMAP commands. > */ > - if (bio_op(bio) == REQ_OP_DISCARD) > + if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == > REQ_OP_SECURE_ERASE) > split = bio_clone_bioset(bio, gfp, bs); > else > split = bio_clone_fast(bio, gfp, bs); > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 3eec75a9e91d..72627e3cf91e 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -172,12 +172,18 @@ void blk_queue_split(struct request_queue *q, struct > bio **bio, > struct bio *split, *res; > unsigned nsegs; > > - if (bio_op(*bio) == REQ_OP_DISCARD) > + switch (bio_op(*bio)) { > + case REQ_OP_DISCARD: > + case REQ_OP_SECURE_ERASE: > split = blk_bio_discard_split(q, *bio, bs, &nsegs); > - else if (bio_op(*bio) == REQ_OP_WRITE_SAME) > + break; > + case REQ_OP_WRITE_SAME: > split = blk_bio_write_same_split(q, *bio, bs, &nsegs); > - else > + break; > + default: > split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs); > + break; > + } > > /* physical segments can be figured out during splitting */ > res = split ? split : *bio; > @@ -213,7 +219,7 @@ static unsigned int __blk_recalc_rq_segments(struct > request_queue *q, > * This should probably be returning 0, but blk_add_request_payload() > * (Christoph) > */ > - if (bio_op(bio) == REQ_OP_DISCARD) > + if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == > REQ_OP_SECURE_ERASE) > return 1; > > if (bio_op(bio) == REQ_OP_WRITE_SAME) > @@ -385,7 +391,9 @@ static int __blk_bios_map_sg(struct request_queue *q, > struct bio *bio, > nsegs = 0; > cluster = blk_queue_cluster(q); > > - if (bio_op(bio) == REQ_OP_DISCARD) { > + switch (bio_op(bio)) { > + case REQ_OP_DISCARD: > + case REQ_OP_SECURE_ERASE: > /* > * This is a hack - drivers should be neither modifying the > * biovec, nor relying on bi_vcnt - but because of > @@ -393,19 +401,16 @@ static int __blk_bios_map_sg(struct request_queue *q, > struct bio *bio, > * a payload we need to set up here (thank you Christoph) and > * bi_vcnt is really the only way of telling if we need to. > */ > - > - if (bio->bi_vcnt) > - goto single_segment; > - > - return
Re: [BUG] Deadlock in blk_mq_register_disk error path
On 08/15/2016 09:01 AM, Jinpu Wang wrote: It's more likely you hit another bug, my colleague Roman fix that: http://www.spinics.net/lists/linux-block/msg04552.html Hello Jinpu, Interesting. However, I see that wrote the following: "Firstly this wrong sequence raises two kernel warnings: 1st. WARNING at lib/percpu-recount.c:309 percpu_ref_kill_and_confirm called more than once 2nd. WARNING at lib/percpu-refcount.c:331". I haven't seen any of these kernel warnings ... Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: [BUG] Deadlock in blk_mq_register_disk error path
Hi Bart, >> >> Nope, your analysis looks correct. This should fix it: >> >> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a > > Hi Jens, > > Will that patch be included in stable kernels? I just encountered a > deadlock with kernel v4.7 that looks similar. > > Thank you, > > Bart. > > INFO: task kworker/u64:6:136 blocked for more than 480 seconds. > Tainted: GW 4.7.0-dbg+ #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/u64:6 D 88016f677bb0 0 136 2 0x > Workqueue: events_unbound async_run_entry_fn > Call Trace: > [] schedule+0x37/0x90 > [] schedule_preempt_disabled+0x10/0x20 > [] mutex_lock_nested+0x144/0x350 > [] blk_mq_disable_hotplug+0x12/0x20 > [] blk_mq_register_disk+0x29/0x120 > [] blk_register_queue+0xb6/0x160 > [] add_disk+0x219/0x4a0 > [] sd_probe_async+0x100/0x1b0 > [] async_run_entry_fn+0x45/0x140 > [] process_one_work+0x1f9/0x6a0 > [] worker_thread+0x49/0x490 > [] kthread+0xea/0x100 > [] ret_from_fork+0x1f/0x40 > 3 locks held by kworker/u64:6/136: > #0: ("events_unbound"){.+.+.+}, at: [] > process_one_work+0x17a/0x6a0 > #1: ((&entry->work)){+.+.+.}, at: [] > process_one_work+0x17a/0x6a0 > #2: (all_q_mutex){+.+.+.}, at: [] > blk_mq_disable_hotplug+0x12/0x20 > INFO: task 02:8101 blocked for more than 480 seconds. > Tainted: GW 4.7.0-dbg+ #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > 02 D 88039b747968 0 8101 1 0x0004 > Call Trace: > [] schedule+0x37/0x90 > [] blk_mq_freeze_queue_wait+0x51/0xb0 > [] blk_mq_update_tag_set_depth+0x3a/0xb0 > [] blk_mq_init_allocated_queue+0x432/0x450 > [] blk_mq_init_queue+0x35/0x60 > [] scsi_mq_alloc_queue+0x17/0x50 > [] scsi_alloc_sdev+0x2b9/0x350 > [] scsi_probe_and_add_lun+0x98b/0xe50 > [] __scsi_scan_target+0x5ca/0x6b0 > [] scsi_scan_target+0xe1/0xf0 > [] srp_create_target+0xf06/0x13d4 [ib_srp] > [] dev_attr_store+0x13/0x20 > [] sysfs_kf_write+0x40/0x50 > [] kernfs_fop_write+0x137/0x1c0 > [] __vfs_write+0x23/0x140 > [] vfs_write+0xb0/0x190 > [] SyS_write+0x44/0xa0 > [] entry_SYSCALL_64_fastpath+0x18/0xa8 > 8 locks held by 02/8101: > #0: (sb_writers#4){.+.+.+}, at: [] > __sb_start_write+0xb2/0xf0 > #1: (&of->mutex){+.+.+.}, at: [] > kernfs_fop_write+0x101/0x1c0 > #2: (s_active#363){.+.+.+}, at: [] > kernfs_fop_write+0x10a/0x1c0 > #3: (&host->add_target_mutex){+.+.+.}, at: [] > srp_create_target+0x134/0x13d4 [ib_srp] > #4: (&shost->scan_mutex){+.+.+.}, at: [] > scsi_scan_target+0x8d/0xf0 > #5: (cpu_hotplug.lock){++}, at: [] > get_online_cpus+0x2d/0x80 > #6: (all_q_mutex){+.+.+.}, at: [] > blk_mq_init_allocated_queue+0x34a/0x450 > #7: (&set->tag_list_lock){+.+...}, at: [] > blk_mq_init_allocated_queue+0x37a/0x450 > It's more likely you hit another bug, my colleague Roman fix that: http://www.spinics.net/lists/linux-block/msg04552.html It will be great, you test and see if it works for you! -- Mit freundlichen Grüßen, Best Regards, Jack Wang Linux Kernel Developer Storage ProfitBricks GmbH The IaaS-Company. ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 5770083-42 Fax: +49 30 5770085-98 Email: jinpu.w...@profitbricks.com URL: http://www.profitbricks.de Sitz der Gesellschaft: Berlin. Registergericht: Amtsgericht Charlottenburg, HRB 125506 B. Geschäftsführer: Andreas Gauger, Achim Weiss. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: [BUG] Deadlock in blk_mq_register_disk error path
On 08/02/2016 10:21 AM, Jens Axboe wrote: > On 08/02/2016 06:58 AM, Jinpu Wang wrote: >> Hi Jens, >> >> I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in >> turn mutex_lock(&all_q_mutex); >> queue_for_each_hw_ctx(q, hctx, i) { >> ret = blk_mq_register_hctx(hctx); >> if (ret) >> break; /// if about error out, we will call >> unregister below >> } >> >> if (ret) >> blk_mq_unregister_disk(disk); >> >> In blk_mq_unregister_disk, we will try to disable_hotplug again, which >> leads to dead lock. >> >> Did I miss anything? > > Nope, your analysis looks correct. This should fix it: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a Hi Jens, Will that patch be included in stable kernels? I just encountered a deadlock with kernel v4.7 that looks similar. Thank you, Bart. INFO: task kworker/u64:6:136 blocked for more than 480 seconds. Tainted: GW 4.7.0-dbg+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u64:6 D 88016f677bb0 0 136 2 0x Workqueue: events_unbound async_run_entry_fn Call Trace: [] schedule+0x37/0x90 [] schedule_preempt_disabled+0x10/0x20 [] mutex_lock_nested+0x144/0x350 [] blk_mq_disable_hotplug+0x12/0x20 [] blk_mq_register_disk+0x29/0x120 [] blk_register_queue+0xb6/0x160 [] add_disk+0x219/0x4a0 [] sd_probe_async+0x100/0x1b0 [] async_run_entry_fn+0x45/0x140 [] process_one_work+0x1f9/0x6a0 [] worker_thread+0x49/0x490 [] kthread+0xea/0x100 [] ret_from_fork+0x1f/0x40 3 locks held by kworker/u64:6/136: #0: ("events_unbound"){.+.+.+}, at: [] process_one_work+0x17a/0x6a0 #1: ((&entry->work)){+.+.+.}, at: [] process_one_work+0x17a/0x6a0 #2: (all_q_mutex){+.+.+.}, at: [] blk_mq_disable_hotplug+0x12/0x20 INFO: task 02:8101 blocked for more than 480 seconds. Tainted: GW 4.7.0-dbg+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 02 D 88039b747968 0 8101 1 0x0004 Call Trace: [] schedule+0x37/0x90 [] blk_mq_freeze_queue_wait+0x51/0xb0 [] blk_mq_update_tag_set_depth+0x3a/0xb0 [] blk_mq_init_allocated_queue+0x432/0x450 [] blk_mq_init_queue+0x35/0x60 [] scsi_mq_alloc_queue+0x17/0x50 [] scsi_alloc_sdev+0x2b9/0x350 [] scsi_probe_and_add_lun+0x98b/0xe50 [] __scsi_scan_target+0x5ca/0x6b0 [] scsi_scan_target+0xe1/0xf0 [] srp_create_target+0xf06/0x13d4 [ib_srp] [] dev_attr_store+0x13/0x20 [] sysfs_kf_write+0x40/0x50 [] kernfs_fop_write+0x137/0x1c0 [] __vfs_write+0x23/0x140 [] vfs_write+0xb0/0x190 [] SyS_write+0x44/0xa0 [] entry_SYSCALL_64_fastpath+0x18/0xa8 8 locks held by 02/8101: #0: (sb_writers#4){.+.+.+}, at: [] __sb_start_write+0xb2/0xf0 #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x101/0x1c0 #2: (s_active#363){.+.+.+}, at: [] kernfs_fop_write+0x10a/0x1c0 #3: (&host->add_target_mutex){+.+.+.}, at: [] srp_create_target+0x134/0x13d4 [ib_srp] #4: (&shost->scan_mutex){+.+.+.}, at: [] scsi_scan_target+0x8d/0xf0 #5: (cpu_hotplug.lock){++}, at: [] get_online_cpus+0x2d/0x80 #6: (all_q_mutex){+.+.+.}, at: [] blk_mq_init_allocated_queue+0x34a/0x450 #7: (&set->tag_list_lock){+.+...}, at: [] blk_mq_init_allocated_queue+0x37a/0x450 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] block: make sure big bio is splitted into at most 256 bvecs
After arbitrary bio size is supported, the incoming bio may be very big. We have to split the bio into small bios so that each holds at most BIO_MAX_PAGES bvecs for safety reason, such as bio_clone(). This patch fixes the following kernel crash: > [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at > 0028 > [ 172.660229] IP: [] bio_trim+0xf/0x2a > [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0 > [ 172.660399] Oops: [#1] SMP > [...] > [ 172.664780] Call Trace: > [ 172.664813] [] ? raid1_make_request+0x2e8/0xad7 [raid1] > [ 172.664846] [] ? blk_queue_split+0x377/0x3d4 > [ 172.664880] [] ? md_make_request+0xf6/0x1e9 [md_mod] > [ 172.664912] [] ? generic_make_request+0xb5/0x155 > [ 172.664947] [] ? prio_io+0x85/0x95 [bcache] > [ 172.664981] [] ? register_cache_set+0x355/0x8d0 [bcache] > [ 172.665016] [] ? register_bcache+0x1006/0x1174 [bcache] The issue can be reproduced by the following steps: - create one raid1 over two virtio-blk - build bcache device over the above raid1 and another cache device and bucket size is set as 2Mbytes - set cache mode as writeback - run random write over ext4 on the bcache device Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios) Reported-by: Sebastian Roesner Reported-by: Eric Wheeler Cc: sta...@vger.kernel.org (4.3+) Cc: Shaohua Li Acked-by: Kent Overstreet Signed-off-by: Ming Lei --- V3: - rebase against v4.8-rc1 since .bi_rw of bio is renamed as .bi_opf V2: - don't mark as REQ_NOMERGE in case the bio is splitted for reaching the limit of bvecs count V1: - Kent pointed out that using max io size can't cover the case of non-full bvecs/pages block/blk-merge.c | 35 --- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 3eec75a..c44d3e9 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -85,7 +85,8 @@ static inline unsigned get_max_io_size(struct request_queue *q, static struct bio *blk_bio_segment_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, -unsigned *segs) +unsigned *segs, +bool *no_merge) { struct bio_vec bv, bvprv, *bvprvp = NULL; struct bvec_iter iter; @@ -94,9 +95,34 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, bool do_split = true; struct bio *new = NULL; const unsigned max_sectors = get_max_io_size(q, bio); + unsigned bvecs = 0; + + *no_merge = true; bio_for_each_segment(bv, bio, iter) { /* +* With arbitrary bio size, the incoming bio may be very +* big. We have to split the bio into small bios so that +* each holds at most BIO_MAX_PAGES bvecs because +* bio_clone() can fail to allocate big bvecs. +* +* It should have been better to apply the limit per +* request queue in which bio_clone() is involved, +* instead of globally. The biggest blocker is +* bio_clone() in bio bounce. +* +* If bio is splitted by this reason, we should allow +* to continue bios merging. +* +* TODO: deal with bio bounce's bio_clone() gracefully +* and convert the global limit into per-queue limit. +*/ + if (bvecs++ >= BIO_MAX_PAGES) { + *no_merge = false; + goto split; + } + + /* * If the queue doesn't support SG gaps and adding this * offset would create a gap, disallow it. */ @@ -171,13 +197,15 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, { struct bio *split, *res; unsigned nsegs; + bool no_merge_for_split = true; if (bio_op(*bio) == REQ_OP_DISCARD) split = blk_bio_discard_split(q, *bio, bs, &nsegs); else if (bio_op(*bio) == REQ_OP_WRITE_SAME) split = blk_bio_write_same_split(q, *bio, bs, &nsegs); else - split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs); + split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs, + &no_merge_for_split); /* physical segments can be figured out during splitting */ res = split ? split : *bio; @@ -186,7 +214,8 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, if (split) { /* there isn't chance to merge the splitted bio */ - split->b
[PATCH] block: Fix secure erase
Commit 288dab8a35a0 ("block: add a separate operation type for secure erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering all the places REQ_OP_DISCARD was being used to mean either. Fix those. Signed-off-by: Adrian Hunter Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase") --- block/bio.c | 21 +++-- block/blk-merge.c| 33 +++-- block/elevator.c | 5 - drivers/mmc/card/block.c | 1 + drivers/mmc/card/queue.c | 3 ++- drivers/mmc/card/queue.h | 4 +++- include/linux/bio.h | 10 -- include/linux/blkdev.h | 6 -- kernel/trace/blktrace.c | 2 +- 9 files changed, 53 insertions(+), 32 deletions(-) diff --git a/block/bio.c b/block/bio.c index f39477538fef..aa7354088008 100644 --- a/block/bio.c +++ b/block/bio.c @@ -667,18 +667,19 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; bio->bi_iter.bi_size= bio_src->bi_iter.bi_size; - if (bio_op(bio) == REQ_OP_DISCARD) - goto integrity_clone; - - if (bio_op(bio) == REQ_OP_WRITE_SAME) { + switch (bio_op(bio)) { + case REQ_OP_DISCARD: + case REQ_OP_SECURE_ERASE: + break; + case REQ_OP_WRITE_SAME: bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0]; - goto integrity_clone; + break; + default: + bio_for_each_segment(bv, bio_src, iter) + bio->bi_io_vec[bio->bi_vcnt++] = bv; + break; } - bio_for_each_segment(bv, bio_src, iter) - bio->bi_io_vec[bio->bi_vcnt++] = bv; - -integrity_clone: if (bio_integrity(bio_src)) { int ret; @@ -1788,7 +1789,7 @@ struct bio *bio_split(struct bio *bio, int sectors, * Discards need a mutable bio_vec to accommodate the payload * required by the DSM TRIM and UNMAP commands. */ - if (bio_op(bio) == REQ_OP_DISCARD) + if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE) split = bio_clone_bioset(bio, gfp, bs); else split = bio_clone_fast(bio, gfp, bs); diff --git a/block/blk-merge.c b/block/blk-merge.c index 3eec75a9e91d..72627e3cf91e 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -172,12 +172,18 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, struct bio *split, *res; unsigned nsegs; - if (bio_op(*bio) == REQ_OP_DISCARD) + switch (bio_op(*bio)) { + case REQ_OP_DISCARD: + case REQ_OP_SECURE_ERASE: split = blk_bio_discard_split(q, *bio, bs, &nsegs); - else if (bio_op(*bio) == REQ_OP_WRITE_SAME) + break; + case REQ_OP_WRITE_SAME: split = blk_bio_write_same_split(q, *bio, bs, &nsegs); - else + break; + default: split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs); + break; + } /* physical segments can be figured out during splitting */ res = split ? split : *bio; @@ -213,7 +219,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, * This should probably be returning 0, but blk_add_request_payload() * (Christoph) */ - if (bio_op(bio) == REQ_OP_DISCARD) + if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE) return 1; if (bio_op(bio) == REQ_OP_WRITE_SAME) @@ -385,7 +391,9 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, nsegs = 0; cluster = blk_queue_cluster(q); - if (bio_op(bio) == REQ_OP_DISCARD) { + switch (bio_op(bio)) { + case REQ_OP_DISCARD: + case REQ_OP_SECURE_ERASE: /* * This is a hack - drivers should be neither modifying the * biovec, nor relying on bi_vcnt - but because of @@ -393,19 +401,16 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, * a payload we need to set up here (thank you Christoph) and * bi_vcnt is really the only way of telling if we need to. */ - - if (bio->bi_vcnt) - goto single_segment; - - return 0; - } - - if (bio_op(bio) == REQ_OP_WRITE_SAME) { -single_segment: + if (!bio->bi_vcnt) + return 0; + /* Fall through */ + case REQ_OP_WRITE_SAME: *sg = sglist; bvec = bio_iovec(bio); sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset); return 1; + default: + break; } for_each_bio(bio) diff --git a/block/elevator.c b/block/elevator.c