Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Tue, Aug 9, 2016 at 11:38 PM, Damien Le Moal  wrote:
> Shaun,
>
> On 8/10/16 12:58, Shaun Tancheff wrote:
>>
>> On Tue, Aug 9, 2016 at 3:09 AM, Damien Le Moal 
>> wrote:

 On Aug 9, 2016, at 15:47, Hannes Reinecke  wrote:
>>
>>
>> [trim]
>>
> Since disk type == 0 for everything that isn't HM so I would prefer the
> sysfs 'zoned' file just report if the drive is HA or HM.
>
 Okay. So let's put in the 'zoned' attribute the device type:
 'host-managed', 'host-aware', or 'device managed'.
>>>
>>>
>>> I hacked your patches and simply put a "0" or "1" in the sysfs zoned
>>> file.
>>> Any drive that has ZBC/ZAC command support gets a "1", "0" for everything
>>> else. This means that drive managed models are not exposed as zoned block
>>> devices. For HM vs HA differentiation, an application can look at the
>>> device type file since it is already present.
>>>
>>> We could indeed set the "zoned" file to the device type, but HM drives
>>> and
>>> regular drives will both have "0" in it, so no differentiation possible.
>>> The other choice could be the "zoned" bits defined by ZBC, but these
>>> do not define a value for host managed drives, and the drive managed
>>> value
>>> being not "0" could be confusing too. So I settled for a simple 0/1
>>> boolean.
>>
>>
>> This seems good to me.
>
>
> Another option I forgot is for the "zoned" file to indicate the total number
> of zones of the device, and 0 for a non zoned regular block device. That
> would work as well.

Clearly either is sufficient.

> [...]
>>>
>>> Done: I hacked Shaun ioctl code and added finish zone too. The
>>> difference with Shaun initial code is that the ioctl are propagated down
>>> to
>>> the driver (__blkdev_driver_ioctl -> sd_ioctl) so that there is no need
>>> for
>>> BIO request definition for the zone operations. So a lot less code added.
>>
>>
>> The purpose of the BIO flags is not to enable the ioctls so much as
>> the other way round. Creating BIO op's is to enable issuing ZBC
>> commands from device mapper targets and file systems without some
>> heinous ioctl hacks.
>> Making the resulting block layer interfaces available via ioctls is just a
>> reasonable way to exercise the code ... or that was my intent.
>
>
> Yes, I understood your code. However, since (or if) we keep the zone
> information in the RB-tree cache, there is no need for the report zone
> operation BIO interface. Same for reset write pointer by keeping the mapping
> to discard. blk_lookup_zone can be used in kernel as a report zone BIO
> replacement and works as well for the report zone ioctl implementation. For
> reset, there is blkdev_issue_discrad in kernel, and the reset zone ioctl
> becomes equivalent to BLKDISCARD ioctl. These are simple. Open, close and
> finish zone remains. For these, adding the BIO interface seemed an overkill.
> Hence my choice of propagating the ioctl to the driver.
> This is debatable of course, and adding an in-kernel interface is not hard:
> we can implement blk_open_zone, blk_close_zone and blk_finish_zone using
> __blkdev_driver_ioctl. That looks clean to me.

Uh. I would call that "heinous" ioctl hacks myself. Kernel -> User API
-> Kernel
is not really a good designed IMO.

> Overall, my concern with the BIO based interface for the ZBC commands is
> that it adds one flag for each command, which is not really the philosophy
> of the interface and potentially opens the door for more such
> implementations in the future with new standards and new commands coming up.
> Clearly that is not a sustainable path. So I think that a more specific
> interface for these zone operations is a better choice. That is consistent
> with what happens with the tons of ATA and SCSI commands not actually doing
> data I/Os (mode sense, log pages, SMART, etc). All these do not use BIOs and
> are processed as request REQ_TYPE_BLOCK_PC.

Part of the reason for following on Mike Christie's bio op/flags cleanup was to
make these op's. The advantage of being added as ops is that there is only
1 extra bit need (not 4 or 5 bits for flags). The other reason for being
promoted into the block layer as commands is because it seems to me
to make sense that these abstractions could be allowed to be passed through
a DM layer and be handled by a files system.

>>> The ioctls do not mimic exactly the ZBC standard. For instance, there is
>>> no
>>> reporting options for report zones, nor is the "all" bit supported for
>>> open,
>>> close or finish zone commands. But the information provided on zones is
>>> complete
>>> and maps to the standard definitions.
>>
>>
>> For the reporting options I have planned to reuse the stream_id in
>> struct bio when that is formalized. There are certainly other places in
>> struct bio to stuff a few extra bits ...

Sorry I was confused here. I was under the impression you were talking
about one of my patches when you seem to have been talking about
your hacking thereof.

> We could add reporting options to blk_loo

Re: [Query] increased latency observed in cpu hotplug path

2016-08-15 Thread Khan, Imran
On 8/5/2016 12:49 PM, Khan, Imran wrote:
> On 8/1/2016 2:58 PM, Khan, Imran wrote:
>> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran :

 Hi,

 Recently we have observed some increased latency in CPU hotplug
 event in CPU online path. For online latency we see that block
 layer is executing notification handler for CPU_UP_PREPARE event
 and this in turn waits for RCU grace period resulting (sometimes)
 in an execution time of 15-20 ms for this notification handler.
 This change was not there in 3.18 kernel but is present in 4.4
 kernel and was introduced by following commit:


 commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
 Author: Akinobu Mita 
 Date:   Sun Sep 27 02:09:23 2015 +0900

 blk-mq: avoid inserting requests before establishing new mapping
>>>
>>> ...
>>>
 Upon reverting this commit I could see an improvement of 15-20 ms in
 online latency. So I am looking for some help in analyzing the effects
 of reverting this or should some other approach to reduce the online
 latency must be taken.
>>>
>>> Can you observe the difference in online latency by removing
>>> get_online_cpus() and put_online_cpus() pair in 
>>> blk_mq_init_allocated_queue()
>>> instead of full reverting the commit?
>>>
>> Hi Akinobu,
>> I tried your suggestion but could not achieve any improvement. Actually the 
>> snippet that is causing the change in latency is the following one :
>>
>> list_for_each_entry(q, &all_q_list, all_q_node) {
>> blk_mq_freeze_queue_wait(q);
>>
>> /*
>>  * timeout handler can't touch hw queue during the
>>  * reinitialization
>>  */
>> del_timer_sync(&q->timeout);
>>  }
>>
>> I understand that this is getting executed now for CPU_UP_PREPARE as well 
>> resulting in 
>> increased latency in the cpu online path. I am trying to reduce this latency 
>> while keeping the 
>> purpose of this commit intact. I would welcome further suggestions/feedback 
>> in this regard.
>>
> Hi Akinobu,
> 
> I am not able to reduce the cpu online latency with this patch, could you 
> please let me know what
> functionality will be broken, if we avoid this patch in our kernel. Also if 
> you have some other 
> suggestions towards improving this patch please let me know.
> 
After moving the remapping of queues to block layer's kworker I see that online 
latency has improved
while offline latency remains the same. As the freezing of queues happens in 
the context of block layer's
worker, I think it would be better to do the remapping in the same context and 
then go ahead with freezing.
In this regard I have made following change:

commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624
Author: Imran Khan 
Date:   Fri Aug 12 19:59:47 2016 +0530

blk-mq: Move block queue remapping from cpu hotplug path

During a cpu hotplug, the hardware and software contexts mappings
need to be updated in order to take into account requests
submitted for the hotadded CPU. But if this mapping is done
in hotplug notifier, it deteriorates the hotplug latency.
So move the block queue remapping to block layer worker which
results in significant improvements in hotplug latency.

Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4
Signed-off-by: Imran Khan 

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..06fcf89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,7 +22,11 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 

 #include 
@@ -32,10 +36,18 @@

 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
-
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx);

 /*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask online_new;
+
+static struct work_struct blk_mq_remap_work;
+
+/*
  * Check if any of the ctx's have pending work in this hardware queue
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
@@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
  unsigned long action, void *hcpu)
 {
-   struct request_queue *q;
int cpu = (unsigned long)hcpu;
-   /*
-* New online cpumask which is going to be set in this hotplug event.
-* Declare this cpumasks as global as cpu-hotplug operation is invoked
-* one-by-one and dynamically allocating this could result in a failure.
-*/
-   static struct cpumask online_new;

/*
 * Before hotadded cpu starts handling requests, new mappings must
@@ -2155,43 +2160,17 @@ static int

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Shaun Tancheff
On Mon, Aug 15, 2016 at 11:00 PM, Damien Le Moal  wrote:
>
> Shaun,
>
>> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
> […]

>>> No, surely not.
>>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>>> Without the RB tree any mkfs program will issue a 'discard' for every
>>> sector. We will be able to coalesce those into one discard per zone, but
>>> we still need to issue one for _every_ zone.
>>
>> How can you make coalesce work transparently in the
>> sd layer _without_ keeping some sort of a discard cache along
>> with the zone cache?
>>
>> Currently the block layer's blkdev_issue_discard() is breaking
>> large discard's into nice granular and aligned chunks but it is
>> not preventing small discards nor coalescing them.
>>
>> In the sd layer would there be way to persist or purge an
>> overly large discard cache? What about honoring
>> discard_zeroes_data? Once the discard is completed with
>> discard_zeroes_data you have to return zeroes whenever
>> a discarded sector is read. Isn't that a log more than just
>> tracking a write pointer? Couldn't a zone have dozens of holes?
>
> My understanding of the standards regarding discard is that it is not
> mandatory and that it is a hint to the drive. The drive can completely
> ignore it if it thinks that is a better choice. I may be wrong on this
> though. Need to check again.

But you are currently setting discard_zeroes_data=1 in your
current patches. I believe that setting discard_zeroes_data=1
effectively promotes discards to being mandatory.

I have a follow on patch to my SCT Write Same series that
handles the CMR zone case in the sd_zbc_setup_discard() handler.

> For reset write pointer, the mapping to discard requires that the calls
> to blkdev_issue_discard be zone aligned for anything to happen. Specify
> less than a zone and nothing will be done. This I think preserve the
> discard semantic.

Oh. If that is the intent then there is just a bug in the handler.
I have pointed out where I believe it to be in my response to
the zone cache patch being posted.

> As for the “discard_zeroes_data” thing, I also think that is a drive
> feature not mandatory. Drives may have it or not, which is consistent
> with the ZBC/ZAC standards regarding reading after write pointer (nothing
> says that zeros have to be returned). In any case, discard of CMR zones
> will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
> choice.

However I am still curious about discard's being coalesced.

>>> Which is (as indicated) really slow, and easily takes several minutes.
>>> With the RB tree we can short-circuit discards to empty zones, and speed
>>> up processing time dramatically.
>>> Sure we could be moving the logic into mkfs and friends, but that would
>>> require us to change the programs and agree on a library (libzbc?) which
>>> should be handling that.
>>
>> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
>> so I'm not sure your argument is valid here.
>
> This initial SMR support patch is just that: a first try. Jaegeuk
> used SG_IO (in fact copy-paste of parts of libzbc) because the current
> ZBC patch-set has no ioctl API for zone information manipulation. We
> will fix this mkfs.f2fs once we agree on an ioctl interface.

Which again is my point. If mkfs.f2fs wants to speed up it's
discard pass in mkfs.f2fs by _not_ sending unneccessary
Reset WP for zones that are already empty it has all the
information it needs to do so.

Here it seems to me that the zone cache is _at_best_
doing double work. At works the zone cache could be
doing the wrong thing _if_ the zone cache got out of sync.
It is certainly possible (however unlikely) that someone was
doing some raw sg activity that is not seed by the sd path.

All I am trying to do is have a discussion about the reasons for
and against have a zone cache. Where it works and where it breaks
this should be entirely technical but I understand that we have all
spent a lot of time _not_ discussing this for various non-technical
reasons.

So far the only reason I've been able to ascertain is that
Host Manged drives really don't like being stuck with the
URSWRZ and would like to have a software hack to return
MUD rather than ship drives with some weird out-of-the box
config where the last zone is marked as FINISH'd thereby
returning MUD on reads as per spec.

I understand that it would be strange state to see of first
boot and likely people would just do a ResetWP and have
weird boot errors, which would probably just make matters
worse.

I just would rather the work around be a bit cleaner and/or
use less memory. I would also like a path available that
does not require SD_ZBC or BLK_ZONED for Host Aware
drives to work, hence this set of patches and me begging
for a single bit in struct bio.

>>
>> [..]
>>
> 3) Try to condense the blkzone data structure to save memory:
> I think that we can at the very least remove the zone length, and also
> may be t

Re: [RFC] libata-scsi: make sure Maximum Write Same Length is not too large

2016-08-15 Thread Martin K. Petersen
> "Tom" == Tom Yan  writes:

Tom,

Tom> The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range
Tom> Entries are always 512-byte.

Lovely. And SAT conveniently ignores this entirely.

Tom> Honestly, I have no idea how that would work on a 4Kn SSD, if it is
Tom> / will ever be a thing.

Highly unlikely.

But I guess you would have to report 8 512-byte ranges in the 4Kn
case. And then rely on the patch we talked about to clamp the number of
sectors based on the block layer limit.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sd: check BLK_DEF_MAX_SECTORS against max_dev_sectors

2016-08-15 Thread Martin K. Petersen
> "Tom" == Tom Yan  writes:

Tom,

>> It would be pretty unusual for a device that is smart enough to
>> report a transfer length limit to be constrained to 1 MB and change.

Tom> Well, it is done pretty much for libata's SATL.

But why?

>> rw_max = min(BLK_DEF_MAX_SECTORS, q->limits.max_dev_sectors);

Tom> That won't really work. min_t() would, though the line is gonna be
Tom> a bit long; not sure if I can/should simply cast the type (unsigned
Tom> int) to BLK_DEF_MAX_SECTORS. And which braces are you referring to?

I'd rather have a split line than double lines with braces.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] sd: dynamically adjust SD_MAX_WS16_BLOCKS as per the actual logical block size

2016-08-15 Thread Martin K. Petersen
> "Tom" == Tom Yan  writes:

Tom,

>> 0x7f, the maximum number of block layer sectors that can be
>> expressed in a single bio.

Tom> Hmm, so when we queue any of the limits, we convert a certain
Tom> maximum number of physical sectors (which we has already been
Tom> doing)

logical sectors

Tom> to its corresponding maximum number of block layer sectors, and
Tom> then make sure that number does not exceed 0x7f, right?

Yep.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-15 Thread Damien Le Moal

Shaun,

> On Aug 14, 2016, at 09:09, Shaun Tancheff  wrote:
[…]
>>> 
>> No, surely not.
>> But one of the _big_ advantages for the RB tree is blkdev_discard().
>> Without the RB tree any mkfs program will issue a 'discard' for every
>> sector. We will be able to coalesce those into one discard per zone, but
>> we still need to issue one for _every_ zone.
> 
> How can you make coalesce work transparently in the
> sd layer _without_ keeping some sort of a discard cache along
> with the zone cache?
> 
> Currently the block layer's blkdev_issue_discard() is breaking
> large discard's into nice granular and aligned chunks but it is
> not preventing small discards nor coalescing them.
> 
> In the sd layer would there be way to persist or purge an
> overly large discard cache? What about honoring
> discard_zeroes_data? Once the discard is completed with
> discard_zeroes_data you have to return zeroes whenever
> a discarded sector is read. Isn't that a log more than just
> tracking a write pointer? Couldn't a zone have dozens of holes?

My understanding of the standards regarding discard is that it is not
mandatory and that it is a hint to the drive. The drive can completely
ignore it if it thinks that is a better choice. I may be wrong on this
though. Need to check again.
For reset write pointer, the mapping to discard requires that the calls
to blkdev_issue_discard be zone aligned for anything to happen. Specify
less than a zone and nothing will be done. This I think preserve the
discard semantic.

As for the “discard_zeroes_data” thing, I also think that is a drive
feature not mandatory. Drives may have it or not, which is consistent
with the ZBC/ZAC standards regarding reading after write pointer (nothing
says that zeros have to be returned). In any case, discard of CMR zones
will be a nop, so for SMR drives, discard_zeroes_data=0 may be a better
choice.

> 
>> Which is (as indicated) really slow, and easily takes several minutes.
>> With the RB tree we can short-circuit discards to empty zones, and speed
>> up processing time dramatically.
>> Sure we could be moving the logic into mkfs and friends, but that would
>> require us to change the programs and agree on a library (libzbc?) which
>> should be handling that.
> 
> F2FS's mkfs.f2fs is already reading the zone topology via SG_IO ...
> so I'm not sure your argument is valid here.

This initial SMR support patch is just that: a first try. Jaegeuk
used SG_IO (in fact copy-paste of parts of libzbc) because the current
ZBC patch-set has no ioctl API for zone information manipulation. We
will fix this mkfs.f2fs once we agree on an ioctl interface.

> 
> [..]
> 
 3) Try to condense the blkzone data structure to save memory:
 I think that we can at the very least remove the zone length, and also
 may be the per zone spinlock too (a single spinlock and proper state flags 
 can
 be used).
>>> 
>>> I have a variant that is an array of descriptors that roughly mimics the
>>> api from blk-zoned.c that I did a few months ago as an example.
>>> I should be able to update that to the current kernel + patches.
>>> 
>> Okay. If we restrict the in-kernel SMR drive handling to devices with
>> identical zone sizes of course we can remove the zone length.
>> And we can do away with the per-zone spinlock and use a global one instead.
> 
> I don't think dropping the zone length is a reasonable thing to do.
> 
> What I propose is an array of _descriptors_ it doesn't drop _any_
> of the zone information that you are holding in an RB tree, it is
> just a condensed format that _mostly_ plugs into your existing
> API.

I do not agree. The Seagate drive already has one zone (the last one)
that is not the same length as the other zones. Sure, since it is the
last one, we can had “if (last zone)” all over the place and make it
work. But that is really ugly. Keeping the length field makes the code
generic and following the standard, which has no restriction on the
zone sizes. We could do some memory optimisation using different types
of blk_zone sturcts, the types mapping to the SAME value: drives with
constant zone size can use a blk_zone type without the length field,
others use a different type that include the field. Accessor functions
can hide the different types in the zone manipulation code.

Best regards.


-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or 
legally privileged information of WDC and/or its affiliates, and are intended 
solely for the use of the individual or entity to which they are addressed. If 
you are not the intended recipient, any disclosure, copying, distribution or 
any action taken or omitte

Re: Time to make dynamically allocated devt the default for scsi disks?

2016-08-15 Thread Bart Van Assche

On 08/14/2016 11:23 AM, Dan Williams wrote:

[ adding Bart back to the cc ]

On Sun, Aug 14, 2016 at 11:08 AM, Dan Williams  wrote:

On Sun, Aug 14, 2016 at 10:20 AM, James Bottomley
 wrote:

[..]

I like it.  I still think the bdi registration code should be in
charge of taking the extra reference on the disk device's parent to
isolate / make clear why the extra reference is being acquired, but I
won't lose sleep if Jens takes your smaller change.

Thanks James!


Bart, do you have a test configuration already set up for this case.
Can you give the 2 patches from James a try?

https://patchwork.kernel.org/patch/9278201/
https://patchwork.kernel.org/patch/9279513/


Hello Dan,

The "sysfs: cannot create duplicate filename" issue is something I ran 
into sporadically before I started using my patch that fixes this issue. 
I have not yet found a systematic way to trigger this behavior. Anyway, 
I will drop my patch and will start testing James' two patches. It will 
take a few days to test these patches thoroughly.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs

2016-08-15 Thread Kent Overstreet
On Mon, Aug 15, 2016 at 11:23:28AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> 
> I still think working around a rough driver submitting too large
> I/O is a bad thing until we've done a full audit of all consuming
> bios through ->make_request, and we've enabled it for the common
> path as well.

bcache originally had workaround code to split too-large bios when it first went
upstream - that was dropped only after the patches to make
generic_make_request() handle arbitrary size bios went in. So to do what you're
suggesting would mean reverting that bcache patch and bringing that code back,
which from my perspective would be a step in the wrong direction. I just want to
get this over and done with.

re: interactions with other drivers - bio_clone() has already been changed to
only clone biovecs that are live for current bi_iter, so there shouldn't be any
safety issues. A driver would have to be intentionally doing its own open coded
bio cloning that clones all of bi_io_vec, not just the active ones - but if
they're doing that, they're already broken because a driver isn't allowed to
look at bi_vcnt if it isn't a bio that it owns - bi_vcnt is 0 on bios that don't
own their biovec (i.e. that were created by bio_clone_fast).

And the cloning and bi_vcnt usage stuff I audited very thoroughly back when I
was working on immutable biovecs and such back in the day, and I had to do a
fair amount of cleanup/refactoring before that stuff could go in.

> 
> > bool do_split = true;
> > struct bio *new = NULL;
> > const unsigned max_sectors = get_max_io_size(q, bio);
> > +   unsigned bvecs = 0;
> > +
> > +   *no_merge = true;
> >  
> > bio_for_each_segment(bv, bio, iter) {
> > /*
> > +* With arbitrary bio size, the incoming bio may be very
> > +* big. We have to split the bio into small bios so that
> > +* each holds at most BIO_MAX_PAGES bvecs because
> > +* bio_clone() can fail to allocate big bvecs.
> > +*
> > +* It should have been better to apply the limit per
> > +* request queue in which bio_clone() is involved,
> > +* instead of globally. The biggest blocker is
> > +* bio_clone() in bio bounce.
> > +*
> > +* If bio is splitted by this reason, we should allow
> > +* to continue bios merging.
> > +*
> > +* TODO: deal with bio bounce's bio_clone() gracefully
> > +* and convert the global limit into per-queue limit.
> > +*/
> > +   if (bvecs++ >= BIO_MAX_PAGES) {
> > +   *no_merge = false;
> > +   goto split;
> > +   }
> 
> That being said this simple if check here is simple enough that it's
> probably fine.  But I see no need to uglify the whole code path
> with that no_merge flag.  Please drop if for now, and if we start
> caring for this path in common code we should just move the
> REQ_NOMERGE setting into the actual blk_bio_*_split helpers.

Agreed about the no_merge thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] block: make sure big bio is splitted into at most 256 bvecs

2016-08-15 Thread Christoph Hellwig
On Mon, Aug 15, 2016 at 11:11:22PM +0800, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().

I still think working around a rough driver submitting too large
I/O is a bad thing until we've done a full audit of all consuming
bios through ->make_request, and we've enabled it for the common
path as well.

>   bool do_split = true;
>   struct bio *new = NULL;
>   const unsigned max_sectors = get_max_io_size(q, bio);
> + unsigned bvecs = 0;
> +
> + *no_merge = true;
>  
>   bio_for_each_segment(bv, bio, iter) {
>   /*
> +  * With arbitrary bio size, the incoming bio may be very
> +  * big. We have to split the bio into small bios so that
> +  * each holds at most BIO_MAX_PAGES bvecs because
> +  * bio_clone() can fail to allocate big bvecs.
> +  *
> +  * It should have been better to apply the limit per
> +  * request queue in which bio_clone() is involved,
> +  * instead of globally. The biggest blocker is
> +  * bio_clone() in bio bounce.
> +  *
> +  * If bio is splitted by this reason, we should allow
> +  * to continue bios merging.
> +  *
> +  * TODO: deal with bio bounce's bio_clone() gracefully
> +  * and convert the global limit into per-queue limit.
> +  */
> + if (bvecs++ >= BIO_MAX_PAGES) {
> + *no_merge = false;
> + goto split;
> + }

That being said this simple if check here is simple enough that it's
probably fine.  But I see no need to uglify the whole code path
with that no_merge flag.  Please drop if for now, and if we start
caring for this path in common code we should just move the
REQ_NOMERGE setting into the actual blk_bio_*_split helpers.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Christoph Hellwig
On Mon, Aug 15, 2016 at 12:16:30PM -0600, Jens Axboe wrote:
>> This really should be a:
>>
>>  if (req_op(rq) != req_op(pos))
>>
>> I'l lleave it up to Jens if he wants that in this patch or not, otherwise
>> I'll send an incremental patch.
>
> Let's get a v2 with that fixed up, it makes a big readability
> difference.

It's not just readbility, it's also a potential correctness issue.
Not sure if we'd hit it for other request types at the moment, but
we'd surely hit if people add more..
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Jens Axboe

On 08/15/2016 12:13 PM, Christoph Hellwig wrote:

--- a/block/elevator.c
+++ b/block/elevator.c
@@ -366,7 +366,10 @@ void elv_dispatch_sort(struct request_queue *q, struct 
request *rq)
list_for_each_prev(entry, &q->queue_head) {
struct request *pos = list_entry_rq(entry);

-   if ((req_op(rq) == REQ_OP_DISCARD) != (req_op(pos) == 
REQ_OP_DISCARD))
+   if ((req_op(rq) == REQ_OP_DISCARD ||
+req_op(rq) == REQ_OP_SECURE_ERASE) !=
+   (req_op(pos) == REQ_OP_DISCARD ||
+req_op(pos) == REQ_OP_SECURE_ERASE))
break;


This really should be a:

if (req_op(rq) != req_op(pos))

I'l lleave it up to Jens if he wants that in this patch or not, otherwise
I'll send an incremental patch.


Let's get a v2 with that fixed up, it makes a big readability
difference.

--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Christoph Hellwig
On Mon, Aug 15, 2016 at 11:43:12AM -0500, Shaun Tancheff wrote:
> Hmm ... Since REQ_SECURE implied REQ_DISCARD doesn't this
> mean that we should include REQ_OP_SECURE_ERASE checking
> wherever REQ_OP_DISCARD is being checked now in drivers/scsi/sd.c ?
> 
> (It's only in 3 spots so it's a quickie patch)

SCSI doesn't support secure erase operations.  Only MMC really
supports it, plus the usual cargo culting in Xen blkfront that's
probably never been tested..
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Christoph Hellwig
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -366,7 +366,10 @@ void elv_dispatch_sort(struct request_queue *q, struct 
> request *rq)
>   list_for_each_prev(entry, &q->queue_head) {
>   struct request *pos = list_entry_rq(entry);
>  
> - if ((req_op(rq) == REQ_OP_DISCARD) != (req_op(pos) == 
> REQ_OP_DISCARD))
> + if ((req_op(rq) == REQ_OP_DISCARD ||
> +  req_op(rq) == REQ_OP_SECURE_ERASE) !=
> + (req_op(pos) == REQ_OP_DISCARD ||
> +  req_op(pos) == REQ_OP_SECURE_ERASE))
>   break;

This really should be a:

if (req_op(rq) != req_op(pos))

I'l lleave it up to Jens if he wants that in this patch or not, otherwise
I'll send an incremental patch.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Bart Van Assche

On 08/15/2016 10:15 AM, Jens Axboe wrote:

Can you reproduce at will? Would be nice to know if it hit the error case,
which is where it would hang.


Hello Jens,

Unfortunately this hang is only triggered sporadically by my tests. 
Since about four weeks ago I triggered several thousand 
scsi_remove_host() calls with my https://github.com/bvanassche/srp-test 
software. This morning was the first time that I ran into a blk-mq 
related hang.


Thanks,

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Jens Axboe

On 08/15/2016 09:53 AM, Bart Van Assche wrote:

On 08/02/2016 10:21 AM, Jens Axboe wrote:

On 08/02/2016 06:58 AM, Jinpu Wang wrote:

Hi Jens,

I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in
turn mutex_lock(&all_q_mutex);
  queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
if (ret)
break; /// if about error out, we will call
unregister below
}

if (ret)
blk_mq_unregister_disk(disk);

In blk_mq_unregister_disk, we will try to disable_hotplug again, which
leads to dead lock.

Did I miss anything?


Nope, your analysis looks correct. This should fix it:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a


Hi Jens,

Will that patch be included in stable kernels? I just encountered a
deadlock with kernel v4.7 that looks similar.


Sure, we can push to stable, it's a pretty straight forward patch. Can
you reproduce at will? Would be nice to know if it hit the error case,
which is where it would hang.

--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: Fix secure erase

2016-08-15 Thread Shaun Tancheff
On Mon, Aug 15, 2016 at 9:07 AM, Adrian Hunter  wrote:
> Commit 288dab8a35a0 ("block: add a separate operation type for secure
> erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
> all the places REQ_OP_DISCARD was being used to mean either. Fix those.
>
> Signed-off-by: Adrian Hunter 
> Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
> ---
>  block/bio.c  | 21 +++--
>  block/blk-merge.c| 33 +++--
>  block/elevator.c |  5 -
>  drivers/mmc/card/block.c |  1 +
>  drivers/mmc/card/queue.c |  3 ++-
>  drivers/mmc/card/queue.h |  4 +++-
>  include/linux/bio.h  | 10 --
>  include/linux/blkdev.h   |  6 --
>  kernel/trace/blktrace.c  |  2 +-
>  9 files changed, 53 insertions(+), 32 deletions(-)

Hmm ... Since REQ_SECURE implied REQ_DISCARD doesn't this
mean that we should include REQ_OP_SECURE_ERASE checking
wherever REQ_OP_DISCARD is being checked now in drivers/scsi/sd.c ?

(It's only in 3 spots so it's a quickie patch)

> diff --git a/block/bio.c b/block/bio.c
> index f39477538fef..aa7354088008 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -667,18 +667,19 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
> gfp_mask,
> bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
> bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
>
> -   if (bio_op(bio) == REQ_OP_DISCARD)
> -   goto integrity_clone;
> -
> -   if (bio_op(bio) == REQ_OP_WRITE_SAME) {
> +   switch (bio_op(bio)) {
> +   case REQ_OP_DISCARD:
> +   case REQ_OP_SECURE_ERASE:
> +   break;
> +   case REQ_OP_WRITE_SAME:
> bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
> -   goto integrity_clone;
> +   break;
> +   default:
> +   bio_for_each_segment(bv, bio_src, iter)
> +   bio->bi_io_vec[bio->bi_vcnt++] = bv;
> +   break;
> }
>
> -   bio_for_each_segment(bv, bio_src, iter)
> -   bio->bi_io_vec[bio->bi_vcnt++] = bv;
> -
> -integrity_clone:
> if (bio_integrity(bio_src)) {
> int ret;
>
> @@ -1788,7 +1789,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
>  * Discards need a mutable bio_vec to accommodate the payload
>  * required by the DSM TRIM and UNMAP commands.
>  */
> -   if (bio_op(bio) == REQ_OP_DISCARD)
> +   if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == 
> REQ_OP_SECURE_ERASE)
> split = bio_clone_bioset(bio, gfp, bs);
> else
> split = bio_clone_fast(bio, gfp, bs);
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3eec75a9e91d..72627e3cf91e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,12 +172,18 @@ void blk_queue_split(struct request_queue *q, struct 
> bio **bio,
> struct bio *split, *res;
> unsigned nsegs;
>
> -   if (bio_op(*bio) == REQ_OP_DISCARD)
> +   switch (bio_op(*bio)) {
> +   case REQ_OP_DISCARD:
> +   case REQ_OP_SECURE_ERASE:
> split = blk_bio_discard_split(q, *bio, bs, &nsegs);
> -   else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
> +   break;
> +   case REQ_OP_WRITE_SAME:
> split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
> -   else
> +   break;
> +   default:
> split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
> +   break;
> +   }
>
> /* physical segments can be figured out during splitting */
> res = split ? split : *bio;
> @@ -213,7 +219,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
> request_queue *q,
>  * This should probably be returning 0, but blk_add_request_payload()
>  * (Christoph)
>  */
> -   if (bio_op(bio) == REQ_OP_DISCARD)
> +   if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == 
> REQ_OP_SECURE_ERASE)
> return 1;
>
> if (bio_op(bio) == REQ_OP_WRITE_SAME)
> @@ -385,7 +391,9 @@ static int __blk_bios_map_sg(struct request_queue *q, 
> struct bio *bio,
> nsegs = 0;
> cluster = blk_queue_cluster(q);
>
> -   if (bio_op(bio) == REQ_OP_DISCARD) {
> +   switch (bio_op(bio)) {
> +   case REQ_OP_DISCARD:
> +   case REQ_OP_SECURE_ERASE:
> /*
>  * This is a hack - drivers should be neither modifying the
>  * biovec, nor relying on bi_vcnt - but because of
> @@ -393,19 +401,16 @@ static int __blk_bios_map_sg(struct request_queue *q, 
> struct bio *bio,
>  * a payload we need to set up here (thank you Christoph) and
>  * bi_vcnt is really the only way of telling if we need to.
>  */
> -
> -   if (bio->bi_vcnt)
> -   goto single_segment;
> -
> -   return

Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Bart Van Assche

On 08/15/2016 09:01 AM, Jinpu Wang wrote:

It's more likely you hit another bug, my colleague Roman fix that:

http://www.spinics.net/lists/linux-block/msg04552.html


Hello Jinpu,

Interesting. However, I see that wrote the following: "Firstly this 
wrong sequence raises two kernel warnings: 1st. WARNING at 
lib/percpu-recount.c:309 percpu_ref_kill_and_confirm called more than 
once 2nd. WARNING at lib/percpu-refcount.c:331". I haven't seen any of 
these kernel warnings ...


Thanks,

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Jinpu Wang
Hi Bart,

>>
>> Nope, your analysis looks correct. This should fix it:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a
>
> Hi Jens,
>
> Will that patch be included in stable kernels? I just encountered a
> deadlock with kernel v4.7 that looks similar.
>
> Thank you,
>
> Bart.
>
> INFO: task kworker/u64:6:136 blocked for more than 480 seconds.
>   Tainted: GW   4.7.0-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u64:6   D 88016f677bb0 0   136  2 0x
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  [] schedule+0x37/0x90
>  [] schedule_preempt_disabled+0x10/0x20
>  [] mutex_lock_nested+0x144/0x350
>  [] blk_mq_disable_hotplug+0x12/0x20
>  [] blk_mq_register_disk+0x29/0x120
>  [] blk_register_queue+0xb6/0x160
>  [] add_disk+0x219/0x4a0
>  [] sd_probe_async+0x100/0x1b0
>  [] async_run_entry_fn+0x45/0x140
>  [] process_one_work+0x1f9/0x6a0
>  [] worker_thread+0x49/0x490
>  [] kthread+0xea/0x100
>  [] ret_from_fork+0x1f/0x40
> 3 locks held by kworker/u64:6/136:
>  #0:  ("events_unbound"){.+.+.+}, at: [] 
> process_one_work+0x17a/0x6a0
>  #1:  ((&entry->work)){+.+.+.}, at: [] 
> process_one_work+0x17a/0x6a0
>  #2:  (all_q_mutex){+.+.+.}, at: [] 
> blk_mq_disable_hotplug+0x12/0x20
> INFO: task 02:8101 blocked for more than 480 seconds.
>   Tainted: GW   4.7.0-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 02  D 88039b747968 0  8101  1 0x0004
> Call Trace:
>  [] schedule+0x37/0x90
>  [] blk_mq_freeze_queue_wait+0x51/0xb0
>  [] blk_mq_update_tag_set_depth+0x3a/0xb0
>  [] blk_mq_init_allocated_queue+0x432/0x450
>  [] blk_mq_init_queue+0x35/0x60
>  [] scsi_mq_alloc_queue+0x17/0x50
>  [] scsi_alloc_sdev+0x2b9/0x350
>  [] scsi_probe_and_add_lun+0x98b/0xe50
>  [] __scsi_scan_target+0x5ca/0x6b0
>  [] scsi_scan_target+0xe1/0xf0
>  [] srp_create_target+0xf06/0x13d4 [ib_srp]
>  [] dev_attr_store+0x13/0x20
>  [] sysfs_kf_write+0x40/0x50
>  [] kernfs_fop_write+0x137/0x1c0
>  [] __vfs_write+0x23/0x140
>  [] vfs_write+0xb0/0x190
>  [] SyS_write+0x44/0xa0
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> 8 locks held by 02/8101:
>  #0:  (sb_writers#4){.+.+.+}, at: [] 
> __sb_start_write+0xb2/0xf0
>  #1:  (&of->mutex){+.+.+.}, at: [] 
> kernfs_fop_write+0x101/0x1c0
>  #2:  (s_active#363){.+.+.+}, at: [] 
> kernfs_fop_write+0x10a/0x1c0
>  #3:  (&host->add_target_mutex){+.+.+.}, at: [] 
> srp_create_target+0x134/0x13d4 [ib_srp]
>  #4:  (&shost->scan_mutex){+.+.+.}, at: [] 
> scsi_scan_target+0x8d/0xf0
>  #5:  (cpu_hotplug.lock){++}, at: [] 
> get_online_cpus+0x2d/0x80
>  #6:  (all_q_mutex){+.+.+.}, at: [] 
> blk_mq_init_allocated_queue+0x34a/0x450
>  #7:  (&set->tag_list_lock){+.+...}, at: [] 
> blk_mq_init_allocated_queue+0x37a/0x450
>

It's more likely you hit another bug, my colleague Roman fix that:

http://www.spinics.net/lists/linux-block/msg04552.html

It will be great, you test and see if it works for you!

-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 5770083-42
Fax: +49 30 5770085-98
Email: jinpu.w...@profitbricks.com
URL: http://www.profitbricks.de

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Bart Van Assche
On 08/02/2016 10:21 AM, Jens Axboe wrote:
> On 08/02/2016 06:58 AM, Jinpu Wang wrote:
>> Hi Jens,
>>
>> I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in
>> turn mutex_lock(&all_q_mutex);
>>   queue_for_each_hw_ctx(q, hctx, i) {
>> ret = blk_mq_register_hctx(hctx);
>> if (ret)
>> break; /// if about error out, we will call
>> unregister below
>> }
>>
>> if (ret)
>> blk_mq_unregister_disk(disk);
>>
>> In blk_mq_unregister_disk, we will try to disable_hotplug again, which
>> leads to dead lock.
>>
>> Did I miss anything?
> 
> Nope, your analysis looks correct. This should fix it:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=6316338a94b2319abe9d3790eb9cdc56ef81ac1a

Hi Jens,

Will that patch be included in stable kernels? I just encountered a
deadlock with kernel v4.7 that looks similar.

Thank you,

Bart.

INFO: task kworker/u64:6:136 blocked for more than 480 seconds.
  Tainted: GW   4.7.0-dbg+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u64:6   D 88016f677bb0 0   136  2 0x
Workqueue: events_unbound async_run_entry_fn
Call Trace:
 [] schedule+0x37/0x90
 [] schedule_preempt_disabled+0x10/0x20
 [] mutex_lock_nested+0x144/0x350
 [] blk_mq_disable_hotplug+0x12/0x20
 [] blk_mq_register_disk+0x29/0x120
 [] blk_register_queue+0xb6/0x160
 [] add_disk+0x219/0x4a0
 [] sd_probe_async+0x100/0x1b0
 [] async_run_entry_fn+0x45/0x140
 [] process_one_work+0x1f9/0x6a0
 [] worker_thread+0x49/0x490
 [] kthread+0xea/0x100
 [] ret_from_fork+0x1f/0x40
3 locks held by kworker/u64:6/136:
 #0:  ("events_unbound"){.+.+.+}, at: [] 
process_one_work+0x17a/0x6a0
 #1:  ((&entry->work)){+.+.+.}, at: [] 
process_one_work+0x17a/0x6a0
 #2:  (all_q_mutex){+.+.+.}, at: [] 
blk_mq_disable_hotplug+0x12/0x20
INFO: task 02:8101 blocked for more than 480 seconds.
  Tainted: GW   4.7.0-dbg+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
02  D 88039b747968 0  8101  1 0x0004
Call Trace:
 [] schedule+0x37/0x90
 [] blk_mq_freeze_queue_wait+0x51/0xb0
 [] blk_mq_update_tag_set_depth+0x3a/0xb0
 [] blk_mq_init_allocated_queue+0x432/0x450
 [] blk_mq_init_queue+0x35/0x60
 [] scsi_mq_alloc_queue+0x17/0x50
 [] scsi_alloc_sdev+0x2b9/0x350
 [] scsi_probe_and_add_lun+0x98b/0xe50
 [] __scsi_scan_target+0x5ca/0x6b0
 [] scsi_scan_target+0xe1/0xf0
 [] srp_create_target+0xf06/0x13d4 [ib_srp]
 [] dev_attr_store+0x13/0x20
 [] sysfs_kf_write+0x40/0x50
 [] kernfs_fop_write+0x137/0x1c0
 [] __vfs_write+0x23/0x140
 [] vfs_write+0xb0/0x190
 [] SyS_write+0x44/0xa0
 [] entry_SYSCALL_64_fastpath+0x18/0xa8
8 locks held by 02/8101:
 #0:  (sb_writers#4){.+.+.+}, at: [] 
__sb_start_write+0xb2/0xf0
 #1:  (&of->mutex){+.+.+.}, at: [] 
kernfs_fop_write+0x101/0x1c0
 #2:  (s_active#363){.+.+.+}, at: [] 
kernfs_fop_write+0x10a/0x1c0
 #3:  (&host->add_target_mutex){+.+.+.}, at: [] 
srp_create_target+0x134/0x13d4 [ib_srp]
 #4:  (&shost->scan_mutex){+.+.+.}, at: [] 
scsi_scan_target+0x8d/0xf0
 #5:  (cpu_hotplug.lock){++}, at: [] 
get_online_cpus+0x2d/0x80
 #6:  (all_q_mutex){+.+.+.}, at: [] 
blk_mq_init_allocated_queue+0x34a/0x450
 #7:  (&set->tag_list_lock){+.+...}, at: [] 
blk_mq_init_allocated_queue+0x37a/0x450

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] block: make sure big bio is splitted into at most 256 bvecs

2016-08-15 Thread Ming Lei
After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 
> 0028
> [  172.660229] IP: [] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops:  [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [] ? generic_make_request+0xb5/0x155
> [  172.664947]  [] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
- create one raid1 over two virtio-blk
- build bcache device over the above raid1 and another cache device
and bucket size is set as 2Mbytes
- set cache mode as writeback
- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner 
Reported-by: Eric Wheeler 
Cc: sta...@vger.kernel.org (4.3+)
Cc: Shaohua Li 
Acked-by: Kent Overstreet 
Signed-off-by: Ming Lei 
---
V3:
- rebase against v4.8-rc1 since .bi_rw of bio is renamed
as .bi_opf

V2:
- don't mark as REQ_NOMERGE in case the bio is splitted
for reaching the limit of bvecs count
V1:
- Kent pointed out that using max io size can't cover
the case of non-full bvecs/pages

 block/blk-merge.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3eec75a..c44d3e9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@ static inline unsigned get_max_io_size(struct request_queue 
*q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
-unsigned *segs)
+unsigned *segs,
+bool *no_merge)
 {
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
@@ -94,9 +95,34 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
+   unsigned bvecs = 0;
+
+   *no_merge = true;
 
bio_for_each_segment(bv, bio, iter) {
/*
+* With arbitrary bio size, the incoming bio may be very
+* big. We have to split the bio into small bios so that
+* each holds at most BIO_MAX_PAGES bvecs because
+* bio_clone() can fail to allocate big bvecs.
+*
+* It should have been better to apply the limit per
+* request queue in which bio_clone() is involved,
+* instead of globally. The biggest blocker is
+* bio_clone() in bio bounce.
+*
+* If bio is splitted by this reason, we should allow
+* to continue bios merging.
+*
+* TODO: deal with bio bounce's bio_clone() gracefully
+* and convert the global limit into per-queue limit.
+*/
+   if (bvecs++ >= BIO_MAX_PAGES) {
+   *no_merge = false;
+   goto split;
+   }
+
+   /*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
 */
@@ -171,13 +197,15 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
 {
struct bio *split, *res;
unsigned nsegs;
+   bool no_merge_for_split = true;
 
if (bio_op(*bio) == REQ_OP_DISCARD)
split = blk_bio_discard_split(q, *bio, bs, &nsegs);
else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
else
-   split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
+   split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs,
+   &no_merge_for_split);
 
/* physical segments can be figured out during splitting */
res = split ? split : *bio;
@@ -186,7 +214,8 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
 
if (split) {
/* there isn't chance to merge the splitted bio */
-   split->b

[PATCH] block: Fix secure erase

2016-08-15 Thread Adrian Hunter
Commit 288dab8a35a0 ("block: add a separate operation type for secure
erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
all the places REQ_OP_DISCARD was being used to mean either. Fix those.

Signed-off-by: Adrian Hunter 
Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
---
 block/bio.c  | 21 +++--
 block/blk-merge.c| 33 +++--
 block/elevator.c |  5 -
 drivers/mmc/card/block.c |  1 +
 drivers/mmc/card/queue.c |  3 ++-
 drivers/mmc/card/queue.h |  4 +++-
 include/linux/bio.h  | 10 --
 include/linux/blkdev.h   |  6 --
 kernel/trace/blktrace.c  |  2 +-
 9 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f39477538fef..aa7354088008 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -667,18 +667,19 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
gfp_mask,
bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
 
-   if (bio_op(bio) == REQ_OP_DISCARD)
-   goto integrity_clone;
-
-   if (bio_op(bio) == REQ_OP_WRITE_SAME) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_SECURE_ERASE:
+   break;
+   case REQ_OP_WRITE_SAME:
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
-   goto integrity_clone;
+   break;
+   default:
+   bio_for_each_segment(bv, bio_src, iter)
+   bio->bi_io_vec[bio->bi_vcnt++] = bv;
+   break;
}
 
-   bio_for_each_segment(bv, bio_src, iter)
-   bio->bi_io_vec[bio->bi_vcnt++] = bv;
-
-integrity_clone:
if (bio_integrity(bio_src)) {
int ret;
 
@@ -1788,7 +1789,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
 * Discards need a mutable bio_vec to accommodate the payload
 * required by the DSM TRIM and UNMAP commands.
 */
-   if (bio_op(bio) == REQ_OP_DISCARD)
+   if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE)
split = bio_clone_bioset(bio, gfp, bs);
else
split = bio_clone_fast(bio, gfp, bs);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3eec75a9e91d..72627e3cf91e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -172,12 +172,18 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
struct bio *split, *res;
unsigned nsegs;
 
-   if (bio_op(*bio) == REQ_OP_DISCARD)
+   switch (bio_op(*bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_SECURE_ERASE:
split = blk_bio_discard_split(q, *bio, bs, &nsegs);
-   else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
+   break;
+   case REQ_OP_WRITE_SAME:
split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
-   else
+   break;
+   default:
split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
+   break;
+   }
 
/* physical segments can be figured out during splitting */
res = split ? split : *bio;
@@ -213,7 +219,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 * This should probably be returning 0, but blk_add_request_payload()
 * (Christoph)
 */
-   if (bio_op(bio) == REQ_OP_DISCARD)
+   if (bio_op(bio) == REQ_OP_DISCARD || bio_op(bio) == REQ_OP_SECURE_ERASE)
return 1;
 
if (bio_op(bio) == REQ_OP_WRITE_SAME)
@@ -385,7 +391,9 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
nsegs = 0;
cluster = blk_queue_cluster(q);
 
-   if (bio_op(bio) == REQ_OP_DISCARD) {
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_SECURE_ERASE:
/*
 * This is a hack - drivers should be neither modifying the
 * biovec, nor relying on bi_vcnt - but because of
@@ -393,19 +401,16 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
 * a payload we need to set up here (thank you Christoph) and
 * bi_vcnt is really the only way of telling if we need to.
 */
-
-   if (bio->bi_vcnt)
-   goto single_segment;
-
-   return 0;
-   }
-
-   if (bio_op(bio) == REQ_OP_WRITE_SAME) {
-single_segment:
+   if (!bio->bi_vcnt)
+   return 0;
+   /* Fall through */
+   case REQ_OP_WRITE_SAME:
*sg = sglist;
bvec = bio_iovec(bio);
sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
return 1;
+   default:
+   break;
}
 
for_each_bio(bio)
diff --git a/block/elevator.c b/block/elevator.c