Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen

2017-09-07 Thread Ming Lei
On Tue, Sep 05, 2017 at 10:23:25AM +0800, Ming Lei wrote:
> On Tue, Sep 05, 2017 at 01:40:11AM +, Bart Van Assche wrote:
> > On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > > On Mon, Sep 04, 2017 at 03:40:35PM +, Bart Van Assche wrote:
> > > > Have you considered to use the blk-mq "reserved request" mechanism to 
> > > > avoid
> > > > starvation of power management requests instead of making the block 
> > > > layer
> > > > even more complicated than it already is?
> > > 
> > > reserved request is really a bad idea, that means the reserved request
> > > can't be used for normal I/O, we all know the request/tag space is
> > > precious, and some device has a quite small tag space, such as sata.
> > > This way will affect performance definitely.
> > 
> > Sorry but I'm neither convinced that reserving a request for power 
> > management 
> > would be a bad idea nor that it would have a significant performance impact 
> > nor
> > that it would be complicated to implement. Have you noticed that the Linux 
> > ATA
> > implementation already reserves a request for internal use and thereby 
> > reduces
> > the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would 
> > like to
> > know if is whether the performance impact of reserving a request is more or 
> > less
> > than 1%.
> 
> Firstly we really can avoid the reservation, why do we have to
> wast one precious tag just for PM, which may never happen on
> one machine from its running. For SATA, the internal tag is for EH,
> I believe the reservation is inevitable.
> 
> Secondly reserving one tag may decrease the concurrent I/O by 1,
> that definitely hurts performance, especially for random I/O. Think
> about why NVMe increases its queue depth so many. Not mention
> there are some devices which have only one tag(.can_queue is 1),
> how can you reserve one tag on this kind of device?
> 
> Finally bad result will follow if you reserve one tag for PM only.
> Suppose it is doable to reserve one tag, did you consider
> how to do that? Tag can be shared in host wide, do you want to
> reserve one tag just for one request_queue?
>   - If yes, lots of tag can be reserved/wasted for the unusual PM
>   or sort of commands, even worse the whole tag space of HBA may
>   not be enough for the reservation if there are lots of LUNs in
>   this HBA.
>   
>   - If not, and you just reserve one tag for one HBA, then all
>   PM commands share the one reservation. During suspend/resume, all
>   these PM commands have to run exclusively(serialized) for diskes
>   attached to the HBA, that will slow down the suspend/resume very
>   much because there may be lots of LUNs in this HBA.
> 
> That is why I said reserving one tag is really bad, isn't it?

Hi Bart,

Looks I replied or clarified all your questions/comments on this
patchset.

So could you let me know if you still object this approach or
patchset?

If not, I think we need to move on and fix the real issue.

-- 
Ming


Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Ming Lei
On Thu, Sep 07, 2017 at 01:47:44PM -0600, Jens Axboe wrote:
> On 09/07/2017 01:38 PM, Linus Torvalds wrote:
> > On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboe  wrote:
> >>
> >> Which was committed yesterday? It was not from my tree. I try to keep
> >> an eye out for potential conflicts or issues.
> > 
> > It was from Andrew, so I'm assuming it was in linux-next. Not as a git
> > tree, but as the usual akpm branch.
> > 
> > I'm not sure why I didn't see any reports from linux-next about it,
> > though - and I looked.
> > 
> > There was no actual visible merge problem, because the conflict was
> > not a data conflict but a semantic one.
> > 
> > But the issue showed up for a trivial allmodconfig build, so it
> > *should* have been caught automatically.
> > 
> > But it's not even the conflict that annoys me.
> > 
> > It's the *reason* for the conflict - the block layer churn that you
> > said was done. We've had too much of it.
> > 
> >>> We need *stability* by now, after these crazy changes. Make sure
> >>> that happens.
> 
> I might have missed a build notice from linux-next, since I know for a
> fact that all my stuff is in -next, updated daily, and Andrew's tree
> definitely is too.
> 
> >> I am pushing back, but I can't embargo any sort of API change. This one has
> >> good reasoning behind it, which is actually nicely explained in the commit
> >> itself. It's not pointless churn, which I would define as change just for
> >> the sake of change itself. Or pointless cleanups.
> > 
> > You can definitely put your foot down on any more random changes to
> > the bio infrastructure. Including for good reasons.
> 
> And I am, but this one wasn't random. As I said, some of the previous
> releases have had more frivolous changes that should have been pushed
> back harder on. And particularly nvme, where fixes that go in after the
> merge window have been doing pointless cleanups while fixing issues,
> causing unnecessary pain for the next release. I am definitely pushing
> back hard on those, just last week after more of that showed up.
> 
> You'll see exactly one of those when I send in the next merge request,
> which sucks, and which was the reason the yelling in that area recently.
> 
> > We have had some serious issues in the block layer - and I'm not
> > talking about the merge conflicts. I'm talking about just the
> > collateral damage, with things like SCSI having to revert using blk-mq
> > by default etc.
> 
> I'm a bit puzzled as to why the suspend/resume thing has existed for so
> long, honestly. But some of these issues don't show themselves until you
> flip the switch. A lot of the production setups have been using scsi-mq
> for YEARS without issues, but obviously they don't hit this. Given how
> big of a switch this is, it's hard to avoid minor fallout. We're dealing
> with it.

Hi Jens,

If you mean the I/O hang during suspend/resume reported by Oleksandr,
that shouldn't be blk-mq only. The root cause is in SCSI's quiesce
design(even in blk-mq's quiesce, but no such issue because no new
request is allocated in the context which queue is quiesced.)

The issue is a bit long-term, since Cathy can reproduce this kind of
I/O hang on 2.6.32 based kernel.

I am working on fixing that[1], but looks the preempt quiesce still
need to be nested, so will post out V4 with supporting nested preempt
quiesce.

The real SCSI_MQ regression is the report of 'SCSI-MQ performance
regression'[2], and the commit of 'scsi: default to scsi-mq' was
reverted after this report. Patch for this regression has been
posted for several round, and the V4 [3] should be ready for merge, looks
it is ignored because of the merge timing?


[1] https://marc.info/?l=linux-scsi=150435774503165=2
[2] https://marc.info/?l=linux-kernel=150271934904399=2
[3] https://marc.info/?t=15043655572=1=2

-- 
Ming


Re: [PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-07 Thread Ming Lei
On Thu, Sep 07, 2017 at 01:54:36PM +0200, Christoph Hellwig wrote:
> bsg-lib now embeddeds the job structure into the request, and req->special
> can't be used anymore.
> 
> Signed-off-by: Christoph Hellwig 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/scsi/scsi_transport_fc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_transport_fc.c 
> b/drivers/scsi/scsi_transport_fc.c
> index 892fbd9800d9..bea06de60827 100644
> --- a/drivers/scsi/scsi_transport_fc.c
> +++ b/drivers/scsi/scsi_transport_fc.c
> @@ -3550,7 +3550,7 @@ fc_vport_sched_delete(struct work_struct *work)
>  static enum blk_eh_timer_return
>  fc_bsg_job_timeout(struct request *req)
>  {
> - struct bsg_job *job = (void *) req->special;
> + struct bsg_job *job = blk_mq_rq_to_pdu(req);
>   struct Scsi_Host *shost = fc_bsg_to_shost(job);
>   struct fc_rport *rport = fc_bsg_to_rport(job);
>   struct fc_internal *i = to_fc_internal(shost->transportt);
> -- 
> 2.11.0
> 

Reviewed-by: Ming Lei 

-- 
Ming


[PATCH] block: tolerate tracing of NULL bio

2017-09-07 Thread Greg Thelen
__get_request() can call trace_block_getrq() with bio=NULL which causes
block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.

Syzkaller fuzzer panics with
linux-next (1d53d908b79d7870d89063062584eead4cf83448):
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault:  [#1] SMP KASAN
  Modules linked in:
  CPU: 0 PID: 2983 Comm: syzkaller40 Not tainted 4.13.0-rc7-next-20170901+ 
#13
  task: 8801cf1da000 task.stack: 8801ce44
  RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
  RSP: 0018:8801ce4473f0 EFLAGS: 00010246
  RAX: 8801cf1da000 RBX: 110039c88e84 RCX: 1d184d27
  RDX: dc01 RSI: 11003b643e7a RDI: e8c26938
  RBP: 8801ce447530 R08: 11003b643e6c R09: e8c26964
  R10: 0002 R11: f9184d2d R12: e8c1f890
  R13: e8c26930 R14: 85cad9e0 R15: 
  FS:  02641880() GS:8801db20() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0043e670 CR3: 0001d1d7a000 CR4: 001406f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
trace_block_getrq include/trace/events/block.h:423 [inline]
__get_request block/blk-core.c:1283 [inline]
get_request+0x1518/0x23b0 block/blk-core.c:1355
blk_old_get_request block/blk-core.c:1402 [inline]
blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe

block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
Only one of them is NULL tolerant.  Favor the NULL tolerant one.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and 
partitions index")
Signed-off-by: Greg Thelen 
---
 include/trace/events/block.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index f815aaaef755..1fd7ff1a46f7 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,6 @@ DECLARE_EVENT_CLASS(block_get_rq,
 
TP_fast_assign(
__entry->dev= bio ? bio_dev(bio) : 0;
-   __entry->dev= bio_dev(bio);
__entry->sector = bio ? bio->bi_iter.bi_sector : 0;
__entry->nr_sector  = bio ? bio_sectors(bio) : 0;
blk_fill_rwbs(__entry->rwbs,
@@ -414,7 +413,7 @@ DECLARE_EVENT_CLASS(block_get_rq,
 /**
  * block_getrq - get a free request entry in queue for block IO operations
  * @q: queue for operations
- * @bio: pending block IO operation
+ * @bio: pending block IO operation (can be %NULL)
  * @rw: low bit indicates a read (%0) or a write (%1)
  *
  * A request struct for queue @q has been allocated to handle the
@@ -430,7 +429,7 @@ DEFINE_EVENT(block_get_rq, block_getrq,
 /**
  * block_sleeprq - waiting to get a free request entry in queue for block IO 
operation
  * @q: queue for operation
- * @bio: pending block IO operation
+ * @bio: pending block IO operation (can be %NULL)
  * @rw: low bit indicates a read (%0) or a write (%1)
  *
  * In the case where a request struct cannot be provided for queue @q
-- 
2.14.1.581.gf28d330327-goog



BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack

2017-09-07 Thread Shaohua Li
On Thu, Sep 07, 2017 at 11:11:24AM +1000, Neil Brown wrote:
> On Wed, Sep 06 2017, Shaohua Li wrote:
> 
> > On Fri, Sep 01, 2017 at 03:26:41PM +0800, alexwu wrote:
> >> Hi,
> >> 
> >> Recently a data integrity issue about skip_copy was found. We are able
> >> to reproduce it and found the root cause. This data integrity issue
> >> might happen if there are other layers between file system and raid5.
> >> 
> >> [How to Reproduce]
> >> 
> >> 1. Create a raid5 named md0 first (with skip_copy enable), and wait md0
> >>resync done which ensures that all data and parity are synchronized
> >> 2. Use lvm tools to create a logical volume named lv-md0 over md0
> >> 3. Format an ext4 file system on lv-md0 and mount on /mnt
> >> 4. Do some db operations (e.g. sqlite insert) to write data through /mnt
> >> 5. When those db operations finished, we do the following command
> >>"echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt,
> >>it is very likely that we got mismatch_cnt != 0 when check finished
> >> 
> >> [Root Cause]
> >> 
> >> After tracing code and more experiments, it is more proper to say that
> >> it's a problem about backing_dev_info (bdi) instead of a bug about
> >> skip_copy.
> >> 
> >> We notice that:
> >>1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's page
> >>   will not be modified before raid5 completes I/O. Thus we can skip 
> >> copy
> >>   page from bio to stripe cache
> >>2. The ext4 file system will call wait_for_stable_page() to ask whether
> >>   the mapped bdi requires stable writes
> >> 
> >> Data integrity happens because:
> >>1. When raid5 enable skip_copy, it will only set it's own bdi required
> >>   BDI_CAP_STABLE_WRITES, but this information will not propagate to
> >>   other bdi between file system and md
> >>2. When ext4 file system check stable writes requirement by calling
> >>   wait_for_stable_page(), it can only see the underlying bdi's
> >> capability
> >>   and cannot see all related bdi
> >> 
> >> Thus, skip_copy works fine if we format file system directly on md.
> >> But data integrity issue happens if there are some other block layers (e.g.
> >> dm)
> >> between file system and md.
> >> 
> >> [Result]
> >> 
> >> We do more tests on different storage stacks, here are the results.
> >> 
> >> The following settings can pass the test thousand times:
> >>1. raid5 with skip_copy enable + ext4
> >>2. raid5 with skip_copy disable + ext4
> >>3. raid5 with skip_copy disable + lvm + ext4
> >> 
> >> The following setting will fail the test in 10 rounds:
> >>1. raid5 with skip_copy enable + lvm + ext4
> >> 
> >> I think the solution might be let all bdi can communicate through different
> >> block layers,
> >> then we can pass BDI_CAP_STABLE_WRITES information if we enable skip_copy.
> >> But the current bdi structure is not allowed us to do that.
> >> 
> >> What would you suggest to do if we want to make skip_copy more reliable ?
> >
> > Very interesting problem. Looks this isn't raid5 specific. stacked device 
> > with
> > block integrity enabled could expose the same issue.
> >
> > I'm cc Jens and Neil to check if they have good idea. Basically the problem 
> > is
> > for stacked disk, the under layer disk has BDI_CAP_STABLE_WRITES enabled, 
> > but
> > upper layer doesn't enable it. The under layer disk could be a raid5 with
> > skip_copy enabled or could be any disk with block integrity enabled (which 
> > will
> > enable BDI_CAP_STABLE_WRITES). Currently we don't propagate the flag to 
> > upper
> > layer disk.
> >
> > A solution is blk_set_stacking_limits propagate the flag at queue setup, we
> > then don't allow the flag changing in runtime later. The raid5 skip_copy
> > interface changes the flag in runtime which probably isn't safe.
> >
> > Thanks,
> > Shaohua
> 
> It has always been messy to handle dependencies from the bottom of the
> stack affecting things closer to the filesystem.
> We used to have issues with alignments and request sizes, and got rid of
> those problems by requiring all devices to accept all bio sizes, and
> providing support for them to split as necessary.
> This is a similar sort of problem, but also quite different.
> 
> It is different because the STABLE_WRITES requirement doesn't affect the
> sort of bio that is passed down, but instead affect the way it is waited
> for.
> 
> I have thought a few times that it would be useful if the "bd_claim"
> functionality included a callback so the claimed device could notify the
> claimer that things had changed.
> Then raid5 could notice that it has been claimed, and call the callback
> when it changes BDI_CAP_STABLE_WRITES.  If the claimer is a stacked
> device, it can re-run blk_set_stacking_limits and call its own claimer.
> 
> There are probably some interesting locking issues around this but, to
> me, it seems like the most likely path to a robust solution.

Fixing blk_set_stacking_limits is the first 

Re: [PATCH V2 0/3] block/loop: handle discard/zeroout error

2017-09-07 Thread Shaohua Li
On Thu, Sep 07, 2017 at 03:20:01PM +0200, Ilya Dryomov wrote:
> Hi Shaohua,
> 
> You wrote:
> > BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't
> > support zeroout, but it doesn't retry if submit_bio_wait returns 
> > -EOPNOTSUPP.
> > Is this correct behavior?
> 
> I sent a patch for that yesterday, see "[PATCH] block: cope with WRITE
> SAME failing in blkdev_issue_zeroout()" on linux-block.  It checks for
> -EREMOTEIO, because that's what I hit, but I wonder if it should check
> for -EOPNOTSUPP from submit_bio_wait() as well.  Can you think of
> a case where, given bdev_write_zeroes_sectors() != 0, submit_bio_wait()
> would fail with -EOPNOTSUPP and we would want to do explicit zeroing?

loop block device returns -EOPNOTSUPP before for zeroout. With the second
patch, it doesn't any more though.


Re: [PATCH V2 3/3] block/loop: suppress discard IO error message

2017-09-07 Thread Shaohua Li
On Thu, Sep 07, 2017 at 05:16:21PM +0800, Ming Lei wrote:
> On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Li  wrote:
> > From: Shaohua Li 
> >
> > We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till
> > fallocate is called. If it doesn't support, loop will return -EOPNOTSUPP
> > and we see a lot of error message printed by blk_update_request. Failure
> > for discard IO isn't a big problem, so we just return 0 in loop which
> > will suppress the IO error message.
> 
> Setting RQF_QUIET for discard IO should be more clean for suppressing error,
> and upper layer can get the failure notification too.

Hmm, forgot why I didn't do this, did consider it before. Probably because
nobody check the return value of discard request. Probably we should skip the
error message for all discard IO in block layer.

I'll repost a patch to fix this issue.

Thanks,
Shaohua


Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Jens Axboe
On 09/07/2017 01:38 PM, Linus Torvalds wrote:
> On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboe  wrote:
>>
>> Which was committed yesterday? It was not from my tree. I try to keep
>> an eye out for potential conflicts or issues.
> 
> It was from Andrew, so I'm assuming it was in linux-next. Not as a git
> tree, but as the usual akpm branch.
> 
> I'm not sure why I didn't see any reports from linux-next about it,
> though - and I looked.
> 
> There was no actual visible merge problem, because the conflict was
> not a data conflict but a semantic one.
> 
> But the issue showed up for a trivial allmodconfig build, so it
> *should* have been caught automatically.
> 
> But it's not even the conflict that annoys me.
> 
> It's the *reason* for the conflict - the block layer churn that you
> said was done. We've had too much of it.
> 
>>> We need *stability* by now, after these crazy changes. Make sure
>>> that happens.

I might have missed a build notice from linux-next, since I know for a
fact that all my stuff is in -next, updated daily, and Andrew's tree
definitely is too.

>> I am pushing back, but I can't embargo any sort of API change. This one has
>> good reasoning behind it, which is actually nicely explained in the commit
>> itself. It's not pointless churn, which I would define as change just for
>> the sake of change itself. Or pointless cleanups.
> 
> You can definitely put your foot down on any more random changes to
> the bio infrastructure. Including for good reasons.

And I am, but this one wasn't random. As I said, some of the previous
releases have had more frivolous changes that should have been pushed
back harder on. And particularly nvme, where fixes that go in after the
merge window have been doing pointless cleanups while fixing issues,
causing unnecessary pain for the next release. I am definitely pushing
back hard on those, just last week after more of that showed up.

You'll see exactly one of those when I send in the next merge request,
which sucks, and which was the reason the yelling in that area recently.

> We have had some serious issues in the block layer - and I'm not
> talking about the merge conflicts. I'm talking about just the
> collateral damage, with things like SCSI having to revert using blk-mq
> by default etc.

I'm a bit puzzled as to why the suspend/resume thing has existed for so
long, honestly. But some of these issues don't show themselves until you
flip the switch. A lot of the production setups have been using scsi-mq
for YEARS without issues, but obviously they don't hit this. Given how
big of a switch this is, it's hard to avoid minor fallout. We're dealing
with it.

> Christ, things like that are pretty *fundamnetal*, wouldn't you say.
> Get those right before doing more churn. And aim to have one or two
> releases literally just fixing things, with a "no churn" rule.

That's the plan...

-- 
Jens Axboe



Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Linus Torvalds
On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboe  wrote:
>
> Which was committed yesterday? It was not from my tree. I try to keep
> an eye out for potential conflicts or issues.

It was from Andrew, so I'm assuming it was in linux-next. Not as a git
tree, but as the usual akpm branch.

I'm not sure why I didn't see any reports from linux-next about it,
though - and I looked.

There was no actual visible merge problem, because the conflict was
not a data conflict but a semantic one.

But the issue showed up for a trivial allmodconfig build, so it
*should* have been caught automatically.

But it's not even the conflict that annoys me.

It's the *reason* for the conflict - the block layer churn that you
said was done. We've had too much of it.

>> We need *stability* by now, after these crazy changes. Make sure that 
>> happens.
>
> I am pushing back, but I can't embargo any sort of API change. This one has
> good reasoning behind it, which is actually nicely explained in the commit
> itself. It's not pointless churn, which I would define as change just for
> the sake of change itself. Or pointless cleanups.

You can definitely put your foot down on any more random changes to
the bio infrastructure. Including for good reasons.

We have had some serious issues in the block layer - and I'm not
talking about the merge conflicts. I'm talking about just the
collateral damage, with things like SCSI having to revert using blk-mq
by default etc.

Christ, things like that are pretty *fundamnetal*, wouldn't you say.
Get those right before doing more churn. And aim to have one or two
releases literally just fixing things, with a "no churn" rule.

Because the block layer really has been a black spot lately.

  Linus


Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-07 Thread Jens Axboe
On 09/07/2017 12:51 PM, Eddie Chapman wrote:
> On 06/09/17 18:38, Coly Li wrote:
>> On 2017/9/6 下午11:46, Jens Axboe wrote:
>>> On 09/06/2017 09:41 AM, Coly Li wrote:
 On 2017/9/6 下午10:20, Jens Axboe wrote:
> On Wed, Sep 06 2017, Coly Li wrote:
>> Hi Jens,
>>
>> Here are 12 patchs for bcache fixes and updates, most of them were posted
>> by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
>> review.
>
> Next time, please send this _before_ the merge window opens. Not a huge
> problem for this series, since it has been posted numerous times in the
> past and already has good review coverage, but it really should have
> been in linux-next for a week or two before heading upstream.
>
 Hi Jens,

 Copied, send patches _before_ merge window opens.

 But could you please to give me a hint, how can I find when a specific
 merge window will open ? I search LKML, and see people send pull
 requests for 4.14 merge window, but I don't see any announcement for
 4.14 merge window time slot.
>>>
>>> The merge window opens when Linus releases the previous kernel. So you
>>> have to try and keep track of when he expects to do that. That really
>>> isn't that difficult - Linus always cuts a release on Sundays. We always
>>> have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or
>>> -rc7 release, that usually gives a good indication of when he expects to
>>> release the final.
>>>
>>> To keep things simple, if you always have things ready when -rc7 is
>>> released, then you are at least one week out from the merge window,
>>> possibly two if we end up doing an -rc8. That means you don't have to
>>> track anything, you know the exact date of when -rc7 is released since
>>> Linus's schedule is usually reliable.
>>>
>>
>> Hi Jens,
>>
>> Copied, I will follow the above rule in next cycle.
>>
>> And I just post the last patch
>> (0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you
>> for 4.14 cycle.
>>
>> This patch is an important fix for bcache writeback rate calculation, it
>> is depended by another patch I sent to you
>> (0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch),
>> please have it in 4.14.
>>
>> Thanks in advance.
> 
> Hello Jens,
> 
> FWIW I'd like to support Coly by reporting that patches 0001, 0002, 0006 
> and 0013 (the last one mentioned above) have been tested by me on 2 
> production servers with a good deal of bcache activity for the last 6 
> weeks without any issues having arisen.
> 
> But this may not be much use to you to know, as these are running 
> vanilla kernel.org kernel 4.4.77.  Still, these 4 patches exactly as 
> Coly has sent them to you apply without any changes to the code, so I 
> guess it vouches for the code at least somewhat.

That's all good and fine, and it's better than no testing at all. But it
doesn't change the fact that anyone using bcache on -next would have
been using the changes for a while before release, instead of just one
or two people. Additionally, your base is drastically different than
mainline, since you're running a kernel that's a year and a half old.

Next time I'm not adding anything post merge window open, unless it's
addressing a problem that was added in the merge window.

-- 
Jens Axboe



Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Jens Axboe
On 09/07/2017 01:04 PM, Linus Torvalds wrote:
> On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboe  wrote:
>>
>> Note that you'll hit a conflict in block/bio-integrity.c and
>> mm/page_io.c, since we had fixes later in the 4.13 series for both of
>> those that ended up conflicting with new changes. Both are trivial to
>> fix up, I've included my resolution at the end of this email.
> 
> Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read
> page from backing device") and the preceding code, which added a
> "zram->bdev" that is the backing dev for the zram device.

Which was committed yesterday? It was not from my tree. I try to keep
an eye out for potential conflicts or issues.

> And then it does
> 
>  bio->bi_bdev = zram->bdev;
> 
> but now the block layer has AGAIN changed some stupid internal detail
> for no obvious reason, so now it doesn't work.
> 
> My initial thought was to just do
> 
> bio->bi_disk = zram->disk;
> 
> instead, which *syntactically* seems to make sense, and match things
> up in the obvious way.
> 
> But when I actually thought about it more, I decided that's bullshit.
> "zram->disk" is not the backing device at all, it's the zram interface
> itself.
> 
> The correct resolution seems to be
> 
> bio_set_dev(bio, zram->bdev);
> 
> but *dammit*, Jens, this insane crazy constant "let's randomly change
> block layer details around" needs to STOP.
> 
> This has been going on for too many releases by now. What the f*ck is
> *wrong* with you and Christoph that you can't just leave this thing
> alone?
> 
> Stop the pointless churn already.
> 
> When I saw your comment:
> 
>   It's a quiet series this round, which I think we needed after
>   the churn of the last few series.
> 
> I was like "finally". And then this shows that you were just lying,
> and the pointless churn is still happening.
> 
> Stop it. Really.
> 
> Here's a hint: if some stupid change requires you to mindlessly do 150
> changes using a wrapper helper, it's churn.
> 
> We need *stability* by now, after these crazy changes. Make sure that happens.

I am pushing back, but I can't embargo any sort of API change. This one has
good reasoning behind it, which is actually nicely explained in the commit
itself. It's not pointless churn, which I would define as change just for
the sake of change itself. Or pointless cleanups.

That said, there are no plans to mix anything up for the next release.
We've had more frivolous changes in past releases, I don't expect that
to happen again. 

-- 
Jens Axboe



Re: [GIT PULL] First set of block changes for 4.14-rc1

2017-09-07 Thread Linus Torvalds
On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboe  wrote:
>
> Note that you'll hit a conflict in block/bio-integrity.c and
> mm/page_io.c, since we had fixes later in the 4.13 series for both of
> those that ended up conflicting with new changes. Both are trivial to
> fix up, I've included my resolution at the end of this email.

Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read
page from backing device") and the preceding code, which added a
"zram->bdev" that is the backing dev for the zram device.

And then it does

 bio->bi_bdev = zram->bdev;

but now the block layer has AGAIN changed some stupid internal detail
for no obvious reason, so now it doesn't work.

My initial thought was to just do

bio->bi_disk = zram->disk;

instead, which *syntactically* seems to make sense, and match things
up in the obvious way.

But when I actually thought about it more, I decided that's bullshit.
"zram->disk" is not the backing device at all, it's the zram interface
itself.

The correct resolution seems to be

bio_set_dev(bio, zram->bdev);

but *dammit*, Jens, this insane crazy constant "let's randomly change
block layer details around" needs to STOP.

This has been going on for too many releases by now. What the f*ck is
*wrong* with you and Christoph that you can't just leave this thing
alone?

Stop the pointless churn already.

When I saw your comment:

  It's a quiet series this round, which I think we needed after
  the churn of the last few series.

I was like "finally". And then this shows that you were just lying,
and the pointless churn is still happening.

Stop it. Really.

Here's a hint: if some stupid change requires you to mindlessly do 150
changes using a wrapper helper, it's churn.

We need *stability* by now, after these crazy changes. Make sure that happens.

  Linus


Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-07 Thread Eddie Chapman

On 06/09/17 18:38, Coly Li wrote:

On 2017/9/6 下午11:46, Jens Axboe wrote:

On 09/06/2017 09:41 AM, Coly Li wrote:

On 2017/9/6 下午10:20, Jens Axboe wrote:

On Wed, Sep 06 2017, Coly Li wrote:

Hi Jens,

Here are 12 patchs for bcache fixes and updates, most of them were posted
by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
review.


Next time, please send this _before_ the merge window opens. Not a huge
problem for this series, since it has been posted numerous times in the
past and already has good review coverage, but it really should have
been in linux-next for a week or two before heading upstream.


Hi Jens,

Copied, send patches _before_ merge window opens.

But could you please to give me a hint, how can I find when a specific
merge window will open ? I search LKML, and see people send pull
requests for 4.14 merge window, but I don't see any announcement for
4.14 merge window time slot.


The merge window opens when Linus releases the previous kernel. So you
have to try and keep track of when he expects to do that. That really
isn't that difficult - Linus always cuts a release on Sundays. We always
have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or
-rc7 release, that usually gives a good indication of when he expects to
release the final.

To keep things simple, if you always have things ready when -rc7 is
released, then you are at least one week out from the merge window,
possibly two if we end up doing an -rc8. That means you don't have to
track anything, you know the exact date of when -rc7 is released since
Linus's schedule is usually reliable.



Hi Jens,

Copied, I will follow the above rule in next cycle.

And I just post the last patch
(0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you
for 4.14 cycle.

This patch is an important fix for bcache writeback rate calculation, it
is depended by another patch I sent to you
(0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch),
please have it in 4.14.

Thanks in advance.


I'm sorry, correction to my last mail, the patches I have tested should 
read: 0002, 0003, 0006 and 0013.


Re: [PATCH 00/13] bcache: fixes and update for 4.14

2017-09-07 Thread Eddie Chapman

On 06/09/17 18:38, Coly Li wrote:

On 2017/9/6 下午11:46, Jens Axboe wrote:

On 09/06/2017 09:41 AM, Coly Li wrote:

On 2017/9/6 下午10:20, Jens Axboe wrote:

On Wed, Sep 06 2017, Coly Li wrote:

Hi Jens,

Here are 12 patchs for bcache fixes and updates, most of them were posted
by Eric Wheeler in 4.13 merge window, but delayed due to lack of code
review.


Next time, please send this _before_ the merge window opens. Not a huge
problem for this series, since it has been posted numerous times in the
past and already has good review coverage, but it really should have
been in linux-next for a week or two before heading upstream.


Hi Jens,

Copied, send patches _before_ merge window opens.

But could you please to give me a hint, how can I find when a specific
merge window will open ? I search LKML, and see people send pull
requests for 4.14 merge window, but I don't see any announcement for
4.14 merge window time slot.


The merge window opens when Linus releases the previous kernel. So you
have to try and keep track of when he expects to do that. That really
isn't that difficult - Linus always cuts a release on Sundays. We always
have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or
-rc7 release, that usually gives a good indication of when he expects to
release the final.

To keep things simple, if you always have things ready when -rc7 is
released, then you are at least one week out from the merge window,
possibly two if we end up doing an -rc8. That means you don't have to
track anything, you know the exact date of when -rc7 is released since
Linus's schedule is usually reliable.



Hi Jens,

Copied, I will follow the above rule in next cycle.

And I just post the last patch
(0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you
for 4.14 cycle.

This patch is an important fix for bcache writeback rate calculation, it
is depended by another patch I sent to you
(0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch),
please have it in 4.14.

Thanks in advance.


Hello Jens,

FWIW I'd like to support Coly by reporting that patches 0001, 0002, 0006 
and 0013 (the last one mentioned above) have been tested by me on 2 
production servers with a good deal of bcache activity for the last 6 
weeks without any issues having arisen.


But this may not be much use to you to know, as these are running 
vanilla kernel.org kernel 4.4.77.  Still, these 4 patches exactly as 
Coly has sent them to you apply without any changes to the code, so I 
guess it vouches for the code at least somewhat.


Eddie


Re: [PATCH]blk-mq-sched: remove the empty entry in token wait list

2017-09-07 Thread Omar Sandoval
On Tue, Aug 29, 2017 at 10:52:13PM +0800, 查斌 wrote:
> From:  Bin Zha 
> 
> 
> When the kyber adjusts the sync and other write depth to the
> minimum(1), there is a case that maybe cause the requests to
> be stalled in the kyber_hctx_data list.
> 
> The following example I have tested:
> 
> 
> CPU7
> block_rq_insert
> add_wait_queue
> __sbitmap_queue_get  CPU23
> block_rq_issue   block_rq_insert
> block_rq_complete -->  waiting token free
>   block_rq_issue
>   /|\block_rq_complete
>||
>||
>|   \|/
>| CPU29
>|block_rq_insert
>|   waiting token free
>| block_rq_issue
>|-- block_rq_complete
> 
> CPU1
>block_rq_insert
>   waiting token free
> 
> 
> The IO request complete in CPU29 will wake up CPU7,
> because it has been added to the waitqueue in
> kyber_get_domain_token. But it is empty waiter, and won't
> wake up the CPU1.If no other requests issue to push it,
> the requests will stall in kyber_hctx_data list.
> 
> 
> Signed-off-by: Bin Zha 
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index b9faabc..584bfd5 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -548,6 +548,10 @@ static int kyber_get_domain_token(struct
> kyber_queue_data *kqd,
>  * queue.
>  */
> nr = __sbitmap_queue_get(domain_tokens);
> +   if (nr >= 0) {
> +   remove_wait_queue(>wait, wait);
> +   INIT_LIST_HEAD(>task_list);
> +   }
> }
> return nr;
>  }

Hm... I could've sworn I thought about this case when I wrote this code,
but your analysis looks correct. I do remember that I didn't do this
because I was worried about a race like so:

add_wait_queue()
kyber_domain_wake()
\_ list_del_init()
remove_wait_queue()
\_ list_del()

But thinking about it some more, list_del() is probably safe after
list_del_init()?

Have you been able to reproduce this? I have the following patch to
force the domain token pools to one token, I imagine with the right
workload we can hit it:

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index f58cab8..b4e1bd7 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -58,9 +58,9 @@ enum {
  * So, we cap these to a reasonable value.
  */
 static const unsigned int kyber_depth[] = {
-   [KYBER_READ] = 256,
-   [KYBER_SYNC_WRITE] = 128,
-   [KYBER_OTHER] = 64,
+   [KYBER_READ] = 1,
+   [KYBER_SYNC_WRITE] = 1,
+   [KYBER_OTHER] = 1,
 };

 /*
@@ -126,6 +126,7 @@ enum {
 #define IS_GOOD(status) ((status) > 0)
 #define IS_BAD(status) ((status) < 0)

+#if 0
 static int kyber_lat_status(struct blk_stat_callback *cb,
unsigned int sched_domain, u64 target)
 {
@@ -243,6 +244,7 @@ static void kyber_adjust_other_depth(struct 
kyber_queue_data *kqd,
if (depth != orig_depth)
sbitmap_queue_resize(>domain_tokens[KYBER_OTHER], depth);
 }
+#endif

 /*
  * Apply heuristics for limiting queue depths based on gathered latency
@@ -250,6 +252,8 @@ static void kyber_adjust_other_depth(struct 
kyber_queue_data *kqd,
  */
 static void kyber_stat_timer_fn(struct blk_stat_callback *cb)
 {
+   return;
+#if 0
struct kyber_queue_data *kqd = cb->data;
int read_status, write_status;

@@ -269,6 +273,7 @@ static void kyber_stat_timer_fn(struct blk_stat_callback 
*cb)
((IS_BAD(read_status) || IS_BAD(write_status) ||
  kqd->domain_tokens[KYBER_OTHER].sb.depth < 
kyber_depth[KYBER_OTHER])))
blk_stat_activate_msecs(kqd->cb, 100);
+#endif
 }

 static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)


[PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()

2017-09-07 Thread Damien Le Moal
Using scsi_device_from_queue(), return the scsi_disk structure
associated with a request queue if the device is a disk.
Export this function to make it available to modules.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd.c | 32 
 drivers/scsi/sd.h |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 1cd113df2d3a..ff9fff4de3e6 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -631,6 +631,38 @@ static void scsi_disk_put(struct scsi_disk *sdkp)
mutex_unlock(_ref_mutex);
 }
 
+static int scsi_disk_lookup(struct device *dev, void *data)
+{
+   struct scsi_disk **sdkp = data;
+
+   if (!*sdkp && dev->class == _disk_class)
+   *sdkp = to_scsi_disk(dev);
+
+   return 0;
+}
+
+/**
+ * scsi_disk_from_queue - return scsi disk associated with a request_queue
+ * @q: The request queue to return the scsi disk from
+ *
+ * Return the struct scsi_disk associated with a request queue or NULL if the
+ * request_queue does not reference a SCSI device or a if the device is
+ * not a disk.
+ */
+struct scsi_disk *scsi_disk_from_queue(struct request_queue *q)
+{
+   struct scsi_device *sdev = scsi_device_from_queue(q);
+   struct scsi_disk *sdkp = NULL;
+
+   if (!sdev)
+   return NULL;
+
+   device_for_each_child(>sdev_gendev, , scsi_disk_lookup);
+
+   return sdkp;
+}
+EXPORT_SYMBOL_GPL(scsi_disk_from_queue);
+
 #ifdef CONFIG_BLK_SED_OPAL
 static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer,
size_t len, bool send)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 3ea82c8cecd5..92113a9e2b20 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -126,6 +126,8 @@ static inline struct scsi_disk *scsi_disk(struct gendisk 
*disk)
return container_of(disk->private_data, struct scsi_disk, driver);
 }
 
+struct scsi_disk *scsi_disk_from_queue(struct request_queue *q);
+
 #define sd_printk(prefix, sdsk, fmt, a...) \
 (sdsk)->disk ? \
  sdev_prefix_printk(prefix, (sdsk)->device,\
-- 
2.13.5



[PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler

2017-09-07 Thread Damien Le Moal
The zoned I/O scheduler is mostly identical to mq-deadline and retains
the same configuration attributes. The main difference is that the
zoned scheduler will ensure that at any time at most only one write
request (command) per sequential zone is in flight (has been issued to
the disk) in order to protect against sequential write reordering
potentially resulting from the concurrent execution of request dispatch
by multiple contexts.

This is achieved similarly to the legacy SCSI I/O path by write locking
zones when a write requests is issued. Subsequent writes to the same
zone have to wait for the completion of the previously issued write
before being in turn dispatched to the disk. This ensures that
sequential writes are processed in the correct order without needing
any modification to the execution model of blk-mq. In addition, this
also protects against write reordering at the HBA level (e.g. AHCI).

This zoned scheduler code is added under the drivers/scsi directory so
that information managed using the scsi_disk structure can be directly
referenced. Doing so, cluttering the block layer with device type
specific code is avoided.

Signed-off-by: Damien Le Moal 
---
 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched |  12 +
 drivers/scsi/Makefile |   1 +
 drivers/scsi/sd.h |   3 +-
 drivers/scsi/sd_zbc.c |  16 +-
 drivers/scsi/sd_zbc.h |   8 +-
 drivers/scsi/zoned_iosched.c  | 939 ++
 7 files changed, 1015 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/zoned_iosched.c

diff --git a/Documentation/block/zoned-iosched.txt 
b/Documentation/block/zoned-iosched.txt
new file mode 100644
index ..b269125bdc61
--- /dev/null
+++ b/Documentation/block/zoned-iosched.txt
@@ -0,0 +1,48 @@
+Zoned I/O scheduler
+===
+
+The Zoned I/O scheduler solves zoned block devices write ordering problems
+inherent to the absence of a global request queue lock in the blk-mq
+infrastructure. Multiple contexts may try to dispatch simultaneously write
+requests to the same sequential zone of a zoned block device, doing so
+potentially breaking the sequential write order imposed by the device.
+
+The Zoned I/O scheduler is based on the mq-deadline scheduler. It shares the
+same tunables and behaves in a comparable manner. The main difference 
introduced
+with the zoned scheduler is handling of write batches. Whereas mq-deadline will
+keep dispatching write requests to the device as long as the batching size
+allows and reads are not starved, the zoned scheduler introduces additional
+constraints:
+1) At most only one write request can be issued to a sequential zone of the
+device. This ensures that no reordering of sequential writes to a sequential
+zone can happen once the write request leaves the scheduler internal queue (rb
+tree).
+2) The number of sequential zones that can be simultaneously written is limited
+to the device advertized maximum number of open zones. This additional 
condition
+avoids performance degradation due to excessive open zone resource use at the
+device level.
+
+These conditions do not apply to write requests targeting conventional zones.
+For these, the zoned scheduler behaves exactly like the mq-deadline scheduler.
+
+The zoned I/O scheduler cannot be used with regular block devices. It can only
+be used with host-managed or host-aware zoned block devices.
+Using the zoned I/O scheduler is mandatory with host-managed disks unless the
+disk user tightly controls itself write sequencing to sequential zones. The
+zoned scheduler will treat host-aware disks exactly the same way as 
host-managed
+devices. That is, eventhough host aware disks can be randomly written, the 
zoned
+scheduler will still impose the limit to one write per zone so that sequential
+writes sequences are preserved.
+
+For host-managed disks, automating the used of the zoned scheduler can be done
+simply with a udev rule. An example of such rule is shown below.
+
+# Set zoned scheduler for host-managed zoned block devices
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/zoned}=="host-managed", \
+   ATTR{queue/scheduler}="zoned"
+
+Zoned I/O scheduler tunables
+
+
+Tunables of the Zoned I/O scheduler are identical to those of the deadline
+I/O scheduler. See Documentation/block/deadline-iosched.txt for details.
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fd2cefa47d35..b87c67dbf1f6 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -70,6 +70,18 @@ config MQ_IOSCHED_DEADLINE
---help---
  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_ZONED
+   tristate "Zoned I/O scheduler"
+   depends on BLK_DEV_ZONED
+   depends on SCSI_MOD
+   depends on BLK_DEV_SD
+   

[PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq

2017-09-07 Thread Damien Le Moal
In the case of a ZBC disk used with scsi-mq, zone write locking does
not prevent write reordering in sequential zones. Unlike the legacy
case, zone locking can only be done after the command request is
removed from the scheduler dispatch queue. That is, at the time of
zone locking, the write command may already be out of order.

Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
used with scsi-mq. Write order guarantees can be provided by an
adapted I/O scheduler.

Signed-off-by: Damien Le Moal 
Reviewed-by: Bart Van Assche 
---
 drivers/scsi/sd_zbc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index bcece82a1d8d..6b1a7f9c1e90 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -264,6 +264,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
(sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors)
return BLKPREP_KILL;
 
+   /* No write locking with scsi-mq */
+   if (rq->q->mq_ops)
+   return BLKPREP_OK;
+
/*
 * There is no write constraint on conventional zones, but do not issue
 * more than one write at a time per sequential zone. This avoids write
-- 
2.13.5



[PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones

2017-09-07 Thread Damien Le Moal
Zoned block devices have no write constraints for conventional zones
so write locking of conventional zones is not necessary and can hurt
performance. To avoid this, introduce the seq_zones bitmap to indicate
if a zone is a sequential one. Use this information to allow any write
to be issued to conventional zones, locking only sequential zones.

As the zone bitmap allocation for seq_zones is identical to the zones
write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap().
Using this helper, wait for the disk capacity and number of zones to
stabilize on the second revalidation pass to allocate and initialize
the bitmaps.

Signed-off-by: Damien Le Moal 
Reviewed-by: Bart Van Assche 
---
 drivers/scsi/sd.h |   1 +
 drivers/scsi/sd_zbc.c | 108 --
 drivers/scsi/sd_zbc.h |  12 ++
 3 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index c9a627e7eebd..3ea82c8cecd5 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -77,6 +77,7 @@ struct scsi_disk {
unsigned intzone_blocks;
unsigned intzone_shift;
unsigned long   *zones_wlock;
+   unsigned long   *seq_zones;
unsigned intzones_optimal_open;
unsigned intzones_optimal_nonseq;
unsigned intzones_max_open;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8b258254a2bb..bcece82a1d8d 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -252,7 +252,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
sector_t sector = blk_rq_pos(rq);
sector_t zone_sectors = sd_zbc_zone_sectors(sdkp);
-   unsigned int zno = sd_zbc_zone_no(sdkp, sector);
+   unsigned int zno;
 
/*
 * Note: Checks of the alignment of the write command on
@@ -265,13 +265,15 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
return BLKPREP_KILL;
 
/*
-* Do not issue more than one write at a time per
-* zone. This solves write ordering problems due to
-* the unlocking of the request queue in the dispatch
-* path in the non scsi-mq case.
+* There is no write constraint on conventional zones, but do not issue
+* more than one write at a time per sequential zone. This avoids write
+* ordering problems due to the unlocking of the request queue in the
+* dispatch path of the non scsi-mq (legacy) case.
 */
-   if (sdkp->zones_wlock &&
-   test_and_set_bit(zno, sdkp->zones_wlock))
+   zno = sd_zbc_zone_no(sdkp, sector);
+   if (!test_bit(zno, sdkp->seq_zones))
+   return BLKPREP_OK;
+   if (test_and_set_bit(zno, sdkp->zones_wlock))
return BLKPREP_DEFER;
 
WARN_ON_ONCE(cmd->flags & SCMD_ZONE_WRITE_LOCK);
@@ -289,8 +291,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
struct request *rq = cmd->request;
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 
-   if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) {
+   if (cmd->flags & SCMD_ZONE_WRITE_LOCK) {
unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq));
+
WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock));
cmd->flags &= ~SCMD_ZONE_WRITE_LOCK;
clear_bit_unlock(zno, sdkp->zones_wlock);
@@ -507,8 +510,67 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
return 0;
 }
 
+/*
+ * Initialize the sequential zone bitmap to allow identifying sequential zones.
+ */
+static int sd_zbc_setup_seq_zones(struct scsi_disk *sdkp)
+{
+   unsigned long *seq_zones;
+   sector_t block = 0;
+   unsigned char *buf;
+   unsigned char *rec;
+   unsigned int buf_len;
+   unsigned int list_length;
+   unsigned int n = 0;
+   int ret = -ENOMEM;
+
+   /* Allocate zone type bitmap */
+   seq_zones = sd_zbc_alloc_zone_bitmap(sdkp);
+   if (!seq_zones)
+   return -ENOMEM;
+
+   buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL);
+   if (!buf)
+   goto out;
+
+   while (block < sdkp->capacity) {
+
+   ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, block);
+   if (ret)
+   goto out;
+
+   /* Parse reported zone descriptors */
+   list_length = get_unaligned_be32([0]) + 64;
+   rec = buf + 64;
+   buf_len = min(list_length, SD_ZBC_BUF_SIZE);
+   while (rec < buf + buf_len) {
+   if ((rec[0] & 0x0f) != ZBC_ZONE_TYPE_CONV)
+   set_bit(n, seq_zones);
+   block += get_unaligned_be64([8]);
+   rec += 64;
+   n++;
+   }
+
+   }
+
+   if (n != sdkp->nr_zones)
+  

[PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()

2017-09-07 Thread Damien Le Moal
The three values starting at byte 8 of the Zoned Block Device
Characteristics VPD page B6h are 32 bits values, not 64bits. So use
get_unaligned_be32() to retrieve the values and not get_unaligned_be64()

Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
Cc: 

Signed-off-by: Damien Le Moal 
Reviewed-by: Bart Van Assche 
---
 drivers/scsi/sd_zbc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8a45db8bd6e7..8b258254a2bb 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -358,15 +358,15 @@ static int sd_zbc_read_zoned_characteristics(struct 
scsi_disk *sdkp,
if (sdkp->device->type != TYPE_ZBC) {
/* Host-aware */
sdkp->urswrz = 1;
-   sdkp->zones_optimal_open = get_unaligned_be64([8]);
-   sdkp->zones_optimal_nonseq = get_unaligned_be64([12]);
+   sdkp->zones_optimal_open = get_unaligned_be32([8]);
+   sdkp->zones_optimal_nonseq = get_unaligned_be32([12]);
sdkp->zones_max_open = 0;
} else {
/* Host-managed */
sdkp->urswrz = buf[4] & 1;
sdkp->zones_optimal_open = 0;
sdkp->zones_optimal_nonseq = 0;
-   sdkp->zones_max_open = get_unaligned_be64([16]);
+   sdkp->zones_max_open = get_unaligned_be32([16]);
}
 
return 0;
-- 
2.13.5



[PATCH V2 06/12] scsi: sd_zbc: Rearrange code

2017-09-07 Thread Damien Le Moal
Move inline functions sd_zbc_zone_sectors() and sd_zbc_zone_no()
declarations to sd_zbc.h and rearrange sd_zbc_setup() to include
use_16_for_rw and use_10_for_rw assignment.
Also move calculation of sdkp->zone_shift together with the assignment
of the verified zone_blocks value in sd_zbc_check_zone_size().

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 21 +
 drivers/scsi/sd_zbc.h | 17 +
 2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index e0927f8fb829..a25a940243a9 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -206,17 +206,6 @@ static void sd_zbc_report_zones_complete(struct scsi_cmnd 
*scmd,
local_irq_restore(flags);
 }
 
-static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
-{
-   return logical_to_sectors(sdkp->device, sdkp->zone_blocks);
-}
-
-static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
- sector_t sector)
-{
-   return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
-}
-
 /*
  * Prepare a RESET WRITE POINTER scsi command for a REQ_OP_ZONE_RESET request.
  */
@@ -516,6 +505,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
}
 
sdkp->zone_blocks = zone_blocks;
+   sdkp->zone_shift = ilog2(zone_blocks);
 
return 0;
 }
@@ -523,10 +513,13 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
 
+   /* READ16/WRITE16 is mandatory for ZBC disks */
+   sdkp->device->use_16_for_rw = 1;
+   sdkp->device->use_10_for_rw = 0;
+
/* chunk_sectors indicates the zone size */
blk_queue_chunk_sectors(sdkp->disk->queue,
logical_to_sectors(sdkp->device, sdkp->zone_blocks));
-   sdkp->zone_shift = ilog2(sdkp->zone_blocks);
sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
if (sdkp->capacity & (sdkp->zone_blocks - 1))
sdkp->nr_zones++;
@@ -589,10 +582,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned 
char *buf)
if (ret)
goto err;
 
-   /* READ16/WRITE16 is mandatory for ZBC disks */
-   sdkp->device->use_16_for_rw = 1;
-   sdkp->device->use_10_for_rw = 0;
-
return 0;
 
 err:
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
index 5e7ecc9b2966..b34002f0acbe 100644
--- a/drivers/scsi/sd_zbc.h
+++ b/drivers/scsi/sd_zbc.h
@@ -15,6 +15,23 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
 void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 struct scsi_sense_hdr *sshdr);
 
+/*
+ * Zone size in number of 512B sectors.
+ */
+static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
+{
+   return logical_to_sectors(sdkp->device, sdkp->zone_blocks);
+}
+
+/*
+ * Zone number of the specified sector.
+ */
+static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
+ sector_t sector)
+{
+   return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
+}
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-- 
2.13.5



[PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros

2017-09-07 Thread Damien Le Moal
instead of open coding, use the min() macro to calculate a report zones
reply buffer length in sd_zbc_check_zone_size() and the round_up()
macro for calculating the number of zones in sd_zbc_setup().

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd_zbc.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index a25a940243a9..8a45db8bd6e7 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -403,7 +403,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, 
unsigned char *buf)
return 0;
 }
 
-#define SD_ZBC_BUF_SIZE 131072
+#define SD_ZBC_BUF_SIZE 131072U
 
 static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 {
@@ -447,10 +447,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
/* Parse REPORT ZONES header */
list_length = get_unaligned_be32([0]) + 64;
rec = buf + 64;
-   if (list_length < SD_ZBC_BUF_SIZE)
-   buf_len = list_length;
-   else
-   buf_len = SD_ZBC_BUF_SIZE;
+   buf_len = min(list_length, SD_ZBC_BUF_SIZE);
 
/* Parse zone descriptors */
while (rec < buf + buf_len) {
@@ -520,9 +517,8 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
/* chunk_sectors indicates the zone size */
blk_queue_chunk_sectors(sdkp->disk->queue,
logical_to_sectors(sdkp->device, sdkp->zone_blocks));
-   sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
-   if (sdkp->capacity & (sdkp->zone_blocks - 1))
-   sdkp->nr_zones++;
+   sdkp->nr_zones =
+   round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift;
 
if (!sdkp->zones_wlock) {
sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
-- 
2.13.5



[PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h

2017-09-07 Thread Damien Le Moal
Move standard macro definitions for the zone types and zone conditions
to scsi_proto.h together with the definitions related to the
REPORT ZONES command.

While at it, define all values in the enums to be clear.

Signed-off-by: Damien Le Moal 
Reviewed-by: Bart Van Assche 
---
 drivers/scsi/sd_zbc.c | 18 --
 include/scsi/scsi_proto.h | 45 ++---
 2 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8aa54779aac1..d445a57f99bd 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -37,24 +37,6 @@
 #include "sd.h"
 #include "scsi_priv.h"
 
-enum zbc_zone_type {
-   ZBC_ZONE_TYPE_CONV = 0x1,
-   ZBC_ZONE_TYPE_SEQWRITE_REQ,
-   ZBC_ZONE_TYPE_SEQWRITE_PREF,
-   ZBC_ZONE_TYPE_RESERVED,
-};
-
-enum zbc_zone_cond {
-   ZBC_ZONE_COND_NO_WP,
-   ZBC_ZONE_COND_EMPTY,
-   ZBC_ZONE_COND_IMP_OPEN,
-   ZBC_ZONE_COND_EXP_OPEN,
-   ZBC_ZONE_COND_CLOSED,
-   ZBC_ZONE_COND_READONLY = 0xd,
-   ZBC_ZONE_COND_FULL,
-   ZBC_ZONE_COND_OFFLINE,
-};
-
 /**
  * Convert a zone descriptor to a zone struct.
  */
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 8c285d9a06d8..39130a9c05bf 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -301,19 +301,42 @@ struct scsi_lun {
 
 /* Reporting options for REPORT ZONES */
 enum zbc_zone_reporting_options {
-   ZBC_ZONE_REPORTING_OPTION_ALL = 0,
-   ZBC_ZONE_REPORTING_OPTION_EMPTY,
-   ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
-   ZBC_ZONE_REPORTING_OPTION_CLOSED,
-   ZBC_ZONE_REPORTING_OPTION_FULL,
-   ZBC_ZONE_REPORTING_OPTION_READONLY,
-   ZBC_ZONE_REPORTING_OPTION_OFFLINE,
-   ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
-   ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
-   ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
+   ZBC_ZONE_REPORTING_OPTION_ALL   = 0x00,
+   ZBC_ZONE_REPORTING_OPTION_EMPTY = 0x01,
+   ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN = 0x02,
+   ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN = 0x03,
+   ZBC_ZONE_REPORTING_OPTION_CLOSED= 0x04,
+   ZBC_ZONE_REPORTING_OPTION_FULL  = 0x05,
+   ZBC_ZONE_REPORTING_OPTION_READONLY  = 0x06,
+   ZBC_ZONE_REPORTING_OPTION_OFFLINE   = 0x07,
+   /* 0x08 to 0x0f are reserved */
+   ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
+   ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE  = 0x11,
+   /* 0x12 to 0x3e are reserved */
+   ZBC_ZONE_REPORTING_OPTION_NON_WP= 0x3f,
 };
 
 #define ZBC_REPORT_ZONE_PARTIAL 0x80
 
+/* Zone types of REPORT ZONES zone descriptors */
+enum zbc_zone_type {
+   ZBC_ZONE_TYPE_CONV  = 0x1,
+   ZBC_ZONE_TYPE_SEQWRITE_REQ  = 0x2,
+   ZBC_ZONE_TYPE_SEQWRITE_PREF = 0x3,
+   /* 0x4 to 0xf are reserved */
+};
+
+/* Zone conditions of REPORT ZONES zone descriptors */
+enum zbc_zone_cond {
+   ZBC_ZONE_COND_NO_WP = 0x0,
+   ZBC_ZONE_COND_EMPTY = 0x1,
+   ZBC_ZONE_COND_IMP_OPEN  = 0x2,
+   ZBC_ZONE_COND_EXP_OPEN  = 0x3,
+   ZBC_ZONE_COND_CLOSED= 0x4,
+   /* 0x5 to 0xc are reserved */
+   ZBC_ZONE_COND_READONLY  = 0xd,
+   ZBC_ZONE_COND_FULL  = 0xe,
+   ZBC_ZONE_COND_OFFLINE   = 0xf,
+};
+
 #endif /* _SCSI_PROTO_H_ */
-- 
2.13.5



[PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h

2017-09-07 Thread Damien Le Moal
Introduce sd_zbc.h to gather ZBC disk related declarations and avoid
cluttering sd.h.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal 
---
 drivers/scsi/sd.c |  1 +
 drivers/scsi/sd.h | 48 ---
 drivers/scsi/sd_zbc.c |  7 +-
 drivers/scsi/sd_zbc.h | 62 +++
 4 files changed, 64 insertions(+), 54 deletions(-)
 create mode 100644 drivers/scsi/sd_zbc.h

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c5d05b1c58a5..1cd113df2d3a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -68,6 +68,7 @@
 #include 
 
 #include "sd.h"
+#include "sd_zbc.h"
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 99c4dde9b6bf..c9a627e7eebd 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -277,52 +277,4 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC;
 }
 
-#ifdef CONFIG_BLK_DEV_ZONED
-
-extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
-extern void sd_zbc_remove(struct scsi_disk *sdkp);
-extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
-extern int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
-extern void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-   struct scsi_sense_hdr *sshdr);
-
-#else /* CONFIG_BLK_DEV_ZONED */
-
-static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-   unsigned char *buf)
-{
-   return 0;
-}
-
-static inline void sd_zbc_remove(struct scsi_disk *sdkp) {}
-
-static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {}
-
-static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
-{
-   /* Let the drive fail requests */
-   return BLKPREP_OK;
-}
-
-static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) {}
-
-static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
-{
-   return BLKPREP_INVALID;
-}
-
-static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
-{
-   return BLKPREP_INVALID;
-}
-
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-  unsigned int good_bytes,
-  struct scsi_sense_hdr *sshdr) {}
-
-#endif /* CONFIG_BLK_DEV_ZONED */
-
 #endif /* _SCSI_DISK_H */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d445a57f99bd..a62f8e256a26 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -28,14 +28,9 @@
 
 #include 
 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
 
 #include "sd.h"
-#include "scsi_priv.h"
+#include "sd_zbc.h"
 
 /**
  * Convert a zone descriptor to a zone struct.
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
new file mode 100644
index ..5e7ecc9b2966
--- /dev/null
+++ b/drivers/scsi/sd_zbc.h
@@ -0,0 +1,62 @@
+#ifndef _SCSI_DISK_ZBC_H
+#define _SCSI_DISK_ZBC_H
+
+#ifdef CONFIG_BLK_DEV_ZONED
+
+int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
+void sd_zbc_remove(struct scsi_disk *sdkp);
+void sd_zbc_print_zones(struct scsi_disk *sdkp);
+
+int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
+void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
+
+int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd);
+int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
+void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+struct scsi_sense_hdr *sshdr);
+
+#else /* CONFIG_BLK_DEV_ZONED */
+
+static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
+   unsigned char *buf)
+{
+   return 0;
+}
+
+static inline void sd_zbc_remove(struct scsi_disk *sdkp)
+{
+}
+
+static inline void sd_zbc_print_zones(struct scsi_disk *sdkp)
+{
+}
+
+static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
+{
+   /* Let the drive fail requests */
+   return BLKPREP_OK;
+}
+
+static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
+{
+}
+
+static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
+{
+   return BLKPREP_INVALID;
+}
+
+static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
+{
+   return BLKPREP_INVALID;
+}
+
+static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
+  unsigned int good_bytes,
+  struct scsi_sense_hdr *sshdr)
+{
+}
+
+#endif /* CONFIG_BLK_DEV_ZONED */
+
+#endif /* _SCSI_DISK_ZBC_H */
-- 
2.13.5



Re: [PATCH V2 0/3] block/loop: handle discard/zeroout error

2017-09-07 Thread Ilya Dryomov
Hi Shaohua,

You wrote:
> BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't
> support zeroout, but it doesn't retry if submit_bio_wait returns -EOPNOTSUPP.
> Is this correct behavior?

I sent a patch for that yesterday, see "[PATCH] block: cope with WRITE
SAME failing in blkdev_issue_zeroout()" on linux-block.  It checks for
-EREMOTEIO, because that's what I hit, but I wonder if it should check
for -EOPNOTSUPP from submit_bio_wait() as well.  Can you think of
a case where, given bdev_write_zeroes_sectors() != 0, submit_bio_wait()
would fail with -EOPNOTSUPP and we would want to do explicit zeroing?

Please excuse broken threading.

Thanks,

Ilya


[PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout

2017-09-07 Thread Christoph Hellwig
bsg-lib now embeddeds the job structure into the request, and req->special
can't be used anymore.

Signed-off-by: Christoph Hellwig 
Cc: sta...@vger.kernel.org
---
 drivers/scsi/scsi_transport_fc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 892fbd9800d9..bea06de60827 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3550,7 +3550,7 @@ fc_vport_sched_delete(struct work_struct *work)
 static enum blk_eh_timer_return
 fc_bsg_job_timeout(struct request *req)
 {
-   struct bsg_job *job = (void *) req->special;
+   struct bsg_job *job = blk_mq_rq_to_pdu(req);
struct Scsi_Host *shost = fc_bsg_to_shost(job);
struct fc_rport *rport = fc_bsg_to_rport(job);
struct fc_internal *i = to_fc_internal(shost->transportt);
-- 
2.11.0



two small bsg fixes, take 3

2017-09-07 Thread Christoph Hellwig
Now fully away, with my fair share of coffee and lunch:


Two fixups for the recent bsg-lib fixes, should go into 4.13 stable as
well.



Re: [PATCH 10/18] lightnvm: pblk: use bio_copy_kern when possible

2017-09-07 Thread Javier González
> On 7 Sep 2017, at 13.08, Christoph Hellwig  wrote:
> 
> On Wed, Sep 06, 2017 at 04:00:56PM +0200, Javier González wrote:
>>> Nope.  You want to loop over vmalloc_to_page and call bio_add_page
>>> for each page,
>> 
>> Yes. This is basically what I did before.
>> 
>>> after taking care of virtually tagged caches instead
>>> of this bounce buffering.
>> 
>> And thus I considered bio_copy_kern to be a better solution, since it
>> will through time take care of doing the vmalloc_to_page correctly for
>> all cases.
> 
> bio_copy_kern copies all the data, so it is generally not a good
> idea.  The cache flushing isn't too hard - take a look at the XFS
> buffer cache for an existing version.
> 
> It would be good to just to do the right thing inside bio_map_kern
> for that so that callers don't need to care if it is vmalloced or
> not.

Yes. That would help. I know md also needs to manually add pages on
vmalloced memory. Probably other do too.

> 
>> Ok. So this would mean that targets (e.g., pblk) deal with struct
>> request instead of only dealing with bios and then letting the LightNVM
>> core transforming bios to requests. This way we can directly map to the
>> request. Is this what you mean?
> 
> Yes.
> 
>> Just out of curiosity, why is forming the bio trough bio_copy_kern (or
>> manually doing the same) and then transforming to a request incorrect /
>> worse?
> 
> Because you expose yourself to the details of mapping a bio to request.
> We had to export blk_init_request_from_bio just for lightnvm to do this,
> and it also has to do weird other bits about requests.  If you go
> through blk_rq_map_* the block layer takes care of all that for you.

Ok. It makes sense. I'll talk to Matias about it.


Thanks!
Javier


signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 0/3] Move tagset reinit to its only current consumer, nvme-core

2017-09-07 Thread Max Gurtovoy



On 9/6/2017 6:30 PM, Sagi Grimberg wrote:

As suggested by Bart Van Assche, get rid of
blk_mq_reinit_tagset and move it to nvme-core (its
only current consumer).

Instead, introduce a more generic tagset iterator helper.

Sagi Grimberg (3):
   block: introduce blk_mq_tagset_iter
   nvme: introduce nvme_reinit_tagset
   block: remove blk_mq_reinit_tagset

  block/blk-mq-tag.c   | 11 +--
  drivers/nvme/host/core.c | 13 +
  drivers/nvme/host/fc.c   |  3 ++-
  drivers/nvme/host/nvme.h |  2 ++
  drivers/nvme/host/rdma.c |  7 +++
  include/linux/blk-mq.h   |  4 ++--
  6 files changed, 27 insertions(+), 13 deletions(-)



The series looks good to me,

Reviewed-by: Max Gurtovoy 

BTW, if we talking about the reinit_tagset, can you explain the 
motivation for dereg_mr and alloc new mr for the RDMA transport layer ?

we don't do it in iSER/SRP so I wonder what is special here ?
I got some local protection errors in MP tests and under high load FIO 
(that cause cmd TMO to expire and abort the cmds and then reconnect).
I wonder if there is a connection between the two. I'm also debugging a 
situation that we might have free/dereg/local_invalidate the MR before 
the QP moved to ERR state.


Re: [PATCH 3/3] block: remove blk_mq_reinit_tagset

2017-09-07 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 2/3] nvme: introduce nvme_reinit_tagset

2017-09-07 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/3] block: introduce blk_mq_tagset_iter

2017-09-07 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH V2 1/3] block/loop: don't hijack error number

2017-09-07 Thread Ming Lei
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> If the bio returns -EOPNOTSUPP, we shouldn't hijack it and return -EIO
>
> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 85de673..715b762 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -460,7 +460,7 @@ static void lo_complete_rq(struct request *rq)
> zero_fill_bio(bio);
> }
>
> -   blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);
> +   blk_mq_end_request(rq, errno_to_blk_status(cmd->ret));
>  }
>
>  static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
> @@ -476,7 +476,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
> ret, long ret2)
>  {
> struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
>
> -   cmd->ret = ret;
> +   cmd->ret = ret > 0 ? 0 : ret;
> lo_rw_aio_do_completion(cmd);
>  }
>
> @@ -1706,7 +1706,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
>   failed:
> /* complete non-aio request */
> if (!cmd->use_aio || ret) {
> -   cmd->ret = ret ? -EIO : 0;
> +   cmd->ret = ret;
> blk_mq_complete_request(cmd->rq);
> }
>  }
> --
> 2.9.5
>

Looks fine:

  Reviewed-by: Ming Lei 


-- 
Ming Lei


Re: [PATCH V2 3/3] block/loop: suppress discard IO error message

2017-09-07 Thread Ming Lei
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till
> fallocate is called. If it doesn't support, loop will return -EOPNOTSUPP
> and we see a lot of error message printed by blk_update_request. Failure
> for discard IO isn't a big problem, so we just return 0 in loop which
> will suppress the IO error message.

Setting RQF_QUIET for discard IO should be more clean for suppressing error,
and upper layer can get the failure notification too.

-- 
Ming Lei


Re: [PATCH V2 2/3] block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES

2017-09-07 Thread Ming Lei
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Li  wrote:
> From: Shaohua Li 
>
> REQ_OP_WRITE_ZEROES really means zero the data. And in blkdev_fallocate,
> FALLOC_FL_ZERO_RANGE will retry but FALLOC_FL_PUNCH_HOLE not, even loop
> request doesn't have BLKDEV_ZERO_NOFALLBACK set.
>
> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 715b762..8934e25 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -426,6 +426,9 @@ static int lo_discard(struct loop_device *lo, struct 
> request *rq, loff_t pos)
> int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> int ret;
>
> +   if (req_op(rq) == REQ_OP_WRITE_ZEROES)
> +   mode = FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE;
> +
> if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> ret = -EOPNOTSUPP;
> goto out;
> --
> 2.9.5
>

Looks fine,

  Reviewed-by: Ming Lei 


-- 
Ming Lei