Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
On Tue, Sep 05, 2017 at 10:23:25AM +0800, Ming Lei wrote: > On Tue, Sep 05, 2017 at 01:40:11AM +, Bart Van Assche wrote: > > On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote: > > > On Mon, Sep 04, 2017 at 03:40:35PM +, Bart Van Assche wrote: > > > > Have you considered to use the blk-mq "reserved request" mechanism to > > > > avoid > > > > starvation of power management requests instead of making the block > > > > layer > > > > even more complicated than it already is? > > > > > > reserved request is really a bad idea, that means the reserved request > > > can't be used for normal I/O, we all know the request/tag space is > > > precious, and some device has a quite small tag space, such as sata. > > > This way will affect performance definitely. > > > > Sorry but I'm neither convinced that reserving a request for power > > management > > would be a bad idea nor that it would have a significant performance impact > > nor > > that it would be complicated to implement. Have you noticed that the Linux > > ATA > > implementation already reserves a request for internal use and thereby > > reduces > > the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would > > like to > > know if is whether the performance impact of reserving a request is more or > > less > > than 1%. > > Firstly we really can avoid the reservation, why do we have to > wast one precious tag just for PM, which may never happen on > one machine from its running. For SATA, the internal tag is for EH, > I believe the reservation is inevitable. > > Secondly reserving one tag may decrease the concurrent I/O by 1, > that definitely hurts performance, especially for random I/O. Think > about why NVMe increases its queue depth so many. Not mention > there are some devices which have only one tag(.can_queue is 1), > how can you reserve one tag on this kind of device? > > Finally bad result will follow if you reserve one tag for PM only. > Suppose it is doable to reserve one tag, did you consider > how to do that? Tag can be shared in host wide, do you want to > reserve one tag just for one request_queue? > - If yes, lots of tag can be reserved/wasted for the unusual PM > or sort of commands, even worse the whole tag space of HBA may > not be enough for the reservation if there are lots of LUNs in > this HBA. > > - If not, and you just reserve one tag for one HBA, then all > PM commands share the one reservation. During suspend/resume, all > these PM commands have to run exclusively(serialized) for diskes > attached to the HBA, that will slow down the suspend/resume very > much because there may be lots of LUNs in this HBA. > > That is why I said reserving one tag is really bad, isn't it? Hi Bart, Looks I replied or clarified all your questions/comments on this patchset. So could you let me know if you still object this approach or patchset? If not, I think we need to move on and fix the real issue. -- Ming
Re: [GIT PULL] First set of block changes for 4.14-rc1
On Thu, Sep 07, 2017 at 01:47:44PM -0600, Jens Axboe wrote: > On 09/07/2017 01:38 PM, Linus Torvalds wrote: > > On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboewrote: > >> > >> Which was committed yesterday? It was not from my tree. I try to keep > >> an eye out for potential conflicts or issues. > > > > It was from Andrew, so I'm assuming it was in linux-next. Not as a git > > tree, but as the usual akpm branch. > > > > I'm not sure why I didn't see any reports from linux-next about it, > > though - and I looked. > > > > There was no actual visible merge problem, because the conflict was > > not a data conflict but a semantic one. > > > > But the issue showed up for a trivial allmodconfig build, so it > > *should* have been caught automatically. > > > > But it's not even the conflict that annoys me. > > > > It's the *reason* for the conflict - the block layer churn that you > > said was done. We've had too much of it. > > > >>> We need *stability* by now, after these crazy changes. Make sure > >>> that happens. > > I might have missed a build notice from linux-next, since I know for a > fact that all my stuff is in -next, updated daily, and Andrew's tree > definitely is too. > > >> I am pushing back, but I can't embargo any sort of API change. This one has > >> good reasoning behind it, which is actually nicely explained in the commit > >> itself. It's not pointless churn, which I would define as change just for > >> the sake of change itself. Or pointless cleanups. > > > > You can definitely put your foot down on any more random changes to > > the bio infrastructure. Including for good reasons. > > And I am, but this one wasn't random. As I said, some of the previous > releases have had more frivolous changes that should have been pushed > back harder on. And particularly nvme, where fixes that go in after the > merge window have been doing pointless cleanups while fixing issues, > causing unnecessary pain for the next release. I am definitely pushing > back hard on those, just last week after more of that showed up. > > You'll see exactly one of those when I send in the next merge request, > which sucks, and which was the reason the yelling in that area recently. > > > We have had some serious issues in the block layer - and I'm not > > talking about the merge conflicts. I'm talking about just the > > collateral damage, with things like SCSI having to revert using blk-mq > > by default etc. > > I'm a bit puzzled as to why the suspend/resume thing has existed for so > long, honestly. But some of these issues don't show themselves until you > flip the switch. A lot of the production setups have been using scsi-mq > for YEARS without issues, but obviously they don't hit this. Given how > big of a switch this is, it's hard to avoid minor fallout. We're dealing > with it. Hi Jens, If you mean the I/O hang during suspend/resume reported by Oleksandr, that shouldn't be blk-mq only. The root cause is in SCSI's quiesce design(even in blk-mq's quiesce, but no such issue because no new request is allocated in the context which queue is quiesced.) The issue is a bit long-term, since Cathy can reproduce this kind of I/O hang on 2.6.32 based kernel. I am working on fixing that[1], but looks the preempt quiesce still need to be nested, so will post out V4 with supporting nested preempt quiesce. The real SCSI_MQ regression is the report of 'SCSI-MQ performance regression'[2], and the commit of 'scsi: default to scsi-mq' was reverted after this report. Patch for this regression has been posted for several round, and the V4 [3] should be ready for merge, looks it is ignored because of the merge timing? [1] https://marc.info/?l=linux-scsi=150435774503165=2 [2] https://marc.info/?l=linux-kernel=150271934904399=2 [3] https://marc.info/?t=15043655572=1=2 -- Ming
Re: [PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout
On Thu, Sep 07, 2017 at 01:54:36PM +0200, Christoph Hellwig wrote: > bsg-lib now embeddeds the job structure into the request, and req->special > can't be used anymore. > > Signed-off-by: Christoph Hellwig> Cc: sta...@vger.kernel.org > --- > drivers/scsi/scsi_transport_fc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/scsi_transport_fc.c > b/drivers/scsi/scsi_transport_fc.c > index 892fbd9800d9..bea06de60827 100644 > --- a/drivers/scsi/scsi_transport_fc.c > +++ b/drivers/scsi/scsi_transport_fc.c > @@ -3550,7 +3550,7 @@ fc_vport_sched_delete(struct work_struct *work) > static enum blk_eh_timer_return > fc_bsg_job_timeout(struct request *req) > { > - struct bsg_job *job = (void *) req->special; > + struct bsg_job *job = blk_mq_rq_to_pdu(req); > struct Scsi_Host *shost = fc_bsg_to_shost(job); > struct fc_rport *rport = fc_bsg_to_rport(job); > struct fc_internal *i = to_fc_internal(shost->transportt); > -- > 2.11.0 > Reviewed-by: Ming Lei -- Ming
[PATCH] block: tolerate tracing of NULL bio
__get_request() can call trace_block_getrq() with bio=NULL which causes block_get_rq::TP_fast_assign() to deref a NULL pointer and panic. Syzkaller fuzzer panics with linux-next (1d53d908b79d7870d89063062584eead4cf83448): kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN Modules linked in: CPU: 0 PID: 2983 Comm: syzkaller40 Not tainted 4.13.0-rc7-next-20170901+ #13 task: 8801cf1da000 task.stack: 8801ce44 RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384 RSP: 0018:8801ce4473f0 EFLAGS: 00010246 RAX: 8801cf1da000 RBX: 110039c88e84 RCX: 1d184d27 RDX: dc01 RSI: 11003b643e7a RDI: e8c26938 RBP: 8801ce447530 R08: 11003b643e6c R09: e8c26964 R10: 0002 R11: f9184d2d R12: e8c1f890 R13: e8c26930 R14: 85cad9e0 R15: FS: 02641880() GS:8801db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0043e670 CR3: 0001d1d7a000 CR4: 001406f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: trace_block_getrq include/trace/events/block.h:423 [inline] __get_request block/blk-core.c:1283 [inline] get_request+0x1518/0x23b0 block/blk-core.c:1355 blk_old_get_request block/blk-core.c:1402 [inline] blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427 sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451 sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070 vfs_ioctl fs/ioctl.c:45 [inline] do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685 SYSC_ioctl fs/ioctl.c:700 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691 entry_SYSCALL_64_fastpath+0x1f/0xbe block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments. Only one of them is NULL tolerant. Favor the NULL tolerant one. Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") Signed-off-by: Greg Thelen--- include/trace/events/block.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index f815aaaef755..1fd7ff1a46f7 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -397,7 +397,6 @@ DECLARE_EVENT_CLASS(block_get_rq, TP_fast_assign( __entry->dev= bio ? bio_dev(bio) : 0; - __entry->dev= bio_dev(bio); __entry->sector = bio ? bio->bi_iter.bi_sector : 0; __entry->nr_sector = bio ? bio_sectors(bio) : 0; blk_fill_rwbs(__entry->rwbs, @@ -414,7 +413,7 @@ DECLARE_EVENT_CLASS(block_get_rq, /** * block_getrq - get a free request entry in queue for block IO operations * @q: queue for operations - * @bio: pending block IO operation + * @bio: pending block IO operation (can be %NULL) * @rw: low bit indicates a read (%0) or a write (%1) * * A request struct for queue @q has been allocated to handle the @@ -430,7 +429,7 @@ DEFINE_EVENT(block_get_rq, block_getrq, /** * block_sleeprq - waiting to get a free request entry in queue for block IO operation * @q: queue for operation - * @bio: pending block IO operation + * @bio: pending block IO operation (can be %NULL) * @rw: low bit indicates a read (%0) or a write (%1) * * In the case where a request struct cannot be provided for queue @q -- 2.14.1.581.gf28d330327-goog
BDI_CAP_STABLE_WRITES for stacked device (Re: Enable skip_copy can cause data integrity issue in some storage) stack
On Thu, Sep 07, 2017 at 11:11:24AM +1000, Neil Brown wrote: > On Wed, Sep 06 2017, Shaohua Li wrote: > > > On Fri, Sep 01, 2017 at 03:26:41PM +0800, alexwu wrote: > >> Hi, > >> > >> Recently a data integrity issue about skip_copy was found. We are able > >> to reproduce it and found the root cause. This data integrity issue > >> might happen if there are other layers between file system and raid5. > >> > >> [How to Reproduce] > >> > >> 1. Create a raid5 named md0 first (with skip_copy enable), and wait md0 > >>resync done which ensures that all data and parity are synchronized > >> 2. Use lvm tools to create a logical volume named lv-md0 over md0 > >> 3. Format an ext4 file system on lv-md0 and mount on /mnt > >> 4. Do some db operations (e.g. sqlite insert) to write data through /mnt > >> 5. When those db operations finished, we do the following command > >>"echo check > /sys/block/md0/md/sync_action" to check mismatch_cnt, > >>it is very likely that we got mismatch_cnt != 0 when check finished > >> > >> [Root Cause] > >> > >> After tracing code and more experiments, it is more proper to say that > >> it's a problem about backing_dev_info (bdi) instead of a bug about > >> skip_copy. > >> > >> We notice that: > >>1. skip_copy counts on BDI_CAP_STABLE_WRITES to ensure that bio's page > >> will not be modified before raid5 completes I/O. Thus we can skip > >> copy > >> page from bio to stripe cache > >>2. The ext4 file system will call wait_for_stable_page() to ask whether > >> the mapped bdi requires stable writes > >> > >> Data integrity happens because: > >>1. When raid5 enable skip_copy, it will only set it's own bdi required > >> BDI_CAP_STABLE_WRITES, but this information will not propagate to > >> other bdi between file system and md > >>2. When ext4 file system check stable writes requirement by calling > >> wait_for_stable_page(), it can only see the underlying bdi's > >> capability > >> and cannot see all related bdi > >> > >> Thus, skip_copy works fine if we format file system directly on md. > >> But data integrity issue happens if there are some other block layers (e.g. > >> dm) > >> between file system and md. > >> > >> [Result] > >> > >> We do more tests on different storage stacks, here are the results. > >> > >> The following settings can pass the test thousand times: > >>1. raid5 with skip_copy enable + ext4 > >>2. raid5 with skip_copy disable + ext4 > >>3. raid5 with skip_copy disable + lvm + ext4 > >> > >> The following setting will fail the test in 10 rounds: > >>1. raid5 with skip_copy enable + lvm + ext4 > >> > >> I think the solution might be let all bdi can communicate through different > >> block layers, > >> then we can pass BDI_CAP_STABLE_WRITES information if we enable skip_copy. > >> But the current bdi structure is not allowed us to do that. > >> > >> What would you suggest to do if we want to make skip_copy more reliable ? > > > > Very interesting problem. Looks this isn't raid5 specific. stacked device > > with > > block integrity enabled could expose the same issue. > > > > I'm cc Jens and Neil to check if they have good idea. Basically the problem > > is > > for stacked disk, the under layer disk has BDI_CAP_STABLE_WRITES enabled, > > but > > upper layer doesn't enable it. The under layer disk could be a raid5 with > > skip_copy enabled or could be any disk with block integrity enabled (which > > will > > enable BDI_CAP_STABLE_WRITES). Currently we don't propagate the flag to > > upper > > layer disk. > > > > A solution is blk_set_stacking_limits propagate the flag at queue setup, we > > then don't allow the flag changing in runtime later. The raid5 skip_copy > > interface changes the flag in runtime which probably isn't safe. > > > > Thanks, > > Shaohua > > It has always been messy to handle dependencies from the bottom of the > stack affecting things closer to the filesystem. > We used to have issues with alignments and request sizes, and got rid of > those problems by requiring all devices to accept all bio sizes, and > providing support for them to split as necessary. > This is a similar sort of problem, but also quite different. > > It is different because the STABLE_WRITES requirement doesn't affect the > sort of bio that is passed down, but instead affect the way it is waited > for. > > I have thought a few times that it would be useful if the "bd_claim" > functionality included a callback so the claimed device could notify the > claimer that things had changed. > Then raid5 could notice that it has been claimed, and call the callback > when it changes BDI_CAP_STABLE_WRITES. If the claimer is a stacked > device, it can re-run blk_set_stacking_limits and call its own claimer. > > There are probably some interesting locking issues around this but, to > me, it seems like the most likely path to a robust solution. Fixing blk_set_stacking_limits is the first
Re: [PATCH V2 0/3] block/loop: handle discard/zeroout error
On Thu, Sep 07, 2017 at 03:20:01PM +0200, Ilya Dryomov wrote: > Hi Shaohua, > > You wrote: > > BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't > > support zeroout, but it doesn't retry if submit_bio_wait returns > > -EOPNOTSUPP. > > Is this correct behavior? > > I sent a patch for that yesterday, see "[PATCH] block: cope with WRITE > SAME failing in blkdev_issue_zeroout()" on linux-block. It checks for > -EREMOTEIO, because that's what I hit, but I wonder if it should check > for -EOPNOTSUPP from submit_bio_wait() as well. Can you think of > a case where, given bdev_write_zeroes_sectors() != 0, submit_bio_wait() > would fail with -EOPNOTSUPP and we would want to do explicit zeroing? loop block device returns -EOPNOTSUPP before for zeroout. With the second patch, it doesn't any more though.
Re: [PATCH V2 3/3] block/loop: suppress discard IO error message
On Thu, Sep 07, 2017 at 05:16:21PM +0800, Ming Lei wrote: > On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Liwrote: > > From: Shaohua Li > > > > We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till > > fallocate is called. If it doesn't support, loop will return -EOPNOTSUPP > > and we see a lot of error message printed by blk_update_request. Failure > > for discard IO isn't a big problem, so we just return 0 in loop which > > will suppress the IO error message. > > Setting RQF_QUIET for discard IO should be more clean for suppressing error, > and upper layer can get the failure notification too. Hmm, forgot why I didn't do this, did consider it before. Probably because nobody check the return value of discard request. Probably we should skip the error message for all discard IO in block layer. I'll repost a patch to fix this issue. Thanks, Shaohua
Re: [GIT PULL] First set of block changes for 4.14-rc1
On 09/07/2017 01:38 PM, Linus Torvalds wrote: > On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboewrote: >> >> Which was committed yesterday? It was not from my tree. I try to keep >> an eye out for potential conflicts or issues. > > It was from Andrew, so I'm assuming it was in linux-next. Not as a git > tree, but as the usual akpm branch. > > I'm not sure why I didn't see any reports from linux-next about it, > though - and I looked. > > There was no actual visible merge problem, because the conflict was > not a data conflict but a semantic one. > > But the issue showed up for a trivial allmodconfig build, so it > *should* have been caught automatically. > > But it's not even the conflict that annoys me. > > It's the *reason* for the conflict - the block layer churn that you > said was done. We've had too much of it. > >>> We need *stability* by now, after these crazy changes. Make sure >>> that happens. I might have missed a build notice from linux-next, since I know for a fact that all my stuff is in -next, updated daily, and Andrew's tree definitely is too. >> I am pushing back, but I can't embargo any sort of API change. This one has >> good reasoning behind it, which is actually nicely explained in the commit >> itself. It's not pointless churn, which I would define as change just for >> the sake of change itself. Or pointless cleanups. > > You can definitely put your foot down on any more random changes to > the bio infrastructure. Including for good reasons. And I am, but this one wasn't random. As I said, some of the previous releases have had more frivolous changes that should have been pushed back harder on. And particularly nvme, where fixes that go in after the merge window have been doing pointless cleanups while fixing issues, causing unnecessary pain for the next release. I am definitely pushing back hard on those, just last week after more of that showed up. You'll see exactly one of those when I send in the next merge request, which sucks, and which was the reason the yelling in that area recently. > We have had some serious issues in the block layer - and I'm not > talking about the merge conflicts. I'm talking about just the > collateral damage, with things like SCSI having to revert using blk-mq > by default etc. I'm a bit puzzled as to why the suspend/resume thing has existed for so long, honestly. But some of these issues don't show themselves until you flip the switch. A lot of the production setups have been using scsi-mq for YEARS without issues, but obviously they don't hit this. Given how big of a switch this is, it's hard to avoid minor fallout. We're dealing with it. > Christ, things like that are pretty *fundamnetal*, wouldn't you say. > Get those right before doing more churn. And aim to have one or two > releases literally just fixing things, with a "no churn" rule. That's the plan... -- Jens Axboe
Re: [GIT PULL] First set of block changes for 4.14-rc1
On Thu, Sep 7, 2017 at 12:27 PM, Jens Axboewrote: > > Which was committed yesterday? It was not from my tree. I try to keep > an eye out for potential conflicts or issues. It was from Andrew, so I'm assuming it was in linux-next. Not as a git tree, but as the usual akpm branch. I'm not sure why I didn't see any reports from linux-next about it, though - and I looked. There was no actual visible merge problem, because the conflict was not a data conflict but a semantic one. But the issue showed up for a trivial allmodconfig build, so it *should* have been caught automatically. But it's not even the conflict that annoys me. It's the *reason* for the conflict - the block layer churn that you said was done. We've had too much of it. >> We need *stability* by now, after these crazy changes. Make sure that >> happens. > > I am pushing back, but I can't embargo any sort of API change. This one has > good reasoning behind it, which is actually nicely explained in the commit > itself. It's not pointless churn, which I would define as change just for > the sake of change itself. Or pointless cleanups. You can definitely put your foot down on any more random changes to the bio infrastructure. Including for good reasons. We have had some serious issues in the block layer - and I'm not talking about the merge conflicts. I'm talking about just the collateral damage, with things like SCSI having to revert using blk-mq by default etc. Christ, things like that are pretty *fundamnetal*, wouldn't you say. Get those right before doing more churn. And aim to have one or two releases literally just fixing things, with a "no churn" rule. Because the block layer really has been a black spot lately. Linus
Re: [PATCH 00/13] bcache: fixes and update for 4.14
On 09/07/2017 12:51 PM, Eddie Chapman wrote: > On 06/09/17 18:38, Coly Li wrote: >> On 2017/9/6 下午11:46, Jens Axboe wrote: >>> On 09/06/2017 09:41 AM, Coly Li wrote: On 2017/9/6 下午10:20, Jens Axboe wrote: > On Wed, Sep 06 2017, Coly Li wrote: >> Hi Jens, >> >> Here are 12 patchs for bcache fixes and updates, most of them were posted >> by Eric Wheeler in 4.13 merge window, but delayed due to lack of code >> review. > > Next time, please send this _before_ the merge window opens. Not a huge > problem for this series, since it has been posted numerous times in the > past and already has good review coverage, but it really should have > been in linux-next for a week or two before heading upstream. > Hi Jens, Copied, send patches _before_ merge window opens. But could you please to give me a hint, how can I find when a specific merge window will open ? I search LKML, and see people send pull requests for 4.14 merge window, but I don't see any announcement for 4.14 merge window time slot. >>> >>> The merge window opens when Linus releases the previous kernel. So you >>> have to try and keep track of when he expects to do that. That really >>> isn't that difficult - Linus always cuts a release on Sundays. We always >>> have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or >>> -rc7 release, that usually gives a good indication of when he expects to >>> release the final. >>> >>> To keep things simple, if you always have things ready when -rc7 is >>> released, then you are at least one week out from the merge window, >>> possibly two if we end up doing an -rc8. That means you don't have to >>> track anything, you know the exact date of when -rc7 is released since >>> Linus's schedule is usually reliable. >>> >> >> Hi Jens, >> >> Copied, I will follow the above rule in next cycle. >> >> And I just post the last patch >> (0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you >> for 4.14 cycle. >> >> This patch is an important fix for bcache writeback rate calculation, it >> is depended by another patch I sent to you >> (0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch), >> please have it in 4.14. >> >> Thanks in advance. > > Hello Jens, > > FWIW I'd like to support Coly by reporting that patches 0001, 0002, 0006 > and 0013 (the last one mentioned above) have been tested by me on 2 > production servers with a good deal of bcache activity for the last 6 > weeks without any issues having arisen. > > But this may not be much use to you to know, as these are running > vanilla kernel.org kernel 4.4.77. Still, these 4 patches exactly as > Coly has sent them to you apply without any changes to the code, so I > guess it vouches for the code at least somewhat. That's all good and fine, and it's better than no testing at all. But it doesn't change the fact that anyone using bcache on -next would have been using the changes for a while before release, instead of just one or two people. Additionally, your base is drastically different than mainline, since you're running a kernel that's a year and a half old. Next time I'm not adding anything post merge window open, unless it's addressing a problem that was added in the merge window. -- Jens Axboe
Re: [GIT PULL] First set of block changes for 4.14-rc1
On 09/07/2017 01:04 PM, Linus Torvalds wrote: > On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboewrote: >> >> Note that you'll hit a conflict in block/bio-integrity.c and >> mm/page_io.c, since we had fixes later in the 4.13 series for both of >> those that ended up conflicting with new changes. Both are trivial to >> fix up, I've included my resolution at the end of this email. > > Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read > page from backing device") and the preceding code, which added a > "zram->bdev" that is the backing dev for the zram device. Which was committed yesterday? It was not from my tree. I try to keep an eye out for potential conflicts or issues. > And then it does > > bio->bi_bdev = zram->bdev; > > but now the block layer has AGAIN changed some stupid internal detail > for no obvious reason, so now it doesn't work. > > My initial thought was to just do > > bio->bi_disk = zram->disk; > > instead, which *syntactically* seems to make sense, and match things > up in the obvious way. > > But when I actually thought about it more, I decided that's bullshit. > "zram->disk" is not the backing device at all, it's the zram interface > itself. > > The correct resolution seems to be > > bio_set_dev(bio, zram->bdev); > > but *dammit*, Jens, this insane crazy constant "let's randomly change > block layer details around" needs to STOP. > > This has been going on for too many releases by now. What the f*ck is > *wrong* with you and Christoph that you can't just leave this thing > alone? > > Stop the pointless churn already. > > When I saw your comment: > > It's a quiet series this round, which I think we needed after > the churn of the last few series. > > I was like "finally". And then this shows that you were just lying, > and the pointless churn is still happening. > > Stop it. Really. > > Here's a hint: if some stupid change requires you to mindlessly do 150 > changes using a wrapper helper, it's churn. > > We need *stability* by now, after these crazy changes. Make sure that happens. I am pushing back, but I can't embargo any sort of API change. This one has good reasoning behind it, which is actually nicely explained in the commit itself. It's not pointless churn, which I would define as change just for the sake of change itself. Or pointless cleanups. That said, there are no plans to mix anything up for the next release. We've had more frivolous changes in past releases, I don't expect that to happen again. -- Jens Axboe
Re: [GIT PULL] First set of block changes for 4.14-rc1
On Tue, Sep 5, 2017 at 7:31 AM, Jens Axboewrote: > > Note that you'll hit a conflict in block/bio-integrity.c and > mm/page_io.c, since we had fixes later in the 4.13 series for both of > those that ended up conflicting with new changes. Both are trivial to > fix up, I've included my resolution at the end of this email. Actually, the _real_ conflict is with commit 8e654f8fbff5 ("zram: read page from backing device") and the preceding code, which added a "zram->bdev" that is the backing dev for the zram device. And then it does bio->bi_bdev = zram->bdev; but now the block layer has AGAIN changed some stupid internal detail for no obvious reason, so now it doesn't work. My initial thought was to just do bio->bi_disk = zram->disk; instead, which *syntactically* seems to make sense, and match things up in the obvious way. But when I actually thought about it more, I decided that's bullshit. "zram->disk" is not the backing device at all, it's the zram interface itself. The correct resolution seems to be bio_set_dev(bio, zram->bdev); but *dammit*, Jens, this insane crazy constant "let's randomly change block layer details around" needs to STOP. This has been going on for too many releases by now. What the f*ck is *wrong* with you and Christoph that you can't just leave this thing alone? Stop the pointless churn already. When I saw your comment: It's a quiet series this round, which I think we needed after the churn of the last few series. I was like "finally". And then this shows that you were just lying, and the pointless churn is still happening. Stop it. Really. Here's a hint: if some stupid change requires you to mindlessly do 150 changes using a wrapper helper, it's churn. We need *stability* by now, after these crazy changes. Make sure that happens. Linus
Re: [PATCH 00/13] bcache: fixes and update for 4.14
On 06/09/17 18:38, Coly Li wrote: On 2017/9/6 下午11:46, Jens Axboe wrote: On 09/06/2017 09:41 AM, Coly Li wrote: On 2017/9/6 下午10:20, Jens Axboe wrote: On Wed, Sep 06 2017, Coly Li wrote: Hi Jens, Here are 12 patchs for bcache fixes and updates, most of them were posted by Eric Wheeler in 4.13 merge window, but delayed due to lack of code review. Next time, please send this _before_ the merge window opens. Not a huge problem for this series, since it has been posted numerous times in the past and already has good review coverage, but it really should have been in linux-next for a week or two before heading upstream. Hi Jens, Copied, send patches _before_ merge window opens. But could you please to give me a hint, how can I find when a specific merge window will open ? I search LKML, and see people send pull requests for 4.14 merge window, but I don't see any announcement for 4.14 merge window time slot. The merge window opens when Linus releases the previous kernel. So you have to try and keep track of when he expects to do that. That really isn't that difficult - Linus always cuts a release on Sundays. We always have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or -rc7 release, that usually gives a good indication of when he expects to release the final. To keep things simple, if you always have things ready when -rc7 is released, then you are at least one week out from the merge window, possibly two if we end up doing an -rc8. That means you don't have to track anything, you know the exact date of when -rc7 is released since Linus's schedule is usually reliable. Hi Jens, Copied, I will follow the above rule in next cycle. And I just post the last patch (0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you for 4.14 cycle. This patch is an important fix for bcache writeback rate calculation, it is depended by another patch I sent to you (0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch), please have it in 4.14. Thanks in advance. I'm sorry, correction to my last mail, the patches I have tested should read: 0002, 0003, 0006 and 0013.
Re: [PATCH 00/13] bcache: fixes and update for 4.14
On 06/09/17 18:38, Coly Li wrote: On 2017/9/6 下午11:46, Jens Axboe wrote: On 09/06/2017 09:41 AM, Coly Li wrote: On 2017/9/6 下午10:20, Jens Axboe wrote: On Wed, Sep 06 2017, Coly Li wrote: Hi Jens, Here are 12 patchs for bcache fixes and updates, most of them were posted by Eric Wheeler in 4.13 merge window, but delayed due to lack of code review. Next time, please send this _before_ the merge window opens. Not a huge problem for this series, since it has been posted numerous times in the past and already has good review coverage, but it really should have been in linux-next for a week or two before heading upstream. Hi Jens, Copied, send patches _before_ merge window opens. But could you please to give me a hint, how can I find when a specific merge window will open ? I search LKML, and see people send pull requests for 4.14 merge window, but I don't see any announcement for 4.14 merge window time slot. The merge window opens when Linus releases the previous kernel. So you have to try and keep track of when he expects to do that. That really isn't that difficult - Linus always cuts a release on Sundays. We always have 7 or 8 rc releases, so a good time to check is his mail on -rc6 or -rc7 release, that usually gives a good indication of when he expects to release the final. To keep things simple, if you always have things ready when -rc7 is released, then you are at least one week out from the merge window, possibly two if we end up doing an -rc8. That means you don't have to track anything, you know the exact date of when -rc7 is released since Linus's schedule is usually reliable. Hi Jens, Copied, I will follow the above rule in next cycle. And I just post the last patch (0013-bcache-initialize-stripe_sectors_dirty-correctly-for.patch) to you for 4.14 cycle. This patch is an important fix for bcache writeback rate calculation, it is depended by another patch I sent to you (0006-bcache-correct-cache_dirty_target-in-__update_writeb.patch), please have it in 4.14. Thanks in advance. Hello Jens, FWIW I'd like to support Coly by reporting that patches 0001, 0002, 0006 and 0013 (the last one mentioned above) have been tested by me on 2 production servers with a good deal of bcache activity for the last 6 weeks without any issues having arisen. But this may not be much use to you to know, as these are running vanilla kernel.org kernel 4.4.77. Still, these 4 patches exactly as Coly has sent them to you apply without any changes to the code, so I guess it vouches for the code at least somewhat. Eddie
Re: [PATCH]blk-mq-sched: remove the empty entry in token wait list
On Tue, Aug 29, 2017 at 10:52:13PM +0800, 查斌 wrote: > From: Bin Zha> > > When the kyber adjusts the sync and other write depth to the > minimum(1), there is a case that maybe cause the requests to > be stalled in the kyber_hctx_data list. > > The following example I have tested: > > > CPU7 > block_rq_insert > add_wait_queue > __sbitmap_queue_get CPU23 > block_rq_issue block_rq_insert > block_rq_complete --> waiting token free > block_rq_issue > /|\block_rq_complete >|| >|| >| \|/ >| CPU29 >|block_rq_insert >| waiting token free >| block_rq_issue >|-- block_rq_complete > > CPU1 >block_rq_insert > waiting token free > > > The IO request complete in CPU29 will wake up CPU7, > because it has been added to the waitqueue in > kyber_get_domain_token. But it is empty waiter, and won't > wake up the CPU1.If no other requests issue to push it, > the requests will stall in kyber_hctx_data list. > > > Signed-off-by: Bin Zha > > diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c > index b9faabc..584bfd5 100644 > --- a/block/kyber-iosched.c > +++ b/block/kyber-iosched.c > @@ -548,6 +548,10 @@ static int kyber_get_domain_token(struct > kyber_queue_data *kqd, > * queue. > */ > nr = __sbitmap_queue_get(domain_tokens); > + if (nr >= 0) { > + remove_wait_queue(>wait, wait); > + INIT_LIST_HEAD(>task_list); > + } > } > return nr; > } Hm... I could've sworn I thought about this case when I wrote this code, but your analysis looks correct. I do remember that I didn't do this because I was worried about a race like so: add_wait_queue() kyber_domain_wake() \_ list_del_init() remove_wait_queue() \_ list_del() But thinking about it some more, list_del() is probably safe after list_del_init()? Have you been able to reproduce this? I have the following patch to force the domain token pools to one token, I imagine with the right workload we can hit it: diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index f58cab8..b4e1bd7 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -58,9 +58,9 @@ enum { * So, we cap these to a reasonable value. */ static const unsigned int kyber_depth[] = { - [KYBER_READ] = 256, - [KYBER_SYNC_WRITE] = 128, - [KYBER_OTHER] = 64, + [KYBER_READ] = 1, + [KYBER_SYNC_WRITE] = 1, + [KYBER_OTHER] = 1, }; /* @@ -126,6 +126,7 @@ enum { #define IS_GOOD(status) ((status) > 0) #define IS_BAD(status) ((status) < 0) +#if 0 static int kyber_lat_status(struct blk_stat_callback *cb, unsigned int sched_domain, u64 target) { @@ -243,6 +244,7 @@ static void kyber_adjust_other_depth(struct kyber_queue_data *kqd, if (depth != orig_depth) sbitmap_queue_resize(>domain_tokens[KYBER_OTHER], depth); } +#endif /* * Apply heuristics for limiting queue depths based on gathered latency @@ -250,6 +252,8 @@ static void kyber_adjust_other_depth(struct kyber_queue_data *kqd, */ static void kyber_stat_timer_fn(struct blk_stat_callback *cb) { + return; +#if 0 struct kyber_queue_data *kqd = cb->data; int read_status, write_status; @@ -269,6 +273,7 @@ static void kyber_stat_timer_fn(struct blk_stat_callback *cb) ((IS_BAD(read_status) || IS_BAD(write_status) || kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER]))) blk_stat_activate_msecs(kqd->cb, 100); +#endif } static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
[PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()
Using scsi_device_from_queue(), return the scsi_disk structure associated with a request queue if the device is a disk. Export this function to make it available to modules. Signed-off-by: Damien Le Moal--- drivers/scsi/sd.c | 32 drivers/scsi/sd.h | 2 ++ 2 files changed, 34 insertions(+) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 1cd113df2d3a..ff9fff4de3e6 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -631,6 +631,38 @@ static void scsi_disk_put(struct scsi_disk *sdkp) mutex_unlock(_ref_mutex); } +static int scsi_disk_lookup(struct device *dev, void *data) +{ + struct scsi_disk **sdkp = data; + + if (!*sdkp && dev->class == _disk_class) + *sdkp = to_scsi_disk(dev); + + return 0; +} + +/** + * scsi_disk_from_queue - return scsi disk associated with a request_queue + * @q: The request queue to return the scsi disk from + * + * Return the struct scsi_disk associated with a request queue or NULL if the + * request_queue does not reference a SCSI device or a if the device is + * not a disk. + */ +struct scsi_disk *scsi_disk_from_queue(struct request_queue *q) +{ + struct scsi_device *sdev = scsi_device_from_queue(q); + struct scsi_disk *sdkp = NULL; + + if (!sdev) + return NULL; + + device_for_each_child(>sdev_gendev, , scsi_disk_lookup); + + return sdkp; +} +EXPORT_SYMBOL_GPL(scsi_disk_from_queue); + #ifdef CONFIG_BLK_SED_OPAL static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, bool send) diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 3ea82c8cecd5..92113a9e2b20 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -126,6 +126,8 @@ static inline struct scsi_disk *scsi_disk(struct gendisk *disk) return container_of(disk->private_data, struct scsi_disk, driver); } +struct scsi_disk *scsi_disk_from_queue(struct request_queue *q); + #define sd_printk(prefix, sdsk, fmt, a...) \ (sdsk)->disk ? \ sdev_prefix_printk(prefix, (sdsk)->device,\ -- 2.13.5
[PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler
The zoned I/O scheduler is mostly identical to mq-deadline and retains the same configuration attributes. The main difference is that the zoned scheduler will ensure that at any time at most only one write request (command) per sequential zone is in flight (has been issued to the disk) in order to protect against sequential write reordering potentially resulting from the concurrent execution of request dispatch by multiple contexts. This is achieved similarly to the legacy SCSI I/O path by write locking zones when a write requests is issued. Subsequent writes to the same zone have to wait for the completion of the previously issued write before being in turn dispatched to the disk. This ensures that sequential writes are processed in the correct order without needing any modification to the execution model of blk-mq. In addition, this also protects against write reordering at the HBA level (e.g. AHCI). This zoned scheduler code is added under the drivers/scsi directory so that information managed using the scsi_disk structure can be directly referenced. Doing so, cluttering the block layer with device type specific code is avoided. Signed-off-by: Damien Le Moal--- Documentation/block/zoned-iosched.txt | 48 ++ block/Kconfig.iosched | 12 + drivers/scsi/Makefile | 1 + drivers/scsi/sd.h | 3 +- drivers/scsi/sd_zbc.c | 16 +- drivers/scsi/sd_zbc.h | 8 +- drivers/scsi/zoned_iosched.c | 939 ++ 7 files changed, 1015 insertions(+), 12 deletions(-) create mode 100644 Documentation/block/zoned-iosched.txt create mode 100644 drivers/scsi/zoned_iosched.c diff --git a/Documentation/block/zoned-iosched.txt b/Documentation/block/zoned-iosched.txt new file mode 100644 index ..b269125bdc61 --- /dev/null +++ b/Documentation/block/zoned-iosched.txt @@ -0,0 +1,48 @@ +Zoned I/O scheduler +=== + +The Zoned I/O scheduler solves zoned block devices write ordering problems +inherent to the absence of a global request queue lock in the blk-mq +infrastructure. Multiple contexts may try to dispatch simultaneously write +requests to the same sequential zone of a zoned block device, doing so +potentially breaking the sequential write order imposed by the device. + +The Zoned I/O scheduler is based on the mq-deadline scheduler. It shares the +same tunables and behaves in a comparable manner. The main difference introduced +with the zoned scheduler is handling of write batches. Whereas mq-deadline will +keep dispatching write requests to the device as long as the batching size +allows and reads are not starved, the zoned scheduler introduces additional +constraints: +1) At most only one write request can be issued to a sequential zone of the +device. This ensures that no reordering of sequential writes to a sequential +zone can happen once the write request leaves the scheduler internal queue (rb +tree). +2) The number of sequential zones that can be simultaneously written is limited +to the device advertized maximum number of open zones. This additional condition +avoids performance degradation due to excessive open zone resource use at the +device level. + +These conditions do not apply to write requests targeting conventional zones. +For these, the zoned scheduler behaves exactly like the mq-deadline scheduler. + +The zoned I/O scheduler cannot be used with regular block devices. It can only +be used with host-managed or host-aware zoned block devices. +Using the zoned I/O scheduler is mandatory with host-managed disks unless the +disk user tightly controls itself write sequencing to sequential zones. The +zoned scheduler will treat host-aware disks exactly the same way as host-managed +devices. That is, eventhough host aware disks can be randomly written, the zoned +scheduler will still impose the limit to one write per zone so that sequential +writes sequences are preserved. + +For host-managed disks, automating the used of the zoned scheduler can be done +simply with a udev rule. An example of such rule is shown below. + +# Set zoned scheduler for host-managed zoned block devices +ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/zoned}=="host-managed", \ + ATTR{queue/scheduler}="zoned" + +Zoned I/O scheduler tunables + + +Tunables of the Zoned I/O scheduler are identical to those of the deadline +I/O scheduler. See Documentation/block/deadline-iosched.txt for details. diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index fd2cefa47d35..b87c67dbf1f6 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -70,6 +70,18 @@ config MQ_IOSCHED_DEADLINE ---help--- MQ version of the deadline IO scheduler. +config MQ_IOSCHED_ZONED + tristate "Zoned I/O scheduler" + depends on BLK_DEV_ZONED + depends on SCSI_MOD + depends on BLK_DEV_SD +
[PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
In the case of a ZBC disk used with scsi-mq, zone write locking does not prevent write reordering in sequential zones. Unlike the legacy case, zone locking can only be done after the command request is removed from the scheduler dispatch queue. That is, at the time of zone locking, the write command may already be out of order. Disable zone write locking in sd_zbc_write_lock_zone() if the disk is used with scsi-mq. Write order guarantees can be provided by an adapted I/O scheduler. Signed-off-by: Damien Le MoalReviewed-by: Bart Van Assche --- drivers/scsi/sd_zbc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index bcece82a1d8d..6b1a7f9c1e90 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -264,6 +264,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors) return BLKPREP_KILL; + /* No write locking with scsi-mq */ + if (rq->q->mq_ops) + return BLKPREP_OK; + /* * There is no write constraint on conventional zones, but do not issue * more than one write at a time per sequential zone. This avoids write -- 2.13.5
[PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones
Zoned block devices have no write constraints for conventional zones so write locking of conventional zones is not necessary and can hurt performance. To avoid this, introduce the seq_zones bitmap to indicate if a zone is a sequential one. Use this information to allow any write to be issued to conventional zones, locking only sequential zones. As the zone bitmap allocation for seq_zones is identical to the zones write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap(). Using this helper, wait for the disk capacity and number of zones to stabilize on the second revalidation pass to allocate and initialize the bitmaps. Signed-off-by: Damien Le MoalReviewed-by: Bart Van Assche --- drivers/scsi/sd.h | 1 + drivers/scsi/sd_zbc.c | 108 -- drivers/scsi/sd_zbc.h | 12 ++ 3 files changed, 108 insertions(+), 13 deletions(-) diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index c9a627e7eebd..3ea82c8cecd5 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -77,6 +77,7 @@ struct scsi_disk { unsigned intzone_blocks; unsigned intzone_shift; unsigned long *zones_wlock; + unsigned long *seq_zones; unsigned intzones_optimal_open; unsigned intzones_optimal_nonseq; unsigned intzones_max_open; diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 8b258254a2bb..bcece82a1d8d 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -252,7 +252,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); sector_t sector = blk_rq_pos(rq); sector_t zone_sectors = sd_zbc_zone_sectors(sdkp); - unsigned int zno = sd_zbc_zone_no(sdkp, sector); + unsigned int zno; /* * Note: Checks of the alignment of the write command on @@ -265,13 +265,15 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) return BLKPREP_KILL; /* -* Do not issue more than one write at a time per -* zone. This solves write ordering problems due to -* the unlocking of the request queue in the dispatch -* path in the non scsi-mq case. +* There is no write constraint on conventional zones, but do not issue +* more than one write at a time per sequential zone. This avoids write +* ordering problems due to the unlocking of the request queue in the +* dispatch path of the non scsi-mq (legacy) case. */ - if (sdkp->zones_wlock && - test_and_set_bit(zno, sdkp->zones_wlock)) + zno = sd_zbc_zone_no(sdkp, sector); + if (!test_bit(zno, sdkp->seq_zones)) + return BLKPREP_OK; + if (test_and_set_bit(zno, sdkp->zones_wlock)) return BLKPREP_DEFER; WARN_ON_ONCE(cmd->flags & SCMD_ZONE_WRITE_LOCK); @@ -289,8 +291,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) struct request *rq = cmd->request; struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); - if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) { + if (cmd->flags & SCMD_ZONE_WRITE_LOCK) { unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq)); + WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock)); cmd->flags &= ~SCMD_ZONE_WRITE_LOCK; clear_bit_unlock(zno, sdkp->zones_wlock); @@ -507,8 +510,67 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) return 0; } +/* + * Initialize the sequential zone bitmap to allow identifying sequential zones. + */ +static int sd_zbc_setup_seq_zones(struct scsi_disk *sdkp) +{ + unsigned long *seq_zones; + sector_t block = 0; + unsigned char *buf; + unsigned char *rec; + unsigned int buf_len; + unsigned int list_length; + unsigned int n = 0; + int ret = -ENOMEM; + + /* Allocate zone type bitmap */ + seq_zones = sd_zbc_alloc_zone_bitmap(sdkp); + if (!seq_zones) + return -ENOMEM; + + buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL); + if (!buf) + goto out; + + while (block < sdkp->capacity) { + + ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, block); + if (ret) + goto out; + + /* Parse reported zone descriptors */ + list_length = get_unaligned_be32([0]) + 64; + rec = buf + 64; + buf_len = min(list_length, SD_ZBC_BUF_SIZE); + while (rec < buf + buf_len) { + if ((rec[0] & 0x0f) != ZBC_ZONE_TYPE_CONV) + set_bit(n, seq_zones); + block += get_unaligned_be64([8]); + rec += 64; + n++; + } + + } + + if (n != sdkp->nr_zones) +
[PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
The three values starting at byte 8 of the Zoned Block Device Characteristics VPD page B6h are 32 bits values, not 64bits. So use get_unaligned_be32() to retrieve the values and not get_unaligned_be64() Fixes: 89d947561077 ("sd: Implement support for ZBC devices") Cc:Signed-off-by: Damien Le Moal Reviewed-by: Bart Van Assche --- drivers/scsi/sd_zbc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 8a45db8bd6e7..8b258254a2bb 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -358,15 +358,15 @@ static int sd_zbc_read_zoned_characteristics(struct scsi_disk *sdkp, if (sdkp->device->type != TYPE_ZBC) { /* Host-aware */ sdkp->urswrz = 1; - sdkp->zones_optimal_open = get_unaligned_be64([8]); - sdkp->zones_optimal_nonseq = get_unaligned_be64([12]); + sdkp->zones_optimal_open = get_unaligned_be32([8]); + sdkp->zones_optimal_nonseq = get_unaligned_be32([12]); sdkp->zones_max_open = 0; } else { /* Host-managed */ sdkp->urswrz = buf[4] & 1; sdkp->zones_optimal_open = 0; sdkp->zones_optimal_nonseq = 0; - sdkp->zones_max_open = get_unaligned_be64([16]); + sdkp->zones_max_open = get_unaligned_be32([16]); } return 0; -- 2.13.5
[PATCH V2 06/12] scsi: sd_zbc: Rearrange code
Move inline functions sd_zbc_zone_sectors() and sd_zbc_zone_no() declarations to sd_zbc.h and rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw assignment. Also move calculation of sdkp->zone_shift together with the assignment of the verified zone_blocks value in sd_zbc_check_zone_size(). No functional change is introduced by this patch. Signed-off-by: Damien Le Moal--- drivers/scsi/sd_zbc.c | 21 + drivers/scsi/sd_zbc.h | 17 + 2 files changed, 22 insertions(+), 16 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index e0927f8fb829..a25a940243a9 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -206,17 +206,6 @@ static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd, local_irq_restore(flags); } -static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp) -{ - return logical_to_sectors(sdkp->device, sdkp->zone_blocks); -} - -static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp, - sector_t sector) -{ - return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift; -} - /* * Prepare a RESET WRITE POINTER scsi command for a REQ_OP_ZONE_RESET request. */ @@ -516,6 +505,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) } sdkp->zone_blocks = zone_blocks; + sdkp->zone_shift = ilog2(zone_blocks); return 0; } @@ -523,10 +513,13 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) static int sd_zbc_setup(struct scsi_disk *sdkp) { + /* READ16/WRITE16 is mandatory for ZBC disks */ + sdkp->device->use_16_for_rw = 1; + sdkp->device->use_10_for_rw = 0; + /* chunk_sectors indicates the zone size */ blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, sdkp->zone_blocks)); - sdkp->zone_shift = ilog2(sdkp->zone_blocks); sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift; if (sdkp->capacity & (sdkp->zone_blocks - 1)) sdkp->nr_zones++; @@ -589,10 +582,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf) if (ret) goto err; - /* READ16/WRITE16 is mandatory for ZBC disks */ - sdkp->device->use_16_for_rw = 1; - sdkp->device->use_10_for_rw = 0; - return 0; err: diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h index 5e7ecc9b2966..b34002f0acbe 100644 --- a/drivers/scsi/sd_zbc.h +++ b/drivers/scsi/sd_zbc.h @@ -15,6 +15,23 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd); void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, struct scsi_sense_hdr *sshdr); +/* + * Zone size in number of 512B sectors. + */ +static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp) +{ + return logical_to_sectors(sdkp->device, sdkp->zone_blocks); +} + +/* + * Zone number of the specified sector. + */ +static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp, + sector_t sector) +{ + return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift; +} + #else /* CONFIG_BLK_DEV_ZONED */ static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, -- 2.13.5
[PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros
instead of open coding, use the min() macro to calculate a report zones reply buffer length in sd_zbc_check_zone_size() and the round_up() macro for calculating the number of zones in sd_zbc_setup(). No functional change is introduced by this patch. Signed-off-by: Damien Le Moal--- drivers/scsi/sd_zbc.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index a25a940243a9..8a45db8bd6e7 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -403,7 +403,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf) return 0; } -#define SD_ZBC_BUF_SIZE 131072 +#define SD_ZBC_BUF_SIZE 131072U static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) { @@ -447,10 +447,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) /* Parse REPORT ZONES header */ list_length = get_unaligned_be32([0]) + 64; rec = buf + 64; - if (list_length < SD_ZBC_BUF_SIZE) - buf_len = list_length; - else - buf_len = SD_ZBC_BUF_SIZE; + buf_len = min(list_length, SD_ZBC_BUF_SIZE); /* Parse zone descriptors */ while (rec < buf + buf_len) { @@ -520,9 +517,8 @@ static int sd_zbc_setup(struct scsi_disk *sdkp) /* chunk_sectors indicates the zone size */ blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, sdkp->zone_blocks)); - sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift; - if (sdkp->capacity & (sdkp->zone_blocks - 1)) - sdkp->nr_zones++; + sdkp->nr_zones = + round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift; if (!sdkp->zones_wlock) { sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones), -- 2.13.5
[PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
Move standard macro definitions for the zone types and zone conditions to scsi_proto.h together with the definitions related to the REPORT ZONES command. While at it, define all values in the enums to be clear. Signed-off-by: Damien Le MoalReviewed-by: Bart Van Assche --- drivers/scsi/sd_zbc.c | 18 -- include/scsi/scsi_proto.h | 45 ++--- 2 files changed, 34 insertions(+), 29 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 8aa54779aac1..d445a57f99bd 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -37,24 +37,6 @@ #include "sd.h" #include "scsi_priv.h" -enum zbc_zone_type { - ZBC_ZONE_TYPE_CONV = 0x1, - ZBC_ZONE_TYPE_SEQWRITE_REQ, - ZBC_ZONE_TYPE_SEQWRITE_PREF, - ZBC_ZONE_TYPE_RESERVED, -}; - -enum zbc_zone_cond { - ZBC_ZONE_COND_NO_WP, - ZBC_ZONE_COND_EMPTY, - ZBC_ZONE_COND_IMP_OPEN, - ZBC_ZONE_COND_EXP_OPEN, - ZBC_ZONE_COND_CLOSED, - ZBC_ZONE_COND_READONLY = 0xd, - ZBC_ZONE_COND_FULL, - ZBC_ZONE_COND_OFFLINE, -}; - /** * Convert a zone descriptor to a zone struct. */ diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h index 8c285d9a06d8..39130a9c05bf 100644 --- a/include/scsi/scsi_proto.h +++ b/include/scsi/scsi_proto.h @@ -301,19 +301,42 @@ struct scsi_lun { /* Reporting options for REPORT ZONES */ enum zbc_zone_reporting_options { - ZBC_ZONE_REPORTING_OPTION_ALL = 0, - ZBC_ZONE_REPORTING_OPTION_EMPTY, - ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_CLOSED, - ZBC_ZONE_REPORTING_OPTION_FULL, - ZBC_ZONE_REPORTING_OPTION_READONLY, - ZBC_ZONE_REPORTING_OPTION_OFFLINE, - ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10, - ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE, - ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f, + ZBC_ZONE_REPORTING_OPTION_ALL = 0x00, + ZBC_ZONE_REPORTING_OPTION_EMPTY = 0x01, + ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN = 0x02, + ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN = 0x03, + ZBC_ZONE_REPORTING_OPTION_CLOSED= 0x04, + ZBC_ZONE_REPORTING_OPTION_FULL = 0x05, + ZBC_ZONE_REPORTING_OPTION_READONLY = 0x06, + ZBC_ZONE_REPORTING_OPTION_OFFLINE = 0x07, + /* 0x08 to 0x0f are reserved */ + ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10, + ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE = 0x11, + /* 0x12 to 0x3e are reserved */ + ZBC_ZONE_REPORTING_OPTION_NON_WP= 0x3f, }; #define ZBC_REPORT_ZONE_PARTIAL 0x80 +/* Zone types of REPORT ZONES zone descriptors */ +enum zbc_zone_type { + ZBC_ZONE_TYPE_CONV = 0x1, + ZBC_ZONE_TYPE_SEQWRITE_REQ = 0x2, + ZBC_ZONE_TYPE_SEQWRITE_PREF = 0x3, + /* 0x4 to 0xf are reserved */ +}; + +/* Zone conditions of REPORT ZONES zone descriptors */ +enum zbc_zone_cond { + ZBC_ZONE_COND_NO_WP = 0x0, + ZBC_ZONE_COND_EMPTY = 0x1, + ZBC_ZONE_COND_IMP_OPEN = 0x2, + ZBC_ZONE_COND_EXP_OPEN = 0x3, + ZBC_ZONE_COND_CLOSED= 0x4, + /* 0x5 to 0xc are reserved */ + ZBC_ZONE_COND_READONLY = 0xd, + ZBC_ZONE_COND_FULL = 0xe, + ZBC_ZONE_COND_OFFLINE = 0xf, +}; + #endif /* _SCSI_PROTO_H_ */ -- 2.13.5
[PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h
Introduce sd_zbc.h to gather ZBC disk related declarations and avoid cluttering sd.h. No functional change is introduced by this patch. Signed-off-by: Damien Le Moal--- drivers/scsi/sd.c | 1 + drivers/scsi/sd.h | 48 --- drivers/scsi/sd_zbc.c | 7 +- drivers/scsi/sd_zbc.h | 62 +++ 4 files changed, 64 insertions(+), 54 deletions(-) create mode 100644 drivers/scsi/sd_zbc.h diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index c5d05b1c58a5..1cd113df2d3a 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -68,6 +68,7 @@ #include #include "sd.h" +#include "sd_zbc.h" #include "scsi_priv.h" #include "scsi_logging.h" diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 99c4dde9b6bf..c9a627e7eebd 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -277,52 +277,4 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp) return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC; } -#ifdef CONFIG_BLK_DEV_ZONED - -extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer); -extern void sd_zbc_remove(struct scsi_disk *sdkp); -extern void sd_zbc_print_zones(struct scsi_disk *sdkp); -extern int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd); -extern void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd); -extern int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd); -extern int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd); -extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, - struct scsi_sense_hdr *sshdr); - -#else /* CONFIG_BLK_DEV_ZONED */ - -static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, - unsigned char *buf) -{ - return 0; -} - -static inline void sd_zbc_remove(struct scsi_disk *sdkp) {} - -static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {} - -static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) -{ - /* Let the drive fail requests */ - return BLKPREP_OK; -} - -static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) {} - -static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd) -{ - return BLKPREP_INVALID; -} - -static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) -{ - return BLKPREP_INVALID; -} - -static inline void sd_zbc_complete(struct scsi_cmnd *cmd, - unsigned int good_bytes, - struct scsi_sense_hdr *sshdr) {} - -#endif /* CONFIG_BLK_DEV_ZONED */ - #endif /* _SCSI_DISK_H */ diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index d445a57f99bd..a62f8e256a26 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -28,14 +28,9 @@ #include #include -#include -#include -#include -#include -#include #include "sd.h" -#include "scsi_priv.h" +#include "sd_zbc.h" /** * Convert a zone descriptor to a zone struct. diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h new file mode 100644 index ..5e7ecc9b2966 --- /dev/null +++ b/drivers/scsi/sd_zbc.h @@ -0,0 +1,62 @@ +#ifndef _SCSI_DISK_ZBC_H +#define _SCSI_DISK_ZBC_H + +#ifdef CONFIG_BLK_DEV_ZONED + +int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer); +void sd_zbc_remove(struct scsi_disk *sdkp); +void sd_zbc_print_zones(struct scsi_disk *sdkp); + +int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd); +void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd); + +int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd); +int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd); +void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, +struct scsi_sense_hdr *sshdr); + +#else /* CONFIG_BLK_DEV_ZONED */ + +static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, + unsigned char *buf) +{ + return 0; +} + +static inline void sd_zbc_remove(struct scsi_disk *sdkp) +{ +} + +static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) +{ +} + +static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) +{ + /* Let the drive fail requests */ + return BLKPREP_OK; +} + +static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) +{ +} + +static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd) +{ + return BLKPREP_INVALID; +} + +static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) +{ + return BLKPREP_INVALID; +} + +static inline void sd_zbc_complete(struct scsi_cmnd *cmd, + unsigned int good_bytes, + struct scsi_sense_hdr *sshdr) +{ +} + +#endif /* CONFIG_BLK_DEV_ZONED */ + +#endif /* _SCSI_DISK_ZBC_H */ -- 2.13.5
Re: [PATCH V2 0/3] block/loop: handle discard/zeroout error
Hi Shaohua, You wrote: > BTW: blkdev_issue_zeroout retries if we immediately find the device doesn't > support zeroout, but it doesn't retry if submit_bio_wait returns -EOPNOTSUPP. > Is this correct behavior? I sent a patch for that yesterday, see "[PATCH] block: cope with WRITE SAME failing in blkdev_issue_zeroout()" on linux-block. It checks for -EREMOTEIO, because that's what I hit, but I wonder if it should check for -EOPNOTSUPP from submit_bio_wait() as well. Can you think of a case where, given bdev_write_zeroes_sectors() != 0, submit_bio_wait() would fail with -EOPNOTSUPP and we would want to do explicit zeroing? Please excuse broken threading. Thanks, Ilya
[PATCH 2/2] scsi_transport_fc: fix NULL pointer dereference in fc_bsg_job_timeout
bsg-lib now embeddeds the job structure into the request, and req->special can't be used anymore. Signed-off-by: Christoph HellwigCc: sta...@vger.kernel.org --- drivers/scsi/scsi_transport_fc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 892fbd9800d9..bea06de60827 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -3550,7 +3550,7 @@ fc_vport_sched_delete(struct work_struct *work) static enum blk_eh_timer_return fc_bsg_job_timeout(struct request *req) { - struct bsg_job *job = (void *) req->special; + struct bsg_job *job = blk_mq_rq_to_pdu(req); struct Scsi_Host *shost = fc_bsg_to_shost(job); struct fc_rport *rport = fc_bsg_to_rport(job); struct fc_internal *i = to_fc_internal(shost->transportt); -- 2.11.0
two small bsg fixes, take 3
Now fully away, with my fair share of coffee and lunch: Two fixups for the recent bsg-lib fixes, should go into 4.13 stable as well.
Re: [PATCH 10/18] lightnvm: pblk: use bio_copy_kern when possible
> On 7 Sep 2017, at 13.08, Christoph Hellwigwrote: > > On Wed, Sep 06, 2017 at 04:00:56PM +0200, Javier González wrote: >>> Nope. You want to loop over vmalloc_to_page and call bio_add_page >>> for each page, >> >> Yes. This is basically what I did before. >> >>> after taking care of virtually tagged caches instead >>> of this bounce buffering. >> >> And thus I considered bio_copy_kern to be a better solution, since it >> will through time take care of doing the vmalloc_to_page correctly for >> all cases. > > bio_copy_kern copies all the data, so it is generally not a good > idea. The cache flushing isn't too hard - take a look at the XFS > buffer cache for an existing version. > > It would be good to just to do the right thing inside bio_map_kern > for that so that callers don't need to care if it is vmalloced or > not. Yes. That would help. I know md also needs to manually add pages on vmalloced memory. Probably other do too. > >> Ok. So this would mean that targets (e.g., pblk) deal with struct >> request instead of only dealing with bios and then letting the LightNVM >> core transforming bios to requests. This way we can directly map to the >> request. Is this what you mean? > > Yes. > >> Just out of curiosity, why is forming the bio trough bio_copy_kern (or >> manually doing the same) and then transforming to a request incorrect / >> worse? > > Because you expose yourself to the details of mapping a bio to request. > We had to export blk_init_request_from_bio just for lightnvm to do this, > and it also has to do weird other bits about requests. If you go > through blk_rq_map_* the block layer takes care of all that for you. Ok. It makes sense. I'll talk to Matias about it. Thanks! Javier signature.asc Description: Message signed with OpenPGP
Re: [PATCH 0/3] Move tagset reinit to its only current consumer, nvme-core
On 9/6/2017 6:30 PM, Sagi Grimberg wrote: As suggested by Bart Van Assche, get rid of blk_mq_reinit_tagset and move it to nvme-core (its only current consumer). Instead, introduce a more generic tagset iterator helper. Sagi Grimberg (3): block: introduce blk_mq_tagset_iter nvme: introduce nvme_reinit_tagset block: remove blk_mq_reinit_tagset block/blk-mq-tag.c | 11 +-- drivers/nvme/host/core.c | 13 + drivers/nvme/host/fc.c | 3 ++- drivers/nvme/host/nvme.h | 2 ++ drivers/nvme/host/rdma.c | 7 +++ include/linux/blk-mq.h | 4 ++-- 6 files changed, 27 insertions(+), 13 deletions(-) The series looks good to me, Reviewed-by: Max GurtovoyBTW, if we talking about the reinit_tagset, can you explain the motivation for dereg_mr and alloc new mr for the RDMA transport layer ? we don't do it in iSER/SRP so I wonder what is special here ? I got some local protection errors in MP tests and under high load FIO (that cause cmd TMO to expire and abort the cmds and then reconnect). I wonder if there is a connection between the two. I'm also debugging a situation that we might have free/dereg/local_invalidate the MR before the QP moved to ERR state.
Re: [PATCH 3/3] block: remove blk_mq_reinit_tagset
Looks good, Reviewed-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 2/3] nvme: introduce nvme_reinit_tagset
Looks good, Reviewed-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 1/3] block: introduce blk_mq_tagset_iter
Looks good, Reviewed-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH V2 1/3] block/loop: don't hijack error number
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Liwrote: > From: Shaohua Li > > If the bio returns -EOPNOTSUPP, we shouldn't hijack it and return -EIO > > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 85de673..715b762 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -460,7 +460,7 @@ static void lo_complete_rq(struct request *rq) > zero_fill_bio(bio); > } > > - blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK); > + blk_mq_end_request(rq, errno_to_blk_status(cmd->ret)); > } > > static void lo_rw_aio_do_completion(struct loop_cmd *cmd) > @@ -476,7 +476,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long > ret, long ret2) > { > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); > > - cmd->ret = ret; > + cmd->ret = ret > 0 ? 0 : ret; > lo_rw_aio_do_completion(cmd); > } > > @@ -1706,7 +1706,7 @@ static void loop_handle_cmd(struct loop_cmd *cmd) > failed: > /* complete non-aio request */ > if (!cmd->use_aio || ret) { > - cmd->ret = ret ? -EIO : 0; > + cmd->ret = ret; > blk_mq_complete_request(cmd->rq); > } > } > -- > 2.9.5 > Looks fine: Reviewed-by: Ming Lei -- Ming Lei
Re: [PATCH V2 3/3] block/loop: suppress discard IO error message
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Liwrote: > From: Shaohua Li > > We don't know if fallocate really supports FALLOC_FL_PUNCH_HOLE till > fallocate is called. If it doesn't support, loop will return -EOPNOTSUPP > and we see a lot of error message printed by blk_update_request. Failure > for discard IO isn't a big problem, so we just return 0 in loop which > will suppress the IO error message. Setting RQF_QUIET for discard IO should be more clean for suppressing error, and upper layer can get the failure notification too. -- Ming Lei
Re: [PATCH V2 2/3] block/loop: use FALLOC_FL_ZERO_RANGE for REQ_OP_WRITE_ZEROES
On Thu, Sep 7, 2017 at 8:13 AM, Shaohua Liwrote: > From: Shaohua Li > > REQ_OP_WRITE_ZEROES really means zero the data. And in blkdev_fallocate, > FALLOC_FL_ZERO_RANGE will retry but FALLOC_FL_PUNCH_HOLE not, even loop > request doesn't have BLKDEV_ZERO_NOFALLBACK set. > > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 715b762..8934e25 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -426,6 +426,9 @@ static int lo_discard(struct loop_device *lo, struct > request *rq, loff_t pos) > int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE; > int ret; > > + if (req_op(rq) == REQ_OP_WRITE_ZEROES) > + mode = FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE; > + > if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) { > ret = -EOPNOTSUPP; > goto out; > -- > 2.9.5 > Looks fine, Reviewed-by: Ming Lei -- Ming Lei