Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
On Wed, Aug 30, 2017 at 02:20:37PM +0900, Sergey Senozhatsky wrote: > Byungchul, a quick question. Hello Sergey, > have you measured the performance impact? somehow my linux-next is Yeah, it might have performance impact inevitably. > notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim > is irritatingly slow) To Ingo, I cannot decide if we have to roll back CONFIG_LOCKDEP_CROSSRELEASE dependency on CONFIG_PROVE_LOCKING in Kconfig. With them enabled, lockdep detection becomes strong but has performance impact. But, it's anyway a debug option so IMHO we don't have to take case of the performance impact. Please let me know your decision. > `time dmesg' shows some difference, but probably that's not a good > test. > > !LOCKDEPLOCKDEP LOCKDEP -CROSSRELEASE -COMPLETIONS > real 0m0.661s 0m2.290s0m1.920s > user 0m0.010s 0m0.105s0m0.000s > sys 0m0.636s 0m2.224s0m1.888s > > anyone else "sees"/"can confirm" the slow down? > > > it gets back to "usual normal" when I disable CROSSRELEASE and COMPLETIONS. > > --- > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index b19c491cbc4e..cdc30ef81c5e 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1091,8 +1091,6 @@ config PROVE_LOCKING > select DEBUG_MUTEXES > select DEBUG_RT_MUTEXES if RT_MUTEXES > select DEBUG_LOCK_ALLOC > - select LOCKDEP_CROSSRELEASE > - select LOCKDEP_COMPLETIONS > select TRACE_IRQFLAGS > default n > help > > --- > > -ss
Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]
On (08/23/17 09:03), Byungchul Park wrote: [..] > > Byungchul, did you add the crosslock checks to lockdep? Can you have a look > > at > > the above report? That report namely doesn't make sense to me. > > The report is talking about the following lockup: > > A work in a worker A task work on exit to user > -- --- > mutex_lock(>bd_mutex) >mutext_lock(>bd_mutex) > blk_execute_rq() >wait_for_completion_io_timeout() >complete() > [..] > To Peterz, > > Anyway I wanted to avoid lockdep reports in the case using a timeout > interface. Do you think it's still worth reporting the kind of lockup? > I'm ok if you do. Byungchul, a quick question. have you measured the performance impact? somehow my linux-next is notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim is irritatingly slow) `time dmesg' shows some difference, but probably that's not a good test. !LOCKDEPLOCKDEP LOCKDEP -CROSSRELEASE -COMPLETIONS real 0m0.661s 0m2.290s0m1.920s user 0m0.010s 0m0.105s0m0.000s sys 0m0.636s 0m2.224s0m1.888s anyone else "sees"/"can confirm" the slow down? it gets back to "usual normal" when I disable CROSSRELEASE and COMPLETIONS. --- diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index b19c491cbc4e..cdc30ef81c5e 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1091,8 +1091,6 @@ config PROVE_LOCKING select DEBUG_MUTEXES select DEBUG_RT_MUTEXES if RT_MUTEXES select DEBUG_LOCK_ALLOC - select LOCKDEP_CROSSRELEASE - select LOCKDEP_COMPLETIONS select TRACE_IRQFLAGS default n help --- -ss
Re: I/O hangs after resuming from suspend-to-ram
On Wed, Aug 30, 2017 at 10:15:37AM +0800, Ming Lei wrote: > Hi, > > On Tue, Aug 29, 2017 at 05:52:42PM +0200, Oleksandr Natalenko wrote: > > Hello. > > > > Re-tested with v4.13-rc7 + proposed patch and got the same result. > > Maybe there is another issue, I didn't use dmcrypt on raid10, will > test in your way to see if I can reproduce it. > > BTW, could you share us which blk-mq scheduler you are using on sata? > The patch I posted should address one issue on none scheduler. Can't reproduce even with putting dmcypt on raid10 after applying my patch. Could you apply the following debug patch and provide the dmesg log after running the commands below? # echo 9 > /proc/sys/kernel/printk # echo devices > /sys/power/pm_test # echo mem > /sys/power/state BTW, it is better to provide the two sata disk(behind raid10) name. diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index b44c1bb687a2..75b13248ea1c 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -53,17 +53,22 @@ static int scsi_dev_type_suspend(struct device *dev, { const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; int err; + struct scsi_device *sdev = to_scsi_device(dev); /* flush pending in-flight resume operations, suspend is synchronous */ async_synchronize_full_domain(_sd_pm_domain); - err = scsi_device_quiesce(to_scsi_device(dev)); + sdev_printk(KERN_WARNING, sdev, "%s: enter\n", __func__); + err = scsi_device_quiesce(sdev); if (err == 0) { + sdev_printk(KERN_WARNING, sdev, "%s: before suspend\n", __func__); err = cb(dev, pm); + sdev_printk(KERN_WARNING, sdev, "%s: after suspend\n", __func__); if (err) - scsi_device_resume(to_scsi_device(dev)); + scsi_device_resume(sdev); } dev_dbg(dev, "scsi suspend: %d\n", err); + sdev_printk(KERN_WARNING, sdev, "%s: exit\n", __func__); return err; } @@ -72,9 +77,13 @@ static int scsi_dev_type_resume(struct device *dev, { const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; int err = 0; + struct scsi_device *sdev = to_scsi_device(dev); + sdev_printk(KERN_WARNING, sdev, "%s: enter\n", __func__); + sdev_printk(KERN_WARNING, sdev, "%s: before resume\n", __func__); err = cb(dev, pm); - scsi_device_resume(to_scsi_device(dev)); + sdev_printk(KERN_WARNING, sdev, "%s: after resume\n", __func__); + scsi_device_resume(sdev); dev_dbg(dev, "scsi resume: %d\n", err); if (err == 0) { @@ -83,6 +92,7 @@ static int scsi_dev_type_resume(struct device *dev, pm_runtime_enable(dev); } + sdev_printk(KERN_WARNING, sdev, "%s: exit\n", __func__); return err; } -- Ming
Re: [RFC] block/loop: make loop cgroup aware
On Tue, Aug 29, 2017 at 08:28:09AM -0700, Tejun Heo wrote: > Hello, Shaohua. > > On Tue, Aug 29, 2017 at 08:22:36AM -0700, Shaohua Li wrote: > > > Yeah, writeback tracks the most active cgroup and associates writeback > > > ios with that cgroup. For buffered loop devices, I think it'd be fine > > > to make the loop driver forward the cgroup association and let the > > > writeback layer determine the matching association as usual. > > > > Doing this means we must forward cgroup info to page, not bio. I need to > > check > > if we can make the mechanism work for memcg. > > The association is already coming from the page. We just need to make > sure that going through loop driver doesn't get in the way of the > membership getting propagated to the underlying device. I think there is confusion. App writes files in upper layer fs on loop. memcg estabilish membership for the pages of these files. Then writeback does write, loop then write these pages to under layer fs. The write is done in loop thread. The write will allocate new page cache for under layer fs files. The issue is we must forward memcg info from upper layer files page cache to under layer files page cache. > > > Hmm... why do we need double forwarding (original issuer -> aio cmd -> > > > ios) in loop? If we need this, doesn't this mean that we're missing > > > ownership forwarding in aio paths and should make the forwarding > > > happen there? > > > > what do you mean double forwarding? > > So, this looks like the loop driver is explicitly forwarding the > association from the original issuer to the aio command and then from > the aio command to the ios to the underlying device. I'm wondering > whether the right way to do this is making aio forward the association > by default, instead of the loop driver doing it explicitly. That way, > all aio users can benefit from the forwarding instead of just loop. That's possible. The downside doing this in aio is we must audit all fs to make sure all bio have association. I'd like to avoid doing this if there is no other loop-like cgroup usage. Thanks, Shaohua
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote: > On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote: > > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote: > > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > > > > From: Shaohua Li> > > > > > > > Currently loop disables merge. While it makes sense for buffer IO mode, > > > > directio mode can benefit from request merge. Without merge, loop could > > > > send small size IO to underlayer disk and harm performance. > > > > > > Hi Shaohua, > > > > > > IMO no matter if merge is used, loop always sends page by page > > > to VFS in both dio or buffer I/O. > > > > Why do you think so? > > do_blockdev_direct_IO() still handles page by page from iov_iter, and > with bigger request, I guess it might be the plug merge working. This is not true. directio sends big size bio directly, not because of plug merge. Please at least check the code before you complain. > > > > > Also if merge is enabled on loop, that means merge is run > > > on both loop and low level block driver, and not sure if we > > > can benefit from that. > > > > why does merge still happen in low level block driver? > > Because scheduler is still working on low level disk. My question > is that why the scheduler in low level disk doesn't work now > if scheduler on loop can merge? The low level disk can still do merge, but since this is directio, the upper layer already dispatches request as big as possible. There is very little chance the requests can be merged again. > > > > > > > > So Could you provide some performance data about this patch? > > > > In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I > > clearly > > see the request size becomes bigger. > > Could you share us what the low level disk is? It's a SATA ssd. Thanks, Shaohua
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On Tue, Aug 29, 2017 at 08:15:23PM +, Bart Van Assche wrote: > On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote: > > Hi Bart, > > > > Did you see perf regression on SRP with smaller jobs after applying > > my patchset V3? > > > > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, > > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) > > VS. > > v4.13-rc6+blk-next+patch V3(3rd column): > > > > > > --- > > IOPS(K) |NONE |NONE > > --- > > read | 475.83 | 485.88 > > --- > > randread | 142.86 | 141.96 > > --- > > write | 483.9 | 492.39 > > --- > > randwrite | 124.83 | 124.53 > > --- > > [ ... ] > > --- > > LAT(us) |NONE |NONE > > --- > > read |2.15 |2.11 > > --- > > randread |7.17 |7.21 > > --- > > write |2.11 |2.08 > > --- > > randwrite | 8.2 |8.22 > > --- > > [ ... ] > > Hello Ming, > > Although I would prefer to see measurement data against an SRP target system > that supports a higher workload (>1M IOPS) and Hi Bart, For so higher workload, I guess it often requires to increase .cmd_per_lun. > also for a high-end NVMe drive, My patch won't affect NVMe drive since NVMe driver doesn't become busy usually. > I think the above data is sufficient to show that the performance impact of > your patch series is most likely small enough even for high-end SCSI initiator > drivers. OK. -- Ming
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote: > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote: > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > > > From: Shaohua Li> > > > > > Currently loop disables merge. While it makes sense for buffer IO mode, > > > directio mode can benefit from request merge. Without merge, loop could > > > send small size IO to underlayer disk and harm performance. > > > > Hi Shaohua, > > > > IMO no matter if merge is used, loop always sends page by page > > to VFS in both dio or buffer I/O. > > Why do you think so? do_blockdev_direct_IO() still handles page by page from iov_iter, and with bigger request, I guess it might be the plug merge working. > > > Also if merge is enabled on loop, that means merge is run > > on both loop and low level block driver, and not sure if we > > can benefit from that. > > why does merge still happen in low level block driver? Because scheduler is still working on low level disk. My question is that why the scheduler in low level disk doesn't work now if scheduler on loop can merge? > > > > > So Could you provide some performance data about this patch? > > In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I > clearly > see the request size becomes bigger. Could you share us what the low level disk is? -- Ming
Re: I/O hangs after resuming from suspend-to-ram
Hi, On Tue, Aug 29, 2017 at 05:52:42PM +0200, Oleksandr Natalenko wrote: > Hello. > > Re-tested with v4.13-rc7 + proposed patch and got the same result. Maybe there is another issue, I didn't use dmcrypt on raid10, will test in your way to see if I can reproduce it. BTW, could you share us which blk-mq scheduler you are using on sata? The patch I posted should address one issue on none scheduler. -- Ming
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Tue, 2017-08-29 at 13:57 -0600, Jens Axboe wrote: > On 08/29/2017 01:49 PM, Ben Hutchings wrote: > > On Tue, 2017-08-29 at 10:46 -0600, Jens Axboe wrote: > > > On 08/29/2017 10:34 AM, Ben Hutchings wrote: > > > > On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote: > > > > > On 08/29/2017 09:48 AM, Jens Axboe wrote: > > > > > > On 08/29/2017 09:28 AM, Ben Hutchings wrote: > > > > > > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: > > > > > > > > On Sun, Aug 13 2017, Ben Hutchings wrote: > > > > > > > > > The block core requests modules with the "-iosched" name > > > > > > > > > suffix, but > > > > > > > > > bfq no longer has that suffix. Add an alias. > > > > > > > > > > > > > > > > I'd apply these two, but both patches are mangled when saved. > > > > > > > > It's > > > > > > > > turning == into =3D and so forth. > > > > > > > > > > > > > > > > Care to check your settings and resend? > > > > > > > > > > > > > > Just tried saving and applying with 'git am' successfully. I > > > > > > > think the > > > > > > > problem is at your end. > > > > > > > > > > > > Then yours is the only one, I apply patches people send me all day > > > > > > long. > > > > > > Was the case both in tbird and mutt, both of them showed the diffs > > > > > > as mangled, and they showed up mangled when saved. > > > > > > > > > > Here's your email in the archive: > > > > > > > > > > https://marc.info/?l=linux-block=150264374920778=raw > > > > > > > > > > Note this part: > > > > > > > > > > Content-Transfer-Encoding: quoted-printable > > > > > > > > What about it? This is used for every mail with a non-ASCII name in > > > > it, for example. 'git am' understands it. > > > > > > What about it? It screws up the patch. Maybe git am understands it, but > > > it's hard/impossible to read manually. > > > > Where, other than the 'raw' link above, are you seeing the patch with > > q-p encoding not decoded? > > When I save it and view it. I don't understand why you wouldn't do that in your mailer. You're going to need to switch back to the mailer when you reply, after all. But if you want to display a quoted-printable file in a terminal, you can use 'qprint -d' to do that. > > > I'm not going to apply anything > > > that I can't personally read/review easily. Fix your setup, if you are > > > going to be sending patches. > > > > So you won't accept patches from anyone with a non-ASCII name? > > You're being ridiculous. Am I? It seems like they would trigger the same problem in your current workflow. Ben. -- Ben Hutchings Teamwork is essential - it allows you to blame someone else. signature.asc Description: This is a digitally signed message part
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On 08/29/2017 02:57 PM, Bart Van Assche wrote: > On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: >> All that can be done by clearing or setting a flag on the first call to >> ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for >> that, but for SCSI we probablt can't use that given that it has more >> meaning for the old request path. But how about just adding a new >> RQD_DRV_INITIALIZED or similar flag? > > Hello Christoph, > > More changes would have to be made to implement the above proposal > than just setting a flag at the start of .queue_rq() / .queuecommand() > for all filesystem requests. The numerous code paths that lead to a > request not being executed immediately, e.g. a SCSI host being busy or > a preparation status != BLKPREP_OK, would have to be inspected and > code would have to be added to clear the "command initialized" flag to > ensure that initialization occurs shortly before the first time a > command is executed. > > The choice we have to make is to add more state information and > complicated code for keeping that state information up-to-date to the > SCSI core or making a small and easy to maintain change to the block > layer core that does not involve any new state information. That's why > I'm asking you to reconsider the patch at the start of this e-mail > thread. I don't like this addition, and honestly I wasn't a huge fan of adding ->init() hooks to the alloc path when we initially added this earlier this summer. The fact that we now have to make this more invasive doesn't improve the situation at all. So fwiw, I too would much rather see an implementation based on an RQF_ flag instead. -- Jens Axboe
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > All that can be done by clearing or setting a flag on the first call to > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > that, but for SCSI we probablt can't use that given that it has more > meaning for the old request path. But how about just adding a new > RQD_DRV_INITIALIZED or similar flag? Hello Christoph, More changes would have to be made to implement the above proposal than just setting a flag at the start of .queue_rq() / .queuecommand() for all filesystem requests. The numerous code paths that lead to a request not being executed immediately, e.g. a SCSI host being busy or a preparation status != BLKPREP_OK, would have to be inspected and code would have to be added to clear the "command initialized" flag to ensure that initialization occurs shortly before the first time a command is executed. The choice we have to make is to add more state information and complicated code for keeping that state information up-to-date to the SCSI core or making a small and easy to maintain change to the block layer core that does not involve any new state information. That's why I'm asking you to reconsider the patch at the start of this e-mail thread. Thanks, Bart.
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote: > Hi Bart, > > Did you see perf regression on SRP with smaller jobs after applying > my patchset V3? > > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS. > v4.13-rc6+blk-next+patch V3(3rd column): > > > --- > IOPS(K) |NONE |NONE > --- > read | 475.83 | 485.88 > --- > randread | 142.86 | 141.96 > --- > write | 483.9 | 492.39 > --- > randwrite | 124.83 | 124.53 > --- > [ ... ] > --- > LAT(us) |NONE |NONE > --- > read |2.15 |2.11 > --- > randread |7.17 |7.21 > --- > write |2.11 |2.08 > --- > randwrite | 8.2 |8.22 > --- > [ ... ] Hello Ming, Although I would prefer to see measurement data against an SRP target system that supports a higher workload (>1M IOPS) and also for a high-end NVMe drive, I think the above data is sufficient to show that the performance impact of your patch series is most likely small enough even for high-end SCSI initiator drivers. Bart.
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Tue, 2017-08-29 at 10:46 -0600, Jens Axboe wrote: > On 08/29/2017 10:34 AM, Ben Hutchings wrote: > > On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote: > > > On 08/29/2017 09:48 AM, Jens Axboe wrote: > > > > On 08/29/2017 09:28 AM, Ben Hutchings wrote: > > > > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: > > > > > > On Sun, Aug 13 2017, Ben Hutchings wrote: > > > > > > > The block core requests modules with the "-iosched" name > > > > > > > suffix, but > > > > > > > bfq no longer has that suffix. Add an alias. > > > > > > > > > > > > I'd apply these two, but both patches are mangled when saved. > > > > > > It's > > > > > > turning == into =3D and so forth. > > > > > > > > > > > > Care to check your settings and resend? > > > > > > > > > > Just tried saving and applying with 'git am' successfully. I > > > > > think the > > > > > problem is at your end. > > > > > > > > Then yours is the only one, I apply patches people send me all day > > > > long. > > > > Was the case both in tbird and mutt, both of them showed the diffs > > > > as mangled, and they showed up mangled when saved. > > > > > > Here's your email in the archive: > > > > > > https://marc.info/?l=linux-block=150264374920778=raw > > > > > > Note this part: > > > > > > Content-Transfer-Encoding: quoted-printable > > > > What about it? This is used for every mail with a non-ASCII name in > > it, for example. 'git am' understands it. > > What about it? It screws up the patch. Maybe git am understands it, but > it's hard/impossible to read manually. Where, other than the 'raw' link above, are you seeing the patch with q-p encoding not decoded? > I'm not going to apply anything > that I can't personally read/review easily. Fix your setup, if you are > going to be sending patches. So you won't accept patches from anyone with a non-ASCII name? Ben. > > > Problem is definitely at your end. > > > > Or perhaps in the middle? Anyway, here are the patches again as an > > mbox. > > Feel free to browse other patches on the list, I don't see any that > are quoted-printable. And I'd know, since I generally end up applying > (or at least reviewing) most of them. > -- Ben Hutchings Teamwork is essential - it allows you to blame someone else. signature.asc Description: This is a digitally signed message part
Re: [PATCH] bsg: remove #if 0'ed code
On 08/29/2017 10:48 AM, Christoph Hellwig wrote: > Signed-off-by: Christoph HellwigApplied, thanks. -- Jens Axboe
[PATCH] bsg: remove #if 0'ed code
Signed-off-by: Christoph Hellwig--- block/bsg.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 37663b664666..ee1335c68de7 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -932,15 +932,8 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return ret; } - /* -* block device ioctls -*/ default: -#if 0 - return ioctl_by_bdev(bd->bdev, cmd, arg); -#else return -ENOTTY; -#endif } } -- 2.11.0
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On 08/29/2017 10:34 AM, Ben Hutchings wrote: > On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote: >> On 08/29/2017 09:48 AM, Jens Axboe wrote: >>> On 08/29/2017 09:28 AM, Ben Hutchings wrote: On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: > On Sun, Aug 13 2017, Ben Hutchings wrote: >> The block core requests modules with the "-iosched" name >> suffix, but >> bfq no longer has that suffix. Add an alias. > > I'd apply these two, but both patches are mangled when saved. > It's > turning == into =3D and so forth. > > Care to check your settings and resend? Just tried saving and applying with 'git am' successfully. I think the problem is at your end. >>> >>> Then yours is the only one, I apply patches people send me all day >>> long. >>> Was the case both in tbird and mutt, both of them showed the diffs >>> as mangled, and they showed up mangled when saved. >> >> Here's your email in the archive: >> >> https://marc.info/?l=linux-block=150264374920778=raw >> >> Note this part: >> >> Content-Transfer-Encoding: quoted-printable > > What about it? This is used for every mail with a non-ASCII name in > it, for example. 'git am' understands it. What about it? It screws up the patch. Maybe git am understands it, but it's hard/impossible to read manually. I'm not going to apply anything that I can't personally read/review easily. Fix your setup, if you are going to be sending patches. >> Problem is definitely at your end. > > Or perhaps in the middle? Anyway, here are the patches again as an > mbox. Feel free to browse other patches on the list, I don't see any that are quoted-printable. And I'd know, since I generally end up applying (or at least reviewing) most of them. -- Jens Axboe
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote: > On 08/29/2017 09:48 AM, Jens Axboe wrote: > > On 08/29/2017 09:28 AM, Ben Hutchings wrote: > > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: > > > > On Sun, Aug 13 2017, Ben Hutchings wrote: > > > > > The block core requests modules with the "-iosched" name > > > > > suffix, but > > > > > bfq no longer has that suffix. Add an alias. > > > > > > > > I'd apply these two, but both patches are mangled when saved. > > > > It's > > > > turning == into =3D and so forth. > > > > > > > > Care to check your settings and resend? > > > > > > Just tried saving and applying with 'git am' successfully. I > > > think the > > > problem is at your end. > > > > Then yours is the only one, I apply patches people send me all day > > long. > > Was the case both in tbird and mutt, both of them showed the diffs > > as mangled, and they showed up mangled when saved. > > Here's your email in the archive: > > https://marc.info/?l=linux-block=150264374920778=raw > > Note this part: > > Content-Transfer-Encoding: quoted-printable What about it? This is used for every mail with a non-ASCII name in it, for example. 'git am' understands it. > Problem is definitely at your end. Or perhaps in the middle? Anyway, here are the patches again as an mbox. Ben. -- Ben Hutchings Teamwork is essential - it allows you to blame someone else. block-sched-module-aliases.mbox Description: application/mbox signature.asc Description: This is a digitally signed message part
[PATCH] blkcg: check pol->cpd_free_fn before free cpd
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd. Signed-off-by: weiping zhang--- block/blk-cgroup.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 0480892..adcbc3e 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1044,7 +1044,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css) list_del(>all_blkcgs_node); for (i = 0; i < BLKCG_MAX_POLS; i++) - if (blkcg->cpd[i]) + if (blkcg->cpd[i] && blkcg_policy[i]->cpd_free_fn) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); mutex_unlock(_pol_mutex); @@ -1109,7 +1109,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) free_pd_blkcg: for (i--; i >= 0; i--) - if (blkcg->cpd[i]) + if (blkcg->cpd[i] && blkcg_policy[i]->cpd_free_fn) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); free_blkcg: kfree(blkcg); @@ -1450,7 +1450,7 @@ int blkcg_policy_register(struct blkcg_policy *pol) return 0; err_free_cpds: - if (pol->cpd_alloc_fn) { + if (pol->cpd_free_fn) { list_for_each_entry(blkcg, _blkcgs, all_blkcgs_node) { if (blkcg->cpd[pol->plid]) { pol->cpd_free_fn(blkcg->cpd[pol->plid]); @@ -1490,7 +1490,7 @@ void blkcg_policy_unregister(struct blkcg_policy *pol) /* remove cpds and unregister */ mutex_lock(_pol_mutex); - if (pol->cpd_alloc_fn) { + if (pol->cpd_free_fn) { list_for_each_entry(blkcg, _blkcgs, all_blkcgs_node) { if (blkcg->cpd[pol->plid]) { pol->cpd_free_fn(blkcg->cpd[pol->plid]); -- 2.9.4
Re: [PATCH] block: Make blk_dequeue_request() static
On 08/28/2017 08:54 PM, Damien Le Moal wrote: > The only caller of this function is blk_start_request() in the same > file. Fix blk_start_request() description accordingly. Applied, thanks. -- Jens Axboe
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On 08/29/2017 09:28 AM, Ben Hutchings wrote: > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: >> On Sun, Aug 13 2017, Ben Hutchings wrote: >>> The block core requests modules with the "-iosched" name suffix, but >>> bfq no longer has that suffix. Add an alias. >> >> I'd apply these two, but both patches are mangled when saved. It's >> turning == into =3D and so forth. >> >> Care to check your settings and resend? > > Just tried saving and applying with 'git am' successfully. I think the > problem is at your end. Then yours is the only one, I apply patches people send me all day long. Was the case both in tbird and mutt, both of them showed the diffs as mangled, and they showed up mangled when saved. -- Jens Axboe
Re: [PATCH 0/2] skd: Two additional patches for kernel v4.14
On 08/29/2017 09:32 AM, Bart Van Assche wrote: > Hello Jens, > > Further code review of the skd driver resulted in two additional > patches. Please consider these for kernel v4.14. Looks good to me, applied, thanks. -- Jens Axboe
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Tue, Aug 29, 2017 at 04:55:59PM +0200, Christoph Hellwig wrote: > On Tue, Aug 29, 2017 at 10:54:17AM -0400, Keith Busch wrote: > > It also looks like new submissions will get a new path only from the > > fact that the original/primary is being reset. The controller reset > > itself seems a bit heavy-handed. Can we just set head->current_path to > > the next active controller in the list? > > For ANA we'll have to do that anyway, but if we got a failure > that clearly indicates a path failure what benefit is there in not > resetting the controller? But yeah, maybe we can just switch the > path for non-ANA controllers and wait for timeouts to do their work. Okay, sounds reasonable. Speaking of timeouts, nvme_req_needs_retry will fail the command immediately rather than try the alternate path if it was cancelled due to timeout handling. Should we create a new construct for a command's total time separate from recovery timeout so we may try an alternate paths?
[PATCH 1/2] skd: Remove blk_queue_bounce_limit() call
Since sTec s1120 devices support 64-bit DMA it is not necessary to request data buffer bouncing. Hence remove the blk_queue_bounce_limit() call. Suggested-by: Christoph HellwigSigned-off-by: Bart Van Assche Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/block/skd_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index 00a86252b3c5..f987ff601a4c 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -2844,7 +2844,6 @@ static int skd_cons_disk(struct skd_device *skdev) rc = PTR_ERR(q); goto err_out; } - blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); q->queuedata = skdev; q->nr_requests = skd_max_queue_depth / 2; -- 2.14.1
[PATCH 2/2] skd: Let the block layer core choose .nr_requests
Since blk_mq_init_queue() initializes .nr_requests to the tag set size and since that value is a good default for the skd driver, do not overwrite the value set by blk_mq_init_queue(). This change doubles the default value of .nr_requests. Signed-off-by: Bart Van AsscheCc: Christoph Hellwig Cc: Hannes Reinecke Cc: Johannes Thumshirn --- drivers/block/skd_main.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index f987ff601a4c..7cedb4295e9d 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -2845,7 +2845,6 @@ static int skd_cons_disk(struct skd_device *skdev) goto err_out; } q->queuedata = skdev; - q->nr_requests = skd_max_queue_depth / 2; skdev->queue = q; disk->queue = q; -- 2.14.1
[PATCH 0/2] skd: Two additional patches for kernel v4.14
Hello Jens, Further code review of the skd driver resulted in two additional patches. Please consider these for kernel v4.14. Thanks, Bart. Bart Van Assche (2): skd: Remove blk_queue_bounce_limit() call skd: Let the block layer core choose .nr_requests drivers/block/skd_main.c | 2 -- 1 file changed, 2 deletions(-) -- 2.14.1
Re: [PATCH] blk-mq: Improvements to the hybrid polling sleep time calculation
>> From: Stephen Bates>> >> Hybrid polling currently uses half the average completion time as an >> estimate of how long to poll for. We can improve upon this by noting >> that polling before the minimum completion time makes no sense. Add a >> sysfs entry to use this fact to improve CPU utilization in certain >> cases. >> >> At the same time the minimum is a bit too long to sleep for since we >> must factor in OS wake time for the thread. For now allow the user to >> set this via a second sysfs entry (in nanoseconds). >> >> Testing this patch on Intel Optane SSDs showed that using the minimum >> rather than half reduced CPU utilization from 59% to 38%. Tuning >> this via the wake time adjustment allowed us to trade CPU load for >> latency. For example >> >> io_poll delay hyb_use_min adjust latency CPU load >> 1 -1 N/A N/A 8.4 100% >> 1 0 0 N/A 8.4 57% >> 1 0 1 0 10.334% >> 1 9 1 10009.9 37% >> 1 0 1 20008.4 47% >> 1 0 1 1 8.4 100% >> >> Ideally we will extend this to auto-calculate the wake time rather >> than have it set by the user. > > I don't like this, it's another weird knob that will exist but that > no one will know how to use. For most of the testing I've done > recently, hybrid is a win over busy polling - hence I think we should > make that the default. 60% of mean has also, in testing, been shown > to be a win. So that's an easy fix/change we can consider. I do agree that the this is a hard knob to tune. I am however not happy that the current hybrid default may mean we are polling well before the minimum completion time. That just seems like a waste of CPU resources to me. I do agree that turning on hybrid as the default and perhaps bumping up the default is a good idea. > To go beyond that, I'd much rather see us tracking the time waste. > If we consider the total completion time of an IO to be A+B+C, where: > > A Time needed to go to sleep > B Sleep time > C Time needed to wake up > > then we could feasibly track A+C. We already know how long the IO > will take to complete, as we track that. At that point we'd have > a full picture of how long we should sleep. Yes, this is where I was thinking of taking this functionality in the long term. It seems like tracking C is something other parts of the kernel might need. Does anyone know of any existing code in this space? > Bonus points for informing the lower level scheduler of this as > well. If the CPU is going idle, we'll enter some sort of power > state in the processor. If we were able to pass in how long we > expect to sleep, we could be making better decisions here. Yup. Again, this seems like something more general that just the block-layer. I will do some digging and see/if anything is available to leverage here. Cheers Stephen
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote: > On Sun, Aug 13 2017, Ben Hutchings wrote: > > The block core requests modules with the "-iosched" name suffix, but > > bfq no longer has that suffix. Add an alias. > > I'd apply these two, but both patches are mangled when saved. It's > turning == into =3D and so forth. > > Care to check your settings and resend? Just tried saving and applying with 'git am' successfully. I think the problem is at your end. Ben. -- Ben Hutchings Teamwork is essential - it allows you to blame someone else. signature.asc Description: This is a digitally signed message part
Re: [PATCH] block: Make blk_dequeue_request() static
On Tue, 2017-08-29 at 11:54 +0900, Damien Le Moal wrote: > The only caller of this function is blk_start_request() in the same > file. Fix blk_start_request() description accordingly. Reviewed-by: Bart Van Assche
Re: [RFC] block/loop: make loop cgroup aware
Hello, Shaohua. On Tue, Aug 29, 2017 at 08:22:36AM -0700, Shaohua Li wrote: > > Yeah, writeback tracks the most active cgroup and associates writeback > > ios with that cgroup. For buffered loop devices, I think it'd be fine > > to make the loop driver forward the cgroup association and let the > > writeback layer determine the matching association as usual. > > Doing this means we must forward cgroup info to page, not bio. I need to check > if we can make the mechanism work for memcg. The association is already coming from the page. We just need to make sure that going through loop driver doesn't get in the way of the membership getting propagated to the underlying device. > > Hmm... why do we need double forwarding (original issuer -> aio cmd -> > > ios) in loop? If we need this, doesn't this mean that we're missing > > ownership forwarding in aio paths and should make the forwarding > > happen there? > > what do you mean double forwarding? So, this looks like the loop driver is explicitly forwarding the association from the original issuer to the aio command and then from the aio command to the ios to the underlying device. I'm wondering whether the right way to do this is making aio forward the association by default, instead of the loop driver doing it explicitly. That way, all aio users can benefit from the forwarding instead of just loop. Thanks. -- tejun
Re: [RFC] block/loop: make loop cgroup aware
On Mon, Aug 28, 2017 at 03:54:59PM -0700, Tejun Heo wrote: > Hello, Shaohua. > > On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote: > > loop block device handles IO in a separate thread. The actual IO > > dispatched isn't cloned from the IO loop device received, so the > > dispatched IO loses the cgroup context. > > Ah, right. Thanks for spotting this. > > > I'm ignoring buffer IO case now, which is quite complicated. Making the > > loop thread aware of cgroup context doesn't really help. The loop device > > only writes to a single file. In current writeback cgroup > > implementation, the file can only belong to one cgroup. > > Yeah, writeback tracks the most active cgroup and associates writeback > ios with that cgroup. For buffered loop devices, I think it'd be fine > to make the loop driver forward the cgroup association and let the > writeback layer determine the matching association as usual. Doing this means we must forward cgroup info to page, not bio. I need to check if we can make the mechanism work for memcg. > > For direct IO case, we could workaround the issue in theory. For > > example, say we assign cgroup1 5M/s BW for loop device and cgroup2 > > 10M/s. We can create a special cgroup for loop thread and assign at > > least 15M/s for the underlayer disk. In this way, we correctly throttle > > the two cgroups. But this is tricky to setup. > > > > This patch tries to address the issue. When loop thread is handling IO, > > it declares the IO is on behalf of the original task, then in block IO > > we use the original task to find cgroup. The concept probably works for > > other scenarios too, but right now I don't make it generic yet. > > The overall approach looks sound to me. Some comments below. > > > @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio) > > > > get_io_context_active(ioc); > > bio->bi_ioc = ioc; > > - bio->bi_css = task_get_css(current, io_cgrp_id); > > + if (current->cgroup_task) > > + bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id); > > + else > > + bio->bi_css = task_get_css(current, io_cgrp_id); > > Do we need a full pointer field for this? I think we should be able > to mark lo writers with a flag (PF or whatever) and then to > kthread_data() to get the lo struct which contains the target css. > Oh, let's do csses instead of tasks for consistency, correctness > (please see below) and performance (csses are cheaper to pin). Forwarding css sounds better. > > @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > cmd->iocb.ki_complete = lo_rw_aio_complete; > > cmd->iocb.ki_flags = IOCB_DIRECT; > > > > + if (cmd->cgroup_task) > > + current->cgroup_task = cmd->cgroup_task; > > + > > if (rw == WRITE) > > ret = call_write_iter(file, >iocb, ); > > else > > ret = call_read_iter(file, >iocb, ); > > > > + current->cgroup_task = NULL; > > + > > if (ret != -EIOCBQUEUED) > > cmd->iocb.ki_complete(>iocb, ret, 0); > > return 0; > > Hmm... why do we need double forwarding (original issuer -> aio cmd -> > ios) in loop? If we need this, doesn't this mean that we're missing > ownership forwarding in aio paths and should make the forwarding > happen there? what do you mean double forwarding? > > > @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct > > blk_mq_hw_ctx *hctx, > > break; > > } > > > > + if (cmd->use_aio) { > > + cmd->cgroup_task = current; > > + get_task_struct(current); > > + } else > > + cmd->cgroup_task = NULL; > > What if %current isn't the original issuer of the io? It could be a > writeback worker trying to flush to a loop device and we don't want to > attribute those ios to the writeback worker. We should forward the > bi_css not the current task. Makes sense. Thanks, Shaohua
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote: > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > > From: Shaohua Li> > > > Currently loop disables merge. While it makes sense for buffer IO mode, > > directio mode can benefit from request merge. Without merge, loop could > > send small size IO to underlayer disk and harm performance. > > Hi Shaohua, > > IMO no matter if merge is used, loop always sends page by page > to VFS in both dio or buffer I/O. Why do you think so? > Also if merge is enabled on loop, that means merge is run > on both loop and low level block driver, and not sure if we > can benefit from that. why does merge still happen in low level block driver? > > So Could you provide some performance data about this patch? In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly see the request size becomes bigger. > > > > Reviewed-by: Omar Sandoval > > Signed-off-by: Shaohua Li > > --- > > drivers/block/loop.c | 66 > > > > drivers/block/loop.h | 1 + > > 2 files changed, 52 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > index 428da07..75a8f6e 100644 > > --- a/drivers/block/loop.c > > +++ b/drivers/block/loop.c > > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, > > bool dio) > > */ > > blk_mq_freeze_queue(lo->lo_queue); > > lo->use_dio = use_dio; > > - if (use_dio) > > + if (use_dio) { > > + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > > lo->lo_flags |= LO_FLAGS_DIRECT_IO; > > - else > > + } else { > > + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; > > + } > > blk_mq_unfreeze_queue(lo->lo_queue); > > } > > > > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long > > ret, long ret2) > > { > > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); > > > > + kfree(cmd->bvec); > > + cmd->bvec = NULL; > > cmd->ret = ret; > > blk_mq_complete_request(cmd->rq); > > } > > @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > { > > struct iov_iter iter; > > struct bio_vec *bvec; > > - struct bio *bio = cmd->rq->bio; > > + struct request *rq = cmd->rq; > > + struct bio *bio = rq->bio; > > struct file *file = lo->lo_backing_file; > > + unsigned int offset; > > + int segments = 0; > > int ret; > > > > - /* nomerge for loop request queue */ > > - WARN_ON(cmd->rq->bio != cmd->rq->biotail); > > + if (rq->bio != rq->biotail) { > > + struct req_iterator iter; > > + struct bio_vec tmp; > > + > > + __rq_for_each_bio(bio, rq) > > + segments += bio_segments(bio); > > + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL); > > The allocation should have been GFP_NOIO. Sounds good. To make this completely correct isn't easy though, we are calling into the underlayer fs operations which can do allocation. Thanks, Shaohua
Re: [GIT PULL] nvme update for Linux 4.14, take 2
On 08/29/2017 09:05 AM, Christoph Hellwig wrote: > Hi Jens, > > below is the current set of NVMe updates for Linux 4.14, now against your > postmerge branch, and with three more patches. > > The biggest bit comes from Sagi and refactors the RDMA driver to prepare > for more code sharing in the setup and teardown path. But we have various > features and bug fixes from a lot of people as well. Pulled, thanks. -- Jens Axboe
[GIT PULL] nvme update for Linux 4.14, take 2
Hi Jens, below is the current set of NVMe updates for Linux 4.14, now against your postmerge branch, and with three more patches. The biggest bit comes from Sagi and refactors the RDMA driver to prepare for more code sharing in the setup and teardown path. But we have various features and bug fixes from a lot of people as well. The following changes since commit cd996fb47c360320cf25ac9503c16de085ea9cfc: Merge tag 'v4.13-rc7' into for-4.14/block-postmerge (2017-08-28 13:00:44 -0600) are available in the git repository at: git://git.infradead.org/nvme.git nvme-4.14 for you to fetch changes up to 1d5df6af8c7469f9ae3e66e7bed0782cfe4f95db: nvme: don't blindly overwrite identifiers on disk revalidate (2017-08-29 10:23:04 +0200) Arnav Dawn (2): nvme: add support for FW activation without reset nvme: define NVME_NSID_ALL Christoph Hellwig (6): nvmet: use NVME_NSID_ALL nvme: report more detailed status codes to the block layer nvme: allow calling nvme_change_ctrl_state from irq context nvme: remove unused struct nvme_ns fields nvme: remove nvme_revalidate_ns nvme: don't blindly overwrite identifiers on disk revalidate Guan Junxiong (2): nvmet: fix the return error code of target if host is not allowed nvme-fabrics: log a warning if hostid is invalid James Smart (2): nvme-fc: Reattach to localports on re-registration nvmet-fc: simplify sg list handling Jan H. Schönherr (1): nvme: fix uninitialized prp2 value on small transfers Johannes Thumshirn (2): nvmet-fcloop: remove ALL_OPTS define nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE Jon Derrick (1): nvme: add support for NVMe 1.3 Timestamp Feature Martin K. Petersen (1): nvme: honor RTD3 Entry Latency for shutdowns Martin Wilck (2): string.h: add memcpy_and_pad() nvmet: use memcpy_and_pad for identify sn/fr Max Gurtovoy (3): nvme: add symbolic constants for CC identifiers nvme: rename AMS symbolic constants to fit specification nvme-rdma: Use unlikely macro in the fast path Sagi Grimberg (13): nvme-rdma: move nvme_rdma_configure_admin_queue code location nvme: Add admin_tagset pointer to nvme_ctrl nvme-rdma: move tagset allocation to a dedicated routine nvme-rdma: disable the controller on resets nvme-rdma: don't free tagset on resets nvme-rdma: reuse configure/destroy_admin_queue nvme-rdma: introduce configure/destroy io queues nvme-rdma: stop queues instead of simply flipping their state nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue nvme-rdma: introduce nvme_rdma_start_queue nvme-rdma: cleanup error path in controller reset nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64 nvme: fix identify namespace logging drivers/nvme/host/core.c| 242 + drivers/nvme/host/fabrics.c | 1 + drivers/nvme/host/fc.c | 145 --- drivers/nvme/host/nvme.h| 10 +- drivers/nvme/host/pci.c | 5 +- drivers/nvme/host/rdma.c| 564 drivers/nvme/target/admin-cmd.c | 16 +- drivers/nvme/target/configfs.c | 2 +- drivers/nvme/target/core.c | 15 +- drivers/nvme/target/fc.c| 48 +--- drivers/nvme/target/fcloop.c| 3 - drivers/nvme/target/loop.c | 1 + include/linux/nvme-fc-driver.h | 2 +- include/linux/nvme.h| 37 ++- include/linux/string.h | 30 +++ 15 files changed, 677 insertions(+), 444 deletions(-)
[PATCH]blk-mq-sched: remove the empty entry in token wait list
From: Bin ZhaWhen the kyber adjusts the sync and other write depth to the minimum(1), there is a case that maybe cause the requests to be stalled in the kyber_hctx_data list. The following example I have tested: CPU7 block_rq_insert add_wait_queue __sbitmap_queue_get CPU23 block_rq_issue block_rq_insert block_rq_complete --> waiting token free block_rq_issue /|\block_rq_complete || || | \|/ | CPU29 |block_rq_insert | waiting token free | block_rq_issue |-- block_rq_complete CPU1 block_rq_insert waiting token free The IO request complete in CPU29 will wake up CPU7, because it has been added to the waitqueue in kyber_get_domain_token. But it is empty waiter, and won't wake up the CPU1.If no other requests issue to push it, the requests will stall in kyber_hctx_data list. Signed-off-by: Bin Zha diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index b9faabc..584bfd5 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -548,6 +548,10 @@ static int kyber_get_domain_token(struct kyber_queue_data *kqd, * queue. */ nr = __sbitmap_queue_get(domain_tokens); + if (nr >= 0) { + remove_wait_queue(>wait, wait); + INIT_LIST_HEAD(>task_list); + } } return nr; }
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Tue, Aug 29, 2017 at 06:22:50PM +0800, Guan Junxiong wrote: > Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF > when path IO error occurs? Such IO will be retried not only on the nvme-mpath > internal path, but also on the dm-mpath path. It will not reach back to dm-multipath if we fail over here.
Re: [GIT PULL] nvme update for Linux 4.14
On Tue, Aug 29, 2017 at 08:34:29AM -0600, Jens Axboe wrote: > Let me know when it's ready. I'll send you another pull request in a bit.
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote: > + /* Anything else could be a path failure, so should be retried */ > + spin_lock_irqsave(>head->requeue_lock, flags); > + blk_steal_bios(>head->requeue_list, req); > + spin_unlock_irqrestore(>head->requeue_lock, flags); > + > + nvme_reset_ctrl(ns->ctrl); > + kblockd_schedule_work(>head->requeue_work); > + return true; > +} It appears this isn't going to cause the path selection to failover for the requeued work. The bio's bi_disk is unchanged from the failed path when the requeue_work submits the bio again so it will use the same path, right? It also looks like new submissions will get a new path only from the fact that the original/primary is being reset. The controller reset itself seems a bit heavy-handed. Can we just set head->current_path to the next active controller in the list? > +static void nvme_requeue_work(struct work_struct *work) > +{ > + struct nvme_ns_head *head = > + container_of(work, struct nvme_ns_head, requeue_work); > + struct bio *bio, *next; > + > + spin_lock_irq(>requeue_lock); > + next = bio_list_get(>requeue_list); > + spin_unlock_irq(>requeue_lock); > + > + while ((bio = next) != NULL) { > + next = bio->bi_next; > + bio->bi_next = NULL; > + generic_make_request_fast(bio); > + } > +} Here, I think we need to reevaluate the path (nvme_find_path) and set bio->bi_disk accordingly.
Re: [GIT PULL] nvme update for Linux 4.14
On 08/28/2017 01:56 PM, Sagi Grimberg wrote: Meh, I don't think that'll work out - the rdma refactoring from Sagi deeply depends on the unіnit_ctrl split.. >>> >>> I can do a for-4.14/block-postmerge branch, which is for-4.14/block >>> with -rc7 pulled in. Should apply on top of that. >>> >>> Did that branch, doesn't apply. But should be easier for you to >>> rebase on top of that branch. Pushed it out. >> >> I've pushed out a nvme-4.14-rebase branch with the rebase, but I didn't >> have time for non-trivial testing yet, that'll have to wait for >> tomorrow. >> >> Sagi: can you double check the resolutions around blk_mq_reinit_tagse? > > Thanks for doing it Christoph, > > It looks good to me, there is one glitch in patch: > "nvme-rdma: reuse configure/destroy_admin_queue" > > The "nvme_rdma_configure_admin_queue" block needs to > be squashed into "nvme-rdma: don't free tagset on resets> > I'll do that now. Let me know when it's ready. -- Jens Axboe
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Sun, Aug 13 2017, Ben Hutchings wrote: > The block core requests modules with the "-iosched" name suffix, but > bfq no longer has that suffix. Add an alias. I'd apply these two, but both patches are mangled when saved. It's turning == into =3D and so forth. Care to check your settings and resend? -- Jens Axboe
non-blocking buffered reads V5
This series resurrects the old patches from Milosz to implement non-blocking buffered reads. Thanks to the non-blocking AIO code from Goldwyn the implementation becomes pretty much trivial. I've also forward ported the test Milosz sent for recent xfsprogs to verify that this series works properly, but I'll still have to address the review comments for it. I'll also volunteer to work with Goldwyn to properly document the RWF_NOWAIT flag in the man page including this change. Changes from V4: - improve conditionals in generic_file_buffered_read Changes from V3: - forward ported to the latest kernel - fixed a compiler warning Changes from V2: - keep returning -EOPNOTSUPP for the not supported buffered write case - add block device node support - rebase against current Linus' tree, which has all the requirements Changes from V1: - fix btrfs to reject nowait buffered writes - tested btrfs and ext4 in addition to xfs this time Here are additional details from the original cover letter from Milosz, where the flag was still called RWF_NONBLOCK: Background: Using a threadpool to emulate non-blocking operations on regular buffered files is a common pattern today (samba, libuv, etc...) Applications split the work between network bound threads (epoll) and IO threadpool. Not every application can use sendfile syscall (TLS / post-processing). This common pattern leads to increased request latency. Latency can be due to additional synchronization between the threads or fast (cached data) request stuck behind slow request (large / uncached data). The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass enqueuing operation in the threadpool if it's already available in the pagecache. Performance numbers (newer Samba): https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing Performance number (older): Some perf data generated using fio comparing the posix aio engine to a version of the posix AIO engine that attempts to performs "fast" reads before submitting the operations to the queue. This workflow is on ext4 partition on raid0 (test / build-rig.) Simulating our database access patern workload using 16kb read accesses. Our database uses a home-spun posix aio like queue (samba does the same thing.) f1: ~73% rand read over mostly cached data (zipf med-size dataset) f2: ~18% rand read over mostly un-cached data (uniform large-dataset) f3: ~9% seq-read over large dataset before: f1: bw (KB /s): min= 11, max= 9088, per=0.56%, avg=969.54, stdev=827.99 lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48% lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42% f2: bw (KB /s): min=2, max= 1882, per=0.16%, avg=273.28, stdev=220.26 lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56% lat (msec) : >=2000=4.33% f3: bw (KB /s): min=0, max=265568, per=99.95%, avg=174575.10, stdev=34526.89 lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82% lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55% lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22% lat (msec) : 100=0.05%, 250=0.02%, 500=0.01% total: READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s, mint=61msec, maxt=600113msec after (with fast read using preadv2 before submit): f1: bw (KB /s): min=3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39 lat (usec) : 2=70.63%, 4=0.01% lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53% f2: bw (KB /s): min=2, max= 2362, per=0.14%, avg=249.83, stdev=222.00 lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18% lat (msec) : >=2000=9.99% f3: bw (KB /s): min=1, max=245448, per=100.00%, avg=177366.50, stdev=35995.60 lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43% lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35% lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22% lat (msec) : 100=0.05%, 250=0.02% total: READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s, mint=600020msec, maxt=600178msec Interpreting the results you can see total bandwidth stays the same but overall request latency is decreased in f1 (random, mostly cached) and f3 (sequential) workloads. There is a slight bump in latency for since it's random data that's unlikely to be cached but we're always trying "fast read". In our application we have starting keeping track of "fast read" hits/misses and for files / requests that have a lot hit ratio we don't do "fast reads" mostly getting rid of extra latency in the uncached cases. In our real world work load we were able to reduce average response time by 20 to 30%
[PATCH 4/4] block_dev: support RFW_NOWAIT on block device nodes
All support is already there in the generic code, we just need to wire it up. Signed-off-by: Christoph HellwigReviewed-by: Jan Kara --- fs/block_dev.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 9941dc8342df..ea21d18d8e79 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1739,6 +1739,8 @@ static int blkdev_open(struct inode * inode, struct file * filp) */ filp->f_flags |= O_LARGEFILE; + filp->f_mode |= FMODE_NOWAIT; + if (filp->f_flags & O_NDELAY) filp->f_mode |= FMODE_NDELAY; if (filp->f_flags & O_EXCL) @@ -1891,6 +1893,9 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) if (iocb->ki_pos >= size) return -ENOSPC; + if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT) + return -EOPNOTSUPP; + iov_iter_truncate(from, size - iocb->ki_pos); blk_start_plug(); -- 2.11.0
[PATCH 3/4] fs: support RWF_NOWAIT for buffered reads
This is based on the old idea and code from Milosz Tanski. With the aio nowait code it becomes mostly trivial now. Buffered writes continue to return -EOPNOTSUPP if RWF_NOWAIT is passed. Signed-off-by: Christoph HellwigReviewed-by: Jan Kara --- fs/aio.c | 6 -- fs/btrfs/file.c| 6 +- fs/ext4/file.c | 6 +++--- fs/xfs/xfs_file.c | 11 +-- include/linux/fs.h | 6 +++--- 5 files changed, 20 insertions(+), 15 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index dcad3a66748c..d93daa076726 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1593,12 +1593,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, goto out_put_req; } - if ((req->common.ki_flags & IOCB_NOWAIT) && - !(req->common.ki_flags & IOCB_DIRECT)) { - ret = -EOPNOTSUPP; - goto out_put_req; - } - ret = put_user(KIOCB_KEY, _iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9e75d8a39aac..e62dd55b4079 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1886,6 +1886,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, loff_t oldsize; int clean_page = 0; + if (!(iocb->ki_flags & IOCB_DIRECT) && + (iocb->ki_flags & IOCB_NOWAIT)) + return -EOPNOTSUPP; + if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; @@ -3105,7 +3109,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_AIO_NOWAIT; + filp->f_mode |= FMODE_NOWAIT; return generic_file_open(inode, filp); } diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 0d7cf0cc9b87..f83521337b8f 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -223,6 +223,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (IS_DAX(inode)) return ext4_dax_write_iter(iocb, from); #endif + if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) + return -EOPNOTSUPP; if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) @@ -448,9 +450,7 @@ static int ext4_file_open(struct inode * inode, struct file * filp) return ret; } - /* Set the flags to support nowait AIO */ - filp->f_mode |= FMODE_AIO_NOWAIT; - + filp->f_mode |= FMODE_NOWAIT; return dquot_file_open(inode, filp); } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c4893e226fd8..1a09104b3eb0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -259,7 +259,11 @@ xfs_file_buffered_aio_read( trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos); - xfs_ilock(ip, XFS_IOLOCK_SHARED); + if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, XFS_IOLOCK_SHARED); + } ret = generic_file_read_iter(iocb, to); xfs_iunlock(ip, XFS_IOLOCK_SHARED); @@ -636,6 +640,9 @@ xfs_file_buffered_aio_write( int enospc = 0; int iolock; + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + write_retry: iolock = XFS_IOLOCK_EXCL; xfs_ilock(ip, iolock); @@ -912,7 +919,7 @@ xfs_file_open( return -EFBIG; if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb))) return -EIO; - file->f_mode |= FMODE_AIO_NOWAIT; + file->f_mode |= FMODE_NOWAIT; return 0; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 6e1fd5d21248..ded258edc425 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -146,8 +146,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File was opened by fanotify and shouldn't generate fanotify events */ #define FMODE_NONOTIFY ((__force fmode_t)0x400) -/* File is capable of returning -EAGAIN if AIO will block */ -#define FMODE_AIO_NOWAIT ((__force fmode_t)0x800) +/* File is capable of returning -EAGAIN if I/O will block */ +#define FMODE_NOWAIT ((__force fmode_t)0x800) /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector @@ -3149,7 +3149,7 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) return -EOPNOTSUPP; if (flags & RWF_NOWAIT) { - if (!(ki->ki_filp->f_mode & FMODE_AIO_NOWAIT)) + if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) return -EOPNOTSUPP; ki->ki_flags |= IOCB_NOWAIT; } -- 2.11.0
[PATCH 1/4] fs: pass iocb to do_generic_file_read
And rename it to the more descriptive generic_file_buffered_read while at it. Signed-off-by: Christoph HellwigReviewed-by: Jan Kara --- mm/filemap.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index a49702445ce0..4bcfa74ad802 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1886,9 +1886,8 @@ static void shrink_readahead_size_eio(struct file *filp, } /** - * do_generic_file_read - generic file read routine - * @filp: the file to read - * @ppos: current file position + * generic_file_buffered_read - generic file read routine + * @iocb: the iocb to read * @iter: data destination * @written: already copied * @@ -1898,12 +1897,14 @@ static void shrink_readahead_size_eio(struct file *filp, * This is really ugly. But the goto's actually try to clarify some * of the logic when it comes to error handling etc. */ -static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos, +static ssize_t generic_file_buffered_read(struct kiocb *iocb, struct iov_iter *iter, ssize_t written) { + struct file *filp = iocb->ki_filp; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; struct file_ra_state *ra = >f_ra; + loff_t *ppos = >ki_pos; pgoff_t index; pgoff_t last_index; pgoff_t prev_index; @@ -2151,14 +2152,14 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos, ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) { - struct file *file = iocb->ki_filp; - ssize_t retval = 0; size_t count = iov_iter_count(iter); + ssize_t retval = 0; if (!count) goto out; /* skip atime */ if (iocb->ki_flags & IOCB_DIRECT) { + struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; loff_t size; @@ -2199,7 +2200,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) goto out; } - retval = do_generic_file_read(file, >ki_pos, iter, retval); + retval = generic_file_buffered_read(iocb, iter, retval); out: return retval; } -- 2.11.0
[PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read
From: Milosz TanskiAllow generic_file_buffered_read to bail out early instead of waiting for the page lock or reading a page if IOCB_NOWAIT is specified. Signed-off-by: Milosz Tanski Reviewed-by: Christoph Hellwig Reviewed-by: Jeff Moyer Acked-by: Sage Weil --- mm/filemap.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index 4bcfa74ad802..eed394fd331c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, page = find_get_page(mapping, index); if (!page) { + if (iocb->ki_flags & IOCB_NOWAIT) + goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@ -1950,6 +1952,11 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, index, last_index - index); } if (!PageUptodate(page)) { + if (iocb->ki_flags & IOCB_NOWAIT) { + put_page(page); + goto would_block; + } + /* * See comment in do_read_cache_page on why * wait_on_page_locked is used to avoid unnecessarily @@ -2131,6 +2138,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, goto readpage; } +would_block: + error = -EAGAIN; out: ra->prev_pos = prev_index; ra->prev_pos <<= PAGE_SHIFT; -- 2.11.0
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
Hi Bart, this isn't just about performance - it's about understanability of the I/O path. The legacy request path already has the prep_fn, which is intended for exactly this sort of prep work, but even that led to a lot of confusion. For blk-mq we decided to not add it but let the called driver in control. I really don't want to move away from that. The passthrough requests using blk_get_requst are a special case as the caller allocates the requests, stuffs data into them and only then hands control to the driver, and thus we need some way to initialize the request before handing controller to the driver in this particular special case. And if needed would could actually do that with explicit calls instead of the callback, although you changed it to save a bit of code.
Re: [PATCH] block: Make blk_dequeue_request() static
On Tue, Aug 29, 2017 at 11:54:37AM +0900, Damien Le Moal wrote: > The only caller of this function is blk_start_request() in the same > file. Fix blk_start_request() description accordingly. > > Signed-off-by: Damien Le MoalLooks good, Reviewed-by: Christoph Hellwig
Re: [PATCH 0/4] loop: support different logical block sizes
On Thu, Aug 24, 2017 at 12:03:40AM -0700, Omar Sandoval wrote: > From: Omar Sandoval> > Hi, everyone, > > Here's another attempt at supporting different logical block sizes for > loop devices. First, some terminology: > > * Logical block size: the lowest possible block size that the storage >device can address. > * Physical block size: the lowest possible sector size that the >hardware can operate on without reverting to read-modify-write >operations. Always greater than or equal to the logical block size. > > These are characteristics of the device itself tracked in the > request_queue. For loop devices, both are currently always 512 bytes. > > struct block_device also has another block size, basically a "soft" > block size which is the preferred I/O size and can be changed from > userspace. For loop devices, this is PAGE_SIZE if the backing file is a > regular file or otherwise the soft block size of the backing block > device. > > Patch 1 simplifies how the soft block size is set. I'm tempted to not > set it at all and use the default of PAGE_SIZE, but maybe someone is > depending on inheriting the soft block size from the backing block > device... > > Patch 2 sets the physical block size of loop devices to a more > meaningful value for loop, PAGE_SIZE. > > Patch 3 allows us to change the logical block size. It adds a new ioctl > instead of the previous approach (which extends LOOP_{GET,SET}_STATUS). > Hannes, I assume you had a specific reason for doing it the way you did, > care to explain? > > Patch 4 is a cleanup. > > This is based on my patch reverting the original block size support, > targeting 4.14. Thanks! > > Omar Sandoval (4): > loop: get rid of lo_blocksize > loop: set physical block size to PAGE_SIZE > loop: add ioctl for changing logical block size > loop: fold loop_switch() into callers > > drivers/block/loop.c | 112 > -- > drivers/block/loop.h | 1 - > include/uapi/linux/loop.h | 1 + > 3 files changed, 40 insertions(+), 74 deletions(-) Looks fine, Reviewed-by: Ming Lei -- Ming
Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests
On Mon, Aug 28, 2017 at 06:31:33PM +, Bart Van Assche wrote: > On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > > I still disagree that we should have an indirect function call like this > > in the fast path. > > > > All that can be done by clearing or setting a flag on the first call to > > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > > that, but for SCSI we probably can't use that given that it has more > > meaning for the old request path. But how about just adding a new > > RQD_DRV_INITIALIZED or similar flag? > > Hello Christoph, > > Sorry but I'm not enthusiast about the proposal to introduce a > RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying > behavior difference, namely that .initialize_rq_fn() would be called from > inside blk_get_request() for pass-through requests and from inside the prep > function for filesystem requests. Another disadvantage of that approach is > that the block layer core never clears request.atomic_flags completely but > only sets and clears individual flags. The SCSI core would have to follow > that model and hence code for clearing RQD_DRV_INITIALIZED would have to be > added to all request completion paths in the SCSI core. > > Have you noticed that Ming Lei's patch series introduces several new atomic > operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY > manipulations. Have you noticed that for SCSI drivers these patches introduce > an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path? > I have derived these numbers from the random write SRP performance numbers > as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be > multiplied with the number of I/O requests processed in parallel. The number > of jobs in Ming Lei's test was 64 but that's probably way higher than the > actual I/O parallelism. Hi Bart, Did you see perf regression on SRP with smaller jobs after applying my patchset V3? I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS. v4.13-rc6+blk-next+patch V3(3rd column): --- IOPS(K) |NONE |NONE --- read | 475.83 | 485.88 --- randread | 142.86 | 141.96 --- write | 483.9 | 492.39 --- randwrite | 124.83 | 124.53 --- --- CPU(%) |NONE |NONE --- read | 15.43 | 15.81 --- randread |9.87 |9.75 --- write | 17.67 | 18.34 --- randwrite | 10.96 | 10.56 --- --- LAT(us) |NONE |NONE --- read |2.15 |2.11 --- randread |7.17 |7.21 --- write |2.11 |2.08 --- randwrite | 8.2 |8.22 --- --- MERGE(K) |NONE |NONE --- read | 8798.59 | 9064.09 --- randread |0.02 |0.19 --- write | 8847.73 | 9102.18 --- randwrite |0.03 |0.13 --- -- Ming
Re: [PATCH 2/2] mq-deadline: Enable auto-loading when built as module
On Sun, Aug 13, 2017 at 06:03:15PM +0100, Ben Hutchings wrote: > The block core requests modules with the "-iosched" name suffix, but > mq-deadline does not have that suffix. Add an alias. > > Fixes: 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline ...") > Signed-off-by: Ben Hutchings> --- > block/mq-deadline.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/mq-deadline.c b/block/mq-deadline.c > index 1b964a387afe..78e7698fa4ed 100644 > --- a/block/mq-deadline.c > +++ b/block/mq-deadline.c > @@ -660,6 +660,7 @@ static struct elevator_type mq_deadline = { > .elevator_name = "mq-deadline", > .elevator_owner = THIS_MODULE, > }; > +MODULE_ALIAS("mq-deadline-iosched"); > > static int __init deadline_init(void) > { Reviewed-by: Ming Lei -- Ming
Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module
On Sun, Aug 13, 2017 at 06:02:19PM +0100, Ben Hutchings wrote: > The block core requests modules with the "-iosched" name suffix, but > bfq no longer has that suffix. Add an alias. > > Fixes: ea25da48086d ("block, bfq: split bfq-iosched.c into multiple ...") > Signed-off-by: Ben Hutchings> --- > block/bfq-iosched.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 436b6ca6b175..3a01563bd564 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4998,6 +4998,7 @@ static struct elevator_type iosched_bfq_mq = { > .elevator_name ="bfq", > .elevator_owner = THIS_MODULE, > }; > +MODULE_ALIAS("bfq-iosched"); > > static int __init bfq_init(void) > { > Reviewed-by: Ming Lei -- Ming
Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
On 2017/8/24 1:58, Christoph Hellwig wrote: > -static inline bool nvme_req_needs_retry(struct request *req) > +static bool nvme_failover_rq(struct request *req) > { > - if (blk_noretry_request(req)) > + struct nvme_ns *ns = req->q->queuedata; > + unsigned long flags; > + > + /* > + * Only fail over commands that came in through the the multipath > + * aware submissions path. Note that ns->head might not be set up > + * for commands used during controller initialization, but those > + * must never set REQ_FAILFAST_TRANSPORT. > + */ > + if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT)) > + return false; > + > + switch (nvme_req(req)->status & 0x7ff) { > + /* > + * Generic command status: > + */ > + case NVME_SC_INVALID_OPCODE: > + case NVME_SC_INVALID_FIELD: > + case NVME_SC_INVALID_NS: > + case NVME_SC_LBA_RANGE: > + case NVME_SC_CAP_EXCEEDED: > + case NVME_SC_RESERVATION_CONFLICT: > + return false; > + > + /* > + * I/O command set specific error. Unfortunately these values are > + * reused for fabrics commands, but those should never get here. > + */ > + case NVME_SC_BAD_ATTRIBUTES: > + case NVME_SC_INVALID_PI: > + case NVME_SC_READ_ONLY: > + case NVME_SC_ONCS_NOT_SUPPORTED: > + WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode == > + nvme_fabrics_command); > + return false; > + > + /* > + * Media and Data Integrity Errors: > + */ > + case NVME_SC_WRITE_FAULT: > + case NVME_SC_READ_ERROR: > + case NVME_SC_GUARD_CHECK: > + case NVME_SC_APPTAG_CHECK: > + case NVME_SC_REFTAG_CHECK: > + case NVME_SC_COMPARE_FAILED: > + case NVME_SC_ACCESS_DENIED: > + case NVME_SC_UNWRITTEN_BLOCK: > return false; > + } > + > + /* Anything else could be a path failure, so should be retried */ > + spin_lock_irqsave(>head->requeue_lock, flags); > + blk_steal_bios(>head->requeue_list, req); > + spin_unlock_irqrestore(>head->requeue_lock, flags); > + > + nvme_reset_ctrl(ns->ctrl); > + kblockd_schedule_work(>head->requeue_work); > + return true; > +} > + > +static inline bool nvme_req_needs_retry(struct request *req) > +{ > if (nvme_req(req)->status & NVME_SC_DNR) > return false; > if (jiffies - req->start_time >= req->timeout) > return false; > if (nvme_req(req)->retries >= nvme_max_retries) > return false; > + if (nvme_failover_rq(req)) > + return false; > + if (blk_noretry_request(req)) > + return false; > return true; > } Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF when path IO error occurs? Such IO will be retried not only on the nvme-mpath internal path, but also on the dm-mpath path. In general, I wonder whether nvme-mpath can co-exist with DM-multipath in a well-defined fashion.
Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode
On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote: > From: Shaohua Li> > Currently loop disables merge. While it makes sense for buffer IO mode, > directio mode can benefit from request merge. Without merge, loop could > send small size IO to underlayer disk and harm performance. Hi Shaohua, IMO no matter if merge is used, loop always sends page by page to VFS in both dio or buffer I/O. Also if merge is enabled on loop, that means merge is run on both loop and low level block driver, and not sure if we can benefit from that. So Could you provide some performance data about this patch? > > Reviewed-by: Omar Sandoval > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 66 > > drivers/block/loop.h | 1 + > 2 files changed, 52 insertions(+), 15 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 428da07..75a8f6e 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, > bool dio) >*/ > blk_mq_freeze_queue(lo->lo_queue); > lo->use_dio = use_dio; > - if (use_dio) > + if (use_dio) { > + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > lo->lo_flags |= LO_FLAGS_DIRECT_IO; > - else > + } else { > + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue); > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; > + } > blk_mq_unfreeze_queue(lo->lo_queue); > } > > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long > ret, long ret2) > { > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); > > + kfree(cmd->bvec); > + cmd->bvec = NULL; > cmd->ret = ret; > blk_mq_complete_request(cmd->rq); > } > @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > { > struct iov_iter iter; > struct bio_vec *bvec; > - struct bio *bio = cmd->rq->bio; > + struct request *rq = cmd->rq; > + struct bio *bio = rq->bio; > struct file *file = lo->lo_backing_file; > + unsigned int offset; > + int segments = 0; > int ret; > > - /* nomerge for loop request queue */ > - WARN_ON(cmd->rq->bio != cmd->rq->biotail); > + if (rq->bio != rq->biotail) { > + struct req_iterator iter; > + struct bio_vec tmp; > + > + __rq_for_each_bio(bio, rq) > + segments += bio_segments(bio); > + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL); The allocation should have been GFP_NOIO. -- Ming
Re: [PATCH V2 1/2] block/loop: set hw_sectors
On Thu, Aug 24, 2017 at 12:24:52PM -0700, Shaohua Li wrote: > From: Shaohua Li> > Loop can handle any size of request. Limiting it to 255 sectors just > burns the CPU for bio split and request merge for underlayer disk and > also cause bad fs block allocation in directio mode. > > Reviewed-by: Omar Sandoval > Signed-off-by: Shaohua Li > --- > drivers/block/loop.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index b55a1f8..428da07 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i) > } > lo->lo_queue->queuedata = lo; > > + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS); > /* >* It doesn't make sense to enable merge because the I/O >* submitted to backing file is handled page by page. > -- > 2.9.5 > Reviewed-by: Ming Lei -- Ming
Re: [PATCH 6/6] power: supply: make device_attribute const
Hi, On Mon, Aug 21, 2017 at 05:13:12PM +0530, Bhumika Goyal wrote: > Make these const as they are only passed as an argument to the > function device_create_file and device_remove_file and the corresponding > arguments are of type const. > Done using Coccinelle. > > Signed-off-by: Bhumika Goyal> --- Thanks, queued. -- Sebastian > drivers/power/supply/olpc_battery.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/power/supply/olpc_battery.c > b/drivers/power/supply/olpc_battery.c > index fc20ca3..3bc2eea 100644 > --- a/drivers/power/supply/olpc_battery.c > +++ b/drivers/power/supply/olpc_battery.c > @@ -559,7 +559,7 @@ static ssize_t olpc_bat_error_read(struct device *dev, > return sprintf(buf, "%d\n", ec_byte); > } > > -static struct device_attribute olpc_bat_error = { > +static const struct device_attribute olpc_bat_error = { > .attr = { > .name = "error", > .mode = S_IRUGO, > -- > 1.9.1 > signature.asc Description: PGP signature
Re: [PATCH 07/10] nvme: track shared namespaces
On Tue, Aug 29, 2017 at 10:42:18AM +0800, Guan Junxiong wrote: > As for the __nvme_find_ns_head function, can it lookup the namespace > globally, not in the current subsytem. No. > Take hypermetro scenario for Please define "hypermetro" > example, two namespaces which should be viewed as the same namespaces > from the database application but exist in two different cities respectively. > Some vendors maybe specify those two namespaces with the same UUID. Then these vendors are non-compliant IFF the controllers don't belong to the same subsystem. > In addition, could you add a switch to turn on/off finding namespaces in > a subsystem-wide level or globally? No. > Can namespace be shared between two subsystem? No - if you share namespace access you are in the same subsystem.
Re: [PATCH 07/10] nvme: track shared namespaces
On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote: > If a namespace can be accessed by another subsystem, the above line > will ignore such namespace. > > Or does the NVMe/NVMf specification constrain that any namespace > can only be accessed by a subsystem? A namespace is a part of a NVMe subsystem. You must not reuse the uniqueue identifier outside the subsystem scope or your implementation will be non-compliant.
Re: [PATCH 07/10] nvme: track shared namespaces
On Mon, Aug 28, 2017 at 01:21:20PM -0700, J Freyensee wrote: > > > horrible. One idea would be to rename the current struct nvme_ns > > > to struct nvme_ns_link or similar and use the nvme_ns name for the > > > new structure. But that would involve a lot of churn. > > > > maybe nvme_ns_primary? > > Since it looks like it holds all unique identifier values and should hold > other namespace characteristics later, maybe: > > nvme_ns_item? > Or nvme_ns_entry? > Or nvme_ns_element? > Or nvme_ns_unit? > Or nvme_ns_entity? > Or nvme_ns_container? I hate them all (including the current ns_head name :)). I suspect the only way that would make my taste happy is to call this new one nvme_ns, but that would lead to a lot of churn.
Re: [PATCH 07/10] nvme: track shared namespaces
On 2017/8/28 14:51, Sagi Grimberg wrote: > +static int __nvme_check_ids(struct nvme_subsystem *subsys, > +struct nvme_ns_head *new) > +{ > +struct nvme_ns_head *h; > + > +lockdep_assert_held(>lock); > + > +list_for_each_entry(h, >nsheads, entry) { > +if ((!uuid_is_null(>uuid) && > + uuid_equal(>uuid, >uuid)) || > +(memchr_inv(new->nguid, 0, sizeof(new->nguid)) && > + memcmp(>nguid, >nguid, sizeof(new->nguid))) || memcmp() -> !memcmp > +(memchr_inv(new->eui64, 0, sizeof(new->eui64)) && > + memcmp(>eui64, >eui64, sizeof(new->eui64 memcmp() -> !memcmp Otherwise in this patch, looks good. Reviewed-by: Guan Junxiong