date:20180510

On Fri, May 11, 2018 at 5:05 AM, Keith Busch
 wrote:
> On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  wrote:
>> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

One exception is PCI error recovery, in which EH_RESET_TIMER still
may be returned any time.

Also the two timeout can happen at the same time from more than one
NS, just before resetting is started(before updating to NVME_CTRL_RESETTING).

OR one timeout is from admin queue, another one is from NS, both happen
at the same time, still before updating to NVME_CTRL_RESETTING.

In above two situations, one timeout can be handled as EH_HANDLED, and
another can be handled as EH_RESET_TIMER.

So it isn't enough to drain timeout by blk_sync_queue() simply.


Thanks,
Ming Lei

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Thu, May 10, 2018 at 04:43:57PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 06:03:59AM +0800, Ming Lei wrote:
> > Sorry, forget to mention, it isn't enough to simply sync timeout inside 
> > reset().
> > 
> > Another tricky thing is about freeze & unfreeze, now freeze is done in
> > nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> > we have to make sure both are paired, otherwise queues may be kept as
> > freeze for ever.
> > 
> > This issue is covered by my V4 & V5 too.
> 
> That it isn't an issue in my patch either.

When blk_sync_queue() is used in your patch for draining timeout,
several nvme_timeout() instances may be drained, that means there
may be more than one nvme_start_freeze() called from all these
nvme_timeout() instances, but finally only one nvme_reset_work()
is run, then queues are kept as freeze forever even though
reset is done successfully.

So you patch won't fix this issue.

All these problems are mentioned in the commit log of V4, and I
am proposing this approach to try to address them all.

Thanks,
Ming

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 06:03:59AM +0800, Ming Lei wrote:
> Sorry, forget to mention, it isn't enough to simply sync timeout inside 
> reset().
> 
> Another tricky thing is about freeze & unfreeze, now freeze is done in
> nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
> we have to make sure both are paired, otherwise queues may be kept as
> freeze for ever.
> 
> This issue is covered by my V4 & V5 too.

That it isn't an issue in my patch either.

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

Hi Laurence,

Great thanks for your so quick test!

On Fri, May 11, 2018 at 5:59 AM, Laurence Oberman  wrote:
> On Thu, 2018-05-10 at 18:28 +0800, Ming Lei wrote:
>> On Sat, May 05, 2018 at 07:11:33PM -0400, Laurence Oberman wrote:
>> > On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
>> > > Hi,
>> > >
>> > > The 1st patch introduces blk_quiesce_timeout() and
>> > > blk_unquiesce_timeout()
>> > > for NVMe, meantime fixes blk_sync_queue().
>> > >
>> > > The 2nd patch covers timeout for admin commands for recovering
>> > > controller
>> > > for avoiding possible deadlock.
>> > >
>> > > The 3rd and 4th patches avoid to wait_freeze on queues which
>> > > aren't
>> > > frozen.
>> > >
>> > > The last 4 patches fixes several races wrt. NVMe timeout handler,
>> > > and
>> > > finally can make blktests block/011 passed. Meantime the NVMe PCI
>> > > timeout
>> > > mecanism become much more rebost than before.
>> > >
>> > > gitweb:
>> > >   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
>> > >
>> > > V4:
>> > >   - fixe nvme_init_set_host_mem_cmd()
>> > >   - use nested EH model, and run both nvme_dev_disable() and
>> > >   resetting in one same context
>> > >
>> > > V3:
>> > >   - fix one new race related freezing in patch 4,
>> > > nvme_reset_work()
>> > >   may hang forever without this patch
>> > >   - rewrite the last 3 patches, and avoid to break
>> > > nvme_reset_ctrl*()
>> > >
>> > > V2:
>> > >   - fix draining timeout work, so no need to change return value
>> > > from
>> > >   .timeout()
>> > >   - fix race between nvme_start_freeze() and nvme_unfreeze()
>> > >   - cover timeout for admin commands running in EH
>> > >
>> > > Ming Lei (7):
>> > >   block: introduce blk_quiesce_timeout() and
>> > > blk_unquiesce_timeout()
>> > >   nvme: pci: cover timeout for admin commands running in EH
>> > >   nvme: pci: only wait freezing if queue is frozen
>> > >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
>> > > recovery
>> > >   nvme: core: introduce 'reset_lock' for sync reset state and
>> > > reset
>> > > activities
>> > >   nvme: pci: prepare for supporting error recovery from resetting
>> > > context
>> > >   nvme: pci: support nested EH
>> > >
>> > >  block/blk-core.c |  21 +++-
>> > >  block/blk-mq.c   |   9 ++
>> > >  block/blk-timeout.c  |   5 +-
>> > >  drivers/nvme/host/core.c |  46 ++-
>> > >  drivers/nvme/host/nvme.h |   5 +
>> > >  drivers/nvme/host/pci.c  | 304
>> > > ---
>> > >  include/linux/blkdev.h   |  13 ++
>> > >  7 files changed, 356 insertions(+), 47 deletions(-)
>> > >
>> > > Cc: Jianchao Wang 
>> > > Cc: Christoph Hellwig 
>> > > Cc: Sagi Grimberg 
>> > > Cc: linux-n...@lists.infradead.org
>> > > Cc: Laurence Oberman 
>> >
>> > Hello Ming
>> >
>> > I have a two node NUMA system here running your kernel tree
>> > 4.17.0-rc3.ming.nvme+
>> >
>> > [root@segstorage1 ~]# numactl --hardware
>> > available: 2 nodes (0-1)
>> > node 0 cpus: 0 3 5 6 8 11 13 14
>> > node 0 size: 63922 MB
>> > node 0 free: 61310 MB
>> > node 1 cpus: 1 2 4 7 9 10 12 15
>> > node 1 size: 64422 MB
>> > node 1 free: 62372 MB
>> > node distances:
>> > node   0   1
>> >   0:  10  20
>> >   1:  20  10
>> >
>> > I ran block/011
>> >
>> > [root@segstorage1 blktests]# ./check block/011
>> > block/011 => nvme0n1 (disable PCI device while doing
>> > I/O)[failed]
>> > runtime...  106.936s
>> > --- tests/block/011.out 2018-05-05 18:01:14.268414752
>> > -0400
>> > +++ results/nvme0n1/block/011.out.bad   2018-05-05
>> > 19:07:21.028634858 -0400
>> > @@ -1,2 +1,36 @@
>> >  Running block/011
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
>> > IO_U_F_FLIGHT) == 0' failed.
>> > ...
>> > (Run 'diff -u tests/block/011.out
>> > results/nvme0n1/block/011.out.bad' to see the entire diff)
>> >
>> > [ 1421.738551] run blktests block/011 at 2018-05-05 19:05:34
>> > [ 1452.676351] nvme nvme0: controller is down; will reset:
>> > CSTS=0x3,
>> > PCI_STATUS=0x10
>> > [ 1452.718221] nvme nvme0: controller is down; will reset:
>> > CSTS=0x3,
>> > PCI_STATUS=0x10
>> > [ 1452.718239] nvme nvme0: EH 0: before shutdown
>> > [ 1452.760890] nvme nvme0: controller is down; will reset:
>> > CSTS=0x3,
>> > PCI_STATUS=0x10
>> > [ 1452.760894] nvme nvme0: controller is down;

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 5:18 AM, Keith Busch
 wrote:
> On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
>> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
>>  wrote:
>> > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> >> Hi Keith,
>> >>
>> >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  
>> >> wrote:
>> >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> >> This sync may be raced with one timed-out request, which may be handled
>> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> >> work reliably.
>> >> >
>> >> > Ming,
>> >> >
>> >> > As proposed, that scenario is impossible to encounter. Resetting the
>> >> > controller inline with the timeout reaps all the commands, and then
>> >> > sets the controller state to RESETTING. While blk-mq may not allow the
>> >> > driver to complete those requests, having the driver sync with the 
>> >> > queues
>> >> > will hold the controller in the reset state until blk-mq is done with
>> >> > its timeout work; therefore, it is impossible for the NVMe driver to
>> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>> >>
>> >> That isn't true for multiple namespace case,  each request queue has its
>> >> own timeout work, and all these timeout work can be triggered 
>> >> concurrently.
>> >
>> > The controller state is most certainly not per queue/namespace. It's
>> > global to the controller. Once the reset is triggered, nvme_timeout can
>> > only return EH_HANDLED.
>>
>> It is related with EH_HANDLED, please see the following case:
>>
>> 1) when req A from N1 is timed out, nvme_timeout() handles
>> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
>>
>> 2) when req B from N2 is timed out, nvme_timeout() handles
>> it as EH_HANDLED, then nvme_dev_disable() is called exactly
>> when reset is in-progress, so queues become quiesced, and nothing
>> can move on in the resetting triggered by N1.
>
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Sorry, forget to mention, it isn't enough to simply sync timeout inside reset().

Another tricky thing is about freeze & unfreeze, now freeze is done in
nvme_dev_disable(), and unfreeze is done in nvme_reset_work. That means
we have to make sure both are paired, otherwise queues may be kept as
freeze for ever.

This issue is covered by my V4 & V5 too.

thanks,
Ming Lei

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-10 Thread Laurence Oberman

On Thu, 2018-05-10 at 18:28 +0800, Ming Lei wrote:
> On Sat, May 05, 2018 at 07:11:33PM -0400, Laurence Oberman wrote:
> > On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > The 1st patch introduces blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > > for NVMe, meantime fixes blk_sync_queue().
> > > 
> > > The 2nd patch covers timeout for admin commands for recovering
> > > controller
> > > for avoiding possible deadlock.
> > > 
> > > The 3rd and 4th patches avoid to wait_freeze on queues which
> > > aren't
> > > frozen.
> > > 
> > > The last 4 patches fixes several races wrt. NVMe timeout handler,
> > > and
> > > finally can make blktests block/011 passed. Meantime the NVMe PCI
> > > timeout
> > > mecanism become much more rebost than before.
> > > 
> > > gitweb:
> > >   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> > > 
> > > V4:
> > >   - fixe nvme_init_set_host_mem_cmd()
> > >   - use nested EH model, and run both nvme_dev_disable() and
> > >   resetting in one same context
> > > 
> > > V3:
> > >   - fix one new race related freezing in patch 4,
> > > nvme_reset_work()
> > >   may hang forever without this patch
> > >   - rewrite the last 3 patches, and avoid to break
> > > nvme_reset_ctrl*()
> > > 
> > > V2:
> > >   - fix draining timeout work, so no need to change return value
> > > from
> > >   .timeout()
> > >   - fix race between nvme_start_freeze() and nvme_unfreeze()
> > >   - cover timeout for admin commands running in EH
> > > 
> > > Ming Lei (7):
> > >   block: introduce blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > >   nvme: pci: cover timeout for admin commands running in EH
> > >   nvme: pci: only wait freezing if queue is frozen
> > >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> > > recovery
> > >   nvme: core: introduce 'reset_lock' for sync reset state and
> > > reset
> > > activities
> > >   nvme: pci: prepare for supporting error recovery from resetting
> > > context
> > >   nvme: pci: support nested EH
> > > 
> > >  block/blk-core.c |  21 +++-
> > >  block/blk-mq.c   |   9 ++
> > >  block/blk-timeout.c  |   5 +-
> > >  drivers/nvme/host/core.c |  46 ++-
> > >  drivers/nvme/host/nvme.h |   5 +
> > >  drivers/nvme/host/pci.c  | 304
> > > ---
> > >  include/linux/blkdev.h   |  13 ++
> > >  7 files changed, 356 insertions(+), 47 deletions(-)
> > > 
> > > Cc: Jianchao Wang 
> > > Cc: Christoph Hellwig 
> > > Cc: Sagi Grimberg 
> > > Cc: linux-n...@lists.infradead.org
> > > Cc: Laurence Oberman 
> > 
> > Hello Ming
> > 
> > I have a two node NUMA system here running your kernel tree
> > 4.17.0-rc3.ming.nvme+
> > 
> > [root@segstorage1 ~]# numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 3 5 6 8 11 13 14
> > node 0 size: 63922 MB
> > node 0 free: 61310 MB
> > node 1 cpus: 1 2 4 7 9 10 12 15
> > node 1 size: 64422 MB
> > node 1 free: 62372 MB
> > node distances:
> > node   0   1 
> >   0:  10  20 
> >   1:  20  10 
> > 
> > I ran block/011
> > 
> > [root@segstorage1 blktests]# ./check block/011
> > block/011 => nvme0n1 (disable PCI device while doing
> > I/O)[failed]
> > runtime...  106.936s
> > --- tests/block/011.out 2018-05-05 18:01:14.268414752
> > -0400
> > +++ results/nvme0n1/block/011.out.bad   2018-05-05
> > 19:07:21.028634858 -0400
> > @@ -1,2 +1,36 @@
> >  Running block/011
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > ...
> > (Run 'diff -u tests/block/011.out
> > results/nvme0n1/block/011.out.bad' to see the entire diff)
> > 
> > [ 1421.738551] run blktests block/011 at 2018-05-05 19:05:34
> > [ 1452.676351] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [ 1452.718221] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [ 1452.718239] nvme nvme0: EH 0: before shutdown
> > [ 1452.760890] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [ 1452.760894] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [ 1452.760897] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [ 1452.760900] nvme nvme0: controller is down; will reset:
> > CSTS=0x3,
> > PCI_STATUS=0x10
> > [

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Thu, May 10, 2018 at 03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.
> 
> If a command times out during connecting state, we go to the dead state.

Actually, it is hang forever in blk_mq_freeze_queue_wait() from
nvme_reset_dev(), not a dead state of controller.

Thanks,
Ming

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Thu, May 10, 2018 at 03:44:41PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> > Could you share me the link?
> 
> The diff was in this reply here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html
> 
> > Firstly, the previous nvme_sync_queues() won't work reliably, so this
> > patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> > this purpose.
> >
> > Secondly, I remembered that you only call nvme_sync_queues() at the
> > entry of nvme_reset_work(), but timeout(either admin or normal IO)
> > can happen again during resetting, that is another race addressed by
> > this patchset, but can't cover by your proposal.
> 
> I sync the queues at the beginning because it ensures there is not
> a single in flight request for the entire controller (all namespaces
> plus admin queue) before transitioning to the connecting state.

But it can't avoid the race I mentioned, since nvme_dev_disable() can
happen again during resetting.

> 
> If a command times out during connecting state, we go to the dead state.

That is too risky, since the IO during resetting isn't much different
with IOs submitted from other IO paths.

For example, in case of blktests block/011, controller shouldn't have
been put into dead, and this patchset(V4 & V5) can cover this case well.

Thanks,
Ming

Re: [PATCH 01/33] block: add a lower-level bio_add_page interface

2018-05-10 Thread Andreas Dilger

On May 10, 2018, at 12:40 AM, Christoph Hellwig  wrote:
> 
> On Wed, May 09, 2018 at 08:12:43AM -0700, Matthew Wilcox wrote:
>> (page, len, off) is a bit weird to me.  Usually we do (page, off, len).
> 
> That's what I'd usually do, too.  But this odd convention is what
> bio_add_page uses, so I decided to stick to that instead of having two
> different conventions in one family of functions.

Would it make sense to change the bio_add_page() and bio_add_pc_page()
to use the more common convention instead of continuing the spread of
this non-standard calling convention?  This is doubly problematic since
"off" and "len" are both unsigned int values so it is easy to get them
mixed up, and just reordering the bio_add_page() arguments would not
generate any errors.

One option would be to rename this function bio_page_add() so there are
build errors or first add bio_page_add() and mark bio_add_page()
deprecated and allow some short time for transition?  There are about
50 uses under drivers/ and 50 uses under fs/.

Cheers, Andreas

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 05:24:46AM +0800, Ming Lei wrote:
> Could you share me the link?

The diff was in this reply here:

http://lists.infradead.org/pipermail/linux-nvme/2018-April/017019.html

> Firstly, the previous nvme_sync_queues() won't work reliably, so this
> patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
> this purpose.
>
> Secondly, I remembered that you only call nvme_sync_queues() at the
> entry of nvme_reset_work(), but timeout(either admin or normal IO)
> can happen again during resetting, that is another race addressed by
> this patchset, but can't cover by your proposal.

I sync the queues at the beginning because it ensures there is not
a single in flight request for the entire controller (all namespaces
plus admin queue) before transitioning to the connecting state.

If a command times out during connecting state, we go to the dead state.

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Thu, May 10, 2018 at 03:18:29PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
> > On Fri, May 11, 2018 at 5:05 AM, Keith Busch
> >  wrote:
> > > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> > >> Hi Keith,
> > >>
> > >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  
> > >> wrote:
> > >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> > >> >> This sync may be raced with one timed-out request, which may be 
> > >> >> handled
> > >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues 
> > >> >> can't
> > >> >> work reliably.
> > >> >
> > >> > Ming,
> > >> >
> > >> > As proposed, that scenario is impossible to encounter. Resetting the
> > >> > controller inline with the timeout reaps all the commands, and then
> > >> > sets the controller state to RESETTING. While blk-mq may not allow the
> > >> > driver to complete those requests, having the driver sync with the 
> > >> > queues
> > >> > will hold the controller in the reset state until blk-mq is done with
> > >> > its timeout work; therefore, it is impossible for the NVMe driver to
> > >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> > >>
> > >> That isn't true for multiple namespace case,  each request queue has its
> > >> own timeout work, and all these timeout work can be triggered 
> > >> concurrently.
> > >
> > > The controller state is most certainly not per queue/namespace. It's
> > > global to the controller. Once the reset is triggered, nvme_timeout can
> > > only return EH_HANDLED.
> > 
> > It is related with EH_HANDLED, please see the following case:
> > 
> > 1) when req A from N1 is timed out, nvme_timeout() handles
> > it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> > 
> > 2) when req B from N2 is timed out, nvme_timeout() handles
> > it as EH_HANDLED, then nvme_dev_disable() is called exactly
> > when reset is in-progress, so queues become quiesced, and nothing
> > can move on in the resetting triggered by N1.
> 
> Huh? The nvme_sync_queues ensures that doesn't happen. That was the
> whole point.

Could you share me the link?

Firstly, the previous nvme_sync_queues() won't work reliably, so this
patch introduces blk_unquiesce_timeout() and blk_quiesce_timeout() for
this purpose.

Secondly, I remembered that you only call nvme_sync_queues() at the
entry of nvme_reset_work(), but timeout(either admin or normal IO)
can happen again during resetting, that is another race addressed by
this patchset, but can't cover by your proposal.

Thanks,
Ming

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 05:10:40AM +0800, Ming Lei wrote:
> On Fri, May 11, 2018 at 5:05 AM, Keith Busch
>  wrote:
> > On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> >> Hi Keith,
> >>
> >> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  wrote:
> >> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> >> >> This sync may be raced with one timed-out request, which may be handled
> >> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> >> work reliably.
> >> >
> >> > Ming,
> >> >
> >> > As proposed, that scenario is impossible to encounter. Resetting the
> >> > controller inline with the timeout reaps all the commands, and then
> >> > sets the controller state to RESETTING. While blk-mq may not allow the
> >> > driver to complete those requests, having the driver sync with the queues
> >> > will hold the controller in the reset state until blk-mq is done with
> >> > its timeout work; therefore, it is impossible for the NVMe driver to
> >> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> >> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> >>
> >> That isn't true for multiple namespace case,  each request queue has its
> >> own timeout work, and all these timeout work can be triggered concurrently.
> >
> > The controller state is most certainly not per queue/namespace. It's
> > global to the controller. Once the reset is triggered, nvme_timeout can
> > only return EH_HANDLED.
> 
> It is related with EH_HANDLED, please see the following case:
> 
> 1) when req A from N1 is timed out, nvme_timeout() handles
> it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.
> 
> 2) when req B from N2 is timed out, nvme_timeout() handles
> it as EH_HANDLED, then nvme_dev_disable() is called exactly
> when reset is in-progress, so queues become quiesced, and nothing
> can move on in the resetting triggered by N1.

Huh? The nvme_sync_queues ensures that doesn't happen. That was the
whole point.

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 5:05 AM, Keith Busch
 wrote:
> On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
>> Hi Keith,
>>
>> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  wrote:
>> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> >> This sync may be raced with one timed-out request, which may be handled
>> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> >> work reliably.
>> >
>> > Ming,
>> >
>> > As proposed, that scenario is impossible to encounter. Resetting the
>> > controller inline with the timeout reaps all the commands, and then
>> > sets the controller state to RESETTING. While blk-mq may not allow the
>> > driver to complete those requests, having the driver sync with the queues
>> > will hold the controller in the reset state until blk-mq is done with
>> > its timeout work; therefore, it is impossible for the NVMe driver to
>> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
>> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
>>
>> That isn't true for multiple namespace case,  each request queue has its
>> own timeout work, and all these timeout work can be triggered concurrently.
>
> The controller state is most certainly not per queue/namespace. It's
> global to the controller. Once the reset is triggered, nvme_timeout can
> only return EH_HANDLED.

It is related with EH_HANDLED, please see the following case:

1) when req A from N1 is timed out, nvme_timeout() handles
it as EH_HANDLED: nvme_dev_disable() and reset is scheduled.

2) when req B from N2 is timed out, nvme_timeout() handles
it as EH_HANDLED, then nvme_dev_disable() is called exactly
when reset is in-progress, so queues become quiesced, and nothing
can move on in the resetting triggered by N1.

Thanks,
Ming Lei

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

On Fri, May 11, 2018 at 04:52:11AM +0800, Ming Lei wrote:
> Hi Keith,
> 
> On Tue, May 8, 2018 at 11:30 PM, Keith Busch  wrote:
> > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
> >> This sync may be raced with one timed-out request, which may be handled
> >> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
> >> work reliably.
> >
> > Ming,
> >
> > As proposed, that scenario is impossible to encounter. Resetting the
> > controller inline with the timeout reaps all the commands, and then
> > sets the controller state to RESETTING. While blk-mq may not allow the
> > driver to complete those requests, having the driver sync with the queues
> > will hold the controller in the reset state until blk-mq is done with
> > its timeout work; therefore, it is impossible for the NVMe driver to
> > return "BLK_EH_RESET_TIMER", and all commands will be completed through
> > nvme_timeout's BLK_EH_HANDLED exactly as desired.
> 
> That isn't true for multiple namespace case,  each request queue has its
> own timeout work, and all these timeout work can be triggered concurrently.

The controller state is most certainly not per queue/namespace. It's
global to the controller. Once the reset is triggered, nvme_timeout can
only return EH_HANDLED.

Re: [PATCH V4 1/7] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()

On Thu, May 10, 2018 at 03:01:04PM +, Bart Van Assche wrote:
> On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > Turns out the current way can't drain timout completely because mod_timer()
> > can be triggered in the work func, which can be just run inside the synced
> > timeout work:
> > 
> > del_timer_sync(>timeout);
> > cancel_work_sync(>timeout_work);
> > 
> > This patch introduces one flag of 'timeout_off' for fixing this issue, turns
> > out this simple way does work.
> > 
> > Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
> > draining timeout, which is needed by NVMe.
> 
> Hello Ming,
> 
> The description of the above patch does not motivate sufficiently why you 
> think
> that this change is necessary. As you know it is already possible to wait 
> until
> timeout handling has finished by calling blk_mq_freeze_queue() +
> blk_mq_unfreeze_queue(). An explanation is needed of why you think that 
> calling

blk_mq_freeze_queue() +  blk_mq_unfreeze_queue() can't work, you have to
call blk_mq_freeze_queue_wait() between the two, but blk_mq_freeze_queue_wait
is a big trouble for NVMe, and can't be used inside nvme_dev_disable().

You can find the usage in the last patch of this series.

Thanks,
Ming

Re: [PATCH V4 6/7] nvme: pci: prepare for supporting error recovery from resetting context

On Mon, May 07, 2018 at 08:04:18AM -0700, James Smart wrote:
> 
> 
> On 5/5/2018 6:59 AM, Ming Lei wrote:
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2365,14 +2365,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev 
> > *dev, int status)
> > nvme_put_ctrl(>ctrl);
> >   }
> > -static void nvme_reset_work(struct work_struct *work)
> > +static void nvme_reset_dev(struct nvme_dev *dev)
> >   {
> > -   struct nvme_dev *dev =
> > -   container_of(work, struct nvme_dev, ctrl.reset_work);
> > bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
> > int result = -ENODEV;
> > enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
> > +   mutex_lock(>ctrl.reset_lock);
> > +
> > if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
> > goto out;
> 
> I believe the reset_lock is unnecessary (patch 5) as it should be covered by
> the transition of the state to RESETTING which is done under lock.
> 
> Thus the error is:
> instead of:
>  if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
>      goto out;
> 
> it should be:
>  if (dev->ctrl.state != NVME_CTRL_RESETTING))
>      return;
> 

Right, I have dropped this patch in V5(not posted yet):

https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

Thanks,
Ming

Re: [PATCH 1/2] nvme: pci: simplify timeout handling

Hi Keith,

On Tue, May 8, 2018 at 11:30 PM, Keith Busch  wrote:
> On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote:
>> This sync may be raced with one timed-out request, which may be handled
>> as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't
>> work reliably.
>
> Ming,
>
> As proposed, that scenario is impossible to encounter. Resetting the
> controller inline with the timeout reaps all the commands, and then
> sets the controller state to RESETTING. While blk-mq may not allow the
> driver to complete those requests, having the driver sync with the queues
> will hold the controller in the reset state until blk-mq is done with
> its timeout work; therefore, it is impossible for the NVMe driver to
> return "BLK_EH_RESET_TIMER", and all commands will be completed through
> nvme_timeout's BLK_EH_HANDLED exactly as desired.

That isn't true for multiple namespace case,  each request queue has its
own timeout work, and all these timeout work can be triggered concurrently.

>
> Could you please recheck my suggestion? The alternatives proposed are
> far too risky for a 4.17 consideration, and I'm hoping we can stabilize
> this behavior in the current release if possible.

Bart's rework on timeout won't cover the case of multiple namespace, so
we have to sync timeout inside driver.

The approach taken in my V4 has been simpler, could you take a look at it?

Thanks,
Ming Lei

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse

On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +
> "Stephen  Bates"  wrote:
> > >Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use 
> > it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >Also the IOMMU isolation do matter a lot to us. Think someone using 
> > > this
> > >peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Alex Williamson

On Thu, 10 May 2018 18:41:09 +
"Stephen  Bates"  wrote:
> >Reasons is that GPU are giving up on PCIe (see all specialize link like
> >NVlink that are popping up in GPU space). So for fast GPU inter-connect
> >we have this new links.   
> 
> I look forward to Nvidia open-licensing NVLink to anyone who wants to use it 
> ;-).

No doubt, the marketing for it is quick to point out the mesh topology
of NVLink, but I haven't seen any technical documents that describe the
isolation capabilities or IOMMU interaction.  Whether this is included
or an afterthought, I have no idea.

> >Also the IOMMU isolation do matter a lot to us. Think someone using this
> >peer to peer to gain control of a server in the cloud.  

>From that perspective, do we have any idea what NVLink means for
topology and IOMMU provided isolation and translation?  I've seen a
device assignment user report that seems to suggest it might pretend to
be PCIe compatible, but the assigned GPU ultimately doesn't work
correctly in a VM, so perhaps the software compatibility is only so
deep. Thanks,

Alex

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Logan Gunthorpe



On 10/05/18 12:41 PM, Stephen  Bates wrote:
> Hi Jerome
> 
>>Note on GPU we do would not rely on ATS for peer to peer. Some part
>>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>>are the part likely to be use in peer to peer.
> 
> OK this is good to know. I agree the DMA engine is probably one of the GPU 
> components most applicable to p2pdma.
> 
>>We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>>for performance reasons ie we do not care having our transaction going
>>to the root complex and back down the destination. At least in use case
>>i am working on this is fine.
> 
> If the GPU people are the good guys does that make the NVMe people the bad 
> guys ;-). If so, what are the RDMA people??? Again good to know.

The NVMe people are the Nice Neighbors, the RDMA people are the
Righteous Romantics and the PCI people are the Pleasant Protagonists...

Obviously.

Logan

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

Hi Jerome

>Hopes this helps understanding the big picture. I over simplify thing and
>devils is in the details.

This was a great primer thanks for putting it together. An LWN.net article 
perhaps ;-)??

Stephen

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

Hi Jerome

>Note on GPU we do would not rely on ATS for peer to peer. Some part
>of the GPU (DMA engines) do not necessarily support ATS. Yet those
>are the part likely to be use in peer to peer.

OK this is good to know. I agree the DMA engine is probably one of the GPU 
components most applicable to p2pdma.

>We (ake GPU people aka the good guys ;)) do no want to do peer to peer
>for performance reasons ie we do not care having our transaction going
>to the root complex and back down the destination. At least in use case
>i am working on this is fine.

If the GPU people are the good guys does that make the NVMe people the bad guys 
;-). If so, what are the RDMA people??? Again good to know.

>Reasons is that GPU are giving up on PCIe (see all specialize link like
>NVlink that are popping up in GPU space). So for fast GPU inter-connect
>we have this new links. 

I look forward to Nvidia open-licensing NVLink to anyone who wants to use it 
;-). Or maybe we'll all just switch to OpenGenCCIX when the time comes.

>Also the IOMMU isolation do matter a lot to us. Think someone using this
>peer to peer to gain control of a server in the cloud.

I agree that IOMMU isolation is very desirable. Hence the desire to ensure we 
can keep the IOMMU on while doing p2pdma if at all possible whilst still 
delivering the desired performance to the user.

Stephen

[PATCH] blk-mq: Rework blk-mq timeout handling again

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).

This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation
  (sh and mips).
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Signed-off-by: Jens Axboe 
---

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 178 +
 block/blk-mq.h | 118 
 block/blk-timeout.c|  89 ++---
 block/blk.h|  21 +++---
 include/linux/blkdev.h |  44 ++--
 7 files changed, 221 insertions(+), 236 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1418a1ccd80d..223f650712f4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Logan Gunthorpe



On 10/05/18 11:11 AM, Stephen  Bates wrote:
>> Not to me. In the p2pdma code we specifically program DMA engines with
>> the PCI bus address. 
> 
> Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
> initiator with an IOVA but with the PCI bus address...
> 
>> So regardless of whether we are using the IOMMU or
>> not, the packets will be forwarded directly to the peer. If the ACS
>>  Redir bits are on they will be forced back to the RC by the switch and
>>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>>  where we want and everything will work (but we lose the isolation of ACS).
> 
> Agreed.
> 
>>For EPs that support ATS, we should (but don't necessarily have to)
>>program them with the IOVA address so they can go through the
>>translation process which will allow P2P without disabling the ACS Redir
>>bits -- provided the ACS direct translation bit is set. (And btw, if it
>>is, then we lose the benefit of ACS protecting against malicious EPs).
>>But, per above, the ATS transaction should involve only the IOVA address
>>so the ACS bits not being set should not break ATS.
> 
> Well we would still have to clear some ACS bits but now we can clear only for 
> translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

> Not to me. In the p2pdma code we specifically program DMA engines with
> the PCI bus address. 

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...

> So regardless of whether we are using the IOMMU or
> not, the packets will be forwarded directly to the peer. If the ACS
>  Redir bits are on they will be forced back to the RC by the switch and
>  the transaction will fail. If we clear the ACS bits, the TLPs will go
>  where we want and everything will work (but we lose the isolation of ACS).

Agreed.

>For EPs that support ATS, we should (but don't necessarily have to)
>program them with the IOVA address so they can go through the
>translation process which will allow P2P without disabling the ACS Redir
>bits -- provided the ACS direct translation bit is set. (And btw, if it
>is, then we lose the benefit of ACS protecting against malicious EPs).
>But, per above, the ATS transaction should involve only the IOVA address
>so the ACS bits not being set should not break ATS.

Well we would still have to clear some ACS bits but now we can clear only for 
translated addresses.

Stephen

Re: [PATCH 6/9] sbitmap: warn if using smaller shallow depth than was setup

On 5/10/18 11:02 AM, Omar Sandoval wrote:
> On Thu, May 10, 2018 at 10:24:24AM -0600, Jens Axboe wrote:
>> From: Omar Sandoval 
>>
>> Make sure the user passed the right value to
>> sbitmap_queue_min_shallow_depth().
> 
> An unlucky bisect that lands between this change and the BFQ/Kyber
> changes is going to trigger this warning. We should have it after the
> BFQ/Kyber changes.

Good point, I'll move it after the kyber/bfq change.

-- 
Jens Axboe

Re: [PATCH 5/9] sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

On 5/10/18 11:01 AM, Omar Sandoval wrote:
> On Thu, May 10, 2018 at 10:24:23AM -0600, Jens Axboe wrote:
>> From: Omar Sandoval 
>>
>> The sbitmap queue wake batch is calculated such that once allocations
>> start blocking, all of the bits which are already allocated must be
>> enough to fulfill the batch counters of all of the waitqueues. However,
>> the shallow allocation depth can break this invariant, since we block
>> before our full depth is being utilized. Add
>> sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
>> the sbq will use, and update sbq_calc_wake_batch() to take it into
>> account.
>>
>> Acked-by: Paolo Valente 
>> Signed-off-by: Omar Sandoval 
> 
> Reviewed -- haha, wait, thanks for picking this one up :)

Just checking if you're awake!

-- 
Jens Axboe

Re: [PATCH 8/9] kyber-iosched: update shallow depth when setting up hardware queue

On Thu, May 10, 2018 at 10:24:26AM -0600, Jens Axboe wrote:
> We don't expect the async depth to be smaller than the wake batch
> count for sbitmap, but just in case, inform sbitmap of what shallow
> depth kyber may use.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/kyber-iosched.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 564967fafe5f..5b33dc394cc7 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -378,6 +378,7 @@ static void kyber_exit_sched(struct elevator_queue *e)
>  
>  static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
>  {
> + struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
>   struct kyber_hctx_data *khd;
>   int i;
>  
> @@ -400,6 +401,8 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
> unsigned int hctx_idx)
>   khd->batching = 0;
>  
>   hctx->sched_data = khd;
> + sbitmap_queue_min_shallow_depth(>sched_tags->bitmap_tags,
> + kqd->async_depth);
>  
>   return 0;
>  }
> -- 
> 2.7.4
>

Re: [PATCH 7/9] bfq-iosched: update shallow depth to smallest one used

On Thu, May 10, 2018 at 10:24:25AM -0600, Jens Axboe wrote:
> If our shallow depth is smaller than the wake batching of sbitmap,
> we can introduce hangs. Ensure that sbitmap knows how low we'll go.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/bfq-iosched.c | 17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 10294124d597..b622e73a326a 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5081,10 +5081,13 @@ void bfq_put_async_queues(struct bfq_data *bfqd, 
> struct bfq_group *bfqg)
>  
>  /*
>   * See the comments on bfq_limit_depth for the purpose of
> - * the depths set in the function.
> + * the depths set in the function. Return minimum shallow depth we'll use.
>   */
> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue 
> *bt)
> +static unsigned int bfq_update_depths(struct bfq_data *bfqd,
> +   struct sbitmap_queue *bt)
>  {
> + unsigned int i, j, min_shallow = UINT_MAX;
> +
>   /*
>* In-word depths if no bfq_queue is being weight-raised:
>* leaving 25% of tags only for sync reads.
> @@ -5115,14 +5118,22 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
> struct sbitmap_queue *bt)
>   bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
>   /* no more than ~37% of tags for sync writes (~20% extra tags) */
>   bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
> +
> + for (i = 0; i < 2; i++)
> + for (j = 0; j < 2; j++)
> + min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
> +
> + return min_shallow;
>  }
>  
>  static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
>  {
>   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>   struct blk_mq_tags *tags = hctx->sched_tags;
> + unsigned int min_shallow;
>  
> - bfq_update_depths(bfqd, >bitmap_tags);
> + min_shallow = bfq_update_depths(bfqd, >bitmap_tags);
> + sbitmap_queue_min_shallow_depth(>bitmap_tags, min_shallow);
>   return 0;
>  }
>  
> -- 
> 2.7.4
>

Re: [PATCH 6/9] sbitmap: warn if using smaller shallow depth than was setup

On Thu, May 10, 2018 at 10:24:24AM -0600, Jens Axboe wrote:
> From: Omar Sandoval 
> 
> Make sure the user passed the right value to
> sbitmap_queue_min_shallow_depth().

An unlucky bisect that lands between this change and the BFQ/Kyber
changes is going to trigger this warning. We should have it after the
BFQ/Kyber changes.

> Acked-by: Paolo Valente 
> Signed-off-by: Omar Sandoval 
> Signed-off-by: Jens Axboe 
> ---
>  lib/sbitmap.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index d21473b42465..8f0950fbaa5c 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -402,6 +402,8 @@ int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
>   unsigned int hint, depth;
>   int nr;
>  
> + WARN_ON_ONCE(shallow_depth < sbq->min_shallow_depth);
> +
>   hint = this_cpu_read(*sbq->alloc_hint);
>   depth = READ_ONCE(sbq->sb.depth);
>   if (unlikely(hint >= depth)) {
> -- 
> 2.7.4
>

Re: [PATCH 5/9] sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

On Thu, May 10, 2018 at 10:24:23AM -0600, Jens Axboe wrote:
> From: Omar Sandoval 
> 
> The sbitmap queue wake batch is calculated such that once allocations
> start blocking, all of the bits which are already allocated must be
> enough to fulfill the batch counters of all of the waitqueues. However,
> the shallow allocation depth can break this invariant, since we block
> before our full depth is being utilized. Add
> sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
> the sbq will use, and update sbq_calc_wake_batch() to take it into
> account.
> 
> Acked-by: Paolo Valente 
> Signed-off-by: Omar Sandoval 

Reviewed -- haha, wait, thanks for picking this one up :)

> Signed-off-by: Jens Axboe 
> ---
>  include/linux/sbitmap.h | 29 +
>  lib/sbitmap.c   | 49 
> -
>  2 files changed, 69 insertions(+), 9 deletions(-)

Re: [PATCH 4/9] bfq-iosched: remove unused variable

On Thu, May 10, 2018 at 10:24:22AM -0600, Jens Axboe wrote:
> bfqd->sb_shift was attempted used as a cache for the sbitmap queue
> shift, but we don't need it, as it never changes. Kill it with fire.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/bfq-iosched.c | 16 +++-
>  block/bfq-iosched.h |  6 --
>  2 files changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0cd8aa80c32d..10294124d597 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5085,26 +5085,24 @@ void bfq_put_async_queues(struct bfq_data *bfqd, 
> struct bfq_group *bfqg)
>   */
>  static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue 
> *bt)
>  {
> - bfqd->sb_shift = bt->sb.shift;
> -
>   /*
>* In-word depths if no bfq_queue is being weight-raised:
>* leaving 25% of tags only for sync reads.
>*
>* In next formulas, right-shift the value
> -  * (1U -  * (1U<<(bfqd->sb_shift - something)), to be robust against
> -  * any possible value of bfqd->sb_shift, without having to
> +  * (1U +  * (1U<<(bt->sb.shift - something)), to be robust against
> +  * any possible value of bt->sb.shift, without having to
>* limit 'something'.
>*/
>   /* no more than 50% of tags for async I/O */
> - bfqd->word_depths[0][0] = max((1U>1, 1U);
> + bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
>   /*
>* no more than 75% of tags for sync writes (25% extra tags
>* w.r.t. async I/O, to prevent async I/O from starving sync
>* writes)
>*/
> - bfqd->word_depths[0][1] = max(((1U>2, 1U);
> + bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
>  
>   /*
>* In-word depths in case some bfq_queue is being weight-
> @@ -5114,9 +5112,9 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
> struct sbitmap_queue *bt)
>* shortage.
>*/
>   /* no more than ~18% of tags for async I/O */
> - bfqd->word_depths[1][0] = max(((1U>4, 1U);
> + bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
>   /* no more than ~37% of tags for sync writes (~20% extra tags) */
> - bfqd->word_depths[1][1] = max(((1U>4, 1U);
> + bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
>  }
>  
>  static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index 8ec7ff92cd6f..faac509cb35e 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -636,12 +636,6 @@ struct bfq_data {
>   struct bfq_queue *bio_bfqq;
>  
>   /*
> -  * Cached sbitmap shift, used to compute depth limits in
> -  * bfq_update_depths.
> -  */
> - unsigned int sb_shift;
> -
> - /*
>* Depth limits used in bfq_limit_depth (see comments on the
>* function)
>*/
> -- 
> 2.7.4
>

Re: [PATCH 3/9] bfq: calculate shallow depths at init time

On Thu, May 10, 2018 at 10:24:21AM -0600, Jens Axboe wrote:
> It doesn't change, so don't put it in the per-IO hot path.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/bfq-iosched.c | 97 
> +++--
>  1 file changed, 50 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index db38e88a5670..0cd8aa80c32d 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -487,46 +487,6 @@ static struct request *bfq_choose_req(struct bfq_data 
> *bfqd,
>  }
>  
>  /*
> - * See the comments on bfq_limit_depth for the purpose of
> - * the depths set in the function.
> - */
> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue 
> *bt)
> -{
> - bfqd->sb_shift = bt->sb.shift;
> -
> - /*
> -  * In-word depths if no bfq_queue is being weight-raised:
> -  * leaving 25% of tags only for sync reads.
> -  *
> -  * In next formulas, right-shift the value
> -  * (1U -  * (1U<<(bfqd->sb_shift - something)), to be robust against
> -  * any possible value of bfqd->sb_shift, without having to
> -  * limit 'something'.
> -  */
> - /* no more than 50% of tags for async I/O */
> - bfqd->word_depths[0][0] = max((1U>1, 1U);
> - /*
> -  * no more than 75% of tags for sync writes (25% extra tags
> -  * w.r.t. async I/O, to prevent async I/O from starving sync
> -  * writes)
> -  */
> - bfqd->word_depths[0][1] = max(((1U>2, 1U);
> -
> - /*
> -  * In-word depths in case some bfq_queue is being weight-
> -  * raised: leaving ~63% of tags for sync reads. This is the
> -  * highest percentage for which, in our tests, application
> -  * start-up times didn't suffer from any regression due to tag
> -  * shortage.
> -  */
> - /* no more than ~18% of tags for async I/O */
> - bfqd->word_depths[1][0] = max(((1U>4, 1U);
> - /* no more than ~37% of tags for sync writes (~20% extra tags) */
> - bfqd->word_depths[1][1] = max(((1U>4, 1U);
> -}
> -
> -/*
>   * Async I/O can easily starve sync I/O (both sync reads and sync
>   * writes), by consuming all tags. Similarly, storms of sync writes,
>   * such as those that sync(2) may trigger, can starve sync reads.
> @@ -535,18 +495,11 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
> struct sbitmap_queue *bt)
>   */
>  static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
>  {
> - struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>   struct bfq_data *bfqd = data->q->elevator->elevator_data;
> - struct sbitmap_queue *bt;
>  
>   if (op_is_sync(op) && !op_is_write(op))
>   return;
>  
> - bt = >bitmap_tags;
> -
> - if (unlikely(bfqd->sb_shift != bt->sb.shift))
> - bfq_update_depths(bfqd, bt);
> -
>   data->shallow_depth =
>   bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
>  
> @@ -5126,6 +5079,55 @@ void bfq_put_async_queues(struct bfq_data *bfqd, 
> struct bfq_group *bfqg)
>   __bfq_put_async_bfqq(bfqd, >async_idle_bfqq);
>  }
>  
> +/*
> + * See the comments on bfq_limit_depth for the purpose of
> + * the depths set in the function.
> + */
> +static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue 
> *bt)
> +{
> + bfqd->sb_shift = bt->sb.shift;
> +
> + /*
> +  * In-word depths if no bfq_queue is being weight-raised:
> +  * leaving 25% of tags only for sync reads.
> +  *
> +  * In next formulas, right-shift the value
> +  * (1U +  * (1U<<(bfqd->sb_shift - something)), to be robust against
> +  * any possible value of bfqd->sb_shift, without having to
> +  * limit 'something'.
> +  */
> + /* no more than 50% of tags for async I/O */
> + bfqd->word_depths[0][0] = max((1U>1, 1U);
> + /*
> +  * no more than 75% of tags for sync writes (25% extra tags
> +  * w.r.t. async I/O, to prevent async I/O from starving sync
> +  * writes)
> +  */
> + bfqd->word_depths[0][1] = max(((1U>2, 1U);
> +
> + /*
> +  * In-word depths in case some bfq_queue is being weight-
> +  * raised: leaving ~63% of tags for sync reads. This is the
> +  * highest percentage for which, in our tests, application
> +  * start-up times didn't suffer from any regression due to tag
> +  * shortage.
> +  */
> + /* no more than ~18% of tags for async I/O */
> + bfqd->word_depths[1][0] = max(((1U>4, 1U);
> + /* no more than ~37% of tags for sync writes (~20% extra tags) */
> + bfqd->word_depths[1][1] = max(((1U>4, 1U);
> +}
> +
> +static int

Re: [PATCH 2/9] bfq-iosched: don't worry about reserved tags in limit_depth

On Thu, May 10, 2018 at 10:24:20AM -0600, Jens Axboe wrote:
> Reserved tags are used for error handling, we don't need to
> care about them for regular IO. The core won't call us for these
> anyway.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/bfq-iosched.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index eefd8a4bc936..db38e88a5670 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -542,14 +542,7 @@ static void bfq_limit_depth(unsigned int op, struct 
> blk_mq_alloc_data *data)
>   if (op_is_sync(op) && !op_is_write(op))
>   return;
>  
> - if (data->flags & BLK_MQ_REQ_RESERVED) {
> - if (unlikely(!tags->nr_reserved_tags)) {
> - WARN_ON_ONCE(1);
> - return;
> - }
> - bt = >breserved_tags;
> - } else
> - bt = >bitmap_tags;
> + bt = >bitmap_tags;
>  
>   if (unlikely(bfqd->sb_shift != bt->sb.shift))
>   bfq_update_depths(bfqd, bt);
> -- 
> 2.7.4
>

Re: [PATCH 1/9] blk-mq: don't call into depth limiting for reserved tags

On Thu, May 10, 2018 at 10:24:19AM -0600, Jens Axboe wrote:
> It's not useful, they are internal and/or error handling recovery
> commands.
> 
> Acked-by: Paolo Valente 

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4e9d83594cca..64630caaf27e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -360,9 +360,11 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>  
>   /*
>* Flush requests are special and go directly to the
> -  * dispatch list.
> +  * dispatch list. Don't include reserved tags in the
> +  * limiting, as it isn't useful.
>*/
> - if (!op_is_flush(op) && e->type->ops.mq.limit_depth)
> + if (!op_is_flush(op) && e->type->ops.mq.limit_depth &&
> + !(data->flags & BLK_MQ_REQ_RESERVED))
>   e->type->ops.mq.limit_depth(op, data);
>   }
>  
> -- 
> 2.7.4
>

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Logan Gunthorpe

On 10/05/18 08:16 AM, Stephen  Bates wrote:
> Hi Christian
> 
>> Why would a switch not identify that as a peer address? We use the PASID 
>>together with ATS to identify the address space which a transaction 
>>should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS 
> then it will issue a TR TLP to obtain a translated address from the IOMMU. 
> This TR TLP will be addressed to the RP and so regardless of ACS it is going 
> up to the Root Port. When it gets the response it gets the physical address 
> and can use that with the TA bit set for the p2pdma. In the case of ATS 
> support we also have more control over ACS as we can disable it just for TA 
> addresses (as per 7.7.7.7.2 of the spec).

Yes. Remember if we are using the IOMMU the EP is being programmed
(regardless of whether it's a DMA engine, NTB window or GPUVA) with an
IOVA address which is separate from the device's PCI bus address. Any
packet addressed to an IOVA address is going to go back to the root
complex no matter what the ACS bits say. Only once ATS translates the
addres back into the PCI bus address will the EP send packets to the
peer and the switch will attempt to root them to the peer and only then
do the ACS bits apply. And the direct translated ACS bit allows packets
that have purportedly been translated through.

>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU 
> at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I 
> think we have to disable the IOMMU (something Mike suggested earlier). The 
> reason is that since ATS is not an option the EP must initiate the DMA using 
> the addresses passed down to it. If the IOMMU is on then this is an IOVA that 
> could (with some non-zero probability) point to an IO Memory address in the 
> same PCI domain. So if we disable ACS we are in trouble as we might MemWr to 
> the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. 
> Disabling the IOMMU removes the IOVA risk and ironically also resolves the 
> IOMMU grouping issues.
> So I think if we want to support performant P2PDMA for devices that don't 
> have ATS (and no NVMe SSDs today support ATS) then we have to disable the 
> IOMMU. I know this is problematic for AMDs use case so perhaps we also need 
> to consider a mode for P2PDMA for devices that DO support ATS where we can 
> enable the IOMMU (but in this case EPs without ATS cannot participate as 
> P2PDMA DMA iniators).
> 
> Make sense?

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address. So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
Redir bits are on they will be forced back to the RC by the switch and
the transaction will fail. If we clear the ACS bits, the TLPs will go
where we want and everything will work (but we lose the isolation of ACS).

For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.

Logan

[PATCH 8/9] kyber-iosched: update shallow depth when setting up hardware queue

We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/kyber-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 564967fafe5f..5b33dc394cc7 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -378,6 +378,7 @@ static void kyber_exit_sched(struct elevator_queue *e)
 
 static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
+   struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
struct kyber_hctx_data *khd;
int i;
 
@@ -400,6 +401,8 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
khd->batching = 0;
 
hctx->sched_data = khd;
+   sbitmap_queue_min_shallow_depth(>sched_tags->bitmap_tags,
+   kqd->async_depth);
 
return 0;
 }
-- 
2.7.4

[PATCH 9/9] sbitmap: fix race in wait batch accounting

If we have multiple callers of sbq_wake_up(), we can end up in a
situation where the wait_cnt will continually go more and more
negative. Consider the case where our wake batch is 1, hence
wait_cnt will start out as 1.

wait_cnt == 1

CPU0CPU1
atomic_dec_return(), cnt == 0
atomic_dec_return(), cnt == -1
cmpxchg(-1, 0) (succeeds)
[wait_cnt now 0]
cmpxchg(0, 1) (fails)

This ends up with wait_cnt being 0, we'll wakeup immediately
next time. Going through the same loop as above again, and
we'll have wait_cnt -1.

For the case where we have a larger wake batch, the only
difference is that the starting point will be higher. We'll
still end up with continually smaller batch wakeups, which
defeats the purpose of the rolling wakeups.

Always reset the wait_cnt to the batch value. Then it doesn't
matter who wins the race. But ensure that whomever does win
the race is the one that increments the ws index as well.

Signed-off-by: Jens Axboe 
---
 lib/sbitmap.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 8f0950fbaa5c..ffc36e695f50 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -478,22 +478,26 @@ static void sbq_wake_up(struct sbitmap_queue *sbq)
 
wait_cnt = atomic_dec_return(>wait_cnt);
if (wait_cnt <= 0) {
+   int ret;
+
wake_batch = READ_ONCE(sbq->wake_batch);
+
/*
 * Pairs with the memory barrier in sbitmap_queue_resize() to
 * ensure that we see the batch size update before the wait
 * count is reset.
 */
smp_mb__before_atomic();
+
/*
-* If there are concurrent callers to sbq_wake_up(), the last
-* one to decrement the wait count below zero will bump it back
-* up. If there is a concurrent resize, the count reset will
-* either cause the cmpxchg to fail or overwrite after the
-* cmpxchg.
+* For concurrent callers (either resize, or just multiple
+* concurrent wakeups of waiters), the one that wins the
+* cmpxchg() race increments the waiter index.
 */
-   atomic_cmpxchg(>wait_cnt, wait_cnt, wait_cnt + wake_batch);
-   sbq_index_atomic_inc(>wake_index);
+   ret = atomic_cmpxchg(>wait_cnt, wait_cnt, wake_batch);
+   if (ret == wait_cnt)
+   sbq_index_atomic_inc(>wake_index);
+
wake_up_nr(>wait, wake_batch);
}
 }
-- 
2.7.4

[PATCH 4/9] bfq-iosched: remove unused variable

bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/bfq-iosched.c | 16 +++-
 block/bfq-iosched.h |  6 --
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cd8aa80c32d..10294124d597 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5085,26 +5085,24 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct 
bfq_group *bfqg)
  */
 static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
 {
-   bfqd->sb_shift = bt->sb.shift;
-
/*
 * In-word depths if no bfq_queue is being weight-raised:
 * leaving 25% of tags only for sync reads.
 *
 * In next formulas, right-shift the value
-* (1Usb_shift - something)), to be robust against
-* any possible value of bfqd->sb_shift, without having to
+* (1Usb.shift - something)), to be robust against
+* any possible value of bt->sb.shift, without having to
 * limit 'something'.
 */
/* no more than 50% of tags for async I/O */
-   bfqd->word_depths[0][0] = max((1U>1, 1U);
+   bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
/*
 * no more than 75% of tags for sync writes (25% extra tags
 * w.r.t. async I/O, to prevent async I/O from starving sync
 * writes)
 */
-   bfqd->word_depths[0][1] = max(((1U>2, 1U);
+   bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
 
/*
 * In-word depths in case some bfq_queue is being weight-
@@ -5114,9 +5112,9 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
struct sbitmap_queue *bt)
 * shortage.
 */
/* no more than ~18% of tags for async I/O */
-   bfqd->word_depths[1][0] = max(((1U>4, 1U);
+   bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
/* no more than ~37% of tags for sync writes (~20% extra tags) */
-   bfqd->word_depths[1][1] = max(((1U>4, 1U);
+   bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 8ec7ff92cd6f..faac509cb35e 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -636,12 +636,6 @@ struct bfq_data {
struct bfq_queue *bio_bfqq;
 
/*
-* Cached sbitmap shift, used to compute depth limits in
-* bfq_update_depths.
-*/
-   unsigned int sb_shift;
-
-   /*
 * Depth limits used in bfq_limit_depth (see comments on the
 * function)
 */
-- 
2.7.4

[PATCH 3/9] bfq: calculate shallow depths at init time

It doesn't change, so don't put it in the per-IO hot path.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/bfq-iosched.c | 97 +++--
 1 file changed, 50 insertions(+), 47 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index db38e88a5670..0cd8aa80c32d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -487,46 +487,6 @@ static struct request *bfq_choose_req(struct bfq_data 
*bfqd,
 }
 
 /*
- * See the comments on bfq_limit_depth for the purpose of
- * the depths set in the function.
- */
-static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
-{
-   bfqd->sb_shift = bt->sb.shift;
-
-   /*
-* In-word depths if no bfq_queue is being weight-raised:
-* leaving 25% of tags only for sync reads.
-*
-* In next formulas, right-shift the value
-* (1Usb_shift - something)), to be robust against
-* any possible value of bfqd->sb_shift, without having to
-* limit 'something'.
-*/
-   /* no more than 50% of tags for async I/O */
-   bfqd->word_depths[0][0] = max((1U>1, 1U);
-   /*
-* no more than 75% of tags for sync writes (25% extra tags
-* w.r.t. async I/O, to prevent async I/O from starving sync
-* writes)
-*/
-   bfqd->word_depths[0][1] = max(((1U>2, 1U);
-
-   /*
-* In-word depths in case some bfq_queue is being weight-
-* raised: leaving ~63% of tags for sync reads. This is the
-* highest percentage for which, in our tests, application
-* start-up times didn't suffer from any regression due to tag
-* shortage.
-*/
-   /* no more than ~18% of tags for async I/O */
-   bfqd->word_depths[1][0] = max(((1U>4, 1U);
-   /* no more than ~37% of tags for sync writes (~20% extra tags) */
-   bfqd->word_depths[1][1] = max(((1U>4, 1U);
-}
-
-/*
  * Async I/O can easily starve sync I/O (both sync reads and sync
  * writes), by consuming all tags. Similarly, storms of sync writes,
  * such as those that sync(2) may trigger, can starve sync reads.
@@ -535,18 +495,11 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
struct sbitmap_queue *bt)
  */
 static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
 {
-   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
struct bfq_data *bfqd = data->q->elevator->elevator_data;
-   struct sbitmap_queue *bt;
 
if (op_is_sync(op) && !op_is_write(op))
return;
 
-   bt = >bitmap_tags;
-
-   if (unlikely(bfqd->sb_shift != bt->sb.shift))
-   bfq_update_depths(bfqd, bt);
-
data->shallow_depth =
bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
 
@@ -5126,6 +5079,55 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct 
bfq_group *bfqg)
__bfq_put_async_bfqq(bfqd, >async_idle_bfqq);
 }
 
+/*
+ * See the comments on bfq_limit_depth for the purpose of
+ * the depths set in the function.
+ */
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+{
+   bfqd->sb_shift = bt->sb.shift;
+
+   /*
+* In-word depths if no bfq_queue is being weight-raised:
+* leaving 25% of tags only for sync reads.
+*
+* In next formulas, right-shift the value
+* (1Usb_shift - something)), to be robust against
+* any possible value of bfqd->sb_shift, without having to
+* limit 'something'.
+*/
+   /* no more than 50% of tags for async I/O */
+   bfqd->word_depths[0][0] = max((1U>1, 1U);
+   /*
+* no more than 75% of tags for sync writes (25% extra tags
+* w.r.t. async I/O, to prevent async I/O from starving sync
+* writes)
+*/
+   bfqd->word_depths[0][1] = max(((1U>2, 1U);
+
+   /*
+* In-word depths in case some bfq_queue is being weight-
+* raised: leaving ~63% of tags for sync reads. This is the
+* highest percentage for which, in our tests, application
+* start-up times didn't suffer from any regression due to tag
+* shortage.
+*/
+   /* no more than ~18% of tags for async I/O */
+   bfqd->word_depths[1][0] = max(((1U>4, 1U);
+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U>4, 1U);
+}
+
+static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
+{
+   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
+   struct blk_mq_tags *tags = hctx->sched_tags;
+
+   bfq_update_depths(bfqd, >bitmap_tags);

[PATCH 6/9] sbitmap: warn if using smaller shallow depth than was setup

From: Omar Sandoval 

Make sure the user passed the right value to
sbitmap_queue_min_shallow_depth().

Acked-by: Paolo Valente 
Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
---
 lib/sbitmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index d21473b42465..8f0950fbaa5c 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -402,6 +402,8 @@ int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
unsigned int hint, depth;
int nr;
 
+   WARN_ON_ONCE(shallow_depth < sbq->min_shallow_depth);
+
hint = this_cpu_read(*sbq->alloc_hint);
depth = READ_ONCE(sbq->sb.depth);
if (unlikely(hint >= depth)) {
-- 
2.7.4

[PATCH 5/9] sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

From: Omar Sandoval 

The sbitmap queue wake batch is calculated such that once allocations
start blocking, all of the bits which are already allocated must be
enough to fulfill the batch counters of all of the waitqueues. However,
the shallow allocation depth can break this invariant, since we block
before our full depth is being utilized. Add
sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
the sbq will use, and update sbq_calc_wake_batch() to take it into
account.

Acked-by: Paolo Valente 
Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
---
 include/linux/sbitmap.h | 29 +
 lib/sbitmap.c   | 49 -
 2 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 841585f6e5f2..0c4a9c242dd7 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -127,6 +127,12 @@ struct sbitmap_queue {
 * @round_robin: Allocate bits in strict round-robin order.
 */
bool round_robin;
+
+   /**
+* @min_shallow_depth: The minimum shallow depth which may be passed to
+* sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
+*/
+   unsigned int min_shallow_depth;
 };
 
 /**
@@ -390,6 +396,9 @@ int __sbitmap_queue_get(struct sbitmap_queue *sbq);
  * @shallow_depth: The maximum number of bits to allocate from a single word.
  * See sbitmap_get_shallow().
  *
+ * If you call this, make sure to call sbitmap_queue_min_shallow_depth() after
+ * initializing @sbq.
+ *
  * Return: Non-negative allocated bit number if successful, -1 otherwise.
  */
 int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
@@ -424,6 +433,9 @@ static inline int sbitmap_queue_get(struct sbitmap_queue 
*sbq,
  * @shallow_depth: The maximum number of bits to allocate from a single word.
  * See sbitmap_get_shallow().
  *
+ * If you call this, make sure to call sbitmap_queue_min_shallow_depth() after
+ * initializing @sbq.
+ *
  * Return: Non-negative allocated bit number if successful, -1 otherwise.
  */
 static inline int sbitmap_queue_get_shallow(struct sbitmap_queue *sbq,
@@ -439,6 +451,23 @@ static inline int sbitmap_queue_get_shallow(struct 
sbitmap_queue *sbq,
 }
 
 /**
+ * sbitmap_queue_min_shallow_depth() - Inform a  sbitmap_queue of the
+ * minimum shallow depth that will be used.
+ * @sbq: Bitmap queue in question.
+ * @min_shallow_depth: The minimum shallow depth that will be passed to
+ * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
+ *
+ * sbitmap_queue_clear() batches wakeups as an optimization. The batch size
+ * depends on the depth of the bitmap. Since the shallow allocation functions
+ * effectively operate with a different depth, the shallow depth must be taken
+ * into account when calculating the batch size. This function must be called
+ * with the minimum shallow depth that will be used. Failure to do so can 
result
+ * in missed wakeups.
+ */
+void sbitmap_queue_min_shallow_depth(struct sbitmap_queue *sbq,
+unsigned int min_shallow_depth);
+
+/**
  * sbitmap_queue_clear() - Free an allocated bit and wake up waiters on a
  *  sbitmap_queue.
  * @sbq: Bitmap to free from.
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index e6a9c06ec70c..d21473b42465 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -270,18 +270,33 @@ void sbitmap_bitmap_show(struct sbitmap *sb, struct 
seq_file *m)
 }
 EXPORT_SYMBOL_GPL(sbitmap_bitmap_show);
 
-static unsigned int sbq_calc_wake_batch(unsigned int depth)
+static unsigned int sbq_calc_wake_batch(struct sbitmap_queue *sbq,
+   unsigned int depth)
 {
unsigned int wake_batch;
+   unsigned int shallow_depth;
 
/*
 * For each batch, we wake up one queue. We need to make sure that our
-* batch size is small enough that the full depth of the bitmap is
-* enough to wake up all of the queues.
+* batch size is small enough that the full depth of the bitmap,
+* potentially limited by a shallow depth, is enough to wake up all of
+* the queues.
+*
+* Each full word of the bitmap has bits_per_word bits, and there might
+* be a partial word. There are depth / bits_per_word full words and
+* depth % bits_per_word bits left over. In bitwise arithmetic:
+*
+* bits_per_word = 1 << shift
+* depth / bits_per_word = depth >> shift
+* depth % bits_per_word = depth & ((1 << shift) - 1)
+*
+* Each word can be limited to sbq->min_shallow_depth bits.
 */
-   wake_batch = SBQ_WAKE_BATCH;
-   if (wake_batch > depth / SBQ_WAIT_QUEUES)
-   wake_batch = max(1U, depth / SBQ_WAIT_QUEUES);
+   shallow_depth = min(1U << sbq->sb.shift, sbq->min_shallow_depth);

[PATCH 7/9] bfq-iosched: update shallow depth to smallest one used

If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/bfq-iosched.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 10294124d597..b622e73a326a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5081,10 +5081,13 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct 
bfq_group *bfqg)
 
 /*
  * See the comments on bfq_limit_depth for the purpose of
- * the depths set in the function.
+ * the depths set in the function. Return minimum shallow depth we'll use.
  */
-static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+static unsigned int bfq_update_depths(struct bfq_data *bfqd,
+ struct sbitmap_queue *bt)
 {
+   unsigned int i, j, min_shallow = UINT_MAX;
+
/*
 * In-word depths if no bfq_queue is being weight-raised:
 * leaving 25% of tags only for sync reads.
@@ -5115,14 +5118,22 @@ static void bfq_update_depths(struct bfq_data *bfqd, 
struct sbitmap_queue *bt)
bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
/* no more than ~37% of tags for sync writes (~20% extra tags) */
bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
+
+   for (i = 0; i < 2; i++)
+   for (j = 0; j < 2; j++)
+   min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
+
+   return min_shallow;
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
 {
struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
struct blk_mq_tags *tags = hctx->sched_tags;
+   unsigned int min_shallow;
 
-   bfq_update_depths(bfqd, >bitmap_tags);
+   min_shallow = bfq_update_depths(bfqd, >bitmap_tags);
+   sbitmap_queue_min_shallow_depth(>bitmap_tags, min_shallow);
return 0;
 }
 
-- 
2.7.4

[PATCHSET v3 0/9] blk-mq-sched and sbitmap shallow depth

Found another issue in sbitmap around our wake batch accounting.
See the last patch for details.

This series passes my testing. The sbitmap change was improved
by Omar, I swapped in his patch instead.

You can also find this series here:

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.18/iosched

 block/bfq-iosched.c |  113 +---
 block/bfq-iosched.h |6 --
 block/blk-mq.c  |6 +-
 block/kyber-iosched.c   |3 +
 include/linux/sbitmap.h |   29 
 lib/sbitmap.c   |   69 ++---
 6 files changed, 148 insertions(+), 78 deletions(-)

-- 
Jens Axboe

[PATCH 2/9] bfq-iosched: don't worry about reserved tags in limit_depth

Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/bfq-iosched.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index eefd8a4bc936..db38e88a5670 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -542,14 +542,7 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
if (op_is_sync(op) && !op_is_write(op))
return;
 
-   if (data->flags & BLK_MQ_REQ_RESERVED) {
-   if (unlikely(!tags->nr_reserved_tags)) {
-   WARN_ON_ONCE(1);
-   return;
-   }
-   bt = >breserved_tags;
-   } else
-   bt = >bitmap_tags;
+   bt = >bitmap_tags;
 
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
-- 
2.7.4

[PATCH 1/9] blk-mq: don't call into depth limiting for reserved tags

It's not useful, they are internal and/or error handling recovery
commands.

Acked-by: Paolo Valente 
Signed-off-by: Jens Axboe 
---
 block/blk-mq.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e9d83594cca..64630caaf27e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -360,9 +360,11 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 
/*
 * Flush requests are special and go directly to the
-* dispatch list.
+* dispatch list. Don't include reserved tags in the
+* limiting, as it isn't useful.
 */
-   if (!op_is_flush(op) && e->type->ops.mq.limit_depth)
+   if (!op_is_flush(op) && e->type->ops.mq.limit_depth &&
+   !(data->flags & BLK_MQ_REQ_RESERVED))
e->type->ops.mq.limit_depth(op, data);
}
 
-- 
2.7.4

Re: [PATCHSET v2 0/7] blk-mq-sched and sbitmap shallow depth

On 5/9/18 10:32 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 09 mag 2018, alle ore 22:49, Jens Axboe  ha 
>> scritto:
>>
>> Omar had some valid complaints about the previous patchset, mostly
>> around the fact that we should not be updating depths on a per-IO
>> basis.  He's right. In fact, BFQ oddly enough does it from the
>> ->limit_depth hook, but the shallow depth counts remain static.
>>
>> This series cleans that up the ->limit_depth() handling, and keeps it
>> out of the hot path.
>>
>> Mike, this should (still) fix your hangs with BFQ. This series is
>> (functionally) identical to the bundled up version I sent you earlier.
>>
> 
> Only one question/curiosity: is performance unaffected by the lowering of
> the wake batch counts?

It's hard to discuss performance within the realm of "it hangs without
it", there's just no way around this. That said, We still have the
rolling waitqueues (default is 8) which also helps in reducing the
overhead on both wakeups and adding to wait queues. So we should still
be quite fine.

As it so happens, there's another bug in wake batch accounting in
sbitmap, which would have defeated much of the win over time anyway. New
series coming with that extra patch.

-- 
Jens Axboe

Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

On Fri, 2018-05-04 at 19:17 +0200, Paolo Valente wrote:
> When invoked for an I/O request rq, [ ... ]

Tested-by: Bart Van Assche

Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Paolo Valente



> Il giorno 10 mag 2018, alle ore 18:12, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
>>> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
 Still 4.16-rc1, being that the version for which you reported this
 issue in the first place.
>>> 
>>> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
>>> since RDMA/CM support for the SRP target driver is missing from that kernel.
>>> That's why I asked you to use the for-next branch from my github repository
>>> in a previous e-mail. Anyway, since the necessary patches are now in
>>> linux-next, the srp-test software can also be run against linux-next. Here
>>> are the results that I obtained with label next-20180329 and the kernel
>>> config attached to your previous e-mail:
>>> 
>>> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
>>> 
>>> BUG: unable to handle kernel NULL pointer dereference at 0200
>>> PGD 0 P4D 0 
>>> Oops: 0002 [#1] SMP PTI
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>> RIP: 0010:rb_erase+0x284/0x380
>>> Call Trace:
>>> 
>>> elv_rb_del+0x24/0x30
>>> bfq_remove_request+0x9a/0x2e0 [bfq]
>>> ? rcu_read_lock_sched_held+0x64/0x70
>>> ? update_load_avg+0x72b/0x760
>>> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
>>> ? __lock_is_held+0x5a/0xa0
>>> blk_mq_free_request+0x5f/0x1a0
>>> blk_put_request+0x23/0x60
>>> multipath_release_clone+0xe/0x10
>>> dm_softirq_done+0xe3/0x270
>>> __blk_mq_complete_request_remote+0x18/0x20
>>> flush_smp_call_function_queue+0xa1/0x150
>>> generic_smp_call_function_single_interrupt+0x13/0x30
>>> smp_call_function_single_interrupt+0x4d/0x220
>>> call_function_single_interrupt+0xf/0x20
>>> 
>> 
>> I suspect my recent fix [1] might fix your failure too.
>> 
>> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html
> 
> Hello Paolo,
> 
> With patch [1] applied I can't reproduce the aforementioned crash. I will add
> my Tested-by.
> 

Great, thanks!

Paolo

> Thanks,
> 
> Bart.
> 
>

Re: v4.16-rc1 + dm-mpath + BFQ

On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
> > Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
> >  ha scritto:
> > 
> > On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
> > > Still 4.16-rc1, being that the version for which you reported this
> > > issue in the first place.
> > 
> > A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
> > since RDMA/CM support for the SRP target driver is missing from that kernel.
> > That's why I asked you to use the for-next branch from my github repository
> > in a previous e-mail. Anyway, since the necessary patches are now in
> > linux-next, the srp-test software can also be run against linux-next. Here
> > are the results that I obtained with label next-20180329 and the kernel
> > config attached to your previous e-mail:
> > 
> > # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 0200
> > PGD 0 P4D 0 
> > Oops: 0002 [#1] SMP PTI
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.0.0-prebuilt.qemu-project.org 04/01/2014
> > RIP: 0010:rb_erase+0x284/0x380
> > Call Trace:
> > 
> > elv_rb_del+0x24/0x30
> > bfq_remove_request+0x9a/0x2e0 [bfq]
> > ? rcu_read_lock_sched_held+0x64/0x70
> > ? update_load_avg+0x72b/0x760
> > bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> > ? __lock_is_held+0x5a/0xa0
> > blk_mq_free_request+0x5f/0x1a0
> > blk_put_request+0x23/0x60
> > multipath_release_clone+0xe/0x10
> > dm_softirq_done+0xe3/0x270
> > __blk_mq_complete_request_remote+0x18/0x20
> > flush_smp_call_function_queue+0xa1/0x150
> > generic_smp_call_function_single_interrupt+0x13/0x30
> > smp_call_function_single_interrupt+0x4d/0x220
> > call_function_single_interrupt+0xf/0x20
> > 
> 
> I suspect my recent fix [1] might fix your failure too.
> 
> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html

Hello Paolo,

With patch [1] applied I can't reproduce the aforementioned crash. I will add
my Tested-by.

Thanks,

Bart.

Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Laurence Oberman

On Thu, 2018-05-10 at 15:16 +, Bart Van Assche wrote:
> On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
> > I was never able to reproduce Barts original issue using his tree
> > and
> > actual mlx5/cx4 hardware and ibsrp
> > I enabled BFQ with no other special tuning for the moath and
> > subpaths.
> > I was waiting for him to come back from vacation to check with him.
> 
> (back in the office)
> 
> Hello Laurence,
> 
> What I understood from off-list communication is that you tried to
> find
> a way to reproduce what I reported without using the srp-test
> software.
> My understanding is that both Paolo and I can reproduce the reported
> issue
> with the srp-test software.
> 
> Bart.
> 
> 
> 

Hello Bart

using your kernel
4.17.0-rc2.bart+

CONFIG_IOSCHED_BFQ=y
CONFIG_BFQ_GROUP_IOSCHED=y

These are all SRP LUNS

36001405b2b5c6c24c084b6fa4d55da2f dm-27 LIO-ORG ,block-10
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:9  sdap 66:144 active ready running
  `- 1:0:0:9  sdaz 67:48  active ready running

36001405b26ebe76dcb94a489f6f245f8 dm-18 LIO-ORG ,block-21
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:20 sdx  65:112 active ready running
  `- 1:0:0:20 sdaa 65:160 active ready running

[root@ibclient ~]# cd /sys/block
[root@ibclient block]# cat /sys/block/dm-18/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdaa/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdx/queue/scheduler
mq-deadline kyber [bfq] none

Not using the test software just exercising the LUNS via my own tests I
am unable to get the OOPS

I guess something in the srp-test software triggers it then.

Doing plenty of IO to 5 mpath devices (1.3Gbytes/sec)

#Time cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map
KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
12:08:320   0  1437   1107  88G   5M   1G 902M 300M
178M  1380K345  0  0  6 74  0   4 

Thanks
Laurence

[PATCH 00/18] Convert default pr_fmt from empty to KBUILD_MODNAME

2018-05-10 Thread Joe Perches

pr_ logging uses allow a prefix to be specified with a
specific #define pr_fmt

The default of pr_fmt in printk.h is #define pr_fmt(fmt) fmt
so no prefixing of logging output is generically done.

There are several output logging uses like dump_stack() that are
unprefixed and should remain unprefixed.

This patch series attempts to convert the default #define of pr_fmt to
KBUILD_MODNAME ": " fmt and as well update the various bits of the kernel
that should _not_ be prefixed by adding #define pr_fmt(fmt) fmt to those
compilation units that do not want output message prefixing.

There are about 1200 uses of #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
that could be removed if the default is changed.

A script that does this removal (and removes any blank lines that follow)
for the linux-kernel tree is included below:

$ git grep -P --name-only "define\s+pr_fmt\b" | \
  grep -v include/linux/printk.h | \
  xargs perl -i -e 'local $/; while (<>) {s/(\n)*[ \t]*#[ \t]*define[ 
\t]+pr_fmt[ \t]*\([ \t]*(\w+)[ \t]*\)[ \t]*KBUILD_MODNAME[ \t]*\": \"[ \t]*\2[ 
\t]*\n\s*/\1\n/; s/^\n//; print;}'

This script should be run after this patch series is applied.

The above script output diff is currently:

1198 files changed, 70 insertions(+), 2241 deletions(-)

Joe Perches (18):
  kernel: Use pr_fmt
  lib: Use pr_fmt
  printk: Convert pr_fmt from blank define to KBUILD_MODNAME
  x86: Remove pr_fmt duplicate logging prefixes
  x86/mtrr: Rename main.c to mtrr.c and remove duplicate prefixes
  net: Remove pr_fmt duplicate logging prefixes
  blk-mq: Remove pr_fmt duplicate logging prefixes
  random: Remove pr_fmt duplicate logging prefixes
  ptp: Remove pr_fmt duplicate logging prefixes
  efifb: Remove pr_fmt duplicate logging prefixes
  proc: Remove pr_fmt duplicate logging prefixes
  uprobes: Remove pr_fmt duplicate logging prefixes
  printk: Remove pr_fmt duplicate logging prefixes
  lib/mpi: Remove pr_fmt duplicate logging prefixes
  security: Remove pr_fmt duplicate logging prefixes
  aoe: Remove pr_fmt duplicate logging prefixes
  security: encrypted-keys: Remove pr_fmt duplicate logging prefixes
  rcu: Use pr_fmt to prefix "rcu: " to logging output

 arch/x86/events/amd/ibs.c  |  2 +-
 arch/x86/kernel/cpu/mtrr/Makefile  |  2 +-
 arch/x86/kernel/cpu/mtrr/{main.c => mtrr.c}| 33 ++---
 arch/x86/kernel/e820.c | 32 ++--
 arch/x86/kernel/hpet.c |  5 +-
 arch/x86/kernel/uprobes.c  |  4 +-
 arch/x86/mm/numa.c | 22 -
 block/blk-mq.c |  9 ++--
 drivers/block/aoe/aoeblk.c | 29 ++-
 drivers/block/aoe/aoechr.c | 11 ++---
 drivers/block/aoe/aoecmd.c | 34 ++---
 drivers/block/aoe/aoedev.c | 19 +++-
 drivers/block/aoe/aoemain.c|  6 +--
 drivers/block/aoe/aoenet.c | 19 +++-
 drivers/char/hw_random/via-rng.c   | 10 ++--
 drivers/char/random.c  | 16 +++---
 drivers/ptp/ptp_clock.c|  4 +-
 drivers/video/fbdev/efifb.c| 48 +-
 fs/proc/root.c |  6 +--
 include/linux/printk.h |  2 +-
 kernel/acct.c  |  2 +
 kernel/async.c | 14 +++---
 kernel/audit_tree.c|  2 +-
 kernel/backtracetest.c |  8 +--
 kernel/crash_core.c| 29 ++-
 kernel/events/uprobes.c|  3 +-
 kernel/exit.c  |  2 +
 kernel/hung_task.c | 13 +++--
 kernel/kprobes.c   | 20 +---
 kernel/module.c| 59 +++
 kernel/panic.c |  3 ++
 kernel/params.c| 13 +++--
 kernel/pid.c   |  2 +
 kernel/printk/printk.c |  2 +-
 kernel/profile.c   |  2 +
 kernel/range.c |  2 +-
 kernel/rcu/rcu_segcblist.c |  2 +
 kernel/rcu/rcuperf.c   | 10 ++--
 kernel/rcu/rcutorture.c| 46 +-
 kernel/rcu/srcutiny.c  |  2 +
 kernel/rcu/srcutree.c  |  5 +-
 kernel/rcu/tiny.c  |  3 ++
 kernel/rcu/tree.c  |  8 +--
 kernel/rcu/tree_plugin.h   | 67 +++---
 kernel/rcu/update.c| 19 +---
 kernel/relay.c |  5 +-
 kernel/seccomp.c   |  4 +-

[PATCH 16/18] aoe: Remove pr_fmt duplicate logging prefixes

2018-05-10 Thread Joe Perches

Converting pr_fmt from a simple define to use KBUILD_MODNAME added
some duplicate logging prefixes to existing uses.

Remove them.

Signed-off-by: Joe Perches 
---
 drivers/block/aoe/aoeblk.c  | 29 ++---
 drivers/block/aoe/aoechr.c  | 11 +--
 drivers/block/aoe/aoecmd.c  | 34 --
 drivers/block/aoe/aoedev.c  | 19 +++
 drivers/block/aoe/aoemain.c |  6 +++---
 drivers/block/aoe/aoenet.c  | 19 ---
 6 files changed, 53 insertions(+), 65 deletions(-)

diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 6797e6c23c8a..d008cc46c0bd 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -205,7 +205,7 @@ aoedisk_add_debugfs(struct aoedev *d)
entry = debugfs_create_file(p, 0444, aoe_debugfs_dir, d,
_debugfs_fops);
if (IS_ERR_OR_NULL(entry)) {
-   pr_info("aoe: cannot create debugfs file for %s\n",
+   pr_info("cannot create debugfs file for %s\n",
d->gd->disk_name);
return;
}
@@ -237,8 +237,7 @@ aoeblk_open(struct block_device *bdev, fmode_t mode)
ulong flags;
 
if (!virt_addr_valid(d)) {
-   pr_crit("aoe: invalid device pointer in %s\n",
-   __func__);
+   pr_crit("invalid device pointer in %s\n", __func__);
WARN_ON(1);
return -ENODEV;
}
@@ -282,8 +281,8 @@ aoeblk_request(struct request_queue *q)
 
d = q->queuedata;
if ((d->flags & DEVFL_UP) == 0) {
-   pr_info_ratelimited("aoe: device %ld.%d is not up\n",
-   d->aoemajor, d->aoeminor);
+   pr_info_ratelimited("device %ld.%d is not up\n",
+   d->aoemajor, d->aoeminor);
while ((rq = blk_peek_request(q))) {
blk_start_request(rq);
aoe_end_request(d, rq, 1);
@@ -299,7 +298,7 @@ aoeblk_getgeo(struct block_device *bdev, struct hd_geometry 
*geo)
struct aoedev *d = bdev->bd_disk->private_data;
 
if ((d->flags & DEVFL_UP) == 0) {
-   printk(KERN_ERR "aoe: disk not up\n");
+   pr_err("disk not up\n");
return -ENODEV;
}
 
@@ -319,7 +318,7 @@ aoeblk_ioctl(struct block_device *bdev, fmode_t mode, uint 
cmd, ulong arg)
 
d = bdev->bd_disk->private_data;
if ((d->flags & DEVFL_UP) == 0) {
-   pr_err("aoe: disk not up\n");
+   pr_err("disk not up\n");
return -ENODEV;
}
 
@@ -332,7 +331,7 @@ aoeblk_ioctl(struct block_device *bdev, fmode_t mode, uint 
cmd, ulong arg)
 
/* udev calls scsi_id, which uses SG_IO, resulting in noise */
if (cmd != SG_IO)
-   pr_info("aoe: unknown ioctl 0x%x\n", cmd);
+   pr_info("unknown ioctl 0x%x\n", cmd);
 
return -ENOTTY;
 }
@@ -370,22 +369,22 @@ aoeblk_gdalloc(void *vp)
 
gd = alloc_disk(AOE_PARTITIONS);
if (gd == NULL) {
-   pr_err("aoe: cannot allocate disk structure for %ld.%d\n",
-   d->aoemajor, d->aoeminor);
+   pr_err("cannot allocate disk structure for %ld.%d\n",
+  d->aoemajor, d->aoeminor);
goto err;
}
 
mp = mempool_create(MIN_BUFS, mempool_alloc_slab, mempool_free_slab,
buf_pool_cache);
if (mp == NULL) {
-   printk(KERN_ERR "aoe: cannot allocate bufpool for %ld.%d\n",
-   d->aoemajor, d->aoeminor);
+   pr_err("cannot allocate bufpool for %ld.%d\n",
+  d->aoemajor, d->aoeminor);
goto err_disk;
}
q = blk_init_queue(aoeblk_request, >lock);
if (q == NULL) {
-   pr_err("aoe: cannot allocate block queue for %ld.%d\n",
-   d->aoemajor, d->aoeminor);
+   pr_err("cannot allocate block queue for %ld.%d\n",
+  d->aoemajor, d->aoeminor);
goto err_mempool;
}
blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
@@ -457,7 +456,7 @@ aoeblk_init(void)
return -ENOMEM;
aoe_debugfs_dir = debugfs_create_dir("aoe", NULL);
if (IS_ERR_OR_NULL(aoe_debugfs_dir)) {
-   pr_info("aoe: cannot create debugfs directory\n");
+   pr_info("cannot create debugfs directory\n");
aoe_debugfs_dir = NULL;
}
return 0;
diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c
index ab41be625a53..57553ee1375c 100644
--- a/drivers/block/aoe/aoechr.c
+++ b/drivers/block/aoe/aoechr.c
@@ -69,8 +69,7 @@ static int
 interfaces(const char __user *str, size_t size)
 {
if (set_aoe_iflist(str, size)) {
-   printk(KERN_ERR
-   "aoe:

[PATCH 07/18] blk-mq: Remove pr_fmt duplicate logging prefixes

2018-05-10 Thread Joe Perches

Converting pr_fmt from a simple define to use KBUILD_MODNAME added
some duplicate logging prefixes to existing uses.

Remove them.

Signed-off-by: Joe Perches 
---
 block/blk-mq.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac16c3f..37e6206e745c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2782,13 +2782,13 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set 
*set)
} while (set->queue_depth);
 
if (!set->queue_depth || err) {
-   pr_err("blk-mq: failed to allocate request map\n");
+   pr_err("failed to allocate request map\n");
return -ENOMEM;
}
 
if (depth != set->queue_depth)
-   pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
-   depth, set->queue_depth);
+   pr_info("reduced tag depth (%u -> %u)\n",
+   depth, set->queue_depth);
 
return 0;
 }
@@ -2845,8 +2845,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
return -EINVAL;
 
if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
-   pr_info("blk-mq: reduced tag depth to %u\n",
-   BLK_MQ_MAX_DEPTH);
+   pr_info("reduced tag depth to %u\n", BLK_MQ_MAX_DEPTH);
set->queue_depth = BLK_MQ_MAX_DEPTH;
}
 
-- 
2.15.0

Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Paolo Valente



> Il giorno 10 mag 2018, alle ore 17:16, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
>> I was never able to reproduce Barts original issue using his tree and
>> actual mlx5/cx4 hardware and ibsrp
>> I enabled BFQ with no other special tuning for the moath and subpaths.
>> I was waiting for him to come back from vacation to check with him.
> 
> (back in the office)
> 
> Hello Laurence,
> 
> What I understood from off-list communication is that you tried to find
> a way to reproduce what I reported without using the srp-test software.
> My understanding is that both Paolo and I can reproduce the reported issue
> with the srp-test software.
> 

Thanks for chiming in, Bart.

Above all, with my fix [1] it should be gone.

Looking forward to your feedback,
Paolo

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html
> Bart.
> 
> 
>

Re: v4.16-rc1 + dm-mpath + BFQ

On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
> I was never able to reproduce Barts original issue using his tree and
> actual mlx5/cx4 hardware and ibsrp
> I enabled BFQ with no other special tuning for the moath and subpaths.
> I was waiting for him to come back from vacation to check with him.

(back in the office)

Hello Laurence,

What I understood from off-list communication is that you tried to find
a way to reproduce what I reported without using the srp-test software.
My understanding is that both Paolo and I can reproduce the reported issue
with the srp-test software.

Bart.

Re: stop using buffer heads in xfs and iomap

2018-05-10 Thread Darrick J. Wong

On Wed, May 09, 2018 at 09:47:57AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for reading blocks from disk using the iomap
> interface, and then gradually switched the buffered I/O path to not
> require buffer heads.  It has survived xfstests for 1k and 4k block
> size.
> 
> There are various small changes to the core VFS, block and readahead
> code to make this happen.
> 
> 
> A git tree is available at:
> 
> git://git.infradead.org/users/hch/xfs.git xfs-remove-bufferheads
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-remove-bufferheads

I ran xfstests on this for fun last night but hung in g/095:

FSTYP -- xfs (debug)
PLATFORM  -- Linux/x86_64 submarine-djwong-mtr01 4.17.0-rc4-djw
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1, -i sparse=1, -b size=1024, /dev/sdf
MOUNT_OPTIONS -- /dev/sdf /opt

FWIW the stock v4 and the 'v5 with everything and 4k blocks' vms
passed, so I guess there's a bug somewhere in the sub-page block size
code paths...

--D

[ 2586.943205] run fstests generic/095 at 2018-05-09 23:28:01
[ 2587.252740] XFS (sdf): Unmounting Filesystem
[ 2587.908441] XFS (sdf): Mounting V5 Filesystem
[ 2587.914685] XFS (sdf): Ending clean mount
[ 2702.258764] INFO: task kworker/u10:3:11834 blocked for more than 60 seconds.
[ 2702.261734]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.263607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.265600] kworker/u10:3   D11984 11834  2 0x8000
[ 2702.273445] Workqueue: writeback wb_workfn (flush-8:80)
[ 2702.274751] Call Trace:
[ 2702.275339]  ? __schedule+0x3e4/0xa70
[ 2702.276112]  ? blk_flush_plug_list+0xe4/0x280
[ 2702.277086]  schedule+0x40/0x90
[ 2702.277967]  io_schedule+0x16/0x40
[ 2702.278774]  __lock_page+0x12d/0x160
[ 2702.279680]  ? page_cache_tree_insert+0x100/0x100
[ 2702.280712]  write_cache_pages+0x32c/0x530
[ 2702.281820]  ? xfs_add_to_ioend+0x350/0x350 [xfs]
[ 2702.292350]  xfs_vm_writepages+0x57/0x80 [xfs]
[ 2702.294048]  do_writepages+0x1a/0x70
[ 2702.295068]  __writeback_single_inode+0x59/0x800
[ 2702.296118]  writeback_sb_inodes+0x282/0x550
[ 2702.297039]  __writeback_inodes_wb+0x87/0xb0
[ 2702.298173]  wb_writeback+0x430/0x5d0
[ 2702.299332]  ? wb_workfn+0x448/0x740
[ 2702.300578]  wb_workfn+0x448/0x740
[ 2702.301434]  ? lock_acquire+0xab/0x200
[ 2702.305413]  process_one_work+0x1ef/0x650
[ 2702.306687]  worker_thread+0x4d/0x3e0
[ 2702.307671]  kthread+0x106/0x140
[ 2702.308473]  ? rescuer_thread+0x340/0x340
[ 2702.309442]  ? kthread_delayed_work_timer_fn+0x90/0x90
[ 2702.310995]  ret_from_fork+0x3a/0x50
[ 2702.312088] INFO: task fio:2618 blocked for more than 60 seconds.
[ 2702.313395]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.315139] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.316820] fio D14224  2618   2612 0x
[ 2702.318050] Call Trace:
[ 2702.318757]  ? __schedule+0x3e4/0xa70
[ 2702.319639]  ? rwsem_down_read_failed+0x7f/0x170
[ 2702.320798]  schedule+0x40/0x90
[ 2702.321630]  rwsem_down_read_failed+0x128/0x170
[ 2702.322752]  ? current_time+0x18/0x70
[ 2702.323857]  ? xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.325162]  ? call_rwsem_down_read_failed+0x14/0x30
[ 2702.326423]  call_rwsem_down_read_failed+0x14/0x30
[ 2702.328393]  ? xfs_ilock+0x28f/0x330 [xfs]
[ 2702.329539]  down_read_nested+0x9d/0xa0
[ 2702.330452]  xfs_ilock+0x28f/0x330 [xfs]
[ 2702.331427]  xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.332590]  xfs_file_read_iter+0x9a/0xb0 [xfs]
[ 2702.333992]  __vfs_read+0x136/0x1a0
[ 2702.335133]  vfs_read+0xa3/0x150
[ 2702.336129]  ksys_read+0x45/0xa0
[ 2702.337085]  do_syscall_64+0x56/0x180
[ 2702.337985]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2702.339537] RIP: 0033:0x7ff4c152751d
[ 2702.340623] RSP: 002b:7fffa13c93b0 EFLAGS: 0293 ORIG_RAX: 

[ 2702.342774] RAX: ffda RBX: 00a0a2c0 RCX: 7ff4c152751d
[ 2702.344269] RDX: 0400 RSI: 00a17c00 RDI: 0005
[ 2702.345801] RBP: 7ff4a6f57000 R08: 08002000 R09: 0004
[ 2702.347342] R10: 0001 R11: 0293 R12: 
[ 2702.348861] R13: 0400 R14: 00a0a2e8 R15: 7ff4a6f57000
[ 2702.351298] INFO: task fio:2619 blocked for more than 60 seconds.
[ 2702.353158]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.355103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.357586] fio D14224  2619   2612 0x
[ 2702.359181] Call Trace:
[ 2702.359815]  ? __schedule+0x3e4/0xa70
[ 2702.360708]  ? rwsem_down_read_failed+0x7f/0x170
[ 2702.361880]  schedule+0x40/0x90
[ 2702.362727]  rwsem_down_read_failed+0x128/0x170
[ 2702.363811]  ? current_time+0x18/0x70
[ 2702.364775]  ? xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.365994]  ? call_rwsem_down_read_failed+0x14/0x30
[

Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

2018-05-10 Thread Darrick J. Wong

On Thu, May 10, 2018 at 08:42:50AM +0200, Christoph Hellwig wrote:
> On Wed, May 09, 2018 at 09:46:28AM -0700, Darrick J. Wong wrote:
> > On Wed, May 09, 2018 at 09:48:07AM +0200, Christoph Hellwig wrote:
> > > This adds a simple iomap-based implementation of the legacy ->bmap
> > > interface.  Note that we can't easily add checks for rt or reflink
> > > files, so these will have to remain in the callers.  This interface
> > > just needs to die..
> > 
> > You /can/ check these...
> > 
> > if (iomap->bdev != inode->i_sb->s_bdev)
> > return 0;
> > if (iomap->flags & IOMAP_F_SHARED)
> > return 0;
> 
> The latter only checks for a shared extent, not a file with possibly
> shared extents.  I'd rather keep the check for a file with possible
> shared extents.



> > > +static loff_t
> > > +iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> > > + void *data, struct iomap *iomap)
> > > +{
> > > + sector_t *bno = data;
> > > +
> > > + if (iomap->type == IOMAP_MAPPED)
> > > + *bno = (iomap->addr + pos - iomap->offset) >> inode->i_blkbits;
> > 
> > Does this need to be careful w.r.t. overflow on systems where sector_t
> > is a 32-bit unsigned long?
> > 
> > Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
> > also seems broken.  I agree the interface needs to die, but ioctls take
> > a long time to deprecate.
> 
> Not much we can do about the interface.

Yes, the interface is fubar, but if file /foo maps to block 8589934720
then do we return the truncated result 128?

--D

Re: [PATCH V4 1/7] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()

On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> Turns out the current way can't drain timout completely because mod_timer()
> can be triggered in the work func, which can be just run inside the synced
> timeout work:
> 
> del_timer_sync(>timeout);
> cancel_work_sync(>timeout_work);
> 
> This patch introduces one flag of 'timeout_off' for fixing this issue, turns
> out this simple way does work.
> 
> Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
> draining timeout, which is needed by NVMe.

Hello Ming,

The description of the above patch does not motivate sufficiently why you think
that this change is necessary. As you know it is already possible to wait until
timeout handling has finished by calling blk_mq_freeze_queue() +
blk_mq_unfreeze_queue(). An explanation is needed of why you think that calling
blk_mq_freeze_queue() + blk_mq_unfreeze_queue() is not sufficient and why you
chose the solution implemented in this patch.

Thanks,

Bart.

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse

On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are 
> > still no users of intel_svm_bind_mm() but I see the AMD version used in 
> > that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which 
> > DMA events require ATS translations and which do not. I am assuming the 
> > driver implements someway of indicating that and its not just a global ON 
> > or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to 
> > support ATS what would need to be added in the NVMe spec above and beyond 
> > what we have in PCI ATS to support efficient use of ATS (for example would 
> > we need a flag in the submission queue entries to indicate a particular 
> > IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.

So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.

So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.

Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.

All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.

Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme

Re: [PATCH 7/7] psi: cgroup support

On Wed, May 09, 2018 at 01:07:36PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:35PM -0400, Johannes Weiner wrote:
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -260,6 +260,18 @@ void psi_task_change(struct task_struct *task, u64 
> > now, int clear, int set)
> > task->psi_flags |= set;
> >  
> > psi_group_update(_system, cpu, now, clear, set);
> > +
> > +#ifdef CONFIG_CGROUPS
> > +   cgroup = task->cgroups->dfl_cgrp;
> > +   while (cgroup && (parent = cgroup_parent(cgroup))) {
> > +   struct psi_group *group;
> > +
> > +   group = cgroup_psi(cgroup);
> > +   psi_group_update(group, cpu, now, clear, set);
> > +
> > +   cgroup = parent;
> > +   }
> > +#endif
> >  }
> 
> TJ fixed needing that for stats at some point, why can't you do the
> same?

The stats deltas are all additive, so it's okay to delay flushing them
up the tree right before somebody is trying to look at them.

With this, though, we are tracking time of an aggregate state composed
of child tasks, and that state might not be identical for you and all
your ancestor, so everytime a task state changes we have to evaluate
and start/stop clocks on every level, because we cannot derive our
state from the state history of our child groups.

For example, say you have the following tree:

  root
 /
A
  /   \
 A1   A2
  running=1   running=1

I.e. There is a a running task in A1 and one in A2.

root, A, A1, and A2 are all PSI_NONE as nothing is stalled.

Now the task in A2 enters a memstall.

  root
 /
A
  /   \
 A1   A2
  running=1   memstall=1

>From the perspective of A2, the group is now fully blocked and starts
recording time in PSI_FULL.

>From the perspective of A, it has a working group below it and a
stalled one, which would make it PSI_SOME, so it starts recording time
in PSI_SOME.

The root/sytem level likewise has to start the timer on PSI_SOME.

Now the task in A1 enters a memstall, and we have to propagate the
PSI_FULL state up A1 -> A -> root.

I'm not quite sure how we could make this lazy. Say we hadn't
propagated the state from A1 and A2 right away, and somebody is asking
about the averages for A. We could tell that A1 and A2 had been in
PSI_FULL recently, but we wouldn't know exactly if them being in these
states fully overlapped (all PSI_FULL), overlapped partially (some
PSI_FULL and some PSI_SOME), or didn't overlap at all (PSI_SOME).

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse

On Thu, May 10, 2018 at 02:16:25PM +, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >together with ATS to identify the address space which a transaction 
> >should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS 
> then it will issue a TR TLP to obtain a translated address from the IOMMU. 
> This TR TLP will be addressed to the RP and so regardless of ACS it is going 
> up to the Root Port. When it gets the response it gets the physical address 
> and can use that with the TA bit set for the p2pdma. In the case of ATS 
> support we also have more control over ACS as we can disable it just for TA 
> addresses (as per 7.7.7.7.2 of the spec).
> 
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU 
> at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I 
> think we have to disable the IOMMU (something Mike suggested earlier). The 
> reason is that since ATS is not an option the EP must initiate the DMA using 
> the addresses passed down to it. If the IOMMU is on then this is an IOVA that 
> could (with some non-zero probability) point to an IO Memory address in the 
> same PCI domain. So if we disable ACS we are in trouble as we might MemWr to 
> the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. 
> Disabling the IOMMU removes the IOVA risk and ironically also resolves the 
> IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't 
> have ATS (and no NVMe SSDs today support ATS) then we have to disable the 
> IOMMU. I know this is problematic for AMDs use case so perhaps we also need 
> to consider a mode for P2PDMA for devices that DO support ATS where we can 
> enable the IOMMU (but in this case EPs without ATS cannot participate as 
> P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Christian König

Am 10.05.2018 um 16:20 schrieb Stephen Bates:

Hi Jerome

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

Ah thanks. This cleared things up for me. A quick search shows there are still no users of intel_svm_bind_mm() but I see the AMD version used in that GPU driver.

Just FYI: There is also another effort ongoing to give both the AMD,
Intel as well as ARM IOMMUs a common interface so that drivers can use
whatever the platform offers fro SVM support.

One thing I could not grok from the code how the GPU driver indicates which DMA
events require ATS translations and which do not. I am assuming the driver
implements someway of indicating that and its not just a global ON or OFF for
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what
would need to be added in the NVMe spec above and beyond what we have in PCI
ATS to support efficient use of ATS (for example would we need a flag in the
submission queue entries to indicate a particular IO's SGL/PRP should undergo
ATS).

Oh, well that is complicated at best.

On very old hardware it wasn't a window, but instead you had to use
special commands in your shader which indicated that you want to use an
ATS transaction instead of a normal PCIe transaction for your
read/write/atomic.

As Jerome explained on most hardware we have a window inside the
internal GPU address space which when accessed issues a ATS transaction
with a configurable PASID.

But on very newer hardware that window became a bit in the GPUVM page
tables, so in theory we now can control it on a 4K granularity basis for
the internal 48bit GPU address space.

Christian.

Cheers

Stephen

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 12:21:00PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   local_irq_disable();
> > +   rq = this_rq();
> > +   raw_spin_lock(>lock);
> > +   rq_pin_lock(rq, );
> 
> Given that churn in sched.h, you seen rq_lock() and friends.
> 
> Either write this like:
> 
>   local_irq_disable();
>   rq = this_rq();
>   rq_lock(rq, );
> 
> Or instroduce "rq = this_rq_lock_irq()", which we could also use in
> do_sched_yield().

Sounds good, I'll add that.

> > +   update_rq_clock(rq);
> > +
> > +   current->flags |= PF_MEMSTALL;
> > +   psi_task_change(current, rq_clock(rq), 0, TSK_MEMSTALL);
> > +
> > +   rq_unpin_lock(rq, );
> > +   raw_spin_unlock(>lock);
> > +   local_irq_enable();
> 
> That's called rq_unlock_irq().

I'll use that. This code was first written against a kernel that
didn't have 8a8c69c32778 ("sched/core: Add rq->lock wrappers.") yet ;)

Re: [PATCH] mtip32xx: Fix an error handling path in 'mtip_pci_probe()'

On 5/10/18 1:27 AM, Christophe JAILLET wrote:
> Branch to the right label in the error handling path in order to keep it
> logical.

Looks good, applied.

-- 
Jens Axboe

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

Hi Jerome

> As it is tie to PASID this is done using IOMMU so looks for caller
> of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
>  user is the AMD GPU driver see:

Ah thanks. This cleared things up for me. A quick search shows there are still 
no users of intel_svm_bind_mm() but I see the AMD version used in that GPU 
driver.

One thing I could not grok from the code how the GPU driver indicates which DMA 
events require ATS translations and which do not. I am assuming the driver 
implements someway of indicating that and its not just a global ON or OFF for 
all DMAs? The reason I ask is that I looking at if NVMe was to support ATS what 
would need to be added in the NVMe spec above and beyond what we have in PCI 
ATS to support efficient use of ATS (for example would we need a flag in the 
submission queue entries to indicate a particular IO's SGL/PRP should undergo 
ATS).

Cheers

Stephen

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 12:14:54PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 15750c222ca2..1658477466d5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
[...]
> What's all this churn about?

The psi callbacks in kernel/sched/stat.h use these rq lock functions
from this file, but sched.h includes stat.h before those definitions.

I'll move this into a separate patch with a proper explanation.

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

Hi Christian

> Why would a switch not identify that as a peer address? We use the PASID 
>together with ATS to identify the address space which a transaction 
>should use.

I think you are conflating two types of TLPs here. If the device supports ATS 
then it will issue a TR TLP to obtain a translated address from the IOMMU. This 
TR TLP will be addressed to the RP and so regardless of ACS it is going up to 
the Root Port. When it gets the response it gets the physical address and can 
use that with the TA bit set for the p2pdma. In the case of ATS support we also 
have more control over ACS as we can disable it just for TA addresses (as per 
7.7.7.7.2 of the spec).

 >   If I'm not completely mistaken when you disable ACS it is perfectly 
 >   possible that a bridge identifies a transaction as belonging to a peer 
 >   address, which isn't what we want here.
   
You are right here and I think this illustrates a problem for using the IOMMU 
at all when P2PDMA devices do not support ATS. Let me explain:

If we want to do a P2PDMA and the DMA device does not support ATS then I think 
we have to disable the IOMMU (something Mike suggested earlier). The reason is 
that since ATS is not an option the EP must initiate the DMA using the 
addresses passed down to it. If the IOMMU is on then this is an IOVA that could 
(with some non-zero probability) point to an IO Memory address in the same PCI 
domain. So if we disable ACS we are in trouble as we might MemWr to the wrong 
place but if we enable ACS we lose much of the benefit of P2PDMA. Disabling the 
IOMMU removes the IOVA risk and ironically also resolves the IOMMU grouping 
issues.

So I think if we want to support performant P2PDMA for devices that don't have 
ATS (and no NVMe SSDs today support ATS) then we have to disable the IOMMU. I 
know this is problematic for AMDs use case so perhaps we also need to consider 
a mode for P2PDMA for devices that DO support ATS where we can enable the IOMMU 
(but in this case EPs without ATS cannot participate as P2PDMA DMA iniators).

Make sense?

Stephen

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 12:05:51PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> 
> > +   some[r] /= max(nonidle_total, 1UL);
> > +   full[r] /= max(nonidle_total, 1UL);
> 
> That's a bare 64bit divide.. that typically failed to build on 32bit
> archs.

Ah yes, I'll switch that to do_div(). Thanks

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 12:04:55PM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > +static void psi_clock(struct work_struct *work)
> > +{
> > +   u64 some[NR_PSI_RESOURCES] = { 0, };
> > +   u64 full[NR_PSI_RESOURCES] = { 0, };
> > +   unsigned long nonidle_total = 0;
> > +   unsigned long missed_periods;
> > +   struct delayed_work *dwork;
> > +   struct psi_group *group;
> > +   unsigned long expires;
> > +   int cpu;
> > +   int r;
> > +
> > +   dwork = to_delayed_work(work);
> > +   group = container_of(dwork, struct psi_group, clock_work);
> > +
> > +   /*
> > +* Calculate the sampling period. The clock might have been
> > +* stopped for a while.
> > +*/
> > +   expires = group->period_expires;
> > +   missed_periods = (jiffies - expires) / MY_LOAD_FREQ;
> > +   group->period_expires = expires + ((1 + missed_periods) * MY_LOAD_FREQ);
> > +
> > +   /*
> > +* Aggregate the per-cpu state into a global state. Each CPU
> > +* is weighted by its non-idle time in the sampling period.
> > +*/
> > +   for_each_online_cpu(cpu) {
> 
> Typically when using online CPU state, you also need hotplug notifiers
> to deal with changes in the online set.
> 
> You also typically need something like cpus_read_lock() around an
> iteration of online CPUs, to avoid the set changing while you're poking
> at them.
> 
> The lack for neither is evident or explained.

The per-cpu state we access is allocated for each possible CPU, so
that is safe (and state being all 0 is semantically sound, too). In a
race with onlining, we might miss some per-cpu samples, but would
catch them the next time. In a race with offlining, we may never
consider the final up to 2s state history of the disappearing CPU; we
could have an offlining callback to flush the state, but I'm not sure
this would be an actual problem in the real world since the error is
small (smallest averaging window is 5 sampling periods) and then would
age out quickly.

I can certainly add a comment explaining this at least.

> > +   struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> > +   unsigned long nonidle;
> > +
> > +   nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> > +   groupc->nonidle_time = 0;
> > +   nonidle_total += nonidle;
> > +
> > +   for (r = 0; r < NR_PSI_RESOURCES; r++) {
> > +   struct psi_resource *res = >res[r];
> > +
> > +   some[r] += (res->times[0] + res->times[1]) * nonidle;
> > +   full[r] += res->times[1] * nonidle;
> > +
> > +   /* It's racy, but we can tolerate some error */
> > +   res->times[0] = 0;
> > +   res->times[1] = 0;
> > +   }
> > +   }

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 11:59:38AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > new file mode 100644
> > index ..b22b0ffc729d
> > --- /dev/null
> > +++ b/include/linux/psi_types.h
> > @@ -0,0 +1,84 @@
> > +#ifndef _LINUX_PSI_TYPES_H
> > +#define _LINUX_PSI_TYPES_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_PSI
> > +
> > +/* Tracked task states */
> > +enum psi_task_count {
> > +   NR_RUNNING,
> > +   NR_IOWAIT,
> > +   NR_MEMSTALL,
> > +   NR_PSI_TASK_COUNTS,
> > +};
> > +
> > +/* Task state bitmasks */
> > +#define TSK_RUNNING(1 << NR_RUNNING)
> > +#define TSK_IOWAIT (1 << NR_IOWAIT)
> > +#define TSK_MEMSTALL   (1 << NR_MEMSTALL)
> > +
> > +/* Resources that workloads could be stalled on */
> > +enum psi_res {
> > +   PSI_CPU,
> > +   PSI_MEM,
> > +   PSI_IO,
> > +   NR_PSI_RESOURCES,
> > +};
> > +
> > +/* Pressure states for a group of tasks */
> > +enum psi_state {
> > +   PSI_NONE,   /* No stalled tasks */
> > +   PSI_SOME,   /* Stalled tasks & working tasks */
> > +   PSI_FULL,   /* Stalled tasks & no working tasks */
> > +   NR_PSI_STATES,
> > +};
> > +
> > +struct psi_resource {
> > +   /* Current pressure state for this resource */
> > +   enum psi_state state;
> > +
> > +   /* Start of current state (cpu_clock) */
> > +   u64 state_start;
> > +
> > +   /* Time sampling buckets for pressure states (ns) */
> > +   u64 times[NR_PSI_STATES - 1];
> 
> Fails to explain why no FULL.

It's NONE that's excluded. I'll add a comment.

> > +struct psi_group_cpu {
> > +   /* States of the tasks belonging to this group */
> > +   unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +
> 
> AFAICT there's a hole here, that would fit the @nonidle member. Which
> also avoids the later hole generated by it.

Good spot, I'll reshuffle this accordingly.

> > +   /* Per-resource pressure tracking in this group */
> > +   struct psi_resource res[NR_PSI_RESOURCES];
> > +
> > +   /* There are runnable or D-state tasks */
> > +   bool nonidle;
> 
> Mandatory complaint about using _Bool in composites goes here.

int it is.

Thanks

Re: [PATCH 5/7] sched: loadavg: make calc_load_n() public

On Wed, May 09, 2018 at 11:49:06AM +0200, Peter Zijlstra wrote:
> On Mon, May 07, 2018 at 05:01:33PM -0400, Johannes Weiner wrote:
> > +static inline unsigned long
> > +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
> > +{
> > +   unsigned long result = 1UL << frac_bits;
> > +
> > +   if (n) {
> > +   for (;;) {
> > +   if (n & 1) {
> > +   result *= x;
> > +   result += 1UL << (frac_bits - 1);
> > +   result >>= frac_bits;
> > +   }
> > +   n >>= 1;
> > +   if (!n)
> > +   break;
> > +   x *= x;
> > +   x += 1UL << (frac_bits - 1);
> > +   x >>= frac_bits;
> > +   }
> > +   }
> > +
> > +   return result;
> > +}
> 
> No real objection; but that does look a wee bit fat for an inline I
> suppose.

Fair enough, I'll put these back where I found them and make
calc_load_n() extern instead.

Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO

On Wed, May 09, 2018 at 01:38:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 09, 2018 at 12:46:18PM +0200, Peter Zijlstra wrote:
> > On Mon, May 07, 2018 at 05:01:34PM -0400, Johannes Weiner wrote:
> > 
> > > @@ -2038,6 +2038,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> > > state, int wake_flags)
> > >   cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> > >   if (task_cpu(p) != cpu) {
> > >   wake_flags |= WF_MIGRATED;
> > > + psi_ttwu_dequeue(p);
> > >   set_task_cpu(p, cpu);
> > >   }
> > >  
> > 
> > > +static inline void psi_ttwu_dequeue(struct task_struct *p)
> > > +{
> > > + /*
> > > +  * Is the task being migrated during a wakeup? Make sure to
> > > +  * deregister its sleep-persistent psi states from the old
> > > +  * queue, and let psi_enqueue() know it has to requeue.
> > > +  */
> > > + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > + int clear = 0;
> > > +
> > > + if (p->in_iowait)
> > > + clear |= TSK_IOWAIT;
> > > + if (p->flags & PF_MEMSTALL)
> > > + clear |= TSK_MEMSTALL;
> > > +
> > > + rq = __task_rq_lock(p, );
> > > + update_rq_clock(rq);
> > > + psi_task_change(p, rq_clock(rq), clear, 0);
> > > + p->sched_psi_wake_requeue = 1;
> > > + __task_rq_unlock(rq, );
> > > + }
> > > +}
> > 
> > Yeah, no... not happening.
> > 
> > We spend a lot of time to never touch the old rq->lock on wakeups. Mason
> > was the one pushing for that, so he should very well know this.
> > 
> > The one cross-cpu atomic (iowait) is already a problem (the whole iowait
> > accounting being useless makes it even worse), adding significant remote
> > prodding is just really bad.
> 
> Also, since all you need is the global number, I don't think you
> actually need any of this. See what we do for nr_uninterruptible.
> 
> In general I think you want to (re)read loadavg.c some more, and maybe
> reuse a bit more of that.

So there is a reason I'm tracking productivity states per-cpu and not
globally. Consider the following example periods on two CPUs:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | memstalled | EXECUTING  |

If we tracked only the global number of stalled tasks, similarly to
nr_uninterruptible, the number would be elevated throughout the whole
sampling period, giving a pressure value of 100% for "some stalled".
And, since there is always something executing, a "full stall" of 0%.

Now consider what happens when the Task 3 sequence is the other way
around:

CPU 0
Task 1: | EXECUTING  | memstalled |
Task 2: | runqueued  | EXECUTING  |

CPU 1
Task 3: | EXECUTING  | memstalled |

Here the number of stalled tasks is elevated only during half of the
sampling period, this time giving a pressure reading of 50% for "some"
(and again 0% for "full").

That's a different measurement, but in terms of workload progress, the
sequences are functionally equivalent. In both scenarios the same
amount of productive CPU cycles is spent advancing tasks 1, 2 and 3,
and the same amount of potentially productive CPU time is lost due to
the contention of memory. We really ought to read the same pressure.

So what I'm doing is calculating the productivity loss on each CPU in
a sampling period as if they were independent time slices. It doesn't
matter how you slice and dice the sequences within each one - if used
CPU time and lost CPU time have the same proportion, we have the same
pressure.

In both scenarios above, this method will give a pressure reading of
some=50% and full=25% of "normalized walltime", which is the time loss
the work would experience on a single CPU executing it serially.

To illustrate:

CPU X
1234
Task 1: | EXECUTING  | memstalled | sleeping   | sleeping   |
Task 2: | runqueued  | EXECUTING  | sleeping   | sleeping   |
Task 3: | sleeping   | sleeping   | EXECUTING  | memstalled |

You can clearly see the 50% of walltime in which *somebody* isn't
advancing (2 and 4), and the 25% of walltime in which *no* tasks are
(3). Same amount of work, same memory stalls, same pressure numbers.

Globalized state tracking would produce those numbers on the single
CPU (obviously), but once concurrency gets into the mix, it's
questionable what its results mean. It certainly isn't able to
reliably detect equivalent slowdowns of individual tasks ("some" is
all over the place), and in this example wasn't able to capture the
impact of contention on overall work completion ("full" is 0%).

* CPU 0: some = 50%, full =  0%
  CPU 1: some = 50%, full = 50%
avg: some = 50%, full = 25%

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Christian König


Am 09.05.2018 um 18:45 schrieb Logan Gunthorpe:


On 09/05/18 07:40 AM, Christian König wrote:

The key takeaway is that when any device has ATS enabled you can't
disable ACS without breaking it (even if you unplug and replug it).

I don't follow how you came to this conclusion...
  The ACS bits we'd be turning off are the ones that force TLPs addressed
at a peer to go to the RC. However, ATS translation packets will be
addressed to an untranslated address which a switch will not identify as
a peer address so it should send upstream regardless the state of the
ACS Req/Comp redirect bits.


Why would a switch not identify that as a peer address? We use the PASID 
together with ATS to identify the address space which a transaction 
should use.


If I'm not completely mistaken when you disable ACS it is perfectly 
possible that a bridge identifies a transaction as belonging to a peer 
address, which isn't what we want here.


Christian.



Once the translation comes back, the ATS endpoint should send the TLP to
the peer address with the AT packet type and it will be directed to the
peer provided the Direct Translated bit is set (or the redirect bits are
unset).

I can't see how turning off the Req/Comp redirect bits could break
anything except for the isolation they provide.

Logan

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

On Sat, May 05, 2018 at 07:11:33PM -0400, Laurence Oberman wrote:
> On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > Hi,
> > 
> > The 1st patch introduces blk_quiesce_timeout() and
> > blk_unquiesce_timeout()
> > for NVMe, meantime fixes blk_sync_queue().
> > 
> > The 2nd patch covers timeout for admin commands for recovering
> > controller
> > for avoiding possible deadlock.
> > 
> > The 3rd and 4th patches avoid to wait_freeze on queues which aren't
> > frozen.
> > 
> > The last 4 patches fixes several races wrt. NVMe timeout handler, and
> > finally can make blktests block/011 passed. Meantime the NVMe PCI
> > timeout
> > mecanism become much more rebost than before.
> > 
> > gitweb:
> > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> > 
> > V4:
> > - fixe nvme_init_set_host_mem_cmd()
> > - use nested EH model, and run both nvme_dev_disable() and
> > resetting in one same context
> > 
> > V3:
> > - fix one new race related freezing in patch 4,
> > nvme_reset_work()
> > may hang forever without this patch
> > - rewrite the last 3 patches, and avoid to break
> > nvme_reset_ctrl*()
> > 
> > V2:
> > - fix draining timeout work, so no need to change return value
> > from
> > .timeout()
> > - fix race between nvme_start_freeze() and nvme_unfreeze()
> > - cover timeout for admin commands running in EH
> > 
> > Ming Lei (7):
> >   block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
> >   nvme: pci: cover timeout for admin commands running in EH
> >   nvme: pci: only wait freezing if queue is frozen
> >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> > recovery
> >   nvme: core: introduce 'reset_lock' for sync reset state and reset
> > activities
> >   nvme: pci: prepare for supporting error recovery from resetting
> > context
> >   nvme: pci: support nested EH
> > 
> >  block/blk-core.c |  21 +++-
> >  block/blk-mq.c   |   9 ++
> >  block/blk-timeout.c  |   5 +-
> >  drivers/nvme/host/core.c |  46 ++-
> >  drivers/nvme/host/nvme.h |   5 +
> >  drivers/nvme/host/pci.c  | 304
> > ---
> >  include/linux/blkdev.h   |  13 ++
> >  7 files changed, 356 insertions(+), 47 deletions(-)
> > 
> > Cc: Jianchao Wang 
> > Cc: Christoph Hellwig 
> > Cc: Sagi Grimberg 
> > Cc: linux-n...@lists.infradead.org
> > Cc: Laurence Oberman 
> 
> Hello Ming
> 
> I have a two node NUMA system here running your kernel tree
> 4.17.0-rc3.ming.nvme+
> 
> [root@segstorage1 ~]# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 3 5 6 8 11 13 14
> node 0 size: 63922 MB
> node 0 free: 61310 MB
> node 1 cpus: 1 2 4 7 9 10 12 15
> node 1 size: 64422 MB
> node 1 free: 62372 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10 
> 
> I ran block/011
> 
> [root@segstorage1 blktests]# ./check block/011
> block/011 => nvme0n1 (disable PCI device while doing I/O)[failed]
> runtime...  106.936s
> --- tests/block/011.out   2018-05-05 18:01:14.268414752 -0400
> +++ results/nvme0n1/block/011.out.bad 2018-05-05
> 19:07:21.028634858 -0400
> @@ -1,2 +1,36 @@
>  Running block/011
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> ...
> (Run 'diff -u tests/block/011.out
> results/nvme0n1/block/011.out.bad' to see the entire diff)
> 
> [ 1421.738551] run blktests block/011 at 2018-05-05 19:05:34
> [ 1452.676351] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.718221] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.718239] nvme nvme0: EH 0: before shutdown
> [ 1452.760890] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760894] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760897] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760900] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760903] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760906] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760909] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.760912] nvme nvme0: controller is down; will reset: CSTS=0x3,
>

Re: [PATCH 01/33] block: add a lower-level bio_add_page interface

On Wed, May 9, 2018 at 3:47 PM, Christoph Hellwig  wrote:
> For the upcoming removal of buffer heads in XFS we need to keep track of
> the number of outstanding writeback requests per page.  For this we need
> to know if bio_add_page merged a region with the previous bvec or not.
> Instead of adding additional arguments this refactors bio_add_page to
> be implemented using three lower level helpers which users like XFS can
> use directly if they care about the merge decisions.

The merge policy may be transparent to fs, such as multipage bvec.

>
> Signed-off-by: Christoph Hellwig 
> ---
>  block/bio.c | 87 ++---
>  include/linux/bio.h |  9 +
>  2 files changed, 67 insertions(+), 29 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 53e0f0a1ed94..6ceba6adbf42 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -773,7 +773,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio 
> *bio, struct page
> return 0;
> }
>
> -   if (bio->bi_vcnt >= bio->bi_max_vecs)
> +   if (bio_full(bio))
> return 0;
>
> /*
> @@ -820,6 +820,59 @@ int bio_add_pc_page(struct request_queue *q, struct bio 
> *bio, struct page
>  }
>  EXPORT_SYMBOL(bio_add_pc_page);
>
> +/**
> + * __bio_try_merge_page - try adding data to an existing bvec
> + * @bio: destination bio
> + * @page: page to add
> + * @len: length of the range to add
> + * @off: offset into @page
> + *
> + * Try adding the data described at @page + @offset to the last bvec of @bio.
> + * Return %true on success or %false on failure.  This can happen frequently
> + * for file systems with a block size smaller than the page size.
> + */
> +bool __bio_try_merge_page(struct bio *bio, struct page *page,
> +   unsigned int len, unsigned int off)
> +{
> +   if (bio->bi_vcnt > 0) {
> +   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
> +
> +   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) 
> {
> +   bv->bv_len += len;
> +   bio->bi_iter.bi_size += len;
> +   return true;
> +   }
> +   }
> +   return false;
> +}
> +EXPORT_SYMBOL_GPL(__bio_try_merge_page);
> +
> +/**
> + * __bio_add_page - add page to a bio in a new segment
> + * @bio: destination bio
> + * @page: page to add
> + * @len: length of the range to add
> + * @off: offset into @page
> + *
> + * Add the data at @page + @offset to @bio as a new bvec.  The caller must
> + * ensure that @bio has space for another bvec.
> + */
> +void __bio_add_page(struct bio *bio, struct page *page,
> +   unsigned int len, unsigned int off)
> +{
> +   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt];
> +
> +   WARN_ON_ONCE(bio_full(bio));
> +
> +   bv->bv_page = page;
> +   bv->bv_offset = off;
> +   bv->bv_len = len;
> +
> +   bio->bi_iter.bi_size += len;
> +   bio->bi_vcnt++;
> +}
> +EXPORT_SYMBOL_GPL(__bio_add_page);

Given both __bio_try_merge_page() and __bio_add_page() are exported,
please add WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)), otherwise
both may be misused by external users.

-- 
Ming Lei

[PATCH] mtip32xx: Fix an error handling path in 'mtip_pci_probe()'

2018-05-10 Thread Christophe JAILLET

Branch to the right label in the error handling path in order to keep it
logical.

Signed-off-by: Christophe JAILLET 
---
 drivers/block/mtip32xx/mtip32xx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 769c551e3d71..06257b872a7d 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4273,7 +4273,7 @@ static int mtip_pci_probe(struct pci_dev *pdev,
if (!dd->isr_workq) {
dev_warn(>dev, "Can't create wq %d\n", dd->instance);
rv = -ENOMEM;
-   goto block_initialize_err;
+   goto setmask_err;
}
 
memset(cpu_list, 0, sizeof(cpu_list));
-- 
2.17.0

Re: [PATCH 11/33] iomap: add an iomap-based readpage and readpages implementation

On Thu, May 10, 2018 at 11:17:58AM +1000, Dave Chinner wrote:
> > +   if (ret <= 0)
> > +   break;
> > +   pos += ret;
> > +   length -= ret;
> > +   }
> > +
> > +   ret = 0;
> 
> This means the function will always return zero, regardless of
> whether iomap_apply returned an error or not.
> 
> > +   if (ctx.bio)
> > +   submit_bio(ctx.bio);
> > +   if (ctx.cur_page) {
> > +   if (!ctx.cur_page_in_bio)
> > +   unlock_page(ctx.cur_page);
> > +   put_page(ctx.cur_page);
> > +   }
> > +   WARN_ON_ONCE(ret && !list_empty(ctx.pages));
> 
> And this warning will never trigger. Was this intended behaviour?
> If it is, it needs a comment, because it looks wrong

Yes, the break should have been a goto out which jumps after the
ret.

Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

On Wed, May 09, 2018 at 09:46:28AM -0700, Darrick J. Wong wrote:
> On Wed, May 09, 2018 at 09:48:07AM +0200, Christoph Hellwig wrote:
> > This adds a simple iomap-based implementation of the legacy ->bmap
> > interface.  Note that we can't easily add checks for rt or reflink
> > files, so these will have to remain in the callers.  This interface
> > just needs to die..
> 
> You /can/ check these...
> 
> if (iomap->bdev != inode->i_sb->s_bdev)
>   return 0;
> if (iomap->flags & IOMAP_F_SHARED)
>   return 0;

The latter only checks for a shared extent, not a file with possibly
shared extents.  I'd rather keep the check for a file with possible
shared extents.

> > +static loff_t
> > +iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> > +   void *data, struct iomap *iomap)
> > +{
> > +   sector_t *bno = data;
> > +
> > +   if (iomap->type == IOMAP_MAPPED)
> > +   *bno = (iomap->addr + pos - iomap->offset) >> inode->i_blkbits;
> 
> Does this need to be careful w.r.t. overflow on systems where sector_t
> is a 32-bit unsigned long?
> 
> Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
> also seems broken.  I agree the interface needs to die, but ioctls take
> a long time to deprecate.

Not much we can do about the interface.

Re: [PATCH 06/33] mm: give the 'ret' variable a better name __do_page_cache_readahead

On Wed, May 09, 2018 at 08:45:01AM -0700, Matthew Wilcox wrote:
> On Wed, May 09, 2018 at 09:48:03AM +0200, Christoph Hellwig wrote:
> > It counts the number of pages acted on, so name it nr_pages to make that
> > obvious.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Yes!
> 
> Also, it can't return an error, so how about changing it to unsigned int?
> And deleting the error check from the caller?

I'll take a look at that.

Re: [PATCH 02/33] fs: factor out a __generic_write_end helper

On Wed, May 09, 2018 at 08:15:56AM -0700, Matthew Wilcox wrote:
> On Wed, May 09, 2018 at 09:47:59AM +0200, Christoph Hellwig wrote:
> >  }
> >  EXPORT_SYMBOL(generic_write_end);
> >  
> > +
> >  /*
> 
> Spurious?

Yes.

Re: [PATCH 01/33] block: add a lower-level bio_add_page interface