Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 08:25:27AM +0200, Christoph Hellwig wrote:
> On Thu, May 10, 2018 at 08:08:38AM -0700, Darrick J. Wong wrote:
> > > > > + sector_t *bno = data;
> > > > > +
> > > > > + if (iomap->type == IOMAP_MAPPED)
> > > > > + *bno = (iomap->addr + pos - iomap->offset) >> 
> > > > > inode->i_blkbits;
> > > > 
> > > > Does this need to be careful w.r.t. overflow on systems where sector_t
> > > > is a 32-bit unsigned long?
> > > > 
> > > > Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
> > > > also seems broken.  I agree the interface needs to die, but ioctls take
> > > > a long time to deprecate.
> > > 
> > > Not much we can do about the interface.
> > 
> > Yes, the interface is fubar, but if file /foo maps to block 8589934720
> > then do we return the truncated result 128?
> 
> Then we'll get a corrupt result.  What do you think we could do here
> eithere in the old or new code?

I think the only thing we /can/ do is figure out if we'd be truncating
the result, dump a warning to the kernel, and return 0, because we don't
want smartypants FIBMAP callers to be using crap block pointers.

Something like this for the bmap implementation...

uint64_t mapping = iomap->addr;

#ifdef CONFIG_LBDAF
if (mapping > ULONG_MAX) {
/* Do not truncate results. */
return 0;
}
#endif

...and in the bmap ioctl...

sector_t mapping = ...;

if (mapping > INT_MAX) {
WARN(1, "would truncate bmap result, go fix your stupid program");
return 0;
}

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-11 Thread Ming Lei
Hi Keith,

On Fri, May 11, 2018 at 02:50:28PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> > for NVMe, meantime fixes blk_sync_queue().
> > 
> > The 2nd patch covers timeout for admin commands for recovering controller
> > for avoiding possible deadlock.
> > 
> > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> > 
> > The last 5 patches fixes several races wrt. NVMe timeout handler, and
> > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> > mecanism become much more rebost than before.
> > 
> > gitweb:
> > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5
> 
> Hi Ming,
> 
> First test with simulated broken links is unsuccessful. I'm getting
> stuck here:
> 
> [<0>] blk_mq_freeze_queue_wait+0x46/0xb0
> [<0>] blk_cleanup_queue+0x78/0x170
> [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
> [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
> [<0>] nvme_remove+0x6b/0x130 [nvme]
> [<0>] pci_device_remove+0x36/0xb0
> [<0>] device_release_driver_internal+0x157/0x220
> [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
> [<0>] process_one_work+0x170/0x350
> [<0>] worker_thread+0x2e/0x380
> [<0>] kthread+0x111/0x130
> [<0>] ret_from_fork+0x1f/0x30
> 
> 
> Here's the last parts of the kernel logs capturing the failure:
> 
> [  760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679176] nvme nvme1: EH 0: before shutdown
> [  760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> 
> ( above repeats a few more hundred times )
> 
> [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> [  760.727103] nvme :86:00.0: Refused to change power state, currently in 
> D3

EH may not cover this kind of failure, so it fails in the 1st try.

> [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> [  760.727485] nvme nvme1: EH 0: after recovery -19
> [  760.727488] nvme nvme1: EH: fail controller

The above issue(hang in nvme_remove())

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Logan Gunthorpe

On 5/11/2018 4:24 PM, Stephen  Bates wrote:

All


  Alex (or anyone else) can you point to where IOVA addresses are generated?


A case of RTFM perhaps (though a pointer to the code would still be 
appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
---
Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.


Hmm, except I'm not sure how to interpret that. It sounds like there 
can't be an IOVA address that overlaps with the PCI MMIO range which is 
good and what I'd expect.


But for peer to peer they say they don't translate the address which 
implies to me that the intention is for a peer to peer address to not be 
mapped in the same way using the dma_map interface (of course though if 
you were using ATS you'd want this for sure). Unless the existing 
dma_map command's notice a PCI MMIO address and handle them differently, 
but I don't see how.


Logan



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
All

> Alex (or anyone else) can you point to where IOVA addresses are generated?

A case of RTFM perhaps (though a pointer to the code would still be 
appreciated).

https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Some exceptions to IOVA
---
Interrupt ranges are not address translated, (0xfee0 - 0xfeef).
The same is true for peer to peer transactions. Hence we reserve the
address from PCI MMIO ranges so they are not allocated for IOVA addresses.

Cheers

Stephen


Re: [PATCH v8] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Bart Van Assche
On Fri, 2018-05-11 at 15:21 -0600, Jens Axboe wrote:
> On 5/11/18 3:08 PM, Bart Van Assche wrote:
> blk_mq_rq_update_aborted_gstate(rq, gstate);
> > +   union blk_deadline_and_state das = READ_ONCE(rq->das);
> > +   unsigned long now = jiffies;
> > +   int32_t diff_jiffies = das.deadline - now;
> > +   int32_t diff_next = das.deadline - data->next;
> > +   enum mq_rq_state rq_state = das.state;
> 
> Why the int32_t types? Just use an int/unsigned int?

The code that uses the diff_jiffies and diff_next variables can only work
correctly if the subtraction results are truncated to 32 bits. Hence the
use of the int32_t data types to make this explicit. I think using int32_t
instead of int makes this code easier to read.

> > @@ -103,37 +97,119 @@ extern void blk_mq_hctx_kobj_init(struct 
> > blk_mq_hw_ctx *hctx);
> >  
> >  void blk_mq_release(struct request_queue *q);
> >  
> > +#if defined(CONFIG_ARC) || \
> > +defined(CONFIG_ARM) && defined(CONFIG_CPU_THUMBONLY) ||
> > \
> > +defined(CONFIG_C6x) || defined(CONFIG_H8300) ||
> > \
> > +defined(CONFIG_HEXAGON) || defined(CONFIG_MICROBLAZE) ||   
> > \
> > +defined(CONFIG_MIPS) && !defined(CONFIG_64BIT) ||  
> > \
> > +defined(CONFIG_NDS32) || defined(CONFIG_NIOS) ||   
> > \
> > +defined(CONFIG_OPENRISC) || defined(CONFIG_PPC32) ||   \
> > +defined(CONFIG_SUPERH) || defined(CONFIG_UML) || 
> > defined(CONFIG_UNICORE32)
> > +#undef CONFIG_ARCH_HAS_CMPXCHG64
> > +#else
> > +#define CONFIG_ARCH_HAS_CMPXCHG64
> > +#endif
> 
> I think the encapsulation of blk_mq_set_rq_state() in terms of lock vs
> cmpxchg64 is fine, but this does not belong in the block layer.
> 
> > +   /* Only used if CONFIG_ARCH_HAS_CMPXCHG64 is not defined. */
> > +   spinlock_t das_lock;
> 
> Let's put it under an if defined() then.

OK, I will move CONFIG_ARCH_HAS_CMPXCHG64 into arch/*/Kconfig and I will
surround spinlock_t das_lock with #if defined(...) / #endif.

Thanks,

Bart.




Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Stephen Bates
>I find this hard to believe. There's always the possibility that some 
>part of the system doesn't support ACS so if the PCI bus addresses and 
>IOVA overlap there's a good chance that P2P and ATS won't work at all on 
>some hardware.

I tend to agree but this comes down to how IOVA addresses are generated in the 
kernel. Alex (or anyone else) can you point to where IOVA addresses are 
generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI 
bus address directly but other IO going to the same PCI EP may flow through the 
IOMMU and be programmed with IOVA rather than PCI bus addresses.

> I prefer 
>the option to disable the ACS bit on boot and let the existing code put 
>the devices into their own IOMMU group (as it should already do to 
>support hardware that doesn't have ACS support).

+1

Stephen




Re: [PATCH v8] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Jens Axboe
On 5/11/18 3:08 PM, Bart Van Assche wrote:
blk_mq_rq_update_aborted_gstate(rq, gstate);
> + union blk_deadline_and_state das = READ_ONCE(rq->das);
> + unsigned long now = jiffies;
> + int32_t diff_jiffies = das.deadline - now;
> + int32_t diff_next = das.deadline - data->next;
> + enum mq_rq_state rq_state = das.state;

Why the int32_t types? Just use an int/unsigned int?

> @@ -103,37 +97,119 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx 
> *hctx);
>  
>  void blk_mq_release(struct request_queue *q);
>  
> +#if defined(CONFIG_ARC) ||   \
> +defined(CONFIG_ARM) && defined(CONFIG_CPU_THUMBONLY) ||  \
> +defined(CONFIG_C6x) || defined(CONFIG_H8300) ||  \
> +defined(CONFIG_HEXAGON) || defined(CONFIG_MICROBLAZE) || \
> +defined(CONFIG_MIPS) && !defined(CONFIG_64BIT) ||
> \
> +defined(CONFIG_NDS32) || defined(CONFIG_NIOS) || \
> +defined(CONFIG_OPENRISC) || defined(CONFIG_PPC32) || \
> +defined(CONFIG_SUPERH) || defined(CONFIG_UML) || 
> defined(CONFIG_UNICORE32)
> +#undef CONFIG_ARCH_HAS_CMPXCHG64
> +#else
> +#define CONFIG_ARCH_HAS_CMPXCHG64
> +#endif

I think the encapsulation of blk_mq_set_rq_state() in terms of lock vs
cmpxchg64 is fine, but this does not belong in the block layer.

> + /* Only used if CONFIG_ARCH_HAS_CMPXCHG64 is not defined. */
> + spinlock_t das_lock;

Let's put it under an if defined() then.

-- 
Jens Axboe



Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-11 Thread Jens Axboe
On 5/8/18 7:33 PM, Kent Overstreet wrote:
>  - Add separately allowed mempools, biosets: bcachefs uses both all over the
>place
> 
>  - Bit of utility code - bio_copy_data_iter(), zero_fill_bio_iter()
> 
>  - bio_list_copy_data(), the bi_next check - defensiveness because of a bug I
>had fun chasing down at one point
> 
>  - add some exports, because bcachefs does dio its own way
>  - show whether fua is supported in sysfs, because I don't know of anything 
> that
>exports whether the _block layer_ specifically thinks fua is supported.

Looked over the series, and looks like both good cleanups and optimizations.
If we can get the mempool patch sorted, I can apply this for 4.18.

-- 
Jens Axboe



Re: [PATCH 01/10] mempool: Add mempool_init()/mempool_exit()

2018-05-11 Thread Jens Axboe
On 5/8/18 7:33 PM, Kent Overstreet wrote:
> Allows mempools to be embedded in other structs, getting rid of a
> pointer indirection from allocation fastpaths.
> 
> mempool_exit() is safe to call on an uninitialized but zeroed mempool.

Looks fine to me. I'd like to carry it through the block branch, as some
of the following cleanups depend on it. Kent, can you post a v2 with
the destroy -> exit typo fixed up?

But would be nice to have someone sign off on it...

-- 
Jens Axboe



[PATCH v8] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Bart Van Assche
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).

This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 178 +
 block/blk-mq.h | 130 
 block/blk-timeout.c|  89 ++---
 block/blk.h|  21 +++---
 include/linux/blkdev.h |  44 ++--
 7 files changed, 233 insertions(+), 236 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1418a1ccd80d..223f650712f4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
r

Re: make a few block drivers highmem safe

2018-05-11 Thread Jens Axboe
On 5/9/18 7:59 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series converts a few random block drivers to be highmem safe,
> in preparation of eventually getting rid of the block layer bounce
> buffering support.

Applied, thanks.

-- 
Jens Axboe



Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> Hi,
> 
> The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> for NVMe, meantime fixes blk_sync_queue().
> 
> The 2nd patch covers timeout for admin commands for recovering controller
> for avoiding possible deadlock.
> 
> The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> 
> The last 5 patches fixes several races wrt. NVMe timeout handler, and
> finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> mecanism become much more rebost than before.
> 
> gitweb:
>   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

Hi Ming,

First test with simulated broken links is unsuccessful. I'm getting
stuck here:

[<0>] blk_mq_freeze_queue_wait+0x46/0xb0
[<0>] blk_cleanup_queue+0x78/0x170
[<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
[<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
[<0>] nvme_remove+0x6b/0x130 [nvme]
[<0>] pci_device_remove+0x36/0xb0
[<0>] device_release_driver_internal+0x157/0x220
[<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
[<0>] process_one_work+0x170/0x350
[<0>] worker_thread+0x2e/0x380
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30


Here's the last parts of the kernel logs capturing the failure:

[  760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679176] nvme nvme1: EH 0: before shutdown
[  760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x

( above repeats a few more hundred times )

[  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
[  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
[  760.727103] nvme :86:00.0: Refused to change power state, currently in D3
[  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
[  760.727485] nvme nvme1: EH 0: after recovery -19
[  760.727488] nvme nvme1: EH: fail controller
[  760.727491] nvme nvme1: Removing after probe failure status: 0
[  760.735138] nvme1n1: detected capacity change from 1200243695616 to 0



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Logan Gunthorpe

On 5/11/2018 2:52 AM, Christian König wrote:
This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.


I find this hard to believe. There's always the possibility that some 
part of the system doesn't support ACS so if the PCI bus addresses and 
IOVA overlap there's a good chance that P2P and ATS won't work at all on 
some hardware.



If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:


a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.


b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.


c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.


I think a configuration option to manage IOMMU groups as you suggest 
would be a very complex interface and difficult to implement. I prefer 
the option to disable the ACS bit on boot and let the existing code put 
the devices into their own IOMMU group (as it should already do to 
support hardware that doesn't have ACS support).


Logan


Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Bart Van Assche
On Fri, 2018-05-11 at 14:35 +0200, Christoph Hellwig wrote:
> > It should be due to union blk_deadline_and_state. 
> > +union blk_deadline_and_state {
> > +   struct {
> > +   uint32_t generation:30;
> > +   uint32_t state:2;
> > +   uint32_t deadline;
> > +   };
> > +   unsigned long legacy_deadline;
> > +   uint64_t das;
> > +};
> 
> Yikes.  Or we just move it into a separate field.  This patch already
> shrinks struct request a lot, so I'd rather do that to keep it simple.

Hello Christoph,

Are you perhaps referring to the legacy_deadline field? Have you noticed that
Jianchao used a legacy block layer function in blk-mq code and that that is
why a wrong value for the deadline appeared in the trace output?

Thanks,

Bart.





Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Bart Van Assche
On Fri, 2018-05-11 at 20:06 +0800, jianchao.wang wrote:
> Hi bart
> 
> I add debug log in blk_mq_add_timer as following
> 
> void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
>  enum mq_rq_state new)
> {
>struct request_queue *q = req->q;
> 
>if (!req->timeout)
>req->timeout = q->rq_timeout;
>if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
>WARN_ON_ONCE(true);
> 
>trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n",
>jiffies,
>req->timeout,
>blk_rq_deadline(req),
>req->das.generation,
>req->das.deadline);
>  
>return __blk_add_timer(req);
>  }
> 
> And get log below:
> 
>  jbd2/sda2-8-320   [000] ...195.030824: blk_mq_add_timer: jiffies 
> 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
> kworker/0:1H-136   [000] ...195.031822: blk_mq_add_timer: jiffies 
> 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
> kworker/6:1H-244   [006] ...195.041695: blk_mq_add_timer: jiffies 
> 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f
> kworker/6:1H-244   [006] ...195.041954: blk_mq_add_timer: jiffies 
> 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f
> 
> The blk_rq_deadline return 550c4000 which looks really crazy.

The bug is in the above trace_printk() call: blk_rq_deadline() must only be used
for the legacy block layer and not for blk-mq code. If you have a look at the 
value
of the das.deadline field then one can see that the value of that field is 
correct:
0x550c - 0x37c0 = 7500 = 30 * 250. Does that mean that HZ = 250 on the 
system
on which you ran this test?

> And generation never change. 

That's a good catch. The code for incrementing the generation number occurs in
blk_mq_change_rq_state() but is missing from blk_mq_rq_set_deadline(). I will 
fix this.

Bart.




[PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state()

2018-05-11 Thread Ming Lei
When controller is being reset, timeout still may be triggered,
for handling this situation, the contoller state has to be
changed to NVME_CTRL_RESETTING first.

So introduce nvme_force_change_ctrl_state() for this purpose.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/core.c | 33 +
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index adb1743e87f7..af053309c610 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -254,6 +254,39 @@ void nvme_cancel_request(struct request *req, void *data, 
bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+/* should only be used by EH for handling error during reset */
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+   enum nvme_ctrl_state new_state)
+{
+   enum nvme_ctrl_state old_state;
+   unsigned long flags;
+   bool warn = true;
+
+   spin_lock_irqsave(&ctrl->lock, flags);
+   old_state = ctrl->state;
+   switch (new_state) {
+   case NVME_CTRL_RESETTING:
+   switch (old_state) {
+   case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
+   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_CONNECTING:
+   warn = false;
+   break;
+   default:
+   break;
+   }
+   default:
+   break;
+   }
+   if (warn)
+   WARN(1, "Make sure you want to change to %d from %d\n",
+   new_state, old_state);
+   ctrl->state = new_state;
+   spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+EXPORT_SYMBOL_GPL(nvme_force_change_ctrl_state);
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 021f7147f779..2a68df197c71 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -385,6 +385,8 @@ void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state);
+void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl,
+   enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap);
 int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl);
-- 
2.9.5



[PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds

2018-05-11 Thread Ming Lei
If it fails to update controller state into LIVE or ADMIN_ONLY, the
controller will be removed, so not necessary to unfreeze queue any
more.

This way will make the following patch easier to not leak the
freeze couner.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/pci.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d880356feee2..b79c7f016489 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2370,6 +2370,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
int result = -ENODEV;
enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
+   bool unfreeze_queue = false;
 
lockdep_assert_held(&dev->ctrl.reset_lock);
 
@@ -2456,7 +2457,7 @@ static int nvme_reset_dev(struct nvme_dev *dev)
/* hit this only when allocate tagset fails */
if (nvme_dev_add(dev))
new_state = NVME_CTRL_ADMIN_ONLY;
-   nvme_unfreeze(&dev->ctrl);
+   unfreeze_queue = true;
}
 
result = -ENODEV;
@@ -2471,6 +2472,9 @@ static int nvme_reset_dev(struct nvme_dev *dev)
}
 
nvme_start_ctrl(&dev->ctrl);
+
+   if (unfreeze_queue)
+   nvme_unfreeze(&dev->ctrl);
return 0;
 
  out:
-- 
2.9.5



Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread Christoph Hellwig
> It should be due to union blk_deadline_and_state. 
> +union blk_deadline_and_state {
> + struct {
> + uint32_t generation:30;
> + uint32_t state:2;
> + uint32_t deadline;
> + };
> + unsigned long legacy_deadline;
> + uint64_t das;
> +};

Yikes.  Or we just move it into a separate field.  This patch already
shrinks struct request a lot, so I'd rather do that to keep it simple.


[PATCH V5 9/9] nvme: pci: support nested EH

2018-05-11 Thread Ming Lei
When one req is timed out, now nvme_timeout() handles it by the
following way:

nvme_dev_disable(dev, false);
nvme_reset_ctrl(&dev->ctrl);
return BLK_EH_HANDLED.

There are several issues about the above approach:

1) IO may fail during resetting

Admin IO timeout may be triggered in nvme_reset_dev() when error happens.
Normal IO timeout may be triggered too during nvme_wait_freeze() in reset
path. When the two kinds of timeout happen, the current reset mechanism
can't work any more.

2) race between nvme_start_freeze and nvme_wait_freeze() & nvme_unfreeze()

- nvme_dev_disable() and resetting controller are required for recovering
controller, but the two are run from different contexts. nvme_start_freeze()
is call from nvme_dev_disable() which is run timeout work context, and
nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
triggered during resetting controller, so nvme_start_freeze() may be run
several times.

- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

3) all namespace's EH require to shutdown & reset the controller

block's timeout handler is per-request-queue, that means each
namespace's error handling may shutdown & reset the whole controller,
then the shutdown from one namespace may quiese queues when resetting
from another namespace is in-progress.

This patch fixes the above issues by using nested EH:

1) run controller shutdown(nvme_dev_disable()) and resetting(nvme_reset_dev)
from one same EH context, so the above race 2) can be fixed easily.

2) always start a new context for handling EH, and cancel all in-flight
requests(include the timed-out ones) in nvme_dev_disable() by quiescing
timeout event before shutdown controller.

3) limit the max number of nested EH, when the limit is reached, removes
the controller and fails all in-flight request.

With this approach, blktest block/011 can be passed.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/core.c |  22 +
 drivers/nvme/host/nvme.h |   2 +
 drivers/nvme/host/pci.c  | 243 +++
 3 files changed, 247 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index af053309c610..1dec353388be 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3581,6 +3581,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   down_read(&ctrl->namespaces_rwsem);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_unquiesce_timeout(ns->queue);
+   up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout);
+
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   down_read(&ctrl->namespaces_rwsem);
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   blk_quiesce_timeout(ns->queue);
+   up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_quiesce_timeout);
+
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
struct nvme_ns *ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2a68df197c71..934cf98925f3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void 
*buffer, size_t len,
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
union nvme_result *res);
 
+void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl);
+void nvme_quiesce_timeout(struct nvme_ctrl *ctrl);
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b79c7f016489..4366c21e59b2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -71,6 +71,7 @@ struct nvme_queue;
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
freeze_queue);
+static int nvme_reset_dev(struct nvme_dev *dev);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -113,6 +114,23 @@ struct nvme_dev {
dma_addr_t host_mem_descs_dma;
struct nvme_host_mem_buf_desc *host_mem_descs;
void **host_mem_desc_bufs;
+
+   /* EH handler */
+   spinlock_t  eh_lock;
+   boolctrl_shutdown_started;
+   boolctrl_failed;
+   unsigned intnested_eh;
+   struct work_struct fail_ctrl_work;
+   wait_queue_head_t   eh_wq;
+   struct list_headeh_head;
+};
+
+#define  NVME_MAX_NESTED_EH32
+struct nvme_eh_work {
+   struct work_s

[PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev()

2018-05-11 Thread Ming Lei
Once nested EH is introduced, we may not need to handle error
in the inner EH, so move error handling out of nvme_reset_dev().

Meantime return the reset result to caller.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/pci.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a924246ffdb6..d880356feee2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2365,7 +2365,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_dev(struct nvme_dev *dev)
+static int nvme_reset_dev(struct nvme_dev *dev)
 {
bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
int result = -ENODEV;
@@ -2459,6 +2459,7 @@ static void nvme_reset_dev(struct nvme_dev *dev)
nvme_unfreeze(&dev->ctrl);
}
 
+   result = -ENODEV;
/*
 * If only admin queue live, keep it to do further investigation or
 * recovery.
@@ -2470,19 +2471,22 @@ static void nvme_reset_dev(struct nvme_dev *dev)
}
 
nvme_start_ctrl(&dev->ctrl);
-   return;
+   return 0;
 
  out:
-   nvme_remove_dead_ctrl(dev, result);
+   return result;
 }
 
 static void nvme_reset_work(struct work_struct *work)
 {
struct nvme_dev *dev =
container_of(work, struct nvme_dev, ctrl.reset_work);
+   int result;
 
mutex_lock(&dev->ctrl.reset_lock);
-   nvme_reset_dev(dev);
+   result = nvme_reset_dev(dev);
+   if (result)
+   nvme_remove_dead_ctrl(dev, result);
mutex_unlock(&dev->ctrl.reset_lock);
 }
 
-- 
2.9.5



[PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH

2018-05-11 Thread Ming Lei
When admin commands are used in EH for recovering controller, we have to
cover their timeout and can't depend on block's timeout since deadlock may
be caused when these commands are timed-out by block layer again.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/pci.c | 81 ++---
 1 file changed, 70 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fac6f1e..ff09b1c760ea 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev 
*dev)
}
 }
 
-static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev,
+   struct nvme_command *c, u32 bits)
 {
u64 dma_addr = dev->host_mem_descs_dma;
+
+   memset(c, 0, sizeof(*c));
+   c->features.opcode  = nvme_admin_set_features;
+   c->features.fid = cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
+   c->features.dword11 = cpu_to_le32(bits);
+   c->features.dword12 = cpu_to_le32(dev->host_mem_size >>
+ ilog2(dev->ctrl.page_size));
+   c->features.dword13 = cpu_to_le32(lower_32_bits(dma_addr));
+   c->features.dword14 = cpu_to_le32(upper_32_bits(dma_addr));
+   c->features.dword15 = cpu_to_le32(dev->nr_host_mem_descs);
+}
+
+static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits)
+{
struct nvme_command c;
int ret;
 
-   memset(&c, 0, sizeof(c));
-   c.features.opcode   = nvme_admin_set_features;
-   c.features.fid  = cpu_to_le32(NVME_FEAT_HOST_MEM_BUF);
-   c.features.dword11  = cpu_to_le32(bits);
-   c.features.dword12  = cpu_to_le32(dev->host_mem_size >>
- ilog2(dev->ctrl.page_size));
-   c.features.dword13  = cpu_to_le32(lower_32_bits(dma_addr));
-   c.features.dword14  = cpu_to_le32(upper_32_bits(dma_addr));
-   c.features.dword15  = cpu_to_le32(dev->nr_host_mem_descs);
+   nvme_init_set_host_mem_cmd(dev, &c, bits);
 
ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
if (ret) {
@@ -1758,6 +1765,58 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 
bits)
return ret;
 }
 
+static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
+{
+   struct completion *waiting = rq->end_io_data;
+
+   rq->end_io_data = NULL;
+
+   /*
+* complete last, if this is a stack request the process (and thus
+* the rq pointer) could be invalid right after this complete()
+*/
+   complete(waiting);
+}
+
+/*
+ * This function can only be used inside nvme_dev_disable() when timeout
+ * may not work, then this function has to cover the timeout by itself.
+ *
+ * When wait_for_completion_io_timeout() returns 0 and timeout happens,
+ * this request will be completed after controller is shutdown.
+ */
+static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
+{
+   DECLARE_COMPLETION_ONSTACK(wait);
+   struct nvme_command c;
+   struct request_queue *q = dev->ctrl.admin_q;
+   struct request *req;
+   int ret;
+
+   nvme_init_set_host_mem_cmd(dev, &c, bits);
+
+   req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY);
+   if (IS_ERR(req))
+   return PTR_ERR(req);
+
+   req->timeout = ADMIN_TIMEOUT;
+   req->end_io_data = &wait;
+
+   blk_execute_rq_nowait(q, NULL, req, false,
+   nvme_set_host_mem_end_io);
+   ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT);
+   if (ret > 0) {
+   if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+   ret = -EINTR;
+   else
+   ret = nvme_req(req)->status;
+   blk_mq_free_request(req);
+   } else
+   ret = -EINTR;
+
+   return ret;
+}
+
 static void nvme_free_host_mem(struct nvme_dev *dev)
 {
int i;
@@ -2216,7 +2275,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
 * but I'd rather be safe than sorry..
 */
if (dev->host_mem_descs)
-   nvme_set_host_mem(dev, 0);
+   nvme_set_host_mem_timeout(dev, 0);
nvme_disable_io_queues(dev);
nvme_disable_admin_queue(dev, shutdown);
}
-- 
2.9.5



[PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()

2018-05-11 Thread Ming Lei
Turns out the current way can't drain timout completely because mod_timer()
can be triggered in the work func, which can be just run inside the synced
timeout work:

del_timer_sync(&q->timeout);
cancel_work_sync(&q->timeout_work);

This patch introduces one flag of 'timeout_off' for fixing this issue, turns
out this simple way does work.

Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for
draining timeout, which is needed by NVMe.

Cc: James Smart 
Cc: Bart Van Assche 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 21 +++--
 block/blk-mq.c |  9 +
 block/blk-timeout.c|  5 -
 include/linux/blkdev.h | 13 +
 4 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..c277f1023703 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_stop_queue);
 
+void blk_unquiesce_timeout(struct request_queue *q)
+{
+   blk_mark_timeout_quiesce(q, false);
+   mod_timer(&q->timeout, jiffies + q->rq_timeout);
+}
+EXPORT_SYMBOL(blk_unquiesce_timeout);
+
+void blk_quiesce_timeout(struct request_queue *q)
+{
+   blk_mark_timeout_quiesce(q, true);
+
+   del_timer_sync(&q->timeout);
+   cancel_work_sync(&q->timeout_work);
+}
+EXPORT_SYMBOL(blk_quiesce_timeout);
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-   del_timer_sync(&q->timeout);
-   cancel_work_sync(&q->timeout_work);
+   blk_quiesce_timeout(q);
 
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
@@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q)
} else {
cancel_delayed_work_sync(&q->delay_work);
}
+
+   blk_mark_timeout_quiesce(q, false);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac16c3f..173f1723e48f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -917,6 +917,15 @@ static void blk_mq_timeout_work(struct work_struct *work)
};
struct blk_mq_hw_ctx *hctx;
int i;
+   bool timeout_off;
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   timeout_off = q->timeout_off;
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (timeout_off)
+   return;
 
/* A deadlock might occur if a request is stuck requiring a
 * timeout at the same time a queue freeze is waiting
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..ffd0b609091e 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work)
 
spin_lock_irqsave(q->queue_lock, flags);
 
+   if (q->timeout_off)
+   goto exit;
+
list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list)
blk_rq_check_expired(rq, &next, &next_set);
 
if (next_set)
mod_timer(&q->timeout, round_jiffies_up(next));
-
+exit:
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c4eee043191..a2cc4aaecf50 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,6 +584,7 @@ struct request_queue {
struct timer_list   timeout;
struct work_struct  timeout_work;
struct list_headtimeout_list;
+   booltimeout_off;
 
struct list_headicq_list;
 #ifdef CONFIG_BLK_CGROUP
@@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, 
struct gendisk *,
 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
  struct request *, int, rq_end_io_fn *);
 
+static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool 
quiesce)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   q->timeout_off = quiesce;
+   spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+extern void blk_quiesce_timeout(struct request_queue *q);
+extern void blk_unquiesce_timeout(struct request_queue *q);
+
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-- 
2.9.5



[PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery

2018-05-11 Thread Ming Lei
When nvme_dev_disable() is used for error recovery, we should always
freeze queues before shutdown controller:

- reset handler supposes queues are frozen, and will wait_freeze &
unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(),
reset handler may wait forever even though there isn't any requests
allocated.

- this way may avoid to cancel lots of requests during error recovery

This patch introduces the parameter of 'freeze_queue' for fixing this
issue.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/pci.c | 47 ---
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 57bd7bebd1e5..1fafe5d01355 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -69,7 +69,8 @@ struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+   freeze_queue);
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -1197,7 +1198,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 */
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(&dev->ctrl);
return BLK_EH_HANDLED;
}
@@ -1224,7 +1225,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, disable controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_HANDLED;
default:
@@ -1240,7 +1241,7 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
dev_warn(dev->ctrl.device,
 "I/O %d QID %d timeout, reset controller\n",
 req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(&dev->ctrl);
 
/*
@@ -2239,19 +2240,35 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+/*
+ * Resetting often follows nvme_dev_disable(), so queues need to be frozen
+ * before resetting.
+ */
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
+   freeze_queue)
 {
int i;
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
bool frozen = false;
 
+   /*
+* 'freeze_queue' is only valid for non-shutdown, and we do
+* inline freeze & wait_freeze_timeout for shutdown just for
+* completing as many as possible requests before shutdown
+*/
+   if (shutdown)
+   freeze_queue = false;
+
+   if (freeze_queue)
+   nvme_start_freeze(&dev->ctrl);
+
mutex_lock(&dev->shutdown_lock);
if (pci_is_enabled(pdev)) {
u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
-   if (dev->ctrl.state == NVME_CTRL_LIVE ||
-   dev->ctrl.state == NVME_CTRL_RESETTING) {
+   if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE ||
+   dev->ctrl.state == NVME_CTRL_RESETTING)) {
nvme_start_freeze(&dev->ctrl);
frozen = true;
}
@@ -2343,7 +2360,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", 
status);
 
nvme_get_ctrl(&dev->ctrl);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
if (!queue_work(nvme_wq, &dev->remove_work))
nvme_put_ctrl(&dev->ctrl);
 }
@@ -2364,7 +2381,7 @@ static void nvme_reset_work(struct work_struct *work)
 * moving on.
 */
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(dev, false, false);
 
/*
 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
@@ -2613,7 +2630,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
 static void nvme_reset_prepare(struct pci_dev *pdev)
 {
struct nvme_dev *dev = pci_get_drvdata(pdev);
-   nvme_dev_disable(dev, false);
+   nvme_dev_disable(d

[PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen

2018-05-11 Thread Ming Lei
In nvme_dev_disable() called during shutting down controler,
nvme_wait_freeze_timeout() may be done on the controller not
frozen yet, so add the check for avoiding the case.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/pci.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ff09b1c760ea..57bd7bebd1e5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2244,14 +2244,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
int i;
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
+   bool frozen = false;
 
mutex_lock(&dev->shutdown_lock);
if (pci_is_enabled(pdev)) {
u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
if (dev->ctrl.state == NVME_CTRL_LIVE ||
-   dev->ctrl.state == NVME_CTRL_RESETTING)
+   dev->ctrl.state == NVME_CTRL_RESETTING) {
nvme_start_freeze(&dev->ctrl);
+   frozen = true;
+   }
dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
pdev->error_state  != pci_channel_io_normal);
}
@@ -2261,7 +2264,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
 * doing a safe shutdown.
 */
if (!dead) {
-   if (shutdown)
+   if (shutdown && frozen)
nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
}
 
-- 
2.9.5



[PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-11 Thread Ming Lei
Hi,

The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
for NVMe, meantime fixes blk_sync_queue().

The 2nd patch covers timeout for admin commands for recovering controller
for avoiding possible deadlock.

The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.

The last 5 patches fixes several races wrt. NVMe timeout handler, and
finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
mecanism become much more rebost than before.

gitweb:
https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

V5:
- avoid to remove controller in case of reset failure in inner EHs
- make sure that nvme_unfreeze and nvme_start_freeze are paired

V4:
- fixe nvme_init_set_host_mem_cmd()
- use nested EH model, and run both nvme_dev_disable() and
resetting in one same context

V3:
- fix one new race related freezing in patch 4, nvme_reset_work()
may hang forever without this patch
- rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*()

V2:
- fix draining timeout work, so no need to change return value from
.timeout()
- fix race between nvme_start_freeze() and nvme_unfreeze()
- cover timeout for admin commands running in EH

Ming Lei (9):
  block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
  nvme: pci: cover timeout for admin commands running in EH
  nvme: pci: only wait freezing if queue is frozen
  nvme: pci: freeze queue in nvme_dev_disable() in case of error
recovery
  nvme: pci: prepare for supporting error recovery from resetting
context
  nvme: pci: move error handling out of nvme_reset_dev()
  nvme: pci: don't unfreeze queue until controller state updating
succeeds
  nvme: core: introduce nvme_force_change_ctrl_state()
  nvme: pci: support nested EH

 block/blk-core.c |  21 ++-
 block/blk-mq.c   |   9 ++
 block/blk-timeout.c  |   5 +-
 drivers/nvme/host/core.c |  57 +++
 drivers/nvme/host/nvme.h |   7 +
 drivers/nvme/host/pci.c  | 402 +--
 include/linux/blkdev.h   |  13 ++
 7 files changed, 462 insertions(+), 52 deletions(-)

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
-- 
2.9.5



[PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context

2018-05-11 Thread Ming Lei
Either the admin or normal IO in reset context may be timed out because
controller error happens. When this timeout happens, we may have to
start controller recovery again.

This patch introduces 'reset_lock' and holds this lock when running reset,
so that we may support nested reset in the following patches.

Cc: James Smart 
Cc: Jianchao Wang 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Cc: linux-n...@lists.infradead.org
Cc: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/core.c |  2 ++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 20 +---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a3771c5729f5..adb1743e87f7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3424,6 +3424,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device 
*dev,
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
 
+   mutex_init(&ctrl->reset_lock);
+
ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL);
if (ret < 0)
goto out;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7ded7a51c430..021f7147f779 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -146,6 +146,9 @@ struct nvme_ctrl {
struct device ctrl_device;
struct device *device;  /* char device */
struct cdev cdev;
+
+   /* sync reset activities */
+   struct mutex reset_lock;
struct work_struct reset_work;
struct work_struct delete_work;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1fafe5d01355..a924246ffdb6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2365,14 +2365,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, 
int status)
nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_reset_work(struct work_struct *work)
+static void nvme_reset_dev(struct nvme_dev *dev)
 {
-   struct nvme_dev *dev =
-   container_of(work, struct nvme_dev, ctrl.reset_work);
bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL);
int result = -ENODEV;
enum nvme_ctrl_state new_state = NVME_CTRL_LIVE;
 
+   lockdep_assert_held(&dev->ctrl.reset_lock);
+
if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING))
goto out;
 
@@ -2448,7 +2448,11 @@ static void nvme_reset_work(struct work_struct *work)
new_state = NVME_CTRL_ADMIN_ONLY;
} else {
nvme_start_queues(&dev->ctrl);
+   mutex_unlock(&dev->ctrl.reset_lock);
+
nvme_wait_freeze(&dev->ctrl);
+
+   mutex_lock(&dev->ctrl.reset_lock);
/* hit this only when allocate tagset fails */
if (nvme_dev_add(dev))
new_state = NVME_CTRL_ADMIN_ONLY;
@@ -2472,6 +2476,16 @@ static void nvme_reset_work(struct work_struct *work)
nvme_remove_dead_ctrl(dev, result);
 }
 
+static void nvme_reset_work(struct work_struct *work)
+{
+   struct nvme_dev *dev =
+   container_of(work, struct nvme_dev, ctrl.reset_work);
+
+   mutex_lock(&dev->ctrl.reset_lock);
+   nvme_reset_dev(dev);
+   mutex_unlock(&dev->ctrl.reset_lock);
+}
+
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
-- 
2.9.5



Re: [PATCH] blk-mq: Rework blk-mq timeout handling again

2018-05-11 Thread jianchao.wang
Hi bart

I add debug log in blk_mq_add_timer as following

void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
 enum mq_rq_state new)
{
   struct request_queue *q = req->q;

   if (!req->timeout)
   req->timeout = q->rq_timeout;
   if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
   WARN_ON_ONCE(true);

   trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n",
   jiffies,
   req->timeout,
   blk_rq_deadline(req),
   req->das.generation,
   req->das.deadline);
 
   return __blk_add_timer(req);
 }

And get log below:

 jbd2/sda2-8-320   [000] ...195.030824: blk_mq_add_timer: jiffies 
37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
kworker/0:1H-136   [000] ...195.031822: blk_mq_add_timer: jiffies 
37c0 to 1d4c ldl 550c4000 gen 0 dl 550c
kworker/6:1H-244   [006] ...195.041695: blk_mq_add_timer: jiffies 
37c3 to 1d4c ldl 550f4000 gen 0 dl 550f
kworker/6:1H-244   [006] ...195.041954: blk_mq_add_timer: jiffies 
37c3 to 1d4c ldl 550f4000 gen 0 dl 550f

The blk_rq_deadline return 550c4000 which looks really crazy.
It should be due to union blk_deadline_and_state. 
+union blk_deadline_and_state {
+   struct {
+   uint32_t generation:30;
+   uint32_t state:2;
+   uint32_t deadline;
+   };
+   unsigned long legacy_deadline;
+   uint64_t das;
+};

And generation never change. 

Thanks
Jianchao



Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-11 Thread Christian König

Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe:


On 10/05/18 11:11 AM, Stephen  Bates wrote:

Not to me. In the p2pdma code we specifically program DMA engines with
the PCI bus address.

Ah yes of course. Brain fart on my part. We are not programming the P2PDMA 
initiator with an IOVA but with the PCI bus address...


By disabling the ACS bits on the intermediate bridges you turn their 
address routing from IOVA addresses (which are to be resolved by the 
root complex) back to PCI bus addresses (which are resolved locally in 
the bridge).


This only works when the IOVA and the PCI bus addresses never overlap. 
I'm not sure how the IOVA allocation works but I don't think we 
guarantee that on Linux.





So regardless of whether we are using the IOMMU or
not, the packets will be forwarded directly to the peer. If the ACS
  Redir bits are on they will be forced back to the RC by the switch and
  the transaction will fail. If we clear the ACS bits, the TLPs will go
  where we want and everything will work (but we lose the isolation of ACS).

Agreed.


If we really want to enable P2P without ATS and IOMMU enabled I think we 
should probably approach it like this:


a) Make double sure that IOVA in an IOMMU group never overlap with PCI 
BARs in that group.


b) Add configuration options to put a whole PCI branch of devices (e.g. 
a bridge) into a single IOMMU group.


c) Add a configuration option to disable the ACS bit on bridges in the 
same IOMMU group.



I agree that we have a rather special case here, but I still find that 
approach rather brave and would vote for disabling P2P without ATS when 
IOMMU is enabled.



BTW: I can't say anything about other implementations, but at least for 
the AMD-IOMMU the transaction won't fail when it is send to the root 
complex.


Instead the root complex would send it to the correct device. I already 
tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs 
(but could be that this only works because of ATS).


Regards,
Christian.


For EPs that support ATS, we should (but don't necessarily have to)
program them with the IOVA address so they can go through the
translation process which will allow P2P without disabling the ACS Redir
bits -- provided the ACS direct translation bit is set. (And btw, if it
is, then we lose the benefit of ACS protecting against malicious EPs).
But, per above, the ATS transaction should involve only the IOVA address
so the ACS bits not being set should not break ATS.
 
Well we would still have to clear some ACS bits but now we can clear only for translated addresses.

We don't have to clear the ACS Redir bits as we did in the first case.
We just have to make sure the ACS Direct Translated bit is set.

Logan