Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation
On Fri, May 11, 2018 at 08:25:27AM +0200, Christoph Hellwig wrote: > On Thu, May 10, 2018 at 08:08:38AM -0700, Darrick J. Wong wrote: > > > > > + sector_t *bno = data; > > > > > + > > > > > + if (iomap->type == IOMAP_MAPPED) > > > > > + *bno = (iomap->addr + pos - iomap->offset) >> > > > > > inode->i_blkbits; > > > > > > > > Does this need to be careful w.r.t. overflow on systems where sector_t > > > > is a 32-bit unsigned long? > > > > > > > > Also, ioctl_fibmap() typecasts the returned sector_t to an int, which > > > > also seems broken. I agree the interface needs to die, but ioctls take > > > > a long time to deprecate. > > > > > > Not much we can do about the interface. > > > > Yes, the interface is fubar, but if file /foo maps to block 8589934720 > > then do we return the truncated result 128? > > Then we'll get a corrupt result. What do you think we could do here > eithere in the old or new code? I think the only thing we /can/ do is figure out if we'd be truncating the result, dump a warning to the kernel, and return 0, because we don't want smartypants FIBMAP callers to be using crap block pointers. Something like this for the bmap implementation... uint64_t mapping = iomap->addr; #ifdef CONFIG_LBDAF if (mapping > ULONG_MAX) { /* Do not truncate results. */ return 0; } #endif ...and in the bmap ioctl... sector_t mapping = ...; if (mapping > INT_MAX) { WARN(1, "would truncate bmap result, go fix your stupid program"); return 0; } --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Keith, On Fri, May 11, 2018 at 02:50:28PM -0600, Keith Busch wrote: > On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote: > > Hi, > > > > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout() > > for NVMe, meantime fixes blk_sync_queue(). > > > > The 2nd patch covers timeout for admin commands for recovering controller > > for avoiding possible deadlock. > > > > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen. > > > > The last 5 patches fixes several races wrt. NVMe timeout handler, and > > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout > > mecanism become much more rebost than before. > > > > gitweb: > > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5 > > Hi Ming, > > First test with simulated broken links is unsuccessful. I'm getting > stuck here: > > [<0>] blk_mq_freeze_queue_wait+0x46/0xb0 > [<0>] blk_cleanup_queue+0x78/0x170 > [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core] > [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core] > [<0>] nvme_remove+0x6b/0x130 [nvme] > [<0>] pci_device_remove+0x36/0xb0 > [<0>] device_release_driver_internal+0x157/0x220 > [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme] > [<0>] process_one_work+0x170/0x350 > [<0>] worker_thread+0x2e/0x380 > [<0>] kthread+0x111/0x130 > [<0>] ret_from_fork+0x1f/0x30 > > > Here's the last parts of the kernel logs capturing the failure: > > [ 760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679176] nvme nvme1: EH 0: before shutdown > [ 760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > > ( above repeats a few more hundred times ) > > [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > [ 760.727103] nvme :86:00.0: Refused to change power state, currently in > D3 EH may not cover this kind of failure, so it fails in the 1st try. > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > [ 760.727488] nvme nvme1: EH: fail controller The above issue(hang in nvme_remove())
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 5/11/2018 4:24 PM, Stephen Bates wrote: All Alex (or anyone else) can you point to where IOVA addresses are generated? A case of RTFM perhaps (though a pointer to the code would still be appreciated). https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt Some exceptions to IOVA --- Interrupt ranges are not address translated, (0xfee0 - 0xfeef). The same is true for peer to peer transactions. Hence we reserve the address from PCI MMIO ranges so they are not allocated for IOVA addresses. Hmm, except I'm not sure how to interpret that. It sounds like there can't be an IOVA address that overlaps with the PCI MMIO range which is good and what I'd expect. But for peer to peer they say they don't translate the address which implies to me that the intention is for a peer to peer address to not be mapped in the same way using the dma_map interface (of course though if you were using ATS you'd want this for sure). Unless the existing dma_map command's notice a PCI MMIO address and handle them differently, but I don't see how. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
All > Alex (or anyone else) can you point to where IOVA addresses are generated? A case of RTFM perhaps (though a pointer to the code would still be appreciated). https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt Some exceptions to IOVA --- Interrupt ranges are not address translated, (0xfee0 - 0xfeef). The same is true for peer to peer transactions. Hence we reserve the address from PCI MMIO ranges so they are not allocated for IOVA addresses. Cheers Stephen
Re: [PATCH v8] blk-mq: Rework blk-mq timeout handling again
On Fri, 2018-05-11 at 15:21 -0600, Jens Axboe wrote: > On 5/11/18 3:08 PM, Bart Van Assche wrote: > blk_mq_rq_update_aborted_gstate(rq, gstate); > > + union blk_deadline_and_state das = READ_ONCE(rq->das); > > + unsigned long now = jiffies; > > + int32_t diff_jiffies = das.deadline - now; > > + int32_t diff_next = das.deadline - data->next; > > + enum mq_rq_state rq_state = das.state; > > Why the int32_t types? Just use an int/unsigned int? The code that uses the diff_jiffies and diff_next variables can only work correctly if the subtraction results are truncated to 32 bits. Hence the use of the int32_t data types to make this explicit. I think using int32_t instead of int makes this code easier to read. > > @@ -103,37 +97,119 @@ extern void blk_mq_hctx_kobj_init(struct > > blk_mq_hw_ctx *hctx); > > > > void blk_mq_release(struct request_queue *q); > > > > +#if defined(CONFIG_ARC) || \ > > +defined(CONFIG_ARM) && defined(CONFIG_CPU_THUMBONLY) || > > \ > > +defined(CONFIG_C6x) || defined(CONFIG_H8300) || > > \ > > +defined(CONFIG_HEXAGON) || defined(CONFIG_MICROBLAZE) || > > \ > > +defined(CONFIG_MIPS) && !defined(CONFIG_64BIT) || > > \ > > +defined(CONFIG_NDS32) || defined(CONFIG_NIOS) || > > \ > > +defined(CONFIG_OPENRISC) || defined(CONFIG_PPC32) || \ > > +defined(CONFIG_SUPERH) || defined(CONFIG_UML) || > > defined(CONFIG_UNICORE32) > > +#undef CONFIG_ARCH_HAS_CMPXCHG64 > > +#else > > +#define CONFIG_ARCH_HAS_CMPXCHG64 > > +#endif > > I think the encapsulation of blk_mq_set_rq_state() in terms of lock vs > cmpxchg64 is fine, but this does not belong in the block layer. > > > + /* Only used if CONFIG_ARCH_HAS_CMPXCHG64 is not defined. */ > > + spinlock_t das_lock; > > Let's put it under an if defined() then. OK, I will move CONFIG_ARCH_HAS_CMPXCHG64 into arch/*/Kconfig and I will surround spinlock_t das_lock with #if defined(...) / #endif. Thanks, Bart.
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>I find this hard to believe. There's always the possibility that some >part of the system doesn't support ACS so if the PCI bus addresses and >IOVA overlap there's a good chance that P2P and ATS won't work at all on >some hardware. I tend to agree but this comes down to how IOVA addresses are generated in the kernel. Alex (or anyone else) can you point to where IOVA addresses are generated? As Logan stated earlier, p2pdma bypasses this and programs the PCI bus address directly but other IO going to the same PCI EP may flow through the IOMMU and be programmed with IOVA rather than PCI bus addresses. > I prefer >the option to disable the ACS bit on boot and let the existing code put >the devices into their own IOMMU group (as it should already do to >support hardware that doesn't have ACS support). +1 Stephen
Re: [PATCH v8] blk-mq: Rework blk-mq timeout handling again
On 5/11/18 3:08 PM, Bart Van Assche wrote: blk_mq_rq_update_aborted_gstate(rq, gstate); > + union blk_deadline_and_state das = READ_ONCE(rq->das); > + unsigned long now = jiffies; > + int32_t diff_jiffies = das.deadline - now; > + int32_t diff_next = das.deadline - data->next; > + enum mq_rq_state rq_state = das.state; Why the int32_t types? Just use an int/unsigned int? > @@ -103,37 +97,119 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx > *hctx); > > void blk_mq_release(struct request_queue *q); > > +#if defined(CONFIG_ARC) || \ > +defined(CONFIG_ARM) && defined(CONFIG_CPU_THUMBONLY) || \ > +defined(CONFIG_C6x) || defined(CONFIG_H8300) || \ > +defined(CONFIG_HEXAGON) || defined(CONFIG_MICROBLAZE) || \ > +defined(CONFIG_MIPS) && !defined(CONFIG_64BIT) || > \ > +defined(CONFIG_NDS32) || defined(CONFIG_NIOS) || \ > +defined(CONFIG_OPENRISC) || defined(CONFIG_PPC32) || \ > +defined(CONFIG_SUPERH) || defined(CONFIG_UML) || > defined(CONFIG_UNICORE32) > +#undef CONFIG_ARCH_HAS_CMPXCHG64 > +#else > +#define CONFIG_ARCH_HAS_CMPXCHG64 > +#endif I think the encapsulation of blk_mq_set_rq_state() in terms of lock vs cmpxchg64 is fine, but this does not belong in the block layer. > + /* Only used if CONFIG_ARCH_HAS_CMPXCHG64 is not defined. */ > + spinlock_t das_lock; Let's put it under an if defined() then. -- Jens Axboe
Re: [PATCH 00/10] Misc block layer patches for bcachefs
On 5/8/18 7:33 PM, Kent Overstreet wrote: > - Add separately allowed mempools, biosets: bcachefs uses both all over the >place > > - Bit of utility code - bio_copy_data_iter(), zero_fill_bio_iter() > > - bio_list_copy_data(), the bi_next check - defensiveness because of a bug I >had fun chasing down at one point > > - add some exports, because bcachefs does dio its own way > - show whether fua is supported in sysfs, because I don't know of anything > that >exports whether the _block layer_ specifically thinks fua is supported. Looked over the series, and looks like both good cleanups and optimizations. If we can get the mempool patch sorted, I can apply this for 4.18. -- Jens Axboe
Re: [PATCH 01/10] mempool: Add mempool_init()/mempool_exit()
On 5/8/18 7:33 PM, Kent Overstreet wrote: > Allows mempools to be embedded in other structs, getting rid of a > pointer indirection from allocation fastpaths. > > mempool_exit() is safe to call on an uninitialized but zeroed mempool. Looks fine to me. I'd like to carry it through the block branch, as some of the following cleanups depend on it. Kent, can you post a v2 with the destroy -> exit typo fixed up? But would be nice to have someone sign off on it... -- Jens Axboe
[PATCH v8] blk-mq: Rework blk-mq timeout handling again
Recently the blk-mq timeout handling code was reworked. See also Tejun Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html). This patch reworks the blk-mq timeout handling code again. The timeout handling code is simplified by introducing a state machine per request. This change avoids that the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. Fix this race as follows: - Replace the __deadline member of struct request by a new member called das that contains the generation number, state and deadline. Only 32 bits are used for the deadline field such that all three fields occupy only 64 bits. This change reduces the maximum supported request timeout value from (2**63/HZ) to (2**31/HZ). - Remove all request member variables that became superfluous due to this change: gstate, gstate_seq and aborted_gstate_sync. - Remove the request state information that became superfluous due to this patch, namely RQF_MQ_TIMEOUT_EXPIRED. - Remove the code that became superfluous due to this change, namely the RCU lock and unlock statements in blk_mq_complete_request() and also the synchronize_rcu() call in the timeout handler. Notes: - A spinlock is used to protect atomic changes of rq->das on those architectures that do not provide a cmpxchg64() implementation. - Atomic instructions are only used to update the request state if a concurrent request state change could be in progress. - blk_add_timer() has been split into two functions - one for the legacy block layer and one for blk-mq. Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: Christoph Hellwig Cc: Jianchao Wang Cc: Ming Lei Cc: Sebastian Ott Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy --- Changes compared to v7: - Fixed the generation number mechanism. Note: with this patch applied the behavior of the block layer does not depend on the generation number. - Added more 32-bit architectures to the list of architectures on which cmpxchg64() should not be used. Changes compared to v6: - Used a union instead of bit manipulations to store multiple values into a single 64-bit field. - Reduced the size of the timeout field from 64 to 32 bits. - Made sure that the block layer still builds with this patch applied for the sh and mips architectures. - Fixed two sparse warnings that were introduced by this patch in the WRITE_ONCE() calls. Changes compared to v5: - Restored the synchronize_rcu() call between marking a request for timeout handling and the actual timeout handling to avoid that timeout handling starts while .queue_rq() is still in progress if the timeout is very short. - Only use cmpxchg() if another context could attempt to change the request state concurrently. Use WRITE_ONCE() otherwise. Changes compared to v4: - Addressed multiple review comments from Christoph. The most important are that atomic_long_cmpxchg() has been changed into cmpxchg() and also that there is now a nice and clean split between the legacy and blk-mq versions of blk_add_timer(). - Changed the patch name and modified the patch description because there is disagreement about whether or not the v4.16 blk-mq core can complete a single request twice. Kept the "Cc: stable" tag because of https://bugzilla.kernel.org/show_bug.cgi?id=199077. Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html): - Removed the spinlock again that was introduced to protect the request state. v4 uses atomic_long_cmpxchg() instead. - Split __deadline into two variables - one for the legacy block layer and one for blk-mq. Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html): - Rebased and retested on top of kernel v4.16. Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html): - Removed the gstate and aborted_gstate members of struct request and used the __deadline member to encode both the generation and state information. block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 178 + block/blk-mq.h | 130 block/blk-timeout.c| 89 ++--- block/blk.h| 21 +++--- include/linux/blkdev.h | 44 ++-- 7 files changed, 233 insertions(+), 236 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 1418a1ccd80d..223f650712f4 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->internal_tag = -1; r
Re: make a few block drivers highmem safe
On 5/9/18 7:59 AM, Christoph Hellwig wrote: > Hi all, > > this series converts a few random block drivers to be highmem safe, > in preparation of eventually getting rid of the block layer bounce > buffering support. Applied, thanks. -- Jens Axboe
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote: > Hi, > > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout() > for NVMe, meantime fixes blk_sync_queue(). > > The 2nd patch covers timeout for admin commands for recovering controller > for avoiding possible deadlock. > > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen. > > The last 5 patches fixes several races wrt. NVMe timeout handler, and > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout > mecanism become much more rebost than before. > > gitweb: > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5 Hi Ming, First test with simulated broken links is unsuccessful. I'm getting stuck here: [<0>] blk_mq_freeze_queue_wait+0x46/0xb0 [<0>] blk_cleanup_queue+0x78/0x170 [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core] [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core] [<0>] nvme_remove+0x6b/0x130 [nvme] [<0>] pci_device_remove+0x36/0xb0 [<0>] device_release_driver_internal+0x157/0x220 [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme] [<0>] process_one_work+0x170/0x350 [<0>] worker_thread+0x2e/0x380 [<0>] kthread+0x111/0x130 [<0>] ret_from_fork+0x1f/0x30 Here's the last parts of the kernel logs capturing the failure: [ 760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679176] nvme nvme1: EH 0: before shutdown [ 760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x ( above repeats a few more hundred times ) [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed [ 760.727103] nvme :86:00.0: Refused to change power state, currently in D3 [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 [ 760.727485] nvme nvme1: EH 0: after recovery -19 [ 760.727488] nvme nvme1: EH: fail controller [ 760.727491] nvme nvme1: Removing after probe failure status: 0 [ 760.735138] nvme1n1: detected capacity change from 1200243695616 to 0
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 5/11/2018 2:52 AM, Christian König wrote: This only works when the IOVA and the PCI bus addresses never overlap. I'm not sure how the IOVA allocation works but I don't think we guarantee that on Linux. I find this hard to believe. There's always the possibility that some part of the system doesn't support ACS so if the PCI bus addresses and IOVA overlap there's a good chance that P2P and ATS won't work at all on some hardware. If we really want to enable P2P without ATS and IOMMU enabled I think we should probably approach it like this: a) Make double sure that IOVA in an IOMMU group never overlap with PCI BARs in that group. b) Add configuration options to put a whole PCI branch of devices (e.g. a bridge) into a single IOMMU group. c) Add a configuration option to disable the ACS bit on bridges in the same IOMMU group. I think a configuration option to manage IOMMU groups as you suggest would be a very complex interface and difficult to implement. I prefer the option to disable the ACS bit on boot and let the existing code put the devices into their own IOMMU group (as it should already do to support hardware that doesn't have ACS support). Logan
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
On Fri, 2018-05-11 at 14:35 +0200, Christoph Hellwig wrote: > > It should be due to union blk_deadline_and_state. > > +union blk_deadline_and_state { > > + struct { > > + uint32_t generation:30; > > + uint32_t state:2; > > + uint32_t deadline; > > + }; > > + unsigned long legacy_deadline; > > + uint64_t das; > > +}; > > Yikes. Or we just move it into a separate field. This patch already > shrinks struct request a lot, so I'd rather do that to keep it simple. Hello Christoph, Are you perhaps referring to the legacy_deadline field? Have you noticed that Jianchao used a legacy block layer function in blk-mq code and that that is why a wrong value for the deadline appeared in the trace output? Thanks, Bart.
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
On Fri, 2018-05-11 at 20:06 +0800, jianchao.wang wrote: > Hi bart > > I add debug log in blk_mq_add_timer as following > > void blk_mq_add_timer(struct request *req, enum mq_rq_state old, > enum mq_rq_state new) > { >struct request_queue *q = req->q; > >if (!req->timeout) >req->timeout = q->rq_timeout; >if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new)) >WARN_ON_ONCE(true); > >trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n", >jiffies, >req->timeout, >blk_rq_deadline(req), >req->das.generation, >req->das.deadline); > >return __blk_add_timer(req); > } > > And get log below: > > jbd2/sda2-8-320 [000] ...195.030824: blk_mq_add_timer: jiffies > 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c > kworker/0:1H-136 [000] ...195.031822: blk_mq_add_timer: jiffies > 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c > kworker/6:1H-244 [006] ...195.041695: blk_mq_add_timer: jiffies > 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f > kworker/6:1H-244 [006] ...195.041954: blk_mq_add_timer: jiffies > 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f > > The blk_rq_deadline return 550c4000 which looks really crazy. The bug is in the above trace_printk() call: blk_rq_deadline() must only be used for the legacy block layer and not for blk-mq code. If you have a look at the value of the das.deadline field then one can see that the value of that field is correct: 0x550c - 0x37c0 = 7500 = 30 * 250. Does that mean that HZ = 250 on the system on which you ran this test? > And generation never change. That's a good catch. The code for incrementing the generation number occurs in blk_mq_change_rq_state() but is missing from blk_mq_rq_set_deadline(). I will fix this. Bart.
[PATCH V5 8/9] nvme: core: introduce nvme_force_change_ctrl_state()
When controller is being reset, timeout still may be triggered, for handling this situation, the contoller state has to be changed to NVME_CTRL_RESETTING first. So introduce nvme_force_change_ctrl_state() for this purpose. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/core.c | 33 + drivers/nvme/host/nvme.h | 2 ++ 2 files changed, 35 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index adb1743e87f7..af053309c610 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -254,6 +254,39 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved) } EXPORT_SYMBOL_GPL(nvme_cancel_request); +/* should only be used by EH for handling error during reset */ +void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl, + enum nvme_ctrl_state new_state) +{ + enum nvme_ctrl_state old_state; + unsigned long flags; + bool warn = true; + + spin_lock_irqsave(&ctrl->lock, flags); + old_state = ctrl->state; + switch (new_state) { + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: + case NVME_CTRL_RESETTING: + case NVME_CTRL_CONNECTING: + warn = false; + break; + default: + break; + } + default: + break; + } + if (warn) + WARN(1, "Make sure you want to change to %d from %d\n", + new_state, old_state); + ctrl->state = new_state; + spin_unlock_irqrestore(&ctrl->lock, flags); +} +EXPORT_SYMBOL_GPL(nvme_force_change_ctrl_state); + bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, enum nvme_ctrl_state new_state) { diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 021f7147f779..2a68df197c71 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -385,6 +385,8 @@ void nvme_complete_rq(struct request *req); void nvme_cancel_request(struct request *req, void *data, bool reserved); bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, enum nvme_ctrl_state new_state); +void nvme_force_change_ctrl_state(struct nvme_ctrl *ctrl, + enum nvme_ctrl_state new_state); int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap); int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap); int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl); -- 2.9.5
[PATCH V5 7/9] nvme: pci: don't unfreeze queue until controller state updating succeeds
If it fails to update controller state into LIVE or ADMIN_ONLY, the controller will be removed, so not necessary to unfreeze queue any more. This way will make the following patch easier to not leak the freeze couner. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/pci.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d880356feee2..b79c7f016489 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2370,6 +2370,7 @@ static int nvme_reset_dev(struct nvme_dev *dev) bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); int result = -ENODEV; enum nvme_ctrl_state new_state = NVME_CTRL_LIVE; + bool unfreeze_queue = false; lockdep_assert_held(&dev->ctrl.reset_lock); @@ -2456,7 +2457,7 @@ static int nvme_reset_dev(struct nvme_dev *dev) /* hit this only when allocate tagset fails */ if (nvme_dev_add(dev)) new_state = NVME_CTRL_ADMIN_ONLY; - nvme_unfreeze(&dev->ctrl); + unfreeze_queue = true; } result = -ENODEV; @@ -2471,6 +2472,9 @@ static int nvme_reset_dev(struct nvme_dev *dev) } nvme_start_ctrl(&dev->ctrl); + + if (unfreeze_queue) + nvme_unfreeze(&dev->ctrl); return 0; out: -- 2.9.5
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
> It should be due to union blk_deadline_and_state. > +union blk_deadline_and_state { > + struct { > + uint32_t generation:30; > + uint32_t state:2; > + uint32_t deadline; > + }; > + unsigned long legacy_deadline; > + uint64_t das; > +}; Yikes. Or we just move it into a separate field. This patch already shrinks struct request a lot, so I'd rather do that to keep it simple.
[PATCH V5 9/9] nvme: pci: support nested EH
When one req is timed out, now nvme_timeout() handles it by the following way: nvme_dev_disable(dev, false); nvme_reset_ctrl(&dev->ctrl); return BLK_EH_HANDLED. There are several issues about the above approach: 1) IO may fail during resetting Admin IO timeout may be triggered in nvme_reset_dev() when error happens. Normal IO timeout may be triggered too during nvme_wait_freeze() in reset path. When the two kinds of timeout happen, the current reset mechanism can't work any more. 2) race between nvme_start_freeze and nvme_wait_freeze() & nvme_unfreeze() - nvme_dev_disable() and resetting controller are required for recovering controller, but the two are run from different contexts. nvme_start_freeze() is call from nvme_dev_disable() which is run timeout work context, and nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be triggered during resetting controller, so nvme_start_freeze() may be run several times. - Also two reset work may run one by one, this may cause hang in nvme_wait_freeze() forever too. 3) all namespace's EH require to shutdown & reset the controller block's timeout handler is per-request-queue, that means each namespace's error handling may shutdown & reset the whole controller, then the shutdown from one namespace may quiese queues when resetting from another namespace is in-progress. This patch fixes the above issues by using nested EH: 1) run controller shutdown(nvme_dev_disable()) and resetting(nvme_reset_dev) from one same EH context, so the above race 2) can be fixed easily. 2) always start a new context for handling EH, and cancel all in-flight requests(include the timed-out ones) in nvme_dev_disable() by quiescing timeout event before shutdown controller. 3) limit the max number of nested EH, when the limit is reached, removes the controller and fails all in-flight request. With this approach, blktest block/011 can be passed. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/core.c | 22 + drivers/nvme/host/nvme.h | 2 + drivers/nvme/host/pci.c | 243 +++ 3 files changed, 247 insertions(+), 20 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index af053309c610..1dec353388be 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3581,6 +3581,28 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_start_freeze); +void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + down_read(&ctrl->namespaces_rwsem); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_unquiesce_timeout(ns->queue); + up_read(&ctrl->namespaces_rwsem); +} +EXPORT_SYMBOL_GPL(nvme_unquiesce_timeout); + +void nvme_quiesce_timeout(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + down_read(&ctrl->namespaces_rwsem); + list_for_each_entry(ns, &ctrl->namespaces, list) + blk_quiesce_timeout(ns->queue); + up_read(&ctrl->namespaces_rwsem); +} +EXPORT_SYMBOL_GPL(nvme_quiesce_timeout); + void nvme_stop_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 2a68df197c71..934cf98925f3 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -407,6 +407,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, union nvme_result *res); +void nvme_unquiesce_timeout(struct nvme_ctrl *ctrl); +void nvme_quiesce_timeout(struct nvme_ctrl *ctrl); void nvme_stop_queues(struct nvme_ctrl *ctrl); void nvme_start_queues(struct nvme_ctrl *ctrl); void nvme_kill_queues(struct nvme_ctrl *ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b79c7f016489..4366c21e59b2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -71,6 +71,7 @@ struct nvme_queue; static void nvme_process_cq(struct nvme_queue *nvmeq); static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool freeze_queue); +static int nvme_reset_dev(struct nvme_dev *dev); /* * Represents an NVM Express device. Each nvme_dev is a PCI function. @@ -113,6 +114,23 @@ struct nvme_dev { dma_addr_t host_mem_descs_dma; struct nvme_host_mem_buf_desc *host_mem_descs; void **host_mem_desc_bufs; + + /* EH handler */ + spinlock_t eh_lock; + boolctrl_shutdown_started; + boolctrl_failed; + unsigned intnested_eh; + struct work_struct fail_ctrl_work; + wait_queue_head_t eh_wq; + struct list_headeh_head; +}; + +#define NVME_MAX_NESTED_EH32 +struct nvme_eh_work { + struct work_s
[PATCH V5 6/9] nvme: pci: move error handling out of nvme_reset_dev()
Once nested EH is introduced, we may not need to handle error in the inner EH, so move error handling out of nvme_reset_dev(). Meantime return the reset result to caller. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/pci.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a924246ffdb6..d880356feee2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2365,7 +2365,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) nvme_put_ctrl(&dev->ctrl); } -static void nvme_reset_dev(struct nvme_dev *dev) +static int nvme_reset_dev(struct nvme_dev *dev) { bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); int result = -ENODEV; @@ -2459,6 +2459,7 @@ static void nvme_reset_dev(struct nvme_dev *dev) nvme_unfreeze(&dev->ctrl); } + result = -ENODEV; /* * If only admin queue live, keep it to do further investigation or * recovery. @@ -2470,19 +2471,22 @@ static void nvme_reset_dev(struct nvme_dev *dev) } nvme_start_ctrl(&dev->ctrl); - return; + return 0; out: - nvme_remove_dead_ctrl(dev, result); + return result; } static void nvme_reset_work(struct work_struct *work) { struct nvme_dev *dev = container_of(work, struct nvme_dev, ctrl.reset_work); + int result; mutex_lock(&dev->ctrl.reset_lock); - nvme_reset_dev(dev); + result = nvme_reset_dev(dev); + if (result) + nvme_remove_dead_ctrl(dev, result); mutex_unlock(&dev->ctrl.reset_lock); } -- 2.9.5
[PATCH V5 2/9] nvme: pci: cover timeout for admin commands running in EH
When admin commands are used in EH for recovering controller, we have to cover their timeout and can't depend on block's timeout since deadlock may be caused when these commands are timed-out by block layer again. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/pci.c | 81 ++--- 1 file changed, 70 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fbc71fac6f1e..ff09b1c760ea 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1733,21 +1733,28 @@ static inline void nvme_release_cmb(struct nvme_dev *dev) } } -static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits) +static void nvme_init_set_host_mem_cmd(struct nvme_dev *dev, + struct nvme_command *c, u32 bits) { u64 dma_addr = dev->host_mem_descs_dma; + + memset(c, 0, sizeof(*c)); + c->features.opcode = nvme_admin_set_features; + c->features.fid = cpu_to_le32(NVME_FEAT_HOST_MEM_BUF); + c->features.dword11 = cpu_to_le32(bits); + c->features.dword12 = cpu_to_le32(dev->host_mem_size >> + ilog2(dev->ctrl.page_size)); + c->features.dword13 = cpu_to_le32(lower_32_bits(dma_addr)); + c->features.dword14 = cpu_to_le32(upper_32_bits(dma_addr)); + c->features.dword15 = cpu_to_le32(dev->nr_host_mem_descs); +} + +static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits) +{ struct nvme_command c; int ret; - memset(&c, 0, sizeof(c)); - c.features.opcode = nvme_admin_set_features; - c.features.fid = cpu_to_le32(NVME_FEAT_HOST_MEM_BUF); - c.features.dword11 = cpu_to_le32(bits); - c.features.dword12 = cpu_to_le32(dev->host_mem_size >> - ilog2(dev->ctrl.page_size)); - c.features.dword13 = cpu_to_le32(lower_32_bits(dma_addr)); - c.features.dword14 = cpu_to_le32(upper_32_bits(dma_addr)); - c.features.dword15 = cpu_to_le32(dev->nr_host_mem_descs); + nvme_init_set_host_mem_cmd(dev, &c, bits); ret = nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0); if (ret) { @@ -1758,6 +1765,58 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits) return ret; } +static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts) +{ + struct completion *waiting = rq->end_io_data; + + rq->end_io_data = NULL; + + /* +* complete last, if this is a stack request the process (and thus +* the rq pointer) could be invalid right after this complete() +*/ + complete(waiting); +} + +/* + * This function can only be used inside nvme_dev_disable() when timeout + * may not work, then this function has to cover the timeout by itself. + * + * When wait_for_completion_io_timeout() returns 0 and timeout happens, + * this request will be completed after controller is shutdown. + */ +static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits) +{ + DECLARE_COMPLETION_ONSTACK(wait); + struct nvme_command c; + struct request_queue *q = dev->ctrl.admin_q; + struct request *req; + int ret; + + nvme_init_set_host_mem_cmd(dev, &c, bits); + + req = nvme_alloc_request(q, &c, 0, NVME_QID_ANY); + if (IS_ERR(req)) + return PTR_ERR(req); + + req->timeout = ADMIN_TIMEOUT; + req->end_io_data = &wait; + + blk_execute_rq_nowait(q, NULL, req, false, + nvme_set_host_mem_end_io); + ret = wait_for_completion_io_timeout(&wait, ADMIN_TIMEOUT); + if (ret > 0) { + if (nvme_req(req)->flags & NVME_REQ_CANCELLED) + ret = -EINTR; + else + ret = nvme_req(req)->status; + blk_mq_free_request(req); + } else + ret = -EINTR; + + return ret; +} + static void nvme_free_host_mem(struct nvme_dev *dev) { int i; @@ -2216,7 +2275,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) * but I'd rather be safe than sorry.. */ if (dev->host_mem_descs) - nvme_set_host_mem(dev, 0); + nvme_set_host_mem_timeout(dev, 0); nvme_disable_io_queues(dev); nvme_disable_admin_queue(dev, shutdown); } -- 2.9.5
[PATCH V5 1/9] block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
Turns out the current way can't drain timout completely because mod_timer() can be triggered in the work func, which can be just run inside the synced timeout work: del_timer_sync(&q->timeout); cancel_work_sync(&q->timeout_work); This patch introduces one flag of 'timeout_off' for fixing this issue, turns out this simple way does work. Also blk_quiesce_timeout() and blk_unquiesce_timeout() are introduced for draining timeout, which is needed by NVMe. Cc: James Smart Cc: Bart Van Assche Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- block/blk-core.c | 21 +++-- block/blk-mq.c | 9 + block/blk-timeout.c| 5 - include/linux/blkdev.h | 13 + 4 files changed, 45 insertions(+), 3 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 85909b431eb0..c277f1023703 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -392,6 +392,22 @@ void blk_stop_queue(struct request_queue *q) } EXPORT_SYMBOL(blk_stop_queue); +void blk_unquiesce_timeout(struct request_queue *q) +{ + blk_mark_timeout_quiesce(q, false); + mod_timer(&q->timeout, jiffies + q->rq_timeout); +} +EXPORT_SYMBOL(blk_unquiesce_timeout); + +void blk_quiesce_timeout(struct request_queue *q) +{ + blk_mark_timeout_quiesce(q, true); + + del_timer_sync(&q->timeout); + cancel_work_sync(&q->timeout_work); +} +EXPORT_SYMBOL(blk_quiesce_timeout); + /** * blk_sync_queue - cancel any pending callbacks on a queue * @q: the queue @@ -412,8 +428,7 @@ EXPORT_SYMBOL(blk_stop_queue); */ void blk_sync_queue(struct request_queue *q) { - del_timer_sync(&q->timeout); - cancel_work_sync(&q->timeout_work); + blk_quiesce_timeout(q); if (q->mq_ops) { struct blk_mq_hw_ctx *hctx; @@ -425,6 +440,8 @@ void blk_sync_queue(struct request_queue *q) } else { cancel_delayed_work_sync(&q->delay_work); } + + blk_mark_timeout_quiesce(q, false); } EXPORT_SYMBOL(blk_sync_queue); diff --git a/block/blk-mq.c b/block/blk-mq.c index 9ce9cac16c3f..173f1723e48f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -917,6 +917,15 @@ static void blk_mq_timeout_work(struct work_struct *work) }; struct blk_mq_hw_ctx *hctx; int i; + bool timeout_off; + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + timeout_off = q->timeout_off; + spin_unlock_irqrestore(q->queue_lock, flags); + + if (timeout_off) + return; /* A deadlock might occur if a request is stuck requiring a * timeout at the same time a queue freeze is waiting diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 652d4d4d3e97..ffd0b609091e 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -136,12 +136,15 @@ void blk_timeout_work(struct work_struct *work) spin_lock_irqsave(q->queue_lock, flags); + if (q->timeout_off) + goto exit; + list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) blk_rq_check_expired(rq, &next, &next_set); if (next_set) mod_timer(&q->timeout, round_jiffies_up(next)); - +exit: spin_unlock_irqrestore(q->queue_lock, flags); } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5c4eee043191..a2cc4aaecf50 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -584,6 +584,7 @@ struct request_queue { struct timer_list timeout; struct work_struct timeout_work; struct list_headtimeout_list; + booltimeout_off; struct list_headicq_list; #ifdef CONFIG_BLK_CGROUP @@ -1017,6 +1018,18 @@ extern void blk_execute_rq(struct request_queue *, struct gendisk *, extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *, struct request *, int, rq_end_io_fn *); +static inline void blk_mark_timeout_quiesce(struct request_queue *q, bool quiesce) +{ + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + q->timeout_off = quiesce; + spin_unlock_irqrestore(q->queue_lock, flags); +} + +extern void blk_quiesce_timeout(struct request_queue *q); +extern void blk_unquiesce_timeout(struct request_queue *q); + int blk_status_to_errno(blk_status_t status); blk_status_t errno_to_blk_status(int errno); -- 2.9.5
[PATCH V5 4/9] nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery
When nvme_dev_disable() is used for error recovery, we should always freeze queues before shutdown controller: - reset handler supposes queues are frozen, and will wait_freeze & unfreeze them explicitly, if queues aren't frozen during nvme_dev_disable(), reset handler may wait forever even though there isn't any requests allocated. - this way may avoid to cancel lots of requests during error recovery This patch introduces the parameter of 'freeze_queue' for fixing this issue. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/pci.c | 47 --- 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 57bd7bebd1e5..1fafe5d01355 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -69,7 +69,8 @@ struct nvme_dev; struct nvme_queue; static void nvme_process_cq(struct nvme_queue *nvmeq); -static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); +static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool + freeze_queue); /* * Represents an NVM Express device. Each nvme_dev is a PCI function. @@ -1197,7 +1198,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) */ if (nvme_should_reset(dev, csts)) { nvme_warn_reset(dev, csts); - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, false, true); nvme_reset_ctrl(&dev->ctrl); return BLK_EH_HANDLED; } @@ -1224,7 +1225,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) dev_warn(dev->ctrl.device, "I/O %d QID %d timeout, disable controller\n", req->tag, nvmeq->qid); - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, false, false); nvme_req(req)->flags |= NVME_REQ_CANCELLED; return BLK_EH_HANDLED; default: @@ -1240,7 +1241,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) dev_warn(dev->ctrl.device, "I/O %d QID %d timeout, reset controller\n", req->tag, nvmeq->qid); - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, false, true); nvme_reset_ctrl(&dev->ctrl); /* @@ -2239,19 +2240,35 @@ static void nvme_pci_disable(struct nvme_dev *dev) } } -static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) +/* + * Resetting often follows nvme_dev_disable(), so queues need to be frozen + * before resetting. + */ +static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool + freeze_queue) { int i; bool dead = true; struct pci_dev *pdev = to_pci_dev(dev->dev); bool frozen = false; + /* +* 'freeze_queue' is only valid for non-shutdown, and we do +* inline freeze & wait_freeze_timeout for shutdown just for +* completing as many as possible requests before shutdown +*/ + if (shutdown) + freeze_queue = false; + + if (freeze_queue) + nvme_start_freeze(&dev->ctrl); + mutex_lock(&dev->shutdown_lock); if (pci_is_enabled(pdev)) { u32 csts = readl(dev->bar + NVME_REG_CSTS); - if (dev->ctrl.state == NVME_CTRL_LIVE || - dev->ctrl.state == NVME_CTRL_RESETTING) { + if (shutdown && (dev->ctrl.state == NVME_CTRL_LIVE || + dev->ctrl.state == NVME_CTRL_RESETTING)) { nvme_start_freeze(&dev->ctrl); frozen = true; } @@ -2343,7 +2360,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status); nvme_get_ctrl(&dev->ctrl); - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, false, false); if (!queue_work(nvme_wq, &dev->remove_work)) nvme_put_ctrl(&dev->ctrl); } @@ -2364,7 +2381,7 @@ static void nvme_reset_work(struct work_struct *work) * moving on. */ if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, false, false); /* * Introduce CONNECTING state from nvme-fc/rdma transports to mark the @@ -2613,7 +2630,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) static void nvme_reset_prepare(struct pci_dev *pdev) { struct nvme_dev *dev = pci_get_drvdata(pdev); - nvme_dev_disable(dev, false); + nvme_dev_disable(d
[PATCH V5 3/9] nvme: pci: only wait freezing if queue is frozen
In nvme_dev_disable() called during shutting down controler, nvme_wait_freeze_timeout() may be done on the controller not frozen yet, so add the check for avoiding the case. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/pci.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ff09b1c760ea..57bd7bebd1e5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2244,14 +2244,17 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) int i; bool dead = true; struct pci_dev *pdev = to_pci_dev(dev->dev); + bool frozen = false; mutex_lock(&dev->shutdown_lock); if (pci_is_enabled(pdev)) { u32 csts = readl(dev->bar + NVME_REG_CSTS); if (dev->ctrl.state == NVME_CTRL_LIVE || - dev->ctrl.state == NVME_CTRL_RESETTING) + dev->ctrl.state == NVME_CTRL_RESETTING) { nvme_start_freeze(&dev->ctrl); + frozen = true; + } dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) || pdev->error_state != pci_channel_io_normal); } @@ -2261,7 +2264,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) * doing a safe shutdown. */ if (!dead) { - if (shutdown) + if (shutdown && frozen) nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT); } -- 2.9.5
[PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi, The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout() for NVMe, meantime fixes blk_sync_queue(). The 2nd patch covers timeout for admin commands for recovering controller for avoiding possible deadlock. The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen. The last 5 patches fixes several races wrt. NVMe timeout handler, and finally can make blktests block/011 passed. Meantime the NVMe PCI timeout mecanism become much more rebost than before. gitweb: https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5 V5: - avoid to remove controller in case of reset failure in inner EHs - make sure that nvme_unfreeze and nvme_start_freeze are paired V4: - fixe nvme_init_set_host_mem_cmd() - use nested EH model, and run both nvme_dev_disable() and resetting in one same context V3: - fix one new race related freezing in patch 4, nvme_reset_work() may hang forever without this patch - rewrite the last 3 patches, and avoid to break nvme_reset_ctrl*() V2: - fix draining timeout work, so no need to change return value from .timeout() - fix race between nvme_start_freeze() and nvme_unfreeze() - cover timeout for admin commands running in EH Ming Lei (9): block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout() nvme: pci: cover timeout for admin commands running in EH nvme: pci: only wait freezing if queue is frozen nvme: pci: freeze queue in nvme_dev_disable() in case of error recovery nvme: pci: prepare for supporting error recovery from resetting context nvme: pci: move error handling out of nvme_reset_dev() nvme: pci: don't unfreeze queue until controller state updating succeeds nvme: core: introduce nvme_force_change_ctrl_state() nvme: pci: support nested EH block/blk-core.c | 21 ++- block/blk-mq.c | 9 ++ block/blk-timeout.c | 5 +- drivers/nvme/host/core.c | 57 +++ drivers/nvme/host/nvme.h | 7 + drivers/nvme/host/pci.c | 402 +-- include/linux/blkdev.h | 13 ++ 7 files changed, 462 insertions(+), 52 deletions(-) Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman -- 2.9.5
[PATCH V5 5/9] nvme: pci: prepare for supporting error recovery from resetting context
Either the admin or normal IO in reset context may be timed out because controller error happens. When this timeout happens, we may have to start controller recovery again. This patch introduces 'reset_lock' and holds this lock when running reset, so that we may support nested reset in the following patches. Cc: James Smart Cc: Jianchao Wang Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: linux-n...@lists.infradead.org Cc: Laurence Oberman Signed-off-by: Ming Lei --- drivers/nvme/host/core.c | 2 ++ drivers/nvme/host/nvme.h | 3 +++ drivers/nvme/host/pci.c | 20 +--- 3 files changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index a3771c5729f5..adb1743e87f7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3424,6 +3424,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work); INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work); + mutex_init(&ctrl->reset_lock); + ret = ida_simple_get(&nvme_instance_ida, 0, 0, GFP_KERNEL); if (ret < 0) goto out; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 7ded7a51c430..021f7147f779 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -146,6 +146,9 @@ struct nvme_ctrl { struct device ctrl_device; struct device *device; /* char device */ struct cdev cdev; + + /* sync reset activities */ + struct mutex reset_lock; struct work_struct reset_work; struct work_struct delete_work; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 1fafe5d01355..a924246ffdb6 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2365,14 +2365,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) nvme_put_ctrl(&dev->ctrl); } -static void nvme_reset_work(struct work_struct *work) +static void nvme_reset_dev(struct nvme_dev *dev) { - struct nvme_dev *dev = - container_of(work, struct nvme_dev, ctrl.reset_work); bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); int result = -ENODEV; enum nvme_ctrl_state new_state = NVME_CTRL_LIVE; + lockdep_assert_held(&dev->ctrl.reset_lock); + if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) goto out; @@ -2448,7 +2448,11 @@ static void nvme_reset_work(struct work_struct *work) new_state = NVME_CTRL_ADMIN_ONLY; } else { nvme_start_queues(&dev->ctrl); + mutex_unlock(&dev->ctrl.reset_lock); + nvme_wait_freeze(&dev->ctrl); + + mutex_lock(&dev->ctrl.reset_lock); /* hit this only when allocate tagset fails */ if (nvme_dev_add(dev)) new_state = NVME_CTRL_ADMIN_ONLY; @@ -2472,6 +2476,16 @@ static void nvme_reset_work(struct work_struct *work) nvme_remove_dead_ctrl(dev, result); } +static void nvme_reset_work(struct work_struct *work) +{ + struct nvme_dev *dev = + container_of(work, struct nvme_dev, ctrl.reset_work); + + mutex_lock(&dev->ctrl.reset_lock); + nvme_reset_dev(dev); + mutex_unlock(&dev->ctrl.reset_lock); +} + static void nvme_remove_dead_ctrl_work(struct work_struct *work) { struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work); -- 2.9.5
Re: [PATCH] blk-mq: Rework blk-mq timeout handling again
Hi bart I add debug log in blk_mq_add_timer as following void blk_mq_add_timer(struct request *req, enum mq_rq_state old, enum mq_rq_state new) { struct request_queue *q = req->q; if (!req->timeout) req->timeout = q->rq_timeout; if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new)) WARN_ON_ONCE(true); trace_printk("jiffies %lx to %x ldl %lx gen %u dl %x\n", jiffies, req->timeout, blk_rq_deadline(req), req->das.generation, req->das.deadline); return __blk_add_timer(req); } And get log below: jbd2/sda2-8-320 [000] ...195.030824: blk_mq_add_timer: jiffies 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c kworker/0:1H-136 [000] ...195.031822: blk_mq_add_timer: jiffies 37c0 to 1d4c ldl 550c4000 gen 0 dl 550c kworker/6:1H-244 [006] ...195.041695: blk_mq_add_timer: jiffies 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f kworker/6:1H-244 [006] ...195.041954: blk_mq_add_timer: jiffies 37c3 to 1d4c ldl 550f4000 gen 0 dl 550f The blk_rq_deadline return 550c4000 which looks really crazy. It should be due to union blk_deadline_and_state. +union blk_deadline_and_state { + struct { + uint32_t generation:30; + uint32_t state:2; + uint32_t deadline; + }; + unsigned long legacy_deadline; + uint64_t das; +}; And generation never change. Thanks Jianchao
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 10.05.2018 um 19:15 schrieb Logan Gunthorpe: On 10/05/18 11:11 AM, Stephen Bates wrote: Not to me. In the p2pdma code we specifically program DMA engines with the PCI bus address. Ah yes of course. Brain fart on my part. We are not programming the P2PDMA initiator with an IOVA but with the PCI bus address... By disabling the ACS bits on the intermediate bridges you turn their address routing from IOVA addresses (which are to be resolved by the root complex) back to PCI bus addresses (which are resolved locally in the bridge). This only works when the IOVA and the PCI bus addresses never overlap. I'm not sure how the IOVA allocation works but I don't think we guarantee that on Linux. So regardless of whether we are using the IOMMU or not, the packets will be forwarded directly to the peer. If the ACS Redir bits are on they will be forced back to the RC by the switch and the transaction will fail. If we clear the ACS bits, the TLPs will go where we want and everything will work (but we lose the isolation of ACS). Agreed. If we really want to enable P2P without ATS and IOMMU enabled I think we should probably approach it like this: a) Make double sure that IOVA in an IOMMU group never overlap with PCI BARs in that group. b) Add configuration options to put a whole PCI branch of devices (e.g. a bridge) into a single IOMMU group. c) Add a configuration option to disable the ACS bit on bridges in the same IOMMU group. I agree that we have a rather special case here, but I still find that approach rather brave and would vote for disabling P2P without ATS when IOMMU is enabled. BTW: I can't say anything about other implementations, but at least for the AMD-IOMMU the transaction won't fail when it is send to the root complex. Instead the root complex would send it to the correct device. I already tested that on an AMD Ryzen with IOMMU enabled and P2P between two GPUs (but could be that this only works because of ATS). Regards, Christian. For EPs that support ATS, we should (but don't necessarily have to) program them with the IOVA address so they can go through the translation process which will allow P2P without disabling the ACS Redir bits -- provided the ACS direct translation bit is set. (And btw, if it is, then we lose the benefit of ACS protecting against malicious EPs). But, per above, the ATS transaction should involve only the IOVA address so the ACS bits not being set should not break ATS. Well we would still have to clear some ACS bits but now we can clear only for translated addresses. We don't have to clear the ACS Redir bits as we did in the first case. We just have to make sure the ACS Direct Translated bit is set. Logan