Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown
On 04/04/17 06:53, Mauricio Faria de Oliveira wrote: > Hi Junichi, > > On 03/28/2017 11:29 PM, Junichi Nomura wrote: >> Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"), >> "rmmod lpfc" starting to cause panic or corruption due to double free. > > Thanks for the report. Can you please check whether the patch just sent > ([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it? It works for me. Thank you! Considering future maintenance, it might be a bit fragile to just depend on the code comment about representing the relation between cq/wq and shared pring but it's maintainers' choice. -- Jun'ichi Nomura, NEC Corporation / NEC Solution Innovators, Ltd.
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpewrote: > > > On 03/04/17 04:47 PM, Dan Williams wrote: >> I wouldn't necessarily conflate supporting pfn_t in the scatterlist >> with the stalled stuct-page-less DMA effor. A pfn_t_to_page() >> conversion will still work and be required. However you're right, the >> minute we use pfn_t for this we're into the realm of special case >> drivers that understand scatterlists with special "I/O-pfn_t" entries. > > Well yes, it would certainly be possible to convert the scatterlist code > from page_link to pfn_t. (The only slightly tricky thing is that > scatterlist uses extra chaining bits and pfn_t uses extra flag bits so > they'd have to be harmonized somehow). But if we aren't moving toward > struct-page-less DMA, I fail to see the point of the conversion. > > I'll definitely need IO scatterlists of some form or another and I like > pfn_t but right now it just seems like extra work with unclear benefit. > (Though, if someone told me that I can't use a third bit in the > page_link field then maybe that would be a good reason to move to pfn_t.) > >> However, maybe that's what we want? I think peer-to-peer DMA is not a >> general purpose feature unless/until we get it standardized in PCI. So >> maybe drivers with special case scatterlist support is exactly what we >> want for now. > > Well, I think this should be completely independent from PCI code. I see > no reason why we can't have infrastructure for DMA on iomem from any > bus. Largely all the work I've done in this area is completely agnostic > to the bus in use. (Except for any kind of white/black list when it is > used.) The completely agnostic part is where I get worried, but I shouldn't say anymore until I actually read the patch.The worry is cases where this agnostic enabling allows unsuspecting code paths to do the wrong thing. Like bypass iomem safety. > The "special case scatterlist" is essentially what I'm proposing in the > patch I sent upthread, it just stores the flag in the page_link instead > of in a pfn_t. Makes sense. The suggestion of pfn_t was to try to get more type safety throughout the stack. So that, again, unsuspecting code paths that get an I/O pfn aren't able to do things like page_address() or kmap() without failing. I'll stop commenting now and set aside some time to go read the patches.
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
On 03/04/17 04:47 PM, Dan Williams wrote: > I wouldn't necessarily conflate supporting pfn_t in the scatterlist > with the stalled stuct-page-less DMA effor. A pfn_t_to_page() > conversion will still work and be required. However you're right, the > minute we use pfn_t for this we're into the realm of special case > drivers that understand scatterlists with special "I/O-pfn_t" entries. Well yes, it would certainly be possible to convert the scatterlist code from page_link to pfn_t. (The only slightly tricky thing is that scatterlist uses extra chaining bits and pfn_t uses extra flag bits so they'd have to be harmonized somehow). But if we aren't moving toward struct-page-less DMA, I fail to see the point of the conversion. I'll definitely need IO scatterlists of some form or another and I like pfn_t but right now it just seems like extra work with unclear benefit. (Though, if someone told me that I can't use a third bit in the page_link field then maybe that would be a good reason to move to pfn_t.) > However, maybe that's what we want? I think peer-to-peer DMA is not a > general purpose feature unless/until we get it standardized in PCI. So > maybe drivers with special case scatterlist support is exactly what we > want for now. Well, I think this should be completely independent from PCI code. I see no reason why we can't have infrastructure for DMA on iomem from any bus. Largely all the work I've done in this area is completely agnostic to the bus in use. (Except for any kind of white/black list when it is used.) The "special case scatterlist" is essentially what I'm proposing in the patch I sent upthread, it just stores the flag in the page_link instead of in a pfn_t. Logan
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpewrote: > > > On 03/04/17 03:44 PM, Dan Williams wrote: >> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe wrote: >>> Hi Christoph, >>> >>> What are your thoughts on an approach like the following untested >>> draft patch. >>> >>> The patch (if fleshed out) makes it so iomem can be used in an sgl >>> and WARN_ONs will occur in places where drivers attempt to access >>> iomem directly through the sgl. >>> >>> I'd also probably create a p2pmem_alloc_sgl helper function so driver >>> writers wouldn't have to mess with sg_set_iomem_page. >>> >>> With all that in place, it should be relatively safe for drivers to >>> implement p2pmem even though we'd still technically be violating the >>> __iomem boundary in some places. >> >> Just reacting to this mail, I still haven't had a chance to take a >> look at the rest of the series. >> >> The pfn_t type was invented to carry extra type and page lookup >> information about the memory behind a given pfn. At first glance that >> seems a more natural place to carry an indication that this is an >> "I/O" pfn. > > I agree... But what are the plans for pfn_t? Is anyone working on using > it in the scatterlist code? Currently it's not there yet and given the > assertion that we will continue to be using struct page for DMA is that > a direction we'd want to go? > I wouldn't necessarily conflate supporting pfn_t in the scatterlist with the stalled stuct-page-less DMA effor. A pfn_t_to_page() conversion will still work and be required. However you're right, the minute we use pfn_t for this we're into the realm of special case drivers that understand scatterlists with special "I/O-pfn_t" entries. However, maybe that's what we want? I think peer-to-peer DMA is not a general purpose feature unless/until we get it standardized in PCI. So maybe drivers with special case scatterlist support is exactly what we want for now. Thoughts?
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
On 03/04/17 03:44 PM, Dan Williams wrote: > On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpewrote: >> Hi Christoph, >> >> What are your thoughts on an approach like the following untested >> draft patch. >> >> The patch (if fleshed out) makes it so iomem can be used in an sgl >> and WARN_ONs will occur in places where drivers attempt to access >> iomem directly through the sgl. >> >> I'd also probably create a p2pmem_alloc_sgl helper function so driver >> writers wouldn't have to mess with sg_set_iomem_page. >> >> With all that in place, it should be relatively safe for drivers to >> implement p2pmem even though we'd still technically be violating the >> __iomem boundary in some places. > > Just reacting to this mail, I still haven't had a chance to take a > look at the rest of the series. > > The pfn_t type was invented to carry extra type and page lookup > information about the memory behind a given pfn. At first glance that > seems a more natural place to carry an indication that this is an > "I/O" pfn. I agree... But what are the plans for pfn_t? Is anyone working on using it in the scatterlist code? Currently it's not there yet and given the assertion that we will continue to be using struct page for DMA is that a direction we'd want to go? Logan
Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown
Hi Junichi, On 03/28/2017 11:29 PM, Junichi Nomura wrote: Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"), "rmmod lpfc" starting to cause panic or corruption due to double free. Thanks for the report. Can you please check whether the patch just sent ([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it? I don't have a setup to test it handy right now. cheers, -- Mauricio Faria de Oliveira IBM Linux Technology Center
[PATCH] lpfc: fix double free of bound CQ/WQ ring pointer
commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications") binds the CQs and WQs ring pointer (sets it to same address on both). lpfc_create_wq_cq(): ... rc = lpfc_cq_create(phba, cq, eq, <...>) ... rc = lpfc_wq_create(phba, wq, cq, qtype); ... /* Bind this CQ/WQ to the NVME ring */ pring = wq->pring; ... cq->pring = pring; ... The commit frees both CQ & WQ for FCP/NVME on lpfc_sli4_queue_destroy(), which causes a double free (potential corruption or panic) on freeing the ring pointer of the second entity (CQ is first, WQ is second): lpfc_pci_remove_one() # that is, .remove / .shutdown -> lpfc_pci_remove_one_s4() -> lpfc_sli4_hba_unset() -> lpfc_sli4_queue_destroy() -> lpfc_sli4_release_queues() # Release FCP/NVME cqs -> __lpfc_sli4_release_queue() -> lpfc_sli4_queue_free() -> kfree(queue->pring) # first free -> lpfc_sli4_release_queues() # Release FCP/NVME wqs -> __lpfc_sli4_release_queue() -> lpfc_sli4_queue_free() -> kfree(queue->pring) # second free So, check for WQs in lpfc_sli4_queue_free() and do not free the pring, as it is freed before in the bound CQ. [the WQs are created only via lpfc_wq_create(), which sets struct lpfc_queue::type == LPFC_WQ. And that happens in 2 sites (lpfc_create_wq_cq() & lpfc_fof_queue_setup()), both of which bind the CQs & WQs. Thus, checking for the LPFC_WQ type correlates to whether the WQ is bound to a CQ, which is freed first.] Additional details: For reference, that binding also occurs on one other function: lpfc_fof_queue_setup(): ... rc = lpfc_cq_create(phba, phba->sli4_hba.oas_cq, <...>) ... rc = lpfc_wq_create(phba, phba->sli4_hba.oas_wq, <...>) ... /* Bind this CQ/WQ to the NVME ring */ pring = phba->sli4_hba.oas_wq->pring; ... phba->sli4_hba.oas_cq->pring = pring; And used to occur similarly on lpfc_sli4_queue_setup(), but was changed by that commit; although the problem is more related to the new freeing pattern introduced in lpfc_sli4_queue_destroy() plus the bound CQs/WQs. - /* Bind this WQ to the next FCP ring */ - pring = >ring[MAX_SLI3_CONFIGURED_RINGS + fcp_wqidx]; ... - phba->sli4_hba.fcp_cq[fcp_wqidx]->pring = pring; commit 85e8a23936ab ("scsi: lpfc: Add shutdown method for kexec") made this more likely as lpfc_pci_remove_one() is called on driver shutdown (e.g., modprobe -r / rmmod). (this patch is partially based on a different patch suggested by Johannes, thus adding a Suggested-by tag for due credit.) Signed-off-by: Mauricio Faria de OliveiraReported-by: Junichi Nomura Suggested-by: Johannes Thumshirn --- drivers/scsi/lpfc/lpfc_sli.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c index 1c9fa45df7eb..8befe841adaa 100644 --- a/drivers/scsi/lpfc/lpfc_sli.c +++ b/drivers/scsi/lpfc/lpfc_sli.c @@ -13758,7 +13758,14 @@ void lpfc_sli4_els_xri_abort_event_proc(struct lpfc_hba *phba) lpfc_free_rq_buffer(queue->phba, queue); kfree(queue->rqbp); } - kfree(queue->pring); + + /* +* The WQs/CQs' pring is bound (same pointer). +* So free it only once, and not again on WQ. +*/ + if (queue->type != LPFC_WQ) + kfree(queue->pring); + kfree(queue); return; } -- 1.8.3.1
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpewrote: > Hi Christoph, > > What are your thoughts on an approach like the following untested > draft patch. > > The patch (if fleshed out) makes it so iomem can be used in an sgl > and WARN_ONs will occur in places where drivers attempt to access > iomem directly through the sgl. > > I'd also probably create a p2pmem_alloc_sgl helper function so driver > writers wouldn't have to mess with sg_set_iomem_page. > > With all that in place, it should be relatively safe for drivers to > implement p2pmem even though we'd still technically be violating the > __iomem boundary in some places. Just reacting to this mail, I still haven't had a chance to take a look at the rest of the series. The pfn_t type was invented to carry extra type and page lookup information about the memory behind a given pfn. At first glance that seems a more natural place to carry an indication that this is an "I/O" pfn.
Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.
Hi Christoph, What are your thoughts on an approach like the following untested draft patch. The patch (if fleshed out) makes it so iomem can be used in an sgl and WARN_ONs will occur in places where drivers attempt to access iomem directly through the sgl. I'd also probably create a p2pmem_alloc_sgl helper function so driver writers wouldn't have to mess with sg_set_iomem_page. With all that in place, it should be relatively safe for drivers to implement p2pmem even though we'd still technically be violating the __iomem boundary in some places. Logan commit b435a154a4ec4f82766f6ab838092c3c5a9388ac Author: Logan GunthorpeDate: Wed Feb 8 12:44:52 2017 -0700 scatterlist: Add support for iomem pages This patch steals another bit from the page_link field to indicate the sg points to iomem. In sg_copy_buffer we use this flag to select between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON unless they indicate they support iomemory by setting the SG_MITER_IOMEM flag. Also added are sg_kmap functions which would replace a common pattern of kmap(sg_page(sg)). These new functions then also warn if the caller tries to map io memory. Another option may be to automatically copy the iomem to a new page and return that transparently to the driver. Another coccinelle patch would then be done to convert kmap(sg_page(sg)) instances to the appropriate sg_kmap calls. Signed-off-by: Logan Gunthorpe diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 0007b79..bd690a2c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -37,6 +37,9 @@ #include +/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */ +#undef kunmap_atomic + static inline int is_dma_buf_file(struct file *); struct dma_buf_list { diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index cb3c8fe..7608da0 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -5,6 +5,7 @@ #include #include #include +#include #include struct scatterlist { @@ -53,6 +54,9 @@ struct sg_table { * * If bit 1 is set, then this sg entry is the last element in a list. * + * We also use bit 2 to indicate whether the page_link points to an + * iomem page or not. + * * See sg_next(). * */ @@ -64,10 +68,17 @@ struct sg_table { * a valid sg entry, or whether it points to the start of a new scatterlist. * Those low bits are there for everyone! (thanks mason :-) */ -#define sg_is_chain(sg)((sg)->page_link & 0x01) -#define sg_is_last(sg) ((sg)->page_link & 0x02) +#define PAGE_LINK_MASK 0x7 +#define PAGE_LINK_CHAIN0x1 +#define PAGE_LINK_LAST 0x2 +#define PAGE_LINK_IOMEM0x4 + +#define sg_is_chain(sg)((sg)->page_link & PAGE_LINK_CHAIN) +#define sg_is_last(sg) ((sg)->page_link & PAGE_LINK_LAST) #define sg_chain_ptr(sg) \ - ((struct scatterlist *) ((sg)->page_link & ~0x03)) + ((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \ +PAGE_LINK_LAST))) +#define sg_is_iomem(sg)((sg)->page_link & PAGE_LINK_IOMEM) /** * sg_assign_page - Assign a given page to an SG entry @@ -81,13 +92,13 @@ struct sg_table { **/ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) { - unsigned long page_link = sg->page_link & 0x3; + unsigned long page_link = sg->page_link & PAGE_LINK_MASK; /* * In order for the low bit stealing approach to work, pages -* must be aligned at a 32-bit boundary as a minimum. +* must be aligned at a 64-bit boundary as a minimum. */ - BUG_ON((unsigned long) page & 0x03); + BUG_ON((unsigned long) page & PAGE_LINK_MASK); #ifdef CONFIG_DEBUG_SG BUG_ON(sg->sg_magic != SG_MAGIC); BUG_ON(sg_is_chain(sg)); @@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist *sg, struct page *page, sg->length = len; } +/** + * sg_set_page - Set sg entry to point at given iomem page + * @sg: SG entry + * @page: The page + * @len:Length of data + * @offset: Offset into page + * + * Description: + * Same as sg_set_page but used when the page is a ZONE_DEVICE page that + * points to IO memory. + * + **/ +static inline void sg_set_iomem_page(struct scatterlist *sg, struct page *page, +unsigned int len, unsigned int offset) +{ + sg_set_page(sg, page, len, offset); + sg->page_link |= PAGE_LINK_IOMEM; +} + static inline struct page *sg_page(struct scatterlist *sg) { #ifdef CONFIG_DEBUG_SG BUG_ON(sg->sg_magic != SG_MAGIC); BUG_ON(sg_is_chain(sg)); #endif - return (struct page *)((sg)->page_link & ~0x3); + return (struct page *)((sg)->page_link &
Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init
On Mon, Apr 3, 2017 at 6:30 AM, Kyle Fortinwrote: > > for (i = 0; i < q->max; i++) > kfree(q->pool[i]); > - kfree(q->pool); > + if (q->is_pool_vmalloc) you could do something like: if (is_vmalloc_addr(q->pool)) vfree(...); else kfree(..); And then remove the bool. Chetan
[Bug 195233] New: sd: discard_granularity set smaller than minimum_io_size when LBPRZ=1
https://bugzilla.kernel.org/show_bug.cgi?id=195233 Bug ID: 195233 Summary: sd: discard_granularity set smaller than minimum_io_size when LBPRZ=1 Product: IO/Storage Version: 2.5 Kernel Version: 4.4.0 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: SCSI Assignee: linux-scsi@vger.kernel.org Reporter: dbuck...@oreilly.com Regression: No Between 3.10 and 4.4 kernels this patch: https://patchwork.kernel.org/patch/6995131/ was merged to ensure discard_granularity is set to logical_block_size when a disk reports LBPRZ=1. In a case where a disk reports LBPRZ=1 with a 512 logical_block_size and 4096 physical_block_size and minimum_io_size, this results in discard_granularity incorrectly being set to 512 which at least in my case results in discard requests being ignored. I believe the easiest fix would be changing: if (sdkp->lbprz) { q->limits.discard_alignment = 0; q->limits.discard_granularity = logical_block_size; } else { to: if (sdkp->lbprz) { q->limits.discard_alignment = 0; q->limits.discard_granularity = max(logical_block_size, minimum_io_size); } else { but any change which results in discard_granularity being set properly would solve this. I have verified that with a 3.10 kernel discard_granularity is set to 4096 and works as expected, but with 4.4 it is set to 512 and does not work. This is a fairly severe bug for my use case, as I rely on discard to free unused blocks from the storage backing LUNs mounted by Linux clients and this effectively makes critical functions like thin-provisioning and snapshots infeasible for 4.x clients. I have only tested this with the Ubuntu-packaged kernels, but the problematic code still is present in the current source. I'm happy to provide any additional information that might be helpful. -- You are receiving this mail because: You are the assignee for the bug.
Re: linux-next: Tree for Apr 3 (scsi/lpfc)
On 04/03/17 01:13, Stephen Rothwell wrote: > Hi all, > > Changes since 20170331: > on i386: when SCSI_LPFC=y and CONFIG_NVME_CORE=m CONFIG_BLK_DEV_NVME=m CONFIG_BLK_DEV_NVME_SCSI=y CONFIG_NVME_FABRICS=m CONFIG_NVME_FC=m CONFIG_NVME_TARGET=m drivers/built-in.o: In function `lpfc_nvme_create_localport': (.text+0x28ce6b): undefined reference to `nvme_fc_register_localport' drivers/built-in.o: In function `lpfc_nvme_destroy_localport': (.text+0x28d263): undefined reference to `nvme_fc_unregister_remoteport' drivers/built-in.o: In function `lpfc_nvme_destroy_localport': (.text+0x28d2d3): undefined reference to `nvme_fc_unregister_localport' drivers/built-in.o: In function `lpfc_nvme_register_port': (.text+0x28d576): undefined reference to `nvme_fc_register_remoteport' drivers/built-in.o: In function `lpfc_nvme_unregister_port': (.text+0x28d93c): undefined reference to `nvme_fc_unregister_remoteport' so SCSI_LPFC depends on NVME_FC... Reported-by: Randy Dunlap-- ~Randy
Re: [PATCH] target/user: PGR Support
On 04/03/2017 12:20 AM, Shie-rei Huang wrote: > After a PGR command is processed in the kernel, is it possible for the > user mode to be notified with the command so that the user mode has a > chance to do its part of PGR processing. Below is one use case of it. > Suppose two TCMU servers are set up as an active/passive cluster for > high availability. An initiator's PGR command is sent to the active > server only. But the active server's user mode would like the passive > server to know about the PGR command also and sync up its configfs > state with the active server. This is to make sure that when the > active server dies, the passive server can take over with the correct > PGR state. We have been discussing this in a couple different threads, and it does not work right now as you saw, but I am looking into and I think some other people are too. For HA we would need to handle these: 1. We can't just pass a raw PGR command to userspace, because we do not have enough info. For commands where we need to know the I_T nexus the command came in on, then we need to pass that info upwards. For example a registration with SPEC_I_PT=0 ALL_TG _PT=0. If we did a new netlink interface then we could pass that extra info with the processed PGR info, or we could modify the existing tcmu one to pass the raw PGR command with the transport info. 2. We do not want to return status for the PGR command before userspace has distributed the PGR info. So, we can't do something really simple like just send a netlink event and return status for the command immediately, or we can't just modify the APTPL file to be usable when APTPL is not used and then just watch it for changes from userspace. We need some extra coordination like send a event to pass the PGR/I_T Nexus info then the kernel needs to wait for a response and then it can return status for the command. > Thanks, > > -Shie-rei > > On Thu, Mar 30, 2017 at 7:47 AM, Bryant G. Ly >wrote: >> This adds PGR support for just TCMU, since tcmu doesn't >> have the necessary IT_NEXUS info to process PGR in userspace, >> so have those commands be processed in kernel. >> >> Signed-off-by: Bryant G. Ly >> --- >> drivers/target/target_core_configfs.c | 10 +- >> drivers/target/target_core_device.c | 27 +++ >> drivers/target/target_core_pr.c | 2 +- >> drivers/target/target_core_pscsi.c| 3 ++- >> include/target/target_core_backend.h | 1 + >> 5 files changed, 36 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/target/target_core_configfs.c >> b/drivers/target/target_core_configfs.c >> index 38b5025..edfb098 100644 >> --- a/drivers/target/target_core_configfs.c >> +++ b/drivers/target/target_core_configfs.c >> @@ -1366,7 +1366,7 @@ static ssize_t target_pr_res_holder_show(struct >> config_item *item, char *page) >> struct se_device *dev = pr_to_dev(item); >> int ret; >> >> - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) >> + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) >> return sprintf(page, "Passthrough\n"); >> >> spin_lock(>dev_reservation_lock); >> @@ -1506,7 +1506,7 @@ static ssize_t target_pr_res_type_show(struct >> config_item *item, char *page) >> { >> struct se_device *dev = pr_to_dev(item); >> >> - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) >> + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) >> return sprintf(page, "SPC_PASSTHROUGH\n"); >> else if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS) >> return sprintf(page, "SPC2_RESERVATIONS\n"); >> @@ -1519,7 +1519,7 @@ static ssize_t target_pr_res_aptpl_active_show(struct >> config_item *item, >> { >> struct se_device *dev = pr_to_dev(item); >> >> - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) >> + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) >> return 0; >> >> return sprintf(page, "APTPL Bit Status: %s\n", >> @@ -1531,7 +1531,7 @@ static ssize_t >> target_pr_res_aptpl_metadata_show(struct config_item *item, >> { >> struct se_device *dev = pr_to_dev(item); >> >> - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) >> + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) >> return 0; >> >> return sprintf(page, "Ready to process PR APTPL metadata..\n"); >> @@ -1577,7 +1577,7 @@ static ssize_t >> target_pr_res_aptpl_metadata_store(struct config_item *item, >> u16 tpgt = 0; >> u8 type = 0; >> >> - if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH) >> + if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR) >> return count; >> if
Re: scsi-mq - tag# and can_queue, performance.
On Mon, 3 Apr 2017, 9:47am, Jens Axboe wrote: > On 04/03/2017 10:41 AM, Arun Easi wrote: > > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > > > >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > >>> On 04/03/2017 08:37 AM, Arun Easi wrote: > If the above is true, then for a LLD to get tag# within it's max-tasks > range, it has to report max-tasks / number-of-hw-queues in can_queue, > and > in the I/O path, use the tag and hwq# to arrive at a index# to use. > This, > though, leads to a poor use of tag resources -- queue reaching it's > capacity while LLD can still take it. > >>> > >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > >>> HBAs. ATM the only 'real' solution to this problem is indeed have a > >>> static split of the entire tag space by the number of hardware queues. > >>> With the mentioned tag-starvation problem. > >> > >> Hello Arun and Hannes, > >> > >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited > >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. > >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to > >> allocate a single set of tags for all hardware queues and also to add a > >> flag > >> to struct scsi_host_template such that SCSI LLDs can enable this behavior? > >> > > > > Hi Bart, > > > > This would certainly be beneficial in my case. Moreover, it certainly > > makes sense to move the logic up where multiple drivers can leverage. > > > > Perhaps, use percpu_ida* interfaces to do that, but I think I read > > somewhere that, it is not efficient (enough?) and is the reason to go the > > current way for block tags. > > You don't have to change the underlying tag generation to solve this > problem, Bart already pretty much outlined a fix that would work. > percpu_ida works fine if you never use more than roughly half the > available space, it's a poor fit for request tags where we want to > retain good behavior and scaling at or near tag exhaustion. That's why > blk-mq ended up rolling its own, which is now generically available as > lib/sbitmap.c. > Sounds good. Thanks for the education, Jens. Regards, -Arun
Re: scsi-mq - tag# and can_queue, performance.
On 04/03/2017 10:41 AM, Arun Easi wrote: > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: >>> On 04/03/2017 08:37 AM, Arun Easi wrote: If the above is true, then for a LLD to get tag# within it's max-tasks range, it has to report max-tasks / number-of-hw-queues in can_queue, and in the I/O path, use the tag and hwq# to arrive at a index# to use. This, though, leads to a poor use of tag resources -- queue reaching it's capacity while LLD can still take it. >>> >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) >>> HBAs. ATM the only 'real' solution to this problem is indeed have a >>> static split of the entire tag space by the number of hardware queues. >>> With the mentioned tag-starvation problem. >> >> Hello Arun and Hannes, >> >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to >> allocate a single set of tags for all hardware queues and also to add a flag >> to struct scsi_host_template such that SCSI LLDs can enable this behavior? >> > > Hi Bart, > > This would certainly be beneficial in my case. Moreover, it certainly > makes sense to move the logic up where multiple drivers can leverage. > > Perhaps, use percpu_ida* interfaces to do that, but I think I read > somewhere that, it is not efficient (enough?) and is the reason to go the > current way for block tags. You don't have to change the underlying tag generation to solve this problem, Bart already pretty much outlined a fix that would work. percpu_ida works fine if you never use more than roughly half the available space, it's a poor fit for request tags where we want to retain good behavior and scaling at or near tag exhaustion. That's why blk-mq ended up rolling its own, which is now generically available as lib/sbitmap.c. -- Jens Axboe
Re: scsi-mq - tag# and can_queue, performance.
On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote: > On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > > On 04/03/2017 08:37 AM, Arun Easi wrote: > > > If the above is true, then for a LLD to get tag# within it's max-tasks > > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > > though, leads to a poor use of tag resources -- queue reaching it's > > > capacity while LLD can still take it. > > > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > > HBAs. ATM the only 'real' solution to this problem is indeed have a > > static split of the entire tag space by the number of hardware queues. > > With the mentioned tag-starvation problem. > > Hello Arun and Hannes, > > Apparently the current blk_mq_alloc_tag_set() implementation is well suited > for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. > How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to > allocate a single set of tags for all hardware queues and also to add a flag > to struct scsi_host_template such that SCSI LLDs can enable this behavior? > Hi Bart, This would certainly be beneficial in my case. Moreover, it certainly makes sense to move the logic up where multiple drivers can leverage. Perhaps, use percpu_ida* interfaces to do that, but I think I read somewhere that, it is not efficient (enough?) and is the reason to go the current way for block tags. Regards, -Arun
Re: scsi-mq - tag# and can_queue, performance.
On Mon, 3 Apr 2017, 12:29am, Hannes Reinecke wrote: > On 04/03/2017 08:37 AM, Arun Easi wrote: > > Hi Folks, > > > > I would like to seek your input on a few topics on SCSI / block > > multi-queue. > > > > 1. Tag# generation. > > > > The context is with SCSI MQ on. My question is, what should a LLD do to > > get request tag values in the range 0 through can_queue - 1 across *all* > > of the queues. In our QLogic 41XXX series of adapters, we have a per > > session submit queue, a shared task memory (shared across all queues) and > > N completion queues (separate MSI-X vectors). We report N as the > > nr_hw_queues. I would like to, if possible, use the block layer tags to > > index into the above shared task memory area. > > > > From looking at the scsi/block source, it appears that when a LLD reports > > a value say #C, in can_queue (via scsi_host_template), that value is used > > as the max depth when corresponding block layer queues are created. So, > > while SCSI restricts the number of commands to LLD at #C, the request tag > > generated across any of the queues can range from 0..#C-1. Please correct > > me if I got this wrong. > > > > If the above is true, then for a LLD to get tag# within it's max-tasks > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > though, leads to a poor use of tag resources -- queue reaching it's > > capacity while LLD can still take it. > > > Yep. > > > blk_mq_unique_tag() would not work here, because it just puts the hwq# in > > the upper 16 bits, which need not fall in the max-tasks range. > > > > Perhaps the current MQ model is to cater to a queue pair > > (submit/completion) kind of hardware model; nevertheless I would like to > > know how other hardware variants can makes use of it. > > > He. Welcome to the club. > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > HBAs. ATM the only 'real' solution to this problem is indeed have a > static split of the entire tag space by the number of hardware queues. > With the mentioned tag-starvation problem. > > If we were to continue with the tag to hardware ID mapping, we would > need to implement a dynamic tag space mapping onto hardware queues. > My idea to that would be to map the entire tag space, but rather the > individual bit words onto the hardware queue. Then we could make the > mapping dynamic, where there individual words are mapped onto the queues > only as needed. > However, the _one_ big problem we're facing here is timeouts. > With the 1:1 mapping between tags and hardware IDs we can only re-use > the tag once the timeout is _definitely_ resolved. But this means > the command will be active, and we cannot return blk_mq_complete() until > the timeout itself has been resolved. > With FC this essentially means until the corresponding XIDs are safe to > re-use, ie after all ABRT/RRQ etc processing has been completed. > Hence we totally lose the ability to return the command itself with > -ETIMEDOUT and continue with I/O processing even though the original XID > is still being held by firmware. > > In the light of this I wonder if it wouldn't be better to completely > decouple block-layer tags and hardware IDs, and have an efficient > algorithm mapping the block-layer tags onto hardware IDs. > That should avoid the arbitrary tag starvation problem, and would allow > us to handle timeouts efficiently. > Of course, we don't _have_ such an efficient algorithm; maybe it's time > to have a generic one within the kernel as quite some drivers would > _love_ to just use the generic implementation here. > (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem) > > > 2. mq vs non-mq performance gain. > > > > This is more like a poll, I guess. I was wondering what performance gains > > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that > > has one slide that shows a 200k IOPS gain. > > > > From my testing, though, I was not lucky to observe that big of a change. > > In fact, the difference was not even noticeable(*). For e.g., for 512 > > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. > > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 > > and another with 1 (LLD is reloaded when it is done). I only used one NUMA > > node for this run. The test was run on a x86_64 setup. > > > You _really_ should have listened to my talk at VAULT. Would you have a slide deck / minutes that could be shared? > For 'legacy' HBAs there indeed is not much of a performance gain to be > had; the max gain is indeed for heavy parallel I/O. I have multiple devices (I-T nexuses) in my setup, so definitely there are parallel I/Os. > And there even is a scheduler issue when running with a single > submission thread; there I've measured a performance _drop_ by up to > 50%. Which, as Jens claims,
[Bug 191471] Mp2Sas and Mpt3Sas (now mpxsas) drivers are spinning forever in the IRQ handler under load condition
https://bugzilla.kernel.org/show_bug.cgi?id=191471 Timur Tabi (ti...@codeaurora.org) changed: What|Removed |Added CC||ti...@codeaurora.org --- Comment #1 from Timur Tabi (ti...@codeaurora.org) --- Created attachment 255737 --> https://bugzilla.kernel.org/attachment.cgi?id=255737=edit Propsed fix -- You are receiving this mail because: You are the assignee for the bug.
Re: scsi-mq - tag# and can_queue, performance.
On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote: > On 04/03/2017 08:37 AM, Arun Easi wrote: > > If the above is true, then for a LLD to get tag# within it's max-tasks > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > > though, leads to a poor use of tag resources -- queue reaching it's > > capacity while LLD can still take it. > > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) > HBAs. ATM the only 'real' solution to this problem is indeed have a > static split of the entire tag space by the number of hardware queues. > With the mentioned tag-starvation problem. Hello Arun and Hannes, Apparently the current blk_mq_alloc_tag_set() implementation is well suited for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers. How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to allocate a single set of tags for all hardware queues and also to add a flag to struct scsi_host_template such that SCSI LLDs can enable this behavior? Bart.
Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors
On Fri, Mar 31, 2017 at 12:25:27PM +0530, Christoph Hellwig wrote: > On Fri, Jan 20, 2017 at 07:27:02PM -0500, Martin K. Petersen wrote: > > > "Christoph" == Christoph Hellwigwrites: > > > > Christoph> And get automatic MSI-X affinity for free. > > > > Chelsio folks: Please review and test! > > ping! csiostor driver is triggering WARN_ON with this patch drivers/pci/msi.c:1193 1172 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, 1173unsigned int max_vecs, unsigned int flags, 1174const struct irq_affinity *affd) 1175 { 1176 static const struct irq_affinity msi_default_affd; 1177 int vecs = -ENOSPC; 1178 1179 if (flags & PCI_IRQ_AFFINITY) { ... 1192 } else { 1193 if (WARN_ON(affd)) 1194 affd = NULL; 1195 } PCI_IRQ_AFFINITY flag is missing + cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt, PCI_IRQ_MSIX, + ); + if (cnt < 0) return cnt; - }
[PATCH] sas: remove sas_domain_release_transport
sas_domain_release_transport is unused since at least v3.13, remove it. Signed-off-by: Johannes Thumshirn--- drivers/scsi/libsas/sas_init.c | 7 --- include/scsi/libsas.h | 1 - 2 files changed, 8 deletions(-) diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c index 15ef8e2..64e9cdd 100644 --- a/drivers/scsi/libsas/sas_init.c +++ b/drivers/scsi/libsas/sas_init.c @@ -566,13 +566,6 @@ sas_domain_attach_transport(struct sas_domain_function_template *dft) } EXPORT_SYMBOL_GPL(sas_domain_attach_transport); - -void sas_domain_release_transport(struct scsi_transport_template *stt) -{ - sas_release_transport(stt); -} -EXPORT_SYMBOL_GPL(sas_domain_release_transport); - /* -- SAS Class register/unregister -- */ static int __init sas_class_init(void) diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h index dae99d7..dd0f72c 100644 --- a/include/scsi/libsas.h +++ b/include/scsi/libsas.h @@ -693,7 +693,6 @@ extern int sas_bios_param(struct scsi_device *, sector_t capacity, int *hsc); extern struct scsi_transport_template * sas_domain_attach_transport(struct sas_domain_function_template *); -extern void sas_domain_release_transport(struct scsi_transport_template *); int sas_discover_root_expander(struct domain_device *); -- 2.10.2
Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init
On Apr 3, 2017, at 9:41 AM, Johannes Thumshirnwrote: > > On Mon, Apr 03, 2017 at 06:30:21AM -0700, Kyle Fortin wrote: >> iscsiadm session login can fail with the following error: >> >> iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com... >> iscsiadm: initiator reported error (9 - internal error) >> >> When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results >> in 64K-sized kmallocs per session. A system under fragmented slab >> pressure may not have any 64K objects available and fail iscsiadm session >> login. Even though memory objects of a smaller size are available, the >> large order allocation ends up failing. >> >> The kernel will print a warning and dump_stack, like below: > > There is a series of patches in Andrew's mmotm tree, which introduces > a kvmalloc() function, that does exactly what you're looking for. > > Maybe you want to base your patch on top of it. > > -- > Johannes Thumshirn Storage > jthumsh...@suse.de+49 911 74053 689 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 Thanks Johannes. I’ll take a look. -- Kyle Fortin - Oracle Linux Engineering
Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init
On Mon, Apr 03, 2017 at 06:30:21AM -0700, Kyle Fortin wrote: > iscsiadm session login can fail with the following error: > > iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com... > iscsiadm: initiator reported error (9 - internal error) > > When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results > in 64K-sized kmallocs per session. A system under fragmented slab > pressure may not have any 64K objects available and fail iscsiadm session > login. Even though memory objects of a smaller size are available, the > large order allocation ends up failing. > > The kernel will print a warning and dump_stack, like below: There is a series of patches in Andrew's mmotm tree, which introduces a kvmalloc() function, that does exactly what you're looking for. Maybe you want to base your patch on top of it. -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init
iscsiadm session login can fail with the following error: iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com... iscsiadm: initiator reported error (9 - internal error) When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results in 64K-sized kmallocs per session. A system under fragmented slab pressure may not have any 64K objects available and fail iscsiadm session login. Even though memory objects of a smaller size are available, the large order allocation ends up failing. The kernel will print a warning and dump_stack, like below: iscsid: page allocation failure: order:4, mode:0xc0d0 CPU: 0 PID: 2456 Comm: iscsid Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2 Call Trace: [] dump_stack+0x63/0x83 [] warn_alloc_failed+0xea/0x140 [] __alloc_pages_slowpath+0x409/0x760 [] __alloc_pages_nodemask+0x2b1/0x2d0 [] ? dev_attr_host_ipaddress+0x20/0xc722 [] alloc_pages_current+0xaf/0x170 [] alloc_kmem_pages+0x31/0xd0 [] ? iscsi_transport_group+0x20/0xc7e2 [] kmalloc_order+0x18/0x50 [] kmalloc_order_trace+0x34/0xe0 [] ? transport_remove_classdev+0x70/0x70 [] __kmalloc+0x27d/0x2a0 [] ? complete_all+0x4d/0x60 [] iscsi_pool_init+0x69/0x160 [libiscsi] [] ? device_initialize+0xb0/0xd0 [] iscsi_session_setup+0x180/0x2f4 [libiscsi] [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp] [] iscsi_sw_tcp_session_create+0xcf/0x150 [iscsi_tcp] [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp] [] iscsi_if_create_session+0x33/0xd0 [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp] [] iscsi_if_recv_msg+0x508/0x8c0 [scsi_transport_iscsi] [] ? __alloc_pages_nodemask+0x19b/0x2d0 [] ? __kmalloc_node_track_caller+0x209/0x2c0 [] iscsi_if_rx+0x7c/0x200 [scsi_transport_iscsi] [] netlink_unicast+0x126/0x1c0 [] netlink_sendmsg+0x36c/0x400 [] sock_sendmsg+0x4d/0x60 [] ___sys_sendmsg+0x30a/0x330 [] ? handle_pte_fault+0x20c/0x230 [] ? __handle_mm_fault+0x1bc/0x330 [] ? handle_mm_fault+0xb2/0x1a0 [] __sys_sendmsg+0x49/0x90 [] SyS_sendmsg+0x19/0x20 [] system_call_fastpath+0x12/0x71 Use vzalloc for iscsi_pool allocations larger than PAGE_SIZE. This only affects hosts using a non-standard larger /etc/iscsi/iscsid.conf node.session.cmds_max value. Since iscsi_pool_init is also called to allocate very small pools per cmd for r2t handling, it is best to retain using kzalloc for those allocations. Signed-off-by: Kyle FortinTested-by: Kyle Fortin Reviewed-by: Joseph Slember Reviewed-by: Lance Hartmann --- drivers/scsi/libiscsi.c | 15 +-- include/scsi/libiscsi.h | 1 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 3fca34a675af..5a622ba2f10d 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -2546,6 +2547,7 @@ int iscsi_eh_recover_target(struct scsi_cmnd *sc) iscsi_pool_init(struct iscsi_pool *q, int max, void ***items, int item_size) { int i, num_arrays = 1; + int alloc_size; memset(q, 0, sizeof(*q)); @@ -2555,7 +2557,13 @@ int iscsi_eh_recover_target(struct scsi_cmnd *sc) * the array. */ if (items) num_arrays++; - q->pool = kzalloc(num_arrays * max * sizeof(void*), GFP_KERNEL); + + alloc_size = num_arrays * max * sizeof(void *); + if (alloc_size > PAGE_SIZE) { + q->pool = vzalloc(alloc_size); + q->is_pool_vmalloc = true; + } else + q->pool = kzalloc(alloc_size, GFP_KERNEL); if (q->pool == NULL) return -ENOMEM; @@ -2589,7 +2597,10 @@ void iscsi_pool_free(struct iscsi_pool *q) for (i = 0; i < q->max; i++) kfree(q->pool[i]); - kfree(q->pool); + if (q->is_pool_vmalloc) + vfree(q->pool); + else + kfree(q->pool); } EXPORT_SYMBOL_GPL(iscsi_pool_free); diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h index 583875ea136a..e3421e527559 100644 --- a/include/scsi/libiscsi.h +++ b/include/scsi/libiscsi.h @@ -258,6 +258,7 @@ struct iscsi_pool { struct kfifoqueue; /* FIFO Queue */ void**pool; /* Pool of elements */ int max;/* Max number of elements */ + boolis_pool_vmalloc; }; /* Session's states */ -- 1.8.3.1
Re: [PATCH v2] scsi: storvsc: Add support for FC rport.
On 04/03/2017 08:17 AM, Christoph Hellwig wrote: if (host->transportt == fc_transport_template) { + struct fc_rport_identifiers ids = { + .roles = FC_PORT_ROLE_FCP_TARGET, + }; I don't think storvsc ever acts as FCP target. In order to implement the work around so that the scsi scan works indicating FC_PORT_ROLE_FCP_TARGET as a role was necessary due to its test in fc_scsi_scan_rport. The idea here is to avoid making any changes to the fc_transport driver which was of some concern. Thanks, Cathy
Re: [PATCH v2] scsi: storvsc: Add support for FC rport.
> if (host->transportt == fc_transport_template) { > + struct fc_rport_identifiers ids = { > + .roles = FC_PORT_ROLE_FCP_TARGET, > + }; I don't think storvsc ever acts as FCP target.
[PATCH 1/2] scsi: convert unrecovered read error to -EILSEQ
It is quite easily to get URE after power failure and get scary message. URE is happens due to internal drive crc mismatch due to partial sector update. Most people interpret such message as "My drive is dying", which isreasonable assumption if your dmesg is full of complain from disks and read(2) return EIO. In fact this error is not fatal. One can fix it easily by rewriting affected sector. So we have to handle URE like follows: - Return EILSEQ to signall caller that this is bad data related problem - Do not retry command, because this is useless. ### Test case #Test uses two HDD: disks sdb sdc #Write_phase # let fio work ~100sec and then cut the power fio --ioengine=libaio --direct=1 --rw=write --bs=1M --iodepth=16 \ --time_based=1 --runtime=600 --filesize=1G --size=1T \ --name /dev/sdb --name /dev/sdc # Check_phase after system goes back fio --ioengine=libaio --direct=1 --group_reporting --rw=read --bs=1M \ --iodepth=16 --size=1G --filesize=1G --name=/dev/sdb --name /dev/sdc More info about URE probability here: https://plus.google.com/101761226576930717211/posts/Pctq7kk1dLL Signed-off-by: Dmitry Monakhov--- drivers/scsi/scsi_lib.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 19125d7..59d64ad 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -961,6 +961,19 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) /* See SSC3rXX or current. */ action = ACTION_FAIL; break; + case MEDIUM_ERROR: + if (sshdr.asc == 0x11) { + /* Handle unrecovered read error */ + switch (sshdr.ascq) { + case 0x00: /* URE */ + case 0x04: /* URE auto reallocate failed */ + case 0x0B: /* URE recommend reassignment*/ + case 0x0C: /* URE recommend rewrite the data */ + action = ACTION_FAIL; + error = -EILSEQ; + break; + } + } default: action = ACTION_FAIL; break; -- 2.9.3
[PATCH 2/2] block: Improve error handling verbosity
EILSEQ is returned due to internal csum error on disk/fabric, let's add special message to distinguish it from others. Also dump original numerical error code. Signed-off-by: Dmitry Monakhov--- block/blk-core.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 071a998..8eab846 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2576,13 +2576,16 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) case -ENODATA: error_type = "critical medium"; break; + case -EILSEQ: + error_type = "bad data"; + break; case -EIO: default: error_type = "I/O"; break; } - printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector %llu\n", - __func__, error_type, req->rq_disk ? + printk_ratelimited(KERN_ERR "%s: %s error (%d), dev %s, sector %llu\n", + __func__, error_type, error, req->rq_disk ? req->rq_disk->disk_name : "?", (unsigned long long)blk_rq_pos(req)); -- 2.9.3
[PATCH v2] scsi: storvsc: Add support for FC rport.
Included in the current storvsc driver for Hyper-V is the ability to access luns on an FC fabric via a virtualized fiber channel adapter exposed by the Hyper-V host. The driver also attaches to the FC transport to allow host and port names to be published under /sys/class/fc_host/hostX. Current customer tools running on the VM require that these names be available in the well known standard location under fc_host/hostX. A problem arose when attaching to the FC transport. The scsi_scan code attempts to call fc_user_scan which has basically become a no-op due to the fact that the virtualized FC device does not expose rports. At this point you cannot refresh the scsi bus after mapping or unmapping luns on the SAN without a reboot of the VM. This patch stubs in an rport per fc_host in storvsc so that the requirement of a defined rport is now met within the fc_transport and echo "- - -" > /sys/class/scsi_host/hostX/scan now works. Signed-off-by: Cathy Avery--- Changes since v1: - Fix fc_rport_identifiers init [Stephen Hemminger] - Better error checking --- drivers/scsi/storvsc_drv.c | 23 ++- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 638e5f4..37646d1 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -478,6 +478,9 @@ struct storvsc_device { */ u64 node_name; u64 port_name; +#if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) + struct fc_rport *rport; +#endif }; struct hv_host_device { @@ -1816,19 +1819,27 @@ static int storvsc_probe(struct hv_device *device, target = (device->dev_instance.b[5] << 8 | device->dev_instance.b[4]); ret = scsi_add_device(host, 0, target, 0); - if (ret) { - scsi_remove_host(host); - goto err_out2; - } + if (ret) + goto err_out3; } #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) if (host->transportt == fc_transport_template) { + struct fc_rport_identifiers ids = { + .roles = FC_PORT_ROLE_FCP_TARGET, + }; + fc_host_node_name(host) = stor_device->node_name; fc_host_port_name(host) = stor_device->port_name; + stor_device->rport = fc_remote_port_add(host, 0, ); + if (!stor_device->rport) + goto err_out3; } #endif return 0; +err_out3: + scsi_remove_host(host); + err_out2: /* * Once we have connected with the host, we would need to @@ -1854,8 +1865,10 @@ static int storvsc_remove(struct hv_device *dev) struct Scsi_Host *host = stor_device->host; #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS) - if (host->transportt == fc_transport_template) + if (host->transportt == fc_transport_template) { + fc_remote_port_delete(stor_device->rport); fc_remove_host(host); + } #endif scsi_remove_host(host); storvsc_dev_remove(dev); -- 2.5.0
Re: scsi-mq - tag# and can_queue, performance.
On 04/03/2017 08:37 AM, Arun Easi wrote: > Hi Folks, > > I would like to seek your input on a few topics on SCSI / block > multi-queue. > > 1. Tag# generation. > > The context is with SCSI MQ on. My question is, what should a LLD do to > get request tag values in the range 0 through can_queue - 1 across *all* > of the queues. In our QLogic 41XXX series of adapters, we have a per > session submit queue, a shared task memory (shared across all queues) and > N completion queues (separate MSI-X vectors). We report N as the > nr_hw_queues. I would like to, if possible, use the block layer tags to > index into the above shared task memory area. > > From looking at the scsi/block source, it appears that when a LLD reports > a value say #C, in can_queue (via scsi_host_template), that value is used > as the max depth when corresponding block layer queues are created. So, > while SCSI restricts the number of commands to LLD at #C, the request tag > generated across any of the queues can range from 0..#C-1. Please correct > me if I got this wrong. > > If the above is true, then for a LLD to get tag# within it's max-tasks > range, it has to report max-tasks / number-of-hw-queues in can_queue, and > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, > though, leads to a poor use of tag resources -- queue reaching it's > capacity while LLD can still take it. > Yep. > blk_mq_unique_tag() would not work here, because it just puts the hwq# in > the upper 16 bits, which need not fall in the max-tasks range. > > Perhaps the current MQ model is to cater to a queue pair > (submit/completion) kind of hardware model; nevertheless I would like to > know how other hardware variants can makes use of it. > He. Welcome to the club. Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe) HBAs. ATM the only 'real' solution to this problem is indeed have a static split of the entire tag space by the number of hardware queues. With the mentioned tag-starvation problem. If we were to continue with the tag to hardware ID mapping, we would need to implement a dynamic tag space mapping onto hardware queues. My idea to that would be to map the entire tag space, but rather the individual bit words onto the hardware queue. Then we could make the mapping dynamic, where there individual words are mapped onto the queues only as needed. However, the _one_ big problem we're facing here is timeouts. With the 1:1 mapping between tags and hardware IDs we can only re-use the tag once the timeout is _definitely_ resolved. But this means the command will be active, and we cannot return blk_mq_complete() until the timeout itself has been resolved. With FC this essentially means until the corresponding XIDs are safe to re-use, ie after all ABRT/RRQ etc processing has been completed. Hence we totally lose the ability to return the command itself with -ETIMEDOUT and continue with I/O processing even though the original XID is still being held by firmware. In the light of this I wonder if it wouldn't be better to completely decouple block-layer tags and hardware IDs, and have an efficient algorithm mapping the block-layer tags onto hardware IDs. That should avoid the arbitrary tag starvation problem, and would allow us to handle timeouts efficiently. Of course, we don't _have_ such an efficient algorithm; maybe it's time to have a generic one within the kernel as quite some drivers would _love_ to just use the generic implementation here. (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem) > 2. mq vs non-mq performance gain. > > This is more like a poll, I guess. I was wondering what performance gains > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that > has one slide that shows a 200k IOPS gain. > > From my testing, though, I was not lucky to observe that big of a change. > In fact, the difference was not even noticeable(*). For e.g., for 512 > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 > and another with 1 (LLD is reloaded when it is done). I only used one NUMA > node for this run. The test was run on a x86_64 setup. > You _really_ should have listened to my talk at VAULT. For 'legacy' HBAs there indeed is not much of a performance gain to be had; the max gain is indeed for heavy parallel I/O. And there even is a scheduler issue when running with a single submission thread; there I've measured a performance _drop_ by up to 50%. Which, as Jens claims, really looks like a block-layer issue rather than a generic problem. > * See item 3 for a special handling. > > 3. add_random slowness. > > One thing I observed with MQ on and off was this block layer tunable, > add_random, which as I understand is to tune disk entropy contribution. > With non-MQ, it is turned on, and with MQ, it is turned off by default. > > This got noticed
scsi-mq - tag# and can_queue, performance.
Hi Folks, I would like to seek your input on a few topics on SCSI / block multi-queue. 1. Tag# generation. The context is with SCSI MQ on. My question is, what should a LLD do to get request tag values in the range 0 through can_queue - 1 across *all* of the queues. In our QLogic 41XXX series of adapters, we have a per session submit queue, a shared task memory (shared across all queues) and N completion queues (separate MSI-X vectors). We report N as the nr_hw_queues. I would like to, if possible, use the block layer tags to index into the above shared task memory area. >From looking at the scsi/block source, it appears that when a LLD reports a value say #C, in can_queue (via scsi_host_template), that value is used as the max depth when corresponding block layer queues are created. So, while SCSI restricts the number of commands to LLD at #C, the request tag generated across any of the queues can range from 0..#C-1. Please correct me if I got this wrong. If the above is true, then for a LLD to get tag# within it's max-tasks range, it has to report max-tasks / number-of-hw-queues in can_queue, and in the I/O path, use the tag and hwq# to arrive at a index# to use. This, though, leads to a poor use of tag resources -- queue reaching it's capacity while LLD can still take it. blk_mq_unique_tag() would not work here, because it just puts the hwq# in the upper 16 bits, which need not fall in the max-tasks range. Perhaps the current MQ model is to cater to a queue pair (submit/completion) kind of hardware model; nevertheless I would like to know how other hardware variants can makes use of it. 2. mq vs non-mq performance gain. This is more like a poll, I guess. I was wondering what performance gains folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that has one slide that shows a 200k IOPS gain. >From my testing, though, I was not lucky to observe that big of a change. In fact, the difference was not even noticeable(*). For e.g., for 512 bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 and another with 1 (LLD is reloaded when it is done). I only used one NUMA node for this run. The test was run on a x86_64 setup. * See item 3 for a special handling. 3. add_random slowness. One thing I observed with MQ on and off was this block layer tunable, add_random, which as I understand is to tune disk entropy contribution. With non-MQ, it is turned on, and with MQ, it is turned off by default. This got noticed because, when I was running multi-port testing, there was a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the test ran on same NUMA node / across NUMA nodes). Just wondering why we have it ON on one setting and OFF on another. Sorry for the rather long e-mail, but your comments/thoughts are much appreciated. Regards, -Arun
Re: always use REQ_OP_WRITE_ZEROES for zeroing offload
On 03/31/2017 06:32 PM, Christoph Hellwig wrote: > This series makes REQ_OP_WRITE_ZEROES the only zeroing offload > supported by the block layer, and switches existing implementations > of REQ_OP_DISCARD that correctly set discard_zeroes_data to it, > removes incorrect discard_zeroes_data, and also switches WRITE SAME > based zeroing in SCSI to this new method. > > The series is against the block for-next tree. > > A git tree is also avaiable at: > > git://git.infradead.org/users/hch/block.git discard-rework > > Gitweb: > > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework Thank you for doing this. For this series: Reviewed-by: Hannes ReineckeCheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)