Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown

2017-04-03 Thread Junichi Nomura
On 04/04/17 06:53, Mauricio Faria de Oliveira wrote:
> Hi Junichi,
> 
> On 03/28/2017 11:29 PM, Junichi Nomura wrote:
>> Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"),
>> "rmmod lpfc" starting to cause panic or corruption due to double free.
> 
> Thanks for the report. Can you please check whether the patch just sent
> ([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it?

It works for me. Thank you!

Considering future maintenance, it might be a bit fragile to just depend
on the code comment about representing the relation between cq/wq and
shared pring but it's maintainers' choice.

-- 
Jun'ichi Nomura, NEC Corporation / NEC Solution Innovators, Ltd.


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Dan Williams
On Mon, Apr 3, 2017 at 4:12 PM, Logan Gunthorpe  wrote:
>
>
> On 03/04/17 04:47 PM, Dan Williams wrote:
>> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
>> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
>> conversion will still work and be required. However you're right, the
>> minute we use pfn_t for this we're into the realm of special case
>> drivers that understand scatterlists with special "I/O-pfn_t" entries.
>
> Well yes, it would certainly be possible to convert the scatterlist code
> from page_link to pfn_t. (The only slightly tricky thing is that
> scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
> they'd have to be harmonized somehow). But if we aren't moving toward
> struct-page-less DMA, I fail to see the point of the conversion.
>
> I'll definitely need IO scatterlists of some form or another and I like
> pfn_t but right now it just seems like extra work with unclear benefit.
> (Though, if someone told me that I can't use a third bit in the
> page_link field then maybe that would be a good reason to move to pfn_t.)
>
>> However, maybe that's what we want? I think peer-to-peer DMA is not a
>> general purpose feature unless/until we get it standardized in PCI. So
>> maybe drivers with special case scatterlist support is exactly what we
>> want for now.
>
> Well, I think this should be completely independent from PCI code. I see
> no reason why we can't have infrastructure for DMA on iomem from any
> bus. Largely all the work I've done in this area is completely agnostic
> to the bus in use. (Except for any kind of white/black list when it is
> used.)

The completely agnostic part is where I get worried, but I shouldn't
say anymore until I actually read the patch.The worry is cases where
this agnostic enabling allows unsuspecting code paths to do the wrong
thing. Like bypass iomem safety.

> The "special case scatterlist" is essentially what I'm proposing in the
> patch I sent upthread, it just stores the flag in the page_link instead
> of in a pfn_t.

Makes sense. The suggestion of pfn_t was to try to get more type
safety throughout the stack. So that, again, unsuspecting code paths
that get an I/O pfn aren't able to do things like page_address() or
kmap() without failing.

I'll stop commenting now and set aside some time to go read the patches.


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Logan Gunthorpe


On 03/04/17 04:47 PM, Dan Williams wrote:
> I wouldn't necessarily conflate supporting pfn_t in the scatterlist
> with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
> conversion will still work and be required. However you're right, the
> minute we use pfn_t for this we're into the realm of special case
> drivers that understand scatterlists with special "I/O-pfn_t" entries.

Well yes, it would certainly be possible to convert the scatterlist code
from page_link to pfn_t. (The only slightly tricky thing is that
scatterlist uses extra chaining bits and pfn_t uses extra flag bits so
they'd have to be harmonized somehow). But if we aren't moving toward
struct-page-less DMA, I fail to see the point of the conversion.

I'll definitely need IO scatterlists of some form or another and I like
pfn_t but right now it just seems like extra work with unclear benefit.
(Though, if someone told me that I can't use a third bit in the
page_link field then maybe that would be a good reason to move to pfn_t.)

> However, maybe that's what we want? I think peer-to-peer DMA is not a
> general purpose feature unless/until we get it standardized in PCI. So
> maybe drivers with special case scatterlist support is exactly what we
> want for now.

Well, I think this should be completely independent from PCI code. I see
no reason why we can't have infrastructure for DMA on iomem from any
bus. Largely all the work I've done in this area is completely agnostic
to the bus in use. (Except for any kind of white/black list when it is
used.)

The "special case scatterlist" is essentially what I'm proposing in the
patch I sent upthread, it just stores the flag in the page_link instead
of in a pfn_t.

Logan


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Dan Williams
On Mon, Apr 3, 2017 at 3:10 PM, Logan Gunthorpe  wrote:
>
>
> On 03/04/17 03:44 PM, Dan Williams wrote:
>> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe  wrote:
>>> Hi Christoph,
>>>
>>> What are your thoughts on an approach like the following untested
>>> draft patch.
>>>
>>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>>> and WARN_ONs will occur in places where drivers attempt to access
>>> iomem directly through the sgl.
>>>
>>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>>> writers wouldn't have to mess with sg_set_iomem_page.
>>>
>>> With all that in place, it should be relatively safe for drivers to
>>> implement p2pmem even though we'd still technically be violating the
>>> __iomem boundary in some places.
>>
>> Just reacting to this mail, I still haven't had a chance to take a
>> look at the rest of the series.
>>
>> The pfn_t type was invented to carry extra type and page lookup
>> information about the memory behind a given pfn. At first glance that
>> seems a more natural place to carry an indication that this is an
>> "I/O" pfn.
>
> I agree... But what are the plans for pfn_t? Is anyone working on using
> it in the scatterlist code? Currently it's not there yet and given the
> assertion that we will continue to be using struct page for DMA is that
> a direction we'd want to go?
>

I wouldn't necessarily conflate supporting pfn_t in the scatterlist
with the stalled stuct-page-less DMA effor. A pfn_t_to_page()
conversion will still work and be required. However you're right, the
minute we use pfn_t for this we're into the realm of special case
drivers that understand scatterlists with special "I/O-pfn_t" entries.
However, maybe that's what we want? I think peer-to-peer DMA is not a
general purpose feature unless/until we get it standardized in PCI. So
maybe drivers with special case scatterlist support is exactly what we
want for now.

Thoughts?


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Logan Gunthorpe


On 03/04/17 03:44 PM, Dan Williams wrote:
> On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe  wrote:
>> Hi Christoph,
>>
>> What are your thoughts on an approach like the following untested
>> draft patch.
>>
>> The patch (if fleshed out) makes it so iomem can be used in an sgl
>> and WARN_ONs will occur in places where drivers attempt to access
>> iomem directly through the sgl.
>>
>> I'd also probably create a p2pmem_alloc_sgl helper function so driver
>> writers wouldn't have to mess with sg_set_iomem_page.
>>
>> With all that in place, it should be relatively safe for drivers to
>> implement p2pmem even though we'd still technically be violating the
>> __iomem boundary in some places.
> 
> Just reacting to this mail, I still haven't had a chance to take a
> look at the rest of the series.
> 
> The pfn_t type was invented to carry extra type and page lookup
> information about the memory behind a given pfn. At first glance that
> seems a more natural place to carry an indication that this is an
> "I/O" pfn.

I agree... But what are the plans for pfn_t? Is anyone working on using
it in the scatterlist code? Currently it's not there yet and given the
assertion that we will continue to be using struct page for DMA is that
a direction we'd want to go?

Logan


Re: [REGRESSION] v4.11-rc3: lpfc: panic during module removal / shutdown

2017-04-03 Thread Mauricio Faria de Oliveira

Hi Junichi,

On 03/28/2017 11:29 PM, Junichi Nomura wrote:

Since commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications"),
"rmmod lpfc" starting to cause panic or corruption due to double free.


Thanks for the report. Can you please check whether the patch just sent
([PATCH] lpfc: fix double free of bound CQ/WQ ring pointer) resolves it?

I don't have a setup to test it handy right now.

cheers,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center



[PATCH] lpfc: fix double free of bound CQ/WQ ring pointer

2017-04-03 Thread Mauricio Faria de Oliveira
commit 895427bd012c ("scsi: lpfc: NVME Initiator: Base modifications")
binds the CQs and WQs ring pointer (sets it to same address on both).

lpfc_create_wq_cq():
...
rc = lpfc_cq_create(phba, cq, eq, <...>)
...
rc = lpfc_wq_create(phba, wq, cq, qtype);
...
/* Bind this CQ/WQ to the NVME ring */
pring = wq->pring;
...
cq->pring = pring;
...

The commit frees both CQ & WQ for FCP/NVME on lpfc_sli4_queue_destroy(),
which causes a double free (potential corruption or panic) on freeing
the ring pointer of the second entity (CQ is first, WQ is second):

lpfc_pci_remove_one() # that is, .remove / .shutdown
-> lpfc_pci_remove_one_s4()
   -> lpfc_sli4_hba_unset()
  -> lpfc_sli4_queue_destroy()

 -> lpfc_sli4_release_queues()  # Release FCP/NVME cqs
-> __lpfc_sli4_release_queue()
   -> lpfc_sli4_queue_free()
  -> kfree(queue->pring)  # first free

 -> lpfc_sli4_release_queues()  # Release FCP/NVME wqs
-> __lpfc_sli4_release_queue()
   -> lpfc_sli4_queue_free()
  -> kfree(queue->pring)  # second free

So, check for WQs in lpfc_sli4_queue_free() and do not free the pring,
as it is freed before in the bound CQ.  [the WQs are created only via
lpfc_wq_create(), which sets struct lpfc_queue::type == LPFC_WQ.  And
that happens in 2 sites (lpfc_create_wq_cq() & lpfc_fof_queue_setup()),
both of which bind the CQs & WQs. Thus, checking for the LPFC_WQ type
correlates to whether the WQ is bound to a CQ, which is freed first.]

Additional details:

For reference, that binding also occurs on one other function:

lpfc_fof_queue_setup():
...
rc = lpfc_cq_create(phba, phba->sli4_hba.oas_cq, <...>)
...
rc = lpfc_wq_create(phba, phba->sli4_hba.oas_wq, <...>)
...
/* Bind this CQ/WQ to the NVME ring */
pring = phba->sli4_hba.oas_wq->pring;
...
phba->sli4_hba.oas_cq->pring = pring;

And used to occur similarly on lpfc_sli4_queue_setup(), but was changed
by that commit; although the problem is more related to the new freeing
pattern introduced in lpfc_sli4_queue_destroy() plus the bound CQs/WQs.

-   /* Bind this WQ to the next FCP ring */
-   pring = >ring[MAX_SLI3_CONFIGURED_RINGS + fcp_wqidx];
...
-   phba->sli4_hba.fcp_cq[fcp_wqidx]->pring = pring;

commit 85e8a23936ab ("scsi: lpfc: Add shutdown method for kexec") made
this more likely as lpfc_pci_remove_one() is called on driver shutdown
(e.g., modprobe -r / rmmod).

(this patch is partially based on a different patch suggested by Johannes,
 thus adding a Suggested-by tag for due credit.)

Signed-off-by: Mauricio Faria de Oliveira 
Reported-by: Junichi Nomura 
Suggested-by: Johannes Thumshirn 
---
 drivers/scsi/lpfc/lpfc_sli.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index 1c9fa45df7eb..8befe841adaa 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -13758,7 +13758,14 @@ void lpfc_sli4_els_xri_abort_event_proc(struct 
lpfc_hba *phba)
lpfc_free_rq_buffer(queue->phba, queue);
kfree(queue->rqbp);
}
-   kfree(queue->pring);
+
+   /*
+* The WQs/CQs' pring is bound (same pointer).
+* So free it only once, and not again on WQ.
+*/
+   if (queue->type != LPFC_WQ)
+   kfree(queue->pring);
+
kfree(queue);
return;
 }
-- 
1.8.3.1



Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Dan Williams
On Mon, Apr 3, 2017 at 2:20 PM, Logan Gunthorpe  wrote:
> Hi Christoph,
>
> What are your thoughts on an approach like the following untested
> draft patch.
>
> The patch (if fleshed out) makes it so iomem can be used in an sgl
> and WARN_ONs will occur in places where drivers attempt to access
> iomem directly through the sgl.
>
> I'd also probably create a p2pmem_alloc_sgl helper function so driver
> writers wouldn't have to mess with sg_set_iomem_page.
>
> With all that in place, it should be relatively safe for drivers to
> implement p2pmem even though we'd still technically be violating the
> __iomem boundary in some places.

Just reacting to this mail, I still haven't had a chance to take a
look at the rest of the series.

The pfn_t type was invented to carry extra type and page lookup
information about the memory behind a given pfn. At first glance that
seems a more natural place to carry an indication that this is an
"I/O" pfn.


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-03 Thread Logan Gunthorpe
Hi Christoph,

What are your thoughts on an approach like the following untested
draft patch.

The patch (if fleshed out) makes it so iomem can be used in an sgl
and WARN_ONs will occur in places where drivers attempt to access
iomem directly through the sgl.

I'd also probably create a p2pmem_alloc_sgl helper function so driver
writers wouldn't have to mess with sg_set_iomem_page.

With all that in place, it should be relatively safe for drivers to
implement p2pmem even though we'd still technically be violating the
__iomem boundary in some places.

Logan


commit b435a154a4ec4f82766f6ab838092c3c5a9388ac
Author: Logan Gunthorpe 
Date:   Wed Feb 8 12:44:52 2017 -0700

scatterlist: Add support for iomem pages

This patch steals another bit from the page_link field to indicate the
sg points to iomem. In sg_copy_buffer we use this flag to select
between memcpy and iomemcpy. Other sg_miter users will get an WARN_ON
unless they indicate they support iomemory by setting the
SG_MITER_IOMEM flag.

Also added are sg_kmap functions which would replace a common pattern
of kmap(sg_page(sg)). These new functions then also warn if the caller
tries to map io memory. Another option may be to automatically copy
the iomem to a new page and return that transparently to the driver.

Another coccinelle patch would then be done to convert kmap(sg_page(sg))
instances to the appropriate sg_kmap calls.

Signed-off-by: Logan Gunthorpe 

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..bd690a2c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@

 #include 

+/* Avoid the highmem.h macro from aliasing our ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);

 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..7608da0 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 struct scatterlist {
@@ -53,6 +54,9 @@ struct sg_table {
  *
  * If bit 1 is set, then this sg entry is the last element in a list.
  *
+ * We also use bit 2 to indicate whether the page_link points to an
+ * iomem page or not.
+ *
  * See sg_next().
  *
  */
@@ -64,10 +68,17 @@ struct sg_table {
  * a valid sg entry, or whether it points to the start of a new
scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define sg_is_chain(sg)((sg)->page_link & 0x01)
-#define sg_is_last(sg) ((sg)->page_link & 0x02)
+#define PAGE_LINK_MASK 0x7
+#define PAGE_LINK_CHAIN0x1
+#define PAGE_LINK_LAST 0x2
+#define PAGE_LINK_IOMEM0x4
+
+#define sg_is_chain(sg)((sg)->page_link & PAGE_LINK_CHAIN)
+#define sg_is_last(sg) ((sg)->page_link & PAGE_LINK_LAST)
 #define sg_chain_ptr(sg)   \
-   ((struct scatterlist *) ((sg)->page_link & ~0x03))
+   ((struct scatterlist *) ((sg)->page_link & ~(PAGE_LINK_CHAIN | \
+PAGE_LINK_LAST)))
+#define sg_is_iomem(sg)((sg)->page_link & PAGE_LINK_IOMEM)

 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -81,13 +92,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page
*page)
 {
-   unsigned long page_link = sg->page_link & 0x3;
+   unsigned long page_link = sg->page_link & PAGE_LINK_MASK;

/*
 * In order for the low bit stealing approach to work, pages
-* must be aligned at a 32-bit boundary as a minimum.
+* must be aligned at a 64-bit boundary as a minimum.
 */
-   BUG_ON((unsigned long) page & 0x03);
+   BUG_ON((unsigned long) page & PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
BUG_ON(sg_is_chain(sg));
@@ -117,13 +128,56 @@ static inline void sg_set_page(struct scatterlist
*sg, struct page *page,
sg->length = len;
 }

+/**
+ * sg_set_page - Set sg entry to point at given iomem page
+ * @sg: SG entry
+ * @page:   The page
+ * @len:Length of data
+ * @offset: Offset into page
+ *
+ * Description:
+ *   Same as sg_set_page but used when the page is a ZONE_DEVICE page that
+ *   points to IO memory.
+ *
+ **/
+static inline void sg_set_iomem_page(struct scatterlist *sg, struct
page *page,
+unsigned int len, unsigned int offset)
+{
+   sg_set_page(sg, page, len, offset);
+   sg->page_link |= PAGE_LINK_IOMEM;
+}
+
 static inline struct page *sg_page(struct scatterlist *sg)
 {
 #ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
BUG_ON(sg_is_chain(sg));
 #endif
-   return (struct page *)((sg)->page_link & ~0x3);
+   return (struct page *)((sg)->page_link & 

Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init

2017-04-03 Thread Chet L
On Mon, Apr 3, 2017 at 6:30 AM, Kyle Fortin  wrote:

>
> for (i = 0; i < q->max; i++)
> kfree(q->pool[i]);
> -   kfree(q->pool);
> +   if (q->is_pool_vmalloc)

you could do something like:

if (is_vmalloc_addr(q->pool))
vfree(...);
else
kfree(..);

And then remove the bool.


Chetan


[Bug 195233] New: sd: discard_granularity set smaller than minimum_io_size when LBPRZ=1

2017-04-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=195233

Bug ID: 195233
   Summary: sd: discard_granularity set smaller than
minimum_io_size when LBPRZ=1
   Product: IO/Storage
   Version: 2.5
Kernel Version: 4.4.0
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: SCSI
  Assignee: linux-scsi@vger.kernel.org
  Reporter: dbuck...@oreilly.com
Regression: No

Between 3.10 and 4.4 kernels this patch:
https://patchwork.kernel.org/patch/6995131/ was merged to ensure
discard_granularity is set to logical_block_size when a disk reports LBPRZ=1. 
In a case where a disk reports LBPRZ=1 with a 512 logical_block_size and 4096
physical_block_size and minimum_io_size, this results in discard_granularity
incorrectly being set to 512 which at least in my case results in discard
requests being ignored.

I believe the easiest fix would be changing:

if (sdkp->lbprz) {
q->limits.discard_alignment = 0;
q->limits.discard_granularity = logical_block_size;
} else {

to:

if (sdkp->lbprz) {
q->limits.discard_alignment = 0;
q->limits.discard_granularity = max(logical_block_size,
minimum_io_size);
} else {

but any change which results in discard_granularity being set properly would
solve this.

I have verified that with a 3.10 kernel discard_granularity is set to 4096 and
works as expected, but with 4.4 it is set to 512 and does not work. 

This is a fairly severe bug for my use case, as I rely on discard to free
unused blocks from the storage backing LUNs mounted by Linux clients and this
effectively makes critical functions like thin-provisioning and snapshots
infeasible for 4.x clients. 

I have only tested this with the Ubuntu-packaged kernels, but the problematic
code still is present in the current source.  I'm happy to provide any
additional information that might be helpful.

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: linux-next: Tree for Apr 3 (scsi/lpfc)

2017-04-03 Thread Randy Dunlap
On 04/03/17 01:13, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20170331:
> 

on i386:

when SCSI_LPFC=y and
CONFIG_NVME_CORE=m
CONFIG_BLK_DEV_NVME=m
CONFIG_BLK_DEV_NVME_SCSI=y
CONFIG_NVME_FABRICS=m
CONFIG_NVME_FC=m
CONFIG_NVME_TARGET=m

drivers/built-in.o: In function `lpfc_nvme_create_localport':
(.text+0x28ce6b): undefined reference to `nvme_fc_register_localport'
drivers/built-in.o: In function `lpfc_nvme_destroy_localport':
(.text+0x28d263): undefined reference to `nvme_fc_unregister_remoteport'
drivers/built-in.o: In function `lpfc_nvme_destroy_localport':
(.text+0x28d2d3): undefined reference to `nvme_fc_unregister_localport'
drivers/built-in.o: In function `lpfc_nvme_register_port':
(.text+0x28d576): undefined reference to `nvme_fc_register_remoteport'
drivers/built-in.o: In function `lpfc_nvme_unregister_port':
(.text+0x28d93c): undefined reference to `nvme_fc_unregister_remoteport'


so SCSI_LPFC depends on NVME_FC...



Reported-by: Randy Dunlap 

-- 
~Randy


Re: [PATCH] target/user: PGR Support

2017-04-03 Thread Mike Christie
On 04/03/2017 12:20 AM, Shie-rei Huang wrote:
> After a PGR command is processed in the kernel, is it possible for the
> user mode to be notified with the command so that the user mode has a
> chance to do its part of PGR processing. Below is one use case of it.
> Suppose two TCMU servers are set up as an active/passive cluster for
> high availability. An initiator's PGR command is sent to the active
> server only. But the active server's user mode would like the passive
> server to know about the PGR command also and sync up its configfs
> state with the active server. This is to make sure that when the
> active server dies, the passive server can take over with the correct
> PGR state.

We have been discussing this in a couple different threads, and it does
not work right now as you saw, but I am looking into and I think some
other people are too.

For HA we would need to handle these:

1. We can't just pass a raw PGR command to userspace, because we do not
have enough info. For commands where we need to know the I_T nexus the
command came in on, then we need to pass that info upwards. For example
a registration with SPEC_I_PT=0 ALL_TG _PT=0.

If we did a new netlink interface then we could pass that extra info
with the processed PGR info, or we could modify the existing tcmu one to
pass the raw PGR command with the transport info.

2. We do not want to return status for the PGR command before userspace
has distributed the PGR info. So, we can't do something really simple
like just send a netlink event and return status for the command
immediately, or we can't just modify the APTPL file to be usable when
APTPL is not used and then just watch it for changes from userspace.

We need some extra coordination like send a event to pass the PGR/I_T
Nexus info then the kernel needs to wait for a response and then it can
return status for the command.


> Thanks,
> 
> -Shie-rei
> 
> On Thu, Mar 30, 2017 at 7:47 AM, Bryant G. Ly
>  wrote:
>> This adds PGR support for just TCMU, since tcmu doesn't
>> have the necessary IT_NEXUS info to process PGR in userspace,
>> so have those commands be processed in kernel.
>>
>> Signed-off-by: Bryant G. Ly 
>> ---
>>  drivers/target/target_core_configfs.c | 10 +-
>>  drivers/target/target_core_device.c   | 27 +++
>>  drivers/target/target_core_pr.c   |  2 +-
>>  drivers/target/target_core_pscsi.c|  3 ++-
>>  include/target/target_core_backend.h  |  1 +
>>  5 files changed, 36 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/target/target_core_configfs.c 
>> b/drivers/target/target_core_configfs.c
>> index 38b5025..edfb098 100644
>> --- a/drivers/target/target_core_configfs.c
>> +++ b/drivers/target/target_core_configfs.c
>> @@ -1366,7 +1366,7 @@ static ssize_t target_pr_res_holder_show(struct 
>> config_item *item, char *page)
>> struct se_device *dev = pr_to_dev(item);
>> int ret;
>>
>> -   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
>> +   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
>> return sprintf(page, "Passthrough\n");
>>
>> spin_lock(>dev_reservation_lock);
>> @@ -1506,7 +1506,7 @@ static ssize_t target_pr_res_type_show(struct 
>> config_item *item, char *page)
>>  {
>> struct se_device *dev = pr_to_dev(item);
>>
>> -   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
>> +   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
>> return sprintf(page, "SPC_PASSTHROUGH\n");
>> else if (dev->dev_reservation_flags & DRF_SPC2_RESERVATIONS)
>> return sprintf(page, "SPC2_RESERVATIONS\n");
>> @@ -1519,7 +1519,7 @@ static ssize_t target_pr_res_aptpl_active_show(struct 
>> config_item *item,
>>  {
>> struct se_device *dev = pr_to_dev(item);
>>
>> -   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
>> +   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
>> return 0;
>>
>> return sprintf(page, "APTPL Bit Status: %s\n",
>> @@ -1531,7 +1531,7 @@ static ssize_t 
>> target_pr_res_aptpl_metadata_show(struct config_item *item,
>>  {
>> struct se_device *dev = pr_to_dev(item);
>>
>> -   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
>> +   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
>> return 0;
>>
>> return sprintf(page, "Ready to process PR APTPL metadata..\n");
>> @@ -1577,7 +1577,7 @@ static ssize_t 
>> target_pr_res_aptpl_metadata_store(struct config_item *item,
>> u16 tpgt = 0;
>> u8 type = 0;
>>
>> -   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH)
>> +   if (dev->transport->transport_flags & TRANSPORT_FLAG_PASSTHROUGH_PGR)
>> return count;
>> if 

Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 9:47am, Jens Axboe wrote:

> On 04/03/2017 10:41 AM, Arun Easi wrote:
> > On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:
> > 
> >> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> >>> On 04/03/2017 08:37 AM, Arun Easi wrote:
>  If the above is true, then for a LLD to get tag# within it's max-tasks 
>  range, it has to report max-tasks / number-of-hw-queues in can_queue, 
>  and 
>  in the I/O path, use the tag and hwq# to arrive at a index# to use. 
>  This, 
>  though, leads to a poor use of tag resources -- queue reaching it's 
>  capacity while LLD can still take it.
> >>>
> >>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> >>> HBAs. ATM the only 'real' solution to this problem is indeed have a
> >>> static split of the entire tag space by the number of hardware queues.
> >>> With the mentioned tag-starvation problem.
> >>
> >> Hello Arun and Hannes,
> >>
> >> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
> >> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
> >> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
> >> allocate a single set of tags for all hardware queues and also to add a 
> >> flag
> >> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
> >>
> > 
> > Hi Bart,
> > 
> > This would certainly be beneficial in my case. Moreover, it certainly 
> > makes sense to move the logic up where multiple drivers can leverage. 
> > 
> > Perhaps, use percpu_ida* interfaces to do that, but I think I read 
> > somewhere that, it is not efficient (enough?) and is the reason to go the 
> > current way for block tags.
> 
> You don't have to change the underlying tag generation to solve this
> problem, Bart already pretty much outlined a fix that would work.
> percpu_ida works fine if you never use more than roughly half the
> available space, it's a poor fit for request tags where we want to
> retain good behavior and scaling at or near tag exhaustion. That's why
> blk-mq ended up rolling its own, which is now generically available as
> lib/sbitmap.c.
> 

Sounds good. Thanks for the education, Jens.

Regards,
-Arun


Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Jens Axboe
On 04/03/2017 10:41 AM, Arun Easi wrote:
> On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:
> 
>> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
>>> On 04/03/2017 08:37 AM, Arun Easi wrote:
 If the above is true, then for a LLD to get tag# within it's max-tasks 
 range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
 in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
 though, leads to a poor use of tag resources -- queue reaching it's 
 capacity while LLD can still take it.
>>>
>>> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
>>> HBAs. ATM the only 'real' solution to this problem is indeed have a
>>> static split of the entire tag space by the number of hardware queues.
>>> With the mentioned tag-starvation problem.
>>
>> Hello Arun and Hannes,
>>
>> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
>> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
>> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
>> allocate a single set of tags for all hardware queues and also to add a flag
>> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
>>
> 
> Hi Bart,
> 
> This would certainly be beneficial in my case. Moreover, it certainly 
> makes sense to move the logic up where multiple drivers can leverage. 
> 
> Perhaps, use percpu_ida* interfaces to do that, but I think I read 
> somewhere that, it is not efficient (enough?) and is the reason to go the 
> current way for block tags.

You don't have to change the underlying tag generation to solve this
problem, Bart already pretty much outlined a fix that would work.
percpu_ida works fine if you never use more than roughly half the
available space, it's a poor fit for request tags where we want to
retain good behavior and scaling at or near tag exhaustion. That's why
blk-mq ended up rolling its own, which is now generically available as
lib/sbitmap.c.

-- 
Jens Axboe



Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 8:20am, Bart Van Assche wrote:

> On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> > On 04/03/2017 08:37 AM, Arun Easi wrote:
> > > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > > though, leads to a poor use of tag resources -- queue reaching it's 
> > > capacity while LLD can still take it.
> >
> > Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> > HBAs. ATM the only 'real' solution to this problem is indeed have a
> > static split of the entire tag space by the number of hardware queues.
> > With the mentioned tag-starvation problem.
> 
> Hello Arun and Hannes,
> 
> Apparently the current blk_mq_alloc_tag_set() implementation is well suited
> for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
> How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
> allocate a single set of tags for all hardware queues and also to add a flag
> to struct scsi_host_template such that SCSI LLDs can enable this behavior?
> 

Hi Bart,

This would certainly be beneficial in my case. Moreover, it certainly 
makes sense to move the logic up where multiple drivers can leverage. 

Perhaps, use percpu_ida* interfaces to do that, but I think I read 
somewhere that, it is not efficient (enough?) and is the reason to go the 
current way for block tags.

Regards,
-Arun

Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
On Mon, 3 Apr 2017, 12:29am, Hannes Reinecke wrote:

> On 04/03/2017 08:37 AM, Arun Easi wrote:
> > Hi Folks,
> > 
> > I would like to seek your input on a few topics on SCSI / block 
> > multi-queue.
> > 
> > 1. Tag# generation.
> > 
> > The context is with SCSI MQ on. My question is, what should a LLD do to 
> > get request tag values in the range 0 through can_queue - 1 across *all* 
> > of the queues. In our QLogic 41XXX series of adapters, we have a per 
> > session submit queue, a shared task memory (shared across all queues) and 
> > N completion queues (separate MSI-X vectors). We report N as the 
> > nr_hw_queues. I would like to, if possible, use the block layer tags to 
> > index into the above shared task memory area.
> > 
> > From looking at the scsi/block source, it appears that when a LLD reports 
> > a value say #C, in can_queue (via scsi_host_template), that value is used 
> > as the max depth when corresponding block layer queues are created. So, 
> > while SCSI restricts the number of commands to LLD at #C, the request tag 
> > generated across any of the queues can range from 0..#C-1. Please correct 
> > me if I got this wrong.
> > 
> > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > though, leads to a poor use of tag resources -- queue reaching it's 
> > capacity while LLD can still take it.
> > 
> Yep.
> 
> > blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
> > the upper 16 bits, which need not fall in the max-tasks range.
> > 
> > Perhaps the current MQ model is to cater to a queue pair 
> > (submit/completion) kind of hardware model; nevertheless I would like to 
> > know how other hardware variants can makes use of it.
> > 
> He. Welcome to the club.
> 
> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> HBAs. ATM the only 'real' solution to this problem is indeed have a
> static split of the entire tag space by the number of hardware queues.
> With the mentioned tag-starvation problem.
> 
> If we were to continue with the tag to hardware ID mapping, we would
> need to implement a dynamic tag space mapping onto hardware queues.
> My idea to that would be to map the entire tag space, but rather the
> individual bit words onto the hardware queue. Then we could make the
> mapping dynamic, where there individual words are mapped onto the queues
> only as needed.
> However, the _one_ big problem we're facing here is timeouts.
> With the 1:1 mapping between tags and hardware IDs we can only re-use
> the tag once the timeout is _definitely_ resolved. But this means
> the command will be active, and we cannot return blk_mq_complete() until
> the timeout itself has been resolved.
> With FC this essentially means until the corresponding XIDs are safe to
> re-use, ie after all ABRT/RRQ etc processing has been completed.
> Hence we totally lose the ability to return the command itself with
> -ETIMEDOUT and continue with I/O processing even though the original XID
> is still being held by firmware.
> 
> In the light of this I wonder if it wouldn't be better to completely
> decouple block-layer tags and hardware IDs, and have an efficient
> algorithm mapping the block-layer tags onto hardware IDs.
> That should avoid the arbitrary tag starvation problem, and would allow
> us to handle timeouts efficiently.
> Of course, we don't _have_ such an efficient algorithm; maybe it's time
> to have a generic one within the kernel as quite some drivers would
> _love_ to just use the generic implementation here.
> (qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem)
> 
> > 2. mq vs non-mq performance gain.
> > 
> > This is more like a poll, I guess. I was wondering what performance gains 
> > folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
> > has one slide that shows a 200k IOPS gain.
> > 
> > From my testing, though, I was not lucky to observe that big of a change. 
> > In fact, the difference was not even noticeable(*). For e.g., for 512 
> > bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
> > When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
> > and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
> > node for this run. The test was run on a x86_64 setup.
> > 
> You _really_ should have listened to my talk at VAULT.

Would you have a slide deck / minutes that could be shared?

> For 'legacy' HBAs there indeed is not much of a performance gain to be
> had; the max gain is indeed for heavy parallel I/O.

I have multiple devices (I-T nexuses) in my setup, so definitely there are 
parallel I/Os.

> And there even is a scheduler issue when running with a single
> submission thread; there I've measured a performance _drop_ by up to
> 50%. Which, as Jens claims, 

[Bug 191471] Mp2Sas and Mpt3Sas (now mpxsas) drivers are spinning forever in the IRQ handler under load condition

2017-04-03 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=191471

Timur Tabi (ti...@codeaurora.org) changed:

   What|Removed |Added

 CC||ti...@codeaurora.org

--- Comment #1 from Timur Tabi (ti...@codeaurora.org) ---
Created attachment 255737
  --> https://bugzilla.kernel.org/attachment.cgi?id=255737=edit
Propsed fix

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Bart Van Assche
On Mon, 2017-04-03 at 09:29 +0200, Hannes Reinecke wrote:
> On 04/03/2017 08:37 AM, Arun Easi wrote:
> > If the above is true, then for a LLD to get tag# within it's max-tasks 
> > range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> > in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> > though, leads to a poor use of tag resources -- queue reaching it's 
> > capacity while LLD can still take it.
>
> Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
> HBAs. ATM the only 'real' solution to this problem is indeed have a
> static split of the entire tag space by the number of hardware queues.
> With the mentioned tag-starvation problem.

Hello Arun and Hannes,

Apparently the current blk_mq_alloc_tag_set() implementation is well suited
for drivers like NVMe and ib_srp but not for traditional SCSI HBA drivers.
How about adding a BLK_MQ_F_* flag that tells __blk_mq_alloc_rq_maps() to
allocate a single set of tags for all hardware queues and also to add a flag
to struct scsi_host_template such that SCSI LLDs can enable this behavior?

Bart.

Re: [PATCH] csiostor: switch to pci_alloc_irq_vectors

2017-04-03 Thread Varun Prakash
On Fri, Mar 31, 2017 at 12:25:27PM +0530, Christoph Hellwig wrote:
> On Fri, Jan 20, 2017 at 07:27:02PM -0500, Martin K. Petersen wrote:
> > > "Christoph" == Christoph Hellwig  writes:
> >
> > Christoph> And get automatic MSI-X affinity for free.
> >
> > Chelsio folks: Please review and test!
>
> ping!

csiostor driver is triggering WARN_ON with this patch
drivers/pci/msi.c:1193

1172 int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int 
min_vecs,
1173unsigned int max_vecs, unsigned int 
flags,
1174const struct irq_affinity *affd)
1175 {
1176 static const struct irq_affinity msi_default_affd;
1177 int vecs = -ENOSPC;
1178
1179 if (flags & PCI_IRQ_AFFINITY) {
...
1192 } else {
1193 if (WARN_ON(affd))
1194 affd = NULL;
1195 }


PCI_IRQ_AFFINITY flag is missing

+   cnt = pci_alloc_irq_vectors_affinity(hw->pdev, min, cnt, PCI_IRQ_MSIX,
+   );
+   if (cnt < 0)
return cnt;
-   }


[PATCH] sas: remove sas_domain_release_transport

2017-04-03 Thread Johannes Thumshirn
sas_domain_release_transport is unused since at least v3.13, remove it.

Signed-off-by: Johannes Thumshirn 
---
 drivers/scsi/libsas/sas_init.c | 7 ---
 include/scsi/libsas.h  | 1 -
 2 files changed, 8 deletions(-)

diff --git a/drivers/scsi/libsas/sas_init.c b/drivers/scsi/libsas/sas_init.c
index 15ef8e2..64e9cdd 100644
--- a/drivers/scsi/libsas/sas_init.c
+++ b/drivers/scsi/libsas/sas_init.c
@@ -566,13 +566,6 @@ sas_domain_attach_transport(struct 
sas_domain_function_template *dft)
 }
 EXPORT_SYMBOL_GPL(sas_domain_attach_transport);
 
-
-void sas_domain_release_transport(struct scsi_transport_template *stt)
-{
-   sas_release_transport(stt);
-}
-EXPORT_SYMBOL_GPL(sas_domain_release_transport);
-
 /* -- SAS Class register/unregister -- */
 
 static int __init sas_class_init(void)
diff --git a/include/scsi/libsas.h b/include/scsi/libsas.h
index dae99d7..dd0f72c 100644
--- a/include/scsi/libsas.h
+++ b/include/scsi/libsas.h
@@ -693,7 +693,6 @@ extern int sas_bios_param(struct scsi_device *,
  sector_t capacity, int *hsc);
 extern struct scsi_transport_template *
 sas_domain_attach_transport(struct sas_domain_function_template *);
-extern void sas_domain_release_transport(struct scsi_transport_template *);
 
 int  sas_discover_root_expander(struct domain_device *);
 
-- 
2.10.2



Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init

2017-04-03 Thread Kyle Fortin
On Apr 3, 2017, at 9:41 AM, Johannes Thumshirn  wrote:
> 
> On Mon, Apr 03, 2017 at 06:30:21AM -0700, Kyle Fortin wrote:
>> iscsiadm session login can fail with the following error:
>> 
>> iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com...
>> iscsiadm: initiator reported error (9 - internal error)
>> 
>> When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results
>> in 64K-sized kmallocs per session.  A system under fragmented slab
>> pressure may not have any 64K objects available and fail iscsiadm session
>> login. Even though memory objects of a smaller size are available, the
>> large order allocation ends up failing.
>> 
>> The kernel will print a warning and dump_stack, like below:
> 
> There is a series of patches in Andrew's mmotm tree, which introduces
> a kvmalloc() function, that does exactly what you're looking for.
> 
> Maybe you want to base your patch on top of it.
> 
> -- 
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Thanks Johannes.  I’ll take a look.
--
Kyle Fortin - Oracle Linux Engineering






Re: [PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init

2017-04-03 Thread Johannes Thumshirn
On Mon, Apr 03, 2017 at 06:30:21AM -0700, Kyle Fortin wrote:
> iscsiadm session login can fail with the following error:
> 
> iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com...
> iscsiadm: initiator reported error (9 - internal error)
> 
> When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results
> in 64K-sized kmallocs per session.  A system under fragmented slab
> pressure may not have any 64K objects available and fail iscsiadm session
> login. Even though memory objects of a smaller size are available, the
> large order allocation ends up failing.
> 
> The kernel will print a warning and dump_stack, like below:

There is a series of patches in Andrew's mmotm tree, which introduces
a kvmalloc() function, that does exactly what you're looking for.

Maybe you want to base your patch on top of it.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH] libiscsi: use vzalloc for large allocations in iscsi_pool_init

2017-04-03 Thread Kyle Fortin
iscsiadm session login can fail with the following error:

iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com...
iscsiadm: initiator reported error (9 - internal error)

When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it results
in 64K-sized kmallocs per session.  A system under fragmented slab
pressure may not have any 64K objects available and fail iscsiadm session
login. Even though memory objects of a smaller size are available, the
large order allocation ends up failing.

The kernel will print a warning and dump_stack, like below:

iscsid: page allocation failure: order:4, mode:0xc0d0
CPU: 0 PID: 2456 Comm: iscsid Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2
Call Trace:
 [] dump_stack+0x63/0x83
 [] warn_alloc_failed+0xea/0x140
 [] __alloc_pages_slowpath+0x409/0x760
 [] __alloc_pages_nodemask+0x2b1/0x2d0
 [] ? dev_attr_host_ipaddress+0x20/0xc722
 [] alloc_pages_current+0xaf/0x170
 [] alloc_kmem_pages+0x31/0xd0
 [] ? iscsi_transport_group+0x20/0xc7e2
 [] kmalloc_order+0x18/0x50
 [] kmalloc_order_trace+0x34/0xe0
 [] ? transport_remove_classdev+0x70/0x70
 [] __kmalloc+0x27d/0x2a0
 [] ? complete_all+0x4d/0x60
 [] iscsi_pool_init+0x69/0x160 [libiscsi]
 [] ? device_initialize+0xb0/0xd0
 [] iscsi_session_setup+0x180/0x2f4 [libiscsi]
 [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp]
 [] iscsi_sw_tcp_session_create+0xcf/0x150 [iscsi_tcp]
 [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp]
 [] iscsi_if_create_session+0x33/0xd0
 [] ? iscsi_max_lun+0x20/0xfa9e [iscsi_tcp]
 [] iscsi_if_recv_msg+0x508/0x8c0 [scsi_transport_iscsi]
 [] ? __alloc_pages_nodemask+0x19b/0x2d0
 [] ? __kmalloc_node_track_caller+0x209/0x2c0
 [] iscsi_if_rx+0x7c/0x200 [scsi_transport_iscsi]
 [] netlink_unicast+0x126/0x1c0
 [] netlink_sendmsg+0x36c/0x400
 [] sock_sendmsg+0x4d/0x60
 [] ___sys_sendmsg+0x30a/0x330
 [] ? handle_pte_fault+0x20c/0x230
 [] ? __handle_mm_fault+0x1bc/0x330
 [] ? handle_mm_fault+0xb2/0x1a0
 [] __sys_sendmsg+0x49/0x90
 [] SyS_sendmsg+0x19/0x20
 [] system_call_fastpath+0x12/0x71

Use vzalloc for iscsi_pool allocations larger than PAGE_SIZE.  This only
affects hosts using a non-standard larger /etc/iscsi/iscsid.conf
node.session.cmds_max value. Since iscsi_pool_init is also called to
allocate very small pools per cmd for r2t handling, it is best to retain
using kzalloc for those allocations.

Signed-off-by: Kyle Fortin 
Tested-by: Kyle Fortin 
Reviewed-by: Joseph Slember 
Reviewed-by: Lance Hartmann 
---
 drivers/scsi/libiscsi.c | 15 +--
 include/scsi/libiscsi.h |  1 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 3fca34a675af..5a622ba2f10d 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2546,6 +2547,7 @@ int iscsi_eh_recover_target(struct scsi_cmnd *sc)
 iscsi_pool_init(struct iscsi_pool *q, int max, void ***items, int item_size)
 {
int i, num_arrays = 1;
+   int alloc_size;
 
memset(q, 0, sizeof(*q));
 
@@ -2555,7 +2557,13 @@ int iscsi_eh_recover_target(struct scsi_cmnd *sc)
 * the array. */
if (items)
num_arrays++;
-   q->pool = kzalloc(num_arrays * max * sizeof(void*), GFP_KERNEL);
+
+   alloc_size = num_arrays * max * sizeof(void *);
+   if (alloc_size > PAGE_SIZE) {
+   q->pool = vzalloc(alloc_size);
+   q->is_pool_vmalloc = true;
+   } else
+   q->pool = kzalloc(alloc_size, GFP_KERNEL);
if (q->pool == NULL)
return -ENOMEM;
 
@@ -2589,7 +2597,10 @@ void iscsi_pool_free(struct iscsi_pool *q)
 
for (i = 0; i < q->max; i++)
kfree(q->pool[i]);
-   kfree(q->pool);
+   if (q->is_pool_vmalloc)
+   vfree(q->pool);
+   else
+   kfree(q->pool);
 }
 EXPORT_SYMBOL_GPL(iscsi_pool_free);
 
diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h
index 583875ea136a..e3421e527559 100644
--- a/include/scsi/libiscsi.h
+++ b/include/scsi/libiscsi.h
@@ -258,6 +258,7 @@ struct iscsi_pool {
struct kfifoqueue;  /* FIFO Queue */
void**pool; /* Pool of elements */
int max;/* Max number of elements */
+   boolis_pool_vmalloc;
 };
 
 /* Session's states */
-- 
1.8.3.1



Re: [PATCH v2] scsi: storvsc: Add support for FC rport.

2017-04-03 Thread Cathy Avery

On 04/03/2017 08:17 AM, Christoph Hellwig wrote:

if (host->transportt == fc_transport_template) {
+   struct fc_rport_identifiers ids = {
+   .roles = FC_PORT_ROLE_FCP_TARGET,
+   };

I don't think storvsc ever acts as FCP target.


In order to implement the work around so that the scsi scan works 
indicating FC_PORT_ROLE_FCP_TARGET as a role was necessary due to its 
test in fc_scsi_scan_rport. The idea here is to avoid making any changes 
to the fc_transport driver which was of some concern.



Thanks,

Cathy


Re: [PATCH v2] scsi: storvsc: Add support for FC rport.

2017-04-03 Thread Christoph Hellwig
>   if (host->transportt == fc_transport_template) {
> + struct fc_rport_identifiers ids = {
> + .roles = FC_PORT_ROLE_FCP_TARGET,
> + };

I don't think storvsc ever acts as FCP target.


[PATCH 1/2] scsi: convert unrecovered read error to -EILSEQ

2017-04-03 Thread Dmitry Monakhov
It is quite easily to get URE after power failure and get scary message.
URE is happens due to internal drive crc mismatch due to partial sector
update. Most people interpret such message as "My drive is dying", which
isreasonable assumption if your dmesg is full of complain from disks and
read(2) return EIO. In fact this error is not fatal. One can fix it easily
by rewriting affected sector.

So we have to handle URE like follows:
- Return EILSEQ to signall caller that this is bad data related problem
- Do not retry command, because this is useless.



### Test case
#Test uses two HDD: disks sdb sdc
#Write_phase
# let fio work ~100sec and then cut the power
fio --ioengine=libaio --direct=1 --rw=write --bs=1M --iodepth=16 \
--time_based=1 --runtime=600 --filesize=1G --size=1T \
--name /dev/sdb --name /dev/sdc

# Check_phase after system goes back
fio --ioengine=libaio --direct=1 --group_reporting --rw=read --bs=1M \
--iodepth=16 --size=1G --filesize=1G
--name=/dev/sdb --name /dev/sdc

More info about URE probability here:
https://plus.google.com/101761226576930717211/posts/Pctq7kk1dLL

Signed-off-by: Dmitry Monakhov 
---
 drivers/scsi/scsi_lib.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 19125d7..59d64ad 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -961,6 +961,19 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
int good_bytes)
/* See SSC3rXX or current. */
action = ACTION_FAIL;
break;
+   case MEDIUM_ERROR:
+   if (sshdr.asc == 0x11) {
+   /* Handle unrecovered read error */
+   switch (sshdr.ascq) {
+   case 0x00: /* URE */
+   case 0x04: /* URE auto reallocate failed */
+   case 0x0B: /* URE recommend reassignment*/
+   case 0x0C: /* URE recommend rewrite the data */
+   action = ACTION_FAIL;
+   error = -EILSEQ;
+   break;
+   }
+   }
default:
action = ACTION_FAIL;
break;
-- 
2.9.3



[PATCH 2/2] block: Improve error handling verbosity

2017-04-03 Thread Dmitry Monakhov
EILSEQ is returned due to internal csum error on disk/fabric,
let's add special message to distinguish it from others. Also dump
original numerical error code.

Signed-off-by: Dmitry Monakhov 
---
 block/blk-core.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 071a998..8eab846 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2576,13 +2576,16 @@ bool blk_update_request(struct request *req, int error, 
unsigned int nr_bytes)
case -ENODATA:
error_type = "critical medium";
break;
+   case -EILSEQ:
+   error_type = "bad data";
+   break;
case -EIO:
default:
error_type = "I/O";
break;
}
-   printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector 
%llu\n",
-  __func__, error_type, req->rq_disk ?
+   printk_ratelimited(KERN_ERR "%s: %s error (%d), dev %s, sector 
%llu\n",
+  __func__, error_type, error, req->rq_disk ?
   req->rq_disk->disk_name : "?",
   (unsigned long long)blk_rq_pos(req));
 
-- 
2.9.3



[PATCH v2] scsi: storvsc: Add support for FC rport.

2017-04-03 Thread Cathy Avery
Included in the current storvsc driver for Hyper-V is the ability
to access luns on an FC fabric via a virtualized fiber channel
adapter exposed by the Hyper-V host. The driver also attaches to
the FC transport to allow host and port names to be published under
/sys/class/fc_host/hostX. Current customer tools running on the VM
require that these names be available in the well known standard
location under fc_host/hostX.

A problem arose when attaching to the FC transport. The scsi_scan
code attempts to call fc_user_scan which has basically become a no-op
due to the fact that the virtualized FC device does not expose rports.
At this point you cannot refresh the scsi bus after mapping or unmapping
luns on the SAN without a reboot of the VM.

This patch stubs in an rport per fc_host in storvsc so that the
requirement of a defined rport is now met within the fc_transport and
echo "- - -" > /sys/class/scsi_host/hostX/scan now works.

Signed-off-by: Cathy Avery 
---
Changes since v1:
- Fix fc_rport_identifiers init [Stephen Hemminger]
- Better error checking
---
 drivers/scsi/storvsc_drv.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 638e5f4..37646d1 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -478,6 +478,9 @@ struct storvsc_device {
 */
u64 node_name;
u64 port_name;
+#if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
+   struct fc_rport *rport;
+#endif
 };
 
 struct hv_host_device {
@@ -1816,19 +1819,27 @@ static int storvsc_probe(struct hv_device *device,
target = (device->dev_instance.b[5] << 8 |
 device->dev_instance.b[4]);
ret = scsi_add_device(host, 0, target, 0);
-   if (ret) {
-   scsi_remove_host(host);
-   goto err_out2;
-   }
+   if (ret)
+   goto err_out3;
}
 #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
if (host->transportt == fc_transport_template) {
+   struct fc_rport_identifiers ids = {
+   .roles = FC_PORT_ROLE_FCP_TARGET,
+   };
+
fc_host_node_name(host) = stor_device->node_name;
fc_host_port_name(host) = stor_device->port_name;
+   stor_device->rport = fc_remote_port_add(host, 0, );
+   if (!stor_device->rport)
+   goto err_out3;
}
 #endif
return 0;
 
+err_out3:
+   scsi_remove_host(host);
+
 err_out2:
/*
 * Once we have connected with the host, we would need to
@@ -1854,8 +1865,10 @@ static int storvsc_remove(struct hv_device *dev)
struct Scsi_Host *host = stor_device->host;
 
 #if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
-   if (host->transportt == fc_transport_template)
+   if (host->transportt == fc_transport_template) {
+   fc_remote_port_delete(stor_device->rport);
fc_remove_host(host);
+   }
 #endif
scsi_remove_host(host);
storvsc_dev_remove(dev);
-- 
2.5.0



Re: scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Hannes Reinecke
On 04/03/2017 08:37 AM, Arun Easi wrote:
> Hi Folks,
> 
> I would like to seek your input on a few topics on SCSI / block 
> multi-queue.
> 
> 1. Tag# generation.
> 
> The context is with SCSI MQ on. My question is, what should a LLD do to 
> get request tag values in the range 0 through can_queue - 1 across *all* 
> of the queues. In our QLogic 41XXX series of adapters, we have a per 
> session submit queue, a shared task memory (shared across all queues) and 
> N completion queues (separate MSI-X vectors). We report N as the 
> nr_hw_queues. I would like to, if possible, use the block layer tags to 
> index into the above shared task memory area.
> 
> From looking at the scsi/block source, it appears that when a LLD reports 
> a value say #C, in can_queue (via scsi_host_template), that value is used 
> as the max depth when corresponding block layer queues are created. So, 
> while SCSI restricts the number of commands to LLD at #C, the request tag 
> generated across any of the queues can range from 0..#C-1. Please correct 
> me if I got this wrong.
> 
> If the above is true, then for a LLD to get tag# within it's max-tasks 
> range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
> in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
> though, leads to a poor use of tag resources -- queue reaching it's 
> capacity while LLD can still take it.
> 
Yep.

> blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
> the upper 16 bits, which need not fall in the max-tasks range.
> 
> Perhaps the current MQ model is to cater to a queue pair 
> (submit/completion) kind of hardware model; nevertheless I would like to 
> know how other hardware variants can makes use of it.
> 
He. Welcome to the club.

Shared tag sets continue to dog the block-mq on 'legacy' (ie non-NVMe)
HBAs. ATM the only 'real' solution to this problem is indeed have a
static split of the entire tag space by the number of hardware queues.
With the mentioned tag-starvation problem.

If we were to continue with the tag to hardware ID mapping, we would
need to implement a dynamic tag space mapping onto hardware queues.
My idea to that would be to map the entire tag space, but rather the
individual bit words onto the hardware queue. Then we could make the
mapping dynamic, where there individual words are mapped onto the queues
only as needed.
However, the _one_ big problem we're facing here is timeouts.
With the 1:1 mapping between tags and hardware IDs we can only re-use
the tag once the timeout is _definitely_ resolved. But this means
the command will be active, and we cannot return blk_mq_complete() until
the timeout itself has been resolved.
With FC this essentially means until the corresponding XIDs are safe to
re-use, ie after all ABRT/RRQ etc processing has been completed.
Hence we totally lose the ability to return the command itself with
-ETIMEDOUT and continue with I/O processing even though the original XID
is still being held by firmware.

In the light of this I wonder if it wouldn't be better to completely
decouple block-layer tags and hardware IDs, and have an efficient
algorithm mapping the block-layer tags onto hardware IDs.
That should avoid the arbitrary tag starvation problem, and would allow
us to handle timeouts efficiently.
Of course, we don't _have_ such an efficient algorithm; maybe it's time
to have a generic one within the kernel as quite some drivers would
_love_ to just use the generic implementation here.
(qla2xxx, lpfc, fcoe, mpt3sas etc all suffer from the same problem)

> 2. mq vs non-mq performance gain.
> 
> This is more like a poll, I guess. I was wondering what performance gains 
> folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
> has one slide that shows a 200k IOPS gain.
> 
> From my testing, though, I was not lucky to observe that big of a change. 
> In fact, the difference was not even noticeable(*). For e.g., for 512 
> bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
> When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
> and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
> node for this run. The test was run on a x86_64 setup.
> 
You _really_ should have listened to my talk at VAULT.
For 'legacy' HBAs there indeed is not much of a performance gain to be
had; the max gain is indeed for heavy parallel I/O.
And there even is a scheduler issue when running with a single
submission thread; there I've measured a performance _drop_ by up to
50%. Which, as Jens claims, really looks like a block-layer issue rather
than a generic problem.


> * See item 3 for a special handling.
> 
> 3. add_random slowness.
> 
> One thing I observed with MQ on and off was this block layer tunable, 
> add_random, which as I understand is to tune disk entropy contribution. 
> With non-MQ, it is turned on, and with MQ, it is turned off by default.
> 
> This got noticed 

scsi-mq - tag# and can_queue, performance.

2017-04-03 Thread Arun Easi
Hi Folks,

I would like to seek your input on a few topics on SCSI / block 
multi-queue.

1. Tag# generation.

The context is with SCSI MQ on. My question is, what should a LLD do to 
get request tag values in the range 0 through can_queue - 1 across *all* 
of the queues. In our QLogic 41XXX series of adapters, we have a per 
session submit queue, a shared task memory (shared across all queues) and 
N completion queues (separate MSI-X vectors). We report N as the 
nr_hw_queues. I would like to, if possible, use the block layer tags to 
index into the above shared task memory area.

>From looking at the scsi/block source, it appears that when a LLD reports 
a value say #C, in can_queue (via scsi_host_template), that value is used 
as the max depth when corresponding block layer queues are created. So, 
while SCSI restricts the number of commands to LLD at #C, the request tag 
generated across any of the queues can range from 0..#C-1. Please correct 
me if I got this wrong.

If the above is true, then for a LLD to get tag# within it's max-tasks 
range, it has to report max-tasks / number-of-hw-queues in can_queue, and 
in the I/O path, use the tag and hwq# to arrive at a index# to use. This, 
though, leads to a poor use of tag resources -- queue reaching it's 
capacity while LLD can still take it.

blk_mq_unique_tag() would not work here, because it just puts the hwq# in 
the upper 16 bits, which need not fall in the max-tasks range.

Perhaps the current MQ model is to cater to a queue pair 
(submit/completion) kind of hardware model; nevertheless I would like to 
know how other hardware variants can makes use of it.

2. mq vs non-mq performance gain.

This is more like a poll, I guess. I was wondering what performance gains 
folks are observing with SCSI MQ on. I saw Christoph H.'s slide deck that 
has one slide that shows a 200k IOPS gain.

>From my testing, though, I was not lucky to observe that big of a change. 
In fact, the difference was not even noticeable(*). For e.g., for 512 
bytes random read test, in both cases, gave me in the vicinity of 2M IOPs. 
When I say both cases, I meant, one with scsi_mod's use_blk_mq set to 0 
and another with 1 (LLD is reloaded when it is done). I only used one NUMA 
node for this run. The test was run on a x86_64 setup.

* See item 3 for a special handling.

3. add_random slowness.

One thing I observed with MQ on and off was this block layer tunable, 
add_random, which as I understand is to tune disk entropy contribution. 
With non-MQ, it is turned on, and with MQ, it is turned off by default.

This got noticed because, when I was running multi-port testing, there was 
a big drop in IOPs with and without MQ (~200K IOPS to 1M+ IOPs when the 
test ran on same NUMA node / across NUMA nodes).

Just wondering why we have it ON on one setting and OFF on another.

Sorry for the rather long e-mail, but your comments/thoughts are much 
appreciated.

Regards,
-Arun



Re: always use REQ_OP_WRITE_ZEROES for zeroing offload

2017-04-03 Thread Hannes Reinecke
On 03/31/2017 06:32 PM, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.
> 
> A git tree is also avaiable at:
> 
> git://git.infradead.org/users/hch/block.git discard-rework
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework
Thank you for doing this.

For this series:

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)