RE: Broadcom 9460 raid card takes too long at system resuming

2021-04-20 Thread Kashyap Desai
Hi Chris -

Most likely behavior you explained is associated with how much time FW takes
to be activated.
In case of actual init from fresh boot, FW is already started running once
system is powered on, and user may not be aware of it.
By the time OS boot reach driver load from fresh boot, there was enough time
spend in system bring up.This is not true in case of resume (Hibernation.).

Kashyap

> -Original Message-
> From: Chris Chiu [mailto:chris.c...@canonical.com]
> Sent: Monday, April 19, 2021 3:45 PM
> To: kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> shivasharan.srikanteshw...@broadcom.com;
> megaraidlinux@broadcom.com
> Cc: linux-s...@vger.kernel.org; Linux Kernel
> 
> Subject: Broadcom 9460 raid card takes too long at system resuming
>
> Hi,
> We found that the Broadcom 9460 RAID card will take ~40 seconds in
> megasas_resume. It is mainly waiting for the FW to come to ready state,
> please refer to the following kernel log. The FW version is
> "megasas: 07.714.04.00-rc1". It seems that the
> megasas_transition_to_ready() loop costs ~40 seconds in megasas_resume.
> However, the same megasas_transition_to_ready() function only takes a few
> milliseconds to complete in megasas_init_fw(). The .read_fw_status_reg
> maps
> to megasas_read_fw_status_reg_fusion. I tried to add
> pci_enable_device_mem() and pci_set_master before
> megasas_transition_to_ready() in megasas_resume() but it makes no
> difference.
>
> I don't really know what makes the difference between driver probe and
> resume. The lspci information of the raid controller is here
> https://gist.github.com/mschiu77/e74ec084cc925643add845fa4dc31ab6.
> Any suggestions about what I can do to find out the cause? Thanks.
>
> [   62.357688] megaraid_sas :45:00.0: megasas_resume is called
> [   62.357719] megaraid_sas :45:00.0: Waiting for FW to come to ready
> state
> [  104.382571] megaraid_sas :45:00.0: FW now in Ready state [
> 104.382576] megaraid_sas :45:00.0: 63 bit DMA mask and 63 bit
> consistent mask [  104.383350] megaraid_sas :45:00.0:
> requested/available msix 33/33 [  104.383669] megaraid_sas :45:00.0:
> Performance mode :Latency
> [  104.383671] megaraid_sas :45:00.0: FW supports sync cache:
> Yes
> [  104.383677] megaraid_sas :45:00.0: megasas_disable_intr_fusion is
> called outbound_intr_mask:0x4009 [  104.550570] megaraid_sas
> :45:00.0: FW provided
> supportMaxExtLDs: 1   max_lds: 64
> [  104.550574] megaraid_sas :45:00.0: controller type   :
> MR(4096MB)
> [  104.550575] megaraid_sas :45:00.0: Online Controller Reset(OCR)
>  : Enabled
> [  104.550577] megaraid_sas :45:00.0: Secure JBOD support   : Yes
> [  104.550579] megaraid_sas :45:00.0: NVMe passthru support : Yes [
> 104.550581] megaraid_sas :45:00.0: FW provided TM
> TaskAbort/Reset timeout: 6 secs/60 secs
> [  104.550583] megaraid_sas :45:00.0: JBOD sequence map support :
> Yes
> [  104.550585] megaraid_sas :45:00.0: PCI Lane Margining support:
> No
> [  104.550999] megaraid_sas :45:00.0: megasas_enable_intr_fusion is
> called outbound_intr_mask:0x4000
>
> Chris


smime.p7s
Description: S/MIME Cryptographic Signature


RE: problem booting 5.10

2020-12-09 Thread Kashyap Desai
> -Original Message-
> From: John Garry [mailto:john.ga...@huawei.com]
> Sent: Wednesday, December 9, 2020 9:22 PM
> To: Julia Lawall ; Kashyap Desai
> 
> Cc: Martin K. Petersen ; Linus Torvalds
> ; James E.J. Bottomley
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>; nicolas.pa...@univ-grenoble-alpes.fr; linux-scsi
> ; Ming Lei ; Sumit Saxena
> ; Shivasharan S
> 
> Subject: Re: problem booting 5.10
>
> On 09/12/2020 15:44, Julia Lawall wrote:
> >
> > On Tue, 8 Dec 2020, John Garry wrote:
> >
> >> On 08/12/2020 22:51, Martin K. Petersen wrote:
> >>> Julia,
> >>>
> >>>> This solves the problem.  Starting from 5.10-rc7 and doing this
> >>>> revert, I get a kernel that boots.
> >> Hi Julia,
> >>
> >> Can you also please test Ming's patchset here (without the megaraid
> >> sas
> >> revert) when you get a chance:
> >> https://lore.kernel.org/linux-block/20201203012638.543321-1-ming.lei@
> >> redhat.com/
> > 5.10-rc7 plus these three commits boots fine.
> >
>
> Hi Julia,
>
> Ok, Thanks for the confirmation. A sort of relief.
>
> @Kashyap, It would be good if we could recreate this, just in case.

I already tested series and issue fixed for me. Final patch (V2) provided by
Ming already has "Tested-by" Tag from me.
I once again confirm using config file provided by Julia Lawall and same
result - Issue recreated once again and fixed by Ming's below patch.

https://lore.kernel.org/linux-block/20201203012638.543321-2-ming@redhat.com/

Kashyap

>
> Cheers,
> John

-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


smime.p7s
Description: S/MIME Cryptographic Signature


RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-04 Thread Kashyap Desai
> >
> > v5.10-rc2 is also broken here.
>
> John, Kashyap, any update on this? If this is going to take a while to fix
> it
> proper, should I send a patch to revert this or at least disable the
> feature by
> default for megaraid_sas in the meantime, so it no longer breaks the
> existing
> systems out there?

I am trying to get similar h/w to try out. All my current h/w works fine.
Give me couple of days' time.
If this is not obviously common issue and need time, we will go with module
parameter disable method.
I will let you know.

Kashyap


smime.p7s
Description: S/MIME Cryptographic Signature


RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-02 Thread Kashyap Desai
> On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> > From: Kashyap Desai 
> >
> > Fusion adapters can steer completions to individual queues, and we now
> > have support for shared host-wide tags.
> > So we can enable multiqueue support for fusion adapters.
> >
> > Once driver enable shared host-wide tags, cpu hotplug feature is also
> > supported as it was enabled using below patchsets - commit
> > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> > offline")
> >
> > Currently driver has provision to disable host-wide tags using
> > "host_tagset_enable" module parameter.
> >
> > Once we do not have any major performance regression using host-wide
> > tags, we will drop the hand-crafted interrupt affinity settings.
> >
> > Performance is also meeting the expecatation - (used both none and
> > mq-deadline scheduler)
> > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get
> > 3.1M IOPs.
> >
> > Signed-off-by: Kashyap Desai 
> > Signed-off-by: Hannes Reinecke 
> > Signed-off-by: John Garry 
>
> Reverting this commit fixed an issue that Dell Power Edge R6415 server
> with
> megaraid_sas is unable to boot.

I will take a look at this. BTW, can you try keeping same PATCH but use
module parameter "host_tagset_enable =0"

Kashyap


smime.p7s
Description: S/MIME Cryptographic Signature


RE: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

2020-09-28 Thread Kashyap Desai
>
> John,
>
> > Have you had a chance to check these outstanding SCSI patches?
> >
> > scsi: megaraid_sas: Added support for shared host tagset for
cpuhotplug
> > scsi: scsi_debug: Support host tagset
> > scsi: hisi_sas: Switch v3 hw to MQ
> > scsi: core: Show nr_hw_queues in sysfs
> > scsi: Add host and host template flag 'host_tagset'
>
> These look good to me.
>
> Jens, feel free to merge.

Hi Jens, Gentle ping. I am not able to find commits for above listed scsi
patches. I want to use your repo which has above mentioned patch for
 patch submission.  Martin has Acked the scsi
patches.

>
> Acked-by: Martin K. Petersen 
>
> --
> Martin K. PetersenOracle Linux Engineering


smime.p7s
Description: S/MIME Cryptographic Signature


RE: [PATCH 1/8] scsi: add a host / host template field for the virt boundary

2019-06-20 Thread Kashyap Desai
> -Original Message-
> From: megaraidlinux@broadcom.com
> [mailto:megaraidlinux@broadcom.com] On Behalf Of Ming Lei
> Sent: Tuesday, June 18, 2019 6:05 AM
> To: Christoph Hellwig 
> Cc: Martin K . Petersen ; Sagi Grimberg
> ; Max Gurtovoy ; Bart Van
> Assche ; linux-rdma ;
> Linux SCSI List ;
> megaraidlinux@broadcom.com; mpt-fusionlinux@broadcom.com;
> linux-hyp...@vger.kernel.org; Linux Kernel Mailing List  ker...@vger.kernel.org>
> Subject: Re: [PATCH 1/8] scsi: add a host / host template field for the
> virt
> boundary
>
> On Mon, Jun 17, 2019 at 8:21 PM Christoph Hellwig  wrote:
> >
> > This allows drivers setting it up easily instead of branching out to
> > block layer calls in slave_alloc, and ensures the upgraded
> > max_segment_size setting gets picked up by the DMA layer.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/scsi/hosts.c | 3 +++
> >  drivers/scsi/scsi_lib.c  | 3 ++-
> >  include/scsi/scsi_host.h | 3 +++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
> > ff0d8c6a8d0c..55522b7162d3 100644
> > --- a/drivers/scsi/hosts.c
> > +++ b/drivers/scsi/hosts.c
> > @@ -462,6 +462,9 @@ struct Scsi_Host *scsi_host_alloc(struct
> scsi_host_template *sht, int privsize)
> > else
> > shost->dma_boundary = 0x;
> >
> > +   if (sht->virt_boundary_mask)
> > +   shost->virt_boundary_mask = sht->virt_boundary_mask;
> > +
> > device_initialize(>shost_gendev);
> > dev_set_name(>shost_gendev, "host%d", shost->host_no);
> > shost->shost_gendev.bus = _bus_type; diff --git
> > a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > 65d0a10c76ad..d333bb6b1c59 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1775,7 +1775,8 @@ void __scsi_init_queue(struct Scsi_Host *shost,
> struct request_queue *q)
> > dma_set_seg_boundary(dev, shost->dma_boundary);
> >
> > blk_queue_max_segment_size(q, shost->max_segment_size);
> > -   dma_set_max_seg_size(dev, shost->max_segment_size);
> > +   blk_queue_virt_boundary(q, shost->virt_boundary_mask);
> > +   dma_set_max_seg_size(dev, queue_max_segment_size(q));
>
> The patch looks fine, also suggest to make sure that max_segment_size is
> block-size aligned, and un-aligned max segment size has caused trouble on
> mmc.

I validated changes on latest and few older series controllers.
Post changes, I noticed max_segment_size is updated.
find /sys/ -name max_segment_size  |grep sdb |xargs grep -r .
/sys/devices/pci:3a/:3a:00.0/:3b:00.0/:3c:04.0/:40:00.0/:41:00.0/:42:00.0/host0/target0:2:12/0:2:12:0/block/sdb/queue/max_segment_size:4294967295

I verify that single SGE having 1MB transfer length is working fine.

Acked-by: Kashyap Desai < kashyap.de...@broadcom.com>

>
> Thanks,
> Ming Lei
>


RE: [PATCH 10/13] megaraid_sas: set virt_boundary_mask in the scsi host

2019-06-13 Thread Kashyap Desai
>
> So before I respin this series, can you help with a way to figure out
for
> mpt3sas and megaraid if a given controller supports NVMe devices at all,
so
> that we don't have to set the virt boundary if not?


In MegaRaid we have below enum -VENTURA_SERIES and AERO_SERIES
supports NVME

enum MR_ADAPTER_TYPE {
MFI_SERIES = 1,
THUNDERBOLT_SERIES = 2,
INVADER_SERIES = 3,
VENTURA_SERIES = 4,
AERO_SERIES = 5,
};

In mpt3sas driver we have below method - If IOC FACT reports NVME Device
support in Protocol Flags, we can consider it as HBA with NVME drive
support.

ioc->facts.ProtocolFlags & MPI2_IOCFACTS_PROTOCOL_NVME_DEVICES

Kashyap


RE: [RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts

2018-09-20 Thread Kashyap Desai
> This is the wrong direction as it does not allow to do initial affinity
> assignement for the non-managed interrupts on allocation time. And
that's
> what Kashyap and Sumit are looking for.
>
> The trivial fix for the possible breakage when irq_default_affinity !=
> cpu_possible_mask is to set the affinity for the pre/post vectors to
> cpu_possible_mask and be done with it.
>
> One other thing I noticed while staring at that is the fact, that the
PCI
> code does not care about the return value of irq_create_affinity_masks()
at
> all. It just proceeds if masks is NULL as if the stuff was requested
with
> the affinity descriptor pointer being NULL. I don't think that's a
> brilliant idea because the drivers might rely on the interrupt being
> managed, but it might be a non-issue and just result in bad locality.
>
> Christoph?
>
> So back to the problem at hand. We need to change the affinity
management
> scheme in a way which lets us differentiate between managed and
> unmanaged
> interrupts, so it allows to automatically assign (initial) affinity to
all
> of them.
>
> Right now everything is bound to the cpumasks array, which does not
allow
> to convey more information. So we need to come up with something
> different.
>
> Something like the below (does not compile and is just for reference)
> should do the trick. I'm not sure whether we want to convey the
information
> (masks and bitmap) in a single data structure or not, but that's an
> implementation detail.


Dou -  Do you want me to test your patch or shall I wait for next version
?

>
> Thanks,
>
>   tglx
>
> 8<-
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -535,15 +535,16 @@ static struct msi_desc *
>  msi_setup_entry(struct pci_dev *dev, int nvec, const struct
irq_affinity *affd)
>  {
>   struct cpumask *masks = NULL;
> + unsigned long managed = 0;
>   struct msi_desc *entry;
>   u16 control;
>
>   if (affd)
> - masks = irq_create_affinity_masks(nvec, affd);
> + masks = irq_create_affinity_masks(nvec, affd, );
>
>
>   /* MSI Entry Initialization */
> - entry = alloc_msi_entry(>dev, nvec, masks);
> + entry = alloc_msi_entry(>dev, nvec, masks, managed);
>   if (!entry)
>   goto out;
>
> @@ -672,15 +673,22 @@ static int msix_setup_entries(struct pci
> struct msix_entry *entries, int nvec,
> const struct irq_affinity *affd)
>  {
> + /*
> +  * MSIX_MAX_VECTORS = 2048, i.e. 256 bytes. Might need runtime
> +  * allocation. OTOH, are 2048 vectors realistic?
> +  */
> + DECLARE_BITMAP(managed, MSIX_MAX_VECTORS);
>   struct cpumask *curmsk, *masks = NULL;
>   struct msi_desc *entry;
>   int ret, i;
>
>   if (affd)
> - masks = irq_create_affinity_masks(nvec, affd);
> + masks = irq_create_affinity_masks(nvec, affd, managed);
>
>   for (i = 0, curmsk = masks; i < nvec; i++) {
> - entry = alloc_msi_entry(>dev, 1, curmsk);
> + unsigned long m = test_bit(i, managed) ? 1 : 0;
> +
> + entry = alloc_msi_entry(>dev, 1, curmsk, m);
>   if (!entry) {
>   if (!i)
>   iounmap(base);
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -27,7 +27,8 @@
>   * and the affinity masks from @affinity are copied.
>   */
>  struct msi_desc *
> -alloc_msi_entry(struct device *dev, int nvec, const struct cpumask
*affinity)
> +alloc_msi_entry(struct device *dev, int nvec, const struct cpumask
*affinity,
> + unsigned long managed)
>  {
>   struct msi_desc *desc;
>
> @@ -38,6 +39,7 @@ alloc_msi_entry(struct device *dev, int
>   INIT_LIST_HEAD(>list);
>   desc->dev = dev;
>   desc->nvec_used = nvec;
> + desc->managed = managed;
>   if (affinity) {
>   desc->affinity = kmemdup(affinity,
>   nvec * sizeof(*desc->affinity), GFP_KERNEL);
> @@ -416,7 +418,7 @@ int msi_domain_alloc_irqs(struct irq_dom
>
>   virq = __irq_domain_alloc_irqs(domain, -1,
desc->nvec_used,
>  dev_to_node(dev), ,
false,
> -desc->affinity);
> +desc->affinity,
desc->managed);
>   if (virq < 0) {
>   ret = -ENOSPC;
>   if (ops->handle_error)


RE: [RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts

2018-09-20 Thread Kashyap Desai
> This is the wrong direction as it does not allow to do initial affinity
> assignement for the non-managed interrupts on allocation time. And
that's
> what Kashyap and Sumit are looking for.
>
> The trivial fix for the possible breakage when irq_default_affinity !=
> cpu_possible_mask is to set the affinity for the pre/post vectors to
> cpu_possible_mask and be done with it.
>
> One other thing I noticed while staring at that is the fact, that the
PCI
> code does not care about the return value of irq_create_affinity_masks()
at
> all. It just proceeds if masks is NULL as if the stuff was requested
with
> the affinity descriptor pointer being NULL. I don't think that's a
> brilliant idea because the drivers might rely on the interrupt being
> managed, but it might be a non-issue and just result in bad locality.
>
> Christoph?
>
> So back to the problem at hand. We need to change the affinity
management
> scheme in a way which lets us differentiate between managed and
> unmanaged
> interrupts, so it allows to automatically assign (initial) affinity to
all
> of them.
>
> Right now everything is bound to the cpumasks array, which does not
allow
> to convey more information. So we need to come up with something
> different.
>
> Something like the below (does not compile and is just for reference)
> should do the trick. I'm not sure whether we want to convey the
information
> (masks and bitmap) in a single data structure or not, but that's an
> implementation detail.


Dou -  Do you want me to test your patch or shall I wait for next version
?

>
> Thanks,
>
>   tglx
>
> 8<-
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -535,15 +535,16 @@ static struct msi_desc *
>  msi_setup_entry(struct pci_dev *dev, int nvec, const struct
irq_affinity *affd)
>  {
>   struct cpumask *masks = NULL;
> + unsigned long managed = 0;
>   struct msi_desc *entry;
>   u16 control;
>
>   if (affd)
> - masks = irq_create_affinity_masks(nvec, affd);
> + masks = irq_create_affinity_masks(nvec, affd, );
>
>
>   /* MSI Entry Initialization */
> - entry = alloc_msi_entry(>dev, nvec, masks);
> + entry = alloc_msi_entry(>dev, nvec, masks, managed);
>   if (!entry)
>   goto out;
>
> @@ -672,15 +673,22 @@ static int msix_setup_entries(struct pci
> struct msix_entry *entries, int nvec,
> const struct irq_affinity *affd)
>  {
> + /*
> +  * MSIX_MAX_VECTORS = 2048, i.e. 256 bytes. Might need runtime
> +  * allocation. OTOH, are 2048 vectors realistic?
> +  */
> + DECLARE_BITMAP(managed, MSIX_MAX_VECTORS);
>   struct cpumask *curmsk, *masks = NULL;
>   struct msi_desc *entry;
>   int ret, i;
>
>   if (affd)
> - masks = irq_create_affinity_masks(nvec, affd);
> + masks = irq_create_affinity_masks(nvec, affd, managed);
>
>   for (i = 0, curmsk = masks; i < nvec; i++) {
> - entry = alloc_msi_entry(>dev, 1, curmsk);
> + unsigned long m = test_bit(i, managed) ? 1 : 0;
> +
> + entry = alloc_msi_entry(>dev, 1, curmsk, m);
>   if (!entry) {
>   if (!i)
>   iounmap(base);
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -27,7 +27,8 @@
>   * and the affinity masks from @affinity are copied.
>   */
>  struct msi_desc *
> -alloc_msi_entry(struct device *dev, int nvec, const struct cpumask
*affinity)
> +alloc_msi_entry(struct device *dev, int nvec, const struct cpumask
*affinity,
> + unsigned long managed)
>  {
>   struct msi_desc *desc;
>
> @@ -38,6 +39,7 @@ alloc_msi_entry(struct device *dev, int
>   INIT_LIST_HEAD(>list);
>   desc->dev = dev;
>   desc->nvec_used = nvec;
> + desc->managed = managed;
>   if (affinity) {
>   desc->affinity = kmemdup(affinity,
>   nvec * sizeof(*desc->affinity), GFP_KERNEL);
> @@ -416,7 +418,7 @@ int msi_domain_alloc_irqs(struct irq_dom
>
>   virq = __irq_domain_alloc_irqs(domain, -1,
desc->nvec_used,
>  dev_to_node(dev), ,
false,
> -desc->affinity);
> +desc->affinity,
desc->managed);
>   if (virq < 0) {
>   ret = -ENOSPC;
>   if (ops->handle_error)


RE: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Kashyap Desai
>
> The point I don't get here is why you need separate reply queues for
> the interrupt coalesce setting.  Shouldn't this just be a flag at
> submission time that indicates the amount of coalescing that should
> happen?
>
> What is the benefit of having different completion queues?

Having different set of queues (it will is something like N:16 where N
queues are without interrupt coalescing and 16 dedicated queues for
interrupt coalescing) we want to avoid penalty introduced by interrupt
coalescing especially for lower QD profiles.

Kashyap


RE: Affinity managed interrupts vs non-managed interrupts

2018-09-11 Thread Kashyap Desai
>
> The point I don't get here is why you need separate reply queues for
> the interrupt coalesce setting.  Shouldn't this just be a flag at
> submission time that indicates the amount of coalescing that should
> happen?
>
> What is the benefit of having different completion queues?

Having different set of queues (it will is something like N:16 where N
queues are without interrupt coalescing and 16 dedicated queues for
interrupt coalescing) we want to avoid penalty introduced by interrupt
coalescing especially for lower QD profiles.

Kashyap


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Kashyap Desai
> >
> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> TBH, this does not make any sense whatsoever. Why are you trying to have
> extra interrupts for coalescing instead of doing the following:

Thomas,

We are using this feature mainly for performance and not for CPU hotplug
issues.
I read your below #1 to #4  points are more of addressing CPU hotplug
stuffs. Right ?  We also want to make sure if we convert megaraid_sas
driver from managed to non-managed interrupt, we can still achieve CPU
hotplug requirement.  If we use " pci_enable_msix_range" and manually set
affinity in driver  using irq_set_affinity_hint, cpu hotplug feature works
as expected.  is able to retain older mapping and whenever
offlined cpu comes back, irqbalancer restore the same old mapping.

If we use all 72 reply queue (all are in interrupt coalescing mode)
without any extra reply queues, we don't have any issue with cpu-msix
mapping and cpu hotplug issues.
Our major problem with that method is latency is very bad on lower QD
and/or single worker case.

To solve that problem we have added extra 16 reply queue (this is a
special h/w feature for performance only) which can be worked in interrupt
coalescing mode vs existing 72 reply queue will work without any interrupt
coalescing.   Best way to map additional 16 reply queue is map it to the
local numa node.

I understand that, it is unique requirement but at the same time we may be
able to do it gracefully (in irq sub system) as you mentioned  "
irq_set_affinity_hint" should be avoided in low level driver.



>
> 1) Allocate 72 reply queues which get nicely spread out to every CPU on
the
>system with affinity spreading.
>
> 2) Have a configuration for your reply queues which allows them to be
>grouped, e.g. by phsyical package.
>
> 3) Have a mechanism to mark a reply queue offline/online and handle that
on
>CPU hotplug. That means on unplug you have to wait for the reply
queue
>which is associated to the outgoing CPU to be empty and no new
requests
>to be queued, which has to be done for the regular per CPU reply
queues
>anyway.
>
> 4) On queueing the request, flag it 'coalescing' which causes the
>hard/firmware to direct the reply to the first online reply queue in
the
>group.
>
> If the last CPU of a group goes offline, then the normal hotplug
mechanism
> takes effect and the whole thing is put 'offline' as well. This works
> nicely for all kind of scenarios even if you have more CPUs than queues.
No
> extras, no magic affinity hints, it just works.
>
> Hmm?
>
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> I still regret the day when I merged that abomination.

Is it possible to have similar mapping in managed interrupt case as below
?

for (i = 0; i < 16 ; i++)
irq_set_affinity_hint (pci_irq_vector(instance->pdev,
cpumask_of_node(local_numa_node));

Currently we always see managed interrupts for pre-vectors are 0-71 and
effective cpu is always 0.
We want some changes in current API which can allow us to  pass flags
(like *local numa affinity*) and cpu-msix mapping are from local numa node
+ effective cpu are spread across local numa node.

>
> Thanks,
>
>   tglx


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-31 Thread Kashyap Desai
> >
> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> TBH, this does not make any sense whatsoever. Why are you trying to have
> extra interrupts for coalescing instead of doing the following:

Thomas,

We are using this feature mainly for performance and not for CPU hotplug
issues.
I read your below #1 to #4  points are more of addressing CPU hotplug
stuffs. Right ?  We also want to make sure if we convert megaraid_sas
driver from managed to non-managed interrupt, we can still achieve CPU
hotplug requirement.  If we use " pci_enable_msix_range" and manually set
affinity in driver  using irq_set_affinity_hint, cpu hotplug feature works
as expected.  is able to retain older mapping and whenever
offlined cpu comes back, irqbalancer restore the same old mapping.

If we use all 72 reply queue (all are in interrupt coalescing mode)
without any extra reply queues, we don't have any issue with cpu-msix
mapping and cpu hotplug issues.
Our major problem with that method is latency is very bad on lower QD
and/or single worker case.

To solve that problem we have added extra 16 reply queue (this is a
special h/w feature for performance only) which can be worked in interrupt
coalescing mode vs existing 72 reply queue will work without any interrupt
coalescing.   Best way to map additional 16 reply queue is map it to the
local numa node.

I understand that, it is unique requirement but at the same time we may be
able to do it gracefully (in irq sub system) as you mentioned  "
irq_set_affinity_hint" should be avoided in low level driver.



>
> 1) Allocate 72 reply queues which get nicely spread out to every CPU on
the
>system with affinity spreading.
>
> 2) Have a configuration for your reply queues which allows them to be
>grouped, e.g. by phsyical package.
>
> 3) Have a mechanism to mark a reply queue offline/online and handle that
on
>CPU hotplug. That means on unplug you have to wait for the reply
queue
>which is associated to the outgoing CPU to be empty and no new
requests
>to be queued, which has to be done for the regular per CPU reply
queues
>anyway.
>
> 4) On queueing the request, flag it 'coalescing' which causes the
>hard/firmware to direct the reply to the first online reply queue in
the
>group.
>
> If the last CPU of a group goes offline, then the normal hotplug
mechanism
> takes effect and the whole thing is put 'offline' as well. This works
> nicely for all kind of scenarios even if you have more CPUs than queues.
No
> extras, no magic affinity hints, it just works.
>
> Hmm?
>
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> I still regret the day when I merged that abomination.

Is it possible to have similar mapping in managed interrupt case as below
?

for (i = 0; i < 16 ; i++)
irq_set_affinity_hint (pci_irq_vector(instance->pdev,
cpumask_of_node(local_numa_node));

Currently we always see managed interrupts for pre-vectors are 0-71 and
effective cpu is always 0.
We want some changes in current API which can allow us to  pass flags
(like *local numa affinity*) and cpu-msix mapping are from local numa node
+ effective cpu are spread across local numa node.

>
> Thanks,
>
>   tglx


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-30 Thread Kashyap Desai
Hi Thomas, Ming, Chris et all,

Your input will help us to do changes for megaraid_sas driver.  We are
currently waiting for community response.

Is it recommended to use " pci_enable_msix_range" and have low level driver
do affinity setting because current APIs around pci_alloc_irq_vectors do not
meet our requirement.

We want more than online CPU msix vectors and using pre_vector we can do
that, but first 16 msix should be mapped to local numa node with effective
cpu spread across cpus of local numa node. This is not possible using
pci_alloc_irq_vectors_affinity.

Do we need kernel API changes or let's have low level driver to manage it
via irq_set_affinity_hint ?

Kashyap

> -Original Message-
> From: Sumit Saxena [mailto:sumit.sax...@broadcom.com]
> Sent: Wednesday, August 29, 2018 4:46 AM
> To: Ming Lei
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap
> Desai; Shivasharan Srikanteshwara
> Subject: RE: Affinity managed interrupts vs non-managed interrupts
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.
> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >
> > Thanks,
> > Ming


RE: Affinity managed interrupts vs non-managed interrupts

2018-08-30 Thread Kashyap Desai
Hi Thomas, Ming, Chris et all,

Your input will help us to do changes for megaraid_sas driver.  We are
currently waiting for community response.

Is it recommended to use " pci_enable_msix_range" and have low level driver
do affinity setting because current APIs around pci_alloc_irq_vectors do not
meet our requirement.

We want more than online CPU msix vectors and using pre_vector we can do
that, but first 16 msix should be mapped to local numa node with effective
cpu spread across cpus of local numa node. This is not possible using
pci_alloc_irq_vectors_affinity.

Do we need kernel API changes or let's have low level driver to manage it
via irq_set_affinity_hint ?

Kashyap

> -Original Message-
> From: Sumit Saxena [mailto:sumit.sax...@broadcom.com]
> Sent: Wednesday, August 29, 2018 4:46 AM
> To: Ming Lei
> Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap
> Desai; Shivasharan Srikanteshwara
> Subject: RE: Affinity managed interrupts vs non-managed interrupts
>
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena 
> > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.
> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >
> > Thanks,
> > Ming


RE: system hung up when offlining CPUs

2017-09-13 Thread Kashyap Desai
>
> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > + linux-scsi and maintainers of megasas
> >
> > When offlining CPU, I/O stops. Do you have any ideas?
> >
> > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
> >> Hi Mark and Christoph,
> >>
> >> Sorry for the late reply. I appreciated that you fixed the issue on kvm
> environment.
> >> But the issue still occurs on physical server.
> >>
> >> Here ares irq information that I summarized megasas irqs from
> >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server:
> >>
> >> ---
> >> IRQ affinity_list IRQ_TYPE
> >>  420-5IR-PCI-MSI 1048576-edge megasas
> >>  430-5IR-PCI-MSI 1048577-edge megasas
> >>  440-5IR-PCI-MSI 1048578-edge megasas
> >>  450-5IR-PCI-MSI 1048579-edge megasas
> >>  460-5IR-PCI-MSI 1048580-edge megasas
> >>  470-5IR-PCI-MSI 1048581-edge megasas
> >>  480-5IR-PCI-MSI 1048582-edge megasas
> >>  490-5IR-PCI-MSI 1048583-edge megasas
> >>  500-5IR-PCI-MSI 1048584-edge megasas
> >>  510-5IR-PCI-MSI 1048585-edge megasas
> >>  520-5IR-PCI-MSI 1048586-edge megasas
> >>  530-5IR-PCI-MSI 1048587-edge megasas
> >>  540-5IR-PCI-MSI 1048588-edge megasas
> >>  550-5IR-PCI-MSI 1048589-edge megasas
> >>  560-5IR-PCI-MSI 1048590-edge megasas
> >>  570-5IR-PCI-MSI 1048591-edge megasas
> >>  580-5IR-PCI-MSI 1048592-edge megasas
> >>  590-5IR-PCI-MSI 1048593-edge megasas
> >>  600-5IR-PCI-MSI 1048594-edge megasas
> >>  610-5IR-PCI-MSI 1048595-edge megasas
> >>  620-5IR-PCI-MSI 1048596-edge megasas
> >>  630-5IR-PCI-MSI 1048597-edge megasas
> >>  640-5IR-PCI-MSI 1048598-edge megasas
> >>  650-5IR-PCI-MSI 1048599-edge megasas
> >>  66  24-29IR-PCI-MSI 1048600-edge megasas
> >>  67  24-29IR-PCI-MSI 1048601-edge megasas
> >>  68  24-29IR-PCI-MSI 1048602-edge megasas
> >>  69  24-29IR-PCI-MSI 1048603-edge megasas
> >>  70  24-29IR-PCI-MSI 1048604-edge megasas
> >>  71  24-29IR-PCI-MSI 1048605-edge megasas
> >>  72  24-29IR-PCI-MSI 1048606-edge megasas
> >>  73  24-29IR-PCI-MSI 1048607-edge megasas
> >>  74  24-29IR-PCI-MSI 1048608-edge megasas
> >>  75  24-29IR-PCI-MSI 1048609-edge megasas
> >>  76  24-29IR-PCI-MSI 1048610-edge megasas
> >>  77  24-29IR-PCI-MSI 1048611-edge megasas
> >>  78  24-29IR-PCI-MSI 1048612-edge megasas
> >>  79  24-29IR-PCI-MSI 1048613-edge megasas
> >>  80  24-29IR-PCI-MSI 1048614-edge megasas
> >>  81  24-29IR-PCI-MSI 1048615-edge megasas
> >>  82  24-29IR-PCI-MSI 1048616-edge megasas
> >>  83  24-29IR-PCI-MSI 1048617-edge megasas
> >>  84  24-29IR-PCI-MSI 1048618-edge megasas
> >>  85  24-29IR-PCI-MSI 1048619-edge megasas
> >>  86  24-29IR-PCI-MSI 1048620-edge megasas
> >>  87  24-29IR-PCI-MSI 1048621-edge megasas
> >>  88  24-29IR-PCI-MSI 1048622-edge megasas
> >>  89  24-29IR-PCI-MSI 1048623-edge megasas
> >> ---
> >>
> >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> >> CPU#24-29, I/O does not work, showing the following messages.
> >>
> >> ---
> >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28
> >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort
> >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB:
> >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort:
> >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called
> >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas:
> >> target
> reset FAILED!!
> >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO
> >> timeout
> >> [...] SCSI command pointer: (882057426560)   SCSI host state: 5
> >> SCSI
> >> [...] IO request frame:
> >> [...]
> >> 
> >> [...]
> >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to
> >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than
> >> 120
> seconds.
> >> [...]   Not tainted 4.13.0+ #15
> >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> >> [...] auditd  D0  1200  1 0x
> >> [...] Call Trace:
> >> [...]  __schedule+0x28d/0x890
> >> [...]  schedule+0x36/0x80
> >> [...]  io_schedule+0x16/0x40
> >> [...]  wait_on_page_bit_common+0x109/0x1c0
> >> [...]  ? page_cache_tree_insert+0xf0/0xf0 [...]
> >> __filemap_fdatawait_range+0x127/0x190
> >> [...]  ? __filemap_fdatawrite_range+0xd1/0x100
> >> [...]  file_write_and_wait_range+0x60/0xb0
> >> [...]  xfs_file_fsync+0x67/0x1d0 [xfs] [...]
> >> 

RE: system hung up when offlining CPUs

2017-09-13 Thread Kashyap Desai
>
> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > + linux-scsi and maintainers of megasas
> >
> > When offlining CPU, I/O stops. Do you have any ideas?
> >
> > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
> >> Hi Mark and Christoph,
> >>
> >> Sorry for the late reply. I appreciated that you fixed the issue on kvm
> environment.
> >> But the issue still occurs on physical server.
> >>
> >> Here ares irq information that I summarized megasas irqs from
> >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server:
> >>
> >> ---
> >> IRQ affinity_list IRQ_TYPE
> >>  420-5IR-PCI-MSI 1048576-edge megasas
> >>  430-5IR-PCI-MSI 1048577-edge megasas
> >>  440-5IR-PCI-MSI 1048578-edge megasas
> >>  450-5IR-PCI-MSI 1048579-edge megasas
> >>  460-5IR-PCI-MSI 1048580-edge megasas
> >>  470-5IR-PCI-MSI 1048581-edge megasas
> >>  480-5IR-PCI-MSI 1048582-edge megasas
> >>  490-5IR-PCI-MSI 1048583-edge megasas
> >>  500-5IR-PCI-MSI 1048584-edge megasas
> >>  510-5IR-PCI-MSI 1048585-edge megasas
> >>  520-5IR-PCI-MSI 1048586-edge megasas
> >>  530-5IR-PCI-MSI 1048587-edge megasas
> >>  540-5IR-PCI-MSI 1048588-edge megasas
> >>  550-5IR-PCI-MSI 1048589-edge megasas
> >>  560-5IR-PCI-MSI 1048590-edge megasas
> >>  570-5IR-PCI-MSI 1048591-edge megasas
> >>  580-5IR-PCI-MSI 1048592-edge megasas
> >>  590-5IR-PCI-MSI 1048593-edge megasas
> >>  600-5IR-PCI-MSI 1048594-edge megasas
> >>  610-5IR-PCI-MSI 1048595-edge megasas
> >>  620-5IR-PCI-MSI 1048596-edge megasas
> >>  630-5IR-PCI-MSI 1048597-edge megasas
> >>  640-5IR-PCI-MSI 1048598-edge megasas
> >>  650-5IR-PCI-MSI 1048599-edge megasas
> >>  66  24-29IR-PCI-MSI 1048600-edge megasas
> >>  67  24-29IR-PCI-MSI 1048601-edge megasas
> >>  68  24-29IR-PCI-MSI 1048602-edge megasas
> >>  69  24-29IR-PCI-MSI 1048603-edge megasas
> >>  70  24-29IR-PCI-MSI 1048604-edge megasas
> >>  71  24-29IR-PCI-MSI 1048605-edge megasas
> >>  72  24-29IR-PCI-MSI 1048606-edge megasas
> >>  73  24-29IR-PCI-MSI 1048607-edge megasas
> >>  74  24-29IR-PCI-MSI 1048608-edge megasas
> >>  75  24-29IR-PCI-MSI 1048609-edge megasas
> >>  76  24-29IR-PCI-MSI 1048610-edge megasas
> >>  77  24-29IR-PCI-MSI 1048611-edge megasas
> >>  78  24-29IR-PCI-MSI 1048612-edge megasas
> >>  79  24-29IR-PCI-MSI 1048613-edge megasas
> >>  80  24-29IR-PCI-MSI 1048614-edge megasas
> >>  81  24-29IR-PCI-MSI 1048615-edge megasas
> >>  82  24-29IR-PCI-MSI 1048616-edge megasas
> >>  83  24-29IR-PCI-MSI 1048617-edge megasas
> >>  84  24-29IR-PCI-MSI 1048618-edge megasas
> >>  85  24-29IR-PCI-MSI 1048619-edge megasas
> >>  86  24-29IR-PCI-MSI 1048620-edge megasas
> >>  87  24-29IR-PCI-MSI 1048621-edge megasas
> >>  88  24-29IR-PCI-MSI 1048622-edge megasas
> >>  89  24-29IR-PCI-MSI 1048623-edge megasas
> >> ---
> >>
> >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> >> CPU#24-29, I/O does not work, showing the following messages.
> >>
> >> ---
> >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28
> >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort
> >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB:
> >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort:
> >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called
> >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas:
> >> target
> reset FAILED!!
> >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO
> >> timeout
> >> [...] SCSI command pointer: (882057426560)   SCSI host state: 5
> >> SCSI
> >> [...] IO request frame:
> >> [...]
> >> 
> >> [...]
> >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to
> >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than
> >> 120
> seconds.
> >> [...]   Not tainted 4.13.0+ #15
> >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> >> [...] auditd  D0  1200  1 0x
> >> [...] Call Trace:
> >> [...]  __schedule+0x28d/0x890
> >> [...]  schedule+0x36/0x80
> >> [...]  io_schedule+0x16/0x40
> >> [...]  wait_on_page_bit_common+0x109/0x1c0
> >> [...]  ? page_cache_tree_insert+0xf0/0xf0 [...]
> >> __filemap_fdatawait_range+0x127/0x190
> >> [...]  ? __filemap_fdatawrite_range+0xd1/0x100
> >> [...]  file_write_and_wait_range+0x60/0xb0
> >> [...]  xfs_file_fsync+0x67/0x1d0 [xfs] [...]
> >> 

RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Monday, August 07, 2017 7:48 PM
> To: Kashyap Desai; Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu Subramani; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; linux-
> ker...@vger.kernel.org; Chaitra Basappa; Sreekanth Reddy; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Mon, 2017-08-07 at 19:26 +0530, Kashyap Desai wrote:
> > >
> > > -Original Message-
> > > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com
> > > ]
> > > Sent: Saturday, August 05, 2017 8:12 PM
> > > To: Christoph Hellwig; Hannes Reinecke
> > > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> > > s...@vger.kernel.org; sathya.prak...@broadcom.com;
> > > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org;
> > > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> > > n...@lists.infradead.org
> > > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
> > >
> > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > > >
> > > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > > > >
> > > > >
> > > > > I'm not happy with this approach.
> > > > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > > > devices.
> > > > >
> > > > > Rather I would like to see the driver to hook into the existing
> > > > > NVMe framework (which essentially means to treat the mpt3sas as
> > > > > a weird NVMe-over-Fabrics HBA), and expose the NVMe devices like
> > > > > any other NVMe HBA.
> > > >
> > > > That doesn't make any sense.  The devices behind the mpt adapter
> > > > don't look like NVMe devices at all for the hosts - there are no
> > > > NVMe commands or queues involved at all, they hide behind the same
> > > > somewhat leaky scsi abstraction as other devices behind the mpt
> > > > controller.
> > >
> > > You might think about what we did for SAS: split the generic handler
> > > into two pieces, libsas for driving the devices, which mpt didn't
> > > need because of the fat firmware and the SAS transport class so mpt
> > > could at least show the same sysfs files as everything else for SAS
> > > devices.
> >
> >  Ventura generation of controllers are adding connectivity of NVME
> >  drives seamlessly and protocol handling is in Firmware.
> >  Same as SCSI to ATA translation done in firmware, Ventura controller
> >  is doing SCSI to NVME translation and for end user protocol handling
> >  is abstracted.
> >
> >  This product handles new Transport protocol (NVME) same as ATA and
> >  transport is abstracted for end user.
> >
> > NVME pass-through related driver code, it is just a big tunnel for
> > user space application. It is just a basic framework like SATA PASS-
> > Through in existing mpt3sas driver.
>
> I know how it works ... and I'm on record as not liking your SATL approach
> because we keep tripping across bugs in the SATL that we have to fix in
> the
> driver.

We discussed about NVME device support behind  to Hannes and
he suggested to describe to product behavior to wider audience to be aware.
Just wanted to share the notes.

>
> However, at least for bot SAS and SATA they appear to the system as SCSI
> devices regardless of HBA, so we've largely smoothed over any problems if
> you
> transfer from mp3sas to another SAS/SATA controller.
>
> I believe your current proposal is to have NVMe devices appear as SCSI,
> which
> isn't how the native NVMe driver handles them at all.  This is going to
> have to
> be special cased in any tool designed to handle nvme devices and it's
> going to
> cause big problems if someone changes controller (or moves the
> device).  What's the proposal for making this as painless as possible?

We have to attempt this use case and see how it behaves. I have not tried
this, so not sure if things are really bad or just some tuning may be
helpful. I will revert back to you on this.

I understood request as -  We need some udev rules to be working well for
*same* NVME drives if it is behind  or native .
Example - If user has OS installed on NVME drive which is behind 
driver as SCSI disk should be able to boot if he/she hooked same NVME drive
which is detected by native  driver (and vice versa.)

>
> James


RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Monday, August 07, 2017 7:48 PM
> To: Kashyap Desai; Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu Subramani; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; linux-
> ker...@vger.kernel.org; Chaitra Basappa; Sreekanth Reddy; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Mon, 2017-08-07 at 19:26 +0530, Kashyap Desai wrote:
> > >
> > > -Original Message-
> > > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com
> > > ]
> > > Sent: Saturday, August 05, 2017 8:12 PM
> > > To: Christoph Hellwig; Hannes Reinecke
> > > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> > > s...@vger.kernel.org; sathya.prak...@broadcom.com;
> > > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org;
> > > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> > > n...@lists.infradead.org
> > > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
> > >
> > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > > >
> > > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > > > >
> > > > >
> > > > > I'm not happy with this approach.
> > > > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > > > devices.
> > > > >
> > > > > Rather I would like to see the driver to hook into the existing
> > > > > NVMe framework (which essentially means to treat the mpt3sas as
> > > > > a weird NVMe-over-Fabrics HBA), and expose the NVMe devices like
> > > > > any other NVMe HBA.
> > > >
> > > > That doesn't make any sense.  The devices behind the mpt adapter
> > > > don't look like NVMe devices at all for the hosts - there are no
> > > > NVMe commands or queues involved at all, they hide behind the same
> > > > somewhat leaky scsi abstraction as other devices behind the mpt
> > > > controller.
> > >
> > > You might think about what we did for SAS: split the generic handler
> > > into two pieces, libsas for driving the devices, which mpt didn't
> > > need because of the fat firmware and the SAS transport class so mpt
> > > could at least show the same sysfs files as everything else for SAS
> > > devices.
> >
> >  Ventura generation of controllers are adding connectivity of NVME
> >  drives seamlessly and protocol handling is in Firmware.
> >  Same as SCSI to ATA translation done in firmware, Ventura controller
> >  is doing SCSI to NVME translation and for end user protocol handling
> >  is abstracted.
> >
> >  This product handles new Transport protocol (NVME) same as ATA and
> >  transport is abstracted for end user.
> >
> > NVME pass-through related driver code, it is just a big tunnel for
> > user space application. It is just a basic framework like SATA PASS-
> > Through in existing mpt3sas driver.
>
> I know how it works ... and I'm on record as not liking your SATL approach
> because we keep tripping across bugs in the SATL that we have to fix in
> the
> driver.

We discussed about NVME device support behind  to Hannes and
he suggested to describe to product behavior to wider audience to be aware.
Just wanted to share the notes.

>
> However, at least for bot SAS and SATA they appear to the system as SCSI
> devices regardless of HBA, so we've largely smoothed over any problems if
> you
> transfer from mp3sas to another SAS/SATA controller.
>
> I believe your current proposal is to have NVMe devices appear as SCSI,
> which
> isn't how the native NVMe driver handles them at all.  This is going to
> have to
> be special cased in any tool designed to handle nvme devices and it's
> going to
> cause big problems if someone changes controller (or moves the
> device).  What's the proposal for making this as painless as possible?

We have to attempt this use case and see how it behaves. I have not tried
this, so not sure if things are really bad or just some tuning may be
helpful. I will revert back to you on this.

I understood request as -  We need some udev rules to be working well for
*same* NVME drives if it is behind  or native .
Example - If user has OS installed on NVME drive which is behind 
driver as SCSI disk should be able to boot if he/she hooked same NVME drive
which is detected by native  driver (and vice versa.)

>
> James


RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Saturday, August 05, 2017 8:12 PM
> To: Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; sathya.prak...@broadcom.com;
> kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org;
> chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > >
> > > I'm not happy with this approach.
> > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > devices.
> > >
> > > Rather I would like to see the driver to hook into the existing NVMe
> > > framework (which essentially means to treat the mpt3sas as a weird
> > > NVMe-over-Fabrics HBA), and expose the NVMe devices like any other
> > > NVMe HBA.
> >
> > That doesn't make any sense.  The devices behind the mpt adapter don't
> > look like NVMe devices at all for the hosts - there are no NVMe
> > commands or queues involved at all, they hide behind the same somewhat
> > leaky scsi abstraction as other devices behind the mpt controller.
>
> You might think about what we did for SAS: split the generic handler into
> two
> pieces, libsas for driving the devices, which mpt didn't need because of
> the fat
> firmware and the SAS transport class so mpt could at least show the same
> sysfs
> files as everything else for SAS devices.

 Ventura generation of controllers are adding connectivity of NVME
 drives seamlessly and protocol handling is in Firmware.
 Same as SCSI to ATA translation done in firmware, Ventura controller
 is doing SCSI to NVME translation and for end user protocol handling
 is abstracted.

 This product handles new Transport protocol (NVME) same as ATA and
 transport is abstracted for end user.

NVME pass-through related driver code, it is just a big tunnel for user
space application.
It is just a basic framework like SATA PASS-Through in existing mpt3sas
driver.

>
> Fortunately for NVMe it's very simple at the moment its just a couple of
> host
> files and wwid on the devices.
>
> James
>
>
> > The only additional leak is that the controller now supports NVMe-
> > like PRPs in additions to its existing two SGL formats.
> >


RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Saturday, August 05, 2017 8:12 PM
> To: Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; sathya.prak...@broadcom.com;
> kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org;
> chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > >
> > > I'm not happy with this approach.
> > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > devices.
> > >
> > > Rather I would like to see the driver to hook into the existing NVMe
> > > framework (which essentially means to treat the mpt3sas as a weird
> > > NVMe-over-Fabrics HBA), and expose the NVMe devices like any other
> > > NVMe HBA.
> >
> > That doesn't make any sense.  The devices behind the mpt adapter don't
> > look like NVMe devices at all for the hosts - there are no NVMe
> > commands or queues involved at all, they hide behind the same somewhat
> > leaky scsi abstraction as other devices behind the mpt controller.
>
> You might think about what we did for SAS: split the generic handler into
> two
> pieces, libsas for driving the devices, which mpt didn't need because of
> the fat
> firmware and the SAS transport class so mpt could at least show the same
> sysfs
> files as everything else for SAS devices.

 Ventura generation of controllers are adding connectivity of NVME
 drives seamlessly and protocol handling is in Firmware.
 Same as SCSI to ATA translation done in firmware, Ventura controller
 is doing SCSI to NVME translation and for end user protocol handling
 is abstracted.

 This product handles new Transport protocol (NVME) same as ATA and
 transport is abstracted for end user.

NVME pass-through related driver code, it is just a big tunnel for user
space application.
It is just a basic framework like SATA PASS-Through in existing mpt3sas
driver.

>
> Fortunately for NVMe it's very simple at the moment its just a couple of
> host
> files and wwid on the devices.
>
> James
>
>
> > The only additional leak is that the controller now supports NVMe-
> > like PRPs in additions to its existing two SGL formats.
> >


RE: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2017-04-27 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, April 27, 2017 3:55 AM
> To: Sreekanth Reddy
> Cc: Martin K. Petersen; j...@kernel.org; linux-s...@vger.kernel.org;
linux-
> ker...@vger.kernel.org; Christoph Hellwig
> Subject: Re: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor
> Post Queue (RDPQ) Array support
>
>
> Sreekanth,
>
> > We need to satisfy this condition on those system where 32 bit dma
> > consistent mask is not supported and it only supports 64 bit dma
> > consistent mask. So on these system we can't set
> > pci_set_consistent_dma_mask() to DMA_BIT_MASK(32).
>
> Which systems are you talking about?
>
> It seems a bit unrealistic to require all devices to support 64-bit DMA.

Martin - We have found all devices to support 64-bit DMA on certain ARM64
platform. I discussed @linux-arm-kern. Below is a thread.

http://marc.info/?l=linux-arm-kernel=148880763816046=2

For ARM64, it is not supporting SWIOTLB and that is a reason we need to
make all DMA pool above 4GB.
Ea. If I map crash kernel above 4GB in x86_64 platform, they owner DMA 32
bit mask since arch specific code in x86_64 support SWIOTLB.
Same settings on ARM64 platform fails DAM 32 bit mask.

In one particular setup of ARM64, I also see below 4GB is mapped to SoC
and kernel component mapped above 4GB region.

Can we add in MR/IT driver below logic to meet this requirement ?

- Driver will attempt DMA buffer above 4GB and check the start and end
address of the physical address.
If DMA buffer cross the "Same 4GB region" ( I mean High Address should be
constant for that region.), driver will hold that region and attempt one
more allocation.
If second allocation is also not meeting "Same 4GB region", we will give
up driver load.

Before we attempt above logic, we would like to understand if we have any
other reliable method ways to handle this in Linux.

Most of the time, we are going to get "same 4GB region", so we are OK to
have this corner case to detect and bail out driver load. There is no
report of issue from field, but wanted to protect failure for future.

Thanks, Kashyap

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2017-04-27 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, April 27, 2017 3:55 AM
> To: Sreekanth Reddy
> Cc: Martin K. Petersen; j...@kernel.org; linux-s...@vger.kernel.org;
linux-
> ker...@vger.kernel.org; Christoph Hellwig
> Subject: Re: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor
> Post Queue (RDPQ) Array support
>
>
> Sreekanth,
>
> > We need to satisfy this condition on those system where 32 bit dma
> > consistent mask is not supported and it only supports 64 bit dma
> > consistent mask. So on these system we can't set
> > pci_set_consistent_dma_mask() to DMA_BIT_MASK(32).
>
> Which systems are you talking about?
>
> It seems a bit unrealistic to require all devices to support 64-bit DMA.

Martin - We have found all devices to support 64-bit DMA on certain ARM64
platform. I discussed @linux-arm-kern. Below is a thread.

http://marc.info/?l=linux-arm-kernel=148880763816046=2

For ARM64, it is not supporting SWIOTLB and that is a reason we need to
make all DMA pool above 4GB.
Ea. If I map crash kernel above 4GB in x86_64 platform, they owner DMA 32
bit mask since arch specific code in x86_64 support SWIOTLB.
Same settings on ARM64 platform fails DAM 32 bit mask.

In one particular setup of ARM64, I also see below 4GB is mapped to SoC
and kernel component mapped above 4GB region.

Can we add in MR/IT driver below logic to meet this requirement ?

- Driver will attempt DMA buffer above 4GB and check the start and end
address of the physical address.
If DMA buffer cross the "Same 4GB region" ( I mean High Address should be
constant for that region.), driver will hold that region and attempt one
more allocation.
If second allocation is also not meeting "Same 4GB region", we will give
up driver load.

Before we attempt above logic, we would like to understand if we have any
other reliable method ways to handle this in Linux.

Most of the time, we are going to get "same 4GB region", so we are OK to
have this corner case to detect and bail out driver load. There is no
report of issue from field, but wanted to protect failure for future.

Thanks, Kashyap

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, March 08, 2017 10:03 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> >>>>> "Kashyap" == Kashyap Desai <kashyap.de...@broadcom.com> writes:
>
> Kashyap,
>
> Kashyap> I am just curious to know how badly we have to scrutinize each
> Kashyap> packet before sending to Fast Path as we are in IO path and
> Kashyap> recommend only important checks to be added.
>
> As Christoph pointed out, when the fast path is in use you assume the
role of
> the SCSI device. And therefore it is your responsibility to ensure that
the VD's
> capacity and other relevant constraints are being honored. Just like the
MR
> firmware and any attached disks would.

Martin -

Agree on this point. I am planning to study all possible such sanity in
driver for VD and not trying to fix one specific scenario as described
here.
Do you think fix  in this area is good  for kernel-stable as well  OR just
keep in linux-next as it is not so severe considering real time exposure ?
Trying to understand priority and severity of this issue.

>
> It is a feature that there is no sanity checking in the sg interface.
> The intent is to be able to pass through commands directly to a device
and
> have the device act upon them. Including fail them if they don't make
any
> sense.

Understood as sg_raw is not design to sanity check.

>
> PS. I'm really no fan of the fast path. It's super messy to have the VD
layout
> handled in two different places.
>
> --
> Martin K. PetersenOracle Linux Engineering


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, March 08, 2017 10:03 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> >>>>> "Kashyap" == Kashyap Desai  writes:
>
> Kashyap,
>
> Kashyap> I am just curious to know how badly we have to scrutinize each
> Kashyap> packet before sending to Fast Path as we are in IO path and
> Kashyap> recommend only important checks to be added.
>
> As Christoph pointed out, when the fast path is in use you assume the
role of
> the SCSI device. And therefore it is your responsibility to ensure that
the VD's
> capacity and other relevant constraints are being honored. Just like the
MR
> firmware and any attached disks would.

Martin -

Agree on this point. I am planning to study all possible such sanity in
driver for VD and not trying to fix one specific scenario as described
here.
Do you think fix  in this area is good  for kernel-stable as well  OR just
keep in linux-next as it is not so severe considering real time exposure ?
Trying to understand priority and severity of this issue.

>
> It is a feature that there is no sanity checking in the sg interface.
> The intent is to be able to pass through commands directly to a device
and
> have the device act upon them. Including fail them if they don't make
any
> sense.

Understood as sg_raw is not design to sanity check.

>
> PS. I'm really no fan of the fast path. It's super messy to have the VD
layout
> handled in two different places.
>
> --
> Martin K. PetersenOracle Linux Engineering


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 9:37 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, Mar 08, 2017 at 09:29:28PM +0530, Kashyap Desai wrote:
> > Thanks Chris. It is understood to have sanity in driver, but how
> > critical such checks where SG_IO type interface send pass-through
request.
> ?
> > Are you suggesting as good to have sanity or very important as there
> > may be a real-time exposure other than SG_IO interface ? I am confused
> > over must or good to have check.
> > Also one more fault I can generate using below sg_raw command -
>
> SCSI _devices_ need to sanity check any input and fail commands instead
of
> crashing or causing other problems.  Normal SCSI HBA drivers don't need
to
> do that as they don't interpret CDBs.  Megaraid (and a few other raid
drivers)
> are special in that they take on part of the device functionality and do
> interpret CDBs sometimes.  In that case you'll need to do all that
sanity
> checking and generate proper errors.
>
> It would be nice to have come common helpers for this shared between
> everyone interpreting SCSI CBD (e.g. the SCSI target code, the NVMe SCSI
> emulation and the various RAID drivers).

Thanks Chris.  I will  continue on this and will come back with changes.
Let me check with Broadcom internally and figure out all possible
scenarios for megaraid_sas.

Thanks, Kashyap


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 9:37 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, Mar 08, 2017 at 09:29:28PM +0530, Kashyap Desai wrote:
> > Thanks Chris. It is understood to have sanity in driver, but how
> > critical such checks where SG_IO type interface send pass-through
request.
> ?
> > Are you suggesting as good to have sanity or very important as there
> > may be a real-time exposure other than SG_IO interface ? I am confused
> > over must or good to have check.
> > Also one more fault I can generate using below sg_raw command -
>
> SCSI _devices_ need to sanity check any input and fail commands instead
of
> crashing or causing other problems.  Normal SCSI HBA drivers don't need
to
> do that as they don't interpret CDBs.  Megaraid (and a few other raid
drivers)
> are special in that they take on part of the device functionality and do
> interpret CDBs sometimes.  In that case you'll need to do all that
sanity
> checking and generate proper errors.
>
> It would be nice to have come common helpers for this shared between
> everyone interpreting SCSI CBD (e.g. the SCSI target code, the NVMe SCSI
> emulation and the various RAID drivers).

Thanks Chris.  I will  continue on this and will come back with changes.
Let me check with Broadcom internally and figure out all possible
scenarios for megaraid_sas.

Thanks, Kashyap


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, March 08, 2017 9:35 PM
> To: h...@infradead.org; kashyap.de...@broadcom.com
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, 2017-03-08 at 21:29 +0530, Kashyap Desai wrote:
> > Also one more fault I can generate using below sg_raw command -
> >
> > "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"
> >
> > Provide more scsi data length compare to actual SG buffer. Do you
> > suggest such SG_IO interface vulnerability is good to be captured in
driver.
>
> That's not a vulnerability of the SG I/O interface. A SCSI device has to
set the
> residual count correctly if the SCSI data length does not match the size
of the
> data buffer.

Thanks Bart.  I will pass this information to Broadcom firmware dev. May
be a Tx/Rx (DMA) related code in MR (also for Fusion IT HBA)  cannot
handle due to some sanity checks are not passed.

>
> Bart.


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, March 08, 2017 9:35 PM
> To: h...@infradead.org; kashyap.de...@broadcom.com
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, 2017-03-08 at 21:29 +0530, Kashyap Desai wrote:
> > Also one more fault I can generate using below sg_raw command -
> >
> > "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"
> >
> > Provide more scsi data length compare to actual SG buffer. Do you
> > suggest such SG_IO interface vulnerability is good to be captured in
driver.
>
> That's not a vulnerability of the SG I/O interface. A SCSI device has to
set the
> residual count correctly if the SCSI data length does not match the size
of the
> data buffer.

Thanks Bart.  I will pass this information to Broadcom firmware dev. May
be a Tx/Rx (DMA) related code in MR (also for Fusion IT HBA)  cannot
handle due to some sanity checks are not passed.

>
> Bart.


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 8:41 PM
> To: Kashyap Desai
> Cc: linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> Hi Kashyap,
>
> for SG_IO passthrough requests we can't validate command validity for
> commands as the block layer treats them as opaque.  The SCSI device
> implementation needs to handle incorrect parameter to be robust.
>
> For your fast path bypass the megaraid driver assumes part of the SCSI
device
> implementation, so it will have to check for validity.

Thanks Chris. It is understood to have sanity in driver, but how critical
such checks where SG_IO type interface send pass-through request. ?
Are you suggesting as good to have sanity or very important as there may
be a real-time exposure other than SG_IO interface ? I am confused over
must or good to have check.
Also one more fault I can generate using below sg_raw command -

"sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"

Provide more scsi data length compare to actual SG buffer. Do you suggest
such SG_IO interface vulnerability is good to be captured in driver.

I am just curious to know how badly we have to scrutinize each packet
before sending to Fast Path  as we are in IO path and recommend only
important checks to be added.

Thanks, Kashyap


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 8:41 PM
> To: Kashyap Desai
> Cc: linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> Hi Kashyap,
>
> for SG_IO passthrough requests we can't validate command validity for
> commands as the block layer treats them as opaque.  The SCSI device
> implementation needs to handle incorrect parameter to be robust.
>
> For your fast path bypass the megaraid driver assumes part of the SCSI
device
> implementation, so it will have to check for validity.

Thanks Chris. It is understood to have sanity in driver, but how critical
such checks where SG_IO type interface send pass-through request. ?
Are you suggesting as good to have sanity or very important as there may
be a real-time exposure other than SG_IO interface ? I am confused over
must or good to have check.
Also one more fault I can generate using below sg_raw command -

"sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"

Provide more scsi data length compare to actual SG buffer. Do you suggest
such SG_IO interface vulnerability is good to be captured in driver.

I am just curious to know how badly we have to scrutinize each packet
before sending to Fast Path  as we are in IO path and recommend only
important checks to be added.

Thanks, Kashyap


out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
Hi -

Need help to understand  if below is something we should consider to be
fixed in megaraid_sas driver or call as unreal exposure.

I have created slice VD of size 10GB (raid 1) using 2 drives.  Each
Physical Drive size is 256GB.

Last LBA of the VD and  actual Physical disk associated with that VD is
different. Actual Physical disk has larger range of LBA compare VD.

Below is readcap detail of VD0

# sg_readcap /dev/sdu
Read Capacity results:
   Last logical block address=20971519 (0x13f), Number of
blocks=20971520
   Logical block length=512 bytes
Hence:
   Device size: 10737418240 bytes, 10240.0 MiB, 10.74 GB

Using below sg_raw command, we should see "LBA out of range" sense.  In
CDB 0x28, pass LBA beyond last lba of VD 0x13f.

sg_raw -r 4k /dev/sdx 28 00 01 4f ff ff 00 00 08 00

It works if VD created behind MR controller does not support Fast Path
Write.
In case of Fast Path Write, driver convert LBA of VD to underlying
Physical disk and send IO direct to the physical disk. Since Physical disk
has enough LBA range to respond, it will not send "LBA out of range
sense".

Megaraid_Sas driver never validate range of LBA for VD as it assume to be
validated by upper layer in scsi stack. Other sg_tool method like sg_dd,
sg_write, dd etc has checks of LBA range and driver never receive out of
range LBA.

What is a suggestion ? Shall I add check in megaraid_sas driver or it is
not a valid scenario as "sg_raw" tool can send any type of command which
does not require multiple sanity in driver.

Thanks, Kashyap


out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
Hi -

Need help to understand  if below is something we should consider to be
fixed in megaraid_sas driver or call as unreal exposure.

I have created slice VD of size 10GB (raid 1) using 2 drives.  Each
Physical Drive size is 256GB.

Last LBA of the VD and  actual Physical disk associated with that VD is
different. Actual Physical disk has larger range of LBA compare VD.

Below is readcap detail of VD0

# sg_readcap /dev/sdu
Read Capacity results:
   Last logical block address=20971519 (0x13f), Number of
blocks=20971520
   Logical block length=512 bytes
Hence:
   Device size: 10737418240 bytes, 10240.0 MiB, 10.74 GB

Using below sg_raw command, we should see "LBA out of range" sense.  In
CDB 0x28, pass LBA beyond last lba of VD 0x13f.

sg_raw -r 4k /dev/sdx 28 00 01 4f ff ff 00 00 08 00

It works if VD created behind MR controller does not support Fast Path
Write.
In case of Fast Path Write, driver convert LBA of VD to underlying
Physical disk and send IO direct to the physical disk. Since Physical disk
has enough LBA range to respond, it will not send "LBA out of range
sense".

Megaraid_Sas driver never validate range of LBA for VD as it assume to be
validated by upper layer in scsi stack. Other sg_tool method like sg_dd,
sg_write, dd etc has checks of LBA range and driver never receive out of
range LBA.

What is a suggestion ? Shall I add check in megaraid_sas driver or it is
not a valid scenario as "sg_raw" tool can send any type of command which
does not require multiple sanity in driver.

Thanks, Kashyap


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(>fw_outstanding) >
> >> -   instance->host->can_queue) {
> >> -   atomic_dec(>fw_outstanding);
> >> -   return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(>fw_outstanding) >
safe_can_queue) {
> >> +   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +   /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +* This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +* sending sorted IO in SCSI.MQ mode.
> >> +*/
> >> +   if (!is_nonrot)
> >> +   udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(>fw_outstanding) >
> >> -   instance->host->can_queue) {
> >> -   atomic_dec(>fw_outstanding);
> >> -   return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(>fw_outstanding) >
safe_can_queue) {
> >> +   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +   /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +* This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +* sending sorted IO in SCSI.MQ mode.
> >> +*/
> >> +   if (!is_nonrot)
> >> +   udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(>fw_outstanding) >
   instance->host->can_queue) {
   atomic_dec(>fw_outstanding);
   return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
u32 index;
-   struct fusion_context *fusion;
+   boolis_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

fusion = instance->ctrl_context;

@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
return SCSI_MLQUEUE_DEVICE_BUSY;
}

-   if (atomic_inc_return(>fw_outstanding) >
-   instance->host->can_queue) {
-   atomic_dec(>fw_outstanding);
-   return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(>fw_outstanding) > safe_can_queue) {
+   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+   /* For rotational device wait for sometime to get fusion command
from pool.
+* This is just to reduce proactive re-queue at mid layer which is
not
+* sending sorted IO in SCSI.MQ mode.
+*/
+   if (!is_nonrot)
+   udelay(100);
+   }

cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-s...@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux-
> bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10 6268 2.432404509 18594  P   N [fio]
>  65,128 10 6269 2.432405013 18594  U   N [fio] 1
>  65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23 3626 2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
>  65,128  7 6126 2.433032637 18594  P   N [fio]
>  65,128  7 6127 2.433033204 18594  U   N [fio] 1
>  65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
>  65,128  9 6393 2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
>  65,128  7 6134 2.433167450 18594  P   N [fio]
>  65,128  7 6135

RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(>fw_outstanding) >
   instance->host->can_queue) {
   atomic_dec(>fw_outstanding);
   return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
u32 index;
-   struct fusion_context *fusion;
+   boolis_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

fusion = instance->ctrl_context;

@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
return SCSI_MLQUEUE_DEVICE_BUSY;
}

-   if (atomic_inc_return(>fw_outstanding) >
-   instance->host->can_queue) {
-   atomic_dec(>fw_outstanding);
-   return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(>fw_outstanding) > safe_can_queue) {
+   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+   /* For rotational device wait for sometime to get fusion command
from pool.
+* This is just to reduce proactive re-queue at mid layer which is
not
+* sending sorted IO in SCSI.MQ mode.
+*/
+   if (!is_nonrot)
+   udelay(100);
+   }

cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-s...@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux-
> bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10 6268 2.432404509 18594  P   N [fio]
>  65,128 10 6269 2.432405013 18594  U   N [fio] 1
>  65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23 3626 2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
>  65,128  7 6126 2.433032637 18594  P   N [fio]
>  65,128  7 6127 2.433033204 18594  U   N [fio] 1
>  65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
>  65,128  9 6393 2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
>  65,128  7 6134 2.433167450 18594  P   N [fio]
>  65,128  7 6135

RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, November 30, 2016 12:50 AM
> To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J .
> Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux-
> s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of
> "outstanding"
>
> On 11/29/2016 11:13 AM, Colin King wrote:
> > Trivial fix to spelling mistake "oustanding" to "outstanding" in
> > dev_info and scmd_printk messages. Also join wrapped literal string in
> > the scmd_printk.
>
> Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>

Please hold this commit as we have patch set work in progress for MegaRaid
Driver. It will conflict it this patch commits.
We will be re-submitting this patch with appropriate Signed-off and
Submitted-by tags.

Thanks, Kashyap


RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, November 30, 2016 12:50 AM
> To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J .
> Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux-
> s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of
> "outstanding"
>
> On 11/29/2016 11:13 AM, Colin King wrote:
> > Trivial fix to spelling mistake "oustanding" to "outstanding" in
> > dev_info and scmd_printk messages. Also join wrapped literal string in
> > the scmd_printk.
>
> Reviewed-by: Bart Van Assche 

Please hold this commit as we have patch set work in progress for MegaRaid
Driver. It will conflict it this patch commits.
We will be re-submitting this patch with appropriate Signed-off and
Submitted-by tags.

Thanks, Kashyap


RE: [GIT PULL] SCSI fixes for 4.9-rc3

2016-11-11 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Gabriel C
> Sent: Friday, November 11, 2016 9:40 AM
> To: James Bottomley; Andrew Morton; Linus Torvalds
> Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org
> Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3
>
>
>
> On 11.11.2016 04:30, Gabriel C wrote:
> >
> > On 05.11.2016 14:29, James Bottomley wrote:
> >
> >
> > ...
> >
> >> Kashyap Desai (1):
> >>   scsi: megaraid_sas: Fix data integrity failure for JBOD
> >> (passthrough) devices
> >>
> >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> index 9ff57de..d8b1fbd 100644
> >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host
> *shost, struct scsi_cmnd *scmd)
> >>goto out_done;
> >>}
> >>
> >> -  switch (scmd->cmnd[0]) {
> >> -  case SYNCHRONIZE_CACHE:
> >> -  /*
> >> -   * FW takes care of flush cache on its own
> >> -   * No need to send it down
> >> -   */
> >> +  /*
> >> +   * FW takes care of flush cache on its own for Virtual Disk.
> >> +   * No need to send it down for VD. For JBOD send
> SYNCHRONIZE_CACHE to FW.
> >> +   */
> >> +  if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) &&
> >> +MEGASAS_IS_LOGICAL(scmd)) {
> >>scmd->result = DID_OK << 16;
> >>goto out_done;
> >> -  default:
> >> -  break;
> >>}
> >>
> >>return instance->instancet->build_and_issue_cmd(instance, scmd);
> >
> > This patch breaks my box.. I'm not able to boot it anymore.
> > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!?
> >
> > I'm not sure how to get an log since dracut times out and I'm dropped
> > , after a very long time of probing 'ghost devices', in a emercency
> > shell,
> journalctl doesn't work also..
> >
> > After reverting this one I can boot normal.
> >
> > Box is a FUJITSU PRIMERGY TX200 S5..

Please  check now commit.  Below commit has complete fix.

http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8


> >
> > This is from an working kernel..
> >
> > [5.119371] megaraid_sas :01:00.0: FW now in Ready state
> > [5.119418] megaraid_sas :01:00.0: firmware supports msix
> > : (0)
> > [5.119420] megaraid_sas :01:00.0: current msix/online cpus
> > : (1/16)
> > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled)
> > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not
> > supported
> > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002
> >
> > ...
> >
> > [5.208063] megaraid_sas :01:00.0: controller type   :
> > MR(256MB)
> > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR)
> > :
> Enabled
> > [5.208067] megaraid_sas :01:00.0: Secure JBOD support   : No
> > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi:
> fw_support_ieee=0
> > [5.208073] megaraid_sas :01:00.0: INIT adapter done
> > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported
> megasas_setup_jbod_map 4967
> > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY
> failed/not supported by firmware
> > [5.252080] megaraid_sas :01:00.0: DCMD not supported by
> > firmware -
> megasas_ld_list_query 4369
> > [5.274086] megaraid_sas :01:00.0: pci id:
> (0x1000)/(0x0060)/(0x1734)/(0x10f9)
> > [5.274089] megaraid_sas :01:00.0: unevenspan support: no
> > [5.274090] megaraid_sas :01:00.0: firmware crash dump   : no
> > [5.274092] megaraid_sas :01:00.0: jbod sync map : no
> > [5.274094] scsi host0: Avago SAS based MegaRAID driver
> > [5.280022] scsi 0:0:6:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.282153] scsi 0:0:7:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.285180] scsi 0:0:10:0: Direct-Access ATA  ST500NM0011
> > FTM6 PQ:
> 0 ANSI: 5
> > [5.369885] scsi 0:2:0:0: Direct-Access LSI  MegaRAID SAS RMB
> > 1.40 PQ:
> 0 ANSI: 5
> >
> > ..
> >
> > Please let me know if you need more infos and/or want me to test
> > patches.
> >
> >
>
> I managed to get some parts of the broken dmesg. There it is :
>
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


RE: [GIT PULL] SCSI fixes for 4.9-rc3

2016-11-11 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Gabriel C
> Sent: Friday, November 11, 2016 9:40 AM
> To: James Bottomley; Andrew Morton; Linus Torvalds
> Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org
> Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3
>
>
>
> On 11.11.2016 04:30, Gabriel C wrote:
> >
> > On 05.11.2016 14:29, James Bottomley wrote:
> >
> >
> > ...
> >
> >> Kashyap Desai (1):
> >>   scsi: megaraid_sas: Fix data integrity failure for JBOD
> >> (passthrough) devices
> >>
> >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> index 9ff57de..d8b1fbd 100644
> >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host
> *shost, struct scsi_cmnd *scmd)
> >>goto out_done;
> >>}
> >>
> >> -  switch (scmd->cmnd[0]) {
> >> -  case SYNCHRONIZE_CACHE:
> >> -  /*
> >> -   * FW takes care of flush cache on its own
> >> -   * No need to send it down
> >> -   */
> >> +  /*
> >> +   * FW takes care of flush cache on its own for Virtual Disk.
> >> +   * No need to send it down for VD. For JBOD send
> SYNCHRONIZE_CACHE to FW.
> >> +   */
> >> +  if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) &&
> >> +MEGASAS_IS_LOGICAL(scmd)) {
> >>scmd->result = DID_OK << 16;
> >>goto out_done;
> >> -  default:
> >> -  break;
> >>}
> >>
> >>return instance->instancet->build_and_issue_cmd(instance, scmd);
> >
> > This patch breaks my box.. I'm not able to boot it anymore.
> > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!?
> >
> > I'm not sure how to get an log since dracut times out and I'm dropped
> > , after a very long time of probing 'ghost devices', in a emercency
> > shell,
> journalctl doesn't work also..
> >
> > After reverting this one I can boot normal.
> >
> > Box is a FUJITSU PRIMERGY TX200 S5..

Please  check now commit.  Below commit has complete fix.

http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8


> >
> > This is from an working kernel..
> >
> > [5.119371] megaraid_sas :01:00.0: FW now in Ready state
> > [5.119418] megaraid_sas :01:00.0: firmware supports msix
> > : (0)
> > [5.119420] megaraid_sas :01:00.0: current msix/online cpus
> > : (1/16)
> > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled)
> > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not
> > supported
> > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002
> >
> > ...
> >
> > [5.208063] megaraid_sas :01:00.0: controller type   :
> > MR(256MB)
> > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR)
> > :
> Enabled
> > [5.208067] megaraid_sas :01:00.0: Secure JBOD support   : No
> > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi:
> fw_support_ieee=0
> > [5.208073] megaraid_sas :01:00.0: INIT adapter done
> > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported
> megasas_setup_jbod_map 4967
> > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY
> failed/not supported by firmware
> > [5.252080] megaraid_sas :01:00.0: DCMD not supported by
> > firmware -
> megasas_ld_list_query 4369
> > [5.274086] megaraid_sas :01:00.0: pci id:
> (0x1000)/(0x0060)/(0x1734)/(0x10f9)
> > [5.274089] megaraid_sas :01:00.0: unevenspan support: no
> > [5.274090] megaraid_sas :01:00.0: firmware crash dump   : no
> > [5.274092] megaraid_sas :01:00.0: jbod sync map : no
> > [5.274094] scsi host0: Avago SAS based MegaRAID driver
> > [5.280022] scsi 0:0:6:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.282153] scsi 0:0:7:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.285180] scsi 0:0:10:0: Direct-Access ATA  ST500NM0011
> > FTM6 PQ:
> 0 ANSI: 5
> > [5.369885] scsi 0:2:0:0: Direct-Access LSI  MegaRAID SAS RMB
> > 1.40 PQ:
> 0 ANSI: 5
> >
> > ..
> >
> > Please let me know if you need more infos and/or want me to test
> > patches.
> >
> >
>
> I managed to get some parts of the broken dmesg. There it is :
>
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-31 Thread Kashyap Desai
Jens- Replied inline.


Omar -  I tested your WIP repo and figure out System hangs only if I pass "
scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am
looking for scsi_mod.use_blk_mq=Y.

Also below is snippet of blktrace. In case of higher per device QD, I see
Requeue request in blktrace.

65,128 10 6268 2.432404509 18594  P   N [fio]
 65,128 10 6269 2.432405013 18594  U   N [fio] 1
 65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
 65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
 65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
 65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
 65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
 65,128 23 3626 2.432432595 16232  D  WS 148800 + 8 [kworker/23:1H]
 65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
 65,128  7 6126 2.433032637 18594  P   N [fio]
 65,128  7 6127 2.433033204 18594  U   N [fio] 1
 65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
 65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
 65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
 65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
 65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
 65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
 65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
 65,128  9 6393 2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
 65,128  7 6134 2.433167450 18594  P   N [fio]
 65,128  7 6135 2.433167911 18594  U   N [fio] 1
 65,128  7 6136 2.433168074 18594  I  WS 148816 + 8 [fio]
 65,128  7 6137 2.433168492 18594  D  WS 148816 + 8 [fio]
 65,128  7 6138 2.433174016 18594  Q  WS 148824 + 8 [fio]
 65,128  7 6139 2.433174282 18594  G  WS 148824 + 8 [fio]
 65,128  7 6140 2.433174613 18594  S  WS 148824 + 8 [fio]
CPU0 (sdy):
 Reads Queued:   0,0KiB  Writes Queued:  79,
316KiB
 Read Dispatches:0,0KiB  Write Dispatches:   67,
18,446,744,073PiB
 Reads Requeued: 0   Writes Requeued:86
 Reads Completed:0,0KiB  Writes Completed:   98,
392KiB
 Read Merges:0,0KiB  Write Merges:0,
0KiB
 Read depth: 0   Write depth: 5
 IO unplugs:79   Timer unplugs:   0



` Kashyap

> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, October 31, 2016 10:54 PM
> To: Kashyap Desai; Omar Sandoval
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> workload
>
> Hi,
>
> One guess would be that this isn't around a requeue condition, but rather
> the
> fact that we don't really guarantee any sort of hard FIFO behavior between
> the
> software queues. Can you try this test patch to see if it changes the
> behavior for
> you? Warning: untested...

Jens - I tested the patch, but I still see random IO pattern for expected
Sequential Run. I am intentionally running case of Re-queue  and seeing
issue at the time of Re-queue.
If there is no Requeue, I see no issue at LLD.


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> f3d27a6dee09..5404ca9c71b2
> 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
> int
> queued)
>   return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>
> +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head
> +*b) {
> + struct request *rqa = container_of(a, struct request, queuelist);
> + struct request *rqb = container_of(b, struct request, queuelist);
> +
> + return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> +
>   /*
>* Run this hardware queue, pulling any software queues mapped to it in.
>* Note that this function currently has various problems around
> ordering @@ -
> 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx
> *hctx)
>   }
>
>   /*
> +  * If the device is rotational, sort the list sanely to avoid
> +  * unecessary seeks. The software queues are roughly FIFO, but
> +  * only roughly, there are no hard guarantees.
> +  */
> + if (!blk_queue_nonrot(q))
> + list_sort(NULL, _list, rq_pos_cmp);
> +
> + /*
>* Start off with dptr being NULL, so we start the first request
>* immediately, even if we have more pending.
>*/
>
> --
> Jens Axboe


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-31 Thread Kashyap Desai
Jens- Replied inline.


Omar -  I tested your WIP repo and figure out System hangs only if I pass "
scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am
looking for scsi_mod.use_blk_mq=Y.

Also below is snippet of blktrace. In case of higher per device QD, I see
Requeue request in blktrace.

65,128 10 6268 2.432404509 18594  P   N [fio]
 65,128 10 6269 2.432405013 18594  U   N [fio] 1
 65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
 65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
 65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
 65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
 65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
 65,128 23 3626 2.432432595 16232  D  WS 148800 + 8 [kworker/23:1H]
 65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
 65,128  7 6126 2.433032637 18594  P   N [fio]
 65,128  7 6127 2.433033204 18594  U   N [fio] 1
 65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
 65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
 65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
 65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
 65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
 65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
 65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
 65,128  9 6393 2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
 65,128  7 6134 2.433167450 18594  P   N [fio]
 65,128  7 6135 2.433167911 18594  U   N [fio] 1
 65,128  7 6136 2.433168074 18594  I  WS 148816 + 8 [fio]
 65,128  7 6137 2.433168492 18594  D  WS 148816 + 8 [fio]
 65,128  7 6138 2.433174016 18594  Q  WS 148824 + 8 [fio]
 65,128  7 6139 2.433174282 18594  G  WS 148824 + 8 [fio]
 65,128  7 6140 2.433174613 18594  S  WS 148824 + 8 [fio]
CPU0 (sdy):
 Reads Queued:   0,0KiB  Writes Queued:  79,
316KiB
 Read Dispatches:0,0KiB  Write Dispatches:   67,
18,446,744,073PiB
 Reads Requeued: 0   Writes Requeued:86
 Reads Completed:0,0KiB  Writes Completed:   98,
392KiB
 Read Merges:0,0KiB  Write Merges:0,
0KiB
 Read depth: 0   Write depth: 5
 IO unplugs:79   Timer unplugs:   0



` Kashyap

> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, October 31, 2016 10:54 PM
> To: Kashyap Desai; Omar Sandoval
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> workload
>
> Hi,
>
> One guess would be that this isn't around a requeue condition, but rather
> the
> fact that we don't really guarantee any sort of hard FIFO behavior between
> the
> software queues. Can you try this test patch to see if it changes the
> behavior for
> you? Warning: untested...

Jens - I tested the patch, but I still see random IO pattern for expected
Sequential Run. I am intentionally running case of Re-queue  and seeing
issue at the time of Re-queue.
If there is no Requeue, I see no issue at LLD.


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> f3d27a6dee09..5404ca9c71b2
> 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
> int
> queued)
>   return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>
> +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head
> +*b) {
> + struct request *rqa = container_of(a, struct request, queuelist);
> + struct request *rqb = container_of(b, struct request, queuelist);
> +
> + return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> +
>   /*
>* Run this hardware queue, pulling any software queues mapped to it in.
>* Note that this function currently has various problems around
> ordering @@ -
> 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx
> *hctx)
>   }
>
>   /*
> +  * If the device is rotational, sort the list sanely to avoid
> +  * unecessary seeks. The software queues are roughly FIFO, but
> +  * only roughly, there are no hard guarantees.
> +  */
> + if (!blk_queue_nonrot(q))
> + list_sort(NULL, _list, rq_pos_cmp);
> +
> + /*
>* Start off with dptr being NULL, so we start the first request
>* immediately, even if we have more pending.
>*/
>
> --
> Jens Axboe


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
> -Original Message-
> From: Omar Sandoval [mailto:osan...@osandov.com]
> Sent: Monday, October 24, 2016 9:11 PM
> To: Kashyap Desai
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig;
> paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > >
> > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > Hi -
> > > >
> > > > I found below conversation and it is on the same line as I wanted
> > > > some input from mailing list.
> > > >
> > > > http://marc.info/?l=linux-kernel=147569860526197=2
> > > >
> > > > I can do testing on any WIP item as Omar mentioned in above
> > discussion.
> > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> >
> > I tried build kernel using this repo, but looks like it is not allowed
> > to reboot due to some changes in  layer.
>
> Did you build the most up-to-date version of that branch? I've been
force
> pushing to it, so the commit id that you built would be useful.
> What boot failure are you seeing?

Below  is latest commit on repo.
commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
Author: Omar Sandoval <osan...@fb.com>
Date:   Tue Sep 20 11:20:03 2016 -0700

[WIP] blk-mq: limit bio queue depth

I have latest repo from 4.9/scsi-next maintained by Martin which boots
fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
branch. I could not see any meaningful data on boot hang, so going to try
one more time tomorrow.


>
> > >
> > > Are you using blk-mq for this disk? If not, then the work there
> > > won't
> > affect you.
> >
> > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > disable, Sequential work load issue is not seen and  scheduling
> > works well.
>
> Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly
> what's going on without the details. A sequential workload with just one
> submitter is about as easy as it gets, so this _should_ be behaving
nicely.



; setup numa policy for each thread
; 'numactl --show' to determine the maximum numa nodes
[global]
ioengine=libaio
buffered=0
rw=write
bssplit=4K/100
iodepth=256
numjobs=1
direct=1
runtime=60s
allow_mounted_write=0

[job1]
filename=/dev/sdd
..
[job24]
filename=/dev/sdaa

When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
ioscheduler detail. (It is in blk-mq mode. )
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:none

When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
ioscheduler picked by SML is .
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]

I see in blk-mq performance is very low for Sequential Write work load and
I confirm that blk-mq convert Sequential work load into random stream due
to  io-scheduler change in blk-mq vs legacy block layer.

>
> > >
> > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > user wants to see limited penalty  for Sequential Work load on HDD
?
> > > >
> > > > ` Kashyap
> > > >
>
> P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to
> mark just about everything I get from Broadcom as spam due to failed
DMARC.
>
> --
> Omar


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
> -Original Message-
> From: Omar Sandoval [mailto:osan...@osandov.com]
> Sent: Monday, October 24, 2016 9:11 PM
> To: Kashyap Desai
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig;
> paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > >
> > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > Hi -
> > > >
> > > > I found below conversation and it is on the same line as I wanted
> > > > some input from mailing list.
> > > >
> > > > http://marc.info/?l=linux-kernel=147569860526197=2
> > > >
> > > > I can do testing on any WIP item as Omar mentioned in above
> > discussion.
> > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> >
> > I tried build kernel using this repo, but looks like it is not allowed
> > to reboot due to some changes in  layer.
>
> Did you build the most up-to-date version of that branch? I've been
force
> pushing to it, so the commit id that you built would be useful.
> What boot failure are you seeing?

Below  is latest commit on repo.
commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
Author: Omar Sandoval 
Date:   Tue Sep 20 11:20:03 2016 -0700

[WIP] blk-mq: limit bio queue depth

I have latest repo from 4.9/scsi-next maintained by Martin which boots
fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
branch. I could not see any meaningful data on boot hang, so going to try
one more time tomorrow.


>
> > >
> > > Are you using blk-mq for this disk? If not, then the work there
> > > won't
> > affect you.
> >
> > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > disable, Sequential work load issue is not seen and  scheduling
> > works well.
>
> Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly
> what's going on without the details. A sequential workload with just one
> submitter is about as easy as it gets, so this _should_ be behaving
nicely.



; setup numa policy for each thread
; 'numactl --show' to determine the maximum numa nodes
[global]
ioengine=libaio
buffered=0
rw=write
bssplit=4K/100
iodepth=256
numjobs=1
direct=1
runtime=60s
allow_mounted_write=0

[job1]
filename=/dev/sdd
..
[job24]
filename=/dev/sdaa

When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
ioscheduler detail. (It is in blk-mq mode. )
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:none

When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
ioscheduler picked by SML is .
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]

I see in blk-mq performance is very low for Sequential Write work load and
I confirm that blk-mq convert Sequential work load into random stream due
to  io-scheduler change in blk-mq vs legacy block layer.

>
> > >
> > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > user wants to see limited penalty  for Sequential Work load on HDD
?
> > > >
> > > > ` Kashyap
> > > >
>
> P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to
> mark just about everything I get from Broadcom as spam due to failed
DMARC.
>
> --
> Omar


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
>
> On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > Hi -
> >
> > I found below conversation and it is on the same line as I wanted some
> > input from mailing list.
> >
> > http://marc.info/?l=linux-kernel=147569860526197=2
> >
> > I can do testing on any WIP item as Omar mentioned in above
discussion.
> > https://github.com/osandov/linux/tree/blk-mq-iosched

I tried build kernel using this repo, but looks like it is not allowed to
reboot due to some changes in  layer.

>
> Are you using blk-mq for this disk? If not, then the work there won't
affect you.

YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
disable, Sequential work load issue is not seen and  scheduling works
well.

>
> > Is there any workaround/alternative in latest upstream kernel, if user
> > wants to see limited penalty  for Sequential Work load on HDD ?
> >
> > ` Kashyap
> >


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
>
> On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > Hi -
> >
> > I found below conversation and it is on the same line as I wanted some
> > input from mailing list.
> >
> > http://marc.info/?l=linux-kernel=147569860526197=2
> >
> > I can do testing on any WIP item as Omar mentioned in above
discussion.
> > https://github.com/osandov/linux/tree/blk-mq-iosched

I tried build kernel using this repo, but looks like it is not allowed to
reboot due to some changes in  layer.

>
> Are you using blk-mq for this disk? If not, then the work there won't
affect you.

YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
disable, Sequential work load issue is not seen and  scheduling works
well.

>
> > Is there any workaround/alternative in latest upstream kernel, if user
> > wants to see limited penalty  for Sequential Work load on HDD ?
> >
> > ` Kashyap
> >


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-21 Thread Kashyap Desai
Hi -

I found below conversation and it is on the same line as I wanted some
input from mailing list.

http://marc.info/?l=linux-kernel=147569860526197=2

I can do testing on any WIP item as Omar mentioned in above discussion.
https://github.com/osandov/linux/tree/blk-mq-iosched

Is there any workaround/alternative in latest upstream kernel, if user
wants to see limited penalty  for Sequential Work load on HDD ?

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Thursday, October 20, 2016 3:39 PM
> To: linux-s...@vger.kernel.org
> Subject: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> [ Apologize, if you find more than one instance of my email.
> Web based email client has some issue, so now trying git send mail.]
>
> Hi,
>
> I am doing some performance tuning in MR driver to understand how sdev
> queue depth and hba queue depth play role in IO submission from above
layer.
> I have 24 JBOD connected to MR 12GB controller and I can see performance
for
> 4K Sequential work load as below.
>
> HBA QD for MR controller is 4065 and Per device QD is set to 32
>
> queue depth from  256 reports 300K IOPS queue depth from  128
> reports 330K IOPS queue depth from  64 reports 360K IOPS queue
depth
> from  32 reports 510K IOPS
>
> In MR driver I added debug print and confirm that more IO come to driver
as
> random IO whenever I have  queue depth more than 32.
>
> I have debug using scsi logging level and blktrace as well. Below is
snippet of
> logs using scsi logging level.  In summary, if SML do flow control of IO
due to
> Device QD or HBA QD, IO coming to LLD is more random pattern.
>
> I see IO coming to driver is not sequential.
>
> [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
3b 00
> 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
00 03
> c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
Write(10) 2a
> 00 00 03 c0 5b 00 00 01 00
>
>  After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> Two Sequence are overlapped due to sdev QD throttling.
>
> [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
5c 00
> 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
00 03
> c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
Write(10) 2a
> 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
CDB:
> Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
[sdy]
> tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
>
>  If scsi_request_fn() breaks due to unavailability of device queue (due
to below
> check), will there be any side defect as I observe ?
> if (!scsi_dev_queue_ready(q, sdev))
>  break;
>
> If I reduce HBA QD and make sure IO from above layer is throttled due to
HBA
> QD, there is a same impact.
> MR driver use host wide shared tag map.
>
> Can someone help me if this can be tunable in LLD providing additional
settings
> or it is expected behavior ? Problem I am facing is, I am not able to
figure out
> optimal device queue depth for different configuration and work load.
>
> Thanks, Kashyap


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-21 Thread Kashyap Desai
Hi -

I found below conversation and it is on the same line as I wanted some
input from mailing list.

http://marc.info/?l=linux-kernel=147569860526197=2

I can do testing on any WIP item as Omar mentioned in above discussion.
https://github.com/osandov/linux/tree/blk-mq-iosched

Is there any workaround/alternative in latest upstream kernel, if user
wants to see limited penalty  for Sequential Work load on HDD ?

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Thursday, October 20, 2016 3:39 PM
> To: linux-s...@vger.kernel.org
> Subject: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> [ Apologize, if you find more than one instance of my email.
> Web based email client has some issue, so now trying git send mail.]
>
> Hi,
>
> I am doing some performance tuning in MR driver to understand how sdev
> queue depth and hba queue depth play role in IO submission from above
layer.
> I have 24 JBOD connected to MR 12GB controller and I can see performance
for
> 4K Sequential work load as below.
>
> HBA QD for MR controller is 4065 and Per device QD is set to 32
>
> queue depth from  256 reports 300K IOPS queue depth from  128
> reports 330K IOPS queue depth from  64 reports 360K IOPS queue
depth
> from  32 reports 510K IOPS
>
> In MR driver I added debug print and confirm that more IO come to driver
as
> random IO whenever I have  queue depth more than 32.
>
> I have debug using scsi logging level and blktrace as well. Below is
snippet of
> logs using scsi logging level.  In summary, if SML do flow control of IO
due to
> Device QD or HBA QD, IO coming to LLD is more random pattern.
>
> I see IO coming to driver is not sequential.
>
> [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
3b 00
> 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
00 03
> c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
Write(10) 2a
> 00 00 03 c0 5b 00 00 01 00
>
>  After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> Two Sequence are overlapped due to sdev QD throttling.
>
> [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
5c 00
> 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
00 03
> c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
Write(10) 2a
> 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
CDB:
> Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
[sdy]
> tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
>
>  If scsi_request_fn() breaks due to unavailability of device queue (due
to below
> check), will there be any side defect as I observe ?
> if (!scsi_dev_queue_ready(q, sdev))
>  break;
>
> If I reduce HBA QD and make sure IO from above layer is throttled due to
HBA
> QD, there is a same impact.
> MR driver use host wide shared tag map.
>
> Can someone help me if this can be tunable in LLD providing additional
settings
> or it is expected behavior ? Problem I am facing is, I am not able to
figure out
> optimal device queue depth for different configuration and work load.
>
> Thanks, Kashyap


RE: Observing Softlockup's while running heavy IOs

2016-08-23 Thread Kashyap Desai
> -Original Message-
> From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com]
> Sent: Saturday, August 20, 2016 2:58 AM
> To: Sreekanth Reddy
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty;
> Chaitra Basappa; Suganath Prabu Subramani
> Subject: RE: Observing Softlockup's while running heavy IOs
>
>
>
> > -Original Message-
> > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> > Sent: Friday, August 19, 2016 6:45 AM
> > To: Elliott, Robert (Persistent Memory) <elli...@hpe.com>
> > Subject: Re: Observing Softlockup's while running heavy IOs
> >
> ...
> > Yes I am also observing that all the interrupts are routed to one CPU.
> > But still I observing softlockups (sometime hardlockups) even when I
> > set rq_affinity to 2.

How about below scenario ?  For simplicity. HBA with single MSI-x vector.
(Whenever HBA supports less MSI-x and logical CPUs are more on system, we
can see chance of these issue frequently..)

Assume we have 32 logical CPU  (4 socket, each with 8 logical CPU). CPU-0 is
not participating in IO.
Remaining CPU range from 1 to 31 is submitting IO. In such a scenario
rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not
help.

We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or
it is very rare to happen on field  ?


>
> That'll ensure the block layer's completion handling is done there, but
> not your
> driver's interrupt handler (which precedes the block layer completion
> handling).
>
>
> > Is their any way to route the interrupts the same CPUs which has
> > submitted the corresponding IOs?
> > or
> > Is their any way/option in the irqbalance/kernel which can route
> > interrupts to CPUs (enabled in affinity_hint) in round robin manner
> > after specific time period.
>
> Ensure your driver creates one MSIX interrupt per CPU core, uses that
> interrupt
> for all submissions from that core, and reports that it would like that
> interrupt to
> be serviced by that core in /proc/irq/nnn/affinity_hint.
>
> Even with hyperthreading, this needs to be based on the logical CPU cores,
> not
> just the physical core or the physical socket.
> You can swamp a logical CPU core as easily as a physical CPU core.
>
> Then, provide an irqbalance policy script that honors the affinity_hint
> for your
> driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity
> to
> match the affinity_hint.
>
> Some versions of irqbalance honor the hints; some purposely don't and need
> to
> be overridden with a policy script.
>
>
> ---
> Robert Elliott, HPE Persistent Memory
>


RE: Observing Softlockup's while running heavy IOs

2016-08-23 Thread Kashyap Desai
> -Original Message-
> From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com]
> Sent: Saturday, August 20, 2016 2:58 AM
> To: Sreekanth Reddy
> Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty;
> Chaitra Basappa; Suganath Prabu Subramani
> Subject: RE: Observing Softlockup's while running heavy IOs
>
>
>
> > -Original Message-
> > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> > Sent: Friday, August 19, 2016 6:45 AM
> > To: Elliott, Robert (Persistent Memory) 
> > Subject: Re: Observing Softlockup's while running heavy IOs
> >
> ...
> > Yes I am also observing that all the interrupts are routed to one CPU.
> > But still I observing softlockups (sometime hardlockups) even when I
> > set rq_affinity to 2.

How about below scenario ?  For simplicity. HBA with single MSI-x vector.
(Whenever HBA supports less MSI-x and logical CPUs are more on system, we
can see chance of these issue frequently..)

Assume we have 32 logical CPU  (4 socket, each with 8 logical CPU). CPU-0 is
not participating in IO.
Remaining CPU range from 1 to 31 is submitting IO. In such a scenario
rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not
help.

We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or
it is very rare to happen on field  ?


>
> That'll ensure the block layer's completion handling is done there, but
> not your
> driver's interrupt handler (which precedes the block layer completion
> handling).
>
>
> > Is their any way to route the interrupts the same CPUs which has
> > submitted the corresponding IOs?
> > or
> > Is their any way/option in the irqbalance/kernel which can route
> > interrupts to CPUs (enabled in affinity_hint) in round robin manner
> > after specific time period.
>
> Ensure your driver creates one MSIX interrupt per CPU core, uses that
> interrupt
> for all submissions from that core, and reports that it would like that
> interrupt to
> be serviced by that core in /proc/irq/nnn/affinity_hint.
>
> Even with hyperthreading, this needs to be based on the logical CPU cores,
> not
> just the physical core or the physical socket.
> You can swamp a logical CPU core as easily as a physical CPU core.
>
> Then, provide an irqbalance policy script that honors the affinity_hint
> for your
> driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity
> to
> match the affinity_hint.
>
> Some versions of irqbalance honor the hints; some purposely don't and need
> to
> be overridden with a policy script.
>
>
> ---
> Robert Elliott, HPE Persistent Memory
>


RE: [PATCH] megaraid_sas: Fix probing cards without io port

2016-08-09 Thread Kashyap Desai
 > > that does not io port resource allocated from BIOS, and kernel can
not
> > assign one as io port shortage.
> >
> > The driver is only looking for MEM, and should not fail.
>
> > It turns out megasas_init_fw() etc are using bar index as mask.
> > index 1 is used as mask 1, so that pci_request_selected_regions() is
> > trying to request BAR0 instead of BAR1.
>
> > Fix all related reference.

Review patch and looks good. Thanks Yinghai for submitting patch.

Acked-by:  Kashyap Desai <kashyap.de...@broadcom.com>

>
> Kashyap? Sumit?
>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [PATCH] megaraid_sas: Fix probing cards without io port

2016-08-09 Thread Kashyap Desai
 > > that does not io port resource allocated from BIOS, and kernel can
not
> > assign one as io port shortage.
> >
> > The driver is only looking for MEM, and should not fail.
>
> > It turns out megasas_init_fw() etc are using bar index as mask.
> > index 1 is used as mask 1, so that pci_request_selected_regions() is
> > trying to request BAR0 instead of BAR1.
>
> > Fix all related reference.

Review patch and looks good. Thanks Yinghai for submitting patch.

Acked-by:  Kashyap Desai 

>
> Kashyap? Sumit?
>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c

2015-12-17 Thread Kashyap Desai
> -Original Message-
> From: Nicholas Krause [mailto:xerofo...@gmail.com]
> Sent: Monday, July 06, 2015 10:06 PM
> To: kashyap.de...@avagotech.com
> Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com;
> jbottom...@odin.com; megaraidlinux@avagotech.com; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] megaraid:Make various functions static in
> megaraid_sas_fusion.c
>
> This makes various functions that have no external callers outside of
their
> definition/declaration in the file megaraid_sas_fusion.c to be properly
> declared as static functions now.
>
> Signed-off-by: Nicholas Krause 

Nicholas -

You posted patch with subject line "megaraid:Make the function
megasas_alloc_cmd_fusion static" for similar code changes.
Let's have consolidated patch for this and Avago can post it with next
change set (to keep git  commit clean without any failure)

Just to be clear - You can send us patch and we can add with Submitted by
your name as signature. For now NACK.

` Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static

2015-12-17 Thread Kashyap Desai
> -int
> +static int
>  megasas_alloc_cmds_fusion(struct megasas_instance *instance)  {
>   int i, j, count;

Good catch.  Let's not include patch to avoid further confusion on series
of acked patch which are not making significant functional difference.
NACK as not making any functional difference. We can make many more such
changes in one shot and let's address those later as we have few patches
ready to send in few days.


> --
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c

2015-12-17 Thread Kashyap Desai
> -Original Message-
> From: Nicholas Krause [mailto:xerofo...@gmail.com]
> Sent: Monday, July 06, 2015 10:06 PM
> To: kashyap.de...@avagotech.com
> Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com;
> jbottom...@odin.com; megaraidlinux@avagotech.com; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] megaraid:Make various functions static in
> megaraid_sas_fusion.c
>
> This makes various functions that have no external callers outside of
their
> definition/declaration in the file megaraid_sas_fusion.c to be properly
> declared as static functions now.
>
> Signed-off-by: Nicholas Krause 

Nicholas -

You posted patch with subject line "megaraid:Make the function
megasas_alloc_cmd_fusion static" for similar code changes.
Let's have consolidated patch for this and Avago can post it with next
change set (to keep git  commit clean without any failure)

Just to be clear - You can send us patch and we can add with Submitted by
your name as signature. For now NACK.

` Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static

2015-12-17 Thread Kashyap Desai
> -int
> +static int
>  megasas_alloc_cmds_fusion(struct megasas_instance *instance)  {
>   int i, j, count;

Good catch.  Let's not include patch to avoid further confusion on series
of acked patch which are not making significant functional difference.
NACK as not making any functional difference. We can make many more such
changes in one shot and let's address those later as we have few patches
ready to send in few days.


> --
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's

2015-11-11 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, November 11, 2015 6:27 PM
> To: Sreekanth Reddy; j...@kernel.org
> Cc: martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> jbottom...@parallels.com; sathya.prak...@avagotech.com;
> kashyap.de...@avagotech.com; linux-kernel@vger.kernel.org;
> h...@infradead.org; chaitra.basa...@avagotech.com; suganath-
> prabu.subram...@avagotech.com
> Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN
> Raid Action request on SAS2 HBA's
>
> On 11/11/2015 01:00 PM, Sreekanth Reddy wrote:
> > From: Sreekanth Reddy 
> >
> > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's.
> > Since these HBA's doesn't support this Raid Action.
> >
> > Also enable fast_path only for SAS3 HBA's.
> >
> > Signed-off-by: Sreekanth Reddy 
> > ---
> >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > index a638920..80469d0 100644
> > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget)
> > if (test_bit(sas_device->handle, ioc->pd_handles))
> > sas_target_priv_data->flags |=
> > MPT_TARGET_FLAGS_RAID_COMPONENT;
> > +#ifndef SCSI_MPT2SAS
> > if (sas_device->fast_path)
> > sas_target_priv_data->flags |=
> MPT_TARGET_FASTPATH_IO;
> > +#endif
> > }
> > spin_unlock_irqrestore(>sas_device_lock, flags);
> >
> > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct
> scsi_cmnd *scmd)
> > ioc->build_zero_len_sge(ioc, _request->SGL);
> >
> > if (likely(mpi_request->Function ==
> MPI2_FUNCTION_SCSI_IO_REQUEST))
> > {
> > +#ifndef SCSI_MPT2SAS
> > if (sas_target_priv_data->flags &
> MPT_TARGET_FASTPATH_IO) {
> > mpi_request->IoFlags = cpu_to_le16(scmd-
> >cmd_len |
> > MPI25_SCSIIO_IOFLAGS_FAST_PATH);
> > mpt3sas_base_put_smid_fast_path(ioc, smid,
> handle);
> > } else
> > +#endif
> > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle);
> > } else
> > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8
> +5035,10 @@
> > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8
> phy_num,
> > sas_device->device_info = device_info;
> > sas_device->sas_address = sas_address;
> > sas_device->phy = sas_device_pg0.PhyNum;
> > +#ifndef SCSI_MPT2SAS
> > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) &
> > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0;
> > +#endif
> >
> > if (sas_device_pg0.Flags &
> MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
> > sas_device->enclosure_level =
> > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct
> MPT3SAS_ADAPTER *ioc,
> > }
> >  }
> >
> > +#ifndef SCSI_MPT2SAS
> >  /**
> >   * _scsih_ir_fastpath - turn on fastpath for IR physdisk
> >   * @ioc: per adapter object
> > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER
> *ioc, u16 handle, u8 phys_disk_num)
> > u16 ioc_status;
> > u32 log_info;
> >
> > -
> > mutex_lock(>scsih_cmds.mutex);
> >
> > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ -
> 5825,6
> > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16
> handle, u8 phys_disk_num)
> > FORCE_BIG_HAMMER);
> > return rc;
> >  }
> > +/* End of not defined SCSI_MPT2SAS */ #endif
> >
> >  /**
> >   * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@
> > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc,
> > if (!sas_device)
> > return;
> >
> > +#ifndef SCSI_MPT2SAS
> > /* hiding raid component */
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > if (starget)
> > starget_for_each_device(starget, (void *)1,
> _scsih_reprobe_lun);  }
> > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> > spin_unlock_irqrestore(>sas_device_lock, flags);
> > if (sas_device) {
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > return;
> > }
> >
> > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > mpt3sas_transport_update_links(ioc, sas_address, handle,
> > sas_device_pg0.PhyNum,
> MPI2_SAS_NEG_LINK_RATE_1_5);
> >
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > _scsih_add_device(ioc, handle, 0, 1);  }
> >
> > @@ -6202,13 +6216,14 @@ _scsih_sas_ir_config_change_event(struct
> > MPT3SAS_ADAPTER *ioc,
> >
> > element = 

RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's

2015-11-11 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, November 11, 2015 6:27 PM
> To: Sreekanth Reddy; j...@kernel.org
> Cc: martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> jbottom...@parallels.com; sathya.prak...@avagotech.com;
> kashyap.de...@avagotech.com; linux-kernel@vger.kernel.org;
> h...@infradead.org; chaitra.basa...@avagotech.com; suganath-
> prabu.subram...@avagotech.com
> Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN
> Raid Action request on SAS2 HBA's
>
> On 11/11/2015 01:00 PM, Sreekanth Reddy wrote:
> > From: Sreekanth Reddy 
> >
> > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's.
> > Since these HBA's doesn't support this Raid Action.
> >
> > Also enable fast_path only for SAS3 HBA's.
> >
> > Signed-off-by: Sreekanth Reddy 
> > ---
> >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > index a638920..80469d0 100644
> > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget)
> > if (test_bit(sas_device->handle, ioc->pd_handles))
> > sas_target_priv_data->flags |=
> > MPT_TARGET_FLAGS_RAID_COMPONENT;
> > +#ifndef SCSI_MPT2SAS
> > if (sas_device->fast_path)
> > sas_target_priv_data->flags |=
> MPT_TARGET_FASTPATH_IO;
> > +#endif
> > }
> > spin_unlock_irqrestore(>sas_device_lock, flags);
> >
> > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct
> scsi_cmnd *scmd)
> > ioc->build_zero_len_sge(ioc, _request->SGL);
> >
> > if (likely(mpi_request->Function ==
> MPI2_FUNCTION_SCSI_IO_REQUEST))
> > {
> > +#ifndef SCSI_MPT2SAS
> > if (sas_target_priv_data->flags &
> MPT_TARGET_FASTPATH_IO) {
> > mpi_request->IoFlags = cpu_to_le16(scmd-
> >cmd_len |
> > MPI25_SCSIIO_IOFLAGS_FAST_PATH);
> > mpt3sas_base_put_smid_fast_path(ioc, smid,
> handle);
> > } else
> > +#endif
> > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle);
> > } else
> > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8
> +5035,10 @@
> > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8
> phy_num,
> > sas_device->device_info = device_info;
> > sas_device->sas_address = sas_address;
> > sas_device->phy = sas_device_pg0.PhyNum;
> > +#ifndef SCSI_MPT2SAS
> > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) &
> > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0;
> > +#endif
> >
> > if (sas_device_pg0.Flags &
> MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
> > sas_device->enclosure_level =
> > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct
> MPT3SAS_ADAPTER *ioc,
> > }
> >  }
> >
> > +#ifndef SCSI_MPT2SAS
> >  /**
> >   * _scsih_ir_fastpath - turn on fastpath for IR physdisk
> >   * @ioc: per adapter object
> > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER
> *ioc, u16 handle, u8 phys_disk_num)
> > u16 ioc_status;
> > u32 log_info;
> >
> > -
> > mutex_lock(>scsih_cmds.mutex);
> >
> > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ -
> 5825,6
> > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16
> handle, u8 phys_disk_num)
> > FORCE_BIG_HAMMER);
> > return rc;
> >  }
> > +/* End of not defined SCSI_MPT2SAS */ #endif
> >
> >  /**
> >   * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@
> > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc,
> > if (!sas_device)
> > return;
> >
> > +#ifndef SCSI_MPT2SAS
> > /* hiding raid component */
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > if (starget)
> > starget_for_each_device(starget, (void *)1,
> _scsih_reprobe_lun);  }
> > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> > spin_unlock_irqrestore(>sas_device_lock, flags);
> > if (sas_device) {
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > return;
> > }
> >
> > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > mpt3sas_transport_update_links(ioc, sas_address, handle,
> > sas_device_pg0.PhyNum,
> MPI2_SAS_NEG_LINK_RATE_1_5);
> >
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > _scsih_add_device(ioc, handle, 0, 1);  }
> >
> > @@ -6202,13 +6216,14 @@ 

RE: [patch 1/2] megaraid_sas: remove a stray tab

2015-11-10 Thread Kashyap Desai
> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Saturday, November 07, 2015 9:46 PM
> To: Kashyap Desai; Sumit Saxena
> Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com;
> linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; kernel-
> janit...@vger.kernel.org
> Subject: [patch 1/2] megaraid_sas: remove a stray tab
>
> One line was indented more than the others.
>
> Signed-off-by: Dan Carpenter 

Let us combine your next patch (because that is also indentation issue)
and this one. We will resubmit patch.   For now ignore to avoid any issue
for pending patches to be committed.

>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index cc95372..829e9e9 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
>
>   if (instance->ctrl_context) {
>   megasas_release_fusion(instance);
> - pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
> + pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
>   (sizeof(struct MR_PD_CFG_SEQ) *
> - (MAX_PHYSICAL_DEVICES - 1));
> + (MAX_PHYSICAL_DEVICES - 1));
>   for (i = 0; i < 2 ; i++) {
>   if (fusion->ld_map[i])
>   dma_free_coherent(>pdev->dev,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch 1/2] megaraid_sas: remove a stray tab

2015-11-10 Thread Kashyap Desai
> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Saturday, November 07, 2015 9:46 PM
> To: Kashyap Desai; Sumit Saxena
> Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com;
> linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; kernel-
> janit...@vger.kernel.org
> Subject: [patch 1/2] megaraid_sas: remove a stray tab
>
> One line was indented more than the others.
>
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>

Let us combine your next patch (because that is also indentation issue)
and this one. We will resubmit patch.   For now ignore to avoid any issue
for pending patches to be committed.

>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index cc95372..829e9e9 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
>
>   if (instance->ctrl_context) {
>   megasas_release_fusion(instance);
> - pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
> + pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
>   (sizeof(struct MR_PD_CFG_SEQ) *
> - (MAX_PHYSICAL_DEVICES - 1));
> + (MAX_PHYSICAL_DEVICES - 1));
>   for (i = 0; i < 2 ; i++) {
>   if (fusion->ld_map[i])
>   dma_free_coherent(>pdev->dev,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_

2015-10-27 Thread Kashyap Desai
> > +   dev_dbg(>pdev->dev, "Error copying out
> > cmd_status\n");
> > error = -EFAULT;
> > }
> >
>
> Reviewed-by: Johannes Thumshirn 

We will consider all three patches for future submission. As of now we have
two patch set pending to be committed.
We are working for few more patch in megaraid_sas which will do clean up in
driver module + proper error handling of DCMD command timeout. It will cover
Patch posted with below subject -

[PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames
failed

James  -  We will be resending these patch set on top of latest outstanding
megaraid_sas driver patch, so that we can avoid any conflict in commits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_

2015-10-27 Thread Kashyap Desai
> > +   dev_dbg(>pdev->dev, "Error copying out
> > cmd_status\n");
> > error = -EFAULT;
> > }
> >
>
> Reviewed-by: Johannes Thumshirn 

We will consider all three patches for future submission. As of now we have
two patch set pending to be committed.
We are working for few more patch in megaraid_sas which will do clean up in
driver module + proper error handling of DCMD command timeout. It will cover
Patch posted with below subject -

[PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames
failed

James  -  We will be resending these patch set on top of latest outstanding
megaraid_sas driver patch, so that we can avoid any conflict in commits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-07-10 Thread Kashyap Desai
>
> I am about to commit the patch that was successfully tested by the
> customer on
> SLES 12, but I'm a bit confused. The upstream patch you referred to is:
>
> https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for-
> next=6431f5d7c6025f8b007af06ea090de308f7e6881
> [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel
>
> But the patch I used is the one you sent by e-mail on May 28th. It is
> completely
> different!
>
> So what am I supposed to do? Use the patch you sent (and that was tested
> by
> the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and
> the
> proper way of fixing the problem would be to backport the upstream commit?

You can use that patch as valid candidate for upstream submission. Some of
the MR maintainer (Sumit Saxena) will send that patch. We are just
organizing other patch series.
Since SLES already ported patch without commit id, we are fine. I am just
giving reference that patch which send via email will be send to upstream
very soon along with other patch set.

Thanks, Kashyap
>
> Please advise,
> --
> Jean Delvare
> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]

2015-07-10 Thread Kashyap Desai

 I am about to commit the patch that was successfully tested by the
 customer on
 SLES 12, but I'm a bit confused. The upstream patch you referred to is:

 https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for-
 nextid=6431f5d7c6025f8b007af06ea090de308f7e6881
 [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel

 But the patch I used is the one you sent by e-mail on May 28th. It is
 completely
 different!

 So what am I supposed to do? Use the patch you sent (and that was tested
 by
 the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and
 the
 proper way of fixing the problem would be to backport the upstream commit?

You can use that patch as valid candidate for upstream submission. Some of
the MR maintainer (Sumit Saxena) will send that patch. We are just
organizing other patch series.
Since SLES already ported patch without commit id, we are fine. I am just
giving reference that patch which send via email will be send to upstream
very soon along with other patch set.

Thanks, Kashyap

 Please advise,
 --
 Jean Delvare
 SUSE L3 Support
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-07-07 Thread Kashyap Desai
> -Original Message-
> From: Jean Delvare [mailto:jdelv...@suse.de]
> Sent: Tuesday, July 07, 2015 2:14 PM
> To: Kashyap Desai
> Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
linux-
> s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett;
Sumit
> Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org;
linux-
> ker...@vger.kernel.org; Myron Stowe
> Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> Hi Kashyap,
>
> On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote:
> > Bjorn/Robin,
> >
> > Apologies for delay. Here is one quick suggestion as we have seen
> > similar issue (not exactly similar, but high probably to have same
> > issue) while controller is configured on VM as pass-through and VM
reboot
> abruptly.
> > In that particular issue, driver interact with FW which may  require
> > chip reset to bring controller to operation state.
> >
> > Relevant patch was submitted for only Older controller as it was only
> > seen for few MegaRaid controller. Below patch already try to do chip
> > reset, but only for limited controllers...I have attached one more
> > patch which does chip reset from driver load time for
> > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
> > controller, so attached patch is required.)
> >
> > http://www.spinics.net/lists/linux-scsi/msg67288.html
> >
> > Please post the result with attached patch.
>
> Good news! Customer tested your patch and said it fixed the problem :-)
>
> I am now in the process of backporting the patch to the SLES 11 SP3
kernel for
> further testing. I'll let you know how it goes. Thank you very much for
your
> assistance.

Thanks for confirmation. Whatever patch I submitted to you, we have added
recently (as part of common interface approach to do chip reset at load
time). We will be submitting that patch to mainline soon.

~ Kashyap
>
> --
> Jean Delvare
> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]

2015-07-07 Thread Kashyap Desai
 -Original Message-
 From: Jean Delvare [mailto:jdelv...@suse.de]
 Sent: Tuesday, July 07, 2015 2:14 PM
 To: Kashyap Desai
 Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
linux-
 s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett;
Sumit
 Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org;
linux-
 ker...@vger.kernel.org; Myron Stowe
 Subject: Re: megaraid_sas: FW in FAULT state!!, how to get more debug
 output? [BKO63661]

 Hi Kashyap,

 On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote:
  Bjorn/Robin,
 
  Apologies for delay. Here is one quick suggestion as we have seen
  similar issue (not exactly similar, but high probably to have same
  issue) while controller is configured on VM as pass-through and VM
reboot
 abruptly.
  In that particular issue, driver interact with FW which may  require
  chip reset to bring controller to operation state.
 
  Relevant patch was submitted for only Older controller as it was only
  seen for few MegaRaid controller. Below patch already try to do chip
  reset, but only for limited controllers...I have attached one more
  patch which does chip reset from driver load time for
  Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
  controller, so attached patch is required.)
 
  http://www.spinics.net/lists/linux-scsi/msg67288.html
 
  Please post the result with attached patch.

 Good news! Customer tested your patch and said it fixed the problem :-)

 I am now in the process of backporting the patch to the SLES 11 SP3
kernel for
 further testing. I'll let you know how it goes. Thank you very much for
your
 assistance.

Thanks for confirmation. Whatever patch I submitted to you, we have added
recently (as part of common interface approach to do chip reset at load
time). We will be submitting that patch to mainline soon.

~ Kashyap

 --
 Jean Delvare
 SUSE L3 Support
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-06-30 Thread Kashyap Desai
Jean,

Patch is available at below repo -

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next

Commit id -
6431f5d7c6025f8b007af06ea090de308f7e6881

If you share megaraid_sas driver code of your tree, I can provide patch for
you.

` Kashyap

> -Original Message-
> From: Jean Delvare [mailto:jdelv...@suse.de]
> Sent: Monday, June 29, 2015 6:55 PM
> To: Kashyap Desai
> Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
> linux-
> s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit
> Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Myron Stowe
> Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> Hi Kashyap,
>
> Thanks for the patch. May I ask what tree it was based on? Linus'
> latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel
> trees (based
> on kernel v3.0 + a bunch of backports and v3.12
> respectively) but your patch fails to apply in both cases. I'll try harder
> but I don't
> know anything about the megaraid_sas code so I really don't know where I'm
> going.
>
> Does your patch depend on any other that may not be present in the SLES
> 11 SP3 and SLES 12 kernels?
>
> Thanks,
> Jean
>
> Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit :
> > Bjorn/Robin,
> >
> > Apologies for delay. Here is one quick suggestion as we have seen
> > similar issue (not exactly similar, but high probably to have same
> > issue) while controller is configured on VM as pass-through and VM
> > reboot
> abruptly.
> > In that particular issue, driver interact with FW which may  require
> > chip reset to bring controller to operation state.
> >
> > Relevant patch was submitted for only Older controller as it was only
> > seen for few MegaRaid controller. Below patch already try to do chip
> > reset, but only for limited controllers...I have attached one more
> > patch which does chip reset from driver load time for
> > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
> > controller, so attached patch is required.)
> >
> > http://www.spinics.net/lists/linux-scsi/msg67288.html
> >
> > Please post the result with attached patch.
> >
> > Thanks, Kashyap
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]

2015-06-30 Thread Kashyap Desai
Jean,

Patch is available at below repo -

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next

Commit id -
6431f5d7c6025f8b007af06ea090de308f7e6881

If you share megaraid_sas driver code of your tree, I can provide patch for
you.

` Kashyap

 -Original Message-
 From: Jean Delvare [mailto:jdelv...@suse.de]
 Sent: Monday, June 29, 2015 6:55 PM
 To: Kashyap Desai
 Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
 linux-
 s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit
 Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux-
 ker...@vger.kernel.org; Myron Stowe
 Subject: RE: megaraid_sas: FW in FAULT state!!, how to get more debug
 output? [BKO63661]

 Hi Kashyap,

 Thanks for the patch. May I ask what tree it was based on? Linus'
 latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel
 trees (based
 on kernel v3.0 + a bunch of backports and v3.12
 respectively) but your patch fails to apply in both cases. I'll try harder
 but I don't
 know anything about the megaraid_sas code so I really don't know where I'm
 going.

 Does your patch depend on any other that may not be present in the SLES
 11 SP3 and SLES 12 kernels?

 Thanks,
 Jean

 Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit :
  Bjorn/Robin,
 
  Apologies for delay. Here is one quick suggestion as we have seen
  similar issue (not exactly similar, but high probably to have same
  issue) while controller is configured on VM as pass-through and VM
  reboot
 abruptly.
  In that particular issue, driver interact with FW which may  require
  chip reset to bring controller to operation state.
 
  Relevant patch was submitted for only Older controller as it was only
  seen for few MegaRaid controller. Below patch already try to do chip
  reset, but only for limited controllers...I have attached one more
  patch which does chip reset from driver load time for
  Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
  controller, so attached patch is required.)
 
  http://www.spinics.net/lists/linux-scsi/msg67288.html
 
  Please post the result with attached patch.
 
  Thanks, Kashyap

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-05-28 Thread Kashyap Desai
Bjorn/Robin,

Apologies for delay. Here is one quick suggestion as we have seen similar
issue (not exactly similar, but high probably to have same issue) while
controller is configured on VM as pass-through and VM reboot abruptly.
In that particular issue, driver interact with FW which may  require chip
reset to bring controller to operation state.

Relevant patch was submitted for only Older controller as it was only seen
for few MegaRaid controller. Below patch already try to do chip reset, but
only for limited controllers...I have attached one more patch which does
chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In
your case you have Thunderbolt controller, so attached patch is required.)

http://www.spinics.net/lists/linux-scsi/msg67288.html

Please post the result with attached patch.

Thanks, Kashyap

> -Original Message-
> From: Bjorn Helgaas [mailto:bhelg...@google.com]
> Sent: Thursday, May 28, 2015 5:55 PM
> To: Robin H. Johnson
> Cc: Adam Radford; Neela Syam Kolli; linux-s...@vger.kernel.org;
> arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit
Saxena;
> Uday Lingala; megaraidlinux@avagotech.com;
linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org; Jean Delvare; Myron Stowe
> Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> [+cc Jean, Myron]
>
> Hello megaraid maintainers,
>
> Have you been able to take a look at this at all?  People have been
reporting this
> issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting
> reports on SLES.
>
> My theory is that the Linux driver relies on some MegaRAID
initialization done by
> the option ROM, and the bug happens when the BIOS doesn't execute the
option
> ROM.
>
> If that's correct, you should be able to reproduce it on any system by
booting
> Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM
(either
> by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip
it).  If the
> Linux driver doesn't rely on the option ROM, you might even be able to
> reproduce it by physically removing the option ROM from the MegaRAID.
>
> Bjorn
>
> On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote:
> > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl]
> >
> > On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote:
> > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote:
> > > > Thanks for the report, Robin.
> > > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the
> > > > problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which
> > > > appeared in v3.3.  For starters, can you verify that, e.g., by
> > > > building
> > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it
> > > > works, and building 3c076351c402 itself to make sure it fails?
> > > >
> > > > Assuming that's the case, please attach the complete dmesg and
> > > > "lspci -vvxxx" output for both kernels to the bugzilla.  ASPM is a
> > > > feature that is configured on both ends of a PCIe link, so I want
> > > > to see the lspci info for the whole system, not just the SAS
adapters.
> > > >
> > > > It's not practical to revert 3c076351c402 now, so I'd also like to
> > > > see the same information for the newest possible kernel (if this
> > > > is possible; I'm not clear on whether you can boot your system or
> > > > not) so we can figure out what needs to be changed.
> > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it
> > > fails to start; Commit 3c076351c402 did make it worse, but I think
> > > we're right that the bug lies in the SAS code.
> > >
> > > Ok, I have done more testing on it (40+ boots), and I think we can
> > > show the problem is somewhere in how the BIOS/EFI/ROM brings up the
> > > card in FastBoot more, or how it leaves the card.
> >
> > I attached your dmesg and lspci logs to
> > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you!  You did
> > a huge amount of excellent testing and analysis, and I'm sorry that we
> > haven't made progress using the results.
> >
> > I still think this is a megaraid_sas driver bug, but I don't have
> > enough evidence to really point fingers.
> >
> > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable
> > code"), megaraid_sas worked reliably.  After 3c076351c402,
> > megaraid_sas does not work reliably when BIOS Fast Boot is enabled.
> >
> >

RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]

2015-05-28 Thread Kashyap Desai
Bjorn/Robin,

Apologies for delay. Here is one quick suggestion as we have seen similar
issue (not exactly similar, but high probably to have same issue) while
controller is configured on VM as pass-through and VM reboot abruptly.
In that particular issue, driver interact with FW which may  require chip
reset to bring controller to operation state.

Relevant patch was submitted for only Older controller as it was only seen
for few MegaRaid controller. Below patch already try to do chip reset, but
only for limited controllers...I have attached one more patch which does
chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In
your case you have Thunderbolt controller, so attached patch is required.)

http://www.spinics.net/lists/linux-scsi/msg67288.html

Please post the result with attached patch.

Thanks, Kashyap

 -Original Message-
 From: Bjorn Helgaas [mailto:bhelg...@google.com]
 Sent: Thursday, May 28, 2015 5:55 PM
 To: Robin H. Johnson
 Cc: Adam Radford; Neela Syam Kolli; linux-s...@vger.kernel.org;
 arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit
Saxena;
 Uday Lingala; megaraidlinux@avagotech.com;
linux-...@vger.kernel.org;
 linux-kernel@vger.kernel.org; Jean Delvare; Myron Stowe
 Subject: Re: megaraid_sas: FW in FAULT state!!, how to get more debug
 output? [BKO63661]

 [+cc Jean, Myron]

 Hello megaraid maintainers,

 Have you been able to take a look at this at all?  People have been
reporting this
 issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting
 reports on SLES.

 My theory is that the Linux driver relies on some MegaRAID
initialization done by
 the option ROM, and the bug happens when the BIOS doesn't execute the
option
 ROM.

 If that's correct, you should be able to reproduce it on any system by
booting
 Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM
(either
 by enabling a BIOS fast boot switch, or modifying the BIOS to skip
it).  If the
 Linux driver doesn't rely on the option ROM, you might even be able to
 reproduce it by physically removing the option ROM from the MegaRAID.

 Bjorn

 On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote:
  [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl]
 
  On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote:
   On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote:
Thanks for the report, Robin.
   
https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the
problem to 3c076351c402 (PCI: Rework ASPM disable code), which
appeared in v3.3.  For starters, can you verify that, e.g., by
building
69166fbf02c7 (the parent of 3c076351c402) to make sure that it
works, and building 3c076351c402 itself to make sure it fails?
   
Assuming that's the case, please attach the complete dmesg and
lspci -vvxxx output for both kernels to the bugzilla.  ASPM is a
feature that is configured on both ends of a PCIe link, so I want
to see the lspci info for the whole system, not just the SAS
adapters.
   
It's not practical to revert 3c076351c402 now, so I'd also like to
see the same information for the newest possible kernel (if this
is possible; I'm not clear on whether you can boot your system or
not) so we can figure out what needs to be changed.
   TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it
   fails to start; Commit 3c076351c402 did make it worse, but I think
   we're right that the bug lies in the SAS code.
  
   Ok, I have done more testing on it (40+ boots), and I think we can
   show the problem is somewhere in how the BIOS/EFI/ROM brings up the
   card in FastBoot more, or how it leaves the card.
 
  I attached your dmesg and lspci logs to
  https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you!  You did
  a huge amount of excellent testing and analysis, and I'm sorry that we
  haven't made progress using the results.
 
  I still think this is a megaraid_sas driver bug, but I don't have
  enough evidence to really point fingers.
 
  Based on your testing, before 3c076351c402 (PCI: Rework ASPM disable
  code), megaraid_sas worked reliably.  After 3c076351c402,
  megaraid_sas does not work reliably when BIOS Fast Boot is enabled.
 
  Fast Boot probably means we don't run the option ROM on the device.
  Your dmesg logs show that in the working case, BIOS has enabled the
  device.  In the failing case it has not.  They also show that when
  Fast Boot is enabled, there's a little less MTRR write-protect space,
  which I'm guessing is space that wasn't needed for shadowing option
  ROMs.
 
  I suspect megaraid_sas depends on something done by the option ROM,
  and that prior to 3c076351c402, Linux did something to ASPM that was
  enough to make megaraid_sas work.
 
  I attached a couple debug patches to
  https://bugzilla.kernel.org/show_bug.cgi?id=63661 that log all the
  ASPM configuration the PCI core does.  One applies to 69166fbf02c7

RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster

2015-02-10 Thread Kashyap Desai
Acked-by: Kashyap Desai 

> -Original Message-
> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> Sent: Friday, February 06, 2015 8:04 PM
> To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley
> Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux-
> s...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster
>
> There is currently no %po format extension, so currently the letters
"on" are
> simply skipped and the pointer is printed as expected (while missing the
word
> on). However, it is probably only a matter of time before someone comes
up
> with a %po extension, at which point this is likely to fail
spectacularly.
>
> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index ff283d23788a..e7c6b9c946d6 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct
> megasas_instance *instance)
>   for (i = 0; i < max_cmd; i++) {
>   cmd = instance->cmd_list[i];
>   if (cmd->sync_cmd == 1 || cmd->scmd) {
> - printk(KERN_NOTICE "megasas: moving
> cmd[%d]:%p:%d:%p"
> + printk(KERN_NOTICE "megasas: moving
> cmd[%d]:%p:%d:%p "
>   "on the defer queue as
internal\n",
>   defer_index, cmd, cmd->sync_cmd, cmd-
> >scmd);
>
> --
> 2.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster

2015-02-10 Thread Kashyap Desai
Acked-by: Kashyap Desai kashyap.de...@avagotech.com

 -Original Message-
 From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
 Sent: Friday, February 06, 2015 8:04 PM
 To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley
 Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux-
 s...@vger.kernel.org; linux-kernel@vger.kernel.org
 Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster

 There is currently no %po format extension, so currently the letters
on are
 simply skipped and the pointer is printed as expected (while missing the
word
 on). However, it is probably only a matter of time before someone comes
up
 with a %po extension, at which point this is likely to fail
spectacularly.

 Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk
 ---
  drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
 b/drivers/scsi/megaraid/megaraid_sas_base.c
 index ff283d23788a..e7c6b9c946d6 100644
 --- a/drivers/scsi/megaraid/megaraid_sas_base.c
 +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
 @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct
 megasas_instance *instance)
   for (i = 0; i  max_cmd; i++) {
   cmd = instance-cmd_list[i];
   if (cmd-sync_cmd == 1 || cmd-scmd) {
 - printk(KERN_NOTICE megasas: moving
 cmd[%d]:%p:%d:%p
 + printk(KERN_NOTICE megasas: moving
 cmd[%d]:%p:%d:%p 
   on the defer queue as
internal\n,
   defer_index, cmd, cmd-sync_cmd, cmd-
 scmd);

 --
 2.1.3
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig  wrote:
> On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote:
>> I read  this comment and find that very few drivers are using this
>> cmd_list.  I think if we remove this cmd_list, performance will scale as I
>> am seeing major contention in this lock.
>> Just thought to ping you to see if this is known limitation for now or any
>> plan to change this lock in near future ?
>
> Removing the lock entirely and pushing the list into the two drivers
> using it is on my TODO list.  Bart actually suggested keeping the code in the
> SCSI core and having a flag to enabled.  Given that I'm too busy to get my 
> full
> version done in time, it might be a good idea if someone picks up Barts
> idea.  Can you send me a patch to add a enable_cmd_list flag to the host
> template and only enable it for aacraid and dpt_i2o?
>

Sure. I will work on relevant code change and will post patch for review.

-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.de...@avagotech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai
 wrote:
>
> > -Original Message-
> > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> > Sent: Friday, July 18, 2014 3:43 PM
> > To: James Bottomley; linux-s...@vger.kernel.org
> > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
> Robert
> > Elliott; Webb Scales; linux-kernel@vger.kernel.org
> > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
> >
> > This patch adds support for an alternate I/O path in the scsi midlayer
> which
> > uses the blk-mq infrastructure instead of the legacy request code.
> >
> > Use of blk-mq is fully transparent to drivers, although for now a host
> > template field is provided to opt out of blk-mq usage in case any
> unforseen
> > incompatibilities arise.
> >
> > In general replacing the legacy request code with blk-mq is a simple and
> > mostly mechanical transformation.  The biggest exception is the new code
> > that deals with the fact the I/O submissions in blk-mq must happen from
> > process context, which slightly complicates the I/O completion handler.
> > The second biggest differences is that blk-mq is build around the
> concept of
> > preallocated requests that also include driver specific data, which in
> SCSI
> > context means the scsi_cmnd structure.  This completely avoids dynamic
> > memory allocations for the fast path through I/O submission.
> >
> > Due the preallocated requests the MQ code path exclusively uses the
> host-
> > wide shared tag allocator instead of a per-LUN one.  This only affects
> drivers
> > actually using the block layer provided tag allocator instead of their
> own.
> > Unlike the old path blk-mq always provides a tag, although drivers don't
> have
> > to use it.
> >
> > For now the blk-mq path is disable by defauly and must be enabled using
> the
> > "use_blk_mq" module parameter.  Once the remaining work in the block
> > layer to make blk-mq more suitable for slow devices is complete I hope
> to
> > make it the default and eventually even remove the old code path.
> >
> > Based on the earlier scsi-mq prototype by Nicholas Bellinger.
> >
> > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
> and
> > various sugestions and code contributions.
> >
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Hannes Reinecke 
> > Reviewed-by: Webb Scales 
> > Acked-by: Jens Axboe 
> > Tested-by: Bart Van Assche 
> > Tested-by: Robert Elliott 
> > ---
> >  drivers/scsi/hosts.c  |  35 +++-
> >  drivers/scsi/scsi.c   |   5 +-
> >  drivers/scsi/scsi_lib.c   | 464
> > --
> >  drivers/scsi/scsi_priv.h  |   3 +
> >  drivers/scsi/scsi_scan.c  |   5 +-
> >  drivers/scsi/scsi_sysfs.c |   2 +
> >  include/scsi/scsi_host.h  |  18 +-
> >  include/scsi/scsi_tcq.h   |  28 ++-
> >  8 files changed, 488 insertions(+), 72 deletions(-)
> >
> > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
> 0632eee..6de80e3
> > 100644
> > --- a/drivers/scsi/hosts.c
> > +++ b/drivers/scsi/hosts.c
> > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> > *shost, struct device *dev,
> >   goto fail;
> >   }
> >
> > + if (shost_use_blk_mq(shost)) {
> > + error = scsi_mq_setup_tags(shost);
> > + if (error)
> > + goto fail;
> > + }
> > +
> > + /*
> > +  * Note that we allocate the freelist even for the MQ case for
> now,
> > +  * as we need a command set aside for scsi_reset_provider.  Having
> > +  * the full host freelist and one command available for that is a
> > +  * little heavy-handed, but avoids introducing a special allocator
> > +  * just for this.  Eventually the structure of scsi_reset_provider
> > +  * will need a major overhaul.
> > +  */
> >   error = scsi_setup_command_freelist(shost);
> >   if (error)
> > - goto fail;
> > + goto out_destroy_tags;
> > +
> >
> >   if (!shost->shost_gendev.parent)
> >   shost->shost_gendev.parent = dev ? dev : _bus;
> > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> > struct device *dev,
> >
> >   error = device_add(>shost_gendev);
> >   if (error)
> 

Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai
kashyap.de...@avagotech.com wrote:

  -Original Message-
  From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
  ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
  Sent: Friday, July 18, 2014 3:43 PM
  To: James Bottomley; linux-s...@vger.kernel.org
  Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
 Robert
  Elliott; Webb Scales; linux-kernel@vger.kernel.org
  Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
 
  This patch adds support for an alternate I/O path in the scsi midlayer
 which
  uses the blk-mq infrastructure instead of the legacy request code.
 
  Use of blk-mq is fully transparent to drivers, although for now a host
  template field is provided to opt out of blk-mq usage in case any
 unforseen
  incompatibilities arise.
 
  In general replacing the legacy request code with blk-mq is a simple and
  mostly mechanical transformation.  The biggest exception is the new code
  that deals with the fact the I/O submissions in blk-mq must happen from
  process context, which slightly complicates the I/O completion handler.
  The second biggest differences is that blk-mq is build around the
 concept of
  preallocated requests that also include driver specific data, which in
 SCSI
  context means the scsi_cmnd structure.  This completely avoids dynamic
  memory allocations for the fast path through I/O submission.
 
  Due the preallocated requests the MQ code path exclusively uses the
 host-
  wide shared tag allocator instead of a per-LUN one.  This only affects
 drivers
  actually using the block layer provided tag allocator instead of their
 own.
  Unlike the old path blk-mq always provides a tag, although drivers don't
 have
  to use it.
 
  For now the blk-mq path is disable by defauly and must be enabled using
 the
  use_blk_mq module parameter.  Once the remaining work in the block
  layer to make blk-mq more suitable for slow devices is complete I hope
 to
  make it the default and eventually even remove the old code path.
 
  Based on the earlier scsi-mq prototype by Nicholas Bellinger.
 
  Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
 and
  various sugestions and code contributions.
 
  Signed-off-by: Christoph Hellwig h...@lst.de
  Reviewed-by: Hannes Reinecke h...@suse.de
  Reviewed-by: Webb Scales web...@hp.com
  Acked-by: Jens Axboe ax...@kernel.dk
  Tested-by: Bart Van Assche bvanass...@acm.org
  Tested-by: Robert Elliott elli...@hp.com
  ---
   drivers/scsi/hosts.c  |  35 +++-
   drivers/scsi/scsi.c   |   5 +-
   drivers/scsi/scsi_lib.c   | 464
  --
   drivers/scsi/scsi_priv.h  |   3 +
   drivers/scsi/scsi_scan.c  |   5 +-
   drivers/scsi/scsi_sysfs.c |   2 +
   include/scsi/scsi_host.h  |  18 +-
   include/scsi/scsi_tcq.h   |  28 ++-
   8 files changed, 488 insertions(+), 72 deletions(-)
 
  diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
 0632eee..6de80e3
  100644
  --- a/drivers/scsi/hosts.c
  +++ b/drivers/scsi/hosts.c
  @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
  *shost, struct device *dev,
goto fail;
}
 
  + if (shost_use_blk_mq(shost)) {
  + error = scsi_mq_setup_tags(shost);
  + if (error)
  + goto fail;
  + }
  +
  + /*
  +  * Note that we allocate the freelist even for the MQ case for
 now,
  +  * as we need a command set aside for scsi_reset_provider.  Having
  +  * the full host freelist and one command available for that is a
  +  * little heavy-handed, but avoids introducing a special allocator
  +  * just for this.  Eventually the structure of scsi_reset_provider
  +  * will need a major overhaul.
  +  */
error = scsi_setup_command_freelist(shost);
if (error)
  - goto fail;
  + goto out_destroy_tags;
  +
 
if (!shost-shost_gendev.parent)
shost-shost_gendev.parent = dev ? dev : platform_bus;
  @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
  struct device *dev,
 
error = device_add(shost-shost_gendev);
if (error)
  - goto out;
  + goto out_destroy_freelist;
 
pm_runtime_set_active(shost-shost_gendev);
pm_runtime_enable(shost-shost_gendev);
  @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
  *shost, struct device *dev,
device_del(shost-shost_dev);
out_del_gendev:
device_del(shost-shost_gendev);
  - out:
  + out_destroy_freelist:
scsi_destroy_command_freelist(shost);
  + out_destroy_tags:
  + if (shost_use_blk_mq(shost))
  + scsi_mq_destroy_tags(shost);
fail:
return error;
   }
  @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
  *dev)
}
 
scsi_destroy_command_freelist(shost);
  - if (shost-bqt

Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig h...@lst.de wrote:
 On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote:
 I read  this comment and find that very few drivers are using this
 cmd_list.  I think if we remove this cmd_list, performance will scale as I
 am seeing major contention in this lock.
 Just thought to ping you to see if this is known limitation for now or any
 plan to change this lock in near future ?

 Removing the lock entirely and pushing the list into the two drivers
 using it is on my TODO list.  Bart actually suggested keeping the code in the
 SCSI core and having a flag to enabled.  Given that I'm too busy to get my 
 full
 version done in time, it might be a good idea if someone picks up Barts
 idea.  Can you send me a patch to add a enable_cmd_list flag to the host
 template and only enable it for aacraid and dpt_i2o?


Sure. I will work on relevant code change and will post patch for review.

-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.de...@avagotech.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-18 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Friday, July 18, 2014 3:43 PM
> To: James Bottomley; linux-s...@vger.kernel.org
> Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
Robert
> Elliott; Webb Scales; linux-kernel@vger.kernel.org
> Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
>
> This patch adds support for an alternate I/O path in the scsi midlayer
which
> uses the blk-mq infrastructure instead of the legacy request code.
>
> Use of blk-mq is fully transparent to drivers, although for now a host
> template field is provided to opt out of blk-mq usage in case any
unforseen
> incompatibilities arise.
>
> In general replacing the legacy request code with blk-mq is a simple and
> mostly mechanical transformation.  The biggest exception is the new code
> that deals with the fact the I/O submissions in blk-mq must happen from
> process context, which slightly complicates the I/O completion handler.
> The second biggest differences is that blk-mq is build around the
concept of
> preallocated requests that also include driver specific data, which in
SCSI
> context means the scsi_cmnd structure.  This completely avoids dynamic
> memory allocations for the fast path through I/O submission.
>
> Due the preallocated requests the MQ code path exclusively uses the
host-
> wide shared tag allocator instead of a per-LUN one.  This only affects
drivers
> actually using the block layer provided tag allocator instead of their
own.
> Unlike the old path blk-mq always provides a tag, although drivers don't
have
> to use it.
>
> For now the blk-mq path is disable by defauly and must be enabled using
the
> "use_blk_mq" module parameter.  Once the remaining work in the block
> layer to make blk-mq more suitable for slow devices is complete I hope
to
> make it the default and eventually even remove the old code path.
>
> Based on the earlier scsi-mq prototype by Nicholas Bellinger.
>
> Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
and
> various sugestions and code contributions.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Webb Scales 
> Acked-by: Jens Axboe 
> Tested-by: Bart Van Assche 
> Tested-by: Robert Elliott 
> ---
>  drivers/scsi/hosts.c  |  35 +++-
>  drivers/scsi/scsi.c   |   5 +-
>  drivers/scsi/scsi_lib.c   | 464
> --
>  drivers/scsi/scsi_priv.h  |   3 +
>  drivers/scsi/scsi_scan.c  |   5 +-
>  drivers/scsi/scsi_sysfs.c |   2 +
>  include/scsi/scsi_host.h  |  18 +-
>  include/scsi/scsi_tcq.h   |  28 ++-
>  8 files changed, 488 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
0632eee..6de80e3
> 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>   goto fail;
>   }
>
> + if (shost_use_blk_mq(shost)) {
> + error = scsi_mq_setup_tags(shost);
> + if (error)
> + goto fail;
> + }
> +
> + /*
> +  * Note that we allocate the freelist even for the MQ case for
now,
> +  * as we need a command set aside for scsi_reset_provider.  Having
> +  * the full host freelist and one command available for that is a
> +  * little heavy-handed, but avoids introducing a special allocator
> +  * just for this.  Eventually the structure of scsi_reset_provider
> +  * will need a major overhaul.
> +  */
>   error = scsi_setup_command_freelist(shost);
>   if (error)
> - goto fail;
> + goto out_destroy_tags;
> +
>
>   if (!shost->shost_gendev.parent)
>   shost->shost_gendev.parent = dev ? dev : _bus;
> @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> struct device *dev,
>
>   error = device_add(>shost_gendev);
>   if (error)
> - goto out;
> + goto out_destroy_freelist;
>
>   pm_runtime_set_active(>shost_gendev);
>   pm_runtime_enable(>shost_gendev);
> @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>   device_del(>shost_dev);
>   out_del_gendev:
>   device_del(>shost_gendev);
> - out:
> + out_destroy_freelist:
>   scsi_destroy_command_freelist(shost);
> + out_destroy_tags:
> + if (shost_use_blk_mq(shost))
> + scsi_mq_destroy_tags(shost);
>   fail:
>   return error;
>  }
> @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
> *dev)
>   }
>
>   scsi_destroy_command_freelist(shost);
> - if (shost->bqt)
> - blk_free_tags(shost->bqt);
> + if (shost_use_blk_mq(shost)) {
> + if (shost->tag_set.tags)
> + scsi_mq_destroy_tags(shost);
> + } else {
> + 

RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-18 Thread Kashyap Desai
 -Original Message-
 From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
 ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
 Sent: Friday, July 18, 2014 3:43 PM
 To: James Bottomley; linux-s...@vger.kernel.org
 Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
Robert
 Elliott; Webb Scales; linux-kernel@vger.kernel.org
 Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

 This patch adds support for an alternate I/O path in the scsi midlayer
which
 uses the blk-mq infrastructure instead of the legacy request code.

 Use of blk-mq is fully transparent to drivers, although for now a host
 template field is provided to opt out of blk-mq usage in case any
unforseen
 incompatibilities arise.

 In general replacing the legacy request code with blk-mq is a simple and
 mostly mechanical transformation.  The biggest exception is the new code
 that deals with the fact the I/O submissions in blk-mq must happen from
 process context, which slightly complicates the I/O completion handler.
 The second biggest differences is that blk-mq is build around the
concept of
 preallocated requests that also include driver specific data, which in
SCSI
 context means the scsi_cmnd structure.  This completely avoids dynamic
 memory allocations for the fast path through I/O submission.

 Due the preallocated requests the MQ code path exclusively uses the
host-
 wide shared tag allocator instead of a per-LUN one.  This only affects
drivers
 actually using the block layer provided tag allocator instead of their
own.
 Unlike the old path blk-mq always provides a tag, although drivers don't
have
 to use it.

 For now the blk-mq path is disable by defauly and must be enabled using
the
 use_blk_mq module parameter.  Once the remaining work in the block
 layer to make blk-mq more suitable for slow devices is complete I hope
to
 make it the default and eventually even remove the old code path.

 Based on the earlier scsi-mq prototype by Nicholas Bellinger.

 Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
and
 various sugestions and code contributions.

 Signed-off-by: Christoph Hellwig h...@lst.de
 Reviewed-by: Hannes Reinecke h...@suse.de
 Reviewed-by: Webb Scales web...@hp.com
 Acked-by: Jens Axboe ax...@kernel.dk
 Tested-by: Bart Van Assche bvanass...@acm.org
 Tested-by: Robert Elliott elli...@hp.com
 ---
  drivers/scsi/hosts.c  |  35 +++-
  drivers/scsi/scsi.c   |   5 +-
  drivers/scsi/scsi_lib.c   | 464
 --
  drivers/scsi/scsi_priv.h  |   3 +
  drivers/scsi/scsi_scan.c  |   5 +-
  drivers/scsi/scsi_sysfs.c |   2 +
  include/scsi/scsi_host.h  |  18 +-
  include/scsi/scsi_tcq.h   |  28 ++-
  8 files changed, 488 insertions(+), 72 deletions(-)

 diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
0632eee..6de80e3
 100644
 --- a/drivers/scsi/hosts.c
 +++ b/drivers/scsi/hosts.c
 @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
 *shost, struct device *dev,
   goto fail;
   }

 + if (shost_use_blk_mq(shost)) {
 + error = scsi_mq_setup_tags(shost);
 + if (error)
 + goto fail;
 + }
 +
 + /*
 +  * Note that we allocate the freelist even for the MQ case for
now,
 +  * as we need a command set aside for scsi_reset_provider.  Having
 +  * the full host freelist and one command available for that is a
 +  * little heavy-handed, but avoids introducing a special allocator
 +  * just for this.  Eventually the structure of scsi_reset_provider
 +  * will need a major overhaul.
 +  */
   error = scsi_setup_command_freelist(shost);
   if (error)
 - goto fail;
 + goto out_destroy_tags;
 +

   if (!shost-shost_gendev.parent)
   shost-shost_gendev.parent = dev ? dev : platform_bus;
 @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
 struct device *dev,

   error = device_add(shost-shost_gendev);
   if (error)
 - goto out;
 + goto out_destroy_freelist;

   pm_runtime_set_active(shost-shost_gendev);
   pm_runtime_enable(shost-shost_gendev);
 @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
 *shost, struct device *dev,
   device_del(shost-shost_dev);
   out_del_gendev:
   device_del(shost-shost_gendev);
 - out:
 + out_destroy_freelist:
   scsi_destroy_command_freelist(shost);
 + out_destroy_tags:
 + if (shost_use_blk_mq(shost))
 + scsi_mq_destroy_tags(shost);
   fail:
   return error;
  }
 @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
 *dev)
   }

   scsi_destroy_command_freelist(shost);
 - if (shost-bqt)
 - blk_free_tags(shost-bqt);
 + if (shost_use_blk_mq(shost)) {
 + if (shost-tag_set.tags)
 + scsi_mq_destroy_tags(shost);
 + } else {
 + if (shost-bqt)
 + 

RE: [PATCH v2 RESEND 14/23] megaraid: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-08-11 Thread Kashyap Desai
> -Original Message-
> From: Alexander Gordeev [mailto:agord...@redhat.com]
> Sent: Monday, August 11, 2014 1:40 PM
> To: Kashyap Desai; Neela Syam Kolli
> Cc: linux-s...@vger.kernel.org; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Christoph Hellwig
> Subject: Re: [PATCH v2 RESEND 14/23] megaraid: Use
> pci_enable_msix_range() instead of pci_enable_msix()
>
> On Wed, Jul 16, 2014 at 08:05:18PM +0200, Alexander Gordeev wrote:
> > As result of deprecation of MSI-X/MSI enablement functions
> > pci_enable_msix() and pci_enable_msi_block() all drivers using these
> > two interfaces need to be updated to use the new
> > pci_enable_msi_range()  or pci_enable_msi_exact() and
> > pci_enable_msix_range() or pci_enable_msix_exact() interfaces.
>
> Kashyap, Neela,
>
> Could you please reveiw this patch?

Review this patch. Please consider this patch Acked by Me.
 Acked-by: Kashyap Desai 

>
> Thanks!
>
> > Signed-off-by: Alexander Gordeev 
> > Cc: Neela Syam Kolli 
> > Cc: linux-s...@vger.kernel.org
> > Cc: linux-...@vger.kernel.org
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c |   20 +++-
> >  1 files changed, 7 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index ba06102..7a4e75e 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -4105,17 +4105,11 @@ static int megasas_init_fw(struct
> megasas_instance *instance)
> >  (unsigned
int)num_online_cpus());
> > for (i = 0; i < instance->msix_vectors; i++)
> > instance->msixentry[i].entry = i;
> > -   i = pci_enable_msix(instance->pdev, instance->msixentry,
> > -   instance->msix_vectors);
> > -   if (i >= 0) {
> > -   if (i) {
> > -   if (!pci_enable_msix(instance->pdev,
> > -instance->msixentry,
i))
> > -   instance->msix_vectors = i;
> > -   else
> > -   instance->msix_vectors = 0;
> > -   }
> > -   } else
> > +   i = pci_enable_msix_range(instance->pdev, instance-
> >msixentry,
> > + 1, instance->msix_vectors);
> > +   if (i)
> > +   instance->msix_vectors = i;
> > +   else
> > instance->msix_vectors = 0;
> >
> > dev_info(>pdev->dev, "[scsi%d]: FW supports"
> > @@ -5135,8 +5129,8 @@ megasas_resume(struct pci_dev *pdev)
> >
> > /* Now re-enable MSI-X */
> > if (instance->msix_vectors &&
> > -   pci_enable_msix(instance->pdev, instance->msixentry,
> > -   instance->msix_vectors))
> > +   pci_enable_msix_exact(instance->pdev, instance->msixentry,
> > + instance->msix_vectors))
> > goto fail_reenable_msix;
> >
> > switch (instance->pdev->device) {
> > --
> > 1.7.7.6
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-initialization failed

2014-07-28 Thread Kashyap Desai
> -Original Message-
> From: Alexander Gordeev [mailto:agord...@redhat.com]
> Sent: Saturday, July 26, 2014 1:54 PM
> To: linux-kernel@vger.kernel.org; Neela Syam Kolli
> Subject: Re: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-
> initialization failed
>
> On Wed, Jul 16, 2014 at 08:05:17PM +0200, Alexander Gordeev wrote:
> > Currently the driver fails to analize MSI-X re-enablement status on
> > resuming and always assumes the success. This update checks the MSI-X
> > initialization result and fails to resume if MSI-Xs re-enablement
> > failed.
>
> Hi Neela,
>
> Could you please review megaraid patches in this series?
>
> Thanks!
Please consider this patch Acked by Me.

Acked-by: Kashyap Desai 

>
> > Signed-off-by: Alexander Gordeev 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c |8 +---
> >  1 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index 112799b..ba06102 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -5134,9 +5134,10 @@ megasas_resume(struct pci_dev *pdev)
> > goto fail_ready_state;
> >
> > /* Now re-enable MSI-X */
> > -   if (instance->msix_vectors)
> > -   pci_enable_msix(instance->pdev, instance->msixentry,
> > -   instance->msix_vectors);
> > +   if (instance->msix_vectors &&
> > +   pci_enable_msix(instance->pdev, instance->msixentry,
> > +   instance->msix_vectors))
> > +   goto fail_reenable_msix;
> >
> > switch (instance->pdev->device) {
> > case PCI_DEVICE_ID_LSI_FUSION:
> > @@ -5245,6 +5246,7 @@ fail_init_mfi:
> >
> >  fail_set_dma_mask:
> >  fail_ready_state:
> > +fail_reenable_msix:
> >
> > pci_disable_device(pdev);
> >
> > --
> > 1.7.7.6
> >
>
> --
> Regards,
> Alexander Gordeev
> agord...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-initialization failed

2014-07-28 Thread Kashyap Desai
 -Original Message-
 From: Alexander Gordeev [mailto:agord...@redhat.com]
 Sent: Saturday, July 26, 2014 1:54 PM
 To: linux-kernel@vger.kernel.org; Neela Syam Kolli
 Subject: Re: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-
 initialization failed

 On Wed, Jul 16, 2014 at 08:05:17PM +0200, Alexander Gordeev wrote:
  Currently the driver fails to analize MSI-X re-enablement status on
  resuming and always assumes the success. This update checks the MSI-X
  initialization result and fails to resume if MSI-Xs re-enablement
  failed.

 Hi Neela,

 Could you please review megaraid patches in this series?

 Thanks!
Please consider this patch Acked by Me.

Acked-by: Kashyap Desai kashyap.de...@avagotech.com


  Signed-off-by: Alexander Gordeev agord...@redhat.com
  ---
   drivers/scsi/megaraid/megaraid_sas_base.c |8 +---
   1 files changed, 5 insertions(+), 3 deletions(-)
 
  diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
  b/drivers/scsi/megaraid/megaraid_sas_base.c
  index 112799b..ba06102 100644
  --- a/drivers/scsi/megaraid/megaraid_sas_base.c
  +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
  @@ -5134,9 +5134,10 @@ megasas_resume(struct pci_dev *pdev)
  goto fail_ready_state;
 
  /* Now re-enable MSI-X */
  -   if (instance-msix_vectors)
  -   pci_enable_msix(instance-pdev, instance-msixentry,
  -   instance-msix_vectors);
  +   if (instance-msix_vectors 
  +   pci_enable_msix(instance-pdev, instance-msixentry,
  +   instance-msix_vectors))
  +   goto fail_reenable_msix;
 
  switch (instance-pdev-device) {
  case PCI_DEVICE_ID_LSI_FUSION:
  @@ -5245,6 +5246,7 @@ fail_init_mfi:
 
   fail_set_dma_mask:
   fail_ready_state:
  +fail_reenable_msix:
 
  pci_disable_device(pdev);
 
  --
  1.7.7.6
 

 --
 Regards,
 Alexander Gordeev
 agord...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/