RE: Broadcom 9460 raid card takes too long at system resuming
Hi Chris - Most likely behavior you explained is associated with how much time FW takes to be activated. In case of actual init from fresh boot, FW is already started running once system is powered on, and user may not be aware of it. By the time OS boot reach driver load from fresh boot, there was enough time spend in system bring up.This is not true in case of resume (Hibernation.). Kashyap > -Original Message- > From: Chris Chiu [mailto:chris.c...@canonical.com] > Sent: Monday, April 19, 2021 3:45 PM > To: kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; > shivasharan.srikanteshw...@broadcom.com; > megaraidlinux@broadcom.com > Cc: linux-s...@vger.kernel.org; Linux Kernel > > Subject: Broadcom 9460 raid card takes too long at system resuming > > Hi, > We found that the Broadcom 9460 RAID card will take ~40 seconds in > megasas_resume. It is mainly waiting for the FW to come to ready state, > please refer to the following kernel log. The FW version is > "megasas: 07.714.04.00-rc1". It seems that the > megasas_transition_to_ready() loop costs ~40 seconds in megasas_resume. > However, the same megasas_transition_to_ready() function only takes a few > milliseconds to complete in megasas_init_fw(). The .read_fw_status_reg > maps > to megasas_read_fw_status_reg_fusion. I tried to add > pci_enable_device_mem() and pci_set_master before > megasas_transition_to_ready() in megasas_resume() but it makes no > difference. > > I don't really know what makes the difference between driver probe and > resume. The lspci information of the raid controller is here > https://gist.github.com/mschiu77/e74ec084cc925643add845fa4dc31ab6. > Any suggestions about what I can do to find out the cause? Thanks. > > [ 62.357688] megaraid_sas :45:00.0: megasas_resume is called > [ 62.357719] megaraid_sas :45:00.0: Waiting for FW to come to ready > state > [ 104.382571] megaraid_sas :45:00.0: FW now in Ready state [ > 104.382576] megaraid_sas :45:00.0: 63 bit DMA mask and 63 bit > consistent mask [ 104.383350] megaraid_sas :45:00.0: > requested/available msix 33/33 [ 104.383669] megaraid_sas :45:00.0: > Performance mode :Latency > [ 104.383671] megaraid_sas :45:00.0: FW supports sync cache: > Yes > [ 104.383677] megaraid_sas :45:00.0: megasas_disable_intr_fusion is > called outbound_intr_mask:0x4009 [ 104.550570] megaraid_sas > :45:00.0: FW provided > supportMaxExtLDs: 1 max_lds: 64 > [ 104.550574] megaraid_sas :45:00.0: controller type : > MR(4096MB) > [ 104.550575] megaraid_sas :45:00.0: Online Controller Reset(OCR) > : Enabled > [ 104.550577] megaraid_sas :45:00.0: Secure JBOD support : Yes > [ 104.550579] megaraid_sas :45:00.0: NVMe passthru support : Yes [ > 104.550581] megaraid_sas :45:00.0: FW provided TM > TaskAbort/Reset timeout: 6 secs/60 secs > [ 104.550583] megaraid_sas :45:00.0: JBOD sequence map support : > Yes > [ 104.550585] megaraid_sas :45:00.0: PCI Lane Margining support: > No > [ 104.550999] megaraid_sas :45:00.0: megasas_enable_intr_fusion is > called outbound_intr_mask:0x4000 > > Chris smime.p7s Description: S/MIME Cryptographic Signature
RE: problem booting 5.10
> -Original Message- > From: John Garry [mailto:john.ga...@huawei.com] > Sent: Wednesday, December 9, 2020 9:22 PM > To: Julia Lawall ; Kashyap Desai > > Cc: Martin K. Petersen ; Linus Torvalds > ; James E.J. Bottomley > ; Linux Kernel Mailing List ker...@vger.kernel.org>; nicolas.pa...@univ-grenoble-alpes.fr; linux-scsi > ; Ming Lei ; Sumit Saxena > ; Shivasharan S > > Subject: Re: problem booting 5.10 > > On 09/12/2020 15:44, Julia Lawall wrote: > > > > On Tue, 8 Dec 2020, John Garry wrote: > > > >> On 08/12/2020 22:51, Martin K. Petersen wrote: > >>> Julia, > >>> > >>>> This solves the problem. Starting from 5.10-rc7 and doing this > >>>> revert, I get a kernel that boots. > >> Hi Julia, > >> > >> Can you also please test Ming's patchset here (without the megaraid > >> sas > >> revert) when you get a chance: > >> https://lore.kernel.org/linux-block/20201203012638.543321-1-ming.lei@ > >> redhat.com/ > > 5.10-rc7 plus these three commits boots fine. > > > > Hi Julia, > > Ok, Thanks for the confirmation. A sort of relief. > > @Kashyap, It would be good if we could recreate this, just in case. I already tested series and issue fixed for me. Final patch (V2) provided by Ming already has "Tested-by" Tag from me. I once again confirm using config file provided by Julia Lawall and same result - Issue recreated once again and fixed by Ming's below patch. https://lore.kernel.org/linux-block/20201203012638.543321-2-ming@redhat.com/ Kashyap > > Cheers, > John -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. smime.p7s Description: S/MIME Cryptographic Signature
RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug
> > > > v5.10-rc2 is also broken here. > > John, Kashyap, any update on this? If this is going to take a while to fix > it > proper, should I send a patch to revert this or at least disable the > feature by > default for megaraid_sas in the meantime, so it no longer breaks the > existing > systems out there? I am trying to get similar h/w to try out. All my current h/w works fine. Give me couple of days' time. If this is not obviously common issue and need time, we will go with module parameter disable method. I will let you know. Kashyap smime.p7s Description: S/MIME Cryptographic Signature
RE: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug
> On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote: > > From: Kashyap Desai > > > > Fusion adapters can steer completions to individual queues, and we now > > have support for shared host-wide tags. > > So we can enable multiqueue support for fusion adapters. > > > > Once driver enable shared host-wide tags, cpu hotplug feature is also > > supported as it was enabled using below patchsets - commit > > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are > > offline") > > > > Currently driver has provision to disable host-wide tags using > > "host_tagset_enable" module parameter. > > > > Once we do not have any major performance regression using host-wide > > tags, we will drop the hand-crafted interrupt affinity settings. > > > > Performance is also meeting the expecatation - (used both none and > > mq-deadline scheduler) > > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs > > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get > > 3.1M IOPs. > > > > Signed-off-by: Kashyap Desai > > Signed-off-by: Hannes Reinecke > > Signed-off-by: John Garry > > Reverting this commit fixed an issue that Dell Power Edge R6415 server > with > megaraid_sas is unable to boot. I will take a look at this. BTW, can you try keeping same PATCH but use module parameter "host_tagset_enable =0" Kashyap smime.p7s Description: S/MIME Cryptographic Signature
RE: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
> > John, > > > Have you had a chance to check these outstanding SCSI patches? > > > > scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug > > scsi: scsi_debug: Support host tagset > > scsi: hisi_sas: Switch v3 hw to MQ > > scsi: core: Show nr_hw_queues in sysfs > > scsi: Add host and host template flag 'host_tagset' > > These look good to me. > > Jens, feel free to merge. Hi Jens, Gentle ping. I am not able to find commits for above listed scsi patches. I want to use your repo which has above mentioned patch for patch submission. Martin has Acked the scsi patches. > > Acked-by: Martin K. Petersen > > -- > Martin K. PetersenOracle Linux Engineering smime.p7s Description: S/MIME Cryptographic Signature
RE: [PATCH 1/8] scsi: add a host / host template field for the virt boundary
> -Original Message- > From: megaraidlinux@broadcom.com > [mailto:megaraidlinux@broadcom.com] On Behalf Of Ming Lei > Sent: Tuesday, June 18, 2019 6:05 AM > To: Christoph Hellwig > Cc: Martin K . Petersen ; Sagi Grimberg > ; Max Gurtovoy ; Bart Van > Assche ; linux-rdma ; > Linux SCSI List ; > megaraidlinux@broadcom.com; mpt-fusionlinux@broadcom.com; > linux-hyp...@vger.kernel.org; Linux Kernel Mailing List ker...@vger.kernel.org> > Subject: Re: [PATCH 1/8] scsi: add a host / host template field for the > virt > boundary > > On Mon, Jun 17, 2019 at 8:21 PM Christoph Hellwig wrote: > > > > This allows drivers setting it up easily instead of branching out to > > block layer calls in slave_alloc, and ensures the upgraded > > max_segment_size setting gets picked up by the DMA layer. > > > > Signed-off-by: Christoph Hellwig > > --- > > drivers/scsi/hosts.c | 3 +++ > > drivers/scsi/scsi_lib.c | 3 ++- > > include/scsi/scsi_host.h | 3 +++ > > 3 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index > > ff0d8c6a8d0c..55522b7162d3 100644 > > --- a/drivers/scsi/hosts.c > > +++ b/drivers/scsi/hosts.c > > @@ -462,6 +462,9 @@ struct Scsi_Host *scsi_host_alloc(struct > scsi_host_template *sht, int privsize) > > else > > shost->dma_boundary = 0x; > > > > + if (sht->virt_boundary_mask) > > + shost->virt_boundary_mask = sht->virt_boundary_mask; > > + > > device_initialize(>shost_gendev); > > dev_set_name(>shost_gendev, "host%d", shost->host_no); > > shost->shost_gendev.bus = _bus_type; diff --git > > a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index > > 65d0a10c76ad..d333bb6b1c59 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -1775,7 +1775,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, > struct request_queue *q) > > dma_set_seg_boundary(dev, shost->dma_boundary); > > > > blk_queue_max_segment_size(q, shost->max_segment_size); > > - dma_set_max_seg_size(dev, shost->max_segment_size); > > + blk_queue_virt_boundary(q, shost->virt_boundary_mask); > > + dma_set_max_seg_size(dev, queue_max_segment_size(q)); > > The patch looks fine, also suggest to make sure that max_segment_size is > block-size aligned, and un-aligned max segment size has caused trouble on > mmc. I validated changes on latest and few older series controllers. Post changes, I noticed max_segment_size is updated. find /sys/ -name max_segment_size |grep sdb |xargs grep -r . /sys/devices/pci:3a/:3a:00.0/:3b:00.0/:3c:04.0/:40:00.0/:41:00.0/:42:00.0/host0/target0:2:12/0:2:12:0/block/sdb/queue/max_segment_size:4294967295 I verify that single SGE having 1MB transfer length is working fine. Acked-by: Kashyap Desai < kashyap.de...@broadcom.com> > > Thanks, > Ming Lei >
RE: [PATCH 10/13] megaraid_sas: set virt_boundary_mask in the scsi host
> > So before I respin this series, can you help with a way to figure out for > mpt3sas and megaraid if a given controller supports NVMe devices at all, so > that we don't have to set the virt boundary if not? In MegaRaid we have below enum -VENTURA_SERIES and AERO_SERIES supports NVME enum MR_ADAPTER_TYPE { MFI_SERIES = 1, THUNDERBOLT_SERIES = 2, INVADER_SERIES = 3, VENTURA_SERIES = 4, AERO_SERIES = 5, }; In mpt3sas driver we have below method - If IOC FACT reports NVME Device support in Protocol Flags, we can consider it as HBA with NVME drive support. ioc->facts.ProtocolFlags & MPI2_IOCFACTS_PROTOCOL_NVME_DEVICES Kashyap
RE: [RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts
> This is the wrong direction as it does not allow to do initial affinity > assignement for the non-managed interrupts on allocation time. And that's > what Kashyap and Sumit are looking for. > > The trivial fix for the possible breakage when irq_default_affinity != > cpu_possible_mask is to set the affinity for the pre/post vectors to > cpu_possible_mask and be done with it. > > One other thing I noticed while staring at that is the fact, that the PCI > code does not care about the return value of irq_create_affinity_masks() at > all. It just proceeds if masks is NULL as if the stuff was requested with > the affinity descriptor pointer being NULL. I don't think that's a > brilliant idea because the drivers might rely on the interrupt being > managed, but it might be a non-issue and just result in bad locality. > > Christoph? > > So back to the problem at hand. We need to change the affinity management > scheme in a way which lets us differentiate between managed and > unmanaged > interrupts, so it allows to automatically assign (initial) affinity to all > of them. > > Right now everything is bound to the cpumasks array, which does not allow > to convey more information. So we need to come up with something > different. > > Something like the below (does not compile and is just for reference) > should do the trick. I'm not sure whether we want to convey the information > (masks and bitmap) in a single data structure or not, but that's an > implementation detail. Dou - Do you want me to test your patch or shall I wait for next version ? > > Thanks, > > tglx > > 8<- > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -535,15 +535,16 @@ static struct msi_desc * > msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) > { > struct cpumask *masks = NULL; > + unsigned long managed = 0; > struct msi_desc *entry; > u16 control; > > if (affd) > - masks = irq_create_affinity_masks(nvec, affd); > + masks = irq_create_affinity_masks(nvec, affd, ); > > > /* MSI Entry Initialization */ > - entry = alloc_msi_entry(>dev, nvec, masks); > + entry = alloc_msi_entry(>dev, nvec, masks, managed); > if (!entry) > goto out; > > @@ -672,15 +673,22 @@ static int msix_setup_entries(struct pci > struct msix_entry *entries, int nvec, > const struct irq_affinity *affd) > { > + /* > + * MSIX_MAX_VECTORS = 2048, i.e. 256 bytes. Might need runtime > + * allocation. OTOH, are 2048 vectors realistic? > + */ > + DECLARE_BITMAP(managed, MSIX_MAX_VECTORS); > struct cpumask *curmsk, *masks = NULL; > struct msi_desc *entry; > int ret, i; > > if (affd) > - masks = irq_create_affinity_masks(nvec, affd); > + masks = irq_create_affinity_masks(nvec, affd, managed); > > for (i = 0, curmsk = masks; i < nvec; i++) { > - entry = alloc_msi_entry(>dev, 1, curmsk); > + unsigned long m = test_bit(i, managed) ? 1 : 0; > + > + entry = alloc_msi_entry(>dev, 1, curmsk, m); > if (!entry) { > if (!i) > iounmap(base); > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -27,7 +27,8 @@ > * and the affinity masks from @affinity are copied. > */ > struct msi_desc * > -alloc_msi_entry(struct device *dev, int nvec, const struct cpumask *affinity) > +alloc_msi_entry(struct device *dev, int nvec, const struct cpumask *affinity, > + unsigned long managed) > { > struct msi_desc *desc; > > @@ -38,6 +39,7 @@ alloc_msi_entry(struct device *dev, int > INIT_LIST_HEAD(>list); > desc->dev = dev; > desc->nvec_used = nvec; > + desc->managed = managed; > if (affinity) { > desc->affinity = kmemdup(affinity, > nvec * sizeof(*desc->affinity), GFP_KERNEL); > @@ -416,7 +418,7 @@ int msi_domain_alloc_irqs(struct irq_dom > > virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used, > dev_to_node(dev), , false, > -desc->affinity); > +desc->affinity, desc->managed); > if (virq < 0) { > ret = -ENOSPC; > if (ops->handle_error)
RE: [RFC PATCH] irq/affinity: Mark the pre/post vectors as regular interrupts
> This is the wrong direction as it does not allow to do initial affinity > assignement for the non-managed interrupts on allocation time. And that's > what Kashyap and Sumit are looking for. > > The trivial fix for the possible breakage when irq_default_affinity != > cpu_possible_mask is to set the affinity for the pre/post vectors to > cpu_possible_mask and be done with it. > > One other thing I noticed while staring at that is the fact, that the PCI > code does not care about the return value of irq_create_affinity_masks() at > all. It just proceeds if masks is NULL as if the stuff was requested with > the affinity descriptor pointer being NULL. I don't think that's a > brilliant idea because the drivers might rely on the interrupt being > managed, but it might be a non-issue and just result in bad locality. > > Christoph? > > So back to the problem at hand. We need to change the affinity management > scheme in a way which lets us differentiate between managed and > unmanaged > interrupts, so it allows to automatically assign (initial) affinity to all > of them. > > Right now everything is bound to the cpumasks array, which does not allow > to convey more information. So we need to come up with something > different. > > Something like the below (does not compile and is just for reference) > should do the trick. I'm not sure whether we want to convey the information > (masks and bitmap) in a single data structure or not, but that's an > implementation detail. Dou - Do you want me to test your patch or shall I wait for next version ? > > Thanks, > > tglx > > 8<- > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -535,15 +535,16 @@ static struct msi_desc * > msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd) > { > struct cpumask *masks = NULL; > + unsigned long managed = 0; > struct msi_desc *entry; > u16 control; > > if (affd) > - masks = irq_create_affinity_masks(nvec, affd); > + masks = irq_create_affinity_masks(nvec, affd, ); > > > /* MSI Entry Initialization */ > - entry = alloc_msi_entry(>dev, nvec, masks); > + entry = alloc_msi_entry(>dev, nvec, masks, managed); > if (!entry) > goto out; > > @@ -672,15 +673,22 @@ static int msix_setup_entries(struct pci > struct msix_entry *entries, int nvec, > const struct irq_affinity *affd) > { > + /* > + * MSIX_MAX_VECTORS = 2048, i.e. 256 bytes. Might need runtime > + * allocation. OTOH, are 2048 vectors realistic? > + */ > + DECLARE_BITMAP(managed, MSIX_MAX_VECTORS); > struct cpumask *curmsk, *masks = NULL; > struct msi_desc *entry; > int ret, i; > > if (affd) > - masks = irq_create_affinity_masks(nvec, affd); > + masks = irq_create_affinity_masks(nvec, affd, managed); > > for (i = 0, curmsk = masks; i < nvec; i++) { > - entry = alloc_msi_entry(>dev, 1, curmsk); > + unsigned long m = test_bit(i, managed) ? 1 : 0; > + > + entry = alloc_msi_entry(>dev, 1, curmsk, m); > if (!entry) { > if (!i) > iounmap(base); > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -27,7 +27,8 @@ > * and the affinity masks from @affinity are copied. > */ > struct msi_desc * > -alloc_msi_entry(struct device *dev, int nvec, const struct cpumask *affinity) > +alloc_msi_entry(struct device *dev, int nvec, const struct cpumask *affinity, > + unsigned long managed) > { > struct msi_desc *desc; > > @@ -38,6 +39,7 @@ alloc_msi_entry(struct device *dev, int > INIT_LIST_HEAD(>list); > desc->dev = dev; > desc->nvec_used = nvec; > + desc->managed = managed; > if (affinity) { > desc->affinity = kmemdup(affinity, > nvec * sizeof(*desc->affinity), GFP_KERNEL); > @@ -416,7 +418,7 @@ int msi_domain_alloc_irqs(struct irq_dom > > virq = __irq_domain_alloc_irqs(domain, -1, desc->nvec_used, > dev_to_node(dev), , false, > -desc->affinity); > +desc->affinity, desc->managed); > if (virq < 0) { > ret = -ENOSPC; > if (ops->handle_error)
RE: Affinity managed interrupts vs non-managed interrupts
> > The point I don't get here is why you need separate reply queues for > the interrupt coalesce setting. Shouldn't this just be a flag at > submission time that indicates the amount of coalescing that should > happen? > > What is the benefit of having different completion queues? Having different set of queues (it will is something like N:16 where N queues are without interrupt coalescing and 16 dedicated queues for interrupt coalescing) we want to avoid penalty introduced by interrupt coalescing especially for lower QD profiles. Kashyap
RE: Affinity managed interrupts vs non-managed interrupts
> > The point I don't get here is why you need separate reply queues for > the interrupt coalesce setting. Shouldn't this just be a flag at > submission time that indicates the amount of coalescing that should > happen? > > What is the benefit of having different completion queues? Having different set of queues (it will is something like N:16 where N queues are without interrupt coalescing and 16 dedicated queues for interrupt coalescing) we want to avoid penalty introduced by interrupt coalescing especially for lower QD profiles. Kashyap
RE: Affinity managed interrupts vs non-managed interrupts
> > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the >system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be >grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on >CPU hotplug. That means on unplug you have to wait for the reply queue >which is associated to the outgoing CPU to be empty and no new requests >to be queued, which has to be done for the regular per CPU reply queues >anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the >hard/firmware to direct the reply to the first online reply queue in the >group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx
RE: Affinity managed interrupts vs non-managed interrupts
> > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the >system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be >grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on >CPU hotplug. That means on unplug you have to wait for the reply queue >which is associated to the outgoing CPU to be empty and no new requests >to be queued, which has to be done for the regular per CPU reply queues >anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the >hard/firmware to direct the reply to the first online reply queue in the >group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx
RE: Affinity managed interrupts vs non-managed interrupts
Hi Thomas, Ming, Chris et all, Your input will help us to do changes for megaraid_sas driver. We are currently waiting for community response. Is it recommended to use " pci_enable_msix_range" and have low level driver do affinity setting because current APIs around pci_alloc_irq_vectors do not meet our requirement. We want more than online CPU msix vectors and using pre_vector we can do that, but first 16 msix should be mapped to local numa node with effective cpu spread across cpus of local numa node. This is not possible using pci_alloc_irq_vectors_affinity. Do we need kernel API changes or let's have low level driver to manage it via irq_set_affinity_hint ? Kashyap > -Original Message- > From: Sumit Saxena [mailto:sumit.sax...@broadcom.com] > Sent: Wednesday, August 29, 2018 4:46 AM > To: Ming Lei > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap > Desai; Shivasharan Srikanteshwara > Subject: RE: Affinity managed interrupts vs non-managed interrupts > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > Thanks, > > Ming
RE: Affinity managed interrupts vs non-managed interrupts
Hi Thomas, Ming, Chris et all, Your input will help us to do changes for megaraid_sas driver. We are currently waiting for community response. Is it recommended to use " pci_enable_msix_range" and have low level driver do affinity setting because current APIs around pci_alloc_irq_vectors do not meet our requirement. We want more than online CPU msix vectors and using pre_vector we can do that, but first 16 msix should be mapped to local numa node with effective cpu spread across cpus of local numa node. This is not possible using pci_alloc_irq_vectors_affinity. Do we need kernel API changes or let's have low level driver to manage it via irq_set_affinity_hint ? Kashyap > -Original Message- > From: Sumit Saxena [mailto:sumit.sax...@broadcom.com] > Sent: Wednesday, August 29, 2018 4:46 AM > To: Ming Lei > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org; Kashyap > Desai; Shivasharan Srikanteshwara > Subject: RE: Affinity managed interrupts vs non-managed interrupts > > > -Original Message- > > From: Ming Lei [mailto:ming@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena > > Cc: t...@linutronix.de; h...@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > Thanks, > > Ming
RE: system hung up when offlining CPUs
> > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > > + linux-scsi and maintainers of megasas > > > > When offlining CPU, I/O stops. Do you have any ideas? > > > > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote: > >> Hi Mark and Christoph, > >> > >> Sorry for the late reply. I appreciated that you fixed the issue on kvm > environment. > >> But the issue still occurs on physical server. > >> > >> Here ares irq information that I summarized megasas irqs from > >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server: > >> > >> --- > >> IRQ affinity_list IRQ_TYPE > >> 420-5IR-PCI-MSI 1048576-edge megasas > >> 430-5IR-PCI-MSI 1048577-edge megasas > >> 440-5IR-PCI-MSI 1048578-edge megasas > >> 450-5IR-PCI-MSI 1048579-edge megasas > >> 460-5IR-PCI-MSI 1048580-edge megasas > >> 470-5IR-PCI-MSI 1048581-edge megasas > >> 480-5IR-PCI-MSI 1048582-edge megasas > >> 490-5IR-PCI-MSI 1048583-edge megasas > >> 500-5IR-PCI-MSI 1048584-edge megasas > >> 510-5IR-PCI-MSI 1048585-edge megasas > >> 520-5IR-PCI-MSI 1048586-edge megasas > >> 530-5IR-PCI-MSI 1048587-edge megasas > >> 540-5IR-PCI-MSI 1048588-edge megasas > >> 550-5IR-PCI-MSI 1048589-edge megasas > >> 560-5IR-PCI-MSI 1048590-edge megasas > >> 570-5IR-PCI-MSI 1048591-edge megasas > >> 580-5IR-PCI-MSI 1048592-edge megasas > >> 590-5IR-PCI-MSI 1048593-edge megasas > >> 600-5IR-PCI-MSI 1048594-edge megasas > >> 610-5IR-PCI-MSI 1048595-edge megasas > >> 620-5IR-PCI-MSI 1048596-edge megasas > >> 630-5IR-PCI-MSI 1048597-edge megasas > >> 640-5IR-PCI-MSI 1048598-edge megasas > >> 650-5IR-PCI-MSI 1048599-edge megasas > >> 66 24-29IR-PCI-MSI 1048600-edge megasas > >> 67 24-29IR-PCI-MSI 1048601-edge megasas > >> 68 24-29IR-PCI-MSI 1048602-edge megasas > >> 69 24-29IR-PCI-MSI 1048603-edge megasas > >> 70 24-29IR-PCI-MSI 1048604-edge megasas > >> 71 24-29IR-PCI-MSI 1048605-edge megasas > >> 72 24-29IR-PCI-MSI 1048606-edge megasas > >> 73 24-29IR-PCI-MSI 1048607-edge megasas > >> 74 24-29IR-PCI-MSI 1048608-edge megasas > >> 75 24-29IR-PCI-MSI 1048609-edge megasas > >> 76 24-29IR-PCI-MSI 1048610-edge megasas > >> 77 24-29IR-PCI-MSI 1048611-edge megasas > >> 78 24-29IR-PCI-MSI 1048612-edge megasas > >> 79 24-29IR-PCI-MSI 1048613-edge megasas > >> 80 24-29IR-PCI-MSI 1048614-edge megasas > >> 81 24-29IR-PCI-MSI 1048615-edge megasas > >> 82 24-29IR-PCI-MSI 1048616-edge megasas > >> 83 24-29IR-PCI-MSI 1048617-edge megasas > >> 84 24-29IR-PCI-MSI 1048618-edge megasas > >> 85 24-29IR-PCI-MSI 1048619-edge megasas > >> 86 24-29IR-PCI-MSI 1048620-edge megasas > >> 87 24-29IR-PCI-MSI 1048621-edge megasas > >> 88 24-29IR-PCI-MSI 1048622-edge megasas > >> 89 24-29IR-PCI-MSI 1048623-edge megasas > >> --- > >> > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > >> CPU#24-29, I/O does not work, showing the following messages. > >> > >> --- > >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 > >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort > >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB: > >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort: > >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called > >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas: > >> target > reset FAILED!! > >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO > >> timeout > >> [...] SCSI command pointer: (882057426560) SCSI host state: 5 > >> SCSI > >> [...] IO request frame: > >> [...] > >> > >> [...] > >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to > >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than > >> 120 > seconds. > >> [...] Not tainted 4.13.0+ #15 > >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > >> [...] auditd D0 1200 1 0x > >> [...] Call Trace: > >> [...] __schedule+0x28d/0x890 > >> [...] schedule+0x36/0x80 > >> [...] io_schedule+0x16/0x40 > >> [...] wait_on_page_bit_common+0x109/0x1c0 > >> [...] ? page_cache_tree_insert+0xf0/0xf0 [...] > >> __filemap_fdatawait_range+0x127/0x190 > >> [...] ? __filemap_fdatawrite_range+0xd1/0x100 > >> [...] file_write_and_wait_range+0x60/0xb0 > >> [...] xfs_file_fsync+0x67/0x1d0 [xfs] [...] > >>
RE: system hung up when offlining CPUs
> > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > > + linux-scsi and maintainers of megasas > > > > When offlining CPU, I/O stops. Do you have any ideas? > > > > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote: > >> Hi Mark and Christoph, > >> > >> Sorry for the late reply. I appreciated that you fixed the issue on kvm > environment. > >> But the issue still occurs on physical server. > >> > >> Here ares irq information that I summarized megasas irqs from > >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server: > >> > >> --- > >> IRQ affinity_list IRQ_TYPE > >> 420-5IR-PCI-MSI 1048576-edge megasas > >> 430-5IR-PCI-MSI 1048577-edge megasas > >> 440-5IR-PCI-MSI 1048578-edge megasas > >> 450-5IR-PCI-MSI 1048579-edge megasas > >> 460-5IR-PCI-MSI 1048580-edge megasas > >> 470-5IR-PCI-MSI 1048581-edge megasas > >> 480-5IR-PCI-MSI 1048582-edge megasas > >> 490-5IR-PCI-MSI 1048583-edge megasas > >> 500-5IR-PCI-MSI 1048584-edge megasas > >> 510-5IR-PCI-MSI 1048585-edge megasas > >> 520-5IR-PCI-MSI 1048586-edge megasas > >> 530-5IR-PCI-MSI 1048587-edge megasas > >> 540-5IR-PCI-MSI 1048588-edge megasas > >> 550-5IR-PCI-MSI 1048589-edge megasas > >> 560-5IR-PCI-MSI 1048590-edge megasas > >> 570-5IR-PCI-MSI 1048591-edge megasas > >> 580-5IR-PCI-MSI 1048592-edge megasas > >> 590-5IR-PCI-MSI 1048593-edge megasas > >> 600-5IR-PCI-MSI 1048594-edge megasas > >> 610-5IR-PCI-MSI 1048595-edge megasas > >> 620-5IR-PCI-MSI 1048596-edge megasas > >> 630-5IR-PCI-MSI 1048597-edge megasas > >> 640-5IR-PCI-MSI 1048598-edge megasas > >> 650-5IR-PCI-MSI 1048599-edge megasas > >> 66 24-29IR-PCI-MSI 1048600-edge megasas > >> 67 24-29IR-PCI-MSI 1048601-edge megasas > >> 68 24-29IR-PCI-MSI 1048602-edge megasas > >> 69 24-29IR-PCI-MSI 1048603-edge megasas > >> 70 24-29IR-PCI-MSI 1048604-edge megasas > >> 71 24-29IR-PCI-MSI 1048605-edge megasas > >> 72 24-29IR-PCI-MSI 1048606-edge megasas > >> 73 24-29IR-PCI-MSI 1048607-edge megasas > >> 74 24-29IR-PCI-MSI 1048608-edge megasas > >> 75 24-29IR-PCI-MSI 1048609-edge megasas > >> 76 24-29IR-PCI-MSI 1048610-edge megasas > >> 77 24-29IR-PCI-MSI 1048611-edge megasas > >> 78 24-29IR-PCI-MSI 1048612-edge megasas > >> 79 24-29IR-PCI-MSI 1048613-edge megasas > >> 80 24-29IR-PCI-MSI 1048614-edge megasas > >> 81 24-29IR-PCI-MSI 1048615-edge megasas > >> 82 24-29IR-PCI-MSI 1048616-edge megasas > >> 83 24-29IR-PCI-MSI 1048617-edge megasas > >> 84 24-29IR-PCI-MSI 1048618-edge megasas > >> 85 24-29IR-PCI-MSI 1048619-edge megasas > >> 86 24-29IR-PCI-MSI 1048620-edge megasas > >> 87 24-29IR-PCI-MSI 1048621-edge megasas > >> 88 24-29IR-PCI-MSI 1048622-edge megasas > >> 89 24-29IR-PCI-MSI 1048623-edge megasas > >> --- > >> > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > >> CPU#24-29, I/O does not work, showing the following messages. > >> > >> --- > >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28 > >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED > >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort > >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB: > >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort: > >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called > >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas: > >> target > reset FAILED!! > >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO > >> timeout > >> [...] SCSI command pointer: (882057426560) SCSI host state: 5 > >> SCSI > >> [...] IO request frame: > >> [...] > >> > >> [...] > >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to > >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than > >> 120 > seconds. > >> [...] Not tainted 4.13.0+ #15 > >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > >> [...] auditd D0 1200 1 0x > >> [...] Call Trace: > >> [...] __schedule+0x28d/0x890 > >> [...] schedule+0x36/0x80 > >> [...] io_schedule+0x16/0x40 > >> [...] wait_on_page_bit_common+0x109/0x1c0 > >> [...] ? page_cache_tree_insert+0xf0/0xf0 [...] > >> __filemap_fdatawait_range+0x127/0x190 > >> [...] ? __filemap_fdatawrite_range+0xd1/0x100 > >> [...] file_write_and_wait_range+0x60/0xb0 > >> [...] xfs_file_fsync+0x67/0x1d0 [xfs] [...] > >>
RE: [PATCH v2 00/13] mpt3sas driver NVMe support:
> -Original Message- > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com] > Sent: Monday, August 07, 2017 7:48 PM > To: Kashyap Desai; Christoph Hellwig; Hannes Reinecke > Cc: Suganath Prabu Subramani; martin.peter...@oracle.com; linux- > s...@vger.kernel.org; Sathya Prakash Veerichetty; linux- > ker...@vger.kernel.org; Chaitra Basappa; Sreekanth Reddy; linux- > n...@lists.infradead.org > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > On Mon, 2017-08-07 at 19:26 +0530, Kashyap Desai wrote: > > > > > > -Original Message- > > > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com > > > ] > > > Sent: Saturday, August 05, 2017 8:12 PM > > > To: Christoph Hellwig; Hannes Reinecke > > > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux- > > > s...@vger.kernel.org; sathya.prak...@broadcom.com; > > > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org; > > > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux- > > > n...@lists.infradead.org > > > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > > > > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote: > > > > > > > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote: > > > > > > > > > > > > > > > I'm not happy with this approach. > > > > > NVMe devices should _not_ appear as SCSI devices; this will just > > > > > confuse matters _and_ will be incompatible with 'normal' NVMe > > > > > devices. > > > > > > > > > > Rather I would like to see the driver to hook into the existing > > > > > NVMe framework (which essentially means to treat the mpt3sas as > > > > > a weird NVMe-over-Fabrics HBA), and expose the NVMe devices like > > > > > any other NVMe HBA. > > > > > > > > That doesn't make any sense. The devices behind the mpt adapter > > > > don't look like NVMe devices at all for the hosts - there are no > > > > NVMe commands or queues involved at all, they hide behind the same > > > > somewhat leaky scsi abstraction as other devices behind the mpt > > > > controller. > > > > > > You might think about what we did for SAS: split the generic handler > > > into two pieces, libsas for driving the devices, which mpt didn't > > > need because of the fat firmware and the SAS transport class so mpt > > > could at least show the same sysfs files as everything else for SAS > > > devices. > > > > Ventura generation of controllers are adding connectivity of NVME > > drives seamlessly and protocol handling is in Firmware. > > Same as SCSI to ATA translation done in firmware, Ventura controller > > is doing SCSI to NVME translation and for end user protocol handling > > is abstracted. > > > > This product handles new Transport protocol (NVME) same as ATA and > > transport is abstracted for end user. > > > > NVME pass-through related driver code, it is just a big tunnel for > > user space application. It is just a basic framework like SATA PASS- > > Through in existing mpt3sas driver. > > I know how it works ... and I'm on record as not liking your SATL approach > because we keep tripping across bugs in the SATL that we have to fix in > the > driver. We discussed about NVME device support behind to Hannes and he suggested to describe to product behavior to wider audience to be aware. Just wanted to share the notes. > > However, at least for bot SAS and SATA they appear to the system as SCSI > devices regardless of HBA, so we've largely smoothed over any problems if > you > transfer from mp3sas to another SAS/SATA controller. > > I believe your current proposal is to have NVMe devices appear as SCSI, > which > isn't how the native NVMe driver handles them at all. This is going to > have to > be special cased in any tool designed to handle nvme devices and it's > going to > cause big problems if someone changes controller (or moves the > device). What's the proposal for making this as painless as possible? We have to attempt this use case and see how it behaves. I have not tried this, so not sure if things are really bad or just some tuning may be helpful. I will revert back to you on this. I understood request as - We need some udev rules to be working well for *same* NVME drives if it is behind or native . Example - If user has OS installed on NVME drive which is behind driver as SCSI disk should be able to boot if he/she hooked same NVME drive which is detected by native driver (and vice versa.) > > James
RE: [PATCH v2 00/13] mpt3sas driver NVMe support:
> -Original Message- > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com] > Sent: Monday, August 07, 2017 7:48 PM > To: Kashyap Desai; Christoph Hellwig; Hannes Reinecke > Cc: Suganath Prabu Subramani; martin.peter...@oracle.com; linux- > s...@vger.kernel.org; Sathya Prakash Veerichetty; linux- > ker...@vger.kernel.org; Chaitra Basappa; Sreekanth Reddy; linux- > n...@lists.infradead.org > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > On Mon, 2017-08-07 at 19:26 +0530, Kashyap Desai wrote: > > > > > > -Original Message- > > > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com > > > ] > > > Sent: Saturday, August 05, 2017 8:12 PM > > > To: Christoph Hellwig; Hannes Reinecke > > > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux- > > > s...@vger.kernel.org; sathya.prak...@broadcom.com; > > > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org; > > > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux- > > > n...@lists.infradead.org > > > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > > > > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote: > > > > > > > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote: > > > > > > > > > > > > > > > I'm not happy with this approach. > > > > > NVMe devices should _not_ appear as SCSI devices; this will just > > > > > confuse matters _and_ will be incompatible with 'normal' NVMe > > > > > devices. > > > > > > > > > > Rather I would like to see the driver to hook into the existing > > > > > NVMe framework (which essentially means to treat the mpt3sas as > > > > > a weird NVMe-over-Fabrics HBA), and expose the NVMe devices like > > > > > any other NVMe HBA. > > > > > > > > That doesn't make any sense. The devices behind the mpt adapter > > > > don't look like NVMe devices at all for the hosts - there are no > > > > NVMe commands or queues involved at all, they hide behind the same > > > > somewhat leaky scsi abstraction as other devices behind the mpt > > > > controller. > > > > > > You might think about what we did for SAS: split the generic handler > > > into two pieces, libsas for driving the devices, which mpt didn't > > > need because of the fat firmware and the SAS transport class so mpt > > > could at least show the same sysfs files as everything else for SAS > > > devices. > > > > Ventura generation of controllers are adding connectivity of NVME > > drives seamlessly and protocol handling is in Firmware. > > Same as SCSI to ATA translation done in firmware, Ventura controller > > is doing SCSI to NVME translation and for end user protocol handling > > is abstracted. > > > > This product handles new Transport protocol (NVME) same as ATA and > > transport is abstracted for end user. > > > > NVME pass-through related driver code, it is just a big tunnel for > > user space application. It is just a basic framework like SATA PASS- > > Through in existing mpt3sas driver. > > I know how it works ... and I'm on record as not liking your SATL approach > because we keep tripping across bugs in the SATL that we have to fix in > the > driver. We discussed about NVME device support behind to Hannes and he suggested to describe to product behavior to wider audience to be aware. Just wanted to share the notes. > > However, at least for bot SAS and SATA they appear to the system as SCSI > devices regardless of HBA, so we've largely smoothed over any problems if > you > transfer from mp3sas to another SAS/SATA controller. > > I believe your current proposal is to have NVMe devices appear as SCSI, > which > isn't how the native NVMe driver handles them at all. This is going to > have to > be special cased in any tool designed to handle nvme devices and it's > going to > cause big problems if someone changes controller (or moves the > device). What's the proposal for making this as painless as possible? We have to attempt this use case and see how it behaves. I have not tried this, so not sure if things are really bad or just some tuning may be helpful. I will revert back to you on this. I understood request as - We need some udev rules to be working well for *same* NVME drives if it is behind or native . Example - If user has OS installed on NVME drive which is behind driver as SCSI disk should be able to boot if he/she hooked same NVME drive which is detected by native driver (and vice versa.) > > James
RE: [PATCH v2 00/13] mpt3sas driver NVMe support:
> -Original Message- > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com] > Sent: Saturday, August 05, 2017 8:12 PM > To: Christoph Hellwig; Hannes Reinecke > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux- > s...@vger.kernel.org; sathya.prak...@broadcom.com; > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org; > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux- > n...@lists.infradead.org > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote: > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote: > > > > > > I'm not happy with this approach. > > > NVMe devices should _not_ appear as SCSI devices; this will just > > > confuse matters _and_ will be incompatible with 'normal' NVMe > > > devices. > > > > > > Rather I would like to see the driver to hook into the existing NVMe > > > framework (which essentially means to treat the mpt3sas as a weird > > > NVMe-over-Fabrics HBA), and expose the NVMe devices like any other > > > NVMe HBA. > > > > That doesn't make any sense. The devices behind the mpt adapter don't > > look like NVMe devices at all for the hosts - there are no NVMe > > commands or queues involved at all, they hide behind the same somewhat > > leaky scsi abstraction as other devices behind the mpt controller. > > You might think about what we did for SAS: split the generic handler into > two > pieces, libsas for driving the devices, which mpt didn't need because of > the fat > firmware and the SAS transport class so mpt could at least show the same > sysfs > files as everything else for SAS devices. Ventura generation of controllers are adding connectivity of NVME drives seamlessly and protocol handling is in Firmware. Same as SCSI to ATA translation done in firmware, Ventura controller is doing SCSI to NVME translation and for end user protocol handling is abstracted. This product handles new Transport protocol (NVME) same as ATA and transport is abstracted for end user. NVME pass-through related driver code, it is just a big tunnel for user space application. It is just a basic framework like SATA PASS-Through in existing mpt3sas driver. > > Fortunately for NVMe it's very simple at the moment its just a couple of > host > files and wwid on the devices. > > James > > > > The only additional leak is that the controller now supports NVMe- > > like PRPs in additions to its existing two SGL formats. > >
RE: [PATCH v2 00/13] mpt3sas driver NVMe support:
> -Original Message- > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com] > Sent: Saturday, August 05, 2017 8:12 PM > To: Christoph Hellwig; Hannes Reinecke > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux- > s...@vger.kernel.org; sathya.prak...@broadcom.com; > kashyap.de...@broadcom.com; linux-kernel@vger.kernel.org; > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux- > n...@lists.infradead.org > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support: > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote: > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote: > > > > > > I'm not happy with this approach. > > > NVMe devices should _not_ appear as SCSI devices; this will just > > > confuse matters _and_ will be incompatible with 'normal' NVMe > > > devices. > > > > > > Rather I would like to see the driver to hook into the existing NVMe > > > framework (which essentially means to treat the mpt3sas as a weird > > > NVMe-over-Fabrics HBA), and expose the NVMe devices like any other > > > NVMe HBA. > > > > That doesn't make any sense. The devices behind the mpt adapter don't > > look like NVMe devices at all for the hosts - there are no NVMe > > commands or queues involved at all, they hide behind the same somewhat > > leaky scsi abstraction as other devices behind the mpt controller. > > You might think about what we did for SAS: split the generic handler into > two > pieces, libsas for driving the devices, which mpt didn't need because of > the fat > firmware and the SAS transport class so mpt could at least show the same > sysfs > files as everything else for SAS devices. Ventura generation of controllers are adding connectivity of NVME drives seamlessly and protocol handling is in Firmware. Same as SCSI to ATA translation done in firmware, Ventura controller is doing SCSI to NVME translation and for end user protocol handling is abstracted. This product handles new Transport protocol (NVME) same as ATA and transport is abstracted for end user. NVME pass-through related driver code, it is just a big tunnel for user space application. It is just a basic framework like SATA PASS-Through in existing mpt3sas driver. > > Fortunately for NVMe it's very simple at the moment its just a couple of > host > files and wwid on the devices. > > James > > > > The only additional leak is that the controller now supports NVMe- > > like PRPs in additions to its existing two SGL formats. > >
RE: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Martin K. Petersen > Sent: Thursday, April 27, 2017 3:55 AM > To: Sreekanth Reddy > Cc: Martin K. Petersen; j...@kernel.org; linux-s...@vger.kernel.org; linux- > ker...@vger.kernel.org; Christoph Hellwig > Subject: Re: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor > Post Queue (RDPQ) Array support > > > Sreekanth, > > > We need to satisfy this condition on those system where 32 bit dma > > consistent mask is not supported and it only supports 64 bit dma > > consistent mask. So on these system we can't set > > pci_set_consistent_dma_mask() to DMA_BIT_MASK(32). > > Which systems are you talking about? > > It seems a bit unrealistic to require all devices to support 64-bit DMA. Martin - We have found all devices to support 64-bit DMA on certain ARM64 platform. I discussed @linux-arm-kern. Below is a thread. http://marc.info/?l=linux-arm-kernel=148880763816046=2 For ARM64, it is not supporting SWIOTLB and that is a reason we need to make all DMA pool above 4GB. Ea. If I map crash kernel above 4GB in x86_64 platform, they owner DMA 32 bit mask since arch specific code in x86_64 support SWIOTLB. Same settings on ARM64 platform fails DAM 32 bit mask. In one particular setup of ARM64, I also see below 4GB is mapped to SoC and kernel component mapped above 4GB region. Can we add in MR/IT driver below logic to meet this requirement ? - Driver will attempt DMA buffer above 4GB and check the start and end address of the physical address. If DMA buffer cross the "Same 4GB region" ( I mean High Address should be constant for that region.), driver will hold that region and attempt one more allocation. If second allocation is also not meeting "Same 4GB region", we will give up driver load. Before we attempt above logic, we would like to understand if we have any other reliable method ways to handle this in Linux. Most of the time, we are going to get "same 4GB region", so we are OK to have this corner case to detect and bail out driver load. There is no report of issue from field, but wanted to protect failure for future. Thanks, Kashyap > > -- > Martin K. PetersenOracle Linux Engineering
RE: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Martin K. Petersen > Sent: Thursday, April 27, 2017 3:55 AM > To: Sreekanth Reddy > Cc: Martin K. Petersen; j...@kernel.org; linux-s...@vger.kernel.org; linux- > ker...@vger.kernel.org; Christoph Hellwig > Subject: Re: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor > Post Queue (RDPQ) Array support > > > Sreekanth, > > > We need to satisfy this condition on those system where 32 bit dma > > consistent mask is not supported and it only supports 64 bit dma > > consistent mask. So on these system we can't set > > pci_set_consistent_dma_mask() to DMA_BIT_MASK(32). > > Which systems are you talking about? > > It seems a bit unrealistic to require all devices to support 64-bit DMA. Martin - We have found all devices to support 64-bit DMA on certain ARM64 platform. I discussed @linux-arm-kern. Below is a thread. http://marc.info/?l=linux-arm-kernel=148880763816046=2 For ARM64, it is not supporting SWIOTLB and that is a reason we need to make all DMA pool above 4GB. Ea. If I map crash kernel above 4GB in x86_64 platform, they owner DMA 32 bit mask since arch specific code in x86_64 support SWIOTLB. Same settings on ARM64 platform fails DAM 32 bit mask. In one particular setup of ARM64, I also see below 4GB is mapped to SoC and kernel component mapped above 4GB region. Can we add in MR/IT driver below logic to meet this requirement ? - Driver will attempt DMA buffer above 4GB and check the start and end address of the physical address. If DMA buffer cross the "Same 4GB region" ( I mean High Address should be constant for that region.), driver will hold that region and attempt one more allocation. If second allocation is also not meeting "Same 4GB region", we will give up driver load. Before we attempt above logic, we would like to understand if we have any other reliable method ways to handle this in Linux. Most of the time, we are going to get "same 4GB region", so we are OK to have this corner case to detect and bail out driver load. There is no report of issue from field, but wanted to protect failure for future. Thanks, Kashyap > > -- > Martin K. PetersenOracle Linux Engineering
RE: out of range LBA using sg_raw
> -Original Message- > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > Sent: Wednesday, March 08, 2017 10:03 PM > To: Kashyap Desai > Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux- > s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > >>>>> "Kashyap" == Kashyap Desai <kashyap.de...@broadcom.com> writes: > > Kashyap, > > Kashyap> I am just curious to know how badly we have to scrutinize each > Kashyap> packet before sending to Fast Path as we are in IO path and > Kashyap> recommend only important checks to be added. > > As Christoph pointed out, when the fast path is in use you assume the role of > the SCSI device. And therefore it is your responsibility to ensure that the VD's > capacity and other relevant constraints are being honored. Just like the MR > firmware and any attached disks would. Martin - Agree on this point. I am planning to study all possible such sanity in driver for VD and not trying to fix one specific scenario as described here. Do you think fix in this area is good for kernel-stable as well OR just keep in linux-next as it is not so severe considering real time exposure ? Trying to understand priority and severity of this issue. > > It is a feature that there is no sanity checking in the sg interface. > The intent is to be able to pass through commands directly to a device and > have the device act upon them. Including fail them if they don't make any > sense. Understood as sg_raw is not design to sanity check. > > PS. I'm really no fan of the fast path. It's super messy to have the VD layout > handled in two different places. > > -- > Martin K. PetersenOracle Linux Engineering
RE: out of range LBA using sg_raw
> -Original Message- > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > Sent: Wednesday, March 08, 2017 10:03 PM > To: Kashyap Desai > Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux- > s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > >>>>> "Kashyap" == Kashyap Desai writes: > > Kashyap, > > Kashyap> I am just curious to know how badly we have to scrutinize each > Kashyap> packet before sending to Fast Path as we are in IO path and > Kashyap> recommend only important checks to be added. > > As Christoph pointed out, when the fast path is in use you assume the role of > the SCSI device. And therefore it is your responsibility to ensure that the VD's > capacity and other relevant constraints are being honored. Just like the MR > firmware and any attached disks would. Martin - Agree on this point. I am planning to study all possible such sanity in driver for VD and not trying to fix one specific scenario as described here. Do you think fix in this area is good for kernel-stable as well OR just keep in linux-next as it is not so severe considering real time exposure ? Trying to understand priority and severity of this issue. > > It is a feature that there is no sanity checking in the sg interface. > The intent is to be able to pass through commands directly to a device and > have the device act upon them. Including fail them if they don't make any > sense. Understood as sg_raw is not design to sanity check. > > PS. I'm really no fan of the fast path. It's super messy to have the VD layout > handled in two different places. > > -- > Martin K. PetersenOracle Linux Engineering
RE: out of range LBA using sg_raw
> -Original Message- > From: Christoph Hellwig [mailto:h...@infradead.org] > Sent: Wednesday, March 08, 2017 9:37 PM > To: Kashyap Desai > Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux- > s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > On Wed, Mar 08, 2017 at 09:29:28PM +0530, Kashyap Desai wrote: > > Thanks Chris. It is understood to have sanity in driver, but how > > critical such checks where SG_IO type interface send pass-through request. > ? > > Are you suggesting as good to have sanity or very important as there > > may be a real-time exposure other than SG_IO interface ? I am confused > > over must or good to have check. > > Also one more fault I can generate using below sg_raw command - > > SCSI _devices_ need to sanity check any input and fail commands instead of > crashing or causing other problems. Normal SCSI HBA drivers don't need to > do that as they don't interpret CDBs. Megaraid (and a few other raid drivers) > are special in that they take on part of the device functionality and do > interpret CDBs sometimes. In that case you'll need to do all that sanity > checking and generate proper errors. > > It would be nice to have come common helpers for this shared between > everyone interpreting SCSI CBD (e.g. the SCSI target code, the NVMe SCSI > emulation and the various RAID drivers). Thanks Chris. I will continue on this and will come back with changes. Let me check with Broadcom internally and figure out all possible scenarios for megaraid_sas. Thanks, Kashyap
RE: out of range LBA using sg_raw
> -Original Message- > From: Christoph Hellwig [mailto:h...@infradead.org] > Sent: Wednesday, March 08, 2017 9:37 PM > To: Kashyap Desai > Cc: Christoph Hellwig; linux-kernel@vger.kernel.org; linux- > s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > On Wed, Mar 08, 2017 at 09:29:28PM +0530, Kashyap Desai wrote: > > Thanks Chris. It is understood to have sanity in driver, but how > > critical such checks where SG_IO type interface send pass-through request. > ? > > Are you suggesting as good to have sanity or very important as there > > may be a real-time exposure other than SG_IO interface ? I am confused > > over must or good to have check. > > Also one more fault I can generate using below sg_raw command - > > SCSI _devices_ need to sanity check any input and fail commands instead of > crashing or causing other problems. Normal SCSI HBA drivers don't need to > do that as they don't interpret CDBs. Megaraid (and a few other raid drivers) > are special in that they take on part of the device functionality and do > interpret CDBs sometimes. In that case you'll need to do all that sanity > checking and generate proper errors. > > It would be nice to have come common helpers for this shared between > everyone interpreting SCSI CBD (e.g. the SCSI target code, the NVMe SCSI > emulation and the various RAID drivers). Thanks Chris. I will continue on this and will come back with changes. Let me check with Broadcom internally and figure out all possible scenarios for megaraid_sas. Thanks, Kashyap
RE: out of range LBA using sg_raw
> -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Wednesday, March 08, 2017 9:35 PM > To: h...@infradead.org; kashyap.de...@broadcom.com > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > On Wed, 2017-03-08 at 21:29 +0530, Kashyap Desai wrote: > > Also one more fault I can generate using below sg_raw command - > > > > "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00" > > > > Provide more scsi data length compare to actual SG buffer. Do you > > suggest such SG_IO interface vulnerability is good to be captured in driver. > > That's not a vulnerability of the SG I/O interface. A SCSI device has to set the > residual count correctly if the SCSI data length does not match the size of the > data buffer. Thanks Bart. I will pass this information to Broadcom firmware dev. May be a Tx/Rx (DMA) related code in MR (also for Fusion IT HBA) cannot handle due to some sanity checks are not passed. > > Bart.
RE: out of range LBA using sg_raw
> -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Wednesday, March 08, 2017 9:35 PM > To: h...@infradead.org; kashyap.de...@broadcom.com > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > On Wed, 2017-03-08 at 21:29 +0530, Kashyap Desai wrote: > > Also one more fault I can generate using below sg_raw command - > > > > "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00" > > > > Provide more scsi data length compare to actual SG buffer. Do you > > suggest such SG_IO interface vulnerability is good to be captured in driver. > > That's not a vulnerability of the SG I/O interface. A SCSI device has to set the > residual count correctly if the SCSI data length does not match the size of the > data buffer. Thanks Bart. I will pass this information to Broadcom firmware dev. May be a Tx/Rx (DMA) related code in MR (also for Fusion IT HBA) cannot handle due to some sanity checks are not passed. > > Bart.
RE: out of range LBA using sg_raw
> -Original Message- > From: Christoph Hellwig [mailto:h...@infradead.org] > Sent: Wednesday, March 08, 2017 8:41 PM > To: Kashyap Desai > Cc: linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > Hi Kashyap, > > for SG_IO passthrough requests we can't validate command validity for > commands as the block layer treats them as opaque. The SCSI device > implementation needs to handle incorrect parameter to be robust. > > For your fast path bypass the megaraid driver assumes part of the SCSI device > implementation, so it will have to check for validity. Thanks Chris. It is understood to have sanity in driver, but how critical such checks where SG_IO type interface send pass-through request. ? Are you suggesting as good to have sanity or very important as there may be a real-time exposure other than SG_IO interface ? I am confused over must or good to have check. Also one more fault I can generate using below sg_raw command - "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00" Provide more scsi data length compare to actual SG buffer. Do you suggest such SG_IO interface vulnerability is good to be captured in driver. I am just curious to know how badly we have to scrutinize each packet before sending to Fast Path as we are in IO path and recommend only important checks to be added. Thanks, Kashyap
RE: out of range LBA using sg_raw
> -Original Message- > From: Christoph Hellwig [mailto:h...@infradead.org] > Sent: Wednesday, March 08, 2017 8:41 PM > To: Kashyap Desai > Cc: linux-kernel@vger.kernel.org; linux-s...@vger.kernel.org > Subject: Re: out of range LBA using sg_raw > > Hi Kashyap, > > for SG_IO passthrough requests we can't validate command validity for > commands as the block layer treats them as opaque. The SCSI device > implementation needs to handle incorrect parameter to be robust. > > For your fast path bypass the megaraid driver assumes part of the SCSI device > implementation, so it will have to check for validity. Thanks Chris. It is understood to have sanity in driver, but how critical such checks where SG_IO type interface send pass-through request. ? Are you suggesting as good to have sanity or very important as there may be a real-time exposure other than SG_IO interface ? I am confused over must or good to have check. Also one more fault I can generate using below sg_raw command - "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00" Provide more scsi data length compare to actual SG buffer. Do you suggest such SG_IO interface vulnerability is good to be captured in driver. I am just curious to know how badly we have to scrutinize each packet before sending to Fast Path as we are in IO path and recommend only important checks to be added. Thanks, Kashyap
out of range LBA using sg_raw
Hi - Need help to understand if below is something we should consider to be fixed in megaraid_sas driver or call as unreal exposure. I have created slice VD of size 10GB (raid 1) using 2 drives. Each Physical Drive size is 256GB. Last LBA of the VD and actual Physical disk associated with that VD is different. Actual Physical disk has larger range of LBA compare VD. Below is readcap detail of VD0 # sg_readcap /dev/sdu Read Capacity results: Last logical block address=20971519 (0x13f), Number of blocks=20971520 Logical block length=512 bytes Hence: Device size: 10737418240 bytes, 10240.0 MiB, 10.74 GB Using below sg_raw command, we should see "LBA out of range" sense. In CDB 0x28, pass LBA beyond last lba of VD 0x13f. sg_raw -r 4k /dev/sdx 28 00 01 4f ff ff 00 00 08 00 It works if VD created behind MR controller does not support Fast Path Write. In case of Fast Path Write, driver convert LBA of VD to underlying Physical disk and send IO direct to the physical disk. Since Physical disk has enough LBA range to respond, it will not send "LBA out of range sense". Megaraid_Sas driver never validate range of LBA for VD as it assume to be validated by upper layer in scsi stack. Other sg_tool method like sg_dd, sg_write, dd etc has checks of LBA range and driver never receive out of range LBA. What is a suggestion ? Shall I add check in megaraid_sas driver or it is not a valid scenario as "sg_raw" tool can send any type of command which does not require multiple sanity in driver. Thanks, Kashyap
out of range LBA using sg_raw
Hi - Need help to understand if below is something we should consider to be fixed in megaraid_sas driver or call as unreal exposure. I have created slice VD of size 10GB (raid 1) using 2 drives. Each Physical Drive size is 256GB. Last LBA of the VD and actual Physical disk associated with that VD is different. Actual Physical disk has larger range of LBA compare VD. Below is readcap detail of VD0 # sg_readcap /dev/sdu Read Capacity results: Last logical block address=20971519 (0x13f), Number of blocks=20971520 Logical block length=512 bytes Hence: Device size: 10737418240 bytes, 10240.0 MiB, 10.74 GB Using below sg_raw command, we should see "LBA out of range" sense. In CDB 0x28, pass LBA beyond last lba of VD 0x13f. sg_raw -r 4k /dev/sdx 28 00 01 4f ff ff 00 00 08 00 It works if VD created behind MR controller does not support Fast Path Write. In case of Fast Path Write, driver convert LBA of VD to underlying Physical disk and send IO direct to the physical disk. Since Physical disk has enough LBA range to respond, it will not send "LBA out of range sense". Megaraid_Sas driver never validate range of LBA for VD as it assume to be validated by upper layer in scsi stack. Other sg_tool method like sg_dd, sg_write, dd etc has checks of LBA range and driver never receive out of range LBA. What is a suggestion ? Shall I add check in megaraid_sas driver or it is not a valid scenario as "sg_raw" tool can send any type of command which does not require multiple sanity in driver. Thanks, Kashyap
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, January 30, 2017 10:03 PM > To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; > h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial workload > > On 01/30/2017 09:30 AM, Bart Van Assche wrote: > > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: > >> - if (atomic_inc_return(>fw_outstanding) > > >> - instance->host->can_queue) { > >> - atomic_dec(>fw_outstanding); > >> - return SCSI_MLQUEUE_HOST_BUSY; > >> - } > >> + if (atomic_inc_return(>fw_outstanding) > safe_can_queue) { > >> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); > >> + /* For rotational device wait for sometime to get fusion > >> + command > >> from pool. > >> +* This is just to reduce proactive re-queue at mid layer > >> + which is > >> not > >> +* sending sorted IO in SCSI.MQ mode. > >> +*/ > >> + if (!is_nonrot) > >> + udelay(100); > >> + } > > > > The SCSI core does not allow to sleep inside the queuecommand() > > callback function. > > udelay() is a busy loop, so it's not sleeping. That said, it's obviously NOT a > great idea. We want to fix the reordering due to requeues, not introduce > random busy delays to work around it. Thanks for feedback. I do realize that udelay() is going to be very odd in queue_command call back. I will keep this note. Preferred solution is blk mq scheduler patches. > > -- > Jens Axboe
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, January 30, 2017 10:03 PM > To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; > h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial workload > > On 01/30/2017 09:30 AM, Bart Van Assche wrote: > > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: > >> - if (atomic_inc_return(>fw_outstanding) > > >> - instance->host->can_queue) { > >> - atomic_dec(>fw_outstanding); > >> - return SCSI_MLQUEUE_HOST_BUSY; > >> - } > >> + if (atomic_inc_return(>fw_outstanding) > safe_can_queue) { > >> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); > >> + /* For rotational device wait for sometime to get fusion > >> + command > >> from pool. > >> +* This is just to reduce proactive re-queue at mid layer > >> + which is > >> not > >> +* sending sorted IO in SCSI.MQ mode. > >> +*/ > >> + if (!is_nonrot) > >> + udelay(100); > >> + } > > > > The SCSI core does not allow to sleep inside the queuecommand() > > callback function. > > udelay() is a busy loop, so it's not sleeping. That said, it's obviously NOT a > great idea. We want to fix the reordering due to requeues, not introduce > random busy delays to work around it. Thanks for feedback. I do realize that udelay() is going to be very odd in queue_command call back. I will keep this note. Preferred solution is blk mq scheduler patches. > > -- > Jens Axboe
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi Jens/Omar, I used git.kernel.dk/linux-block branch - blk-mq-sched (commit 0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in this thread is resolved. Now I am seeing MQ and SQ mode both are resulting in sequential IO pattern while IO is getting re-queued in block layer. To make similar performance without blk-mq-sched feature, is it good to pause IO for few usec in LLD? I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if it is Sequential on Rotational media.) Explaining w.r.t megaraid_sas driver. This driver expose can_queue, but it internally consume commands for raid 1, fast path. In worst case, can_queue/2 will consume all firmware resources and driver will re-queue further IOs to SML as below - if (atomic_inc_return(>fw_outstanding) > instance->host->can_queue) { atomic_dec(>fw_outstanding); return SCSI_MLQUEUE_HOST_BUSY; } I want to avoid above SCSI_MLQUEUE_HOST_BUSY. Need your suggestion for below changes - diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 9a9c84f..a683eb0 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "megaraid_sas_fusion.h" #include "megaraid_sas.h" @@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, struct megasas_cmd_fusion *cmd, *r1_cmd = NULL; union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc; u32 index; - struct fusion_context *fusion; + boolis_nonrot; + u32 safe_can_queue; + u32 num_cpus; + struct fusion_context *fusion; + + fusion = instance->ctrl_context; + + num_cpus = num_online_cpus(); + safe_can_queue = instance->cur_can_queue - num_cpus; fusion = instance->ctrl_context; @@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, return SCSI_MLQUEUE_DEVICE_BUSY; } - if (atomic_inc_return(>fw_outstanding) > - instance->host->can_queue) { - atomic_dec(>fw_outstanding); - return SCSI_MLQUEUE_HOST_BUSY; - } + if (atomic_inc_return(>fw_outstanding) > safe_can_queue) { + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); + /* For rotational device wait for sometime to get fusion command from pool. +* This is just to reduce proactive re-queue at mid layer which is not +* sending sorted IO in SCSI.MQ mode. +*/ + if (!is_nonrot) + udelay(100); + } cmd = megasas_get_cmd_fusion(instance, scmd->request->tag); ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Tuesday, November 01, 2016 11:11 AM > To: 'Jens Axboe'; 'Omar Sandoval' > Cc: 'linux-s...@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux- > bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org' > Subject: RE: Device or HBA level QD throttling creates randomness in > sequetial workload > > Jens- Replied inline. > > > Omar - I tested your WIP repo and figure out System hangs only if I pass > " > scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I > am looking for scsi_mod.use_blk_mq=Y. > > Also below is snippet of blktrace. In case of higher per device QD, I see > Requeue request in blktrace. > > 65,128 10 6268 2.432404509 18594 P N [fio] > 65,128 10 6269 2.432405013 18594 U N [fio] 1 > 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] > 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] > 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] > 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] > 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] > 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 > [kworker/23:1H] > 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] > 65,128 7 6126 2.433032637 18594 P N [fio] > 65,128 7 6127 2.433033204 18594 U N [fio] 1 > 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] > 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] > 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] > 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] > 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] > 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] > 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] > 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] > 65,128 7 6134 2.433167450 18594 P N [fio] > 65,128 7 6135
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi Jens/Omar, I used git.kernel.dk/linux-block branch - blk-mq-sched (commit 0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in this thread is resolved. Now I am seeing MQ and SQ mode both are resulting in sequential IO pattern while IO is getting re-queued in block layer. To make similar performance without blk-mq-sched feature, is it good to pause IO for few usec in LLD? I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if it is Sequential on Rotational media.) Explaining w.r.t megaraid_sas driver. This driver expose can_queue, but it internally consume commands for raid 1, fast path. In worst case, can_queue/2 will consume all firmware resources and driver will re-queue further IOs to SML as below - if (atomic_inc_return(>fw_outstanding) > instance->host->can_queue) { atomic_dec(>fw_outstanding); return SCSI_MLQUEUE_HOST_BUSY; } I want to avoid above SCSI_MLQUEUE_HOST_BUSY. Need your suggestion for below changes - diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 9a9c84f..a683eb0 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "megaraid_sas_fusion.h" #include "megaraid_sas.h" @@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, struct megasas_cmd_fusion *cmd, *r1_cmd = NULL; union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc; u32 index; - struct fusion_context *fusion; + boolis_nonrot; + u32 safe_can_queue; + u32 num_cpus; + struct fusion_context *fusion; + + fusion = instance->ctrl_context; + + num_cpus = num_online_cpus(); + safe_can_queue = instance->cur_can_queue - num_cpus; fusion = instance->ctrl_context; @@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, return SCSI_MLQUEUE_DEVICE_BUSY; } - if (atomic_inc_return(>fw_outstanding) > - instance->host->can_queue) { - atomic_dec(>fw_outstanding); - return SCSI_MLQUEUE_HOST_BUSY; - } + if (atomic_inc_return(>fw_outstanding) > safe_can_queue) { + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); + /* For rotational device wait for sometime to get fusion command from pool. +* This is just to reduce proactive re-queue at mid layer which is not +* sending sorted IO in SCSI.MQ mode. +*/ + if (!is_nonrot) + udelay(100); + } cmd = megasas_get_cmd_fusion(instance, scmd->request->tag); ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Tuesday, November 01, 2016 11:11 AM > To: 'Jens Axboe'; 'Omar Sandoval' > Cc: 'linux-s...@vger.kernel.org'; 'linux-kernel@vger.kernel.org'; 'linux- > bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org' > Subject: RE: Device or HBA level QD throttling creates randomness in > sequetial workload > > Jens- Replied inline. > > > Omar - I tested your WIP repo and figure out System hangs only if I pass > " > scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I > am looking for scsi_mod.use_blk_mq=Y. > > Also below is snippet of blktrace. In case of higher per device QD, I see > Requeue request in blktrace. > > 65,128 10 6268 2.432404509 18594 P N [fio] > 65,128 10 6269 2.432405013 18594 U N [fio] 1 > 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] > 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] > 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] > 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] > 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] > 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 > [kworker/23:1H] > 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] > 65,128 7 6126 2.433032637 18594 P N [fio] > 65,128 7 6127 2.433033204 18594 U N [fio] 1 > 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] > 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] > 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] > 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] > 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] > 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] > 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] > 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] > 65,128 7 6134 2.433167450 18594 P N [fio] > 65,128 7 6135
RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"
> -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Wednesday, November 30, 2016 12:50 AM > To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J . > Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux- > s...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of > "outstanding" > > On 11/29/2016 11:13 AM, Colin King wrote: > > Trivial fix to spelling mistake "oustanding" to "outstanding" in > > dev_info and scmd_printk messages. Also join wrapped literal string in > > the scmd_printk. > > Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com> Please hold this commit as we have patch set work in progress for MegaRaid Driver. It will conflict it this patch commits. We will be re-submitting this patch with appropriate Signed-off and Submitted-by tags. Thanks, Kashyap
RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"
> -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Wednesday, November 30, 2016 12:50 AM > To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J . > Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux- > s...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of > "outstanding" > > On 11/29/2016 11:13 AM, Colin King wrote: > > Trivial fix to spelling mistake "oustanding" to "outstanding" in > > dev_info and scmd_printk messages. Also join wrapped literal string in > > the scmd_printk. > > Reviewed-by: Bart Van Assche Please hold this commit as we have patch set work in progress for MegaRaid Driver. It will conflict it this patch commits. We will be re-submitting this patch with appropriate Signed-off and Submitted-by tags. Thanks, Kashyap
RE: [GIT PULL] SCSI fixes for 4.9-rc3
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Gabriel C > Sent: Friday, November 11, 2016 9:40 AM > To: James Bottomley; Andrew Morton; Linus Torvalds > Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org > Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3 > > > > On 11.11.2016 04:30, Gabriel C wrote: > > > > On 05.11.2016 14:29, James Bottomley wrote: > > > > > > ... > > > >> Kashyap Desai (1): > >> scsi: megaraid_sas: Fix data integrity failure for JBOD > >> (passthrough) devices > >> > >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > >> b/drivers/scsi/megaraid/megaraid_sas_base.c > >> index 9ff57de..d8b1fbd 100644 > >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c > >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host > *shost, struct scsi_cmnd *scmd) > >>goto out_done; > >>} > >> > >> - switch (scmd->cmnd[0]) { > >> - case SYNCHRONIZE_CACHE: > >> - /* > >> - * FW takes care of flush cache on its own > >> - * No need to send it down > >> - */ > >> + /* > >> + * FW takes care of flush cache on its own for Virtual Disk. > >> + * No need to send it down for VD. For JBOD send > SYNCHRONIZE_CACHE to FW. > >> + */ > >> + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && > >> +MEGASAS_IS_LOGICAL(scmd)) { > >>scmd->result = DID_OK << 16; > >>goto out_done; > >> - default: > >> - break; > >>} > >> > >>return instance->instancet->build_and_issue_cmd(instance, scmd); > > > > This patch breaks my box.. I'm not able to boot it anymore. > > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!? > > > > I'm not sure how to get an log since dracut times out and I'm dropped > > , after a very long time of probing 'ghost devices', in a emercency > > shell, > journalctl doesn't work also.. > > > > After reverting this one I can boot normal. > > > > Box is a FUJITSU PRIMERGY TX200 S5.. Please check now commit. Below commit has complete fix. http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8 > > > > This is from an working kernel.. > > > > [5.119371] megaraid_sas :01:00.0: FW now in Ready state > > [5.119418] megaraid_sas :01:00.0: firmware supports msix > > : (0) > > [5.119420] megaraid_sas :01:00.0: current msix/online cpus > > : (1/16) > > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled) > > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not > > supported > > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002 > > > > ... > > > > [5.208063] megaraid_sas :01:00.0: controller type : > > MR(256MB) > > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR) > > : > Enabled > > [5.208067] megaraid_sas :01:00.0: Secure JBOD support : No > > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi: > fw_support_ieee=0 > > [5.208073] megaraid_sas :01:00.0: INIT adapter done > > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported > megasas_setup_jbod_map 4967 > > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY > failed/not supported by firmware > > [5.252080] megaraid_sas :01:00.0: DCMD not supported by > > firmware - > megasas_ld_list_query 4369 > > [5.274086] megaraid_sas :01:00.0: pci id: > (0x1000)/(0x0060)/(0x1734)/(0x10f9) > > [5.274089] megaraid_sas :01:00.0: unevenspan support: no > > [5.274090] megaraid_sas :01:00.0: firmware crash dump : no > > [5.274092] megaraid_sas :01:00.0: jbod sync map : no > > [5.274094] scsi host0: Avago SAS based MegaRAID driver > > [5.280022] scsi 0:0:6:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.282153] scsi 0:0:7:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.285180] scsi 0:0:10:0: Direct-Access ATA ST500NM0011 > > FTM6 PQ: > 0 ANSI: 5 > > [5.369885] scsi 0:2:0:0: Direct-Access LSI MegaRAID SAS RMB > > 1.40 PQ: > 0 ANSI: 5 > > > > .. > > > > Please let me know if you need more infos and/or want me to test > > patches. > > > > > > I managed to get some parts of the broken dmesg. There it is : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html
RE: [GIT PULL] SCSI fixes for 4.9-rc3
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Gabriel C > Sent: Friday, November 11, 2016 9:40 AM > To: James Bottomley; Andrew Morton; Linus Torvalds > Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org > Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3 > > > > On 11.11.2016 04:30, Gabriel C wrote: > > > > On 05.11.2016 14:29, James Bottomley wrote: > > > > > > ... > > > >> Kashyap Desai (1): > >> scsi: megaraid_sas: Fix data integrity failure for JBOD > >> (passthrough) devices > >> > >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > >> b/drivers/scsi/megaraid/megaraid_sas_base.c > >> index 9ff57de..d8b1fbd 100644 > >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c > >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host > *shost, struct scsi_cmnd *scmd) > >>goto out_done; > >>} > >> > >> - switch (scmd->cmnd[0]) { > >> - case SYNCHRONIZE_CACHE: > >> - /* > >> - * FW takes care of flush cache on its own > >> - * No need to send it down > >> - */ > >> + /* > >> + * FW takes care of flush cache on its own for Virtual Disk. > >> + * No need to send it down for VD. For JBOD send > SYNCHRONIZE_CACHE to FW. > >> + */ > >> + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && > >> +MEGASAS_IS_LOGICAL(scmd)) { > >>scmd->result = DID_OK << 16; > >>goto out_done; > >> - default: > >> - break; > >>} > >> > >>return instance->instancet->build_and_issue_cmd(instance, scmd); > > > > This patch breaks my box.. I'm not able to boot it anymore. > > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!? > > > > I'm not sure how to get an log since dracut times out and I'm dropped > > , after a very long time of probing 'ghost devices', in a emercency > > shell, > journalctl doesn't work also.. > > > > After reverting this one I can boot normal. > > > > Box is a FUJITSU PRIMERGY TX200 S5.. Please check now commit. Below commit has complete fix. http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8 > > > > This is from an working kernel.. > > > > [5.119371] megaraid_sas :01:00.0: FW now in Ready state > > [5.119418] megaraid_sas :01:00.0: firmware supports msix > > : (0) > > [5.119420] megaraid_sas :01:00.0: current msix/online cpus > > : (1/16) > > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled) > > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not > > supported > > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002 > > > > ... > > > > [5.208063] megaraid_sas :01:00.0: controller type : > > MR(256MB) > > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR) > > : > Enabled > > [5.208067] megaraid_sas :01:00.0: Secure JBOD support : No > > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi: > fw_support_ieee=0 > > [5.208073] megaraid_sas :01:00.0: INIT adapter done > > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported > megasas_setup_jbod_map 4967 > > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY > failed/not supported by firmware > > [5.252080] megaraid_sas :01:00.0: DCMD not supported by > > firmware - > megasas_ld_list_query 4369 > > [5.274086] megaraid_sas :01:00.0: pci id: > (0x1000)/(0x0060)/(0x1734)/(0x10f9) > > [5.274089] megaraid_sas :01:00.0: unevenspan support: no > > [5.274090] megaraid_sas :01:00.0: firmware crash dump : no > > [5.274092] megaraid_sas :01:00.0: jbod sync map : no > > [5.274094] scsi host0: Avago SAS based MegaRAID driver > > [5.280022] scsi 0:0:6:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.282153] scsi 0:0:7:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.285180] scsi 0:0:10:0: Direct-Access ATA ST500NM0011 > > FTM6 PQ: > 0 ANSI: 5 > > [5.369885] scsi 0:2:0:0: Direct-Access LSI MegaRAID SAS RMB > > 1.40 PQ: > 0 ANSI: 5 > > > > .. > > > > Please let me know if you need more infos and/or want me to test > > patches. > > > > > > I managed to get some parts of the broken dmesg. There it is : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Jens- Replied inline. Omar - I tested your WIP repo and figure out System hangs only if I pass " scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am looking for scsi_mod.use_blk_mq=Y. Also below is snippet of blktrace. In case of higher per device QD, I see Requeue request in blktrace. 65,128 10 6268 2.432404509 18594 P N [fio] 65,128 10 6269 2.432405013 18594 U N [fio] 1 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 [kworker/23:1H] 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] 65,128 7 6126 2.433032637 18594 P N [fio] 65,128 7 6127 2.433033204 18594 U N [fio] 1 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] 65,128 7 6134 2.433167450 18594 P N [fio] 65,128 7 6135 2.433167911 18594 U N [fio] 1 65,128 7 6136 2.433168074 18594 I WS 148816 + 8 [fio] 65,128 7 6137 2.433168492 18594 D WS 148816 + 8 [fio] 65,128 7 6138 2.433174016 18594 Q WS 148824 + 8 [fio] 65,128 7 6139 2.433174282 18594 G WS 148824 + 8 [fio] 65,128 7 6140 2.433174613 18594 S WS 148824 + 8 [fio] CPU0 (sdy): Reads Queued: 0,0KiB Writes Queued: 79, 316KiB Read Dispatches:0,0KiB Write Dispatches: 67, 18,446,744,073PiB Reads Requeued: 0 Writes Requeued:86 Reads Completed:0,0KiB Writes Completed: 98, 392KiB Read Merges:0,0KiB Write Merges:0, 0KiB Read depth: 0 Write depth: 5 IO unplugs:79 Timer unplugs: 0 ` Kashyap > -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, October 31, 2016 10:54 PM > To: Kashyap Desai; Omar Sandoval > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial > workload > > Hi, > > One guess would be that this isn't around a requeue condition, but rather > the > fact that we don't really guarantee any sort of hard FIFO behavior between > the > software queues. Can you try this test patch to see if it changes the > behavior for > you? Warning: untested... Jens - I tested the patch, but I still see random IO pattern for expected Sequential Run. I am intentionally running case of Re-queue and seeing issue at the time of Re-queue. If there is no Requeue, I see no issue at LLD. > > diff --git a/block/blk-mq.c b/block/blk-mq.c index > f3d27a6dee09..5404ca9c71b2 > 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned > int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head > +*b) { > + struct request *rqa = container_of(a, struct request, queuelist); > + struct request *rqb = container_of(b, struct request, queuelist); > + > + return blk_rq_pos(rqa) < blk_rq_pos(rqb); } > + > /* >* Run this hardware queue, pulling any software queues mapped to it in. >* Note that this function currently has various problems around > ordering @@ - > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) > } > > /* > + * If the device is rotational, sort the list sanely to avoid > + * unecessary seeks. The software queues are roughly FIFO, but > + * only roughly, there are no hard guarantees. > + */ > + if (!blk_queue_nonrot(q)) > + list_sort(NULL, _list, rq_pos_cmp); > + > + /* >* Start off with dptr being NULL, so we start the first request >* immediately, even if we have more pending. >*/ > > -- > Jens Axboe
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Jens- Replied inline. Omar - I tested your WIP repo and figure out System hangs only if I pass " scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am looking for scsi_mod.use_blk_mq=Y. Also below is snippet of blktrace. In case of higher per device QD, I see Requeue request in blktrace. 65,128 10 6268 2.432404509 18594 P N [fio] 65,128 10 6269 2.432405013 18594 U N [fio] 1 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 [kworker/23:1H] 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] 65,128 7 6126 2.433032637 18594 P N [fio] 65,128 7 6127 2.433033204 18594 U N [fio] 1 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] 65,128 7 6134 2.433167450 18594 P N [fio] 65,128 7 6135 2.433167911 18594 U N [fio] 1 65,128 7 6136 2.433168074 18594 I WS 148816 + 8 [fio] 65,128 7 6137 2.433168492 18594 D WS 148816 + 8 [fio] 65,128 7 6138 2.433174016 18594 Q WS 148824 + 8 [fio] 65,128 7 6139 2.433174282 18594 G WS 148824 + 8 [fio] 65,128 7 6140 2.433174613 18594 S WS 148824 + 8 [fio] CPU0 (sdy): Reads Queued: 0,0KiB Writes Queued: 79, 316KiB Read Dispatches:0,0KiB Write Dispatches: 67, 18,446,744,073PiB Reads Requeued: 0 Writes Requeued:86 Reads Completed:0,0KiB Writes Completed: 98, 392KiB Read Merges:0,0KiB Write Merges:0, 0KiB Read depth: 0 Write depth: 5 IO unplugs:79 Timer unplugs: 0 ` Kashyap > -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, October 31, 2016 10:54 PM > To: Kashyap Desai; Omar Sandoval > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial > workload > > Hi, > > One guess would be that this isn't around a requeue condition, but rather > the > fact that we don't really guarantee any sort of hard FIFO behavior between > the > software queues. Can you try this test patch to see if it changes the > behavior for > you? Warning: untested... Jens - I tested the patch, but I still see random IO pattern for expected Sequential Run. I am intentionally running case of Re-queue and seeing issue at the time of Re-queue. If there is no Requeue, I see no issue at LLD. > > diff --git a/block/blk-mq.c b/block/blk-mq.c index > f3d27a6dee09..5404ca9c71b2 > 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned > int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head > +*b) { > + struct request *rqa = container_of(a, struct request, queuelist); > + struct request *rqb = container_of(b, struct request, queuelist); > + > + return blk_rq_pos(rqa) < blk_rq_pos(rqb); } > + > /* >* Run this hardware queue, pulling any software queues mapped to it in. >* Note that this function currently has various problems around > ordering @@ - > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) > } > > /* > + * If the device is rotational, sort the list sanely to avoid > + * unecessary seeks. The software queues are roughly FIFO, but > + * only roughly, there are no hard guarantees. > + */ > + if (!blk_queue_nonrot(q)) > + list_sort(NULL, _list, rq_pos_cmp); > + > + /* >* Start off with dptr being NULL, so we start the first request >* immediately, even if we have more pending. >*/ > > -- > Jens Axboe
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Omar Sandoval [mailto:osan...@osandov.com] > Sent: Monday, October 24, 2016 9:11 PM > To: Kashyap Desai > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig; > paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in sequetial > workload > > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > > Hi - > > > > > > > > I found below conversation and it is on the same line as I wanted > > > > some input from mailing list. > > > > > > > > http://marc.info/?l=linux-kernel=147569860526197=2 > > > > > > > > I can do testing on any WIP item as Omar mentioned in above > > discussion. > > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > > > I tried build kernel using this repo, but looks like it is not allowed > > to reboot due to some changes in layer. > > Did you build the most up-to-date version of that branch? I've been force > pushing to it, so the commit id that you built would be useful. > What boot failure are you seeing? Below is latest commit on repo. commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda Author: Omar Sandoval <osan...@fb.com> Date: Tue Sep 20 11:20:03 2016 -0700 [WIP] blk-mq: limit bio queue depth I have latest repo from 4.9/scsi-next maintained by Martin which boots fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched branch. I could not see any meaningful data on boot hang, so going to try one more time tomorrow. > > > > > > > Are you using blk-mq for this disk? If not, then the work there > > > won't > > affect you. > > > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > > disable, Sequential work load issue is not seen and scheduling > > works well. > > Ah, okay, perfect. Can you send the fio job file you're using? Hard to tell exactly > what's going on without the details. A sequential workload with just one > submitter is about as easy as it gets, so this _should_ be behaving nicely. ; setup numa policy for each thread ; 'numactl --show' to determine the maximum numa nodes [global] ioengine=libaio buffered=0 rw=write bssplit=4K/100 iodepth=256 numjobs=1 direct=1 runtime=60s allow_mounted_write=0 [job1] filename=/dev/sdd .. [job24] filename=/dev/sdaa When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a ioscheduler detail. (It is in blk-mq mode. ) /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:none When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0, ioscheduler picked by SML is . /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq] I see in blk-mq performance is very low for Sequential Write work load and I confirm that blk-mq convert Sequential work load into random stream due to io-scheduler change in blk-mq vs legacy block layer. > > > > > > > > Is there any workaround/alternative in latest upstream kernel, if > > > > user wants to see limited penalty for Sequential Work load on HDD ? > > > > > > > > ` Kashyap > > > > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail seems to > mark just about everything I get from Broadcom as spam due to failed DMARC. > > -- > Omar
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Omar Sandoval [mailto:osan...@osandov.com] > Sent: Monday, October 24, 2016 9:11 PM > To: Kashyap Desai > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig; > paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in sequetial > workload > > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > > Hi - > > > > > > > > I found below conversation and it is on the same line as I wanted > > > > some input from mailing list. > > > > > > > > http://marc.info/?l=linux-kernel=147569860526197=2 > > > > > > > > I can do testing on any WIP item as Omar mentioned in above > > discussion. > > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > > > I tried build kernel using this repo, but looks like it is not allowed > > to reboot due to some changes in layer. > > Did you build the most up-to-date version of that branch? I've been force > pushing to it, so the commit id that you built would be useful. > What boot failure are you seeing? Below is latest commit on repo. commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda Author: Omar Sandoval Date: Tue Sep 20 11:20:03 2016 -0700 [WIP] blk-mq: limit bio queue depth I have latest repo from 4.9/scsi-next maintained by Martin which boots fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched branch. I could not see any meaningful data on boot hang, so going to try one more time tomorrow. > > > > > > > Are you using blk-mq for this disk? If not, then the work there > > > won't > > affect you. > > > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > > disable, Sequential work load issue is not seen and scheduling > > works well. > > Ah, okay, perfect. Can you send the fio job file you're using? Hard to tell exactly > what's going on without the details. A sequential workload with just one > submitter is about as easy as it gets, so this _should_ be behaving nicely. ; setup numa policy for each thread ; 'numactl --show' to determine the maximum numa nodes [global] ioengine=libaio buffered=0 rw=write bssplit=4K/100 iodepth=256 numjobs=1 direct=1 runtime=60s allow_mounted_write=0 [job1] filename=/dev/sdd .. [job24] filename=/dev/sdaa When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a ioscheduler detail. (It is in blk-mq mode. ) /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:none When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0, ioscheduler picked by SML is . /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq] I see in blk-mq performance is very low for Sequential Write work load and I confirm that blk-mq convert Sequential work load into random stream due to io-scheduler change in blk-mq vs legacy block layer. > > > > > > > > Is there any workaround/alternative in latest upstream kernel, if > > > > user wants to see limited penalty for Sequential Work load on HDD ? > > > > > > > > ` Kashyap > > > > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail seems to > mark just about everything I get from Broadcom as spam due to failed DMARC. > > -- > Omar
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > Hi - > > > > I found below conversation and it is on the same line as I wanted some > > input from mailing list. > > > > http://marc.info/?l=linux-kernel=147569860526197=2 > > > > I can do testing on any WIP item as Omar mentioned in above discussion. > > https://github.com/osandov/linux/tree/blk-mq-iosched I tried build kernel using this repo, but looks like it is not allowed to reboot due to some changes in layer. > > Are you using blk-mq for this disk? If not, then the work there won't affect you. YES. I am using blk-mq for my test. I also confirm if use_blk_mq is disable, Sequential work load issue is not seen and scheduling works well. > > > Is there any workaround/alternative in latest upstream kernel, if user > > wants to see limited penalty for Sequential Work load on HDD ? > > > > ` Kashyap > >
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > Hi - > > > > I found below conversation and it is on the same line as I wanted some > > input from mailing list. > > > > http://marc.info/?l=linux-kernel=147569860526197=2 > > > > I can do testing on any WIP item as Omar mentioned in above discussion. > > https://github.com/osandov/linux/tree/blk-mq-iosched I tried build kernel using this repo, but looks like it is not allowed to reboot due to some changes in layer. > > Are you using blk-mq for this disk? If not, then the work there won't affect you. YES. I am using blk-mq for my test. I also confirm if use_blk_mq is disable, Sequential work load issue is not seen and scheduling works well. > > > Is there any workaround/alternative in latest upstream kernel, if user > > wants to see limited penalty for Sequential Work load on HDD ? > > > > ` Kashyap > >
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi - I found below conversation and it is on the same line as I wanted some input from mailing list. http://marc.info/?l=linux-kernel=147569860526197=2 I can do testing on any WIP item as Omar mentioned in above discussion. https://github.com/osandov/linux/tree/blk-mq-iosched Is there any workaround/alternative in latest upstream kernel, if user wants to see limited penalty for Sequential Work load on HDD ? ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Thursday, October 20, 2016 3:39 PM > To: linux-s...@vger.kernel.org > Subject: Device or HBA level QD throttling creates randomness in sequetial > workload > > [ Apologize, if you find more than one instance of my email. > Web based email client has some issue, so now trying git send mail.] > > Hi, > > I am doing some performance tuning in MR driver to understand how sdev > queue depth and hba queue depth play role in IO submission from above layer. > I have 24 JBOD connected to MR 12GB controller and I can see performance for > 4K Sequential work load as below. > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > queue depth from 256 reports 300K IOPS queue depth from 128 > reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth > from 32 reports 510K IOPS > > In MR driver I added debug print and confirm that more IO come to driver as > random IO whenever I have queue depth more than 32. > > I have debug using scsi logging level and blktrace as well. Below is snippet of > logs using scsi logging level. In summary, if SML do flow control of IO due to > Device QD or HBA QD, IO coming to LLD is more random pattern. > > I see IO coming to driver is not sequential. > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a > 00 00 03 c0 5b 00 00 01 00 > > After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". > Two Sequence are overlapped due to sdev QD throttling. > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > If scsi_request_fn() breaks due to unavailability of device queue (due to below > check), will there be any side defect as I observe ? > if (!scsi_dev_queue_ready(q, sdev)) > break; > > If I reduce HBA QD and make sure IO from above layer is throttled due to HBA > QD, there is a same impact. > MR driver use host wide shared tag map. > > Can someone help me if this can be tunable in LLD providing additional settings > or it is expected behavior ? Problem I am facing is, I am not able to figure out > optimal device queue depth for different configuration and work load. > > Thanks, Kashyap
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi - I found below conversation and it is on the same line as I wanted some input from mailing list. http://marc.info/?l=linux-kernel=147569860526197=2 I can do testing on any WIP item as Omar mentioned in above discussion. https://github.com/osandov/linux/tree/blk-mq-iosched Is there any workaround/alternative in latest upstream kernel, if user wants to see limited penalty for Sequential Work load on HDD ? ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Thursday, October 20, 2016 3:39 PM > To: linux-s...@vger.kernel.org > Subject: Device or HBA level QD throttling creates randomness in sequetial > workload > > [ Apologize, if you find more than one instance of my email. > Web based email client has some issue, so now trying git send mail.] > > Hi, > > I am doing some performance tuning in MR driver to understand how sdev > queue depth and hba queue depth play role in IO submission from above layer. > I have 24 JBOD connected to MR 12GB controller and I can see performance for > 4K Sequential work load as below. > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > queue depth from 256 reports 300K IOPS queue depth from 128 > reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth > from 32 reports 510K IOPS > > In MR driver I added debug print and confirm that more IO come to driver as > random IO whenever I have queue depth more than 32. > > I have debug using scsi logging level and blktrace as well. Below is snippet of > logs using scsi logging level. In summary, if SML do flow control of IO due to > Device QD or HBA QD, IO coming to LLD is more random pattern. > > I see IO coming to driver is not sequential. > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a > 00 00 03 c0 5b 00 00 01 00 > > After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". > Two Sequence are overlapped due to sdev QD throttling. > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > If scsi_request_fn() breaks due to unavailability of device queue (due to below > check), will there be any side defect as I observe ? > if (!scsi_dev_queue_ready(q, sdev)) > break; > > If I reduce HBA QD and make sure IO from above layer is throttled due to HBA > QD, there is a same impact. > MR driver use host wide shared tag map. > > Can someone help me if this can be tunable in LLD providing additional settings > or it is expected behavior ? Problem I am facing is, I am not able to figure out > optimal device queue depth for different configuration and work load. > > Thanks, Kashyap
RE: Observing Softlockup's while running heavy IOs
> -Original Message- > From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com] > Sent: Saturday, August 20, 2016 2:58 AM > To: Sreekanth Reddy > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; > irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty; > Chaitra Basappa; Suganath Prabu Subramani > Subject: RE: Observing Softlockup's while running heavy IOs > > > > > -Original Message- > > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com] > > Sent: Friday, August 19, 2016 6:45 AM > > To: Elliott, Robert (Persistent Memory) <elli...@hpe.com> > > Subject: Re: Observing Softlockup's while running heavy IOs > > > ... > > Yes I am also observing that all the interrupts are routed to one CPU. > > But still I observing softlockups (sometime hardlockups) even when I > > set rq_affinity to 2. How about below scenario ? For simplicity. HBA with single MSI-x vector. (Whenever HBA supports less MSI-x and logical CPUs are more on system, we can see chance of these issue frequently..) Assume we have 32 logical CPU (4 socket, each with 8 logical CPU). CPU-0 is not participating in IO. Remaining CPU range from 1 to 31 is submitting IO. In such a scenario rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not help. We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or it is very rare to happen on field ? > > That'll ensure the block layer's completion handling is done there, but > not your > driver's interrupt handler (which precedes the block layer completion > handling). > > > > Is their any way to route the interrupts the same CPUs which has > > submitted the corresponding IOs? > > or > > Is their any way/option in the irqbalance/kernel which can route > > interrupts to CPUs (enabled in affinity_hint) in round robin manner > > after specific time period. > > Ensure your driver creates one MSIX interrupt per CPU core, uses that > interrupt > for all submissions from that core, and reports that it would like that > interrupt to > be serviced by that core in /proc/irq/nnn/affinity_hint. > > Even with hyperthreading, this needs to be based on the logical CPU cores, > not > just the physical core or the physical socket. > You can swamp a logical CPU core as easily as a physical CPU core. > > Then, provide an irqbalance policy script that honors the affinity_hint > for your > driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity > to > match the affinity_hint. > > Some versions of irqbalance honor the hints; some purposely don't and need > to > be overridden with a policy script. > > > --- > Robert Elliott, HPE Persistent Memory >
RE: Observing Softlockup's while running heavy IOs
> -Original Message- > From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com] > Sent: Saturday, August 20, 2016 2:58 AM > To: Sreekanth Reddy > Cc: linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; > irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty; > Chaitra Basappa; Suganath Prabu Subramani > Subject: RE: Observing Softlockup's while running heavy IOs > > > > > -Original Message- > > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com] > > Sent: Friday, August 19, 2016 6:45 AM > > To: Elliott, Robert (Persistent Memory) > > Subject: Re: Observing Softlockup's while running heavy IOs > > > ... > > Yes I am also observing that all the interrupts are routed to one CPU. > > But still I observing softlockups (sometime hardlockups) even when I > > set rq_affinity to 2. How about below scenario ? For simplicity. HBA with single MSI-x vector. (Whenever HBA supports less MSI-x and logical CPUs are more on system, we can see chance of these issue frequently..) Assume we have 32 logical CPU (4 socket, each with 8 logical CPU). CPU-0 is not participating in IO. Remaining CPU range from 1 to 31 is submitting IO. In such a scenario rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not help. We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or it is very rare to happen on field ? > > That'll ensure the block layer's completion handling is done there, but > not your > driver's interrupt handler (which precedes the block layer completion > handling). > > > > Is their any way to route the interrupts the same CPUs which has > > submitted the corresponding IOs? > > or > > Is their any way/option in the irqbalance/kernel which can route > > interrupts to CPUs (enabled in affinity_hint) in round robin manner > > after specific time period. > > Ensure your driver creates one MSIX interrupt per CPU core, uses that > interrupt > for all submissions from that core, and reports that it would like that > interrupt to > be serviced by that core in /proc/irq/nnn/affinity_hint. > > Even with hyperthreading, this needs to be based on the logical CPU cores, > not > just the physical core or the physical socket. > You can swamp a logical CPU core as easily as a physical CPU core. > > Then, provide an irqbalance policy script that honors the affinity_hint > for your > driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity > to > match the affinity_hint. > > Some versions of irqbalance honor the hints; some purposely don't and need > to > be overridden with a policy script. > > > --- > Robert Elliott, HPE Persistent Memory >
RE: [PATCH] megaraid_sas: Fix probing cards without io port
> > that does not io port resource allocated from BIOS, and kernel can not > > assign one as io port shortage. > > > > The driver is only looking for MEM, and should not fail. > > > It turns out megasas_init_fw() etc are using bar index as mask. > > index 1 is used as mask 1, so that pci_request_selected_regions() is > > trying to request BAR0 instead of BAR1. > > > Fix all related reference. Review patch and looks good. Thanks Yinghai for submitting patch. Acked-by: Kashyap Desai <kashyap.de...@broadcom.com> > > Kashyap? Sumit? > > -- > Martin K. PetersenOracle Linux Engineering
RE: [PATCH] megaraid_sas: Fix probing cards without io port
> > that does not io port resource allocated from BIOS, and kernel can not > > assign one as io port shortage. > > > > The driver is only looking for MEM, and should not fail. > > > It turns out megasas_init_fw() etc are using bar index as mask. > > index 1 is used as mask 1, so that pci_request_selected_regions() is > > trying to request BAR0 instead of BAR1. > > > Fix all related reference. Review patch and looks good. Thanks Yinghai for submitting patch. Acked-by: Kashyap Desai > > Kashyap? Sumit? > > -- > Martin K. PetersenOracle Linux Engineering
RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c
> -Original Message- > From: Nicholas Krause [mailto:xerofo...@gmail.com] > Sent: Monday, July 06, 2015 10:06 PM > To: kashyap.de...@avagotech.com > Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com; > jbottom...@odin.com; megaraidlinux@avagotech.com; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH] megaraid:Make various functions static in > megaraid_sas_fusion.c > > This makes various functions that have no external callers outside of their > definition/declaration in the file megaraid_sas_fusion.c to be properly > declared as static functions now. > > Signed-off-by: Nicholas Krause Nicholas - You posted patch with subject line "megaraid:Make the function megasas_alloc_cmd_fusion static" for similar code changes. Let's have consolidated patch for this and Avago can post it with next change set (to keep git commit clean without any failure) Just to be clear - You can send us patch and we can add with Submitted by your name as signature. For now NACK. ` Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static
> -int > +static int > megasas_alloc_cmds_fusion(struct megasas_instance *instance) { > int i, j, count; Good catch. Let's not include patch to avoid further confusion on series of acked patch which are not making significant functional difference. NACK as not making any functional difference. We can make many more such changes in one shot and let's address those later as we have few patches ready to send in few days. > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c
> -Original Message- > From: Nicholas Krause [mailto:xerofo...@gmail.com] > Sent: Monday, July 06, 2015 10:06 PM > To: kashyap.de...@avagotech.com > Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com; > jbottom...@odin.com; megaraidlinux@avagotech.com; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH] megaraid:Make various functions static in > megaraid_sas_fusion.c > > This makes various functions that have no external callers outside of their > definition/declaration in the file megaraid_sas_fusion.c to be properly > declared as static functions now. > > Signed-off-by: Nicholas KrauseNicholas - You posted patch with subject line "megaraid:Make the function megasas_alloc_cmd_fusion static" for similar code changes. Let's have consolidated patch for this and Avago can post it with next change set (to keep git commit clean without any failure) Just to be clear - You can send us patch and we can add with Submitted by your name as signature. For now NACK. ` Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static
> -int > +static int > megasas_alloc_cmds_fusion(struct megasas_instance *instance) { > int i, j, count; Good catch. Let's not include patch to avoid further confusion on series of acked patch which are not making significant functional difference. NACK as not making any functional difference. We can make many more such changes in one shot and let's address those later as we have few patches ready to send in few days. > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Wednesday, November 11, 2015 6:27 PM > To: Sreekanth Reddy; j...@kernel.org > Cc: martin.peter...@oracle.com; linux-s...@vger.kernel.org; > jbottom...@parallels.com; sathya.prak...@avagotech.com; > kashyap.de...@avagotech.com; linux-kernel@vger.kernel.org; > h...@infradead.org; chaitra.basa...@avagotech.com; suganath- > prabu.subram...@avagotech.com > Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN > Raid Action request on SAS2 HBA's > > On 11/11/2015 01:00 PM, Sreekanth Reddy wrote: > > From: Sreekanth Reddy > > > > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's. > > Since these HBA's doesn't support this Raid Action. > > > > Also enable fast_path only for SAS3 HBA's. > > > > Signed-off-by: Sreekanth Reddy > > --- > > drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +-- > > 1 file changed, 17 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > index a638920..80469d0 100644 > > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget) > > if (test_bit(sas_device->handle, ioc->pd_handles)) > > sas_target_priv_data->flags |= > > MPT_TARGET_FLAGS_RAID_COMPONENT; > > +#ifndef SCSI_MPT2SAS > > if (sas_device->fast_path) > > sas_target_priv_data->flags |= > MPT_TARGET_FASTPATH_IO; > > +#endif > > } > > spin_unlock_irqrestore(>sas_device_lock, flags); > > > > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct > scsi_cmnd *scmd) > > ioc->build_zero_len_sge(ioc, _request->SGL); > > > > if (likely(mpi_request->Function == > MPI2_FUNCTION_SCSI_IO_REQUEST)) > > { > > +#ifndef SCSI_MPT2SAS > > if (sas_target_priv_data->flags & > MPT_TARGET_FASTPATH_IO) { > > mpi_request->IoFlags = cpu_to_le16(scmd- > >cmd_len | > > MPI25_SCSIIO_IOFLAGS_FAST_PATH); > > mpt3sas_base_put_smid_fast_path(ioc, smid, > handle); > > } else > > +#endif > > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle); > > } else > > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8 > +5035,10 @@ > > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 > phy_num, > > sas_device->device_info = device_info; > > sas_device->sas_address = sas_address; > > sas_device->phy = sas_device_pg0.PhyNum; > > +#ifndef SCSI_MPT2SAS > > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) & > > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0; > > +#endif > > > > if (sas_device_pg0.Flags & > MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { > > sas_device->enclosure_level = > > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct > MPT3SAS_ADAPTER *ioc, > > } > > } > > > > +#ifndef SCSI_MPT2SAS > > /** > > * _scsih_ir_fastpath - turn on fastpath for IR physdisk > > * @ioc: per adapter object > > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER > *ioc, u16 handle, u8 phys_disk_num) > > u16 ioc_status; > > u32 log_info; > > > > - > > mutex_lock(>scsih_cmds.mutex); > > > > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ - > 5825,6 > > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16 > handle, u8 phys_disk_num) > > FORCE_BIG_HAMMER); > > return rc; > > } > > +/* End of not defined SCSI_MPT2SAS */ #endif > > > > /** > > * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@ > > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc, > > if (!sas_device) > > return; > > > > +#ifndef SCSI_MPT2SAS > > /* hiding raid component */ > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > if (starget) > > starget_for_each_device(starget, (void *)1, > _scsih_reprobe_lun); } > > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > sas_device = _scsih_sas_device_find_by_handle(ioc, handle); > > spin_unlock_irqrestore(>sas_device_lock, flags); > > if (sas_device) { > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > return; > > } > > > > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > mpt3sas_transport_update_links(ioc, sas_address, handle, > > sas_device_pg0.PhyNum, > MPI2_SAS_NEG_LINK_RATE_1_5); > > > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > _scsih_add_device(ioc, handle, 0, 1); } > > > > @@ -6202,13 +6216,14 @@ _scsih_sas_ir_config_change_event(struct > > MPT3SAS_ADAPTER *ioc, > > > > element =
RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Wednesday, November 11, 2015 6:27 PM > To: Sreekanth Reddy; j...@kernel.org > Cc: martin.peter...@oracle.com; linux-s...@vger.kernel.org; > jbottom...@parallels.com; sathya.prak...@avagotech.com; > kashyap.de...@avagotech.com; linux-kernel@vger.kernel.org; > h...@infradead.org; chaitra.basa...@avagotech.com; suganath- > prabu.subram...@avagotech.com > Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN > Raid Action request on SAS2 HBA's > > On 11/11/2015 01:00 PM, Sreekanth Reddy wrote: > > From: Sreekanth Reddy> > > > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's. > > Since these HBA's doesn't support this Raid Action. > > > > Also enable fast_path only for SAS3 HBA's. > > > > Signed-off-by: Sreekanth Reddy > > --- > > drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +-- > > 1 file changed, 17 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > index a638920..80469d0 100644 > > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget) > > if (test_bit(sas_device->handle, ioc->pd_handles)) > > sas_target_priv_data->flags |= > > MPT_TARGET_FLAGS_RAID_COMPONENT; > > +#ifndef SCSI_MPT2SAS > > if (sas_device->fast_path) > > sas_target_priv_data->flags |= > MPT_TARGET_FASTPATH_IO; > > +#endif > > } > > spin_unlock_irqrestore(>sas_device_lock, flags); > > > > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct > scsi_cmnd *scmd) > > ioc->build_zero_len_sge(ioc, _request->SGL); > > > > if (likely(mpi_request->Function == > MPI2_FUNCTION_SCSI_IO_REQUEST)) > > { > > +#ifndef SCSI_MPT2SAS > > if (sas_target_priv_data->flags & > MPT_TARGET_FASTPATH_IO) { > > mpi_request->IoFlags = cpu_to_le16(scmd- > >cmd_len | > > MPI25_SCSIIO_IOFLAGS_FAST_PATH); > > mpt3sas_base_put_smid_fast_path(ioc, smid, > handle); > > } else > > +#endif > > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle); > > } else > > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8 > +5035,10 @@ > > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 > phy_num, > > sas_device->device_info = device_info; > > sas_device->sas_address = sas_address; > > sas_device->phy = sas_device_pg0.PhyNum; > > +#ifndef SCSI_MPT2SAS > > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) & > > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0; > > +#endif > > > > if (sas_device_pg0.Flags & > MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { > > sas_device->enclosure_level = > > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct > MPT3SAS_ADAPTER *ioc, > > } > > } > > > > +#ifndef SCSI_MPT2SAS > > /** > > * _scsih_ir_fastpath - turn on fastpath for IR physdisk > > * @ioc: per adapter object > > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER > *ioc, u16 handle, u8 phys_disk_num) > > u16 ioc_status; > > u32 log_info; > > > > - > > mutex_lock(>scsih_cmds.mutex); > > > > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ - > 5825,6 > > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16 > handle, u8 phys_disk_num) > > FORCE_BIG_HAMMER); > > return rc; > > } > > +/* End of not defined SCSI_MPT2SAS */ #endif > > > > /** > > * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@ > > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc, > > if (!sas_device) > > return; > > > > +#ifndef SCSI_MPT2SAS > > /* hiding raid component */ > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > if (starget) > > starget_for_each_device(starget, (void *)1, > _scsih_reprobe_lun); } > > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > sas_device = _scsih_sas_device_find_by_handle(ioc, handle); > > spin_unlock_irqrestore(>sas_device_lock, flags); > > if (sas_device) { > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > return; > > } > > > > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > mpt3sas_transport_update_links(ioc, sas_address, handle, > > sas_device_pg0.PhyNum, > MPI2_SAS_NEG_LINK_RATE_1_5); > > > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > _scsih_add_device(ioc, handle, 0, 1); } > > > > @@ -6202,13 +6216,14 @@
RE: [patch 1/2] megaraid_sas: remove a stray tab
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Saturday, November 07, 2015 9:46 PM > To: Kashyap Desai; Sumit Saxena > Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com; > linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; kernel- > janit...@vger.kernel.org > Subject: [patch 1/2] megaraid_sas: remove a stray tab > > One line was indented more than the others. > > Signed-off-by: Dan Carpenter Let us combine your next patch (because that is also indentation issue) and this one. We will resubmit patch. For now ignore to avoid any issue for pending patches to be committed. > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index cc95372..829e9e9 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev > *pdev) > > if (instance->ctrl_context) { > megasas_release_fusion(instance); > - pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > + pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > (sizeof(struct MR_PD_CFG_SEQ) * > - (MAX_PHYSICAL_DEVICES - 1)); > + (MAX_PHYSICAL_DEVICES - 1)); > for (i = 0; i < 2 ; i++) { > if (fusion->ld_map[i]) > dma_free_coherent(>pdev->dev, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch 1/2] megaraid_sas: remove a stray tab
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Saturday, November 07, 2015 9:46 PM > To: Kashyap Desai; Sumit Saxena > Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com; > linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org; kernel- > janit...@vger.kernel.org > Subject: [patch 1/2] megaraid_sas: remove a stray tab > > One line was indented more than the others. > > Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com> Let us combine your next patch (because that is also indentation issue) and this one. We will resubmit patch. For now ignore to avoid any issue for pending patches to be committed. > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index cc95372..829e9e9 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev > *pdev) > > if (instance->ctrl_context) { > megasas_release_fusion(instance); > - pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > + pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > (sizeof(struct MR_PD_CFG_SEQ) * > - (MAX_PHYSICAL_DEVICES - 1)); > + (MAX_PHYSICAL_DEVICES - 1)); > for (i = 0; i < 2 ; i++) { > if (fusion->ld_map[i]) > dma_free_coherent(>pdev->dev, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_
> > + dev_dbg(>pdev->dev, "Error copying out > > cmd_status\n"); > > error = -EFAULT; > > } > > > > Reviewed-by: Johannes Thumshirn We will consider all three patches for future submission. As of now we have two patch set pending to be committed. We are working for few more patch in megaraid_sas which will do clean up in driver module + proper error handling of DCMD command timeout. It will cover Patch posted with below subject - [PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames failed James - We will be resending these patch set on top of latest outstanding megaraid_sas driver patch, so that we can avoid any conflict in commits. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_
> > + dev_dbg(>pdev->dev, "Error copying out > > cmd_status\n"); > > error = -EFAULT; > > } > > > > Reviewed-by: Johannes ThumshirnWe will consider all three patches for future submission. As of now we have two patch set pending to be committed. We are working for few more patch in megaraid_sas which will do clean up in driver module + proper error handling of DCMD command timeout. It will cover Patch posted with below subject - [PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames failed James - We will be resending these patch set on top of latest outstanding megaraid_sas driver patch, so that we can avoid any conflict in commits. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
> > I am about to commit the patch that was successfully tested by the > customer on > SLES 12, but I'm a bit confused. The upstream patch you referred to is: > > https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for- > next=6431f5d7c6025f8b007af06ea090de308f7e6881 > [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel > > But the patch I used is the one you sent by e-mail on May 28th. It is > completely > different! > > So what am I supposed to do? Use the patch you sent (and that was tested > by > the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and > the > proper way of fixing the problem would be to backport the upstream commit? You can use that patch as valid candidate for upstream submission. Some of the MR maintainer (Sumit Saxena) will send that patch. We are just organizing other patch series. Since SLES already ported patch without commit id, we are fine. I am just giving reference that patch which send via email will be send to upstream very soon along with other patch set. Thanks, Kashyap > > Please advise, > -- > Jean Delvare > SUSE L3 Support -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]
I am about to commit the patch that was successfully tested by the customer on SLES 12, but I'm a bit confused. The upstream patch you referred to is: https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for- nextid=6431f5d7c6025f8b007af06ea090de308f7e6881 [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel But the patch I used is the one you sent by e-mail on May 28th. It is completely different! So what am I supposed to do? Use the patch you sent (and that was tested by the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and the proper way of fixing the problem would be to backport the upstream commit? You can use that patch as valid candidate for upstream submission. Some of the MR maintainer (Sumit Saxena) will send that patch. We are just organizing other patch series. Since SLES already ported patch without commit id, we are fine. I am just giving reference that patch which send via email will be send to upstream very soon along with other patch set. Thanks, Kashyap Please advise, -- Jean Delvare SUSE L3 Support -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
> -Original Message- > From: Jean Delvare [mailto:jdelv...@suse.de] > Sent: Tuesday, July 07, 2015 2:14 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; linux- > s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > Good news! Customer tested your patch and said it fixed the problem :-) > > I am now in the process of backporting the patch to the SLES 11 SP3 kernel for > further testing. I'll let you know how it goes. Thank you very much for your > assistance. Thanks for confirmation. Whatever patch I submitted to you, we have added recently (as part of common interface approach to do chip reset at load time). We will be submitting that patch to mainline soon. ~ Kashyap > > -- > Jean Delvare > SUSE L3 Support -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]
-Original Message- From: Jean Delvare [mailto:jdelv...@suse.de] Sent: Tuesday, July 07, 2015 2:14 PM To: Kashyap Desai Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; linux- s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- ker...@vger.kernel.org; Myron Stowe Subject: Re: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661] Hi Kashyap, On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Good news! Customer tested your patch and said it fixed the problem :-) I am now in the process of backporting the patch to the SLES 11 SP3 kernel for further testing. I'll let you know how it goes. Thank you very much for your assistance. Thanks for confirmation. Whatever patch I submitted to you, we have added recently (as part of common interface approach to do chip reset at load time). We will be submitting that patch to mainline soon. ~ Kashyap -- Jean Delvare SUSE L3 Support -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
Jean, Patch is available at below repo - git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next Commit id - 6431f5d7c6025f8b007af06ea090de308f7e6881 If you share megaraid_sas driver code of your tree, I can provide patch for you. ` Kashyap > -Original Message- > From: Jean Delvare [mailto:jdelv...@suse.de] > Sent: Monday, June 29, 2015 6:55 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; > linux- > s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Myron Stowe > Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > Thanks for the patch. May I ask what tree it was based on? Linus' > latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel > trees (based > on kernel v3.0 + a bunch of backports and v3.12 > respectively) but your patch fails to apply in both cases. I'll try harder > but I don't > know anything about the megaraid_sas code so I really don't know where I'm > going. > > Does your patch depend on any other that may not be present in the SLES > 11 SP3 and SLES 12 kernels? > > Thanks, > Jean > > Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM > > reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > > > Thanks, Kashyap > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]
Jean, Patch is available at below repo - git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next Commit id - 6431f5d7c6025f8b007af06ea090de308f7e6881 If you share megaraid_sas driver code of your tree, I can provide patch for you. ` Kashyap -Original Message- From: Jean Delvare [mailto:jdelv...@suse.de] Sent: Monday, June 29, 2015 6:55 PM To: Kashyap Desai Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; linux- s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- ker...@vger.kernel.org; Myron Stowe Subject: RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661] Hi Kashyap, Thanks for the patch. May I ask what tree it was based on? Linus' latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel trees (based on kernel v3.0 + a bunch of backports and v3.12 respectively) but your patch fails to apply in both cases. I'll try harder but I don't know anything about the megaraid_sas code so I really don't know where I'm going. Does your patch depend on any other that may not be present in the SLES 11 SP3 and SLES 12 kernels? Thanks, Jean Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Thanks, Kashyap -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Thanks, Kashyap > -Original Message- > From: Bjorn Helgaas [mailto:bhelg...@google.com] > Sent: Thursday, May 28, 2015 5:55 PM > To: Robin H. Johnson > Cc: Adam Radford; Neela Syam Kolli; linux-s...@vger.kernel.org; > arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit Saxena; > Uday Lingala; megaraidlinux@avagotech.com; linux-...@vger.kernel.org; > linux-kernel@vger.kernel.org; Jean Delvare; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > [+cc Jean, Myron] > > Hello megaraid maintainers, > > Have you been able to take a look at this at all? People have been reporting this > issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting > reports on SLES. > > My theory is that the Linux driver relies on some MegaRAID initialization done by > the option ROM, and the bug happens when the BIOS doesn't execute the option > ROM. > > If that's correct, you should be able to reproduce it on any system by booting > Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM (either > by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip it). If the > Linux driver doesn't rely on the option ROM, you might even be able to > reproduce it by physically removing the option ROM from the MegaRAID. > > Bjorn > > On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote: > > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] > > > > On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote: > > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > > > > Thanks for the report, Robin. > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the > > > > problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which > > > > appeared in v3.3. For starters, can you verify that, e.g., by > > > > building > > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it > > > > works, and building 3c076351c402 itself to make sure it fails? > > > > > > > > Assuming that's the case, please attach the complete dmesg and > > > > "lspci -vvxxx" output for both kernels to the bugzilla. ASPM is a > > > > feature that is configured on both ends of a PCIe link, so I want > > > > to see the lspci info for the whole system, not just the SAS adapters. > > > > > > > > It's not practical to revert 3c076351c402 now, so I'd also like to > > > > see the same information for the newest possible kernel (if this > > > > is possible; I'm not clear on whether you can boot your system or > > > > not) so we can figure out what needs to be changed. > > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it > > > fails to start; Commit 3c076351c402 did make it worse, but I think > > > we're right that the bug lies in the SAS code. > > > > > > Ok, I have done more testing on it (40+ boots), and I think we can > > > show the problem is somewhere in how the BIOS/EFI/ROM brings up the > > > card in FastBoot more, or how it leaves the card. > > > > I attached your dmesg and lspci logs to > > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did > > a huge amount of excellent testing and analysis, and I'm sorry that we > > haven't made progress using the results. > > > > I still think this is a megaraid_sas driver bug, but I don't have > > enough evidence to really point fingers. > > > > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable > > code"), megaraid_sas worked reliably. After 3c076351c402, > > megaraid_sas does not work reliably when BIOS Fast Boot is enabled. > > > >
RE: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661]
Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Thanks, Kashyap -Original Message- From: Bjorn Helgaas [mailto:bhelg...@google.com] Sent: Thursday, May 28, 2015 5:55 PM To: Robin H. Johnson Cc: Adam Radford; Neela Syam Kolli; linux-s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit Saxena; Uday Lingala; megaraidlinux@avagotech.com; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; Jean Delvare; Myron Stowe Subject: Re: megaraid_sas: FW in FAULT state!!, how to get more debug output? [BKO63661] [+cc Jean, Myron] Hello megaraid maintainers, Have you been able to take a look at this at all? People have been reporting this issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting reports on SLES. My theory is that the Linux driver relies on some MegaRAID initialization done by the option ROM, and the bug happens when the BIOS doesn't execute the option ROM. If that's correct, you should be able to reproduce it on any system by booting Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM (either by enabling a BIOS fast boot switch, or modifying the BIOS to skip it). If the Linux driver doesn't rely on the option ROM, you might even be able to reproduce it by physically removing the option ROM from the MegaRAID. Bjorn On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote: [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote: On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: Thanks for the report, Robin. https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the problem to 3c076351c402 (PCI: Rework ASPM disable code), which appeared in v3.3. For starters, can you verify that, e.g., by building 69166fbf02c7 (the parent of 3c076351c402) to make sure that it works, and building 3c076351c402 itself to make sure it fails? Assuming that's the case, please attach the complete dmesg and lspci -vvxxx output for both kernels to the bugzilla. ASPM is a feature that is configured on both ends of a PCIe link, so I want to see the lspci info for the whole system, not just the SAS adapters. It's not practical to revert 3c076351c402 now, so I'd also like to see the same information for the newest possible kernel (if this is possible; I'm not clear on whether you can boot your system or not) so we can figure out what needs to be changed. TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it fails to start; Commit 3c076351c402 did make it worse, but I think we're right that the bug lies in the SAS code. Ok, I have done more testing on it (40+ boots), and I think we can show the problem is somewhere in how the BIOS/EFI/ROM brings up the card in FastBoot more, or how it leaves the card. I attached your dmesg and lspci logs to https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did a huge amount of excellent testing and analysis, and I'm sorry that we haven't made progress using the results. I still think this is a megaraid_sas driver bug, but I don't have enough evidence to really point fingers. Based on your testing, before 3c076351c402 (PCI: Rework ASPM disable code), megaraid_sas worked reliably. After 3c076351c402, megaraid_sas does not work reliably when BIOS Fast Boot is enabled. Fast Boot probably means we don't run the option ROM on the device. Your dmesg logs show that in the working case, BIOS has enabled the device. In the failing case it has not. They also show that when Fast Boot is enabled, there's a little less MTRR write-protect space, which I'm guessing is space that wasn't needed for shadowing option ROMs. I suspect megaraid_sas depends on something done by the option ROM, and that prior to 3c076351c402, Linux did something to ASPM that was enough to make megaraid_sas work. I attached a couple debug patches to https://bugzilla.kernel.org/show_bug.cgi?id=63661 that log all the ASPM configuration the PCI core does. One applies to 69166fbf02c7
RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster
Acked-by: Kashyap Desai > -Original Message- > From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk] > Sent: Friday, February 06, 2015 8:04 PM > To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley > Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux- > s...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster > > There is currently no %po format extension, so currently the letters "on" are > simply skipped and the pointer is printed as expected (while missing the word > on). However, it is probably only a matter of time before someone comes up > with a %po extension, at which point this is likely to fail spectacularly. > > Signed-off-by: Rasmus Villemoes > --- > drivers/scsi/megaraid/megaraid_sas_base.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index ff283d23788a..e7c6b9c946d6 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct > megasas_instance *instance) > for (i = 0; i < max_cmd; i++) { > cmd = instance->cmd_list[i]; > if (cmd->sync_cmd == 1 || cmd->scmd) { > - printk(KERN_NOTICE "megasas: moving > cmd[%d]:%p:%d:%p" > + printk(KERN_NOTICE "megasas: moving > cmd[%d]:%p:%d:%p " > "on the defer queue as internal\n", > defer_index, cmd, cmd->sync_cmd, cmd- > >scmd); > > -- > 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster
Acked-by: Kashyap Desai kashyap.de...@avagotech.com -Original Message- From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk] Sent: Friday, February 06, 2015 8:04 PM To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux- s...@vger.kernel.org; linux-kernel@vger.kernel.org Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster There is currently no %po format extension, so currently the letters on are simply skipped and the pointer is printed as expected (while missing the word on). However, it is probably only a matter of time before someone comes up with a %po extension, at which point this is likely to fail spectacularly. Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk --- drivers/scsi/megaraid/megaraid_sas_base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index ff283d23788a..e7c6b9c946d6 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct megasas_instance *instance) for (i = 0; i max_cmd; i++) { cmd = instance-cmd_list[i]; if (cmd-sync_cmd == 1 || cmd-scmd) { - printk(KERN_NOTICE megasas: moving cmd[%d]:%p:%d:%p + printk(KERN_NOTICE megasas: moving cmd[%d]:%p:%d:%p on the defer queue as internal\n, defer_index, cmd, cmd-sync_cmd, cmd- scmd); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig wrote: > On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote: >> I read this comment and find that very few drivers are using this >> cmd_list. I think if we remove this cmd_list, performance will scale as I >> am seeing major contention in this lock. >> Just thought to ping you to see if this is known limitation for now or any >> plan to change this lock in near future ? > > Removing the lock entirely and pushing the list into the two drivers > using it is on my TODO list. Bart actually suggested keeping the code in the > SCSI core and having a flag to enabled. Given that I'm too busy to get my > full > version done in time, it might be a good idea if someone picks up Barts > idea. Can you send me a patch to add a enable_cmd_list flag to the host > template and only enable it for aacraid and dpt_i2o? > Sure. I will work on relevant code change and will post patch for review. -- Device Driver Developer @ Avagotech Kashyap D. Desai Note - my new email address kashyap.de...@avagotech.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai wrote: > > > -Original Message- > > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > > Sent: Friday, July 18, 2014 3:43 PM > > To: James Bottomley; linux-s...@vger.kernel.org > > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; > Robert > > Elliott; Webb Scales; linux-kernel@vger.kernel.org > > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. > > > > This patch adds support for an alternate I/O path in the scsi midlayer > which > > uses the blk-mq infrastructure instead of the legacy request code. > > > > Use of blk-mq is fully transparent to drivers, although for now a host > > template field is provided to opt out of blk-mq usage in case any > unforseen > > incompatibilities arise. > > > > In general replacing the legacy request code with blk-mq is a simple and > > mostly mechanical transformation. The biggest exception is the new code > > that deals with the fact the I/O submissions in blk-mq must happen from > > process context, which slightly complicates the I/O completion handler. > > The second biggest differences is that blk-mq is build around the > concept of > > preallocated requests that also include driver specific data, which in > SCSI > > context means the scsi_cmnd structure. This completely avoids dynamic > > memory allocations for the fast path through I/O submission. > > > > Due the preallocated requests the MQ code path exclusively uses the > host- > > wide shared tag allocator instead of a per-LUN one. This only affects > drivers > > actually using the block layer provided tag allocator instead of their > own. > > Unlike the old path blk-mq always provides a tag, although drivers don't > have > > to use it. > > > > For now the blk-mq path is disable by defauly and must be enabled using > the > > "use_blk_mq" module parameter. Once the remaining work in the block > > layer to make blk-mq more suitable for slow devices is complete I hope > to > > make it the default and eventually even remove the old code path. > > > > Based on the earlier scsi-mq prototype by Nicholas Bellinger. > > > > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking > and > > various sugestions and code contributions. > > > > Signed-off-by: Christoph Hellwig > > Reviewed-by: Hannes Reinecke > > Reviewed-by: Webb Scales > > Acked-by: Jens Axboe > > Tested-by: Bart Van Assche > > Tested-by: Robert Elliott > > --- > > drivers/scsi/hosts.c | 35 +++- > > drivers/scsi/scsi.c | 5 +- > > drivers/scsi/scsi_lib.c | 464 > > -- > > drivers/scsi/scsi_priv.h | 3 + > > drivers/scsi/scsi_scan.c | 5 +- > > drivers/scsi/scsi_sysfs.c | 2 + > > include/scsi/scsi_host.h | 18 +- > > include/scsi/scsi_tcq.h | 28 ++- > > 8 files changed, 488 insertions(+), 72 deletions(-) > > > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index > 0632eee..6de80e3 > > 100644 > > --- a/drivers/scsi/hosts.c > > +++ b/drivers/scsi/hosts.c > > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host > > *shost, struct device *dev, > > goto fail; > > } > > > > + if (shost_use_blk_mq(shost)) { > > + error = scsi_mq_setup_tags(shost); > > + if (error) > > + goto fail; > > + } > > + > > + /* > > + * Note that we allocate the freelist even for the MQ case for > now, > > + * as we need a command set aside for scsi_reset_provider. Having > > + * the full host freelist and one command available for that is a > > + * little heavy-handed, but avoids introducing a special allocator > > + * just for this. Eventually the structure of scsi_reset_provider > > + * will need a major overhaul. > > + */ > > error = scsi_setup_command_freelist(shost); > > if (error) > > - goto fail; > > + goto out_destroy_tags; > > + > > > > if (!shost->shost_gendev.parent) > > shost->shost_gendev.parent = dev ? dev : _bus; > > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, > > struct device *dev, > > > > error = device_add(>shost_gendev); > > if (error) >
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai kashyap.de...@avagotech.com wrote: -Original Message- From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- ow...@vger.kernel.org] On Behalf Of Christoph Hellwig Sent: Friday, July 18, 2014 3:43 PM To: James Bottomley; linux-s...@vger.kernel.org Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; Robert Elliott; Webb Scales; linux-kernel@vger.kernel.org Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. This patch adds support for an alternate I/O path in the scsi midlayer which uses the blk-mq infrastructure instead of the legacy request code. Use of blk-mq is fully transparent to drivers, although for now a host template field is provided to opt out of blk-mq usage in case any unforseen incompatibilities arise. In general replacing the legacy request code with blk-mq is a simple and mostly mechanical transformation. The biggest exception is the new code that deals with the fact the I/O submissions in blk-mq must happen from process context, which slightly complicates the I/O completion handler. The second biggest differences is that blk-mq is build around the concept of preallocated requests that also include driver specific data, which in SCSI context means the scsi_cmnd structure. This completely avoids dynamic memory allocations for the fast path through I/O submission. Due the preallocated requests the MQ code path exclusively uses the host- wide shared tag allocator instead of a per-LUN one. This only affects drivers actually using the block layer provided tag allocator instead of their own. Unlike the old path blk-mq always provides a tag, although drivers don't have to use it. For now the blk-mq path is disable by defauly and must be enabled using the use_blk_mq module parameter. Once the remaining work in the block layer to make blk-mq more suitable for slow devices is complete I hope to make it the default and eventually even remove the old code path. Based on the earlier scsi-mq prototype by Nicholas Bellinger. Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and various sugestions and code contributions. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Hannes Reinecke h...@suse.de Reviewed-by: Webb Scales web...@hp.com Acked-by: Jens Axboe ax...@kernel.dk Tested-by: Bart Van Assche bvanass...@acm.org Tested-by: Robert Elliott elli...@hp.com --- drivers/scsi/hosts.c | 35 +++- drivers/scsi/scsi.c | 5 +- drivers/scsi/scsi_lib.c | 464 -- drivers/scsi/scsi_priv.h | 3 + drivers/scsi/scsi_scan.c | 5 +- drivers/scsi/scsi_sysfs.c | 2 + include/scsi/scsi_host.h | 18 +- include/scsi/scsi_tcq.h | 28 ++- 8 files changed, 488 insertions(+), 72 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 0632eee..6de80e3 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, goto fail; } + if (shost_use_blk_mq(shost)) { + error = scsi_mq_setup_tags(shost); + if (error) + goto fail; + } + + /* + * Note that we allocate the freelist even for the MQ case for now, + * as we need a command set aside for scsi_reset_provider. Having + * the full host freelist and one command available for that is a + * little heavy-handed, but avoids introducing a special allocator + * just for this. Eventually the structure of scsi_reset_provider + * will need a major overhaul. + */ error = scsi_setup_command_freelist(shost); if (error) - goto fail; + goto out_destroy_tags; + if (!shost-shost_gendev.parent) shost-shost_gendev.parent = dev ? dev : platform_bus; @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, error = device_add(shost-shost_gendev); if (error) - goto out; + goto out_destroy_freelist; pm_runtime_set_active(shost-shost_gendev); pm_runtime_enable(shost-shost_gendev); @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, device_del(shost-shost_dev); out_del_gendev: device_del(shost-shost_gendev); - out: + out_destroy_freelist: scsi_destroy_command_freelist(shost); + out_destroy_tags: + if (shost_use_blk_mq(shost)) + scsi_mq_destroy_tags(shost); fail: return error; } @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device *dev) } scsi_destroy_command_freelist(shost); - if (shost-bqt
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig h...@lst.de wrote: On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote: I read this comment and find that very few drivers are using this cmd_list. I think if we remove this cmd_list, performance will scale as I am seeing major contention in this lock. Just thought to ping you to see if this is known limitation for now or any plan to change this lock in near future ? Removing the lock entirely and pushing the list into the two drivers using it is on my TODO list. Bart actually suggested keeping the code in the SCSI core and having a flag to enabled. Given that I'm too busy to get my full version done in time, it might be a good idea if someone picks up Barts idea. Can you send me a patch to add a enable_cmd_list flag to the host template and only enable it for aacraid and dpt_i2o? Sure. I will work on relevant code change and will post patch for review. -- Device Driver Developer @ Avagotech Kashyap D. Desai Note - my new email address kashyap.de...@avagotech.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Friday, July 18, 2014 3:43 PM > To: James Bottomley; linux-s...@vger.kernel.org > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; Robert > Elliott; Webb Scales; linux-kernel@vger.kernel.org > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. > > This patch adds support for an alternate I/O path in the scsi midlayer which > uses the blk-mq infrastructure instead of the legacy request code. > > Use of blk-mq is fully transparent to drivers, although for now a host > template field is provided to opt out of blk-mq usage in case any unforseen > incompatibilities arise. > > In general replacing the legacy request code with blk-mq is a simple and > mostly mechanical transformation. The biggest exception is the new code > that deals with the fact the I/O submissions in blk-mq must happen from > process context, which slightly complicates the I/O completion handler. > The second biggest differences is that blk-mq is build around the concept of > preallocated requests that also include driver specific data, which in SCSI > context means the scsi_cmnd structure. This completely avoids dynamic > memory allocations for the fast path through I/O submission. > > Due the preallocated requests the MQ code path exclusively uses the host- > wide shared tag allocator instead of a per-LUN one. This only affects drivers > actually using the block layer provided tag allocator instead of their own. > Unlike the old path blk-mq always provides a tag, although drivers don't have > to use it. > > For now the blk-mq path is disable by defauly and must be enabled using the > "use_blk_mq" module parameter. Once the remaining work in the block > layer to make blk-mq more suitable for slow devices is complete I hope to > make it the default and eventually even remove the old code path. > > Based on the earlier scsi-mq prototype by Nicholas Bellinger. > > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and > various sugestions and code contributions. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Hannes Reinecke > Reviewed-by: Webb Scales > Acked-by: Jens Axboe > Tested-by: Bart Van Assche > Tested-by: Robert Elliott > --- > drivers/scsi/hosts.c | 35 +++- > drivers/scsi/scsi.c | 5 +- > drivers/scsi/scsi_lib.c | 464 > -- > drivers/scsi/scsi_priv.h | 3 + > drivers/scsi/scsi_scan.c | 5 +- > drivers/scsi/scsi_sysfs.c | 2 + > include/scsi/scsi_host.h | 18 +- > include/scsi/scsi_tcq.h | 28 ++- > 8 files changed, 488 insertions(+), 72 deletions(-) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 0632eee..6de80e3 > 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host > *shost, struct device *dev, > goto fail; > } > > + if (shost_use_blk_mq(shost)) { > + error = scsi_mq_setup_tags(shost); > + if (error) > + goto fail; > + } > + > + /* > + * Note that we allocate the freelist even for the MQ case for now, > + * as we need a command set aside for scsi_reset_provider. Having > + * the full host freelist and one command available for that is a > + * little heavy-handed, but avoids introducing a special allocator > + * just for this. Eventually the structure of scsi_reset_provider > + * will need a major overhaul. > + */ > error = scsi_setup_command_freelist(shost); > if (error) > - goto fail; > + goto out_destroy_tags; > + > > if (!shost->shost_gendev.parent) > shost->shost_gendev.parent = dev ? dev : _bus; > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, > struct device *dev, > > error = device_add(>shost_gendev); > if (error) > - goto out; > + goto out_destroy_freelist; > > pm_runtime_set_active(>shost_gendev); > pm_runtime_enable(>shost_gendev); > @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host > *shost, struct device *dev, > device_del(>shost_dev); > out_del_gendev: > device_del(>shost_gendev); > - out: > + out_destroy_freelist: > scsi_destroy_command_freelist(shost); > + out_destroy_tags: > + if (shost_use_blk_mq(shost)) > + scsi_mq_destroy_tags(shost); > fail: > return error; > } > @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device > *dev) > } > > scsi_destroy_command_freelist(shost); > - if (shost->bqt) > - blk_free_tags(shost->bqt); > + if (shost_use_blk_mq(shost)) { > + if (shost->tag_set.tags) > + scsi_mq_destroy_tags(shost); > + } else { > +
RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
-Original Message- From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- ow...@vger.kernel.org] On Behalf Of Christoph Hellwig Sent: Friday, July 18, 2014 3:43 PM To: James Bottomley; linux-s...@vger.kernel.org Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; Robert Elliott; Webb Scales; linux-kernel@vger.kernel.org Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. This patch adds support for an alternate I/O path in the scsi midlayer which uses the blk-mq infrastructure instead of the legacy request code. Use of blk-mq is fully transparent to drivers, although for now a host template field is provided to opt out of blk-mq usage in case any unforseen incompatibilities arise. In general replacing the legacy request code with blk-mq is a simple and mostly mechanical transformation. The biggest exception is the new code that deals with the fact the I/O submissions in blk-mq must happen from process context, which slightly complicates the I/O completion handler. The second biggest differences is that blk-mq is build around the concept of preallocated requests that also include driver specific data, which in SCSI context means the scsi_cmnd structure. This completely avoids dynamic memory allocations for the fast path through I/O submission. Due the preallocated requests the MQ code path exclusively uses the host- wide shared tag allocator instead of a per-LUN one. This only affects drivers actually using the block layer provided tag allocator instead of their own. Unlike the old path blk-mq always provides a tag, although drivers don't have to use it. For now the blk-mq path is disable by defauly and must be enabled using the use_blk_mq module parameter. Once the remaining work in the block layer to make blk-mq more suitable for slow devices is complete I hope to make it the default and eventually even remove the old code path. Based on the earlier scsi-mq prototype by Nicholas Bellinger. Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and various sugestions and code contributions. Signed-off-by: Christoph Hellwig h...@lst.de Reviewed-by: Hannes Reinecke h...@suse.de Reviewed-by: Webb Scales web...@hp.com Acked-by: Jens Axboe ax...@kernel.dk Tested-by: Bart Van Assche bvanass...@acm.org Tested-by: Robert Elliott elli...@hp.com --- drivers/scsi/hosts.c | 35 +++- drivers/scsi/scsi.c | 5 +- drivers/scsi/scsi_lib.c | 464 -- drivers/scsi/scsi_priv.h | 3 + drivers/scsi/scsi_scan.c | 5 +- drivers/scsi/scsi_sysfs.c | 2 + include/scsi/scsi_host.h | 18 +- include/scsi/scsi_tcq.h | 28 ++- 8 files changed, 488 insertions(+), 72 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 0632eee..6de80e3 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, goto fail; } + if (shost_use_blk_mq(shost)) { + error = scsi_mq_setup_tags(shost); + if (error) + goto fail; + } + + /* + * Note that we allocate the freelist even for the MQ case for now, + * as we need a command set aside for scsi_reset_provider. Having + * the full host freelist and one command available for that is a + * little heavy-handed, but avoids introducing a special allocator + * just for this. Eventually the structure of scsi_reset_provider + * will need a major overhaul. + */ error = scsi_setup_command_freelist(shost); if (error) - goto fail; + goto out_destroy_tags; + if (!shost-shost_gendev.parent) shost-shost_gendev.parent = dev ? dev : platform_bus; @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, error = device_add(shost-shost_gendev); if (error) - goto out; + goto out_destroy_freelist; pm_runtime_set_active(shost-shost_gendev); pm_runtime_enable(shost-shost_gendev); @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, device_del(shost-shost_dev); out_del_gendev: device_del(shost-shost_gendev); - out: + out_destroy_freelist: scsi_destroy_command_freelist(shost); + out_destroy_tags: + if (shost_use_blk_mq(shost)) + scsi_mq_destroy_tags(shost); fail: return error; } @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device *dev) } scsi_destroy_command_freelist(shost); - if (shost-bqt) - blk_free_tags(shost-bqt); + if (shost_use_blk_mq(shost)) { + if (shost-tag_set.tags) + scsi_mq_destroy_tags(shost); + } else { + if (shost-bqt) +
RE: [PATCH v2 RESEND 14/23] megaraid: Use pci_enable_msix_range() instead of pci_enable_msix()
> -Original Message- > From: Alexander Gordeev [mailto:agord...@redhat.com] > Sent: Monday, August 11, 2014 1:40 PM > To: Kashyap Desai; Neela Syam Kolli > Cc: linux-s...@vger.kernel.org; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Christoph Hellwig > Subject: Re: [PATCH v2 RESEND 14/23] megaraid: Use > pci_enable_msix_range() instead of pci_enable_msix() > > On Wed, Jul 16, 2014 at 08:05:18PM +0200, Alexander Gordeev wrote: > > As result of deprecation of MSI-X/MSI enablement functions > > pci_enable_msix() and pci_enable_msi_block() all drivers using these > > two interfaces need to be updated to use the new > > pci_enable_msi_range() or pci_enable_msi_exact() and > > pci_enable_msix_range() or pci_enable_msix_exact() interfaces. > > Kashyap, Neela, > > Could you please reveiw this patch? Review this patch. Please consider this patch Acked by Me. Acked-by: Kashyap Desai > > Thanks! > > > Signed-off-by: Alexander Gordeev > > Cc: Neela Syam Kolli > > Cc: linux-s...@vger.kernel.org > > Cc: linux-...@vger.kernel.org > > --- > > drivers/scsi/megaraid/megaraid_sas_base.c | 20 +++- > > 1 files changed, 7 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > > b/drivers/scsi/megaraid/megaraid_sas_base.c > > index ba06102..7a4e75e 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > @@ -4105,17 +4105,11 @@ static int megasas_init_fw(struct > megasas_instance *instance) > > (unsigned int)num_online_cpus()); > > for (i = 0; i < instance->msix_vectors; i++) > > instance->msixentry[i].entry = i; > > - i = pci_enable_msix(instance->pdev, instance->msixentry, > > - instance->msix_vectors); > > - if (i >= 0) { > > - if (i) { > > - if (!pci_enable_msix(instance->pdev, > > -instance->msixentry, i)) > > - instance->msix_vectors = i; > > - else > > - instance->msix_vectors = 0; > > - } > > - } else > > + i = pci_enable_msix_range(instance->pdev, instance- > >msixentry, > > + 1, instance->msix_vectors); > > + if (i) > > + instance->msix_vectors = i; > > + else > > instance->msix_vectors = 0; > > > > dev_info(>pdev->dev, "[scsi%d]: FW supports" > > @@ -5135,8 +5129,8 @@ megasas_resume(struct pci_dev *pdev) > > > > /* Now re-enable MSI-X */ > > if (instance->msix_vectors && > > - pci_enable_msix(instance->pdev, instance->msixentry, > > - instance->msix_vectors)) > > + pci_enable_msix_exact(instance->pdev, instance->msixentry, > > + instance->msix_vectors)) > > goto fail_reenable_msix; > > > > switch (instance->pdev->device) { > > -- > > 1.7.7.6 > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-initialization failed
> -Original Message- > From: Alexander Gordeev [mailto:agord...@redhat.com] > Sent: Saturday, July 26, 2014 1:54 PM > To: linux-kernel@vger.kernel.org; Neela Syam Kolli > Subject: Re: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re- > initialization failed > > On Wed, Jul 16, 2014 at 08:05:17PM +0200, Alexander Gordeev wrote: > > Currently the driver fails to analize MSI-X re-enablement status on > > resuming and always assumes the success. This update checks the MSI-X > > initialization result and fails to resume if MSI-Xs re-enablement > > failed. > > Hi Neela, > > Could you please review megaraid patches in this series? > > Thanks! Please consider this patch Acked by Me. Acked-by: Kashyap Desai > > > Signed-off-by: Alexander Gordeev > > --- > > drivers/scsi/megaraid/megaraid_sas_base.c |8 +--- > > 1 files changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > > b/drivers/scsi/megaraid/megaraid_sas_base.c > > index 112799b..ba06102 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > @@ -5134,9 +5134,10 @@ megasas_resume(struct pci_dev *pdev) > > goto fail_ready_state; > > > > /* Now re-enable MSI-X */ > > - if (instance->msix_vectors) > > - pci_enable_msix(instance->pdev, instance->msixentry, > > - instance->msix_vectors); > > + if (instance->msix_vectors && > > + pci_enable_msix(instance->pdev, instance->msixentry, > > + instance->msix_vectors)) > > + goto fail_reenable_msix; > > > > switch (instance->pdev->device) { > > case PCI_DEVICE_ID_LSI_FUSION: > > @@ -5245,6 +5246,7 @@ fail_init_mfi: > > > > fail_set_dma_mask: > > fail_ready_state: > > +fail_reenable_msix: > > > > pci_disable_device(pdev); > > > > -- > > 1.7.7.6 > > > > -- > Regards, > Alexander Gordeev > agord...@redhat.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re-initialization failed
-Original Message- From: Alexander Gordeev [mailto:agord...@redhat.com] Sent: Saturday, July 26, 2014 1:54 PM To: linux-kernel@vger.kernel.org; Neela Syam Kolli Subject: Re: [PATCH v2 RESEND 13/23] megaraid: Fail resume if MSI-X re- initialization failed On Wed, Jul 16, 2014 at 08:05:17PM +0200, Alexander Gordeev wrote: Currently the driver fails to analize MSI-X re-enablement status on resuming and always assumes the success. This update checks the MSI-X initialization result and fails to resume if MSI-Xs re-enablement failed. Hi Neela, Could you please review megaraid patches in this series? Thanks! Please consider this patch Acked by Me. Acked-by: Kashyap Desai kashyap.de...@avagotech.com Signed-off-by: Alexander Gordeev agord...@redhat.com --- drivers/scsi/megaraid/megaraid_sas_base.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index 112799b..ba06102 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -5134,9 +5134,10 @@ megasas_resume(struct pci_dev *pdev) goto fail_ready_state; /* Now re-enable MSI-X */ - if (instance-msix_vectors) - pci_enable_msix(instance-pdev, instance-msixentry, - instance-msix_vectors); + if (instance-msix_vectors + pci_enable_msix(instance-pdev, instance-msixentry, + instance-msix_vectors)) + goto fail_reenable_msix; switch (instance-pdev-device) { case PCI_DEVICE_ID_LSI_FUSION: @@ -5245,6 +5246,7 @@ fail_init_mfi: fail_set_dma_mask: fail_ready_state: +fail_reenable_msix: pci_disable_device(pdev); -- 1.7.7.6 -- Regards, Alexander Gordeev agord...@redhat.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/