(2013/05/07 21:50), Don Dutile wrote: > On 05/07/2013 03:09 AM, Takao Indoh wrote: >> Sorry for the delayed response. >> >> (2013/04/30 23:54), Sumner, William wrote: >>> I have installed your original patch set (from last November) and tested >>> with three platforms, each with a different IO configuration. On the first >>> platform crashdumps were consistently successful. On the second and third >>> platforms, the reset of one specific PCI device on each platform (a >>> different device type on each platform) immediately hung the crash kernel. >>> This was consistent across multiple tries. Temporarily modifying the patch >>> to skip resetting the one problem-device on each platform allowed each >>> platform to create a dump successfully. >>> >>> I am now working with our IO team to determine the exact nature of the hang >>> and to see if the hang is unique to the specific cards or the specific >>> platforms. >>> >>> Also I am starting to look at an alternate possibility: >>> >>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level >>> Reset) describes reset use-cases for "a partitioned environment where >>> hardware is migrated from one partition to another" and for "system >>> software is taking down the software stack for a Function and then >>> rebuilding that stack". >>> >>> These use-cases seem very similar to transitioning into the crash kernel. >>> The "Implementation Note" at the end of that section describes an algorithm >>> for "Avoiding Data Corruption From Stale Completions" which looks like it >>> might be applicable to stopping ongoing DMA. I am hopeful that adding some >>> of the steps from this algorithm to one of the already proposed patches >>> would avoid the hang that I saw on two platforms. >> >> It seems that the algorithm you mentioned requires four steps. >> >> 1. Clear Command register >> 2. Wait a few milliseconds (Or polling Transactions Pending bit) >> 3. Do FLR >> 4. Wait 100 ms >> >> My patch does not do step 1 and 2. So, I would appreciate it if you >> could add the following into save_config() in my latest patch and >> confirm if kernel still hangs up or not. >> >> subordinate = dev->subordinate; >> list_for_each_entry(child,&subordinate->devices, bus_list) { >> dev_info(&child->dev, "save state\n"); >> pci_save_state(child); >> + pci_write_config_word(child, PCI_COMMAND, 0); >> + msleep(1000); > As Linus pointed out in an earlier patch, msleep() after each device > is s-l-o-w and unnecessary; should do a single msleep(1000) after > *all* the command registers are written. That way, the added delay is 1sec, > not 1sec*ndevs.
Yeah, of course. I added these two lines just to confirm if clearing command register resolves Bill's problem. They are tentative things. > q: is this turning off command register for only PCI(e) endpoints? > -- shouldn't have to turn off command register in bridges/switches. Yes, only clearing register of endpoints because kdump fails due to following error when turning off command register of switch as well. Fusion MPT SAS Host driver 3.04.20 mptbase: ioc0: Initiating bringup mptbase: ioc0: WARNING - Unexpected doorbell active! mptbase: ioc0: ERROR - Failed to come READY after reset! IocState=f0000000 mptbase: ioc0: WARNING - ResetHistory bit failed to clear! mptbase: ioc0: ERROR - Diagnostic reset FAILED! (ffffffffh) mptbase: ioc0: WARNING - NOT READY WARNING! mptbase: ioc0: ERROR - didn't initialize properly! (-1) mptsas: probe of 0000:50:00.0 failed with error -1 Thanks, Takao Indoh > >> } >> >> Thanks, >> Takao Indoh >> >> >>> >>> Bill Sumner >>> >>> -----Original Message----- >>> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don >>> Dutile >>> Sent: Friday, April 26, 2013 8:43 AM >>> To: Takao Indoh >>> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; >>> linux-ker...@vger.kernel.org; tin...@gmail.com; >>> iommu@lists.linux-foundation.org; bhelg...@google.com >>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA >>> >>> On 04/25/2013 11:10 PM, Takao Indoh wrote: >>>> (2013/04/26 3:01), Don Dutile wrote: >>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote: >>>>>> (2013/04/25 4:59), Don Dutile wrote: >>>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote: >>>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>>>>>> PCIe root port and downstream port to reset its downstream endpoint. >>>>>>>> >>>>>>>> Problem: >>>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on >>>>>>>> is >>>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>>>>>> occur in second kernel and it causes problems like driver error or PCI >>>>>>>> SERR, at last kdump fails. This problem is caused as follows. >>>>>>>> 1) Devices are working on first kernel. >>>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>>>>>> and its DMA continues during this switch. >>>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA >>>>>>>> causes >>>>>>>> dma-remapping errors. >>>>>>>> >>>>>>>> Solution: >>>>>>>> All DMA transactions have to be stopped before iommu is initialized. By >>>>>>>> this patch devices are reset and in-flight DMA is stopped before >>>>>>>> pci_iommu_init. >>>>>>>> >>>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>>>>>> between them. If the endpoint is VGA device, it is skipped because the >>>>>>>> monitor blacks out if VGA controller is reset. >>>>>>>> >>>>>>> Couple questions wrt VGA device: >>>>>>> (1) Many graphics devices are multi-function, one function being VGA; >>>>>>> is the VGA always function 0, so this scan sees it first& >>>>>>> doesn't >>>>>>> do a reset on that PCIe link? if the VGA is not function 0, >>>>>>> won't >>>>>>> this logic break (will reset b/c function 0 is non-VGA >>>>>>> graphics) ? >>>>>> >>>>>> VGA is not reset irrespective of its function number. The logic of this >>>>>> patch is: >>>>>> >>>>>> for_each_pci_dev(dev) { >>>>>> if (dev is not PCIe) >>>>>> continue; >>>>>> if (dev is not root port/downstream port) ---(1) >>>>>> continue; >>>>>> list_for_each_entry(child,&dev->subordinate->devices, >>>>>> bus_list) { >>>>>> if (child is upstream port or bridge or VGA) ---(2) >>>>>> continue; >>>>>> } >>>>>> do_reset_its_child(dev); >>>>>> } >>>>>> >>>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or >>>>>> downstream port) of VGA is also skipped by (2). >>>>>> >>>>>> >>>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic >>>>>>> assumes it is, and why it isn't blanked. >>>>>>> Q: Should the filter be based on a device having a >>>>>>> device-class of display ? >>>>>> >>>>>> I want to avoid the situation that user's monitor blacks out and user >>>>>> cannot know what's going on. That's reason why I introduced the logic to >>>>>> skip VGA. As far as I tested the logic based on device-class works well, >>>>> sorry, I read your description, which said VGA, but your are filtering on >>>>> display class, >>>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA >>>>> display devices >>>>> are probably one of the most aggressive DMA engines on a system.... and >>>>> will grow as >>>>> asymmetric processing using GPUs gets architected into a device-agnostic >>>>> manner. >>>>> So, this may work well for servers, which is the primary consumer/user of >>>>> this feature, >>>>> and they typically have built-in graphics that are generally used in >>>>> simple VGA mode, >>>>> so this may be sufficient for now. >>>> >>>> Ok, understood. >>>> >>>> >>>>> >>>>>> but I would appreciate it if there are better ways. >>>>>> >>>>> You probably don't want to hear it but.... >>>>> a) only turn off cmd-reg master enable bit >>>>> b) only do reset based on a list of devices known not to >>>>> obey their cmd-reg master enable bit, and only do reset to those >>>>> devices. >>>>> But, given the testing you've done so far, this optional (need cmdline) >>>>> feature, >>>>> let's start here. >>>> >>>> Ok. Either way I think we need more testing. >>>> >>>> >>>>>>> >>>>>>>> Actually this is v8 patch but quite different from v7 and it's been so >>>>>>>> long since previous post, so I start over again. >>>>>>> Thanks for this re-start. I need to continue reviewing the rest. >>>>>> >>>>>> Thank you for your review! >>>>>> >>>>>>> >>>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a >>>>>>> crash >>>>>>> dump? After the crash dump, the system is rebooting to >>>>>>> previous (iommu=on) setting. >>>>>>> That logic, along w/your previous patch to disable the IOMMU >>>>>>> if iommu=off >>>>>>> is set, would remove this (relatively slow) PCI init >>>>>>> sequencing ? >>>>>> >>>>>> To force iommu off, all ongoing DMA have to be stopped before that since >>>>>> they are accessing the device address, not physical address. If we >>>>>> disable >>>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area >>>>>> and it causes memory corruption or PCI-SERR due to DMA error. >>>>> Right, that's a 'duh' on my part. >>>>> I thought 'disable iommu' == 'block all dma' and it just turns it off& >>>>> let's the ongoing DMA run... >>>>> Please ignore this question... sigh. >>>>> >>>>>> >>>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in >>>>>> second kernel if iommu is used in first kernel. >>>>>> >>>>>> Thanks, >>>>>> Takao Indoh >>>>>> >>>>>> >>>>>>> >>>>>>>> Previous post: >>>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>>>>>> https://lkml.org/lkml/2012/11/26/814 >>>>>>>> >>>>>>>> Signed-off-by: Takao Indoh<indou.ta...@jp.fujitsu.com> >>>>>>>> --- >>>>>>>> Documentation/kernel-parameters.txt | 2 + >>>>>>>> drivers/pci/pci.c | 103 >>>>>>>> +++++++++++++++++++++++++++++++++++ >>>>>>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/kernel-parameters.txt >>>>>>>> b/Documentation/kernel-parameters.txt >>>>>>>> index 4609e81..2a31ade 100644 >>>>>>>> --- a/Documentation/kernel-parameters.txt >>>>>>>> +++ b/Documentation/kernel-parameters.txt >>>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can >>>>>>>> also be entirely omitted. >>>>>>>> any pair of devices, possibly at the cost of >>>>>>>> reduced performance. This also guarantees >>>>>>>> that hot-added devices will work. >>>>>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>>>>>> + reset >>>>>>>> cbiosize=nn[KMG] The fixed amount of bus space >>>>>>>> which is >>>>>>>> reserved for the CardBus bridge's IO window. >>>>>>>> The default value is 256 bytes. >>>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>>>>>> index b099e00..42385c9 100644 >>>>>>>> --- a/drivers/pci/pci.c >>>>>>>> +++ b/drivers/pci/pci.c >>>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus >>>>>>>> *bus) >>>>>>>> } >>>>>>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>>>>>> >>>>>>>> +/* >>>>>>>> + * Return true if dev is PCIe root port or downstream port whose >>>>>>>> child is PCIe >>>>>>>> + * endpoint except VGA device. >>>>>>>> + */ >>>>>>>> +static int __init need_reset(struct pci_dev *dev) >>>>>>>> +{ >>>>>>>> + struct pci_bus *subordinate; >>>>>>>> + struct pci_dev *child; >>>>>>>> + >>>>>>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>>>>>> + list_empty(&dev->subordinate->devices) || >>>>>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>>>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>>>>>> + return 0; >>>>>>>> + >>>>>>>> + subordinate = dev->subordinate; >>>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>>>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>>>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>>>>>> + /* Don't reset switch, bridge, VGA device */ >>>>>>>> + return 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + return 1; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static void __init save_config(struct pci_dev *dev) >>>>> This should be renamed. it implies it is saving the config of the pdev >>>>> passed in, >>>>> when in reality, it is saving the config of all the devices attached to >>>>> this pdev. >>>>> i.e., save_downstream_configs() >>>>> >>>>>>>> +{ >>>>>>>> + struct pci_bus *subordinate; >>>>>>>> + struct pci_dev *child; >>>>>>>> + >>>>>>>> + if (!need_reset(dev)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + subordinate = dev->subordinate; >>>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>>> + dev_info(&child->dev, "save state\n"); >>>>>>>> + pci_save_state(child); >>>>>>>> + } >>>>>>>> +} >>>>>>>> + >>>>>>>> +static void __init restore_config(struct pci_dev *dev) >>>>> inverse of above: restore_downstream_configs() >>>>>>>> +{ >>>>>>>> + struct pci_bus *subordinate; >>>>>>>> + struct pci_dev *child; >>>>>>>> + >>>>>>>> + if (!need_reset(dev)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + subordinate = dev->subordinate; >>>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>>> + dev_info(&child->dev, "restore state\n"); >>>>>>>> + pci_restore_state(child); >>>>>>>> + } >>>>>>>> +} >>>>>>>> + >>>>>>>> +static void __init do_device_reset(struct pci_dev *dev) >>>>> do_downstream_device_reset() -- it's not resetting this pdev, >>>>> but the pdev's of the devices attached to it. >>>>> >>>> Right, I'll change them as you said. >>>> >>>> >>>>>>>> +{ >>>>>>>> + u16 ctrl; >>>>>>>> + >>>>>>>> + if (!need_reset(dev)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>>>>>> + >>>>>>>> + /* Assert Secondary Bus Reset */ >>>>>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>>>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>>>> + >>>>>>>> + msleep(2); >>>>>>>> + >>>>> This works well for x86, which uses ioport registers to access >>>>> these<256-offset registers, b/c the write() function can't return >>>>> until the write is actually completed, but for a non-x86 system, >>>>> with fully mmconf'd PCI space, a write() may still be a write& run >>>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms, >>>>> you ought to do another read_config() to the device, to flush the write, >>>>> before starting the msleep(2) clock. >>>>> >>>> I didn't know that... I'll insert something like this before msleep. >>>> >>>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy); >>>> >>>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code. >>>> Probably need the same fix, right? >>>> >>> sounds like it. >>> Interested to hear from someone in non-x86-land how >>> they ensure pci cfg writes aren't buffered in their cpus; >>> maybe handled via special uncached regions on other arches. >>> >>>> >>>>>>>> + /* De-assert Secondary Bus Reset */ >>>>>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int __initdata pcie_reset_devices; >>>>>>>> +static int __init reset_pcie_devices(void) >>>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie >>>>> devices >>>>>>>> +{ >>>>>>>> + struct pci_dev *dev = NULL; >>>>>>>> + >>>>>>>> + if (!pcie_reset_devices) >>>>>>>> + return 0; >>>>>>>> + >>>>>>>> + for_each_pci_dev(dev) >>>>>>>> + save_config(dev); >>>>>>>> + >>>>>>>> + for_each_pci_dev(dev) >>>>>>>> + do_device_reset(dev); >>>>>>>> + >>>>>>>> + msleep(1000); >>>>>>>> + >>>>>>>> + for_each_pci_dev(dev) >>>>>>>> + restore_config(dev); >>>>>>>> + >>>>> My apologies if past thread covered this sequence... >>>>> Why three loops through all PCIe devices on the system? >>>>> why not have the first for-each-pci-dev() loop filter devices >>>>> to be reset, and save_config those pdev's, >>>>> return a list of saved_pdev's; feed that list into the do_device_reset(); >>>>> then mpsleep(), then restore the list. >>>>> in fact, once you create a link list of pdev's to reset, >>>>> just loop that list doing save, then reset; rtn the list, do msleep(), >>>>> then restore the config of pdevs in the list. >>>>> Otherwise, doing much more traversing than what's needed. >>>>> Doing a great deal more config-saving then needed right now as well >>>>> (saving all non-endpt devices that aren't reset). >>>>> >>>> >>>> Creating list is nice idea, thanks. I'll do. >>>> >>>> I'll have vacation next week, so I'll post v2 patch the week after >>>> next. >>>> >>>> Thanks, >>>> Takao Indoh >>>> >>> Have a good vacation! Thanks for your willingness >>> to respin again& make this improvement. >>> >>> >>> _______________________________________________ >>> kexec mailing list >>> ke...@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/kexec >>> >>> >> >> > > > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu