Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Thu, Aug 01, 2013 at 03:34:06PM +0900, Takao Indoh wrote: > (2013/08/01 6:23), Rafael J. Wysocki wrote: > > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote: > >> [+cc Rafael, linux-acpi] > >> > >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh > >> wrote: > >> > >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but > >>> what you are talking about is that it should be changed so that x86 > >>> IOMMU initialization is done *before* PCI enumeration like sparc, right? > >> > >> Yes. I don't know whether or when that initialization order will ever > >> be changed, but I do think we should avoid building more > >> infrastructure that depends on the current order. > >> > >> Changing the order is a pretty big deal because it's a lot more than > >> just the IOMMU. Basically I think we should be enumerating ACPI > >> devices, including the IOMMU, before PCI devices, but there's a lot of > >> legacy involved in that area. Added Rafael in case he has any > >> thoughts. > > > > Well, actually, I'm not really familiar with IOMMUs, sorry. > > > > I do think that initializing IOMMU before PCI enumeration would be better, > > however. At least if the ordering should be the same on all architectures, > > which I suppose is the case, that's the one I'd choose. > > Ok guys. If x86 IOMMU maintainer also thinks changing order is > necessary, maybe I need to give up device reset in kdump kernel and > consider doing it in panic kernel. I don't think trying to reset all the devices in panic kernel is a good idea. We need to handle the problem at IOMMU level first which is independent of whether devices have been reset or not. IOW, we should have the capability to initialize IOMMU first and be able to deal with devices which are doing DMA. I am not against doing device reset and it most likely is a good thing but it should happen in second kernel and not in crashed kernel. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Thu, 2013-08-01 at 15:34 +0900, Takao Indoh wrote: > (2013/08/01 6:23), Rafael J. Wysocki wrote: > > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote: > >> [+cc Rafael, linux-acpi] > >> > >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh > >> wrote: > >> > >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but > >>> what you are talking about is that it should be changed so that x86 > >>> IOMMU initialization is done *before* PCI enumeration like sparc, right? > >> > >> Yes. I don't know whether or when that initialization order will ever > >> be changed, but I do think we should avoid building more > >> infrastructure that depends on the current order. > >> > >> Changing the order is a pretty big deal because it's a lot more than > >> just the IOMMU. Basically I think we should be enumerating ACPI > >> devices, including the IOMMU, before PCI devices, but there's a lot of > >> legacy involved in that area. Added Rafael in case he has any > >> thoughts. > > > > Well, actually, I'm not really familiar with IOMMUs, sorry. > > > > I do think that initializing IOMMU before PCI enumeration would be better, > > however. At least if the ordering should be the same on all architectures, > > which I suppose is the case, that's the one I'd choose. > > Ok guys. If x86 IOMMU maintainer also thinks changing order is > necessary, maybe I need to give up device reset in kdump kernel and > consider doing it in panic kernel. > > Either way, I need bus reset interface to reset devices. Bjorn, could > you review the bus reset patches Alex posted yesterday? I'll post a non-RFC version today, I've made a couple cleanups, tuned the delays and rolled in the AER version of secondary bus reset. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/08/01 6:23), Rafael J. Wysocki wrote: > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote: >> [+cc Rafael, linux-acpi] >> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh >> wrote: >> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but >>> what you are talking about is that it should be changed so that x86 >>> IOMMU initialization is done *before* PCI enumeration like sparc, right? >> >> Yes. I don't know whether or when that initialization order will ever >> be changed, but I do think we should avoid building more >> infrastructure that depends on the current order. >> >> Changing the order is a pretty big deal because it's a lot more than >> just the IOMMU. Basically I think we should be enumerating ACPI >> devices, including the IOMMU, before PCI devices, but there's a lot of >> legacy involved in that area. Added Rafael in case he has any >> thoughts. > > Well, actually, I'm not really familiar with IOMMUs, sorry. > > I do think that initializing IOMMU before PCI enumeration would be better, > however. At least if the ordering should be the same on all architectures, > which I suppose is the case, that's the one I'd choose. Ok guys. If x86 IOMMU maintainer also thinks changing order is necessary, maybe I need to give up device reset in kdump kernel and consider doing it in panic kernel. Either way, I need bus reset interface to reset devices. Bjorn, could you review the bus reset patches Alex posted yesterday? Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote: > [+cc Rafael, linux-acpi] > > On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh > wrote: > > > On x86, currently IOMMU initialization run *after* PCI enumeration, but > > what you are talking about is that it should be changed so that x86 > > IOMMU initialization is done *before* PCI enumeration like sparc, right? > > Yes. I don't know whether or when that initialization order will ever > be changed, but I do think we should avoid building more > infrastructure that depends on the current order. > > Changing the order is a pretty big deal because it's a lot more than > just the IOMMU. Basically I think we should be enumerating ACPI > devices, including the IOMMU, before PCI devices, but there's a lot of > legacy involved in that area. Added Rafael in case he has any > thoughts. Well, actually, I'm not really familiar with IOMMUs, sorry. I do think that initializing IOMMU before PCI enumeration would be better, however. At least if the ordering should be the same on all architectures, which I suppose is the case, that's the one I'd choose. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
[+cc Rafael, linux-acpi] On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh wrote: > On x86, currently IOMMU initialization run *after* PCI enumeration, but > what you are talking about is that it should be changed so that x86 > IOMMU initialization is done *before* PCI enumeration like sparc, right? Yes. I don't know whether or when that initialization order will ever be changed, but I do think we should avoid building more infrastructure that depends on the current order. Changing the order is a pretty big deal because it's a lot more than just the IOMMU. Basically I think we should be enumerating ACPI devices, including the IOMMU, before PCI devices, but there's a lot of legacy involved in that area. Added Rafael in case he has any thoughts. > Hmm, ok, I think I need to post attached patch to iommu list and > discuss it including current order of x86 IOMMU initialization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Tue, Jul 30, 2013 at 09:59:16AM -0600, Bjorn Helgaas wrote: > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh > wrote: > > (2013/07/29 23:17), Bjorn Helgaas wrote: > >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh > >> wrote: > >>> (2013/07/26 2:00), Bjorn Helgaas wrote: > > My point about IOMMU and PCI initialization order doesn't go away just > because it doesn't fit "kdump policy." Having system initialization > occur in a logical order is far more important than making kdump work. > >>> > >>> My next plan is as follows. I think this is matched to logical order > >>> on boot. > >>> > >>> drivers/pci/pci.c > >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus > >>> *bus) > >>> > >>> drivers/iommu/intel-iommu.c > >>> - On initialization, if IOMMU is already enabled, call this bus reset > >>>function before disabling and re-enabling IOMMU. > >> > >> I raised this issue because of arches like sparc that enumerate the > >> IOMMU before the PCI devices that use it. In that situation, I think > >> you're proposing this: > >> > >>panic kernel > >> enable IOMMU > >> panic > >>kdump kernel > >> initialize IOMMU (already enabled) > >>pci_reset_bus > >>disable IOMMU > >>enable IOMMU > >> enumerate PCI devices > >> > >> But the problem is that when you call pci_reset_bus(), you haven't > >> enumerated the PCI devices, so you don't know what to reset. > > > > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu > > initialization is based on the assumption that enumeration of PCI devices > > is already done. We can find target device from IOMMU page table instead > > of scanning all devices in pci tree. > > > > Therefore, this idea is only for intel-iommu. Other architectures need > > to implement their own reset code. > > That's my point. I'm opposed to adding code to PCI when it only > benefits x86 and we know other arches will need a fundamentally > different design. I would rather have a design that can work for all > arches. I agree. It makes sense to work on a design which works for all arches. And that's when I feel that handling this problem at IOMMU level makes more sense, if we can. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Thu, Jul 25, 2013 at 11:00:46AM -0600, Bjorn Helgaas wrote: > On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh > wrote: > > Sorry for letting this discussion slide, I was busy on other works:-( > > Anyway, the summary of previous discussion is: > > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > > boot. This expects PCI enumeration is done before IOMMU > > initialization as follows. > > (1) PCI enumeration > > (2) fs_initcall ---> device reset > > (3) IOMMU initialization > > - This works on x86, but does not work on other architecture because > > IOMMU is initialized before PCI enumeration on some architectures. So, > > device reset should be done where IOMMU is initialized instead of > > initcall. > > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > > > Resetting devices in panic kernel is against kdump policy and seems not to > > be good idea. So I think adding reset code into iommu initialization is > > better. I'll post patches for that. > > Of course nobody *wants* to do anything in the panic kernel. But > simply saying "it's against kdump policy and seems not to be a good > idea" is not a technical argument. There are things that are > impractical to do in the kdump kernel, so they have to be done in the > panic kernel even though we know the kernel is unreliable and the > attempt may fail. I think resetting all devices in crashed kernel is really a lot of code. If there is a small piece of code, it can still be considered. I don't know much about IOMMU or PCI or PCIE. But I am taking one step back and discuss again the idea of not resetting the IOMMU in second kernel. I think resetting the bus is a good idea but just resetting PCIE will solve only part of the problem and we will same issues with devices on other buses. So what sounds more appealing if we could fix this particular problem at IOMMU level first (and continue to develp patches for resetting various buses). In the past also these ideas have been proposed that continue to use translation table from first kernel. Retain those mappings and don't reset IOMMU. Reserve some space for kdump mappings in first kernel and use that reserved mapping space in second kernel. It never got implemented though. Bjorn, so what's the fundamental problem with this idea? Also, what's wrong with DMAR error. If some device tried to do DMA, and DMA was blocked because IOMMU got reset and mappings are no more there, why does it lead to failure. Shouldn't we just reate limit error messages in such case and if device is needed, anyway driver will reset it. Other problem mentioned in this thread is PCI SERR. What is it? Is it some kind of error device reports if it can't do DMA successfully. Can these errors be simply ignored kdump kernel? This problem sounds similar to a device keeping interrupt asserted in second kernel and kernel simply disables the interrupt line if nobody claims the interrupt. IOW, it feels to me that we should handle the issue (DMAR error) at IOMMU level first (instead of trying to make sure that by the time we get to initialize IOMMU(), all devices in system have been quiesced and nobody is doing DMA). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/07/31 12:11), Alex Williamson wrote: > On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote: >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index e37fea6..c595997 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev) >> EXPORT_SYMBOL_GPL(pci_reset_function); >> >> /** >> + * pci_reset_bus - reset a PCI bus >> + * @bus: PCI bus to reset >> + * >> + * Returns 0 if the bus was successfully reset or negative if failed. >> + */ >> +int pci_reset_bus(struct pci_bus *bus) >> +{ >> +struct pci_dev *pdev; >> +u16 ctrl; >> + >> +if (!bus->self) >> +return -ENOTTY; >> + >> +list_for_each_entry(pdev, &bus->devices, bus_list) >> +if (pdev->subordinate) >> +return -ENOTTY; >> + >> +/* Save config registers of children */ >> +list_for_each_entry(pdev, &bus->devices, bus_list) { >> +dev_info(&pdev->dev, "Save state\n"); >> +pci_save_state(pdev); >> +} >> + >> +dev_info(&bus->self->dev, "Reset Secondary bus\n"); >> + >> +/* Assert Secondary Bus Reset */ >> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl); >> +ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl); >> + >> +/* Read config again to flush previous write */ >> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl); >> + >> +msleep(2); >> + >> +/* De-assert Secondary Bus Reset */ >> +ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; >> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl); >> + >> +/* Wait for completion */ >> +msleep(1000); > > > We already have secondary bus reset code in this file, why are we > duplicating it here? Also, why are these delays different from the > existing code? I'm also in need of a bus reset interface for when we > assign all of the devices on a bus to userspace and do not have working > function level resets per device. I'll post my patch series and perhaps > we can collaborate on a pci bus reset interface. Thanks, Good point. Yes, we have already similar functions. pci_parent_bus_reset() 1. Assert secondary bus reset 2. msleep(100) 3. De-assert secondary bus reset 4. msleep(100) aer_do_secondary_bus_reset() 1. Assert secondary bus reset 2. msleep(2) 3. De-assert secondary bus reset, 4. msleep(200) To be honest, I wrote my reset code almost one years ago, so I forgot the reason why I separated them. Basically my reset code is based on aer_do_secondary_bus_reset(). The different is waiting time after reset. My patch has 1000msec waiting time. At first my reset code is almost same as aer_do_secondary_bus_reset(). But when I tested the reset code, I found that on certain machine restoring config registers failed after reset. It failed because 200msec waiting time was too short. And I found the following description in PCIe spec. According to this, I thought we should wait at least 1000msec. 6.6.1. Conventional Reset * The Root Complex and/or system software must allow at least 1.0s after a Conventional Reset of a device, before it may determine that a device which fails to return a Successful Completion status for a valid Configuration Request is a broken device. This period is independent of how quickly Link training completes. Note: This delay is analogous to the Trhfa parameter specified for PCI/PCI-X, and is intended to allow an adequate amount of time for devices which require self initialization. * When attempting a Configuration access to devices on a PCI or PCI-X bus segment behind a PCI Express/PCI(-X) Bridge, the timing parameter Trhfa must be respected. And I saw patches you posted today, yes, your patch looks helpful for my purpose:-) Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote: > (2013/07/31 0:59), Bjorn Helgaas wrote: > > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh > > wrote: > >> (2013/07/29 23:17), Bjorn Helgaas wrote: > >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh > >>> wrote: > (2013/07/26 2:00), Bjorn Helgaas wrote: > > > > My point about IOMMU and PCI initialization order doesn't go away just > > because it doesn't fit "kdump policy." Having system initialization > > occur in a logical order is far more important than making kdump work. > > My next plan is as follows. I think this is matched to logical order > on boot. > > drivers/pci/pci.c > - Add function to reset bus, for example, pci_reset_bus(struct pci_bus > *bus) > > drivers/iommu/intel-iommu.c > - On initialization, if IOMMU is already enabled, call this bus reset > function before disabling and re-enabling IOMMU. > >>> > >>> I raised this issue because of arches like sparc that enumerate the > >>> IOMMU before the PCI devices that use it. In that situation, I think > >>> you're proposing this: > >>> > >>> panic kernel > >>> enable IOMMU > >>> panic > >>> kdump kernel > >>> initialize IOMMU (already enabled) > >>> pci_reset_bus > >>> disable IOMMU > >>> enable IOMMU > >>> enumerate PCI devices > >>> > >>> But the problem is that when you call pci_reset_bus(), you haven't > >>> enumerated the PCI devices, so you don't know what to reset. > >> > >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu > >> initialization is based on the assumption that enumeration of PCI devices > >> is already done. We can find target device from IOMMU page table instead > >> of scanning all devices in pci tree. > >> > >> Therefore, this idea is only for intel-iommu. Other architectures need > >> to implement their own reset code. > > > > That's my point. I'm opposed to adding code to PCI when it only > > benefits x86 and we know other arches will need a fundamentally > > different design. I would rather have a design that can work for all > > arches. > > > > If your implementation is totally implemented under arch/x86 (or in > > intel-iommu.c, I guess), I can't object as much. However, I think > > that eventually even x86 should enumerate the IOMMUs via ACPI before > > we enumerate PCI devices. > > > > It's pretty clear that's how BIOS designers expect the OS to work. > > For example, sec 8.7.3 of the Intel Virtualization Technology for > > Directed I/O spec, rev 1.3, shows the expectation that remapping > > hardware (IOMMU) is initialized before discovering the I/O hierarchy > > below a hot-added host bridge. Obviously you're not talking about a > > hot-add scenario, but I think the same sequence should apply at > > boot-time as well. > > Of course I won't add something just for x86 into common PCI layer. I > attach my new patch, though it is not well tested yet. > > On x86, currently IOMMU initialization run *after* PCI enumeration, but > what you are talking about is that it should be changed so that x86 > IOMMU initialization is done *before* PCI enumeration like sparc, right? > > Hmm, ok, I think I need to post attached patch to iommu list and > discuss it including current order of x86 IOMMU initialization. > > Thanks, > Takao Indoh > --- > drivers/iommu/intel-iommu.c | 55 +- > drivers/pci/pci.c | 53 > include/linux/pci.h |1 > 3 files changed, 108 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c > index eec0d3e..fb8a546 100644 > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = { > .notifier_call = device_notifier, > }; > > +/* Reset PCI device if its entry exists in DMAR table */ > +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 > segment) > +{ > + u64 addr; > + struct root_entry *root; > + struct context_entry *context; > + int bus, devfn; > + struct pci_dev *dev; > + > + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG); > + if (!addr) > + return; > + > + /* > + * In the case of kdump, ioremap is needed because root-entry table > + * exists in first kernel's memory area which is not mapped in second > + * kernel > + */ > + root = (struct root_entry*)ioremap(addr, PAGE_SIZE); > + if (!root) > + return; > + > + for (bus=0; bus + if (!root_present(&root[bus])) > + continue; > + > + context = (struct context_entry *)ioremap( > + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE); > + if (!context) > + continue; > + > + for (devfn=0; devfn +
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/07/31 0:59), Bjorn Helgaas wrote: > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh > wrote: >> (2013/07/29 23:17), Bjorn Helgaas wrote: >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh >>> wrote: (2013/07/26 2:00), Bjorn Helgaas wrote: > > My point about IOMMU and PCI initialization order doesn't go away just > because it doesn't fit "kdump policy." Having system initialization > occur in a logical order is far more important than making kdump work. My next plan is as follows. I think this is matched to logical order on boot. drivers/pci/pci.c - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) drivers/iommu/intel-iommu.c - On initialization, if IOMMU is already enabled, call this bus reset function before disabling and re-enabling IOMMU. >>> >>> I raised this issue because of arches like sparc that enumerate the >>> IOMMU before the PCI devices that use it. In that situation, I think >>> you're proposing this: >>> >>> panic kernel >>> enable IOMMU >>> panic >>> kdump kernel >>> initialize IOMMU (already enabled) >>> pci_reset_bus >>> disable IOMMU >>> enable IOMMU >>> enumerate PCI devices >>> >>> But the problem is that when you call pci_reset_bus(), you haven't >>> enumerated the PCI devices, so you don't know what to reset. >> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu >> initialization is based on the assumption that enumeration of PCI devices >> is already done. We can find target device from IOMMU page table instead >> of scanning all devices in pci tree. >> >> Therefore, this idea is only for intel-iommu. Other architectures need >> to implement their own reset code. > > That's my point. I'm opposed to adding code to PCI when it only > benefits x86 and we know other arches will need a fundamentally > different design. I would rather have a design that can work for all > arches. > > If your implementation is totally implemented under arch/x86 (or in > intel-iommu.c, I guess), I can't object as much. However, I think > that eventually even x86 should enumerate the IOMMUs via ACPI before > we enumerate PCI devices. > > It's pretty clear that's how BIOS designers expect the OS to work. > For example, sec 8.7.3 of the Intel Virtualization Technology for > Directed I/O spec, rev 1.3, shows the expectation that remapping > hardware (IOMMU) is initialized before discovering the I/O hierarchy > below a hot-added host bridge. Obviously you're not talking about a > hot-add scenario, but I think the same sequence should apply at > boot-time as well. Of course I won't add something just for x86 into common PCI layer. I attach my new patch, though it is not well tested yet. On x86, currently IOMMU initialization run *after* PCI enumeration, but what you are talking about is that it should be changed so that x86 IOMMU initialization is done *before* PCI enumeration like sparc, right? Hmm, ok, I think I need to post attached patch to iommu list and discuss it including current order of x86 IOMMU initialization. Thanks, Takao Indoh --- drivers/iommu/intel-iommu.c | 55 +- drivers/pci/pci.c | 53 include/linux/pci.h |1 3 files changed, 108 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index eec0d3e..fb8a546 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = { .notifier_call = device_notifier, }; +/* Reset PCI device if its entry exists in DMAR table */ +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment) +{ + u64 addr; + struct root_entry *root; + struct context_entry *context; + int bus, devfn; + struct pci_dev *dev; + + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG); + if (!addr) + return; + + /* +* In the case of kdump, ioremap is needed because root-entry table +* exists in first kernel's memory area which is not mapped in second +* kernel +*/ + root = (struct root_entry*)ioremap(addr, PAGE_SIZE); + if (!root) + return; + + for (bus=0; busbus)) /* go to next bus */ + break; + else /* Try per-function reset */ + pci_reset_function(dev); + + } + iounmap(context); + } + iounmap(root); +} + int __init intel_iommu_init(void) { int ret = 0; @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void) continue; iommu = drhd->iommu; - if (iommu->gcmd & DMA_GCMD_TE) + if (iommu->gcmd & DMA_GCMD_TE) { +
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh wrote: > (2013/07/29 23:17), Bjorn Helgaas wrote: >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh >> wrote: >>> (2013/07/26 2:00), Bjorn Helgaas wrote: My point about IOMMU and PCI initialization order doesn't go away just because it doesn't fit "kdump policy." Having system initialization occur in a logical order is far more important than making kdump work. >>> >>> My next plan is as follows. I think this is matched to logical order >>> on boot. >>> >>> drivers/pci/pci.c >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) >>> >>> drivers/iommu/intel-iommu.c >>> - On initialization, if IOMMU is already enabled, call this bus reset >>>function before disabling and re-enabling IOMMU. >> >> I raised this issue because of arches like sparc that enumerate the >> IOMMU before the PCI devices that use it. In that situation, I think >> you're proposing this: >> >>panic kernel >> enable IOMMU >> panic >>kdump kernel >> initialize IOMMU (already enabled) >>pci_reset_bus >>disable IOMMU >>enable IOMMU >> enumerate PCI devices >> >> But the problem is that when you call pci_reset_bus(), you haven't >> enumerated the PCI devices, so you don't know what to reset. > > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu > initialization is based on the assumption that enumeration of PCI devices > is already done. We can find target device from IOMMU page table instead > of scanning all devices in pci tree. > > Therefore, this idea is only for intel-iommu. Other architectures need > to implement their own reset code. That's my point. I'm opposed to adding code to PCI when it only benefits x86 and we know other arches will need a fundamentally different design. I would rather have a design that can work for all arches. If your implementation is totally implemented under arch/x86 (or in intel-iommu.c, I guess), I can't object as much. However, I think that eventually even x86 should enumerate the IOMMUs via ACPI before we enumerate PCI devices. It's pretty clear that's how BIOS designers expect the OS to work. For example, sec 8.7.3 of the Intel Virtualization Technology for Directed I/O spec, rev 1.3, shows the expectation that remapping hardware (IOMMU) is initialized before discovering the I/O hierarchy below a hot-added host bridge. Obviously you're not talking about a hot-add scenario, but I think the same sequence should apply at boot-time as well. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/07/29 23:17), Bjorn Helgaas wrote: > On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh > wrote: >> (2013/07/26 2:00), Bjorn Helgaas wrote: >>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh >>> wrote: Sorry for letting this discussion slide, I was busy on other works:-( Anyway, the summary of previous discussion is: - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on boot. This expects PCI enumeration is done before IOMMU initialization as follows. (1) PCI enumeration (2) fs_initcall ---> device reset (3) IOMMU initialization - This works on x86, but does not work on other architecture because IOMMU is initialized before PCI enumeration on some architectures. So, device reset should be done where IOMMU is initialized instead of initcall. - Or, as another idea, we can reset devices in first kernel(panic kernel) Resetting devices in panic kernel is against kdump policy and seems not to be good idea. So I think adding reset code into iommu initialization is better. I'll post patches for that. >>> >>> Of course nobody *wants* to do anything in the panic kernel. But >>> simply saying "it's against kdump policy and seems not to be a good >>> idea" is not a technical argument. There are things that are >>> impractical to do in the kdump kernel, so they have to be done in the >>> panic kernel even though we know the kernel is unreliable and the >>> attempt may fail. >> >> Accessing kernel data in panic kernel causes panic again, so >> - Don't touch kernel data in panic situation >> - Jump to kdump kernel as quickly as possible, and do things in safe >>kernel >> These are basic "kdump policy". Of course if there are any works which >> we cannot do in kdump kernel and can do only in panic kernel, for >> example saving registers or stopping cpus, we should do them in panic >> kernel. >> >> Resetting devices in panic kernel is worth considering if we can safely >> find pci_dev and reset it, but I have no idea how to do that because >> for example struct pci_dev may be borken. > > Nobody can guarantee that the panic kernel can do *anything* safely > because any arbitrary kernel data or text may be corrupted. But if > you consider any specific data structure, e.g., CPU or PCI device > lists, it's not very likely that it will be corrupted. To reset device we need to scan pci device tree using for_each_pci_dev. Something like bust_spinlocks() to clear pci_lock forcibly is needed. Vivek, adding these into kdump is acceptable for you? Or any other ideas? I think iterating over a list like for_each_pci_dev is dangerous. > >>> My point about IOMMU and PCI initialization order doesn't go away just >>> because it doesn't fit "kdump policy." Having system initialization >>> occur in a logical order is far more important than making kdump work. >> >> My next plan is as follows. I think this is matched to logical order >> on boot. >> >> drivers/pci/pci.c >> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) >> >> drivers/iommu/intel-iommu.c >> - On initialization, if IOMMU is already enabled, call this bus reset >>function before disabling and re-enabling IOMMU. > > I raised this issue because of arches like sparc that enumerate the > IOMMU before the PCI devices that use it. In that situation, I think > you're proposing this: > >panic kernel > enable IOMMU > panic >kdump kernel > initialize IOMMU (already enabled) >pci_reset_bus >disable IOMMU >enable IOMMU > enumerate PCI devices > > But the problem is that when you call pci_reset_bus(), you haven't > enumerated the PCI devices, so you don't know what to reset. Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu initialization is based on the assumption that enumeration of PCI devices is already done. We can find target device from IOMMU page table instead of scanning all devices in pci tree. Therefore, this idea is only for intel-iommu. Other architectures need to implement their own reset code. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh wrote: > (2013/07/26 2:00), Bjorn Helgaas wrote: >> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh >> wrote: >>> Sorry for letting this discussion slide, I was busy on other works:-( >>> Anyway, the summary of previous discussion is: >>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >>>boot. This expects PCI enumeration is done before IOMMU >>>initialization as follows. >>> (1) PCI enumeration >>> (2) fs_initcall ---> device reset >>> (3) IOMMU initialization >>> - This works on x86, but does not work on other architecture because >>>IOMMU is initialized before PCI enumeration on some architectures. So, >>>device reset should be done where IOMMU is initialized instead of >>>initcall. >>> - Or, as another idea, we can reset devices in first kernel(panic kernel) >>> >>> Resetting devices in panic kernel is against kdump policy and seems not to >>> be good idea. So I think adding reset code into iommu initialization is >>> better. I'll post patches for that. >> >> Of course nobody *wants* to do anything in the panic kernel. But >> simply saying "it's against kdump policy and seems not to be a good >> idea" is not a technical argument. There are things that are >> impractical to do in the kdump kernel, so they have to be done in the >> panic kernel even though we know the kernel is unreliable and the >> attempt may fail. > > Accessing kernel data in panic kernel causes panic again, so > - Don't touch kernel data in panic situation > - Jump to kdump kernel as quickly as possible, and do things in safe > kernel > These are basic "kdump policy". Of course if there are any works which > we cannot do in kdump kernel and can do only in panic kernel, for > example saving registers or stopping cpus, we should do them in panic > kernel. > > Resetting devices in panic kernel is worth considering if we can safely > find pci_dev and reset it, but I have no idea how to do that because > for example struct pci_dev may be borken. Nobody can guarantee that the panic kernel can do *anything* safely because any arbitrary kernel data or text may be corrupted. But if you consider any specific data structure, e.g., CPU or PCI device lists, it's not very likely that it will be corrupted. >> My point about IOMMU and PCI initialization order doesn't go away just >> because it doesn't fit "kdump policy." Having system initialization >> occur in a logical order is far more important than making kdump work. > > My next plan is as follows. I think this is matched to logical order > on boot. > > drivers/pci/pci.c > - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) > > drivers/iommu/intel-iommu.c > - On initialization, if IOMMU is already enabled, call this bus reset > function before disabling and re-enabling IOMMU. I raised this issue because of arches like sparc that enumerate the IOMMU before the PCI devices that use it. In that situation, I think you're proposing this: panic kernel enable IOMMU panic kdump kernel initialize IOMMU (already enabled) pci_reset_bus disable IOMMU enable IOMMU enumerate PCI devices But the problem is that when you call pci_reset_bus(), you haven't enumerated the PCI devices, so you don't know what to reset. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/07/26 2:00), Bjorn Helgaas wrote: > On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh > wrote: >> Sorry for letting this discussion slide, I was busy on other works:-( >> Anyway, the summary of previous discussion is: >> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >>boot. This expects PCI enumeration is done before IOMMU >>initialization as follows. >> (1) PCI enumeration >> (2) fs_initcall ---> device reset >> (3) IOMMU initialization >> - This works on x86, but does not work on other architecture because >>IOMMU is initialized before PCI enumeration on some architectures. So, >>device reset should be done where IOMMU is initialized instead of >>initcall. >> - Or, as another idea, we can reset devices in first kernel(panic kernel) >> >> Resetting devices in panic kernel is against kdump policy and seems not to >> be good idea. So I think adding reset code into iommu initialization is >> better. I'll post patches for that. > > Of course nobody *wants* to do anything in the panic kernel. But > simply saying "it's against kdump policy and seems not to be a good > idea" is not a technical argument. There are things that are > impractical to do in the kdump kernel, so they have to be done in the > panic kernel even though we know the kernel is unreliable and the > attempt may fail. Accessing kernel data in panic kernel causes panic again, so - Don't touch kernel data in panic situation - Jump to kdump kernel as quickly as possible, and do things in safe kernel These are basic "kdump policy". Of course if there are any works which we cannot do in kdump kernel and can do only in panic kernel, for example saving registers or stopping cpus, we should do them in panic kernel. Resetting devices in panic kernel is worth considering if we can safely find pci_dev and reset it, but I have no idea how to do that because for example struct pci_dev may be borken. > > My point about IOMMU and PCI initialization order doesn't go away just > because it doesn't fit "kdump policy." Having system initialization > occur in a logical order is far more important than making kdump work. My next plan is as follows. I think this is matched to logical order on boot. drivers/pci/pci.c - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus) drivers/iommu/intel-iommu.c - On initialization, if IOMMU is already enabled, call this bus reset function before disabling and re-enabling IOMMU. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/07/25 23:24), Vivek Goyal wrote: > On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote: >> Sorry for letting this discussion slide, I was busy on other works:-( >> Anyway, the summary of previous discussion is: >> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on >>boot. This expects PCI enumeration is done before IOMMU >>initialization as follows. >> (1) PCI enumeration >> (2) fs_initcall ---> device reset >> (3) IOMMU initialization >> - This works on x86, but does not work on other architecture because >>IOMMU is initialized before PCI enumeration on some architectures. So, >>device reset should be done where IOMMU is initialized instead of >>initcall. >> - Or, as another idea, we can reset devices in first kernel(panic kernel) >> >> Resetting devices in panic kernel is against kdump policy and seems not to >> be good idea. So I think adding reset code into iommu initialization is >> better. I'll post patches for that. > > I don't understand all the details but I agree that idea of trying to > reset IOMMU in crashed kernel might not fly. > >> >> Another discussion point is how to handle buggy devices. Resetting buggy >> devices makes system more unstable. One of ideas is using boot parameter >> so that user can choose to reset devices or not. > > So who would decide which device is buggy and don't reset it. Give > some details here. I found the case that kdump does not work after resetting devices and it works when removing reset patch. The cause of problem is a bug of PCIe switch chip. If there is boot parameter not to reset devices, user can use it as workaround. I think in this case we should add PCI quirk to avoid this buggy hardware, but we need to wait errata from vendor and it basically takes long time. > > Can't we simply blacklist associated module, so that it never loads > and then it never tries to reset the devices? > So you mean that device reset should be done on its driver loading? Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh wrote: > Sorry for letting this discussion slide, I was busy on other works:-( > Anyway, the summary of previous discussion is: > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > boot. This expects PCI enumeration is done before IOMMU > initialization as follows. > (1) PCI enumeration > (2) fs_initcall ---> device reset > (3) IOMMU initialization > - This works on x86, but does not work on other architecture because > IOMMU is initialized before PCI enumeration on some architectures. So, > device reset should be done where IOMMU is initialized instead of > initcall. > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > Resetting devices in panic kernel is against kdump policy and seems not to > be good idea. So I think adding reset code into iommu initialization is > better. I'll post patches for that. Of course nobody *wants* to do anything in the panic kernel. But simply saying "it's against kdump policy and seems not to be a good idea" is not a technical argument. There are things that are impractical to do in the kdump kernel, so they have to be done in the panic kernel even though we know the kernel is unreliable and the attempt may fail. My point about IOMMU and PCI initialization order doesn't go away just because it doesn't fit "kdump policy." Having system initialization occur in a logical order is far more important than making kdump work. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote: > Sorry for letting this discussion slide, I was busy on other works:-( > Anyway, the summary of previous discussion is: > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on > boot. This expects PCI enumeration is done before IOMMU > initialization as follows. > (1) PCI enumeration > (2) fs_initcall ---> device reset > (3) IOMMU initialization > - This works on x86, but does not work on other architecture because > IOMMU is initialized before PCI enumeration on some architectures. So, > device reset should be done where IOMMU is initialized instead of > initcall. > - Or, as another idea, we can reset devices in first kernel(panic kernel) > > Resetting devices in panic kernel is against kdump policy and seems not to > be good idea. So I think adding reset code into iommu initialization is > better. I'll post patches for that. I don't understand all the details but I agree that idea of trying to reset IOMMU in crashed kernel might not fly. > > Another discussion point is how to handle buggy devices. Resetting buggy > devices makes system more unstable. One of ideas is using boot parameter > so that user can choose to reset devices or not. So who would decide which device is buggy and don't reset it. Give some details here. Can't we simply blacklist associated module, so that it never loads and then it never tries to reset the devices? Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Sorry for letting this discussion slide, I was busy on other works:-( Anyway, the summary of previous discussion is: - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on boot. This expects PCI enumeration is done before IOMMU initialization as follows. (1) PCI enumeration (2) fs_initcall ---> device reset (3) IOMMU initialization - This works on x86, but does not work on other architecture because IOMMU is initialized before PCI enumeration on some architectures. So, device reset should be done where IOMMU is initialized instead of initcall. - Or, as another idea, we can reset devices in first kernel(panic kernel) Resetting devices in panic kernel is against kdump policy and seems not to be good idea. So I think adding reset code into iommu initialization is better. I'll post patches for that. Another discussion point is how to handle buggy devices. Resetting buggy devices makes system more unstable. One of ideas is using boot parameter so that user can choose to reset devices or not. An existed parameter "reset_devices" is not helpful for this purpose because it is always necessary for kdump and user cannot get rid of it. Introducing new boot parameter seems to be unpopular for maintainers. Any ideas? Thanks, Takao Indoh (2013/06/14 11:11), Takao Indoh wrote: > (2013/06/13 12:41), Bjorn Helgaas wrote: >> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh >> wrote: >>> (2013/06/12 13:45), Bjorn Helgaas wrote: [+cc Vivek, Haren; sorry I didn't think to add you earlier] On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh wrote: > (2013/06/11 11:20), Bjorn Helgaas wrote: >> I'm not sure you need to reset legacy devices (or non-PCI devices) >> yet, but the current hook isn't anchored anywhere -- it's just an >> fs_initcall() that doesn't give the reader any clue about the >> connection between the reset and the problem it's solving. >> >> If we do something like this patch, I think it needs to be done at the >> point where we enable or disable the IOMMU. That way, it's connected >> to the important event, and there's a clue about how to make >> corresponding fixes for other IOMMUs. > > Ok. pci_iommu_init() is appropriate place to add this hook? I looked at various IOMMU init places today, and it's far more complicated and varied than I had hoped. This reset scheme depends on enumerating PCI devices before we initialize the IOMMU used by those devices. x86 works that way today, but not all architectures do (see the sparc pci_fire_pbm_init(), for >>> >>> Sorry, could you tell me which part depends on architecture? >> >> Your patch works if PCIe devices are reset before the kdump kernel >> enables the IOMMU. On x86, this is possible because PCI enumeration >> happens before the IOMMU initialization. On sparc, the IOMMU is >> initialized before PCI devices are enumerated, so there would still be >> a window where ongoing DMA could cause an IOMMU error. > > Ok, understood, thanks. > > Hmmm, it seems to be difficult to find out method which is common to > all architectures. So, what I can do for now is introducing reset scheme > which is only for x86. > > 1) Change this patch so that it work only on x86 platform. For example > call this reset code from x86_init.iommu.iommu_init() instead of > fs_initcall. > > Or another idea is: > > 2) Enumerate PCI devices in IOMMU layer. That is: > PCI layer > Just provide interface to reset given strcut pci_dev. Maybe > pci_reset_function() looks good for this purpose. > IOMMU layer > Determine which devices should be reset. On kernel boot, check if > IOMMU is already active or not, and if active, check IOMMU page > table and reset devices whose entry exists there. > >> Of course, it might be possible to reorganize the sparc code to to the >> IOMMU init *after* it enumerates PCI devices. But I think that change >> would be hard to justify. >> >> And I think even on x86, it would be better if we did the IOMMU init >> before PCI enumeration -- the PCI devices depend on the IOMMU, so >> logically the IOMMU should be initialized first so the PCI devices can >> be associated with it as they are enumerated. > > So third idea is: > > 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We > need to implement new code to enumerate PCI devices and reset them > for this purpose. > > Idea 2 is not difficult to implement, but one problem is that this > method may be dangerous. We need to scan IOMMU page table which is used > in previous kernel, but it may be broken. Idea 3 seems to be difficult > to implement... > > >> example). And I think conceptually, the IOMMU should be enumerated and initialized *before* the devices that use it. So I'm uncomfortable with that aspect of this scheme. It would be at least conceivable to reset the de
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/06/13 12:41), Bjorn Helgaas wrote: > On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh > wrote: >> (2013/06/12 13:45), Bjorn Helgaas wrote: >>> [+cc Vivek, Haren; sorry I didn't think to add you earlier] >>> >>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh >>> wrote: (2013/06/11 11:20), Bjorn Helgaas wrote: >>> > I'm not sure you need to reset legacy devices (or non-PCI devices) > yet, but the current hook isn't anchored anywhere -- it's just an > fs_initcall() that doesn't give the reader any clue about the > connection between the reset and the problem it's solving. > > If we do something like this patch, I think it needs to be done at the > point where we enable or disable the IOMMU. That way, it's connected > to the important event, and there's a clue about how to make > corresponding fixes for other IOMMUs. Ok. pci_iommu_init() is appropriate place to add this hook? >>> >>> I looked at various IOMMU init places today, and it's far more >>> complicated and varied than I had hoped. >>> >>> This reset scheme depends on enumerating PCI devices before we >>> initialize the IOMMU used by those devices. x86 works that way today, >>> but not all architectures do (see the sparc pci_fire_pbm_init(), for >> >> Sorry, could you tell me which part depends on architecture? > > Your patch works if PCIe devices are reset before the kdump kernel > enables the IOMMU. On x86, this is possible because PCI enumeration > happens before the IOMMU initialization. On sparc, the IOMMU is > initialized before PCI devices are enumerated, so there would still be > a window where ongoing DMA could cause an IOMMU error. Ok, understood, thanks. Hmmm, it seems to be difficult to find out method which is common to all architectures. So, what I can do for now is introducing reset scheme which is only for x86. 1) Change this patch so that it work only on x86 platform. For example call this reset code from x86_init.iommu.iommu_init() instead of fs_initcall. Or another idea is: 2) Enumerate PCI devices in IOMMU layer. That is: PCI layer Just provide interface to reset given strcut pci_dev. Maybe pci_reset_function() looks good for this purpose. IOMMU layer Determine which devices should be reset. On kernel boot, check if IOMMU is already active or not, and if active, check IOMMU page table and reset devices whose entry exists there. > Of course, it might be possible to reorganize the sparc code to to the > IOMMU init *after* it enumerates PCI devices. But I think that change > would be hard to justify. > > And I think even on x86, it would be better if we did the IOMMU init > before PCI enumeration -- the PCI devices depend on the IOMMU, so > logically the IOMMU should be initialized first so the PCI devices can > be associated with it as they are enumerated. So third idea is: 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We need to implement new code to enumerate PCI devices and reset them for this purpose. Idea 2 is not difficult to implement, but one problem is that this method may be dangerous. We need to scan IOMMU page table which is used in previous kernel, but it may be broken. Idea 3 seems to be difficult to implement... > >>> example). And I think conceptually, the IOMMU should be enumerated >>> and initialized *before* the devices that use it. >>> >>> So I'm uncomfortable with that aspect of this scheme. >>> >>> It would be at least conceivable to reset the devices in the system >>> kernel, before the kexec. I know we want to do as little as possible >>> in the crashing kernel, but it's at least a possibility, and it might >>> be cleaner. >> >> I bet this will be not accepted by kdump maintainer. Everything in panic >> kernel is unreliable. > > kdump is inherently unreliable. The kdump kernel doesn't start from > an arbitrary machine state. We don't expect it to tolerate all CPUs > running, for example. Maybe it should be expected to tolerate PCI > devices running, either. What I wanted to say is that any resources of first kernel are unreliable. Under panic situation, struct pci_dev tree may be broken, or pci_lock may be already hold by someone, etc. So, if we do this in first kernel, maybe kdump needs its own code to enumerate PCI devices and reset them. Vivek? Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh wrote: > (2013/06/12 13:45), Bjorn Helgaas wrote: >> [+cc Vivek, Haren; sorry I didn't think to add you earlier] >> >> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh >> wrote: >>> (2013/06/11 11:20), Bjorn Helgaas wrote: >> I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. >>> >>> Ok. pci_iommu_init() is appropriate place to add this hook? >> >> I looked at various IOMMU init places today, and it's far more >> complicated and varied than I had hoped. >> >> This reset scheme depends on enumerating PCI devices before we >> initialize the IOMMU used by those devices. x86 works that way today, >> but not all architectures do (see the sparc pci_fire_pbm_init(), for > > Sorry, could you tell me which part depends on architecture? Your patch works if PCIe devices are reset before the kdump kernel enables the IOMMU. On x86, this is possible because PCI enumeration happens before the IOMMU initialization. On sparc, the IOMMU is initialized before PCI devices are enumerated, so there would still be a window where ongoing DMA could cause an IOMMU error. Of course, it might be possible to reorganize the sparc code to to the IOMMU init *after* it enumerates PCI devices. But I think that change would be hard to justify. And I think even on x86, it would be better if we did the IOMMU init before PCI enumeration -- the PCI devices depend on the IOMMU, so logically the IOMMU should be initialized first so the PCI devices can be associated with it as they are enumerated. >> example). And I think conceptually, the IOMMU should be enumerated >> and initialized *before* the devices that use it. >> >> So I'm uncomfortable with that aspect of this scheme. >> >> It would be at least conceivable to reset the devices in the system >> kernel, before the kexec. I know we want to do as little as possible >> in the crashing kernel, but it's at least a possibility, and it might >> be cleaner. > > I bet this will be not accepted by kdump maintainer. Everything in panic > kernel is unreliable. kdump is inherently unreliable. The kdump kernel doesn't start from an arbitrary machine state. We don't expect it to tolerate all CPUs running, for example. Maybe it should be expected to tolerate PCI devices running, either. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/06/12 22:19), Don Dutile wrote: > On 06/11/2013 07:19 PM, Sumner, William wrote: >> >>> (2013/06/11 11:20), Bjorn Helgaas wrote: On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh wrote: > (2013/06/07 13:14), Bjorn Helgaas wrote: >> One thing I'm not sure about is that you are only resetting PCIe >> devices, but I don't think the problem is actually specific to PCIe, >> is it? I think the same issue could occur on any system with an >> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >> there are systems with IOMMUs for plain old PCI devices, e.g., >> PA-RISC. > > Right, this is not specific to PCIe. The reasons why the target is only > PCIe is just to make algorithm to reset simple. It is possible to reset > legacy PCI devices in my patch, but code becomes somewhat complicated. I > thought recently most systems used PCIe and there was little demand for > resetting legacy PCI. Therefore I decided not to reset legacy PCI > devices, but I'll do if there are requests :-) I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. >>> >>> Ok. pci_iommu_init() is appropriate place to add this hook? >>> We already have a "reset_devices" boot option. This is for the same purpose, as far as I can tell, and I'm not sure there's value in having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In fact, there's nothing specific even to PCI here. The Intel VT-d docs seem carefully written so they could apply to either PCIe or non-PCI devices. >>> >>> Yeah, I can integrate this option into reset_devices. The reason why >>> I separate them is to avoid regression. >>> >>> I have tested my patch on many machines and basically it worked, but I >>> found two machines where this reset patch caused problem. The first one >>> was caused by bug of raid card firmware. After updating firmware, this >>> patch worked. The second one was due to bug of PCIe switch chip. I >>> reported this bug to the vendor but it is not fixed yet. >>> >>> Anyway, this patch may cause problems on such a buggy machine, so I >>> introduced new boot parameter so that user can enable or disable this >>> function aside from reset_devices. >>> >>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >>> reset_devices instead of adding new one, and using quirk or something to >>> avoid such a buggy devices. >> >> With respect to "and using quirk or something to avoid such buggy devices", >> I believe that it will be necessary to provide a mechanism for devices that >> need special handling to do the reset -- perhaps something like a list >> of tuples: (device_type, function_to_call) with a default function_to_call >> when the device_type is not found in the list. These functions would >> need to be physically separate from the device driver because if the device >> is present it needs to be reset even if the crash kernel chooses not to load >> the driver for that device. >> >>> >>> So, basically I agree with using reset_devices, but I want to prepare >>> workaround in case this reset causes something wrong. >>> >> I like the ability to specify the original "reset_devices" separately from >> invoking this new mechanism. With so many different uses for Linux in >> so many different environments and with so many different device drivers >> it seems reasonable to keep the ability to tell the device drivers to >> reset their devices -- instead of pulling the reset line on all devices. >> >> I also like the ability to invoke the new reset feature separately from >> telling the device drivers to do it. >> >>> >> I tried to make a list of the interesting scenarios and the events >> that are relevant to this problem: >> >>Case 1: IOMMU off in system, off in kdump kernel >> system kernel leaves IOMMU off >>DMA targets system-kernel memory >> kexec to kdump kernel (IOMMU off, devices untouched) >>DMA targets system-kernel memory (harmless) >> kdump kernel re-inits device >>DMA targets kdump-kernel memory >> >>Case 2: IOMMU off in system kernel, on in kdump kernel >> system kernel leaves IOMMU off >>DMA targets system-kernel memory >> kexec to kdump kernel (IOMMU off, devices untouched) >>DMA targets system-kernel memory (harmless) >> kdu
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/06/12 13:45), Bjorn Helgaas wrote: > [+cc Vivek, Haren; sorry I didn't think to add you earlier] > > On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh > wrote: >> (2013/06/11 11:20), Bjorn Helgaas wrote: > >>> I'm not sure you need to reset legacy devices (or non-PCI devices) >>> yet, but the current hook isn't anchored anywhere -- it's just an >>> fs_initcall() that doesn't give the reader any clue about the >>> connection between the reset and the problem it's solving. >>> >>> If we do something like this patch, I think it needs to be done at the >>> point where we enable or disable the IOMMU. That way, it's connected >>> to the important event, and there's a clue about how to make >>> corresponding fixes for other IOMMUs. >> >> Ok. pci_iommu_init() is appropriate place to add this hook? > > I looked at various IOMMU init places today, and it's far more > complicated and varied than I had hoped. > > This reset scheme depends on enumerating PCI devices before we > initialize the IOMMU used by those devices. x86 works that way today, > but not all architectures do (see the sparc pci_fire_pbm_init(), for Sorry, could you tell me which part depends on architecture? > example). And I think conceptually, the IOMMU should be enumerated > and initialized *before* the devices that use it. > > So I'm uncomfortable with that aspect of this scheme. > > It would be at least conceivable to reset the devices in the system > kernel, before the kexec. I know we want to do as little as possible > in the crashing kernel, but it's at least a possibility, and it might > be cleaner. I bet this will be not accepted by kdump maintainer. Everything in panic kernel is unreliable. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On 06/11/2013 07:19 PM, Sumner, William wrote: (2013/06/11 11:20), Bjorn Helgaas wrote: On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh wrote: (2013/06/07 13:14), Bjorn Helgaas wrote: One thing I'm not sure about is that you are only resetting PCIe devices, but I don't think the problem is actually specific to PCIe, is it? I think the same issue could occur on any system with an IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but there are systems with IOMMUs for plain old PCI devices, e.g., PA-RISC. Right, this is not specific to PCIe. The reasons why the target is only PCIe is just to make algorithm to reset simple. It is possible to reset legacy PCI devices in my patch, but code becomes somewhat complicated. I thought recently most systems used PCIe and there was little demand for resetting legacy PCI. Therefore I decided not to reset legacy PCI devices, but I'll do if there are requests :-) I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. Ok. pci_iommu_init() is appropriate place to add this hook? We already have a "reset_devices" boot option. This is for the same purpose, as far as I can tell, and I'm not sure there's value in having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In fact, there's nothing specific even to PCI here. The Intel VT-d docs seem carefully written so they could apply to either PCIe or non-PCI devices. Yeah, I can integrate this option into reset_devices. The reason why I separate them is to avoid regression. I have tested my patch on many machines and basically it worked, but I found two machines where this reset patch caused problem. The first one was caused by bug of raid card firmware. After updating firmware, this patch worked. The second one was due to bug of PCIe switch chip. I reported this bug to the vendor but it is not fixed yet. Anyway, this patch may cause problems on such a buggy machine, so I introduced new boot parameter so that user can enable or disable this function aside from reset_devices. Actually Vivek Goyal, kdump maintainer said same thing. He proposed using reset_devices instead of adding new one, and using quirk or something to avoid such a buggy devices. With respect to "and using quirk or something to avoid such buggy devices", I believe that it will be necessary to provide a mechanism for devices that need special handling to do the reset -- perhaps something like a list of tuples: (device_type, function_to_call) with a default function_to_call when the device_type is not found in the list. These functions would need to be physically separate from the device driver because if the device is present it needs to be reset even if the crash kernel chooses not to load the driver for that device. So, basically I agree with using reset_devices, but I want to prepare workaround in case this reset causes something wrong. I like the ability to specify the original "reset_devices" separately from invoking this new mechanism. With so many different uses for Linux in so many different environments and with so many different device drivers it seems reasonable to keep the ability to tell the device drivers to reset their devices -- instead of pulling the reset line on all devices. I also like the ability to invoke the new reset feature separately from telling the device drivers to do it. I tried to make a list of the interesting scenarios and the events that are relevant to this problem: Case 1: IOMMU off in system, off in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel re-inits device DMA targets kdump-kernel memory Case 2: IOMMU off in system kernel, on in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
[+cc Vivek, Haren; sorry I didn't think to add you earlier] On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh wrote: > (2013/06/11 11:20), Bjorn Helgaas wrote: >> I'm not sure you need to reset legacy devices (or non-PCI devices) >> yet, but the current hook isn't anchored anywhere -- it's just an >> fs_initcall() that doesn't give the reader any clue about the >> connection between the reset and the problem it's solving. >> >> If we do something like this patch, I think it needs to be done at the >> point where we enable or disable the IOMMU. That way, it's connected >> to the important event, and there's a clue about how to make >> corresponding fixes for other IOMMUs. > > Ok. pci_iommu_init() is appropriate place to add this hook? I looked at various IOMMU init places today, and it's far more complicated and varied than I had hoped. This reset scheme depends on enumerating PCI devices before we initialize the IOMMU used by those devices. x86 works that way today, but not all architectures do (see the sparc pci_fire_pbm_init(), for example). And I think conceptually, the IOMMU should be enumerated and initialized *before* the devices that use it. So I'm uncomfortable with that aspect of this scheme. It would be at least conceivable to reset the devices in the system kernel, before the kexec. I know we want to do as little as possible in the crashing kernel, but it's at least a possibility, and it might be cleaner. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
[Your quoting is messed up below] >> This is Takao's text > This is Bill's text On Tue, Jun 11, 2013 at 5:19 PM, Sumner, William wrote: >>(2013/06/11 11:20), Bjorn Helgaas wrote: >>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh >>> wrote: >>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >>reset_devices instead of adding new one, and using quirk or something to >>avoid such a buggy devices. > > With respect to "and using quirk or something to avoid such buggy devices", > I believe that it will be necessary to provide a mechanism for devices that > need special handling to do the reset ... You mean something like pci_dev_specific_reset()? >>So, basically I agree with using reset_devices, but I want to prepare >>workaround in case this reset causes something wrong. >> > I like the ability to specify the original "reset_devices" separately from > invoking this new mechanism. With so many different uses for Linux in > so many different environments and with so many different device drivers > it seems reasonable to keep the ability to tell the device drivers to > reset their devices -- instead of pulling the reset line on all devices. The problem with adding new options for this and that is it confuses users and fragments testing. Users already randomly try options and publish combinations that happen to work as "solutions." Then we don't get problem reports and don't get a chance to fix things properly. The ideal OS would have zero options and be able to figure everything out by itself. So that's my bias about giving users flexibility by adding new options -- I don't like it, and I think we have to accept some lack of flexibility to keep the system complexity tractable :) > ... > Thinking out of the box: > Much of the discussion about dealing with the ongoing DMA leftover > from the system kernel has assumed that the crash kernel will reset > the IOMMU -- which causes various problems if done while there is any DMA > still active -- which leads to the idea of stopping all of the DMA. > > Suppose the crash kernel does not reset the hardware IOMMU, but simply > detects that it is active, resets only the devices that are necessary > for the crash kernel to operate, and re-programs only the translations > for those devices. All other translations remain the same (and remain valid) > so all leftover DMA continues into its buffer in the system kernel area > where it is harmless. New translations needed by the kdump kernel are > added to the existing tables. > > I have not yet tried this, so I am not ready to propose it as anything more > than a discussion topic at this time. > > It might work this way: (A small modification to case 3a above) > > IOMMU on in system kernel, kdump kernel accepts active IOMMU >system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory >kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory >kdump kernel detects active IOMMU and doesn't touch it > DMA targets IOMMU-mapped system-kernel memory >kdump kernel does not re-initialize IOMMU hardware >kdump kernel initializes IOMMU in-memory management structures >kdump kernel calls device drivers' standard initialization functions > Drivers initialize their own devices -- DMA from that device stops > When drivers request new DMA mappings, the kdump IOMMU driver: > 1. Updates its in-memory mgt structures for that device & range > 2. Updates IOMMU translate tables for that device & range > . Translations for all other devices & ranges are unchanged > 3. Flushes IOMMU TLB to force IOMMU hardware update This is certainly an interesting idea. It would require some additional smarts in the IOMMU driver to take over existing page tables. Those data structures from the system kernel are outside the memory map known to the kdump kernel, so there'd likely be VM issues there. The IOMMU driver would also have to be able to reconstruct the state for the bus address space allocator, e.g., use the I/O page tables to rebuild a bitmap of allocated space. We'll end up with a bunch of dma_map() calls from the system kernel that don't have corresponding dma_unmap()s. I guess when a driver initializes its device, there might have to be a new interface to "unmap any existing mappings for this device, even though I don't know what they are." I think it's possible, but it sounds like quite a lot of work and I'm not sure it's worth it for this relatively limited use case. And it's more IOMMU-specific than a "reset devices" strategy, so most of it would have to be done in each IOMMU driver. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
>(2013/06/11 11:20), Bjorn Helgaas wrote: >> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh >> wrote: >>> (2013/06/07 13:14), Bjorn Helgaas wrote: >> One thing I'm not sure about is that you are only resetting PCIe devices, but I don't think the problem is actually specific to PCIe, is it? I think the same issue could occur on any system with an IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but there are systems with IOMMUs for plain old PCI devices, e.g., PA-RISC. >>> >>> Right, this is not specific to PCIe. The reasons why the target is only >>> PCIe is just to make algorithm to reset simple. It is possible to reset >>> legacy PCI devices in my patch, but code becomes somewhat complicated. I >>> thought recently most systems used PCIe and there was little demand for >>> resetting legacy PCI. Therefore I decided not to reset legacy PCI >>> devices, but I'll do if there are requests :-) >> >> I'm not sure you need to reset legacy devices (or non-PCI devices) >> yet, but the current hook isn't anchored anywhere -- it's just an >> fs_initcall() that doesn't give the reader any clue about the >> connection between the reset and the problem it's solving. >> >> If we do something like this patch, I think it needs to be done at the >> point where we enable or disable the IOMMU. That way, it's connected >> to the important event, and there's a clue about how to make >> corresponding fixes for other IOMMUs. > >Ok. pci_iommu_init() is appropriate place to add this hook? > >> We already have a "reset_devices" boot option. This is for the same >> purpose, as far as I can tell, and I'm not sure there's value in >> having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In >> fact, there's nothing specific even to PCI here. The Intel VT-d docs >> seem carefully written so they could apply to either PCIe or non-PCI >> devices. > >Yeah, I can integrate this option into reset_devices. The reason why >I separate them is to avoid regression. > >I have tested my patch on many machines and basically it worked, but I >found two machines where this reset patch caused problem. The first one >was caused by bug of raid card firmware. After updating firmware, this >patch worked. The second one was due to bug of PCIe switch chip. I >reported this bug to the vendor but it is not fixed yet. > >Anyway, this patch may cause problems on such a buggy machine, so I >introduced new boot parameter so that user can enable or disable this >function aside from reset_devices. > >Actually Vivek Goyal, kdump maintainer said same thing. He proposed using >reset_devices instead of adding new one, and using quirk or something to >avoid such a buggy devices. With respect to "and using quirk or something to avoid such buggy devices", I believe that it will be necessary to provide a mechanism for devices that need special handling to do the reset -- perhaps something like a list of tuples: (device_type, function_to_call) with a default function_to_call when the device_type is not found in the list. These functions would need to be physically separate from the device driver because if the device is present it needs to be reset even if the crash kernel chooses not to load the driver for that device. > >So, basically I agree with using reset_devices, but I want to prepare >workaround in case this reset causes something wrong. > I like the ability to specify the original "reset_devices" separately from invoking this new mechanism. With so many different uses for Linux in so many different environments and with so many different device drivers it seems reasonable to keep the ability to tell the device drivers to reset their devices -- instead of pulling the reset line on all devices. I also like the ability to invoke the new reset feature separately from telling the device drivers to do it. > >> I tried to make a list of the interesting scenarios and the events that are relevant to this problem: Case 1: IOMMU off in system, off in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel re-inits device DMA targets kdump-kernel memory Case 2: IOMMU off in system kernel, on in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU system kernel en
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
(2013/06/11 11:20), Bjorn Helgaas wrote: > On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh > wrote: >> (2013/06/07 13:14), Bjorn Helgaas wrote: > >>> One thing I'm not sure about is that you are only resetting PCIe >>> devices, but I don't think the problem is actually specific to PCIe, >>> is it? I think the same issue could occur on any system with an >>> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >>> there are systems with IOMMUs for plain old PCI devices, e.g., >>> PA-RISC. >> >> Right, this is not specific to PCIe. The reasons why the target is only >> PCIe is just to make algorithm to reset simple. It is possible to reset >> legacy PCI devices in my patch, but code becomes somewhat complicated. I >> thought recently most systems used PCIe and there was little demand for >> resetting legacy PCI. Therefore I decided not to reset legacy PCI >> devices, but I'll do if there are requests :-) > > I'm not sure you need to reset legacy devices (or non-PCI devices) > yet, but the current hook isn't anchored anywhere -- it's just an > fs_initcall() that doesn't give the reader any clue about the > connection between the reset and the problem it's solving. > > If we do something like this patch, I think it needs to be done at the > point where we enable or disable the IOMMU. That way, it's connected > to the important event, and there's a clue about how to make > corresponding fixes for other IOMMUs. Ok. pci_iommu_init() is appropriate place to add this hook? > We already have a "reset_devices" boot option. This is for the same > purpose, as far as I can tell, and I'm not sure there's value in > having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In > fact, there's nothing specific even to PCI here. The Intel VT-d docs > seem carefully written so they could apply to either PCIe or non-PCI > devices. Yeah, I can integrate this option into reset_devices. The reason why I separate them is to avoid regression. I have tested my patch on many machines and basically it worked, but I found two machines where this reset patch caused problem. The first one was caused by bug of raid card firmware. After updating firmware, this patch worked. The second one was due to bug of PCIe switch chip. I reported this bug to the vendor but it is not fixed yet. Anyway, this patch may cause problems on such a buggy machine, so I introduced new boot parameter so that user can enable or disable this function aside from reset_devices. Actually Vivek Goyal, kdump maintainer said same thing. He proposed using reset_devices instead of adding new one, and using quirk or something to avoid such a buggy devices. So, basically I agree with using reset_devices, but I want to prepare workaround in case this reset causes something wrong. > >>> I tried to make a list of the interesting scenarios and the events >>> that are relevant to this problem: >>> >>> Case 1: IOMMU off in system, off in kdump kernel >>> system kernel leaves IOMMU off >>> DMA targets system-kernel memory >>> kexec to kdump kernel (IOMMU off, devices untouched) >>> DMA targets system-kernel memory (harmless) >>> kdump kernel re-inits device >>> DMA targets kdump-kernel memory >>> >>> Case 2: IOMMU off in system kernel, on in kdump kernel >>> system kernel leaves IOMMU off >>> DMA targets system-kernel memory >>> kexec to kdump kernel (IOMMU off, devices untouched) >>> DMA targets system-kernel memory (harmless) >>> kdump kernel enables IOMMU with no valid mappings >>> DMA causes IOMMU errors (annoying but harmless) >>> kdump kernel re-inits device >>> DMA targets IOMMU-mapped kdump-kernel memory >>> >>> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>> system kernel enables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory >>> kexec to kdump kernel (IOMMU on, devices untouched) >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel doesn't know about IOMMU or doesn't touch it >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel re-inits device >>> kernel assumes no IOMMU, so all new DMA mappings are invalid >>> because DMAs actually do go through the IOMMU >>> (** corruption or other non-recoverable error likely **) >>> >>> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>> system kernel enables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory >>> kexec to kdump kernel (IOMMU on, devices untouched) >>> DMA targets IOMMU-mapped system-kernel memory >>> kdump kernel disables IOMMU >>> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is >>> disabled >>> (** corruption or other non-recoverable error likely **) >>> kdump kernel re-inits device
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh wrote: > (2013/06/07 13:14), Bjorn Helgaas wrote: >> One thing I'm not sure about is that you are only resetting PCIe >> devices, but I don't think the problem is actually specific to PCIe, >> is it? I think the same issue could occur on any system with an >> IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but >> there are systems with IOMMUs for plain old PCI devices, e.g., >> PA-RISC. > > Right, this is not specific to PCIe. The reasons why the target is only > PCIe is just to make algorithm to reset simple. It is possible to reset > legacy PCI devices in my patch, but code becomes somewhat complicated. I > thought recently most systems used PCIe and there was little demand for > resetting legacy PCI. Therefore I decided not to reset legacy PCI > devices, but I'll do if there are requests :-) I'm not sure you need to reset legacy devices (or non-PCI devices) yet, but the current hook isn't anchored anywhere -- it's just an fs_initcall() that doesn't give the reader any clue about the connection between the reset and the problem it's solving. If we do something like this patch, I think it needs to be done at the point where we enable or disable the IOMMU. That way, it's connected to the important event, and there's a clue about how to make corresponding fixes for other IOMMUs. We already have a "reset_devices" boot option. This is for the same purpose, as far as I can tell, and I'm not sure there's value in having both "reset_devices" and "pci=pcie_reset_endpoint_devices". In fact, there's nothing specific even to PCI here. The Intel VT-d docs seem carefully written so they could apply to either PCIe or non-PCI devices. >> I tried to make a list of the interesting scenarios and the events >> that are relevant to this problem: >> >> Case 1: IOMMU off in system, off in kdump kernel >>system kernel leaves IOMMU off >> DMA targets system-kernel memory >>kexec to kdump kernel (IOMMU off, devices untouched) >> DMA targets system-kernel memory (harmless) >>kdump kernel re-inits device >> DMA targets kdump-kernel memory >> >> Case 2: IOMMU off in system kernel, on in kdump kernel >>system kernel leaves IOMMU off >> DMA targets system-kernel memory >>kexec to kdump kernel (IOMMU off, devices untouched) >> DMA targets system-kernel memory (harmless) >>kdump kernel enables IOMMU with no valid mappings >> DMA causes IOMMU errors (annoying but harmless) >>kdump kernel re-inits device >> DMA targets IOMMU-mapped kdump-kernel memory >> >> Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >>system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >>kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >>kdump kernel doesn't know about IOMMU or doesn't touch it >> DMA targets IOMMU-mapped system-kernel memory >>kdump kernel re-inits device >> kernel assumes no IOMMU, so all new DMA mappings are invalid >> because DMAs actually do go through the IOMMU >> (** corruption or other non-recoverable error likely **) >> >> Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >>system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >>kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >>kdump kernel disables IOMMU >> DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled >> (** corruption or other non-recoverable error likely **) >>kdump kernel re-inits device >> DMA targets kdump-kernel memory >> >> Case 4: IOMMU on in system kernel, on in kdump kernel >>system kernel enables IOMMU >> DMA targets IOMMU-mapped system-kernel memory >>kexec to kdump kernel (IOMMU on, devices untouched) >> DMA targets IOMMU-mapped system-kernel memory >>kdump kernel enables IOMMU with no valid mappings >> DMA causes IOMMU errors (annoying but harmless) >>kdump kernel re-inits device >> DMA targets IOMMU-mapped kdump-kernel memory > > This is not harmless. Errors like PCI SERR are detected here, and it > makes driver or system unstable, and kdump fails. I also got report that > system hangs up due to this. OK, let's take this slowly. Does an IOMMU error in the system kernel also cause SERR or make the system unstable? Is that the expected behavior on IOMMU errors, or is there something special about the kdump scenario that causes SERRs? I see lots of DMAR errors, e.g., those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are reported with printk and don't seem to cause an SERR. Maybe the SERR is system-specific behavior? ht
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Thank you for your comments! (2013/06/07 13:14), Bjorn Helgaas wrote: > Sorry it's taken me so long to look at this. I've been putting this > off because the patch doesn't seem "obviously correct," and I don't > feel like I really understand the problem. I'm naive about both > IOMMUs and kexec/kdump, so please pardon (and help me with) any silly > questions or assumptions below. > > On Mon, May 13, 2013 at 11:29 PM, Takao Indoh > wrote: >> This patch resets PCIe devices on boot to stop ongoing DMA. When >> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered >> on each PCIe root port and downstream port to reset its downstream >> endpoint. >> >> Problem: >> This patch solves the problem that kdump can fail when intel_iommu=on is >> specified. When intel_iommu=on is specified, many dma-remapping errors >> occur in second kernel and it causes problems like driver error or PCI >> SERR, at last kdump fails. This problem is caused as follows. >> 1) Devices are working on first kernel. >> 2) Switch to second kernel(kdump kernel). The devices are still working >> and its DMA continues during this switch. >> 3) iommu is initialized during second kernel boot and ongoing DMA causes >> dma-remapping errors. > > If I understand correctly, the problem only happens on systems with an > IOMMU that's enabled in either the system or kdump kernel (or both). > For systems without an IOMMU (or if it is disabled in both the system > and kdump kernels), any ongoing DMA should use addresses that target > system-kernel memory and should not affect the kdump kernel. Yes, that's correct. > One thing I'm not sure about is that you are only resetting PCIe > devices, but I don't think the problem is actually specific to PCIe, > is it? I think the same issue could occur on any system with an > IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but > there are systems with IOMMUs for plain old PCI devices, e.g., > PA-RISC. Right, this is not specific to PCIe. The reasons why the target is only PCIe is just to make algorithm to reset simple. It is possible to reset legacy PCI devices in my patch, but code becomes somewhat complicated. I thought recently most systems used PCIe and there was little demand for resetting legacy PCI. Therefore I decided not to reset legacy PCI devices, but I'll do if there are requests :-) > > I tried to make a list of the interesting scenarios and the events > that are relevant to this problem: > > Case 1: IOMMU off in system, off in kdump kernel >system kernel leaves IOMMU off > DMA targets system-kernel memory >kexec to kdump kernel (IOMMU off, devices untouched) > DMA targets system-kernel memory (harmless) >kdump kernel re-inits device > DMA targets kdump-kernel memory > > Case 2: IOMMU off in system kernel, on in kdump kernel >system kernel leaves IOMMU off > DMA targets system-kernel memory >kexec to kdump kernel (IOMMU off, devices untouched) > DMA targets system-kernel memory (harmless) >kdump kernel enables IOMMU with no valid mappings > DMA causes IOMMU errors (annoying but harmless) >kdump kernel re-inits device > DMA targets IOMMU-mapped kdump-kernel memory > > Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU >system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory >kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory >kdump kernel doesn't know about IOMMU or doesn't touch it > DMA targets IOMMU-mapped system-kernel memory >kdump kernel re-inits device > kernel assumes no IOMMU, so all new DMA mappings are invalid > because DMAs actually do go through the IOMMU > (** corruption or other non-recoverable error likely **) > > Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU >system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory >kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory >kdump kernel disables IOMMU > DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled > (** corruption or other non-recoverable error likely **) >kdump kernel re-inits device > DMA targets kdump-kernel memory > > Case 4: IOMMU on in system kernel, on in kdump kernel >system kernel enables IOMMU > DMA targets IOMMU-mapped system-kernel memory >kexec to kdump kernel (IOMMU on, devices untouched) > DMA targets IOMMU-mapped system-kernel memory >kdump kernel enables IOMMU with no valid mappings > DMA causes IOMMU errors (annoying but harmless) This is not harmless. Errors like PCI SERR are detected here, and it makes driver or system unstable
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Sorry it's taken me so long to look at this. I've been putting this off because the patch doesn't seem "obviously correct," and I don't feel like I really understand the problem. I'm naive about both IOMMUs and kexec/kdump, so please pardon (and help me with) any silly questions or assumptions below. On Mon, May 13, 2013 at 11:29 PM, Takao Indoh wrote: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working >and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes >dma-remapping errors. If I understand correctly, the problem only happens on systems with an IOMMU that's enabled in either the system or kdump kernel (or both). For systems without an IOMMU (or if it is disabled in both the system and kdump kernels), any ongoing DMA should use addresses that target system-kernel memory and should not affect the kdump kernel. One thing I'm not sure about is that you are only resetting PCIe devices, but I don't think the problem is actually specific to PCIe, is it? I think the same issue could occur on any system with an IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but there are systems with IOMMUs for plain old PCI devices, e.g., PA-RISC. I tried to make a list of the interesting scenarios and the events that are relevant to this problem: Case 1: IOMMU off in system, off in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel re-inits device DMA targets kdump-kernel memory Case 2: IOMMU off in system kernel, on in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel doesn't know about IOMMU or doesn't touch it DMA targets IOMMU-mapped system-kernel memory kdump kernel re-inits device kernel assumes no IOMMU, so all new DMA mappings are invalid because DMAs actually do go through the IOMMU (** corruption or other non-recoverable error likely **) Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel disables IOMMU DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled (** corruption or other non-recoverable error likely **) kdump kernel re-inits device DMA targets kdump-kernel memory Case 4: IOMMU on in system kernel, on in kdump kernel system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory The problem cases I see are 3a and 3b, but that's not the problem you're describing. Obviously I'm missing something. It sounds like you're seeing problems in case 2 or case 4, where the IOMMU is enabled in the kdump kernel. Maybe my assumption about the IOMMU being enabled with no valid mappings is wrong? Or maybe those IOMMU errors are not actually harmless? Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not with case 3a. Therefore, it seems like the kdump kernel *must* contain IOMMU support unless it knows for certain that the system kernel wasn't using the IOMMU. Do you have any bugzilla references or problem report URLs you could include here? Obviously I'm very confused h
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Ping Bjorn, could you please tell your impression on this idea? This fix is very important for kdump. For exmaple, kdump does not work if PCI passthrough is used on KVM guest. Please see patch description for details. If you don't agree this patch, please tell me what changes I should make. Thanks, Takao Indoh (2013/05/14 14:29), Takao Indoh wrote: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working > and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes > dma-remapping errors. > > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. > > Changelog: > v2: > - Read pci config before de-assert secondary bus reset to flush previous >write > - Change function/variable name > - Make a list of devices to be reset > > v1: > https://patchwork.kernel.org/patch/2482291/ > > Signed-off-by: Takao Indoh > --- > Documentation/kernel-parameters.txt |2 + > drivers/pci/pci.c | 113 > +++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be > entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG]The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is > PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +}
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Hi Bjorn, Any comments, ack/nack? Thanks, Takao Indoh (2013/05/15 7:04), Eric W. Biederman wrote: > Takao Indoh writes: > >> This patch resets PCIe devices on boot to stop ongoing DMA. When >> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered >> on each PCIe root port and downstream port to reset its downstream >> endpoint. >> >> Problem: >> This patch solves the problem that kdump can fail when intel_iommu=on is >> specified. When intel_iommu=on is specified, many dma-remapping errors >> occur in second kernel and it causes problems like driver error or PCI >> SERR, at last kdump fails. This problem is caused as follows. >> 1) Devices are working on first kernel. >> 2) Switch to second kernel(kdump kernel). The devices are still working >> and its DMA continues during this switch. >> 3) iommu is initialized during second kernel boot and ongoing DMA causes >> dma-remapping errors. >> >> Solution: >> All DMA transactions have to be stopped before iommu is initialized. By >> this patch devices are reset and in-flight DMA is stopped before >> pci_iommu_init. >> >> To invoke hot reset on an endpoint, its upstream link need to be reset. >> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds >> root port/downstream port whose child is PCIe endpoint, and then reset >> link between them. If the endpoint is VGA device, it is skipped because >> the monitor blacks out if VGA controller is reset. > > At a quick skim this patch looks reasonable. > > Acked-by: "Eric W. Biederman" > >> Changelog: >> v2: >> - Read pci config before de-assert secondary bus reset to flush previous >>write >> - Change function/variable name >> - Make a list of devices to be reset >> >> v1: >> https://patchwork.kernel.org/patch/2482291/ >> >> Signed-off-by: Takao Indoh >> --- >> Documentation/kernel-parameters.txt |2 + >> drivers/pci/pci.c | 113 >> +++ >> 2 files changed, 115 insertions(+), 0 deletions(-) >> >> diff --git a/Documentation/kernel-parameters.txt >> b/Documentation/kernel-parameters.txt >> index c3bfacb..8c9e8e4 100644 >> --- a/Documentation/kernel-parameters.txt >> +++ b/Documentation/kernel-parameters.txt >> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be >> entirely omitted. >> any pair of devices, possibly at the cost of >> reduced performance. This also guarantees >> that hot-added devices will work. >> +pcie_reset_endpoint_devices Reset PCIe endpoint on boot by >> +hot reset >> cbiosize=nn[KMG]The fixed amount of bus space which is >> reserved for the CardBus bridge's IO window. >> The default value is 256 bytes. >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index a899d8b..70c1205 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >> } >> EXPORT_SYMBOL(pci_fixup_cardbus); >> >> +/* >> + * Return true if dev is PCIe root port or downstream port whose child is >> PCIe >> + * endpoint except VGA device. >> + */ >> +static int __init need_reset(struct pci_dev *dev) >> +{ >> +struct pci_bus *subordinate; >> +struct pci_dev *child; >> + >> +if (!pci_is_pcie(dev) || !dev->subordinate || >> +list_empty(&dev->subordinate->devices) || >> +((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >> +return 0; >> + >> +subordinate = dev->subordinate; >> +list_for_each_entry(child, &subordinate->devices, bus_list) { >> +if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >> +(pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >> +((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) >> +/* Don't reset switch, bridge, VGA device */ >> +return 0; >> +} >> + >> +return 1; >> +} >> + >> +static void __init save_downstream_configs(struct pci_dev *dev) >> +{ >> +struct pci_bus *subordinate; >> +struct pci_dev *child; >> + >> +subordinate = dev->subordinate; >> +list_for_each_entry(child, &subordinate->devices, bus_list) { >> +dev_info(&child->dev, "save state\n"); >> +pci_save_state(child); >> +} >> +} >> + >> +static void __init restore_downstream_configs(struct pci_dev *dev) >> +{ >> +struct pci_bus *subordinate; >> +struct pci_dev *child; >> + >> +subordinate = dev->subordinate; >> +list_for_each_entry(child, &subordinate->devices, bus_list) { >> +dev_info(&child->dev, "restore state\n"); >> +pci_restore_state(child); >> +} >> +} >> + >> +static void __init do_downstream_device_reset(struct pci_dev
Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Takao Indoh writes: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working >and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes >dma-remapping errors. > > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. At a quick skim this patch looks reasonable. Acked-by: "Eric W. Biederman" > Changelog: > v2: > - Read pci config before de-assert secondary bus reset to flush previous > write > - Change function/variable name > - Make a list of devices to be reset > > v1: > https://patchwork.kernel.org/patch/2482291/ > > Signed-off-by: Takao Indoh > --- > Documentation/kernel-parameters.txt |2 + > drivers/pci/pci.c | 113 > +++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be > entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG]The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is > PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +} > + > +static void __init do_downstream_device_reset(struct pci_dev *dev) > +{ > + u16 ctrl; > + > + dev_info(&dev->dev, "Reset Secondary bus\n"); > + > + /* Assert Secondary Bus Reset */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + ct