Sorry for the delayed response.
(2013/04/30 23:54), Sumner, William wrote:
> I have installed your original patch set (from last November) and tested with
> three platforms, each with a different IO configuration. On the first
> platform crashdumps were consistently successful. On the second and third
> platforms, the reset of one specific PCI device on each platform (a different
> device type on each platform) immediately hung the crash kernel. This was
> consistent across multiple tries. Temporarily modifying the patch to skip
> resetting the one problem-device on each platform allowed each platform to
> create a dump successfully.
>
> I am now working with our IO team to determine the exact nature of the hang
> and to see if the hang is unique to the specific cards or the specific
> platforms.
>
> Also I am starting to look at an alternate possibility:
>
> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level
> Reset) describes reset use-cases for "a partitioned environment where
> hardware is migrated from one partition to another" and for "system software
> is taking down the software stack for a Function and then rebuilding that
> stack".
>
> These use-cases seem very similar to transitioning into the crash kernel.
> The "Implementation Note" at the end of that section describes an algorithm
> for "Avoiding Data Corruption From Stale Completions" which looks like it
> might be applicable to stopping ongoing DMA. I am hopeful that adding some
> of the steps from this algorithm to one of the already proposed patches would
> avoid the hang that I saw on two platforms.
It seems that the algorithm you mentioned requires four steps.
1. Clear Command register
2. Wait a few milliseconds (Or polling Transactions Pending bit)
3. Do FLR
4. Wait 100 ms
My patch does not do step 1 and 2. So, I would appreciate it if you
could add the following into save_config() in my latest patch and
confirm if kernel still hangs up or not.
subordinate = dev->subordinate;
list_for_each_entry(child, &subordinate->devices, bus_list) {
dev_info(&child->dev, "save state\n");
pci_save_state(child);
+ pci_write_config_word(child, PCI_COMMAND, 0);
+ msleep(1000);
}
Thanks,
Takao Indoh
>
> Bill Sumner
>
> -----Original Message-----
> From: kexec [mailto:[email protected]] On Behalf Of Don Dutile
> Sent: Friday, April 26, 2013 8:43 AM
> To: Takao Indoh
> Cc: [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>
> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>> (2013/04/26 3:01), Don Dutile wrote:
>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>
>>>>>> Problem:
>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>> 1) Devices are working on first kernel.
>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>> and its DMA continues during this switch.
>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>> dma-remapping errors.
>>>>>>
>>>>>> Solution:
>>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>>> pci_iommu_init.
>>>>>>
>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>>> monitor blacks out if VGA controller is reset.
>>>>>>
>>>>> Couple questions wrt VGA device:
>>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>>> is the VGA always function 0, so this scan sees it first&
>>>>> doesn't
>>>>> do a reset on that PCIe link? if the VGA is not function 0,
>>>>> won't
>>>>> this logic break (will reset b/c function 0 is non-VGA graphics)
>>>>> ?
>>>>
>>>> VGA is not reset irrespective of its function number. The logic of this
>>>> patch is:
>>>>
>>>> for_each_pci_dev(dev) {
>>>> if (dev is not PCIe)
>>>> continue;
>>>> if (dev is not root port/downstream port) ---(1)
>>>> continue;
>>>> list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>>>> if (child is upstream port or bridge or VGA) ---(2)
>>>> continue;
>>>> }
>>>> do_reset_its_child(dev);
>>>> }
>>>>
>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>>> downstream port) of VGA is also skipped by (2).
>>>>
>>>>
>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>>> assumes it is, and why it isn't blanked.
>>>>> Q: Should the filter be based on a device having a device-class
>>>>> of display ?
>>>>
>>>> I want to avoid the situation that user's monitor blacks out and user
>>>> cannot know what's going on. That's reason why I introduced the logic to
>>>> skip VGA. As far as I tested the logic based on device-class works well,
>>> sorry, I read your description, which said VGA, but your are filtering on
>>> display class,
>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA
>>> display devices
>>> are probably one of the most aggressive DMA engines on a system.... and
>>> will grow as
>>> asymmetric processing using GPUs gets architected into a device-agnostic
>>> manner.
>>> So, this may work well for servers, which is the primary consumer/user of
>>> this feature,
>>> and they typically have built-in graphics that are generally used in simple
>>> VGA mode,
>>> so this may be sufficient for now.
>>
>> Ok, understood.
>>
>>
>>>
>>>> but I would appreciate it if there are better ways.
>>>>
>>> You probably don't want to hear it but....
>>> a) only turn off cmd-reg master enable bit
>>> b) only do reset based on a list of devices known not to
>>> obey their cmd-reg master enable bit, and only do reset to those
>>> devices.
>>> But, given the testing you've done so far, this optional (need cmdline)
>>> feature,
>>> let's start here.
>>
>> Ok. Either way I think we need more testing.
>>
>>
>>>>>
>>>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>>>> long since previous post, so I start over again.
>>>>> Thanks for this re-start. I need to continue reviewing the rest.
>>>>
>>>> Thank you for your review!
>>>>
>>>>>
>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a
>>>>> crash
>>>>> dump? After the crash dump, the system is rebooting to previous
>>>>> (iommu=on) setting.
>>>>> That logic, along w/your previous patch to disable the IOMMU if
>>>>> iommu=off
>>>>> is set, would remove this (relatively slow) PCI init sequencing ?
>>>>
>>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>>> they are accessing the device address, not physical address. If we disable
>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>>>> and it causes memory corruption or PCI-SERR due to DMA error.
>>> Right, that's a 'duh' on my part.
>>> I thought 'disable iommu' == 'block all dma' and it just turns it off&
>>> let's the ongoing DMA run...
>>> Please ignore this question... sigh.
>>>
>>>>
>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in
>>>> second kernel if iommu is used in first kernel.
>>>>
>>>> Thanks,
>>>> Takao Indoh
>>>>
>>>>
>>>>>
>>>>>> Previous post:
>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>>>> https://lkml.org/lkml/2012/11/26/814
>>>>>>
>>>>>> Signed-off-by: Takao Indoh<[email protected]>
>>>>>> ---
>>>>>> Documentation/kernel-parameters.txt | 2 +
>>>>>> drivers/pci/pci.c | 103
>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>> 2 files changed, 105 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/kernel-parameters.txt
>>>>>> b/Documentation/kernel-parameters.txt
>>>>>> index 4609e81..2a31ade 100644
>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also
>>>>>> be entirely omitted.
>>>>>> any pair of devices, possibly at the cost of
>>>>>> reduced performance. This also guarantees
>>>>>> that hot-added devices will work.
>>>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot
>>>>>> + reset
>>>>>> cbiosize=nn[KMG] The fixed amount of bus space which is
>>>>>> reserved for the CardBus bridge's IO window.
>>>>>> The default value is 256 bytes.
>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>>> index b099e00..42385c9 100644
>>>>>> --- a/drivers/pci/pci.c
>>>>>> +++ b/drivers/pci/pci.c
>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus
>>>>>> *bus)
>>>>>> }
>>>>>> EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>>>
>>>>>> +/*
>>>>>> + * Return true if dev is PCIe root port or downstream port whose child
>>>>>> is PCIe
>>>>>> + * endpoint except VGA device.
>>>>>> + */
>>>>>> +static int __init need_reset(struct pci_dev *dev)
>>>>>> +{
>>>>>> + struct pci_bus *subordinate;
>>>>>> + struct pci_dev *child;
>>>>>> +
>>>>>> + if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>>>> + list_empty(&dev->subordinate->devices) ||
>>>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>>>> + return 0;
>>>>>> +
>>>>>> + subordinate = dev->subordinate;
>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY))
>>>>>> + /* Don't reset switch, bridge, VGA device */
>>>>>> + return 0;
>>>>>> + }
>>>>>> +
>>>>>> + return 1;
>>>>>> +}
>>>>>> +
>>>>>> +static void __init save_config(struct pci_dev *dev)
>>> This should be renamed. it implies it is saving the config of the pdev
>>> passed in,
>>> when in reality, it is saving the config of all the devices attached to
>>> this pdev.
>>> i.e., save_downstream_configs()
>>>
>>>>>> +{
>>>>>> + struct pci_bus *subordinate;
>>>>>> + struct pci_dev *child;
>>>>>> +
>>>>>> + if (!need_reset(dev))
>>>>>> + return;
>>>>>> +
>>>>>> + subordinate = dev->subordinate;
>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> + dev_info(&child->dev, "save state\n");
>>>>>> + pci_save_state(child);
>>>>>> + }
>>>>>> +}
>>>>>> +
>>>>>> +static void __init restore_config(struct pci_dev *dev)
>>> inverse of above: restore_downstream_configs()
>>>>>> +{
>>>>>> + struct pci_bus *subordinate;
>>>>>> + struct pci_dev *child;
>>>>>> +
>>>>>> + if (!need_reset(dev))
>>>>>> + return;
>>>>>> +
>>>>>> + subordinate = dev->subordinate;
>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> + dev_info(&child->dev, "restore state\n");
>>>>>> + pci_restore_state(child);
>>>>>> + }
>>>>>> +}
>>>>>> +
>>>>>> +static void __init do_device_reset(struct pci_dev *dev)
>>> do_downstream_device_reset() -- it's not resetting this pdev,
>>> but the pdev's of the devices attached to it.
>>>
>> Right, I'll change them as you said.
>>
>>
>>>>>> +{
>>>>>> + u16 ctrl;
>>>>>> +
>>>>>> + if (!need_reset(dev))
>>>>>> + return;
>>>>>> +
>>>>>> + dev_info(&dev->dev, "Reset Secondary bus\n");
>>>>>> +
>>>>>> + /* Assert Secondary Bus Reset */
>>>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>> +
>>>>>> + msleep(2);
>>>>>> +
>>> This works well for x86, which uses ioport registers to access
>>> these<256-offset registers, b/c the write() function can't return
>>> until the write is actually completed, but for a non-x86 system,
>>> with fully mmconf'd PCI space, a write() may still be a write& run
>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms,
>>> you ought to do another read_config() to the device, to flush the write,
>>> before starting the msleep(2) clock.
>>>
>> I didn't know that... I'll insert something like this before msleep.
>>
>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy);
>>
>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code.
>> Probably need the same fix, right?
>>
> sounds like it.
> Interested to hear from someone in non-x86-land how
> they ensure pci cfg writes aren't buffered in their cpus;
> maybe handled via special uncached regions on other arches.
>
>>
>>>>>> + /* De-assert Secondary Bus Reset */
>>>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>> +}
>>>>>> +
>>>>>> +static int __initdata pcie_reset_devices;
>>>>>> +static int __init reset_pcie_devices(void)
>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie
>>> devices
>>>>>> +{
>>>>>> + struct pci_dev *dev = NULL;
>>>>>> +
>>>>>> + if (!pcie_reset_devices)
>>>>>> + return 0;
>>>>>> +
>>>>>> + for_each_pci_dev(dev)
>>>>>> + save_config(dev);
>>>>>> +
>>>>>> + for_each_pci_dev(dev)
>>>>>> + do_device_reset(dev);
>>>>>> +
>>>>>> + msleep(1000);
>>>>>> +
>>>>>> + for_each_pci_dev(dev)
>>>>>> + restore_config(dev);
>>>>>> +
>>> My apologies if past thread covered this sequence...
>>> Why three loops through all PCIe devices on the system?
>>> why not have the first for-each-pci-dev() loop filter devices
>>> to be reset, and save_config those pdev's,
>>> return a list of saved_pdev's; feed that list into the do_device_reset();
>>> then mpsleep(), then restore the list.
>>> in fact, once you create a link list of pdev's to reset,
>>> just loop that list doing save, then reset; rtn the list, do msleep(),
>>> then restore the config of pdevs in the list.
>>> Otherwise, doing much more traversing than what's needed.
>>> Doing a great deal more config-saving then needed right now as well
>>> (saving all non-endpt devices that aren't reset).
>>>
>>
>> Creating list is nice idea, thanks. I'll do.
>>
>> I'll have vacation next week, so I'll post v2 patch the week after
>> next.
>>
>> Thanks,
>> Takao Indoh
>>
> Have a good vacation! Thanks for your willingness
> to respin again & make this improvement.
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/