Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Takao Indoh Tue, 07 May 2013 00:10:52 -0700

Sorry for the delayed response.

(2013/04/30 23:54), Sumner, William wrote:
> I have installed your original patch set (from last November) and tested with 
> three platforms, each with a different IO configuration.  On the first 
> platform crashdumps were consistently successful.  On the second and third 
> platforms, the reset of one specific PCI device on each platform (a different 
> device type on each platform) immediately hung the crash kernel. This was 
> consistent across multiple tries.  Temporarily modifying the patch to skip 
> resetting the one problem-device on each platform allowed each platform to 
> create a dump successfully.
> 
> I am now working with our IO team to determine the exact nature of the hang 
> and to see if the hang is unique to the specific cards or the specific 
> platforms.
> 
> Also I am starting to look at an alternate possibility:
> 
> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
> Reset) describes reset use-cases for "a partitioned environment where 
> hardware is migrated from one partition to another" and for "system software 
> is taking down the software stack for a Function and then rebuilding that 
> stack".
> 
> These use-cases seem very similar to transitioning into the crash kernel.  
> The "Implementation Note" at the end of that section describes an algorithm 
> for "Avoiding Data Corruption From Stale Completions" which looks like it 
> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
> of the steps from this algorithm to one of the already proposed patches would 
> avoid the hang that I saw on two platforms.


It seems that the algorithm you mentioned requires four steps.

1. Clear Command register
2. Wait a few milliseconds (Or polling Transactions Pending bit)
3. Do FLR
4. Wait 100 ms

My patch does not do step 1 and 2. So, I would appreciate it if you
could add the following into save_config() in my latest patch and
confirm if kernel still hangs up or not.

        subordinate = dev->subordinate;
        list_for_each_entry(child, &subordinate->devices, bus_list) {
                dev_info(&child->dev, "save state\n");
                pci_save_state(child);
+               pci_write_config_word(child, PCI_COMMAND, 0);
+               msleep(1000);
        }

Thanks,
Takao Indoh


> 
> Bill Sumner
> 
> -----Original Message-----
> From: kexec [mailto:[email protected]] On Behalf Of Don Dutile
> Sent: Friday, April 26, 2013 8:43 AM
> To: Takao Indoh
> Cc: [email protected]; [email protected]; 
> [email protected]; [email protected]; 
> [email protected]; [email protected]
> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
> 
> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>> (2013/04/26 3:01), Don Dutile wrote:
>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>
>>>>>> Problem:
>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>> 1) Devices are working on first kernel.
>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>>         and its DMA continues during this switch.
>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>>         dma-remapping errors.
>>>>>>
>>>>>> Solution:
>>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>>> pci_iommu_init.
>>>>>>
>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>>> monitor blacks out if VGA controller is reset.
>>>>>>
>>>>> Couple questions wrt VGA device:
>>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>>>          is the VGA always function 0, so this scan sees it first&   
>>>>> doesn't
>>>>>          do a reset on that PCIe link?  if the VGA is not function 0, 
>>>>> won't
>>>>>          this logic break (will reset b/c function 0 is non-VGA graphics) 
>>>>> ?
>>>>
>>>> VGA is not reset irrespective of its function number. The logic of this
>>>> patch is:
>>>>
>>>> for_each_pci_dev(dev) {
>>>>         if (dev is not PCIe)
>>>>            continue;
>>>>         if (dev is not root port/downstream port) ---(1)
>>>>            continue;
>>>>         list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>>>>             if (child is upstream port or bridge or VGA) ---(2)
>>>>                 continue;
>>>>         }
>>>>         do_reset_its_child(dev);
>>>> }
>>>>
>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>>> downstream port) of VGA is also skipped by (2).
>>>>
>>>>
>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>>>          assumes it is, and why it isn't blanked.
>>>>>          Q: Should the filter be based on a device having a device-class 
>>>>> of display ?
>>>>
>>>> I want to avoid the situation that user's monitor blacks out and user
>>>> cannot know what's going on. That's reason why I introduced the logic to
>>>> skip VGA. As far as I tested the logic based on device-class works well,
>>> sorry, I read your description, which said VGA, but your are filtering on 
>>> display class,
>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
>>> display devices
>>> are probably one of the most aggressive DMA engines on a system.... and 
>>> will grow as
>>> asymmetric processing using GPUs gets architected into a device-agnostic 
>>> manner.
>>> So, this may work well for servers, which is the primary consumer/user of 
>>> this feature,
>>> and they typically have built-in graphics that are generally used in simple 
>>> VGA mode,
>>> so this may be sufficient for now.
>>
>> Ok, understood.
>>
>>
>>>
>>>> but I would appreciate it if there are better ways.
>>>>
>>> You probably don't want to hear it but....
>>> a) only turn off cmd-reg master enable bit
>>> b) only do reset based on a list of devices known not to
>>>       obey their cmd-reg master enable bit, and only do reset to those 
>>> devices.
>>> But, given the testing you've done so far, this optional (need cmdline) 
>>> feature,
>>> let's start here.
>>
>> Ok. Either way I think we need more testing.
>>
>>
>>>>>
>>>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>>>> long since previous post, so I start over again.
>>>>> Thanks for this re-start.  I need to continue reviewing the rest.
>>>>
>>>> Thank you for your review!
>>>>
>>>>>
>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a 
>>>>> crash
>>>>>         dump?  After the crash dump, the system is rebooting to previous 
>>>>> (iommu=on) setting.
>>>>>         That logic, along w/your previous patch to disable the IOMMU if 
>>>>> iommu=off
>>>>>         is set, would remove this (relatively slow) PCI init sequencing ?
>>>>
>>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>>> they are accessing the device address, not physical address. If we disable
>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>>>> and it causes memory corruption or PCI-SERR due to DMA error.
>>> Right, that's a 'duh' on my part.
>>> I thought 'disable iommu' == 'block all dma' and it just turns it off&
>>> let's the ongoing DMA run...
>>> Please ignore this question... sigh.
>>>
>>>>
>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in
>>>> second kernel if iommu is used in first kernel.
>>>>
>>>> Thanks,
>>>> Takao Indoh
>>>>
>>>>
>>>>>
>>>>>> Previous post:
>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>>>> https://lkml.org/lkml/2012/11/26/814
>>>>>>
>>>>>> Signed-off-by: Takao Indoh<[email protected]>
>>>>>> ---
>>>>>>       Documentation/kernel-parameters.txt |    2 +
>>>>>>       drivers/pci/pci.c                   |  103 
>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>       2 files changed, 105 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/kernel-parameters.txt 
>>>>>> b/Documentation/kernel-parameters.txt
>>>>>> index 4609e81..2a31ade 100644
>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also 
>>>>>> be entirely omitted.
>>>>>>                       any pair of devices, possibly at the cost of
>>>>>>                       reduced performance.  This also guarantees
>>>>>>                       that hot-added devices will work.
>>>>>> +        pcie_reset_devices    Reset PCIe endpoint on boot by hot
>>>>>> +                reset
>>>>>>               cbiosize=nn[KMG]    The fixed amount of bus space which is
>>>>>>                       reserved for the CardBus bridge's IO window.
>>>>>>                       The default value is 256 bytes.
>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>>> index b099e00..42385c9 100644
>>>>>> --- a/drivers/pci/pci.c
>>>>>> +++ b/drivers/pci/pci.c
>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus 
>>>>>> *bus)
>>>>>>       }
>>>>>>       EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>>>
>>>>>> +/*
>>>>>> + * Return true if dev is PCIe root port or downstream port whose child 
>>>>>> is PCIe
>>>>>> + * endpoint except VGA device.
>>>>>> + */
>>>>>> +static int __init need_reset(struct pci_dev *dev)
>>>>>> +{
>>>>>> +    struct pci_bus *subordinate;
>>>>>> +    struct pci_dev *child;
>>>>>> +
>>>>>> +    if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>>>> +        list_empty(&dev->subordinate->devices) ||
>>>>>> +        ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>>>> +         (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    subordinate = dev->subordinate;
>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> +        if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>>>> +            (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>>>> +            ((child->class>>    16) == PCI_BASE_CLASS_DISPLAY))
>>>>>> +            /* Don't reset switch, bridge, VGA device */
>>>>>> +            return 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    return 1;
>>>>>> +}
>>>>>> +
>>>>>> +static void __init save_config(struct pci_dev *dev)
>>> This should be renamed.  it implies it is saving the config of the pdev 
>>> passed in,
>>> when in reality, it is saving the config of all the devices attached to 
>>> this pdev.
>>> i.e., save_downstream_configs()
>>>
>>>>>> +{
>>>>>> +    struct pci_bus *subordinate;
>>>>>> +    struct pci_dev *child;
>>>>>> +
>>>>>> +    if (!need_reset(dev))
>>>>>> +        return;
>>>>>> +
>>>>>> +    subordinate = dev->subordinate;
>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> +        dev_info(&child->dev, "save state\n");
>>>>>> +        pci_save_state(child);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static void __init restore_config(struct pci_dev *dev)
>>> inverse of above: restore_downstream_configs()
>>>>>> +{
>>>>>> +    struct pci_bus *subordinate;
>>>>>> +    struct pci_dev *child;
>>>>>> +
>>>>>> +    if (!need_reset(dev))
>>>>>> +        return;
>>>>>> +
>>>>>> +    subordinate = dev->subordinate;
>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>> +        dev_info(&child->dev, "restore state\n");
>>>>>> +        pci_restore_state(child);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static void __init do_device_reset(struct pci_dev *dev)
>>> do_downstream_device_reset() -- it's not resetting this pdev,
>>> but the pdev's of the devices attached to it.
>>>
>> Right, I'll change them as you said.
>>
>>
>>>>>> +{
>>>>>> +    u16 ctrl;
>>>>>> +
>>>>>> +    if (!need_reset(dev))
>>>>>> +        return;
>>>>>> +
>>>>>> +    dev_info(&dev->dev, "Reset Secondary bus\n");
>>>>>> +
>>>>>> +    /* Assert Secondary Bus Reset */
>>>>>> +    pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>>>> +    ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>> +
>>>>>> +    msleep(2);
>>>>>> +
>>> This works well for x86, which uses ioport registers to access
>>> these<256-offset registers, b/c the write() function can't return
>>> until the write is actually completed, but for a non-x86 system,
>>> with fully mmconf'd PCI space, a write() may still be a write&  run
>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms,
>>> you ought to do another read_config() to the device, to flush the write,
>>> before starting the msleep(2) clock.
>>>
>> I didn't know that... I'll insert something like this before msleep.
>>
>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy);
>>
>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code.
>> Probably need the same fix, right?
>>
> sounds like it.
> Interested to hear from someone in non-x86-land how
> they ensure pci cfg writes aren't buffered in their cpus;
> maybe handled via special uncached regions on other arches.
> 
>>
>>>>>> +    /* De-assert Secondary Bus Reset */
>>>>>> +    ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>> +}
>>>>>> +
>>>>>> +static int __initdata pcie_reset_devices;
>>>>>> +static int __init reset_pcie_devices(void)
>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie 
>>> devices
>>>>>> +{
>>>>>> +    struct pci_dev *dev = NULL;
>>>>>> +
>>>>>> +    if (!pcie_reset_devices)
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    for_each_pci_dev(dev)
>>>>>> +        save_config(dev);
>>>>>> +
>>>>>> +    for_each_pci_dev(dev)
>>>>>> +        do_device_reset(dev);
>>>>>> +
>>>>>> +    msleep(1000);
>>>>>> +
>>>>>> +    for_each_pci_dev(dev)
>>>>>> +        restore_config(dev);
>>>>>> +
>>> My apologies if past thread covered this sequence...
>>> Why three loops through all PCIe devices on the system?
>>> why not have the first for-each-pci-dev() loop filter devices
>>> to be reset, and save_config those pdev's,
>>> return a list of saved_pdev's;  feed that list into the do_device_reset();
>>> then mpsleep(), then restore the list.
>>> in fact, once you create a link list of pdev's to reset,
>>> just loop that list doing save, then reset; rtn the list, do msleep(),
>>> then restore the config of pdevs in the list.
>>> Otherwise, doing much more traversing than what's needed.
>>> Doing a great deal more config-saving then needed right now as well
>>> (saving all non-endpt devices that aren't reset).
>>>
>>
>> Creating list is nice idea, thanks. I'll do.
>>
>> I'll have vacation next week, so I'll post v2 patch the week after
>> next.
>>
>> Thanks,
>> Takao Indoh
>>
> Have a good vacation! Thanks for your willingness
> to respin again & make this improvement.
> 
> 
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Reply via email to