(2013/05/07 21:50), Don Dutile wrote:
> On 05/07/2013 03:09 AM, Takao Indoh wrote:
>> Sorry for the delayed response.
>>
>> (2013/04/30 23:54), Sumner, William wrote:
>>> I have installed your original patch set (from last November) and tested 
>>> with three platforms, each with a different IO configuration.  On the first 
>>> platform crashdumps were consistently successful.  On the second and third 
>>> platforms, the reset of one specific PCI device on each platform (a 
>>> different device type on each platform) immediately hung the crash kernel. 
>>> This was consistent across multiple tries.  Temporarily modifying the patch 
>>> to skip resetting the one problem-device on each platform allowed each 
>>> platform to create a dump successfully.
>>>
>>> I am now working with our IO team to determine the exact nature of the hang 
>>> and to see if the hang is unique to the specific cards or the specific 
>>> platforms.
>>>
>>> Also I am starting to look at an alternate possibility:
>>>
>>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>>> Reset) describes reset use-cases for "a partitioned environment where 
>>> hardware is migrated from one partition to another" and for "system 
>>> software is taking down the software stack for a Function and then 
>>> rebuilding that stack".
>>>
>>> These use-cases seem very similar to transitioning into the crash kernel.  
>>> The "Implementation Note" at the end of that section describes an algorithm 
>>> for "Avoiding Data Corruption From Stale Completions" which looks like it 
>>> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
>>> of the steps from this algorithm to one of the already proposed patches 
>>> would avoid the hang that I saw on two platforms.
>>
>> It seems that the algorithm you mentioned requires four steps.
>>
>> 1. Clear Command register
>> 2. Wait a few milliseconds (Or polling Transactions Pending bit)
>> 3. Do FLR
>> 4. Wait 100 ms
>>
>> My patch does not do step 1 and 2. So, I would appreciate it if you
>> could add the following into save_config() in my latest patch and
>> confirm if kernel still hangs up or not.
>>
>>           subordinate = dev->subordinate;
>>           list_for_each_entry(child,&subordinate->devices, bus_list) {
>>                   dev_info(&child->dev, "save state\n");
>>                   pci_save_state(child);
>> +               pci_write_config_word(child, PCI_COMMAND, 0);
>> +               msleep(1000);
> As Linus pointed out in an earlier patch, msleep() after each device
> is s-l-o-w and unnecessary; should do a single msleep(1000) after
> *all* the command registers are written.  That way, the added delay is 1sec,
> not 1sec*ndevs.

Yeah, of course. I added these two lines just to confirm if clearing
command register resolves Bill's problem. They are tentative things.

> q: is this turning off command register for only PCI(e) endpoints?
> -- shouldn't have to turn off command register in bridges/switches.

Yes, only clearing register of endpoints because kdump fails due to
following error when turning off command register of switch as well.

Fusion MPT SAS Host driver 3.04.20
mptbase: ioc0: Initiating bringup
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: ERROR - Failed to come READY after reset! IocState=f0000000
mptbase: ioc0: WARNING - ResetHistory bit failed to clear!
mptbase: ioc0: ERROR - Diagnostic reset FAILED! (ffffffffh)
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: ioc0: ERROR - didn't initialize properly! (-1)
mptsas: probe of 0000:50:00.0 failed with error -1

Thanks,
Takao Indoh

> 
>>           }
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>>
>>> Bill Sumner
>>>
>>> -----Original Message-----
>>> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don 
>>> Dutile
>>> Sent: Friday, April 26, 2013 8:43 AM
>>> To: Takao Indoh
>>> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
>>> linux-ker...@vger.kernel.org; tin...@gmail.com; 
>>> iommu@lists.linux-foundation.org; bhelg...@google.com
>>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>>>
>>> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>>>> (2013/04/26 3:01), Don Dutile wrote:
>>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>>>
>>>>>>>> Problem:
>>>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on 
>>>>>>>> is
>>>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>>>> 1) Devices are working on first kernel.
>>>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>>>>           and its DMA continues during this switch.
>>>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA 
>>>>>>>> causes
>>>>>>>>           dma-remapping errors.
>>>>>>>>
>>>>>>>> Solution:
>>>>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>>>>> pci_iommu_init.
>>>>>>>>
>>>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>>>>> monitor blacks out if VGA controller is reset.
>>>>>>>>
>>>>>>> Couple questions wrt VGA device:
>>>>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>>>>>            is the VGA always function 0, so this scan sees it first&    
>>>>>>> doesn't
>>>>>>>            do a reset on that PCIe link?  if the VGA is not function 0, 
>>>>>>> won't
>>>>>>>            this logic break (will reset b/c function 0 is non-VGA 
>>>>>>> graphics) ?
>>>>>>
>>>>>> VGA is not reset irrespective of its function number. The logic of this
>>>>>> patch is:
>>>>>>
>>>>>> for_each_pci_dev(dev) {
>>>>>>           if (dev is not PCIe)
>>>>>>              continue;
>>>>>>           if (dev is not root port/downstream port) ---(1)
>>>>>>              continue;
>>>>>>           list_for_each_entry(child,&dev->subordinate->devices, 
>>>>>> bus_list) {
>>>>>>               if (child is upstream port or bridge or VGA) ---(2)
>>>>>>                   continue;
>>>>>>           }
>>>>>>           do_reset_its_child(dev);
>>>>>> }
>>>>>>
>>>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>>>>> downstream port) of VGA is also skipped by (2).
>>>>>>
>>>>>>
>>>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>>>>>            assumes it is, and why it isn't blanked.
>>>>>>>            Q: Should the filter be based on a device having a 
>>>>>>> device-class of display ?
>>>>>>
>>>>>> I want to avoid the situation that user's monitor blacks out and user
>>>>>> cannot know what's going on. That's reason why I introduced the logic to
>>>>>> skip VGA. As far as I tested the logic based on device-class works well,
>>>>> sorry, I read your description, which said VGA, but your are filtering on 
>>>>> display class,
>>>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
>>>>> display devices
>>>>> are probably one of the most aggressive DMA engines on a system.... and 
>>>>> will grow as
>>>>> asymmetric processing using GPUs gets architected into a device-agnostic 
>>>>> manner.
>>>>> So, this may work well for servers, which is the primary consumer/user of 
>>>>> this feature,
>>>>> and they typically have built-in graphics that are generally used in 
>>>>> simple VGA mode,
>>>>> so this may be sufficient for now.
>>>>
>>>> Ok, understood.
>>>>
>>>>
>>>>>
>>>>>> but I would appreciate it if there are better ways.
>>>>>>
>>>>> You probably don't want to hear it but....
>>>>> a) only turn off cmd-reg master enable bit
>>>>> b) only do reset based on a list of devices known not to
>>>>>         obey their cmd-reg master enable bit, and only do reset to those 
>>>>> devices.
>>>>> But, given the testing you've done so far, this optional (need cmdline) 
>>>>> feature,
>>>>> let's start here.
>>>>
>>>> Ok. Either way I think we need more testing.
>>>>
>>>>
>>>>>>>
>>>>>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>>>>>> long since previous post, so I start over again.
>>>>>>> Thanks for this re-start.  I need to continue reviewing the rest.
>>>>>>
>>>>>> Thank you for your review!
>>>>>>
>>>>>>>
>>>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a 
>>>>>>> crash
>>>>>>>           dump?  After the crash dump, the system is rebooting to 
>>>>>>> previous (iommu=on) setting.
>>>>>>>           That logic, along w/your previous patch to disable the IOMMU 
>>>>>>> if iommu=off
>>>>>>>           is set, would remove this (relatively slow) PCI init 
>>>>>>> sequencing ?
>>>>>>
>>>>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>>>>> they are accessing the device address, not physical address. If we 
>>>>>> disable
>>>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>>>>>> and it causes memory corruption or PCI-SERR due to DMA error.
>>>>> Right, that's a 'duh' on my part.
>>>>> I thought 'disable iommu' == 'block all dma' and it just turns it off&
>>>>> let's the ongoing DMA run...
>>>>> Please ignore this question... sigh.
>>>>>
>>>>>>
>>>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in
>>>>>> second kernel if iommu is used in first kernel.
>>>>>>
>>>>>> Thanks,
>>>>>> Takao Indoh
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Previous post:
>>>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>>>>>> https://lkml.org/lkml/2012/11/26/814
>>>>>>>>
>>>>>>>> Signed-off-by: Takao Indoh<indou.ta...@jp.fujitsu.com>
>>>>>>>> ---
>>>>>>>>         Documentation/kernel-parameters.txt |    2 +
>>>>>>>>         drivers/pci/pci.c                   |  103 
>>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>>         2 files changed, 105 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/kernel-parameters.txt 
>>>>>>>> b/Documentation/kernel-parameters.txt
>>>>>>>> index 4609e81..2a31ade 100644
>>>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can 
>>>>>>>> also be entirely omitted.
>>>>>>>>                         any pair of devices, possibly at the cost of
>>>>>>>>                         reduced performance.  This also guarantees
>>>>>>>>                         that hot-added devices will work.
>>>>>>>> +        pcie_reset_devices    Reset PCIe endpoint on boot by hot
>>>>>>>> +                reset
>>>>>>>>                 cbiosize=nn[KMG]    The fixed amount of bus space 
>>>>>>>> which is
>>>>>>>>                         reserved for the CardBus bridge's IO window.
>>>>>>>>                         The default value is 256 bytes.
>>>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>>>>> index b099e00..42385c9 100644
>>>>>>>> --- a/drivers/pci/pci.c
>>>>>>>> +++ b/drivers/pci/pci.c
>>>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus 
>>>>>>>> *bus)
>>>>>>>>         }
>>>>>>>>         EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Return true if dev is PCIe root port or downstream port whose 
>>>>>>>> child is PCIe
>>>>>>>> + * endpoint except VGA device.
>>>>>>>> + */
>>>>>>>> +static int __init need_reset(struct pci_dev *dev)
>>>>>>>> +{
>>>>>>>> +    struct pci_bus *subordinate;
>>>>>>>> +    struct pci_dev *child;
>>>>>>>> +
>>>>>>>> +    if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>>>>>> +        list_empty(&dev->subordinate->devices) ||
>>>>>>>> +        ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>>>>>> +         (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>>>>>> +        return 0;
>>>>>>>> +
>>>>>>>> +    subordinate = dev->subordinate;
>>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>>> +        if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>>>>>> +            (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>>>>>> +            ((child->class>>     16) == PCI_BASE_CLASS_DISPLAY))
>>>>>>>> +            /* Don't reset switch, bridge, VGA device */
>>>>>>>> +            return 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return 1;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __init save_config(struct pci_dev *dev)
>>>>> This should be renamed.  it implies it is saving the config of the pdev 
>>>>> passed in,
>>>>> when in reality, it is saving the config of all the devices attached to 
>>>>> this pdev.
>>>>> i.e., save_downstream_configs()
>>>>>
>>>>>>>> +{
>>>>>>>> +    struct pci_bus *subordinate;
>>>>>>>> +    struct pci_dev *child;
>>>>>>>> +
>>>>>>>> +    if (!need_reset(dev))
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    subordinate = dev->subordinate;
>>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>>> +        dev_info(&child->dev, "save state\n");
>>>>>>>> +        pci_save_state(child);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __init restore_config(struct pci_dev *dev)
>>>>> inverse of above: restore_downstream_configs()
>>>>>>>> +{
>>>>>>>> +    struct pci_bus *subordinate;
>>>>>>>> +    struct pci_dev *child;
>>>>>>>> +
>>>>>>>> +    if (!need_reset(dev))
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    subordinate = dev->subordinate;
>>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>>> +        dev_info(&child->dev, "restore state\n");
>>>>>>>> +        pci_restore_state(child);
>>>>>>>> +    }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void __init do_device_reset(struct pci_dev *dev)
>>>>> do_downstream_device_reset() -- it's not resetting this pdev,
>>>>> but the pdev's of the devices attached to it.
>>>>>
>>>> Right, I'll change them as you said.
>>>>
>>>>
>>>>>>>> +{
>>>>>>>> +    u16 ctrl;
>>>>>>>> +
>>>>>>>> +    if (!need_reset(dev))
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    dev_info(&dev->dev, "Reset Secondary bus\n");
>>>>>>>> +
>>>>>>>> +    /* Assert Secondary Bus Reset */
>>>>>>>> +    pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>>>>>> +    ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>>>> +
>>>>>>>> +    msleep(2);
>>>>>>>> +
>>>>> This works well for x86, which uses ioport registers to access
>>>>> these<256-offset registers, b/c the write() function can't return
>>>>> until the write is actually completed, but for a non-x86 system,
>>>>> with fully mmconf'd PCI space, a write() may still be a write&   run
>>>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms,
>>>>> you ought to do another read_config() to the device, to flush the write,
>>>>> before starting the msleep(2) clock.
>>>>>
>>>> I didn't know that... I'll insert something like this before msleep.
>>>>
>>>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy);
>>>>
>>>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code.
>>>> Probably need the same fix, right?
>>>>
>>> sounds like it.
>>> Interested to hear from someone in non-x86-land how
>>> they ensure pci cfg writes aren't buffered in their cpus;
>>> maybe handled via special uncached regions on other arches.
>>>
>>>>
>>>>>>>> +    /* De-assert Secondary Bus Reset */
>>>>>>>> +    ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int __initdata pcie_reset_devices;
>>>>>>>> +static int __init reset_pcie_devices(void)
>>>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie 
>>>>> devices
>>>>>>>> +{
>>>>>>>> +    struct pci_dev *dev = NULL;
>>>>>>>> +
>>>>>>>> +    if (!pcie_reset_devices)
>>>>>>>> +        return 0;
>>>>>>>> +
>>>>>>>> +    for_each_pci_dev(dev)
>>>>>>>> +        save_config(dev);
>>>>>>>> +
>>>>>>>> +    for_each_pci_dev(dev)
>>>>>>>> +        do_device_reset(dev);
>>>>>>>> +
>>>>>>>> +    msleep(1000);
>>>>>>>> +
>>>>>>>> +    for_each_pci_dev(dev)
>>>>>>>> +        restore_config(dev);
>>>>>>>> +
>>>>> My apologies if past thread covered this sequence...
>>>>> Why three loops through all PCIe devices on the system?
>>>>> why not have the first for-each-pci-dev() loop filter devices
>>>>> to be reset, and save_config those pdev's,
>>>>> return a list of saved_pdev's;  feed that list into the do_device_reset();
>>>>> then mpsleep(), then restore the list.
>>>>> in fact, once you create a link list of pdev's to reset,
>>>>> just loop that list doing save, then reset; rtn the list, do msleep(),
>>>>> then restore the config of pdevs in the list.
>>>>> Otherwise, doing much more traversing than what's needed.
>>>>> Doing a great deal more config-saving then needed right now as well
>>>>> (saving all non-endpt devices that aren't reset).
>>>>>
>>>>
>>>> Creating list is nice idea, thanks. I'll do.
>>>>
>>>> I'll have vacation next week, so I'll post v2 patch the week after
>>>> next.
>>>>
>>>> Thanks,
>>>> Takao Indoh
>>>>
>>> Have a good vacation! Thanks for your willingness
>>> to respin again&  make this improvement.
>>>
>>>
>>> _______________________________________________
>>> kexec mailing list
>>> ke...@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/kexec
>>>
>>>
>>
>>
> 
> 
> 


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to