Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-08-01 Thread Vivek Goyal
On Thu, Aug 01, 2013 at 03:34:06PM +0900, Takao Indoh wrote:
> (2013/08/01 6:23), Rafael J. Wysocki wrote:
> > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> >> [+cc Rafael, linux-acpi]
> >>
> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh  
> >> wrote:
> >>
> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> >>> what you are talking about is that it should be changed so that x86
> >>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> >>
> >> Yes.  I don't know whether or when that initialization order will ever
> >> be changed, but I do think we should avoid building more
> >> infrastructure that depends on the current order.
> >>
> >> Changing the order is a pretty big deal because it's a lot more than
> >> just the IOMMU.  Basically I think we should be enumerating ACPI
> >> devices, including the IOMMU, before PCI devices, but there's a lot of
> >> legacy involved in that area.  Added Rafael in case he has any
> >> thoughts.
> > 
> > Well, actually, I'm not really familiar with IOMMUs, sorry.
> > 
> > I do think that initializing IOMMU before PCI enumeration would be better,
> > however.  At least if the ordering should be the same on all architectures,
> > which I suppose is the case, that's the one I'd choose.
> 
> Ok guys. If x86 IOMMU maintainer also thinks changing order is
> necessary, maybe I need to give up device reset in kdump kernel and
> consider doing it in panic kernel.

I don't think trying to reset all the devices in panic kernel is
a good idea.

We need to handle the problem at IOMMU level first which is 
independent of whether devices have been reset or not.

IOW, we should have the capability to initialize IOMMU first
and be able to deal with devices which are doing DMA.

I am not against doing device reset and it most likely is a good thing
but it should happen in second kernel and not in crashed kernel.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-08-01 Thread Alex Williamson
On Thu, 2013-08-01 at 15:34 +0900, Takao Indoh wrote:
> (2013/08/01 6:23), Rafael J. Wysocki wrote:
> > On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> >> [+cc Rafael, linux-acpi]
> >>
> >> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh  
> >> wrote:
> >>
> >>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> >>> what you are talking about is that it should be changed so that x86
> >>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> >>
> >> Yes.  I don't know whether or when that initialization order will ever
> >> be changed, but I do think we should avoid building more
> >> infrastructure that depends on the current order.
> >>
> >> Changing the order is a pretty big deal because it's a lot more than
> >> just the IOMMU.  Basically I think we should be enumerating ACPI
> >> devices, including the IOMMU, before PCI devices, but there's a lot of
> >> legacy involved in that area.  Added Rafael in case he has any
> >> thoughts.
> > 
> > Well, actually, I'm not really familiar with IOMMUs, sorry.
> > 
> > I do think that initializing IOMMU before PCI enumeration would be better,
> > however.  At least if the ordering should be the same on all architectures,
> > which I suppose is the case, that's the one I'd choose.
> 
> Ok guys. If x86 IOMMU maintainer also thinks changing order is
> necessary, maybe I need to give up device reset in kdump kernel and
> consider doing it in panic kernel.
> 
> Either way, I need bus reset interface to reset devices. Bjorn, could
> you review the bus reset patches Alex posted yesterday?

I'll post a non-RFC version today, I've made a couple cleanups, tuned
the delays and rolled in the AER version of secondary bus reset.
Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-31 Thread Takao Indoh
(2013/08/01 6:23), Rafael J. Wysocki wrote:
> On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
>> [+cc Rafael, linux-acpi]
>>
>> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh  
>> wrote:
>>
>>> On x86, currently IOMMU initialization run *after* PCI enumeration, but
>>> what you are talking about is that it should be changed so that x86
>>> IOMMU initialization is done *before* PCI enumeration like sparc, right?
>>
>> Yes.  I don't know whether or when that initialization order will ever
>> be changed, but I do think we should avoid building more
>> infrastructure that depends on the current order.
>>
>> Changing the order is a pretty big deal because it's a lot more than
>> just the IOMMU.  Basically I think we should be enumerating ACPI
>> devices, including the IOMMU, before PCI devices, but there's a lot of
>> legacy involved in that area.  Added Rafael in case he has any
>> thoughts.
> 
> Well, actually, I'm not really familiar with IOMMUs, sorry.
> 
> I do think that initializing IOMMU before PCI enumeration would be better,
> however.  At least if the ordering should be the same on all architectures,
> which I suppose is the case, that's the one I'd choose.

Ok guys. If x86 IOMMU maintainer also thinks changing order is
necessary, maybe I need to give up device reset in kdump kernel and
consider doing it in panic kernel.

Either way, I need bus reset interface to reset devices. Bjorn, could
you review the bus reset patches Alex posted yesterday?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-31 Thread Rafael J. Wysocki
On Wednesday, July 31, 2013 03:08:03 PM Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi]
> 
> On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh  
> wrote:
> 
> > On x86, currently IOMMU initialization run *after* PCI enumeration, but
> > what you are talking about is that it should be changed so that x86
> > IOMMU initialization is done *before* PCI enumeration like sparc, right?
> 
> Yes.  I don't know whether or when that initialization order will ever
> be changed, but I do think we should avoid building more
> infrastructure that depends on the current order.
> 
> Changing the order is a pretty big deal because it's a lot more than
> just the IOMMU.  Basically I think we should be enumerating ACPI
> devices, including the IOMMU, before PCI devices, but there's a lot of
> legacy involved in that area.  Added Rafael in case he has any
> thoughts.

Well, actually, I'm not really familiar with IOMMUs, sorry.

I do think that initializing IOMMU before PCI enumeration would be better,
however.  At least if the ordering should be the same on all architectures,
which I suppose is the case, that's the one I'd choose.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-31 Thread Bjorn Helgaas
[+cc Rafael, linux-acpi]

On Tue, Jul 30, 2013 at 6:35 PM, Takao Indoh  wrote:

> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?

Yes.  I don't know whether or when that initialization order will ever
be changed, but I do think we should avoid building more
infrastructure that depends on the current order.

Changing the order is a pretty big deal because it's a lot more than
just the IOMMU.  Basically I think we should be enumerating ACPI
devices, including the IOMMU, before PCI devices, but there's a lot of
legacy involved in that area.  Added Rafael in case he has any
thoughts.

> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-31 Thread Vivek Goyal
On Tue, Jul 30, 2013 at 09:59:16AM -0600, Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
>  wrote:
> > (2013/07/29 23:17), Bjorn Helgaas wrote:
> >> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
> >> wrote:
> >>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> 
>  My point about IOMMU and PCI initialization order doesn't go away just
>  because it doesn't fit "kdump policy."  Having system initialization
>  occur in a logical order is far more important than making kdump work.
> >>>
> >>> My next plan is as follows. I think this is matched to logical order
> >>> on boot.
> >>>
> >>> drivers/pci/pci.c
> >>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus 
> >>> *bus)
> >>>
> >>> drivers/iommu/intel-iommu.c
> >>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>>function before disabling and re-enabling IOMMU.
> >>
> >> I raised this issue because of arches like sparc that enumerate the
> >> IOMMU before the PCI devices that use it.  In that situation, I think
> >> you're proposing this:
> >>
> >>panic kernel
> >>  enable IOMMU
> >>  panic
> >>kdump kernel
> >>  initialize IOMMU (already enabled)
> >>pci_reset_bus
> >>disable IOMMU
> >>enable IOMMU
> >>  enumerate PCI devices
> >>
> >> But the problem is that when you call pci_reset_bus(), you haven't
> >> enumerated the PCI devices, so you don't know what to reset.
> >
> > Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> > initialization is based on the assumption that enumeration of PCI devices
> > is already done. We can find target device from IOMMU page table instead
> > of scanning all devices in pci tree.
> >
> > Therefore, this idea is only for intel-iommu. Other architectures need
> > to implement their own reset code.
> 
> That's my point.  I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design.  I would rather have a design that can work for all
> arches.

I agree. It makes sense to work on a design which works for all
arches.

And that's when I feel that handling this problem at IOMMU level
makes more sense, if we can.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-31 Thread Vivek Goyal
On Thu, Jul 25, 2013 at 11:00:46AM -0600, Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>  wrote:
> > Sorry for letting this discussion slide, I was busy on other works:-(
> > Anyway, the summary of previous discussion is:
> > - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
> >   boot. This expects PCI enumeration is done before IOMMU
> >   initialization as follows.
> > (1) PCI enumeration
> > (2) fs_initcall ---> device reset
> > (3) IOMMU initialization
> > - This works on x86, but does not work on other architecture because
> >   IOMMU is initialized before PCI enumeration on some architectures. So,
> >   device reset should be done where IOMMU is initialized instead of
> >   initcall.
> > - Or, as another idea, we can reset devices in first kernel(panic kernel)
> >
> > Resetting devices in panic kernel is against kdump policy and seems not to
> > be good idea. So I think adding reset code into iommu initialization is
> > better. I'll post patches for that.
> 
> Of course nobody *wants* to do anything in the panic kernel.  But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument.  There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

I think resetting all devices in crashed kernel is really a lot of
code. If there is a small piece of code, it can still be considered.

I don't know much about IOMMU or PCI or PCIE. But I am taking one step
back and discuss again the idea of not resetting the IOMMU in second
kernel.

I think resetting the bus is a good idea but just resetting PCIE
will solve only part of the problem and we will same issues with
devices on other buses.

So what sounds more appealing if we could fix this particular
problem at IOMMU level first (and continue to develp patches for
resetting various buses).

In the past also these ideas have been proposed that continue to
use translation table from first kernel. Retain those mappings and
don't reset IOMMU. Reserve some space for kdump mappings in first
kernel and use that reserved mapping space in second kernel. It
never got implemented though.

Bjorn, so what's the fundamental problem with this idea?

Also, what's wrong with DMAR error. If some device tried to do DMA,
and DMA was blocked because IOMMU got reset and mappings are no more
there, why does it lead to failure. Shouldn't we just reate limit 
error messages in such case and if device is needed, anyway driver
will reset it.

Other problem mentioned in this thread is PCI SERR. What is it? Is
it some kind of error device reports if it can't do DMA successfully.
Can these errors be simply ignored kdump kernel? This problem sounds
similar to a device keeping interrupt asserted in second kernel and
kernel simply disables the interrupt line if nobody claims the
interrupt.

IOW, it feels to me that we should handle the issue (DMAR error) at
IOMMU level first (instead of trying to make sure that by the time
we get to initialize IOMMU(), all devices in system have been quiesced
and nobody is doing DMA).

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Takao Indoh
(2013/07/31 12:11), Alex Williamson wrote:
> On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e37fea6..c595997 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
>>   EXPORT_SYMBOL_GPL(pci_reset_function);
>>   
>>   /**
>> + * pci_reset_bus - reset a PCI bus
>> + * @bus: PCI bus to reset
>> + *
>> + * Returns 0 if the bus was successfully reset or negative if failed.
>> + */
>> +int pci_reset_bus(struct pci_bus *bus)
>> +{
>> +struct pci_dev *pdev;
>> +u16 ctrl;
>> +
>> +if (!bus->self)
>> +return -ENOTTY;
>> +
>> +list_for_each_entry(pdev, &bus->devices, bus_list)
>> +if (pdev->subordinate)
>> +return -ENOTTY;
>> +
>> +/* Save config registers of children */
>> +list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +dev_info(&pdev->dev, "Save state\n");
>> +pci_save_state(pdev);
>> +}
>> +
>> +dev_info(&bus->self->dev, "Reset Secondary bus\n");
>> +
>> +/* Assert Secondary Bus Reset */
>> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> +ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +/* Read config again to flush previous write */
>> +pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
>> +
>> +msleep(2);
>> +
>> +/* De-assert Secondary Bus Reset */
>> +ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
>> +
>> +/* Wait for completion */
>> +msleep(1000);
> 
> 
> We already have secondary bus reset code in this file, why are we
> duplicating it here?  Also, why are these delays different from the
> existing code?  I'm also in need of a bus reset interface for when we
> assign all of the devices on a bus to userspace and do not have working
> function level resets per device.  I'll post my patch series and perhaps
> we can collaborate on a pci bus reset interface.  Thanks,

Good point. Yes, we have already similar functions.

pci_parent_bus_reset()
1. Assert secondary bus reset
2. msleep(100)
3. De-assert secondary bus reset
4. msleep(100)

aer_do_secondary_bus_reset()
1. Assert secondary bus reset
2. msleep(2)
3. De-assert secondary bus reset,
4. msleep(200)

To be honest, I wrote my reset code almost one years ago, so I forgot
the reason why I separated them.

Basically my reset code is based on aer_do_secondary_bus_reset(). The
different is waiting time after reset. My patch has 1000msec waiting
time.

At first my reset code is almost same as aer_do_secondary_bus_reset().
But when I tested the reset code, I found that on certain machine
restoring config registers failed after reset. It failed because 200msec
waiting time was too short. And I found the following description in
PCIe spec. According to this, I thought we should wait at least 1000msec.

6.6.1. Conventional Reset

* The Root Complex and/or system software must allow at least 1.0s
  after a Conventional Reset of a device, before it may determine that a
  device which fails to return a Successful Completion status for a
  valid Configuration Request is a broken device. This period is
  independent of how quickly Link training completes.

  Note: This delay is analogous to the Trhfa parameter specified for
  PCI/PCI-X, and is intended to allow an adequate amount of time for
  devices which require self initialization.

* When attempting a Configuration access to devices on a PCI or PCI-X
  bus segment behind a PCI Express/PCI(-X) Bridge, the timing parameter
  Trhfa must be respected.

And I saw patches you posted today, yes, your patch looks helpful for
my purpose:-)

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Alex Williamson
On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
> (2013/07/31 0:59), Bjorn Helgaas wrote:
> > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> >  wrote:
> >> (2013/07/29 23:17), Bjorn Helgaas wrote:
> >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
> >>> wrote:
>  (2013/07/26 2:00), Bjorn Helgaas wrote:
> > 
> > My point about IOMMU and PCI initialization order doesn't go away just
> > because it doesn't fit "kdump policy."  Having system initialization
> > occur in a logical order is far more important than making kdump work.
> 
>  My next plan is as follows. I think this is matched to logical order
>  on boot.
> 
>  drivers/pci/pci.c
>  - Add function to reset bus, for example, pci_reset_bus(struct pci_bus 
>  *bus)
> 
>  drivers/iommu/intel-iommu.c
>  - On initialization, if IOMMU is already enabled, call this bus reset
>  function before disabling and re-enabling IOMMU.
> >>>
> >>> I raised this issue because of arches like sparc that enumerate the
> >>> IOMMU before the PCI devices that use it.  In that situation, I think
> >>> you're proposing this:
> >>>
> >>> panic kernel
> >>>   enable IOMMU
> >>>   panic
> >>> kdump kernel
> >>>   initialize IOMMU (already enabled)
> >>> pci_reset_bus
> >>> disable IOMMU
> >>> enable IOMMU
> >>>   enumerate PCI devices
> >>>
> >>> But the problem is that when you call pci_reset_bus(), you haven't
> >>> enumerated the PCI devices, so you don't know what to reset.
> >>
> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> >> initialization is based on the assumption that enumeration of PCI devices
> >> is already done. We can find target device from IOMMU page table instead
> >> of scanning all devices in pci tree.
> >>
> >> Therefore, this idea is only for intel-iommu. Other architectures need
> >> to implement their own reset code.
> > 
> > That's my point.  I'm opposed to adding code to PCI when it only
> > benefits x86 and we know other arches will need a fundamentally
> > different design.  I would rather have a design that can work for all
> > arches.
> > 
> > If your implementation is totally implemented under arch/x86 (or in
> > intel-iommu.c, I guess), I can't object as much.  However, I think
> > that eventually even x86 should enumerate the IOMMUs via ACPI before
> > we enumerate PCI devices.
> > 
> > It's pretty clear that's how BIOS designers expect the OS to work.
> > For example, sec 8.7.3 of the Intel Virtualization Technology for
> > Directed I/O spec, rev 1.3, shows the expectation that remapping
> > hardware (IOMMU) is initialized before discovering the I/O hierarchy
> > below a hot-added host bridge.  Obviously you're not talking about a
> > hot-add scenario, but I think the same sequence should apply at
> > boot-time as well.
> 
> Of course I won't add something just for x86 into common PCI layer. I
> attach my new patch, though it is not well tested yet.
> 
> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?
> 
> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
> 
> Thanks,
> Takao Indoh
> ---
>  drivers/iommu/intel-iommu.c |   55 +-
>  drivers/pci/pci.c   |   53 
>  include/linux/pci.h |1
>  3 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..fb8a546 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
>   .notifier_call = device_notifier,
>  };
>  
> +/* Reset PCI device if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 
> segment)
> +{
> + u64 addr;
> + struct root_entry *root;
> + struct context_entry *context;
> + int bus, devfn;
> + struct pci_dev *dev;
> +
> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> + if (!addr)
> + return;
> +
> + /*
> +  *  In the case of kdump, ioremap is needed because root-entry table
> +  *  exists in first kernel's memory area which is not mapped in second
> +  *  kernel
> +  */
> + root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
> + if (!root)
> + return;
> +
> + for (bus=0; bus + if (!root_present(&root[bus]))
> + continue;
> +
> + context = (struct context_entry *)ioremap(
> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> + if (!context)
> + continue;
> +
> + for (devfn=0; devfn + 

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Takao Indoh
(2013/07/31 0:59), Bjorn Helgaas wrote:
> On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
>  wrote:
>> (2013/07/29 23:17), Bjorn Helgaas wrote:
>>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
>>> wrote:
 (2013/07/26 2:00), Bjorn Helgaas wrote:
> 
> My point about IOMMU and PCI initialization order doesn't go away just
> because it doesn't fit "kdump policy."  Having system initialization
> occur in a logical order is far more important than making kdump work.

 My next plan is as follows. I think this is matched to logical order
 on boot.

 drivers/pci/pci.c
 - Add function to reset bus, for example, pci_reset_bus(struct pci_bus 
 *bus)

 drivers/iommu/intel-iommu.c
 - On initialization, if IOMMU is already enabled, call this bus reset
 function before disabling and re-enabling IOMMU.
>>>
>>> I raised this issue because of arches like sparc that enumerate the
>>> IOMMU before the PCI devices that use it.  In that situation, I think
>>> you're proposing this:
>>>
>>> panic kernel
>>>   enable IOMMU
>>>   panic
>>> kdump kernel
>>>   initialize IOMMU (already enabled)
>>> pci_reset_bus
>>> disable IOMMU
>>> enable IOMMU
>>>   enumerate PCI devices
>>>
>>> But the problem is that when you call pci_reset_bus(), you haven't
>>> enumerated the PCI devices, so you don't know what to reset.
>>
>> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
>> initialization is based on the assumption that enumeration of PCI devices
>> is already done. We can find target device from IOMMU page table instead
>> of scanning all devices in pci tree.
>>
>> Therefore, this idea is only for intel-iommu. Other architectures need
>> to implement their own reset code.
> 
> That's my point.  I'm opposed to adding code to PCI when it only
> benefits x86 and we know other arches will need a fundamentally
> different design.  I would rather have a design that can work for all
> arches.
> 
> If your implementation is totally implemented under arch/x86 (or in
> intel-iommu.c, I guess), I can't object as much.  However, I think
> that eventually even x86 should enumerate the IOMMUs via ACPI before
> we enumerate PCI devices.
> 
> It's pretty clear that's how BIOS designers expect the OS to work.
> For example, sec 8.7.3 of the Intel Virtualization Technology for
> Directed I/O spec, rev 1.3, shows the expectation that remapping
> hardware (IOMMU) is initialized before discovering the I/O hierarchy
> below a hot-added host bridge.  Obviously you're not talking about a
> hot-add scenario, but I think the same sequence should apply at
> boot-time as well.

Of course I won't add something just for x86 into common PCI layer. I
attach my new patch, though it is not well tested yet.

On x86, currently IOMMU initialization run *after* PCI enumeration, but
what you are talking about is that it should be changed so that x86
IOMMU initialization is done *before* PCI enumeration like sparc, right?

Hmm, ok, I think I need to post attached patch to iommu list and
discuss it including current order of x86 IOMMU initialization.

Thanks,
Takao Indoh
---
 drivers/iommu/intel-iommu.c |   55 +-
 drivers/pci/pci.c   |   53 
 include/linux/pci.h |1
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index eec0d3e..fb8a546 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
.notifier_call = device_notifier,
 };
 
+/* Reset PCI device if its entry exists in DMAR table */
+static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
+{
+   u64 addr;
+   struct root_entry *root;
+   struct context_entry *context;
+   int bus, devfn;
+   struct pci_dev *dev;
+
+   addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
+   if (!addr)
+   return;
+
+   /*
+*  In the case of kdump, ioremap is needed because root-entry table
+*  exists in first kernel's memory area which is not mapped in second
+*  kernel
+*/
+   root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
+   if (!root)
+   return;
+
+   for (bus=0; busbus)) /* go to next bus */
+   break;
+   else /* Try per-function reset */
+   pci_reset_function(dev);
+
+   }
+   iounmap(context);
+   }
+   iounmap(root);
+}
+
 int __init intel_iommu_init(void)
 {
int ret = 0;
@@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
continue;
 
iommu = drhd->iommu;
-   if (iommu->gcmd & DMA_GCMD_TE)
+   if (iommu->gcmd & DMA_GCMD_TE) {
+

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-30 Thread Bjorn Helgaas
On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
 wrote:
> (2013/07/29 23:17), Bjorn Helgaas wrote:
>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
>> wrote:
>>> (2013/07/26 2:00), Bjorn Helgaas wrote:

 My point about IOMMU and PCI initialization order doesn't go away just
 because it doesn't fit "kdump policy."  Having system initialization
 occur in a logical order is far more important than making kdump work.
>>>
>>> My next plan is as follows. I think this is matched to logical order
>>> on boot.
>>>
>>> drivers/pci/pci.c
>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>>
>>> drivers/iommu/intel-iommu.c
>>> - On initialization, if IOMMU is already enabled, call this bus reset
>>>function before disabling and re-enabling IOMMU.
>>
>> I raised this issue because of arches like sparc that enumerate the
>> IOMMU before the PCI devices that use it.  In that situation, I think
>> you're proposing this:
>>
>>panic kernel
>>  enable IOMMU
>>  panic
>>kdump kernel
>>  initialize IOMMU (already enabled)
>>pci_reset_bus
>>disable IOMMU
>>enable IOMMU
>>  enumerate PCI devices
>>
>> But the problem is that when you call pci_reset_bus(), you haven't
>> enumerated the PCI devices, so you don't know what to reset.
>
> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> initialization is based on the assumption that enumeration of PCI devices
> is already done. We can find target device from IOMMU page table instead
> of scanning all devices in pci tree.
>
> Therefore, this idea is only for intel-iommu. Other architectures need
> to implement their own reset code.

That's my point.  I'm opposed to adding code to PCI when it only
benefits x86 and we know other arches will need a fundamentally
different design.  I would rather have a design that can work for all
arches.

If your implementation is totally implemented under arch/x86 (or in
intel-iommu.c, I guess), I can't object as much.  However, I think
that eventually even x86 should enumerate the IOMMUs via ACPI before
we enumerate PCI devices.

It's pretty clear that's how BIOS designers expect the OS to work.
For example, sec 8.7.3 of the Intel Virtualization Technology for
Directed I/O spec, rev 1.3, shows the expectation that remapping
hardware (IOMMU) is initialized before discovering the I/O hierarchy
below a hot-added host bridge.  Obviously you're not talking about a
hot-add scenario, but I think the same sequence should apply at
boot-time as well.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-29 Thread Takao Indoh
(2013/07/29 23:17), Bjorn Helgaas wrote:
> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  
> wrote:
>> (2013/07/26 2:00), Bjorn Helgaas wrote:
>>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>>>  wrote:
 Sorry for letting this discussion slide, I was busy on other works:-(
 Anyway, the summary of previous discussion is:
 - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
 boot. This expects PCI enumeration is done before IOMMU
 initialization as follows.
   (1) PCI enumeration
   (2) fs_initcall ---> device reset
   (3) IOMMU initialization
 - This works on x86, but does not work on other architecture because
 IOMMU is initialized before PCI enumeration on some architectures. So,
 device reset should be done where IOMMU is initialized instead of
 initcall.
 - Or, as another idea, we can reset devices in first kernel(panic kernel)

 Resetting devices in panic kernel is against kdump policy and seems not to
 be good idea. So I think adding reset code into iommu initialization is
 better. I'll post patches for that.
>>>
>>> Of course nobody *wants* to do anything in the panic kernel.  But
>>> simply saying "it's against kdump policy and seems not to be a good
>>> idea" is not a technical argument.  There are things that are
>>> impractical to do in the kdump kernel, so they have to be done in the
>>> panic kernel even though we know the kernel is unreliable and the
>>> attempt may fail.
>>
>> Accessing kernel data in panic kernel causes panic again, so
>> - Don't touch kernel data in panic situation
>> - Jump to kdump kernel as quickly as possible, and do things in safe
>>kernel
>> These are basic "kdump policy". Of course if there are any works which
>> we cannot do in kdump kernel and can do only in panic kernel, for
>> example saving registers or stopping cpus, we should do them in panic
>> kernel.
>>
>> Resetting devices in panic kernel is worth considering if we can safely
>> find pci_dev and reset it, but I have no idea how to do that because
>> for example struct pci_dev may be borken.
> 
> Nobody can guarantee that the panic kernel can do *anything* safely
> because any arbitrary kernel data or text may be corrupted.  But if
> you consider any specific data structure, e.g., CPU or PCI device
> lists, it's not very likely that it will be corrupted.

To reset device we need to scan pci device tree using for_each_pci_dev.
Something like bust_spinlocks() to clear pci_lock forcibly is needed.
Vivek, adding these into kdump is acceptable for you? Or any other
ideas? I think iterating over a list like for_each_pci_dev is dangerous.

> 
>>> My point about IOMMU and PCI initialization order doesn't go away just
>>> because it doesn't fit "kdump policy."  Having system initialization
>>> occur in a logical order is far more important than making kdump work.
>>
>> My next plan is as follows. I think this is matched to logical order
>> on boot.
>>
>> drivers/pci/pci.c
>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>>
>> drivers/iommu/intel-iommu.c
>> - On initialization, if IOMMU is already enabled, call this bus reset
>>function before disabling and re-enabling IOMMU.
> 
> I raised this issue because of arches like sparc that enumerate the
> IOMMU before the PCI devices that use it.  In that situation, I think
> you're proposing this:
> 
>panic kernel
>  enable IOMMU
>  panic
>kdump kernel
>  initialize IOMMU (already enabled)
>pci_reset_bus
>disable IOMMU
>enable IOMMU
>  enumerate PCI devices
> 
> But the problem is that when you call pci_reset_bus(), you haven't
> enumerated the PCI devices, so you don't know what to reset.

Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
initialization is based on the assumption that enumeration of PCI devices 
is already done. We can find target device from IOMMU page table instead
of scanning all devices in pci tree.

Therefore, this idea is only for intel-iommu. Other architectures need
to implement their own reset code.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-29 Thread Bjorn Helgaas
On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh  wrote:
> (2013/07/26 2:00), Bjorn Helgaas wrote:
>> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>>  wrote:
>>> Sorry for letting this discussion slide, I was busy on other works:-(
>>> Anyway, the summary of previous discussion is:
>>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>>boot. This expects PCI enumeration is done before IOMMU
>>>initialization as follows.
>>>  (1) PCI enumeration
>>>  (2) fs_initcall ---> device reset
>>>  (3) IOMMU initialization
>>> - This works on x86, but does not work on other architecture because
>>>IOMMU is initialized before PCI enumeration on some architectures. So,
>>>device reset should be done where IOMMU is initialized instead of
>>>initcall.
>>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>>
>>> Resetting devices in panic kernel is against kdump policy and seems not to
>>> be good idea. So I think adding reset code into iommu initialization is
>>> better. I'll post patches for that.
>>
>> Of course nobody *wants* to do anything in the panic kernel.  But
>> simply saying "it's against kdump policy and seems not to be a good
>> idea" is not a technical argument.  There are things that are
>> impractical to do in the kdump kernel, so they have to be done in the
>> panic kernel even though we know the kernel is unreliable and the
>> attempt may fail.
>
> Accessing kernel data in panic kernel causes panic again, so
> - Don't touch kernel data in panic situation
> - Jump to kdump kernel as quickly as possible, and do things in safe
>   kernel
> These are basic "kdump policy". Of course if there are any works which
> we cannot do in kdump kernel and can do only in panic kernel, for
> example saving registers or stopping cpus, we should do them in panic
> kernel.
>
> Resetting devices in panic kernel is worth considering if we can safely
> find pci_dev and reset it, but I have no idea how to do that because
> for example struct pci_dev may be borken.

Nobody can guarantee that the panic kernel can do *anything* safely
because any arbitrary kernel data or text may be corrupted.  But if
you consider any specific data structure, e.g., CPU or PCI device
lists, it's not very likely that it will be corrupted.

>> My point about IOMMU and PCI initialization order doesn't go away just
>> because it doesn't fit "kdump policy."  Having system initialization
>> occur in a logical order is far more important than making kdump work.
>
> My next plan is as follows. I think this is matched to logical order
> on boot.
>
> drivers/pci/pci.c
> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
>
> drivers/iommu/intel-iommu.c
> - On initialization, if IOMMU is already enabled, call this bus reset
>   function before disabling and re-enabling IOMMU.

I raised this issue because of arches like sparc that enumerate the
IOMMU before the PCI devices that use it.  In that situation, I think
you're proposing this:

  panic kernel
enable IOMMU
panic
  kdump kernel
initialize IOMMU (already enabled)
  pci_reset_bus
  disable IOMMU
  enable IOMMU
enumerate PCI devices

But the problem is that when you call pci_reset_bus(), you haven't
enumerated the PCI devices, so you don't know what to reset.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-28 Thread Takao Indoh
(2013/07/26 2:00), Bjorn Helgaas wrote:
> On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
>  wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>boot. This expects PCI enumeration is done before IOMMU
>>initialization as follows.
>>  (1) PCI enumeration
>>  (2) fs_initcall ---> device reset
>>  (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>>IOMMU is initialized before PCI enumeration on some architectures. So,
>>device reset should be done where IOMMU is initialized instead of
>>initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
> 
> Of course nobody *wants* to do anything in the panic kernel.  But
> simply saying "it's against kdump policy and seems not to be a good
> idea" is not a technical argument.  There are things that are
> impractical to do in the kdump kernel, so they have to be done in the
> panic kernel even though we know the kernel is unreliable and the
> attempt may fail.

Accessing kernel data in panic kernel causes panic again, so
- Don't touch kernel data in panic situation
- Jump to kdump kernel as quickly as possible, and do things in safe
  kernel
These are basic "kdump policy". Of course if there are any works which
we cannot do in kdump kernel and can do only in panic kernel, for
example saving registers or stopping cpus, we should do them in panic
kernel.

Resetting devices in panic kernel is worth considering if we can safely
find pci_dev and reset it, but I have no idea how to do that because
for example struct pci_dev may be borken.

> 
> My point about IOMMU and PCI initialization order doesn't go away just
> because it doesn't fit "kdump policy."  Having system initialization
> occur in a logical order is far more important than making kdump work.

My next plan is as follows. I think this is matched to logical order
on boot.

drivers/pci/pci.c
- Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)

drivers/iommu/intel-iommu.c
- On initialization, if IOMMU is already enabled, call this bus reset
  function before disabling and re-enabling IOMMU.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-28 Thread Takao Indoh
(2013/07/25 23:24), Vivek Goyal wrote:
> On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
>> Sorry for letting this discussion slide, I was busy on other works:-(
>> Anyway, the summary of previous discussion is:
>> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>>boot. This expects PCI enumeration is done before IOMMU
>>initialization as follows.
>>  (1) PCI enumeration
>>  (2) fs_initcall ---> device reset
>>  (3) IOMMU initialization
>> - This works on x86, but does not work on other architecture because
>>IOMMU is initialized before PCI enumeration on some architectures. So,
>>device reset should be done where IOMMU is initialized instead of
>>initcall.
>> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>>
>> Resetting devices in panic kernel is against kdump policy and seems not to
>> be good idea. So I think adding reset code into iommu initialization is
>> better. I'll post patches for that.
> 
> I don't understand all the details but I agree that idea of trying to
> reset IOMMU in crashed kernel might not fly.
> 
>>
>> Another discussion point is how to handle buggy devices. Resetting buggy
>> devices makes system more unstable. One of ideas is using boot parameter
>> so that user can choose to reset devices or not.
> 
> So who would decide which device is buggy and don't reset it. Give
> some details here.

I found the case that kdump does not work after resetting devices and
it works when removing reset patch. The cause of problem is a bug of
PCIe switch chip. If there is boot parameter not to reset devices,
user can use it as workaround.

I think in this case we should add PCI quirk to avoid this buggy
hardware, but we need to wait errata from vendor and it basically takes
long time.

> 
> Can't we simply blacklist associated module, so that it never loads
> and then it never tries to reset the devices?
> 

So you mean that device reset should be done on its driver loading?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-25 Thread Bjorn Helgaas
On Wed, Jul 24, 2013 at 12:29 AM, Takao Indoh
 wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>   boot. This expects PCI enumeration is done before IOMMU
>   initialization as follows.
> (1) PCI enumeration
> (2) fs_initcall ---> device reset
> (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
>   IOMMU is initialized before PCI enumeration on some architectures. So,
>   device reset should be done where IOMMU is initialized instead of
>   initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
>
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

Of course nobody *wants* to do anything in the panic kernel.  But
simply saying "it's against kdump policy and seems not to be a good
idea" is not a technical argument.  There are things that are
impractical to do in the kdump kernel, so they have to be done in the
panic kernel even though we know the kernel is unreliable and the
attempt may fail.

My point about IOMMU and PCI initialization order doesn't go away just
because it doesn't fit "kdump policy."  Having system initialization
occur in a logical order is far more important than making kdump work.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-25 Thread Vivek Goyal
On Wed, Jul 24, 2013 at 03:29:58PM +0900, Takao Indoh wrote:
> Sorry for letting this discussion slide, I was busy on other works:-(
> Anyway, the summary of previous discussion is:
> - My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
>   boot. This expects PCI enumeration is done before IOMMU
>   initialization as follows.
> (1) PCI enumeration
> (2) fs_initcall ---> device reset
> (3) IOMMU initialization
> - This works on x86, but does not work on other architecture because
>   IOMMU is initialized before PCI enumeration on some architectures. So,
>   device reset should be done where IOMMU is initialized instead of
>   initcall.
> - Or, as another idea, we can reset devices in first kernel(panic kernel)
> 
> Resetting devices in panic kernel is against kdump policy and seems not to
> be good idea. So I think adding reset code into iommu initialization is
> better. I'll post patches for that.

I don't understand all the details but I agree that idea of trying to
reset IOMMU in crashed kernel might not fly.

> 
> Another discussion point is how to handle buggy devices. Resetting buggy
> devices makes system more unstable. One of ideas is using boot parameter
> so that user can choose to reset devices or not.

So who would decide which device is buggy and don't reset it. Give
some details here.

Can't we simply blacklist associated module, so that it never loads
and then it never tries to reset the devices?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-07-23 Thread Takao Indoh
Sorry for letting this discussion slide, I was busy on other works:-(
Anyway, the summary of previous discussion is:
- My patch adds new initcall(fs_initcall) to reset all PCIe endpoints on
  boot. This expects PCI enumeration is done before IOMMU
  initialization as follows.
(1) PCI enumeration
(2) fs_initcall ---> device reset
(3) IOMMU initialization
- This works on x86, but does not work on other architecture because
  IOMMU is initialized before PCI enumeration on some architectures. So,
  device reset should be done where IOMMU is initialized instead of
  initcall.
- Or, as another idea, we can reset devices in first kernel(panic kernel)

Resetting devices in panic kernel is against kdump policy and seems not to
be good idea. So I think adding reset code into iommu initialization is
better. I'll post patches for that.

Another discussion point is how to handle buggy devices. Resetting buggy
devices makes system more unstable. One of ideas is using boot parameter
so that user can choose to reset devices or not. An existed parameter
"reset_devices" is not helpful for this purpose because it is always
necessary for kdump and user cannot get rid of it. Introducing new boot
parameter seems to be unpopular for maintainers. Any ideas?

Thanks,
Takao Indoh

(2013/06/14 11:11), Takao Indoh wrote:
> (2013/06/13 12:41), Bjorn Helgaas wrote:
>> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh  
>> wrote:
>>> (2013/06/12 13:45), Bjorn Helgaas wrote:
 [+cc Vivek, Haren; sorry I didn't think to add you earlier]

 On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
  wrote:
> (2013/06/11 11:20), Bjorn Helgaas wrote:

>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU.  That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
> Ok. pci_iommu_init() is appropriate place to add this hook?

 I looked at various IOMMU init places today, and it's far more
 complicated and varied than I had hoped.

 This reset scheme depends on enumerating PCI devices before we
 initialize the IOMMU used by those devices.  x86 works that way today,
 but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>>
>>> Sorry, could you tell me which part depends on architecture?
>>
>> Your patch works if PCIe devices are reset before the kdump kernel
>> enables the IOMMU.  On x86, this is possible because PCI enumeration
>> happens before the IOMMU initialization.  On sparc, the IOMMU is
>> initialized before PCI devices are enumerated, so there would still be
>> a window where ongoing DMA could cause an IOMMU error.
> 
> Ok, understood, thanks.
> 
> Hmmm, it seems to be difficult to find out method which is common to
> all architectures. So, what I can do for now is introducing reset scheme
> which is only for x86.
> 
> 1) Change this patch so that it work only on x86 platform. For example
> call this reset code from x86_init.iommu.iommu_init() instead of
> fs_initcall.
> 
> Or another idea is:
> 
> 2) Enumerate PCI devices in IOMMU layer. That is:
> PCI layer
>   Just provide interface to reset given strcut pci_dev. Maybe
>   pci_reset_function() looks good for this purpose.
> IOMMU layer
>   Determine which devices should be reset. On kernel boot, check if
>   IOMMU is already active or not, and if active, check IOMMU page
>   table and reset devices whose entry exists there.
> 
>> Of course, it might be possible to reorganize the sparc code to to the
>> IOMMU init *after* it enumerates PCI devices.  But I think that change
>> would be hard to justify.
>>
>> And I think even on x86, it would be better if we did the IOMMU init
>> before PCI enumeration -- the PCI devices depend on the IOMMU, so
>> logically the IOMMU should be initialized first so the PCI devices can
>> be associated with it as they are enumerated.
> 
> So third idea is:
> 
> 3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
> need to implement new code to enumerate PCI devices and reset them
> for this purpose.
> 
> Idea 2 is not difficult to implement, but one problem is that this
> method may be dangerous. We need to scan IOMMU page table which is used
> in previous kernel, but it may be broken. Idea 3 seems to be difficult
> to implement...
> 
> 
>>
 example).  And I think conceptually, the IOMMU should be enumerated
 and initialized *before* the devices that use it.

 So I'm uncomfortable with that aspect of this scheme.

 It would be at least conceivable to reset the de

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-13 Thread Takao Indoh
(2013/06/13 12:41), Bjorn Helgaas wrote:
> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh  
> wrote:
>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>
>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>>  wrote:
 (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>
> I'm not sure you need to reset legacy devices (or non-PCI devices)
> yet, but the current hook isn't anchored anywhere -- it's just an
> fs_initcall() that doesn't give the reader any clue about the
> connection between the reset and the problem it's solving.
>
> If we do something like this patch, I think it needs to be done at the
> point where we enable or disable the IOMMU.  That way, it's connected
> to the important event, and there's a clue about how to make
> corresponding fixes for other IOMMUs.

 Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>> I looked at various IOMMU init places today, and it's far more
>>> complicated and varied than I had hoped.
>>>
>>> This reset scheme depends on enumerating PCI devices before we
>>> initialize the IOMMU used by those devices.  x86 works that way today,
>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>
>> Sorry, could you tell me which part depends on architecture?
> 
> Your patch works if PCIe devices are reset before the kdump kernel
> enables the IOMMU.  On x86, this is possible because PCI enumeration
> happens before the IOMMU initialization.  On sparc, the IOMMU is
> initialized before PCI devices are enumerated, so there would still be
> a window where ongoing DMA could cause an IOMMU error.

Ok, understood, thanks.

Hmmm, it seems to be difficult to find out method which is common to
all architectures. So, what I can do for now is introducing reset scheme
which is only for x86.

1) Change this patch so that it work only on x86 platform. For example
   call this reset code from x86_init.iommu.iommu_init() instead of
   fs_initcall.

Or another idea is:

2) Enumerate PCI devices in IOMMU layer. That is:
   PCI layer
 Just provide interface to reset given strcut pci_dev. Maybe
 pci_reset_function() looks good for this purpose.
   IOMMU layer
 Determine which devices should be reset. On kernel boot, check if
 IOMMU is already active or not, and if active, check IOMMU page
 table and reset devices whose entry exists there.

> Of course, it might be possible to reorganize the sparc code to to the
> IOMMU init *after* it enumerates PCI devices.  But I think that change
> would be hard to justify.
> 
> And I think even on x86, it would be better if we did the IOMMU init
> before PCI enumeration -- the PCI devices depend on the IOMMU, so
> logically the IOMMU should be initialized first so the PCI devices can
> be associated with it as they are enumerated.

So third idea is:

3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
   need to implement new code to enumerate PCI devices and reset them
   for this purpose.

Idea 2 is not difficult to implement, but one problem is that this
method may be dangerous. We need to scan IOMMU page table which is used
in previous kernel, but it may be broken. Idea 3 seems to be difficult
to implement...


> 
>>> example).  And I think conceptually, the IOMMU should be enumerated
>>> and initialized *before* the devices that use it.
>>>
>>> So I'm uncomfortable with that aspect of this scheme.
>>>
>>> It would be at least conceivable to reset the devices in the system
>>> kernel, before the kexec.  I know we want to do as little as possible
>>> in the crashing kernel, but it's at least a possibility, and it might
>>> be cleaner.
>>
>> I bet this will be not accepted by kdump maintainer. Everything in panic
>> kernel is unreliable.
> 
> kdump is inherently unreliable.  The kdump kernel doesn't start from
> an arbitrary machine state.  We don't expect it to tolerate all CPUs
> running, for example.  Maybe it should be expected to tolerate PCI
> devices running, either.

What I wanted to say is that any resources of first kernel are
unreliable. Under panic situation, struct pci_dev tree may be broken, or
pci_lock may be already hold by someone, etc. So, if we do this in first
kernel, maybe kdump needs its own code to enumerate PCI devices and
reset them.

Vivek?

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Bjorn Helgaas
On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh  wrote:
> (2013/06/12 13:45), Bjorn Helgaas wrote:
>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>
>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>  wrote:
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>
 I'm not sure you need to reset legacy devices (or non-PCI devices)
 yet, but the current hook isn't anchored anywhere -- it's just an
 fs_initcall() that doesn't give the reader any clue about the
 connection between the reset and the problem it's solving.

 If we do something like this patch, I think it needs to be done at the
 point where we enable or disable the IOMMU.  That way, it's connected
 to the important event, and there's a clue about how to make
 corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>> I looked at various IOMMU init places today, and it's far more
>> complicated and varied than I had hoped.
>>
>> This reset scheme depends on enumerating PCI devices before we
>> initialize the IOMMU used by those devices.  x86 works that way today,
>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>
> Sorry, could you tell me which part depends on architecture?

Your patch works if PCIe devices are reset before the kdump kernel
enables the IOMMU.  On x86, this is possible because PCI enumeration
happens before the IOMMU initialization.  On sparc, the IOMMU is
initialized before PCI devices are enumerated, so there would still be
a window where ongoing DMA could cause an IOMMU error.

Of course, it might be possible to reorganize the sparc code to to the
IOMMU init *after* it enumerates PCI devices.  But I think that change
would be hard to justify.

And I think even on x86, it would be better if we did the IOMMU init
before PCI enumeration -- the PCI devices depend on the IOMMU, so
logically the IOMMU should be initialized first so the PCI devices can
be associated with it as they are enumerated.

>> example).  And I think conceptually, the IOMMU should be enumerated
>> and initialized *before* the devices that use it.
>>
>> So I'm uncomfortable with that aspect of this scheme.
>>
>> It would be at least conceivable to reset the devices in the system
>> kernel, before the kexec.  I know we want to do as little as possible
>> in the crashing kernel, but it's at least a possibility, and it might
>> be cleaner.
>
> I bet this will be not accepted by kdump maintainer. Everything in panic
> kernel is unreliable.

kdump is inherently unreliable.  The kdump kernel doesn't start from
an arbitrary machine state.  We don't expect it to tolerate all CPUs
running, for example.  Maybe it should be expected to tolerate PCI
devices running, either.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Takao Indoh
(2013/06/12 22:19), Don Dutile wrote:
> On 06/11/2013 07:19 PM, Sumner, William wrote:
>>
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
 On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  
 wrote:
> (2013/06/07 13:14), Bjorn Helgaas wrote:

>> One thing I'm not sure about is that you are only resetting PCIe
>> devices, but I don't think the problem is actually specific to PCIe,
>> is it?  I think the same issue could occur on any system with an
>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>> PA-RISC.
>
> Right, this is not specific to PCIe. The reasons why the target is only
> PCIe is just to make algorithm to reset simple. It is possible to reset
> legacy PCI devices in my patch, but code becomes somewhat complicated. I
> thought recently most systems used PCIe and there was little demand for
> resetting legacy PCI. Therefore I decided not to reset legacy PCI
> devices, but I'll do if there are requests :-)

 I'm not sure you need to reset legacy devices (or non-PCI devices)
 yet, but the current hook isn't anchored anywhere -- it's just an
 fs_initcall() that doesn't give the reader any clue about the
 connection between the reset and the problem it's solving.

 If we do something like this patch, I think it needs to be done at the
 point where we enable or disable the IOMMU.  That way, it's connected
 to the important event, and there's a clue about how to make
 corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
 We already have a "reset_devices" boot option.  This is for the same
 purpose, as far as I can tell, and I'm not sure there's value in
 having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
 fact, there's nothing specific even to PCI here.  The Intel VT-d docs
 seem carefully written so they could apply to either PCIe or non-PCI
 devices.
>>>
>>> Yeah, I can integrate this option into reset_devices. The reason why
>>> I separate them is to avoid regression.
>>>
>>> I have tested my patch on many machines and basically it worked, but I
>>> found two machines where this reset patch caused problem. The first one
>>> was caused by bug of raid card firmware. After updating firmware, this
>>> patch worked. The second one was due to bug of PCIe switch chip. I
>>> reported this bug to the vendor but it is not fixed yet.
>>>
>>> Anyway, this patch may cause problems on such a buggy machine, so I
>>> introduced new boot parameter so that user can enable or disable this
>>> function aside from reset_devices.
>>>
>>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>> reset_devices instead of adding new one, and using quirk or something to
>>> avoid such a buggy devices.
>>
>> With respect to "and using quirk or something to avoid such buggy devices",
>> I believe that it will be necessary to provide a mechanism for devices that
>> need special handling to do the reset -- perhaps something like a list
>> of tuples: (device_type, function_to_call) with a default function_to_call
>> when the device_type is not found in the list.  These functions would
>> need to be physically separate from the device driver because if the device
>> is present it needs to be reset even if the crash kernel chooses not to load
>> the driver for that device.
>>
>>>
>>> So, basically I agree with using reset_devices, but I want to prepare
>>> workaround in case this reset causes something wrong.
>>>
>> I like the ability to specify the original "reset_devices" separately from
>> invoking this new mechanism.  With so many different uses for Linux in
>> so many different environments and with so many different device drivers
>> it seems reasonable to keep the ability to tell the device drivers to
>> reset their devices -- instead of pulling the reset line on all devices.
>>
>> I also like the ability to invoke the new reset feature separately from
>> telling the device drivers to do it.
>>
>>>

>> I tried to make a list of the interesting scenarios and the events
>> that are relevant to this problem:
>>
>>Case 1: IOMMU off in system, off in kdump kernel
>>  system kernel leaves IOMMU off
>>DMA targets system-kernel memory
>>  kexec to kdump kernel (IOMMU off, devices untouched)
>>DMA targets system-kernel memory (harmless)
>>  kdump kernel re-inits device
>>DMA targets kdump-kernel memory
>>
>>Case 2: IOMMU off in system kernel, on in kdump kernel
>>  system kernel leaves IOMMU off
>>DMA targets system-kernel memory
>>  kexec to kdump kernel (IOMMU off, devices untouched)
>>DMA targets system-kernel memory (harmless)
>>  kdu

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Takao Indoh
(2013/06/12 13:45), Bjorn Helgaas wrote:
> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
> 
> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>  wrote:
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
> 
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU.  That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
> 
> I looked at various IOMMU init places today, and it's far more
> complicated and varied than I had hoped.
> 
> This reset scheme depends on enumerating PCI devices before we
> initialize the IOMMU used by those devices.  x86 works that way today,
> but not all architectures do (see the sparc pci_fire_pbm_init(), for

Sorry, could you tell me which part depends on architecture?

> example).  And I think conceptually, the IOMMU should be enumerated
> and initialized *before* the devices that use it.
> 
> So I'm uncomfortable with that aspect of this scheme.
> 
> It would be at least conceivable to reset the devices in the system
> kernel, before the kexec.  I know we want to do as little as possible
> in the crashing kernel, but it's at least a possibility, and it might
> be cleaner.

I bet this will be not accepted by kdump maintainer. Everything in panic
kernel is unreliable.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Don Dutile

On 06/11/2013 07:19 PM, Sumner, William wrote:



(2013/06/11 11:20), Bjorn Helgaas wrote:

On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  wrote:

(2013/06/07 13:14), Bjorn Helgaas wrote:



One thing I'm not sure about is that you are only resetting PCIe
devices, but I don't think the problem is actually specific to PCIe,
is it?  I think the same issue could occur on any system with an
IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
there are systems with IOMMUs for plain old PCI devices, e.g.,
PA-RISC.


Right, this is not specific to PCIe. The reasons why the target is only
PCIe is just to make algorithm to reset simple. It is possible to reset
legacy PCI devices in my patch, but code becomes somewhat complicated. I
thought recently most systems used PCIe and there was little demand for
resetting legacy PCI. Therefore I decided not to reset legacy PCI
devices, but I'll do if there are requests :-)


I'm not sure you need to reset legacy devices (or non-PCI devices)
yet, but the current hook isn't anchored anywhere -- it's just an
fs_initcall() that doesn't give the reader any clue about the
connection between the reset and the problem it's solving.

If we do something like this patch, I think it needs to be done at the
point where we enable or disable the IOMMU.  That way, it's connected
to the important event, and there's a clue about how to make
corresponding fixes for other IOMMUs.


Ok. pci_iommu_init() is appropriate place to add this hook?


We already have a "reset_devices" boot option.  This is for the same
purpose, as far as I can tell, and I'm not sure there's value in
having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
fact, there's nothing specific even to PCI here.  The Intel VT-d docs
seem carefully written so they could apply to either PCIe or non-PCI
devices.


Yeah, I can integrate this option into reset_devices. The reason why
I separate them is to avoid regression.

I have tested my patch on many machines and basically it worked, but I
found two machines where this reset patch caused problem. The first one
was caused by bug of raid card firmware. After updating firmware, this
patch worked. The second one was due to bug of PCIe switch chip. I
reported this bug to the vendor but it is not fixed yet.

Anyway, this patch may cause problems on such a buggy machine, so I
introduced new boot parameter so that user can enable or disable this
function aside from reset_devices.

Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
reset_devices instead of adding new one, and using quirk or something to
avoid such a buggy devices.


With respect to "and using quirk or something to avoid such buggy devices",
I believe that it will be necessary to provide a mechanism for devices that
need special handling to do the reset -- perhaps something like a list
of tuples: (device_type, function_to_call) with a default function_to_call
when the device_type is not found in the list.  These functions would
need to be physically separate from the device driver because if the device
is present it needs to be reset even if the crash kernel chooses not to load
the driver for that device.



So, basically I agree with using reset_devices, but I want to prepare
workaround in case this reset causes something wrong.


I like the ability to specify the original "reset_devices" separately from
invoking this new mechanism.  With so many different uses for Linux in
so many different environments and with so many different device drivers
it seems reasonable to keep the ability to tell the device drivers to
reset their devices -- instead of pulling the reset line on all devices.

I also like the ability to invoke the new reset feature separately from
telling the device drivers to do it.






I tried to make a list of the interesting scenarios and the events
that are relevant to this problem:

   Case 1: IOMMU off in system, off in kdump kernel
 system kernel leaves IOMMU off
   DMA targets system-kernel memory
 kexec to kdump kernel (IOMMU off, devices untouched)
   DMA targets system-kernel memory (harmless)
 kdump kernel re-inits device
   DMA targets kdump-kernel memory

   Case 2: IOMMU off in system kernel, on in kdump kernel
 system kernel leaves IOMMU off
   DMA targets system-kernel memory
 kexec to kdump kernel (IOMMU off, devices untouched)
   DMA targets system-kernel memory (harmless)
 kdump kernel enables IOMMU with no valid mappings
   DMA causes IOMMU errors (annoying but harmless)
 kdump kernel re-inits device
   DMA targets IOMMU-mapped kdump-kernel memory

   Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
 system kernel enables IOMMU
   DMA targets IOMMU-mapped system-kernel memory
 kexec to kdump kernel (IOMMU on, devices untouched)
   DMA targets IOMMU-mapped system-

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-11 Thread Bjorn Helgaas
[+cc Vivek, Haren; sorry I didn't think to add you earlier]

On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
 wrote:
> (2013/06/11 11:20), Bjorn Helgaas wrote:

>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU.  That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
> Ok. pci_iommu_init() is appropriate place to add this hook?

I looked at various IOMMU init places today, and it's far more
complicated and varied than I had hoped.

This reset scheme depends on enumerating PCI devices before we
initialize the IOMMU used by those devices.  x86 works that way today,
but not all architectures do (see the sparc pci_fire_pbm_init(), for
example).  And I think conceptually, the IOMMU should be enumerated
and initialized *before* the devices that use it.

So I'm uncomfortable with that aspect of this scheme.

It would be at least conceivable to reset the devices in the system
kernel, before the kexec.  I know we want to do as little as possible
in the crashing kernel, but it's at least a possibility, and it might
be cleaner.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-11 Thread Bjorn Helgaas
[Your quoting is messed up below]
>> This is Takao's text
> This is Bill's text

On Tue, Jun 11, 2013 at 5:19 PM, Sumner, William  wrote:
>>(2013/06/11 11:20), Bjorn Helgaas wrote:
>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  
>>> wrote:

>>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>reset_devices instead of adding new one, and using quirk or something to
>>avoid such a buggy devices.
>
> With respect to "and using quirk or something to avoid such buggy devices",
> I believe that it will be necessary to provide a mechanism for devices that
> need special handling to do the reset ...

You mean something like pci_dev_specific_reset()?

>>So, basically I agree with using reset_devices, but I want to prepare
>>workaround in case this reset causes something wrong.
>>
> I like the ability to specify the original "reset_devices" separately from
> invoking this new mechanism.  With so many different uses for Linux in
> so many different environments and with so many different device drivers
> it seems reasonable to keep the ability to tell the device drivers to
> reset their devices -- instead of pulling the reset line on all devices.

The problem with adding new options for this and that is it confuses
users and fragments testing.  Users already randomly try options and
publish combinations that happen to work as "solutions."  Then we
don't get problem reports and don't get a chance to fix things
properly.  The ideal OS would have zero options and be able to figure
everything out by itself.  So that's my bias about giving users
flexibility by adding new options -- I don't like it, and I think we
have to accept some lack of flexibility to keep the system complexity
tractable  :)

> ...
> Thinking out of the box:
> Much of the discussion about dealing with the ongoing DMA leftover
> from the system kernel has assumed that the crash kernel will reset
> the IOMMU -- which causes various problems if done while there is any DMA
> still active -- which leads to the idea of stopping all of the DMA.
>
> Suppose the crash kernel does not reset the hardware IOMMU, but simply
> detects that it is active, resets only the devices that are necessary
> for the crash kernel to operate, and re-programs only the translations
> for those devices.  All other translations remain the same (and remain valid)
> so all leftover DMA continues into its buffer in the system kernel area
> where it is harmless.  New translations needed by the kdump kernel are
> added to the existing tables.
>
> I have not yet tried this, so I am not ready to propose it as anything more
> than a discussion topic at this time.
>
> It might work this way: (A small modification to case 3a above)
>
> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>system kernel enables IOMMU
>   DMA targets IOMMU-mapped system-kernel memory
>kexec to kdump kernel (IOMMU on, devices untouched)
>   DMA targets IOMMU-mapped system-kernel memory
>kdump kernel detects active IOMMU and doesn't touch it
>   DMA targets IOMMU-mapped system-kernel memory
>kdump kernel does not re-initialize IOMMU hardware
>kdump kernel initializes IOMMU in-memory management structures
>kdump kernel calls device drivers' standard initialization functions
>   Drivers initialize their own devices -- DMA from that device stops
>   When drivers request new DMA mappings, the kdump IOMMU driver:
>   1. Updates its in-memory mgt structures for that device & range
>   2. Updates IOMMU translate tables for that device & range
>  . Translations for all other devices & ranges are unchanged
>   3. Flushes IOMMU TLB to force IOMMU hardware update

This is certainly an interesting idea.

It would require some additional smarts in the IOMMU driver to take
over existing page tables.  Those data structures from the system
kernel are outside the memory map known to the kdump kernel, so
there'd likely be VM issues there.  The IOMMU driver would also have
to be able to reconstruct the state for the bus address space
allocator, e.g., use the I/O page tables to rebuild a bitmap of
allocated space.

We'll end up with a bunch of dma_map() calls from the system kernel
that don't have corresponding dma_unmap()s.  I guess when a driver
initializes its device, there might have to be a new interface to
"unmap any existing mappings for this device, even though I don't know
what they are."

I think it's possible, but it sounds like quite a lot of work and I'm
not sure it's worth it for this relatively limited use case.  And it's
more IOMMU-specific than a "reset devices" strategy, so most of it
would have to be done in each IOMMU driver.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-11 Thread Sumner, William

>(2013/06/11 11:20), Bjorn Helgaas wrote:
>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  
>> wrote:
>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>
 One thing I'm not sure about is that you are only resetting PCIe
 devices, but I don't think the problem is actually specific to PCIe,
 is it?  I think the same issue could occur on any system with an
 IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
 there are systems with IOMMUs for plain old PCI devices, e.g.,
 PA-RISC.
>>>
>>> Right, this is not specific to PCIe. The reasons why the target is only
>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>> thought recently most systems used PCIe and there was little demand for
>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>> devices, but I'll do if there are requests :-)
>>
>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>> yet, but the current hook isn't anchored anywhere -- it's just an
>> fs_initcall() that doesn't give the reader any clue about the
>> connection between the reset and the problem it's solving.
>>
>> If we do something like this patch, I think it needs to be done at the
>> point where we enable or disable the IOMMU.  That way, it's connected
>> to the important event, and there's a clue about how to make
>> corresponding fixes for other IOMMUs.
>
>Ok. pci_iommu_init() is appropriate place to add this hook?
>
>> We already have a "reset_devices" boot option.  This is for the same
>> purpose, as far as I can tell, and I'm not sure there's value in
>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
>> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
>> seem carefully written so they could apply to either PCIe or non-PCI
>> devices.
>
>Yeah, I can integrate this option into reset_devices. The reason why
>I separate them is to avoid regression.
>
>I have tested my patch on many machines and basically it worked, but I
>found two machines where this reset patch caused problem. The first one
>was caused by bug of raid card firmware. After updating firmware, this
>patch worked. The second one was due to bug of PCIe switch chip. I
>reported this bug to the vendor but it is not fixed yet.
>
>Anyway, this patch may cause problems on such a buggy machine, so I
>introduced new boot parameter so that user can enable or disable this
>function aside from reset_devices.
>
>Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>reset_devices instead of adding new one, and using quirk or something to
>avoid such a buggy devices.

With respect to "and using quirk or something to avoid such buggy devices",
I believe that it will be necessary to provide a mechanism for devices that
need special handling to do the reset -- perhaps something like a list
of tuples: (device_type, function_to_call) with a default function_to_call
when the device_type is not found in the list.  These functions would
need to be physically separate from the device driver because if the device
is present it needs to be reset even if the crash kernel chooses not to load
the driver for that device.

>
>So, basically I agree with using reset_devices, but I want to prepare
>workaround in case this reset causes something wrong.
>
I like the ability to specify the original "reset_devices" separately from
invoking this new mechanism.  With so many different uses for Linux in
so many different environments and with so many different device drivers
it seems reasonable to keep the ability to tell the device drivers to
reset their devices -- instead of pulling the reset line on all devices.

I also like the ability to invoke the new reset feature separately from
telling the device drivers to do it.

>
>>
 I tried to make a list of the interesting scenarios and the events
 that are relevant to this problem:

   Case 1: IOMMU off in system, off in kdump kernel
 system kernel leaves IOMMU off
   DMA targets system-kernel memory
 kexec to kdump kernel (IOMMU off, devices untouched)
   DMA targets system-kernel memory (harmless)
 kdump kernel re-inits device
   DMA targets kdump-kernel memory

   Case 2: IOMMU off in system kernel, on in kdump kernel
 system kernel leaves IOMMU off
   DMA targets system-kernel memory
 kexec to kdump kernel (IOMMU off, devices untouched)
   DMA targets system-kernel memory (harmless)
 kdump kernel enables IOMMU with no valid mappings
   DMA causes IOMMU errors (annoying but harmless)
 kdump kernel re-inits device
   DMA targets IOMMU-mapped kdump-kernel memory

   Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
 system kernel en

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-10 Thread Takao Indoh
(2013/06/11 11:20), Bjorn Helgaas wrote:
> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  
> wrote:
>> (2013/06/07 13:14), Bjorn Helgaas wrote:
> 
>>> One thing I'm not sure about is that you are only resetting PCIe
>>> devices, but I don't think the problem is actually specific to PCIe,
>>> is it?  I think the same issue could occur on any system with an
>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>> PA-RISC.
>>
>> Right, this is not specific to PCIe. The reasons why the target is only
>> PCIe is just to make algorithm to reset simple. It is possible to reset
>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>> thought recently most systems used PCIe and there was little demand for
>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>> devices, but I'll do if there are requests :-)
> 
> I'm not sure you need to reset legacy devices (or non-PCI devices)
> yet, but the current hook isn't anchored anywhere -- it's just an
> fs_initcall() that doesn't give the reader any clue about the
> connection between the reset and the problem it's solving.
> 
> If we do something like this patch, I think it needs to be done at the
> point where we enable or disable the IOMMU.  That way, it's connected
> to the important event, and there's a clue about how to make
> corresponding fixes for other IOMMUs.

Ok. pci_iommu_init() is appropriate place to add this hook?

> We already have a "reset_devices" boot option.  This is for the same
> purpose, as far as I can tell, and I'm not sure there's value in
> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
> seem carefully written so they could apply to either PCIe or non-PCI
> devices.

Yeah, I can integrate this option into reset_devices. The reason why
I separate them is to avoid regression.

I have tested my patch on many machines and basically it worked, but I
found two machines where this reset patch caused problem. The first one
was caused by bug of raid card firmware. After updating firmware, this
patch worked. The second one was due to bug of PCIe switch chip. I
reported this bug to the vendor but it is not fixed yet.

Anyway, this patch may cause problems on such a buggy machine, so I
introduced new boot parameter so that user can enable or disable this
function aside from reset_devices.

Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
reset_devices instead of adding new one, and using quirk or something to
avoid such a buggy devices.

So, basically I agree with using reset_devices, but I want to prepare
workaround in case this reset causes something wrong.


> 
>>> I tried to make a list of the interesting scenarios and the events
>>> that are relevant to this problem:
>>>
>>>   Case 1: IOMMU off in system, off in kdump kernel
>>> system kernel leaves IOMMU off
>>>   DMA targets system-kernel memory
>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>   DMA targets system-kernel memory (harmless)
>>> kdump kernel re-inits device
>>>   DMA targets kdump-kernel memory
>>>
>>>   Case 2: IOMMU off in system kernel, on in kdump kernel
>>> system kernel leaves IOMMU off
>>>   DMA targets system-kernel memory
>>> kexec to kdump kernel (IOMMU off, devices untouched)
>>>   DMA targets system-kernel memory (harmless)
>>> kdump kernel enables IOMMU with no valid mappings
>>>   DMA causes IOMMU errors (annoying but harmless)
>>> kdump kernel re-inits device
>>>   DMA targets IOMMU-mapped kdump-kernel memory
>>>
>>>   Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>> system kernel enables IOMMU
>>>   DMA targets IOMMU-mapped system-kernel memory
>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>   DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel doesn't know about IOMMU or doesn't touch it
>>>   DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel re-inits device
>>>   kernel assumes no IOMMU, so all new DMA mappings are invalid
>>> because DMAs actually do go through the IOMMU
>>>   (** corruption or other non-recoverable error likely **)
>>>
>>>   Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>> system kernel enables IOMMU
>>>   DMA targets IOMMU-mapped system-kernel memory
>>> kexec to kdump kernel (IOMMU on, devices untouched)
>>>   DMA targets IOMMU-mapped system-kernel memory
>>> kdump kernel disables IOMMU
>>>   DMA targets IOMMU-mapped system-kernel memory, but IOMMU is 
>>> disabled
>>>   (** corruption or other non-recoverable error likely **)
>>> kdump kernel re-inits device

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-10 Thread Bjorn Helgaas
On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  wrote:
> (2013/06/07 13:14), Bjorn Helgaas wrote:

>> One thing I'm not sure about is that you are only resetting PCIe
>> devices, but I don't think the problem is actually specific to PCIe,
>> is it?  I think the same issue could occur on any system with an
>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>> PA-RISC.
>
> Right, this is not specific to PCIe. The reasons why the target is only
> PCIe is just to make algorithm to reset simple. It is possible to reset
> legacy PCI devices in my patch, but code becomes somewhat complicated. I
> thought recently most systems used PCIe and there was little demand for
> resetting legacy PCI. Therefore I decided not to reset legacy PCI
> devices, but I'll do if there are requests :-)

I'm not sure you need to reset legacy devices (or non-PCI devices)
yet, but the current hook isn't anchored anywhere -- it's just an
fs_initcall() that doesn't give the reader any clue about the
connection between the reset and the problem it's solving.

If we do something like this patch, I think it needs to be done at the
point where we enable or disable the IOMMU.  That way, it's connected
to the important event, and there's a clue about how to make
corresponding fixes for other IOMMUs.

We already have a "reset_devices" boot option.  This is for the same
purpose, as far as I can tell, and I'm not sure there's value in
having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
fact, there's nothing specific even to PCI here.  The Intel VT-d docs
seem carefully written so they could apply to either PCIe or non-PCI
devices.

>> I tried to make a list of the interesting scenarios and the events
>> that are relevant to this problem:
>>
>>  Case 1: IOMMU off in system, off in kdump kernel
>>system kernel leaves IOMMU off
>>  DMA targets system-kernel memory
>>kexec to kdump kernel (IOMMU off, devices untouched)
>>  DMA targets system-kernel memory (harmless)
>>kdump kernel re-inits device
>>  DMA targets kdump-kernel memory
>>
>>  Case 2: IOMMU off in system kernel, on in kdump kernel
>>system kernel leaves IOMMU off
>>  DMA targets system-kernel memory
>>kexec to kdump kernel (IOMMU off, devices untouched)
>>  DMA targets system-kernel memory (harmless)
>>kdump kernel enables IOMMU with no valid mappings
>>  DMA causes IOMMU errors (annoying but harmless)
>>kdump kernel re-inits device
>>  DMA targets IOMMU-mapped kdump-kernel memory
>>
>>  Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>system kernel enables IOMMU
>>  DMA targets IOMMU-mapped system-kernel memory
>>kexec to kdump kernel (IOMMU on, devices untouched)
>>  DMA targets IOMMU-mapped system-kernel memory
>>kdump kernel doesn't know about IOMMU or doesn't touch it
>>  DMA targets IOMMU-mapped system-kernel memory
>>kdump kernel re-inits device
>>  kernel assumes no IOMMU, so all new DMA mappings are invalid
>> because DMAs actually do go through the IOMMU
>>  (** corruption or other non-recoverable error likely **)
>>
>>  Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>system kernel enables IOMMU
>>  DMA targets IOMMU-mapped system-kernel memory
>>kexec to kdump kernel (IOMMU on, devices untouched)
>>  DMA targets IOMMU-mapped system-kernel memory
>>kdump kernel disables IOMMU
>>  DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>  (** corruption or other non-recoverable error likely **)
>>kdump kernel re-inits device
>>  DMA targets kdump-kernel memory
>>
>>  Case 4: IOMMU on in system kernel, on in kdump kernel
>>system kernel enables IOMMU
>>  DMA targets IOMMU-mapped system-kernel memory
>>kexec to kdump kernel (IOMMU on, devices untouched)
>>  DMA targets IOMMU-mapped system-kernel memory
>>kdump kernel enables IOMMU with no valid mappings
>>  DMA causes IOMMU errors (annoying but harmless)
>>kdump kernel re-inits device
>>  DMA targets IOMMU-mapped kdump-kernel memory
>
> This is not harmless. Errors like PCI SERR are detected here, and it
> makes driver or system unstable, and kdump fails. I also got report that
> system hangs up due to this.

OK, let's take this slowly.  Does an IOMMU error in the system kernel
also cause SERR or make the system unstable?  Is that the expected
behavior on IOMMU errors, or is there something special about the
kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
reported with printk and don't seem to cause an SERR.  Maybe the SERR
is system-specific behavior?

ht

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-07 Thread Takao Indoh
Thank you for your comments!

(2013/06/07 13:14), Bjorn Helgaas wrote:
> Sorry it's taken me so long to look at this.  I've been putting this
> off because the patch doesn't seem "obviously correct," and I don't
> feel like I really understand the problem.  I'm naive about both
> IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
> questions or assumptions below.
> 
> On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
>  wrote:
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>> and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>> dma-remapping errors.
> 
> If I understand correctly, the problem only happens on systems with an
> IOMMU that's enabled in either the system or kdump kernel (or both).
> For systems without an IOMMU (or if it is disabled in both the system
> and kdump kernels),  any ongoing DMA should use addresses that target
> system-kernel memory and should not affect the kdump kernel.

Yes, that's correct.

> One thing I'm not sure about is that you are only resetting PCIe
> devices, but I don't think the problem is actually specific to PCIe,
> is it?  I think the same issue could occur on any system with an
> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
> there are systems with IOMMUs for plain old PCI devices, e.g.,
> PA-RISC.

Right, this is not specific to PCIe. The reasons why the target is only
PCIe is just to make algorithm to reset simple. It is possible to reset
legacy PCI devices in my patch, but code becomes somewhat complicated. I
thought recently most systems used PCIe and there was little demand for
resetting legacy PCI. Therefore I decided not to reset legacy PCI
devices, but I'll do if there are requests :-)

> 
> I tried to make a list of the interesting scenarios and the events
> that are relevant to this problem:
> 
>  Case 1: IOMMU off in system, off in kdump kernel
>system kernel leaves IOMMU off
>  DMA targets system-kernel memory
>kexec to kdump kernel (IOMMU off, devices untouched)
>  DMA targets system-kernel memory (harmless)
>kdump kernel re-inits device
>  DMA targets kdump-kernel memory
> 
>  Case 2: IOMMU off in system kernel, on in kdump kernel
>system kernel leaves IOMMU off
>  DMA targets system-kernel memory
>kexec to kdump kernel (IOMMU off, devices untouched)
>  DMA targets system-kernel memory (harmless)
>kdump kernel enables IOMMU with no valid mappings
>  DMA causes IOMMU errors (annoying but harmless)
>kdump kernel re-inits device
>  DMA targets IOMMU-mapped kdump-kernel memory
> 
>  Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>system kernel enables IOMMU
>  DMA targets IOMMU-mapped system-kernel memory
>kexec to kdump kernel (IOMMU on, devices untouched)
>  DMA targets IOMMU-mapped system-kernel memory
>kdump kernel doesn't know about IOMMU or doesn't touch it
>  DMA targets IOMMU-mapped system-kernel memory
>kdump kernel re-inits device
>  kernel assumes no IOMMU, so all new DMA mappings are invalid
> because DMAs actually do go through the IOMMU
>  (** corruption or other non-recoverable error likely **)
> 
>  Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>system kernel enables IOMMU
>  DMA targets IOMMU-mapped system-kernel memory
>kexec to kdump kernel (IOMMU on, devices untouched)
>  DMA targets IOMMU-mapped system-kernel memory
>kdump kernel disables IOMMU
>  DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>  (** corruption or other non-recoverable error likely **)
>kdump kernel re-inits device
>  DMA targets kdump-kernel memory
> 
>  Case 4: IOMMU on in system kernel, on in kdump kernel
>system kernel enables IOMMU
>  DMA targets IOMMU-mapped system-kernel memory
>kexec to kdump kernel (IOMMU on, devices untouched)
>  DMA targets IOMMU-mapped system-kernel memory
>kdump kernel enables IOMMU with no valid mappings
>  DMA causes IOMMU errors (annoying but harmless)

This is not harmless. Errors like PCI SERR are detected here, and it
makes driver or system unstable

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-06 Thread Bjorn Helgaas
Sorry it's taken me so long to look at this.  I've been putting this
off because the patch doesn't seem "obviously correct," and I don't
feel like I really understand the problem.  I'm naive about both
IOMMUs and kexec/kdump, so please pardon (and help me with) any silly
questions or assumptions below.

On Mon, May 13, 2013 at 11:29 PM, Takao Indoh
 wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>dma-remapping errors.

If I understand correctly, the problem only happens on systems with an
IOMMU that's enabled in either the system or kdump kernel (or both).
For systems without an IOMMU (or if it is disabled in both the system
and kdump kernels),  any ongoing DMA should use addresses that target
system-kernel memory and should not affect the kdump kernel.

One thing I'm not sure about is that you are only resetting PCIe
devices, but I don't think the problem is actually specific to PCIe,
is it?  I think the same issue could occur on any system with an
IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
there are systems with IOMMUs for plain old PCI devices, e.g.,
PA-RISC.

I tried to make a list of the interesting scenarios and the events
that are relevant to this problem:

Case 1: IOMMU off in system, off in kdump kernel
  system kernel leaves IOMMU off
DMA targets system-kernel memory
  kexec to kdump kernel (IOMMU off, devices untouched)
DMA targets system-kernel memory (harmless)
  kdump kernel re-inits device
DMA targets kdump-kernel memory

Case 2: IOMMU off in system kernel, on in kdump kernel
  system kernel leaves IOMMU off
DMA targets system-kernel memory
  kexec to kdump kernel (IOMMU off, devices untouched)
DMA targets system-kernel memory (harmless)
  kdump kernel enables IOMMU with no valid mappings
DMA causes IOMMU errors (annoying but harmless)
  kdump kernel re-inits device
DMA targets IOMMU-mapped kdump-kernel memory

Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
  system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
  kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
  kdump kernel doesn't know about IOMMU or doesn't touch it
DMA targets IOMMU-mapped system-kernel memory
  kdump kernel re-inits device
kernel assumes no IOMMU, so all new DMA mappings are invalid
because DMAs actually do go through the IOMMU
(** corruption or other non-recoverable error likely **)

Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
  system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
  kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
  kdump kernel disables IOMMU
DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
(** corruption or other non-recoverable error likely **)
  kdump kernel re-inits device
DMA targets kdump-kernel memory

Case 4: IOMMU on in system kernel, on in kdump kernel
  system kernel enables IOMMU
DMA targets IOMMU-mapped system-kernel memory
  kexec to kdump kernel (IOMMU on, devices untouched)
DMA targets IOMMU-mapped system-kernel memory
  kdump kernel enables IOMMU with no valid mappings
DMA causes IOMMU errors (annoying but harmless)
  kdump kernel re-inits device
DMA targets IOMMU-mapped kdump-kernel memory

The problem cases I see are 3a and 3b, but that's not the problem
you're describing.  Obviously I'm missing something.

It sounds like you're seeing problems in case 2 or case 4, where the
IOMMU is enabled in the kdump kernel.  Maybe my assumption about the
IOMMU being enabled with no valid mappings is wrong?  Or maybe those
IOMMU errors are not actually harmless?

Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not
with case 3a.  Therefore, it seems like the kdump kernel *must*
contain IOMMU support unless it knows for certain that the system
kernel wasn't using the IOMMU.

Do you have any bugzilla references or problem report URLs you could
include here?

Obviously I'm very confused h

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-06 Thread Takao Indoh
Ping

Bjorn, could you please tell your impression on this idea?

This fix is very important for kdump. For exmaple, kdump does not work
if PCI passthrough is used on KVM guest. Please see patch description
for details.

If you don't agree this patch, please tell me what changes I should
make.

Thanks,
Takao Indoh

(2013/05/14 14:29), Takao Indoh wrote:
> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
> 
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
> and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
> dma-remapping errors.
> 
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
> 
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.
> 
> Changelog:
> v2:
> - Read pci config before de-assert secondary bus reset to flush previous
>write
> - Change function/variable name
> - Make a list of devices to be reset
> 
> v1:
> https://patchwork.kernel.org/patch/2482291/
> 
> Signed-off-by: Takao Indoh 
> ---
>   Documentation/kernel-parameters.txt |2 +
>   drivers/pci/pci.c   |  113 
> +++
>   2 files changed, 115 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   any pair of devices, possibly at the cost of
>   reduced performance.  This also guarantees
>   that hot-added devices will work.
> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
> + hot reset
>   cbiosize=nn[KMG]The fixed amount of bus space which is
>   reserved for the CardBus bridge's IO window.
>   The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>   }
>   EXPORT_SYMBOL(pci_fixup_cardbus);
>   
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is 
> PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!pci_is_pcie(dev) || !dev->subordinate ||
> + list_empty(&dev->subordinate->devices) ||
> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +  (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> + return 0;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + /* Don't reset switch, bridge, VGA device */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "save state\n");
> + pci_save_state(child);
> + }
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "restore state\n");
> + pci_restore_state(child);
> + }
> +}

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-05-21 Thread Takao Indoh
Hi Bjorn,

Any comments, ack/nack?

Thanks,
Takao Indoh

(2013/05/15 7:04), Eric W. Biederman wrote:
> Takao Indoh  writes:
> 
>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
>> on each PCIe root port and downstream port to reset its downstream
>> endpoint.
>>
>> Problem:
>> This patch solves the problem that kdump can fail when intel_iommu=on is
>> specified. When intel_iommu=on is specified, many dma-remapping errors
>> occur in second kernel and it causes problems like driver error or PCI
>> SERR, at last kdump fails. This problem is caused as follows.
>> 1) Devices are working on first kernel.
>> 2) Switch to second kernel(kdump kernel). The devices are still working
>> and its DMA continues during this switch.
>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>> dma-remapping errors.
>>
>> Solution:
>> All DMA transactions have to be stopped before iommu is initialized. By
>> this patch devices are reset and in-flight DMA is stopped before
>> pci_iommu_init.
>>
>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
>> root port/downstream port whose child is PCIe endpoint, and then reset
>> link between them. If the endpoint is VGA device, it is skipped because
>> the monitor blacks out if VGA controller is reset.
> 
> At a quick skim this patch looks reasonable.
> 
> Acked-by: "Eric W. Biederman" 
> 
>> Changelog:
>> v2:
>> - Read pci config before de-assert secondary bus reset to flush previous
>>write
>> - Change function/variable name
>> - Make a list of devices to be reset
>>
>> v1:
>> https://patchwork.kernel.org/patch/2482291/
>>
>> Signed-off-by: Takao Indoh 
>> ---
>>   Documentation/kernel-parameters.txt |2 +
>>   drivers/pci/pci.c   |  113 
>> +++
>>   2 files changed, 115 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt 
>> b/Documentation/kernel-parameters.txt
>> index c3bfacb..8c9e8e4 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be 
>> entirely omitted.
>>  any pair of devices, possibly at the cost of
>>  reduced performance.  This also guarantees
>>  that hot-added devices will work.
>> +pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
>> +hot reset
>>  cbiosize=nn[KMG]The fixed amount of bus space which is
>>  reserved for the CardBus bridge's IO window.
>>  The default value is 256 bytes.
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index a899d8b..70c1205 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>   }
>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>   
>> +/*
>> + * Return true if dev is PCIe root port or downstream port whose child is 
>> PCIe
>> + * endpoint except VGA device.
>> + */
>> +static int __init need_reset(struct pci_dev *dev)
>> +{
>> +struct pci_bus *subordinate;
>> +struct pci_dev *child;
>> +
>> +if (!pci_is_pcie(dev) || !dev->subordinate ||
>> +list_empty(&dev->subordinate->devices) ||
>> +((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>> +return 0;
>> +
>> +subordinate = dev->subordinate;
>> +list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> +(pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>> +((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> +/* Don't reset switch, bridge, VGA device */
>> +return 0;
>> +}
>> +
>> +return 1;
>> +}
>> +
>> +static void __init save_downstream_configs(struct pci_dev *dev)
>> +{
>> +struct pci_bus *subordinate;
>> +struct pci_dev *child;
>> +
>> +subordinate = dev->subordinate;
>> +list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +dev_info(&child->dev, "save state\n");
>> +pci_save_state(child);
>> +}
>> +}
>> +
>> +static void __init restore_downstream_configs(struct pci_dev *dev)
>> +{
>> +struct pci_bus *subordinate;
>> +struct pci_dev *child;
>> +
>> +subordinate = dev->subordinate;
>> +list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +dev_info(&child->dev, "restore state\n");
>> +pci_restore_state(child);
>> +}
>> +}
>> +
>> +static void __init do_downstream_device_reset(struct pci_dev

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-05-14 Thread Eric W. Biederman
Takao Indoh  writes:

> This patch resets PCIe devices on boot to stop ongoing DMA. When
> "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered
> on each PCIe root port and downstream port to reset its downstream
> endpoint.
>
> Problem:
> This patch solves the problem that kdump can fail when intel_iommu=on is
> specified. When intel_iommu=on is specified, many dma-remapping errors
> occur in second kernel and it causes problems like driver error or PCI
> SERR, at last kdump fails. This problem is caused as follows.
> 1) Devices are working on first kernel.
> 2) Switch to second kernel(kdump kernel). The devices are still working
>and its DMA continues during this switch.
> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>dma-remapping errors.
>
> Solution:
> All DMA transactions have to be stopped before iommu is initialized. By
> this patch devices are reset and in-flight DMA is stopped before
> pci_iommu_init.
>
> To invoke hot reset on an endpoint, its upstream link need to be reset.
> reset_pcie_endpoints() is called from fs_initcall_sync, and it finds
> root port/downstream port whose child is PCIe endpoint, and then reset
> link between them. If the endpoint is VGA device, it is skipped because
> the monitor blacks out if VGA controller is reset.

At a quick skim this patch looks reasonable.

Acked-by: "Eric W. Biederman" 

> Changelog:
> v2:
> - Read pci config before de-assert secondary bus reset to flush previous
>   write
> - Change function/variable name
> - Make a list of devices to be reset
>
> v1:
> https://patchwork.kernel.org/patch/2482291/
>
> Signed-off-by: Takao Indoh 
> ---
>  Documentation/kernel-parameters.txt |2 +
>  drivers/pci/pci.c   |  113 
> +++
>  2 files changed, 115 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index c3bfacb..8c9e8e4 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   any pair of devices, possibly at the cost of
>   reduced performance.  This also guarantees
>   that hot-added devices will work.
> + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by
> + hot reset
>   cbiosize=nn[KMG]The fixed amount of bus space which is
>   reserved for the CardBus bridge's IO window.
>   The default value is 256 bytes.
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a899d8b..70c1205 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_fixup_cardbus);
>  
> +/*
> + * Return true if dev is PCIe root port or downstream port whose child is 
> PCIe
> + * endpoint except VGA device.
> + */
> +static int __init need_reset(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + if (!pci_is_pcie(dev) || !dev->subordinate ||
> + list_empty(&dev->subordinate->devices) ||
> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +  (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
> + return 0;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + /* Don't reset switch, bridge, VGA device */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +static void __init save_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "save state\n");
> + pci_save_state(child);
> + }
> +}
> +
> +static void __init restore_downstream_configs(struct pci_dev *dev)
> +{
> + struct pci_bus *subordinate;
> + struct pci_dev *child;
> +
> + subordinate = dev->subordinate;
> + list_for_each_entry(child, &subordinate->devices, bus_list) {
> + dev_info(&child->dev, "restore state\n");
> + pci_restore_state(child);
> + }
> +}
> +
> +static void __init do_downstream_device_reset(struct pci_dev *dev)
> +{
> + u16 ctrl;
> +
> + dev_info(&dev->dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
> + ct