RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-08 Thread Huaisheng HS1 Ye

> 
> On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> > Currently in our mind, an ideal use scenario is that, we put all page 
> > caches to
> > zone_nvm, without any doubt, page cache is an efficient and common cache
> > implement, but it has a disadvantage that all dirty data within it would 
> > has risk
> > to be missed by power failure or system crash. If we put all page caches to 
> > NVDIMMs,
> > all dirty data will be safe.
> 
> That's a common misconception.  Some dirty data will still be in the
> CPU caches.  Are you planning on building servers which have enough
> capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> 
Sorry for not being clear.
For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an 
interrupt will be reported to CPU, an interrupt response function should be 
responsible to flush all dirty data to NVDIMM.
If there is a system crush, perhaps CPU couldn't have chance to execute this 
response.

It is hard to make sure everything is safe, what we can do is just to save the 
dirty data which is already stored to Pagecache, but not in CPU cache.
Is this an improvement than current?

> Then there's the problem of reconnecting the page cache (which is
> pointed to by ephemeral data structures like inodes and dentries) to
> the new inodes.
Yes, it is not easy.

> 
> And then you have to convince customers that what you're doing is safe
> enough for them to trust it ;-)
Sure. 

Sincerely,
Huaisheng Ye
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE

2018-05-08 Thread Huaisheng HS1 Ye

> On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index c782e8f..5fe1f63 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> >
> > +config ZONE_NVM
> > +   bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > +   depends on NUMA && X86_64
> 
> Hi,
> I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> (i.e., UMA) configs?
> 
I wrote these patches with two sockets testing platform, and there are two DDRs 
and two NVDIMMs have been installed to it.
So, for every socket it has one DDR and one NVDIMM with it. Here is memory 
region from memblock, you can get its distribution.

 435 [0.00] Zone ranges:
 436 [0.00]   DMA  [mem 0x1000-0x00ff]
 437 [0.00]   DMA32[mem 0x0100-0x]
 438 [0.00]   Normal   [mem 0x0001-0x0046bfff]
 439 [0.00]   NVM  [mem 0x00044000-0x0046bfff]
 440 [0.00]   Device   empty
 441 [0.00] Movable zone start for each node
 442 [0.00] Early memory node ranges
 443 [0.00]   node   0: [mem 0x1000-0x0009]
 444 [0.00]   node   0: [mem 0x0010-0xa69c2fff]
 445 [0.00]   node   0: [mem 0xa7654000-0xa85eefff]
 446 [0.00]   node   0: [mem 0xab399000-0xaf3f6fff]
 447 [0.00]   node   0: [mem 0xaf429000-0xaf7f]
 448 [0.00]   node   0: [mem 0x0001-0x00043fff] 
Normal 0
 449 [0.00]   node   0: [mem 0x00044000-0x00237fff] 
NVDIMM 0
 450 [0.00]   node   1: [mem 0x00238000-0x00277fff] 
Normal 1
 451 [0.00]   node   1: [mem 0x00278000-0x0046bfff] 
NVDIMM 1

If we disable NUMA, there is a result as Normal an NVDIMM zones will be 
overlapping with each other.
Current mm treats all memory regions equally, it divides zones just by size, 
like 16M for DMA, 4G for DMA32, and others above for Normal.
The spanned range of all zones couldn't be overlapped.

If we enable NUMA, for every socket its DDR and NVDIMM are separated, you can 
find that NVDIMM region always behind Normal zone.

Sincerely,
Huaisheng Ye 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 1/3] resource: Use list_head to link sibling resource

2018-05-08 Thread Baoquan He
On 05/07/18 at 07:42pm, kbuild test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.17-rc4 next-20180504]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180507-144345
> config: powerpc-defconfig (attached as .config)
> compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=powerpc 
> 
> All errors (new ones prefixed by >>):
> 
>arch/powerpc/kernel/pci-common.c: In function 
> 'pci_process_bridge_OF_ranges':
> >> arch/powerpc/kernel/pci-common.c:764:44: error: incompatible types when 
> >> assigning to type 'struct list_head' from type 'void *'
>res->parent = res->child = res->sibling = NULL;

Pasted code can fix above error.

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index fe9733aa..a7e68f6f9f24 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -761,7 +761,9 @@ void pci_process_bridge_OF_ranges(struct pci_controller 
*hose,
res->flags = range.flags;
res->start = range.cpu_addr;
res->end = range.cpu_addr + range.size - 1;
-   res->parent = res->child = res->sibling = NULL;
+   res->parent = NULL;
+   INIT_LIST_HEAD(>child);
+   INIT_LIST_HEAD(>sibling);
}
}
 }

>^
>arch/powerpc/kernel/pci-common.c: In function 'reparent_resources':
> >> arch/powerpc/kernel/pci-common.c:1100:10: error: assignment from 
> >> incompatible pointer type [-Werror=incompatible-pointer-types]
>  for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
>  ^

This reparent_resources() function is duplicated with the one in
arch/microblaze/pci/pci-common.c which has been fixed in v4. I planned
to move it to kernel/resource.c in a separate patch since it's shared by
different ARCH, then fix it in this patch.

>arch/powerpc/kernel/pci-common.c:1100:50: error: assignment from 
> incompatible pointer type [-Werror=incompatible-pointer-types]
>  for (pp = >child; (p = *pp) != NULL; pp = >sibling) {
>  ^
> >> arch/powerpc/kernel/pci-common.c:1113:13: error: incompatible types when 
> >> assigning to type 'struct list_head' from type 'struct resource *'
>  res->child = *firstpp;
> ^
>arch/powerpc/kernel/pci-common.c:1114:15: error: incompatible types when 
> assigning to type 'struct list_head' from type 'struct resource *'
>  res->sibling = *pp;
>   ^
> >> arch/powerpc/kernel/pci-common.c:1117:9: error: incompatible types when 
> >> assigning to type 'struct resource *' from type 'struct list_head'
>  for (p = res->child; p != NULL; p = p->sibling) {
> ^
>arch/powerpc/kernel/pci-common.c:1117:36: error: incompatible types when 
> assigning to type 'struct resource *' from type 'struct list_head'
>  for (p = res->child; p != NULL; p = p->sibling) {
>^
>cc1: all warnings being treated as errors
> 
> vim +764 arch/powerpc/kernel/pci-common.c
> 
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  642  
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  643  /**
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  644   * 
> pci_process_bridge_OF_ranges - Parse PCI bridge resources from device tree
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  645   * @hose: newly allocated 
> pci_controller to be setup
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  646   * @dev: device node of the 
> host bridge
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  647   * @primary: set if primary 
> bus (32 bits only, soon to be deprecated)
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  648   *
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  649   * This function will parse 
> the "ranges" property of a PCI host bridge device
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  650   * node and setup the 
> resource mapping of a pci controller based on its
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  651   * content.
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  652   *
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  653   * Life would be boring if 
> it wasn't for a few issues that we have to deal
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  654   * with here:
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  655   *
> 13dccb9e Benjamin Herrenschmidt 2007-12-11  656   *   - We can 

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 17:31:48 -0600
Logan Gunthorpe  wrote:
> On 08/05/18 05:11 PM, Alex Williamson wrote:
> > A runtime, sysfs approach has some benefits here,
> > especially in identifying the device assuming we're ok with leaving
> > the persistence problem to userspace tools.  I'm still a little fond of
> > the idea of exposing an acs_flags attribute for devices in sysfs where
> > a write would do a soft unplug and re-add of all affected devices to
> > automatically recreate the proper grouping.  Any dynamic change in
> > routing and grouping would require all DMA be re-established anyway and
> > a soft hotplug seems like an elegant way of handling it.  Thanks,  
> 
> This approach sounds like it has a lot more issues to contend with:
> 
> For starters, a soft unplug/re-add of all the devices behind a switch is
> going to be difficult if a lot of those devices have had drivers
> installed and their respective resources are now mounted or otherwise in
> use.
> 
> Then, do we have to redo a the soft-replace every time we change the ACS
> bit for every downstream port? That could mean you have to do dozens
> soft-replaces before you have all the ACS bits set which means you have
> a storm of drivers being added and removed.

True, anything requiring tweaking multiple downstream ports would
induce a hot-unplug/replug for each.  A better sysfs interface would
allow multiple downstream ports to be updated in a single shot.

> This would require some kind of fancy custom setup software that runs at
> just the right time in the boot sequence or a lot of work on the users
> part to unbind all the resources, setup the ACS bits and then rebind
> everything (assuming the soft re-add doesn't rebind it every time you
> adjust one ACS bit). Ugly.
> 
> IMO, if we need to do the sysfs approach then we need to be able to
> adjust the groups dynamically in a sensible way and not through the
> large hammer that is soft-replaces. I think this would be great but I
> don't think we will be tackling that with this patch set.

OTOH, I think the only sensible way to dynamically adjust groups is
through hotplug, we cannot have running drivers attached to downstream
endpoints as we're adjusting the routing.  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 19:06:17 -0400
Don Dutile  wrote:
> On 05/08/2018 05:27 PM, Stephen  Bates wrote:
> > As I understand it VMs need to know because VFIO passes IOMMU
> > grouping up into the VMs. So if a IOMMU grouping changes the VM's
> > view of its PCIe topology changes. I think we even have to be
> > cognizant of the fact the OS running on the VM may not even support
> > hot-plug of PCI devices.  
> Alex:
> Really? IOMMU groups are created by the kernel, so don't know how
> they would be passed into the VMs, unless indirectly via  PCI(e)
> layout. At best, twiddling w/ACS enablement (emulation) would cause
> VMs to see different IOMMU groups, but again, VMs are not the
> security point/level, the host/HV's are.

Correct, the VM has no concept of the host's IOMMU groups, only the
hypervisor knows about the groups, but really only to the extent of
which device belongs to which group and whether the group is viable.
Any runtime change to grouping though would require DMA mapping
updates, which I don't see how we can reasonably do with drivers,
vfio-pci or native host drivers, bound to the affected devices.  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] resource: Use list_head to link sibling resource

2018-05-08 Thread Wei Yang
On Tue, May 08, 2018 at 08:11:23PM +0800, Baoquan He wrote:
>On 05/08/18 at 08:48pm, Wei Yang wrote:
>> On Mon, May 07, 2018 at 09:14:29AM +0800, Baoquan He wrote:
>> >Hi Wei Yang,
>> >
>> >On 04/26/18 at 09:18am, Wei Yang wrote:
>> >> On Thu, Apr 19, 2018 at 08:18:46AM +0800, Baoquan He wrote:
>> >> >The struct resource uses singly linked list to link siblings. It's not
>> >> >easy to do reverse iteration on sibling list. So replace it with 
>> >> >list_head.
>> >> >
>> >> 
>> >> Hi, Baoquan
>> >> 
>> >> Besides changing the data structure, I have another proposal to do the 
>> >> reverse
>> >> iteration. Which means it would not affect other users, if you just want a
>> >> reverse iteration.
>> >> 
>> >> BTW, I don't think Andrew suggest to use linked-list directly. What he 
>> >> wants
>> >> is a better solution to your first proposal in
>> >> https://patchwork.kernel.org/patch/10300819/.
>> >> 
>> >> Below is my proposal of resource reverse iteration without changing 
>> >> current
>> >> design.
>> >
>> >I got your mail and read it, then interrupted by other thing and forgot
>> >replying, sorry.
>> >
>> >I am fine with your code change. As I said before, I have tried to change
>> >code per reviewers' comment, then let reviewers decide which way is
>> >better. Please feel free to post formal patches and joining discussion
>> >about this issue.
>> 
>> Yep, while I don't have a real requirement to add the reverse version, so 
>> what
>> is the proper way to send a patch?
>> 
>> A patch reply to this thread is ok?
>
>I am not sure either. Since my patches are still under reviewing. And
>you have pasted your patch. It depends on maintainers, mainly Andrew and
>other reviewers who have concerns.

Ok, thanks.

-- 
Wei Yang
Help you, Help me
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 05:11 PM, Alex Williamson wrote:
> On to the implementation details... I already mentioned the BDF issue
> in my other reply.  If we had a way to persistently identify a device,
> would we specify the downstream points at which we want to disable ACS
> or the endpoints that we want to connect?  The latter has a problem
> that the grouping upstream of an endpoint is already set by the time we
> discover the endpoint, so we might need to unwind to get the grouping
> correct.  The former might be more difficult for users to find the
> necessary nodes, but easier for the kernel to deal with during
> discovery.  

I was envisioning the former with kernel helping by printing a dmesg in
certain circumstances to help with figuring out which devices need to be
specified. Specifying a list of endpoints on the command line and having
the kernel try to figure out which downstream ports need to be adjusted
while we are in the middle of enumerating the bus is, like you said, a
nightmare.

> A runtime, sysfs approach has some benefits here,
> especially in identifying the device assuming we're ok with leaving
> the persistence problem to userspace tools.  I'm still a little fond of
> the idea of exposing an acs_flags attribute for devices in sysfs where
> a write would do a soft unplug and re-add of all affected devices to
> automatically recreate the proper grouping.  Any dynamic change in
> routing and grouping would require all DMA be re-established anyway and
> a soft hotplug seems like an elegant way of handling it.  Thanks,

This approach sounds like it has a lot more issues to contend with:

For starters, a soft unplug/re-add of all the devices behind a switch is
going to be difficult if a lot of those devices have had drivers
installed and their respective resources are now mounted or otherwise in
use.

Then, do we have to redo a the soft-replace every time we change the ACS
bit for every downstream port? That could mean you have to do dozens
soft-replaces before you have all the ACS bits set which means you have
a storm of drivers being added and removed.

This would require some kind of fancy custom setup software that runs at
just the right time in the boot sequence or a lot of work on the users
part to unbind all the resources, setup the ACS bits and then rebind
everything (assuming the soft re-add doesn't rebind it every time you
adjust one ACS bit). Ugly.

IMO, if we need to do the sysfs approach then we need to be able to
adjust the groups dynamically in a sensible way and not through the
large hammer that is soft-replaces. I think this would be great but I
don't think we will be tackling that with this patch set.

Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 05:00 PM, Dan Williams wrote:
>> I'd advise caution with a user supplied BDF approach, we have no
>> guaranteed persistence for a device's PCI address.  Adding a device
>> might renumber the buses, replacing a device with one that consumes
>> more/less bus numbers can renumber the buses, motherboard firmware
>> updates could renumber the buses, pci=assign-buses can renumber the
>> buses, etc.  This is why the VT-d spec makes use of device paths when
>> describing PCI hierarchies, firmware can't know what bus number will be
>> assigned to a device, but it does know the base bus number and the path
>> of devfns needed to get to it.  I don't know how we come up with an
>> option that's easy enough for a user to understand, but reasonably
>> robust against hardware changes.  Thanks,
> 
> True, but at the same time this feature is for "users with custom
> hardware designed for purpose", I assume they would be willing to take
> on the bus renumbering risk. It's already the case that
> /sys/bus/pci/drivers//bind takes BDF, which is why it seemed to
> make a similar interface for the command line. Ideally we could later
> get something into ACPI or other platform firmware to arrange for
> bridges to disable ACS by default if we see p2p becoming a
> common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
> given PCI-E sub-domain.

Yeah, I'm having a hard time coming up with an easy enough solution for
the user. I agree with Dan though, the bus renumbering risk would be
fairly low in the custom hardware seeing the switches are likely going
to be directly soldered to the same board with the CPU.

That being said, I supposed we could allow the command line to take both
a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like
a bit of a challenge.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 22:25:06 +
"Stephen  Bates"  wrote:

> >Yeah, so based on the discussion I'm leaning toward just having a
> >command line option that takes a list of BDFs and disables ACS
> > for them. (Essentially as Dan has suggested.) This avoids the
> > shotgun.  
> 
> I concur that this seems to be where the conversation is taking us.
> 
> @Alex - Before we go do this can you provide input on the approach? I
> don't want to re-spin only to find we are still not converging on the
> ACS issue

I can envision numerous implementation details that makes this less
trivial than it sounds, but it seems like the thing we need to decide
first is if intentionally leaving windows between devices with the
intention of exploiting them for direct P2P DMA in an otherwise IOMMU
managed address space is something we want to do.  From a security
perspective, we already handle this with IOMMU groups because many
devices do not support ACS, the new thing is embracing this rather than
working around it.  It makes me a little twitchy, but so long as the
IOMMU groups match the expected worst case routing between devices,
it's really no different than if we could wipe the ACS capability from
the device.

On to the implementation details... I already mentioned the BDF issue
in my other reply.  If we had a way to persistently identify a device,
would we specify the downstream points at which we want to disable ACS
or the endpoints that we want to connect?  The latter has a problem
that the grouping upstream of an endpoint is already set by the time we
discover the endpoint, so we might need to unwind to get the grouping
correct.  The former might be more difficult for users to find the
necessary nodes, but easier for the kernel to deal with during
discovery.  A runtime, sysfs approach has some benefits here,
especially in identifying the device assuming we're ok with leaving
the persistence problem to userspace tools.  I'm still a little fond of
the idea of exposing an acs_flags attribute for devices in sysfs where
a write would do a soft unplug and re-add of all affected devices to
automatically recreate the proper grouping.  Any dynamic change in
routing and grouping would require all DMA be re-established anyway and
a soft hotplug seems like an elegant way of handling it.  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Don Dutile

On 05/08/2018 05:27 PM, Stephen  Bates wrote:

Hi Don


Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
devices.
That agent should 'request' to the kernel that ACS be removed/circumvented 
(p2p enabled) btwn two endpoints.
I recommend doing so via a sysfs method.


Yes we looked at something like this in the past but it does hit the IOMMU 
grouping issue I discussed earlier today which is not acceptable right now. In 
the long term, once we get IOMMU grouping callbacks to VMs we can look at 
extending p2pdma in this way. But I don't think this is viable for the initial 
series.

 

So I don't understand the comments why VMs should need to know.


As I understand it VMs need to know because VFIO passes IOMMU grouping up into 
the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology 
changes. I think we even have to be cognizant of the fact the OS running on the 
VM may not even support hot-plug of PCI devices.

Alex:
Really? IOMMU groups are created by the kernel, so don't know how they would be 
passed into the VMs, unless indirectly via  PCI(e) layout.
At best, twiddling w/ACS enablement (emulation) would cause VMs to see 
different IOMMU groups, but again, VMs are not the security point/level, the 
host/HV's are.

 

Is there a thread I need to read up to explain /clear-up the thoughts above?


If you search for p2pdma you should find the previous discussions. Thanks for 
the input!

Stephen
 
 



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Dan Williams
On Tue, May 8, 2018 at 3:32 PM, Alex Williamson
 wrote:
> On Tue, 8 May 2018 16:10:19 -0600
> Logan Gunthorpe  wrote:
>
>> On 08/05/18 04:03 PM, Alex Williamson wrote:
>> > If IOMMU grouping implies device assignment (because nobody else uses
>> > it to the same extent as device assignment) then the build-time option
>> > falls to pieces, we need a single kernel that can do both.  I think we
>> > need to get more clever about allowing the user to specify exactly at
>> > which points in the topology they want to disable isolation.  Thanks,
>>
>>
>> Yeah, so based on the discussion I'm leaning toward just having a
>> command line option that takes a list of BDFs and disables ACS for them.
>> (Essentially as Dan has suggested.) This avoids the shotgun.
>>
>> Then, the pci_p2pdma_distance command needs to check that ACS is
>> disabled for all bridges between the two devices. If this is not the
>> case, it returns -1. Future work can check if the EP has ATS support, in
>> which case it has to check for the ACS direct translated bit.
>>
>> A user then needs to either disable the IOMMU and/or add the command
>> line option to disable ACS for the specific downstream ports in the PCI
>> hierarchy. This means the IOMMU groups will be less granular but
>> presumably the person adding the command line argument understands this.
>>
>> We may also want to do some work so that there's informative dmesgs on
>> which BDFs need to be specified on the command line so it's not so
>> difficult for the user to figure out.
>
> I'd advise caution with a user supplied BDF approach, we have no
> guaranteed persistence for a device's PCI address.  Adding a device
> might renumber the buses, replacing a device with one that consumes
> more/less bus numbers can renumber the buses, motherboard firmware
> updates could renumber the buses, pci=assign-buses can renumber the
> buses, etc.  This is why the VT-d spec makes use of device paths when
> describing PCI hierarchies, firmware can't know what bus number will be
> assigned to a device, but it does know the base bus number and the path
> of devfns needed to get to it.  I don't know how we come up with an
> option that's easy enough for a user to understand, but reasonably
> robust against hardware changes.  Thanks,

True, but at the same time this feature is for "users with custom
hardware designed for purpose", I assume they would be willing to take
on the bus renumbering risk. It's already the case that
/sys/bus/pci/drivers//bind takes BDF, which is why it seemed to
make a similar interface for the command line. Ideally we could later
get something into ACPI or other platform firmware to arrange for
bridges to disable ACS by default if we see p2p becoming a
common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a
given PCI-E sub-domain.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
>Yeah, so based on the discussion I'm leaning toward just having a
>command line option that takes a list of BDFs and disables ACS for them.
>(Essentially as Dan has suggested.) This avoids the shotgun.

I concur that this seems to be where the conversation is taking us.

@Alex - Before we go do this can you provide input on the approach? I don't 
want to re-spin only to find we are still not converging on the ACS issue

Thanks

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Don Dutile

On 05/08/2018 06:03 PM, Alex Williamson wrote:

On Tue, 8 May 2018 21:42:27 +
"Stephen  Bates"  wrote:


Hi Alex


But it would be a much easier proposal to disable ACS when the
IOMMU is not enabled, ACS has no real purpose in that case.


I guess one issue I have with this is that it disables IOMMU groups
for all Root Ports and not just the one(s) we wish to do p2pdma on.


But as I understand this series, we're not really targeting specific
sets of devices either.  It's more of a shotgun approach that we
disable ACS on downstream switch ports and hope that we get the right
set of devices, but with the indecisiveness that we might later
white-list select root ports to further increase the blast radius.


The IOMMU and P2P are already not exclusive, we can bounce off
the IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you
didn't expect distros to use, so I don't think intervention for the
user to disable the IOMMU if it's enabled by default is a serious
concern either.


ATS definitely makes things more interesting for the cases where the
EPs support it. However I don't really have a handle on how common
ATS support is going to be in the kinds of devices we have been
focused on (NVMe SSDs and RDMA NICs mostly).


What you're trying to do is enabled direct peer-to-peer for
endpoints which do not support ATS when the IOMMU is enabled, which
is not something that necessarily makes sense to me.


As above the advantage of leaving the IOMMU on is that it allows for
both p2pdma PCI domains and IOMMU groupings PCI domains in the same
system. It is just that these domains will be separate to each other.


That argument makes sense if we had the ability to select specific sets
of devices, but that's not the case here, right?  With the shotgun
approach, we're clearly favoring one at the expense of the other and
it's not clear why we don't simple force the needle all the way in that
direction such that the results are at least predictable.


  So that leaves avoiding bounce buffers as the remaining IOMMU
feature


I agree with you here that the devices we will want to use for p2p
will probably not require a bounce buffer and will support 64 bit DMA
addressing.


I'm still not seeing why it's terribly undesirable to require
devices to support ATS if they want to do direct P2P with an IOMMU
enabled.


I think the one reason is for the use-case above. Allowing IOMMU
groupings on one domain and p2pdma on another domain


If IOMMU grouping implies device assignment (because nobody else uses
it to the same extent as device assignment) then the build-time option
falls to pieces, we need a single kernel that can do both.  I think we
need to get more clever about allowing the user to specify exactly at
which points in the topology they want to disable isolation.  Thanks,

Alex


+1/ack

RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to
put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping',
which is really a 'DMA security domain' and less an 'IOMMU grouping domain'.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 04:03 PM, Alex Williamson wrote:
> If IOMMU grouping implies device assignment (because nobody else uses
> it to the same extent as device assignment) then the build-time option
> falls to pieces, we need a single kernel that can do both.  I think we
> need to get more clever about allowing the user to specify exactly at
> which points in the topology they want to disable isolation.  Thanks,


Yeah, so based on the discussion I'm leaning toward just having a
command line option that takes a list of BDFs and disables ACS for them.
(Essentially as Dan has suggested.) This avoids the shotgun.

Then, the pci_p2pdma_distance command needs to check that ACS is
disabled for all bridges between the two devices. If this is not the
case, it returns -1. Future work can check if the EP has ATS support, in
which case it has to check for the ACS direct translated bit.

A user then needs to either disable the IOMMU and/or add the command
line option to disable ACS for the specific downstream ports in the PCI
hierarchy. This means the IOMMU groups will be less granular but
presumably the person adding the command line argument understands this.

We may also want to do some work so that there's informative dmesgs on
which BDFs need to be specified on the command line so it's not so
difficult for the user to figure out.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 17:25:24 -0400
Don Dutile  wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas  wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>PCI/P2PDMA: Support peer-to-peer memory
> >>>PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>docs-rst: Add a new directory for PCI documentation
> >>>PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>block: Introduce PCI P2P flags for request and request queue
> >>>IB/core: Ensure we map P2P memory correctly in
> >>>  rdma_rw_ctx_[init|destroy]()
> >>>nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>nvme-pci: Add support for P2P memory in requests
> >>>nvme-pci: Add a quirk for a pseudo CMB
> >>>nvmet: Introduce helper functions to allocate and free request SGLs
> >>>nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci|  25 +
> >>>   Documentation/PCI/index.rst|  14 +
> >>>   Documentation/driver-api/index.rst |   2 +-
> >>>   Documentation/driver-api/pci/index.rst |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst| 166 ++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst|   3 +-
> >>>   block/blk-core.c   |   3 +
> >>>   drivers/infiniband/core/rw.c   |  13 +-
> >>>   drivers/nvme/host/core.c   |   4 +
> >>>   drivers/nvme/host/nvme.h   |   8 +
> >>>   drivers/nvme/host/pci.c| 118 +++--
> >>>   drivers/nvme/target/configfs.c |  67 +++
> >>>   drivers/nvme/target/core.c | 143 -
> >>>   drivers/nvme/target/io-cmd.c   |   3 +
> >>>   drivers/nvme/target/nvmet.h|  15 +
> >>>   drivers/nvme/target/rdma.c |  22 +-
> >>>   drivers/pci/Kconfig|  26 +
> >>>   drivers/pci/Makefile   |   1 +
> >>>   drivers/pci/p2pdma.c   | 814 
> >>> +
> >>>   drivers/pci/pci.c  |   6 +
> >>>   include/linux/blk_types.h  |  18 +-
> >>>   include/linux/blkdev.h |   3 +
> >>>   include/linux/memremap.h   |  19 +
> >>>   include/linux/pci-p2pdma.h | 118 +
> >>>   include/linux/pci.h|   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a 
> tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that 
> manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs 
> (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done 

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Jerome

>I think there is confusion here, Alex properly explained the scheme
>   PCIE-device do a ATS request to the IOMMU which returns a valid
>translation for a virtual address. Device can then use that address
>directly without going through IOMMU for translation.

This makes sense and to be honest I now understand ATS and its interaction with 
ACS a lot better than I did 24 hours ago ;-).

>ATS is implemented by the IOMMU not by the device (well device implement
>the client side of it). Also ATS is meaningless without something like
>PASID as far as i know.

I think it's the client side that is what is important to us. Not many EPs 
support ATS today and it's not clear if many will in the future.  So assuming 
we want to do p2pdma between devices (some of) which do NOT support ATS how 
best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me 
given this impacts all the PCI domains in the system and not just the domain we 
wish to do P2P on.

Stephen

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Don

>Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
>devices.
>That agent should 'request' to the kernel that ACS be removed/circumvented 
> (p2p enabled) btwn two endpoints.
>I recommend doing so via a sysfs method.

Yes we looked at something like this in the past but it does hit the IOMMU 
grouping issue I discussed earlier today which is not acceptable right now. In 
the long term, once we get IOMMU grouping callbacks to VMs we can look at 
extending p2pdma in this way. But I don't think this is viable for the initial 
series. 


>So I don't understand the comments why VMs should need to know.

As I understand it VMs need to know because VFIO passes IOMMU grouping up into 
the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology 
changes. I think we even have to be cognizant of the fact the OS running on the 
VM may not even support hot-plug of PCI devices.

> Is there a thread I need to read up to explain /clear-up the thoughts above?

If you search for p2pdma you should find the previous discussions. Thanks for 
the input!

Stephen



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 14:49:23 -0600
Logan Gunthorpe  wrote:

> On 08/05/18 02:43 PM, Alex Williamson wrote:
> > Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> > dumb question, why not simply turn off the IOMMU and thus ACS?  The
> > argument of using the IOMMU for security is rather diminished if we're
> > specifically enabling devices to poke one another directly and clearly
> > this isn't favorable for device assignment either.  Are there target
> > systems where this is not a simple kernel commandline option?  Thanks,  
> 
> Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
> into some bios's that set the bits on boot (which is annoying).

But it would be a much easier proposal to disable ACS when the IOMMU is
not enabled, ACS has no real purpose in that case.

> I also don't expect people will respond well to making the IOMMU and P2P
> exclusive. The IOMMU is often used for more than just security and on
> many platforms it's enabled by default. I'd much rather allow IOMMU use
> but have fewer isolation groups in much the same way as if you had PCI
> bridges that didn't support ACS.

The IOMMU and P2P are already not exclusive, we can bounce off the
IOMMU or make use of ATS as we've previously discussed.  We were
previously talking about a build time config option that you didn't
expect distros to use, so I don't think intervention for the user to
disable the IOMMU if it's enabled by default is a serious concern
either.

What you're trying to do is enabled direct peer-to-peer for endpoints
which do not support ATS when the IOMMU is enabled, which is not
something that necessarily makes sense to me.  As I mentioned in a
previous reply, the IOMMU provides us with an I/O virtual address space
for devices, ACS is meant to fill the topology based gaps in that
virtual address space, making transactions follow IOMMU compliant
routing rules to avoid aliases between the IOVA and physical address
spaces.  But this series specifically wants to leave those gaps open
for direct P2P access.

So we compromise the P2P aspect of security, still protecting RAM, but
potentially only to the extent that a device cannot hop through or
interfere with other devices to do its bidding.  Device assignment is
mostly tossed out the window because not only are bigger groups more
difficult to deal with, the IOVA space is riddled with gaps, which is
not really a solved problem.  So that leaves avoiding bounce buffers as
the remaining IOMMU feature, but we're dealing with native express
devices and relatively high end devices that are probably installed in
modern systems, so that seems like a non-issue.

Are there other uses I'm forgetting?  We can enable interrupt remapping
separate from DMA translation, so we can exclude that one.  I'm still
not seeing why it's terribly undesirable to require devices to support
ATS if they want to do direct P2P with an IOMMU enabled.  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Don Dutile

On 05/08/2018 12:57 PM, Alex Williamson wrote:

On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas  wrote:


On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:

Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
...



Logan Gunthorpe (14):
   PCI/P2PDMA: Support peer-to-peer memory
   PCI/P2PDMA: Add sysfs group to display p2pmem stats
   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
   docs-rst: Add a new directory for PCI documentation
   PCI/P2PDMA: Add P2P DMA driver writer's documentation
   block: Introduce PCI P2P flags for request and request queue
   IB/core: Ensure we map P2P memory correctly in
 rdma_rw_ctx_[init|destroy]()
   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
   nvme-pci: Add support for P2P memory in requests
   nvme-pci: Add a quirk for a pseudo CMB
   nvmet: Introduce helper functions to allocate and free request SGLs
   nvmet-rdma: Use new SGL alloc/free helper for requests
   nvmet: Optionally use PCI P2P memory

  Documentation/ABI/testing/sysfs-bus-pci|  25 +
  Documentation/PCI/index.rst|  14 +
  Documentation/driver-api/index.rst |   2 +-
  Documentation/driver-api/pci/index.rst |  20 +
  Documentation/driver-api/pci/p2pdma.rst| 166 ++
  Documentation/driver-api/{ => pci}/pci.rst |   0
  Documentation/index.rst|   3 +-
  block/blk-core.c   |   3 +
  drivers/infiniband/core/rw.c   |  13 +-
  drivers/nvme/host/core.c   |   4 +
  drivers/nvme/host/nvme.h   |   8 +
  drivers/nvme/host/pci.c| 118 +++--
  drivers/nvme/target/configfs.c |  67 +++
  drivers/nvme/target/core.c | 143 -
  drivers/nvme/target/io-cmd.c   |   3 +
  drivers/nvme/target/nvmet.h|  15 +
  drivers/nvme/target/rdma.c |  22 +-
  drivers/pci/Kconfig|  26 +
  drivers/pci/Makefile   |   1 +
  drivers/pci/p2pdma.c   | 814 +
  drivers/pci/pci.c  |   6 +
  include/linux/blk_types.h  |  18 +-
  include/linux/blkdev.h |   3 +
  include/linux/memremap.h   |  19 +
  include/linux/pci-p2pdma.h | 118 +
  include/linux/pci.h|   4 +
  26 files changed, 1579 insertions(+), 56 deletions(-)
  create mode 100644 Documentation/PCI/index.rst
  create mode 100644 Documentation/driver-api/pci/index.rst
  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
  create mode 100644 drivers/pci/p2pdma.c
  create mode 100644 include/linux/pci-p2pdma.h


How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?


AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex

At least 1/2 the cases presented to me by existing customers want it in a 
tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that 
manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs 
(configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this 
static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with 
p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all 
over the VM, but no other VM and not the host/HV.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Don Dutile

On 05/08/2018 10:44 AM, Stephen  Bates wrote:

Hi Dan


It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?
   
I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series.



It is clear if it is a kernel command-line option or a CONFIG option.
One does not have access to the kernel command-line w/o a few privs.
A CONFIG option prevents a distribution to have a default, locked-down kernel 
_and_ the ability to be 'unlocked' if the customer/site is 'secure' via other 
means.
A run/boot-time option is more flexible and achieves the best of both.
   

Why is this text added in a follow on patch and not the patch that
  introduced the config option?


Because the ACS section was added later in the series and this information is 
associated with that additional functionality.
 

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.



Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two 
devices.
That agent should 'request' to the kernel that ACS be removed/circumvented (p2p 
enabled) btwn two endpoints.
I recommend doing so via a sysfs method.

That way, the system can limit the 'unsecure' space btwn two devices, likely 
configured on a separate switch, from the rest of the still-secured/ACS-enabled 
PCIe tree.
PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, 
but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ).
A pair-listing would be optimal, allowing the kernel to figure out the ACS 
path, and not making it endpoint-switch-switch...-switch-endpt error-entry 
prone.
Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not 
optimal, but better then all the way to/from memory, and a security/iova-check 
possible,
can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits 
be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI 
domain).

Kernels that never want to support P2P could build w/o it enabled cmdline 
option is moot.
Kernels built with it on, *still* need cmdline option, to be blunt that the 
kernel is enabling a feature that could render the entire (IO sub)system 
unsecure.


By this you mean the address for either a RP, DSP, USP or MF EP below which we 
disable ACS? We could do that but I don't think it avoids the issue of changes 
in IOMMU groupings as devices are added/removed. It simply changes the problem 
from affecting and entire PCI domain to a sub-set of the domain. We can already 
handle this by doing p2pdma on one RP and normal IOMMU isolation on the other 
RPs in the system.


as devices are added, they start in ACS-enabled, secured mode.
As sysfs entry modifies p2p ability, IOMMU group is modified as well.


btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge 
issue.
   So I don't understand the comments why VMs should need to know.
   -- configure p2p _before_ assigning devices to VMs. ... iommu groups are 
checked at assignment time.
  -- so even if hot-add, separate iommu group, then enable p2p, becomes 
same IOMMU group, then can only assign to same VM.
   -- VMs don't know IOMMU's & ACS are involved now, and won't later, even 
if device's dynamically added/removed

Is there a thread I need to read up to explain /clear-up the thoughts above?


Stephen

[1] https://marc.info/?l=linux-doc=150907188310838=2
 



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Jerome Glisse
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 02:43 PM, Alex Williamson wrote:
> Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
> dumb question, why not simply turn off the IOMMU and thus ACS?  The
> argument of using the IOMMU for security is rather diminished if we're
> specifically enabling devices to poke one another directly and clearly
> this isn't favorable for device assignment either.  Are there target
> systems where this is not a simple kernel commandline option?  Thanks,

Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run
into some bios's that set the bits on boot (which is annoying).

I also don't expect people will respond well to making the IOMMU and P2P
exclusive. The IOMMU is often used for more than just security and on
many platforms it's enabled by default. I'd much rather allow IOMMU use
but have fewer isolation groups in much the same way as if you had PCI
bridges that didn't support ACS.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 14:19:05 -0600
Logan Gunthorpe  wrote:

> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.  
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

Yes, GPUs seem to be leading the pack in implementing ATS.  So now the
dumb question, why not simply turn off the IOMMU and thus ACS?  The
argument of using the IOMMU for security is rather diminished if we're
specifically enabling devices to poke one another directly and clearly
this isn't favorable for device assignment either.  Are there target
systems where this is not a simple kernel commandline option?  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 02:13 PM, Alex Williamson wrote:
> Well, I'm a bit confused, this patch series is specifically disabling
> ACS on switches, but per the spec downstream switch ports implementing
> ACS MUST implement direct translated P2P.  So it seems the only
> potential gap here is the endpoint, which must support ATS or else
> there's nothing for direct translated P2P to do.  The switch port plays
> no part in the actual translation of the request, ATS on the endpoint
> has already cached the translation and is now attempting to use it.
> For the switch port, this only becomes a routing decision, the request
> is already translated, therefore ACS RR and EC can be ignored to
> perform "normal" (direct) routing, as if ACS were not present.  It would
> be a shame to go to all the trouble of creating this no-ACS mode to find
> out the target hardware supports ATS and should have simply used it, or
> we should have disabled the IOMMU altogether, which leaves ACS disabled.

Ah, ok, I didn't think it was the endpoint that had to implement ATS.
But in that case, for our application, we need NVMe cards and RDMA NICs
to all have ATS support and I expect that is just as unlikely. At least
none of the endpoints on my system support it. Maybe only certain GPUs
have this support.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 13:45:50 -0600
Logan Gunthorpe  wrote:

> On 08/05/18 01:34 PM, Alex Williamson wrote:
> > They are not so unrelated, see the ACS Direct Translated P2P
> > capability, which in fact must be implemented by switch downstream
> > ports implementing ACS and works specifically with ATS.  This appears to
> > be the way the PCI SIG would intend for P2P to occur within an IOMMU
> > managed topology, routing pre-translated DMA directly between peer
> > devices while requiring non-translated requests to bounce through the
> > IOMMU.  Really, what's the value of having an I/O virtual address space
> > provided by an IOMMU if we're going to allow physical DMA between
> > downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> > course ATS is not without holes itself, basically that we trust the
> > endpoint's implementation of ATS implicitly.  Thanks,  
> 
> I agree that this is what the SIG intends, but I don't think hardware
> fully supports this methodology yet. The Direct Translated capability
> just requires switches to forward packets that have the AT request type
> set. It does not require them to do the translation or to support ATS
> such that P2P requests can be translated by the IOMMU. I expect this is
> so that an downstream device can implement ATS and not get messed up by
> an upstream switch that doesn't support it.

Well, I'm a bit confused, this patch series is specifically disabling
ACS on switches, but per the spec downstream switch ports implementing
ACS MUST implement direct translated P2P.  So it seems the only
potential gap here is the endpoint, which must support ATS or else
there's nothing for direct translated P2P to do.  The switch port plays
no part in the actual translation of the request, ATS on the endpoint
has already cached the translation and is now attempting to use it.
For the switch port, this only becomes a routing decision, the request
is already translated, therefore ACS RR and EC can be ignored to
perform "normal" (direct) routing, as if ACS were not present.  It would
be a shame to go to all the trouble of creating this no-ACS mode to find
out the target hardware supports ATS and should have simply used it, or
we should have disabled the IOMMU altogether, which leaves ACS disabled.
Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 01:34 PM, Alex Williamson wrote:
> They are not so unrelated, see the ACS Direct Translated P2P
> capability, which in fact must be implemented by switch downstream
> ports implementing ACS and works specifically with ATS.  This appears to
> be the way the PCI SIG would intend for P2P to occur within an IOMMU
> managed topology, routing pre-translated DMA directly between peer
> devices while requiring non-translated requests to bounce through the
> IOMMU.  Really, what's the value of having an I/O virtual address space
> provided by an IOMMU if we're going to allow physical DMA between
> downstream devices, couldn't we just turn off the IOMMU altogether?  Of
> course ATS is not without holes itself, basically that we trust the
> endpoint's implementation of ATS implicitly.  Thanks,

I agree that this is what the SIG intends, but I don't think hardware
fully supports this methodology yet. The Direct Translated capability
just requires switches to forward packets that have the AT request type
set. It does not require them to do the translation or to support ATS
such that P2P requests can be translated by the IOMMU. I expect this is
so that an downstream device can implement ATS and not get messed up by
an upstream switch that doesn't support it.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Alex Williamson
On Tue, 8 May 2018 13:13:40 -0600
Logan Gunthorpe  wrote:

> On 08/05/18 10:50 AM, Christian König wrote:
> > E.g. transactions are initially send to the root complex for 
> > translation, that's for sure. But at least for AMD GPUs the root complex 
> > answers with the translated address which is then cached in the device.
> > 
> > So further transactions for the same address range then go directly to 
> > the destination.  
> 
> Sounds like you are referring to Address Translation Services (ATS).
> This is quite separate from ACS and, to my knowledge, isn't widely
> supported by switch hardware.

They are not so unrelated, see the ACS Direct Translated P2P
capability, which in fact must be implemented by switch downstream
ports implementing ACS and works specifically with ATS.  This appears to
be the way the PCI SIG would intend for P2P to occur within an IOMMU
managed topology, routing pre-translated DMA directly between peer
devices while requiring non-translated requests to bounce through the
IOMMU.  Really, what's the value of having an I/O virtual address space
provided by an IOMMU if we're going to allow physical DMA between
downstream devices, couldn't we just turn off the IOMMU altogether?  Of
course ATS is not without holes itself, basically that we trust the
endpoint's implementation of ATS implicitly.  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 10:50 AM, Christian König wrote:
> E.g. transactions are initially send to the root complex for 
> translation, that's for sure. But at least for AMD GPUs the root complex 
> answers with the translated address which is then cached in the device.
> 
> So further transactions for the same address range then go directly to 
> the destination.

Sounds like you are referring to Address Translation Services (ATS).
This is quite separate from ACS and, to my knowledge, isn't widely
supported by switch hardware.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/5] fix radix tree multi-order iteration race

2018-05-08 Thread Ross Zwisler
On Thu, May 03, 2018 at 01:24:25PM -0600, Ross Zwisler wrote:
> The following series gets the radix tree test suite compiling again in
> the current linux/master, adds a unit test which exposes a race in the
> radix tree multi-order iteration code, and then fixes that race.
> 
> This race was initially hit on a v4.15 based kernel and results in a GP
> fault.  I've described the race in detail in patches 4 and 5.
> 
> The fix is simple and necessary, and I think it should be merged for
> v4.17.
> 
> This tree has gotten positive build confirmation from the 0-day bot,
> passes the updated radix tree test suite, xfstests, and the original
> test that was hitting the race with the v4.15 based kernel.
> 
> Ross Zwisler (5):
>   radix tree test suite: fix mapshift build target
>   radix tree test suite: fix compilation issue
>   radix tree test suite: add item_delete_rcu()
>   radix tree test suite: multi-order iteration race
>   radix tree: fix multi-order iteration race
> 
>  lib/radix-tree.c  |  6 ++--
>  tools/include/linux/spinlock.h|  3 +-
>  tools/testing/radix-tree/Makefile |  6 ++--
>  tools/testing/radix-tree/multiorder.c | 63 
> +++
>  tools/testing/radix-tree/test.c   | 19 +++
>  tools/testing/radix-tree/test.h   |  3 ++
>  6 files changed, 91 insertions(+), 9 deletions(-)
> 
> -- 
> 2.14.3
> 

Ping on this series.  Does anyone have any feedback?  Matthew?

I'd really like to get it into v4.17.

We can take it through the libnvdimm tree, if that's easiest.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-08 Thread Alex Williamson
On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas  wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> > rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci|  25 +
> >  Documentation/PCI/index.rst|  14 +
> >  Documentation/driver-api/index.rst |   2 +-
> >  Documentation/driver-api/pci/index.rst |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst| 166 ++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst|   3 +-
> >  block/blk-core.c   |   3 +
> >  drivers/infiniband/core/rw.c   |  13 +-
> >  drivers/nvme/host/core.c   |   4 +
> >  drivers/nvme/host/nvme.h   |   8 +
> >  drivers/nvme/host/pci.c| 118 +++--
> >  drivers/nvme/target/configfs.c |  67 +++
> >  drivers/nvme/target/core.c | 143 -
> >  drivers/nvme/target/io-cmd.c   |   3 +
> >  drivers/nvme/target/nvmet.h|  15 +
> >  drivers/nvme/target/rdma.c |  22 +-
> >  drivers/pci/Kconfig|  26 +
> >  drivers/pci/Makefile   |   1 +
> >  drivers/pci/p2pdma.c   | 814 
> > +
> >  drivers/pci/pci.c  |   6 +
> >  include/linux/blk_types.h  |  18 +-
> >  include/linux/blkdev.h |   3 +
> >  include/linux/memremap.h   |  19 +
> >  include/linux/pci-p2pdma.h | 118 +
> >  include/linux/pci.h|   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe:


On 08/05/18 01:17 AM, Christian König wrote:

AMD APUs mandatory need the ACS flag set for the GPU integrated in the
CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.


Ok, that is at least a step in the right direction. But I think we 
seriously need to test that for side effects.





And what exactly is the problem here? I'm currently testing P2P with
GPUs in different IOMMU domains and at least with AMD IOMMUs that works
perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.


Well I'm not an expert on this, but if I'm not completely mistaken that 
is not correct.


E.g. transactions are initially send to the root complex for 
translation, that's for sure. But at least for AMD GPUs the root complex 
answers with the translated address which is then cached in the device.


So further transactions for the same address range then go directly to 
the destination.


What you don't want is device isolation, cause in this case the root 
complex handles the transaction themselves. IIRC there where also 
something like "force_isolation" and "nobypass" parameters for the IOMMU 
to control that behavior.


It's already late here, but going to dig up the documentation for that 
tomorrow and/or contact a hardware engineer involved in the ACS spec.


Regards,
Christian.



If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Am 08.05.2018 um 16:25 schrieb Stephen Bates:
 
Hi Christian



AMD APUs mandatory need the ACS flag set for the GPU integrated in the
CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?


Well I'm not an expert on this, but I think that is an incorrect 
assumption you guys use here.


At least in the default configuration even with IOMMU enabled P2P 
transactions does NOT necessary travel up to the root complex for 
translation.


It's already late here, but if nobody beats me I'm going to dig up the 
necessary documentation tomorrow.


Regards,
Christian.




Similar problems arise when you do this for dedicated GPU, but we
haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit.
 

So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?
 

And what exactly is the problem here?
  
We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe).



I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change.

Cheers

Stephen
 



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Logan Gunthorpe


On 08/05/18 01:17 AM, Christian König wrote:
> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

Well, given that the current set only disables ACS bits on bridges
(previous versions were only on switches) this shouldn't be an issue for
integrated devices. We do not disable ACS flags globally.

> And what exactly is the problem here? I'm currently testing P2P with 
> GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
> perfectly fine.

In addition to Stephen's comments, seeing we've established a general
need to avoid the root complex (until we have a whitelist at least) we
must have ACS disabled along the path between the devices. Otherwise,
all TLPs will go through the root complex and if there is no support it
will fail.

If the consensus is we want a command line option, then so be it. But
we'll have to deny pretty much all P2P transactions unless the user
correctly disables ACS along the path using the command line option and
this is really annoying for users of this functionality to understand
how to do that correctly.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates
Hi Dan

>It seems unwieldy that this is a compile time option and not a runtime
>option. Can't we have a kernel command line option to opt-in to this
>behavior rather than require a wholly separate kernel image?
  
I think because of the security implications associated with p2pdma and ACS we 
wanted to make it very clear people were choosing one (p2pdma) or the other 
(IOMMU groupings and isolation). However personally I would prefer including 
the option of a run-time kernel parameter too. In fact a few months ago I 
proposed a small patch that did just that [1]. It never really went anywhere 
but if people were open to the idea we could look at adding it to the series.
  
> Why is this text added in a follow on patch and not the patch that
>  introduced the config option?

Because the ACS section was added later in the series and this information is 
associated with that additional functionality.

> I'm also wondering if that command line option can take a 'bus device
> function' address of a switch to limit the scope of where ACS is
> disabled.

By this you mean the address for either a RP, DSP, USP or MF EP below which we 
disable ACS? We could do that but I don't think it avoids the issue of changes 
in IOMMU groupings as devices are added/removed. It simply changes the problem 
from affecting and entire PCI domain to a sub-set of the domain. We can already 
handle this by doing p2pdma on one RP and normal IOMMU isolation on the other 
RPs in the system.

Stephen

[1] https://marc.info/?l=linux-doc=150907188310838=2


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Dan Williams
On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpe  wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
>
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
>
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.

>
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  9 +
>  drivers/pci/p2pdma.c   | 45 ++---
>  drivers/pci/pci.c  |  6 ++
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 50 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
>   transations must be between devices behind the same root port.
>   (Typically behind a network of PCIe switches).
>
> + Enabling this option will also disable ACS on all ports behind
> + any PCIe switch. This effectively puts all devices behind any
> + switch heirarchy into the same IOMMU group. Which implies that
> + individual devices behind any switch will not be able to be
> + assigned to separate VMs because there is no isolation between
> + them. Additionally, any malicious PCIe devices will be able to
> + DMA to memory exposed by other EPs in the same domain as TLPs
> + will not be checked by the IOMMU.
> +
>   If unsure, say N.

It seems unwieldy that this is a compile time option and not a runtime
option. Can't we have a kernel command line option to opt-in to this
behavior rather than require a wholly separate kernel image?

Why is this text added in a follow on patch and not the patch that
introduced the config option?

I'm also wondering if that command line option can take a 'bus device
function' address of a switch to limit the scope of where ACS is
disabled.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Stephen Bates

Hi Christian

> AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
> CPU when IOMMU is enabled or otherwise you will break SVM.

OK but in this case aren't you losing (many of) the benefits of P2P since all 
DMAs will now get routed up to the IOMMU before being passed down to the 
destination PCIe EP?

> Similar problems arise when you do this for dedicated GPU, but we 
> haven't upstreamed the support for this yet.

Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA 
will be routed to the IOMMU which removes a lot of the benefit. 

> So that is a clear NAK from my side for the approach.

Do you have an alternative? This is the approach we arrived it after a 
reasonably lengthy discussion on the mailing lists. Alex, are you still 
comfortable with this approach?

> And what exactly is the problem here?
 
We had a pretty lengthy discussion on this topic on one of the previous 
revisions. The issue is that currently there is no mechanism in the IOMMU code 
to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change 
its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS 
settings could change. Since there is no way to currently handle changing ACS 
settings and hence IOMMU groupings the consensus was to simply disable ACS on 
all ports in a p2pdma domain. This effectively makes all the devices in the 
p2pdma domain part of the same IOMMU grouping. The plan will be to address this 
in time and add a mechanism for IOMMU grouping changes and notification to VMs 
but that's not part of this series. Note you are still allowed to have ACS 
functioning on other PCI domains so if you do not a plurality of IOMMU 
groupings you can still achieve it (but you can't do p2pdma across IOMMU 
groupings, which is safe).

> I'm currently testing P2P with  GPUs in different IOMMU domains and at least 
> with AMD IOMMUs that works perfectly fine.

Yup that should work though again I have to ask are you disabling ACS on the 
ports between the two peer devices to get the p2p benefit? If not you are not 
getting all the performance benefit (due to IOMMU routing), if you are then 
there are obviously security implications between those IOMMU domains if they 
are assigned to different VMs. And now the issue is if new devices are added 
and the p2p topology needed to change there would be no way to inform the VMs 
of any IOMMU group change. 

Cheers

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 1/3] resource: Use list_head to link sibling resource

2018-05-08 Thread Baoquan He
On 05/07/18 at 11:50pm, kbuild test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.17-rc4 next-20180504]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180507-144345
> config: arm-allmodconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm 
> 
> All errors (new ones prefixed by >>):

Thanks, below patch can fix it:


diff --git a/arch/arm/plat-samsung/pm-check.c b/arch/arm/plat-samsung/pm-check.c
index cd2c02c68bc3..5494355b1c49 100644
--- a/arch/arm/plat-samsung/pm-check.c
+++ b/arch/arm/plat-samsung/pm-check.c
@@ -46,8 +46,8 @@ typedef u32 *(run_fn_t)(struct resource *ptr, u32 *arg);
 static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
 {
while (ptr != NULL) {
-   if (ptr->child != NULL)
-   s3c_pm_run_res(ptr->child, fn, arg);
+   if (!list_empty(>child))
+   s3c_pm_run_res(resource_first_child(>child), fn, 
arg);
 
if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
== IORESOURCE_SYSTEM_RAM) {
@@ -57,7 +57,7 @@ static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, 
u32 *arg)
arg = (fn)(ptr, arg);
}
 
-   ptr = ptr->sibling;
+   ptr = resource_sibling(ptr);
}
 }
 

> 
>arch/arm/plat-samsung/pm-check.c: In function 's3c_pm_run_res':
> >> arch/arm/plat-samsung/pm-check.c:49:18: error: invalid operands to binary 
> >> != (have 'struct list_head' and 'void *')
>   if (ptr->child != NULL)
>   ~~ ^~
> >> arch/arm/plat-samsung/pm-check.c:50:19: error: incompatible type for 
> >> argument 1 of 's3c_pm_run_res'
>s3c_pm_run_res(ptr->child, fn, arg);
>   ^~~
>arch/arm/plat-samsung/pm-check.c:46:13: note: expected 'struct resource *' 
> but argument is of type 'struct list_head'
> static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
> ^~
> >> arch/arm/plat-samsung/pm-check.c:60:7: error: incompatible types when 
> >> assigning to type 'struct resource *' from type 'struct list_head'
>   ptr = ptr->sibling;
>   ^
> 
> vim +49 arch/arm/plat-samsung/pm-check.c
> 
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  45  
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  46  static 
> void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  47  {
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  48   while 
> (ptr != NULL) {
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @49   
> if (ptr->child != NULL)
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @50   
> s3c_pm_run_res(ptr->child, fn, arg);
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  51  
> 05fee7cf arch/arm/plat-samsung/pm-check.c Toshi Kani 2016-01-26  52   
> if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
> 05fee7cf arch/arm/plat-samsung/pm-check.c Toshi Kani 2016-01-26  53   
> == IORESOURCE_SYSTEM_RAM) {
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  54   
> S3C_PMDBG("Found system RAM at %08lx..%08lx\n",
> 840eeeb8 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  55   
>   (unsigned long)ptr->start,
> 840eeeb8 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  56   
>   (unsigned long)ptr->end);
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  57   
> arg = (fn)(ptr, arg);
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  58   
> }
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  59  
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @60   
> ptr = ptr->sibling;
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  61   }
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  62  }
> 549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  63  
> 
> :: The code at line 49 was first introduced by commit
> :: 549c7e33aeb9bfe441ecf68639d2227bb90978e7 [ARM] S3C: Split the resume 
> memory check code from pm.c
> 
> :: TO: Ben Dooks 
> :: CC: Ben Dooks 
> 
> ---
> 0-DAY kernel 

Re: [PATCH v3 1/3] resource: Use list_head to link sibling resource

2018-05-08 Thread Baoquan He
On 05/08/18 at 08:48pm, Wei Yang wrote:
> On Mon, May 07, 2018 at 09:14:29AM +0800, Baoquan He wrote:
> >Hi Wei Yang,
> >
> >On 04/26/18 at 09:18am, Wei Yang wrote:
> >> On Thu, Apr 19, 2018 at 08:18:46AM +0800, Baoquan He wrote:
> >> >The struct resource uses singly linked list to link siblings. It's not
> >> >easy to do reverse iteration on sibling list. So replace it with 
> >> >list_head.
> >> >
> >> 
> >> Hi, Baoquan
> >> 
> >> Besides changing the data structure, I have another proposal to do the 
> >> reverse
> >> iteration. Which means it would not affect other users, if you just want a
> >> reverse iteration.
> >> 
> >> BTW, I don't think Andrew suggest to use linked-list directly. What he 
> >> wants
> >> is a better solution to your first proposal in
> >> https://patchwork.kernel.org/patch/10300819/.
> >> 
> >> Below is my proposal of resource reverse iteration without changing current
> >> design.
> >
> >I got your mail and read it, then interrupted by other thing and forgot
> >replying, sorry.
> >
> >I am fine with your code change. As I said before, I have tried to change
> >code per reviewers' comment, then let reviewers decide which way is
> >better. Please feel free to post formal patches and joining discussion
> >about this issue.
> 
> Yep, while I don't have a real requirement to add the reverse version, so what
> is the proper way to send a patch?
> 
> A patch reply to this thread is ok?

I am not sure either. Since my patches are still under reviewing. And
you have pasted your patch. It depends on maintainers, mainly Andrew and
other reviewers who have concerns.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 1/3] resource: Use list_head to link sibling resource

2018-05-08 Thread Wei Yang
On Mon, May 07, 2018 at 09:14:29AM +0800, Baoquan He wrote:
>Hi Wei Yang,
>
>On 04/26/18 at 09:18am, Wei Yang wrote:
>> On Thu, Apr 19, 2018 at 08:18:46AM +0800, Baoquan He wrote:
>> >The struct resource uses singly linked list to link siblings. It's not
>> >easy to do reverse iteration on sibling list. So replace it with list_head.
>> >
>> 
>> Hi, Baoquan
>> 
>> Besides changing the data structure, I have another proposal to do the 
>> reverse
>> iteration. Which means it would not affect other users, if you just want a
>> reverse iteration.
>> 
>> BTW, I don't think Andrew suggest to use linked-list directly. What he wants
>> is a better solution to your first proposal in
>> https://patchwork.kernel.org/patch/10300819/.
>> 
>> Below is my proposal of resource reverse iteration without changing current
>> design.
>
>I got your mail and read it, then interrupted by other thing and forgot
>replying, sorry.
>
>I am fine with your code change. As I said before, I have tried to change
>code per reviewers' comment, then let reviewers decide which way is
>better. Please feel free to post formal patches and joining discussion
>about this issue.

Yep, while I don't have a real requirement to add the reverse version, so what
is the proper way to send a patch?

A patch reply to this thread is ok?

>
>Thanks
>Baoquan
>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Votre courriel à Amazon.fr Marketplace

2018-05-08 Thread Amazon.fr
Nous avons reçu un ou plusieurs courriels de la part de 
linux-nvdimm@lists.01.org. Cette personne n'est pas enregistrée dans votre 
Compte Amazon. Veuillez renvoyer le courriel depuis votre compte courriel 
enregistré avec Amazon.fr ou contactez le vendeur par le biais de notre site 
web.

Si vous avez plus de questions, veuillez consulter nos pages d'aide :

http://www.amazon.fr/gp/help/customer/display.html?nodeId=200436270

Nous espérons que nos ressources en ligne répondront à toutes vos questions. Si 
vous n'avez pas trouvé de réponse et que vous souhaitez nous contacter, 
veuillez utiliser le formulaire Contactez-nous que vous trouverez dans nos 
pages d'aide.

Cordialement,
Amazon.fr
Cet e-mail a-t-il été utile ? 
https://sellercentral-europe.amazon.com/gp/satisfaction/survey-form.html?ie=UTF8=NotificationBusEmailHMD=ACP_INVALID_SENDER_ADDRESS_BUYER
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Your email to Amazon.com Marketplace

2018-05-08 Thread Amazon.com

We weren't able to deliver a message because you're not approved to send 
e-mails to this email address. If you feel this is a mistake, contact the owner 
of the selling account and he or she can approve you as a sender.

For answers to any further questions please visit our online help:

https://sellercentral.amazon.com/gp/help/200389080

We hope our online resources meet all your needs. If you have explored the 
above link but still need to get in touch with us, please use the e-mail form 
in our online Help department.

Warmest Regards,
Amazon.com
http://www.amazon.com
Was this email helpful?
 
https://sellercentral.amazon.com/gp/satisfaction/survey-form.html?ie=UTF8=NotificationBusEmailHMD=ACP_INVALID_SENDER_ADDRESS
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Christian König

Hi Bjorn,

Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas:

[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and
I don't know if there are other places we would care?


thanks for pointing this out, I totally missed this hack.

AMD APUs mandatory need the ACS flag set for the GPU integrated in the 
CPU when IOMMU is enabled or otherwise you will break SVM.


Similar problems arise when you do this for dedicated GPU, but we 
haven't upstreamed the support for this yet.


So that is a clear NAK from my side for the approach.

And what exactly is the problem here? I'm currently testing P2P with 
GPUs in different IOMMU domains and at least with AMD IOMMUs that works 
perfectly fine.


Regards,
Christian.



On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:

For peer-to-peer transactions to work the downstream ports in each
switch must not have the ACS flags set. At this time there is no way
to dynamically change the flags and update the corresponding IOMMU
groups so this is done at enumeration time before the groups are
assigned.

This effectively means that if CONFIG_PCI_P2PDMA is selected then
all devices behind any PCIe switch heirarchy will be in the same IOMMU
group. Which implies that individual devices behind any switch
heirarchy will not be able to be assigned to separate VMs because
there is no isolation between them. Additionally, any malicious PCIe
devices will be able to DMA to memory exposed by other EPs in the same
domain as TLPs will not be checked by the IOMMU.

Given that the intended use case of P2P Memory is for users with
custom hardware designed for purpose, we do not expect distributors
to ever need to enable this option. Users that want to use P2P
must have compiled a custom kernel with this configuration option
and understand the implications regarding ACS. They will either
not require ACS or will have design the system in such a way that
devices that require isolation will be separate from those using P2P
transactions.

Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/Kconfig|  9 +
  drivers/pci/p2pdma.c   | 45 ++---
  drivers/pci/pci.c  |  6 ++
  include/linux/pci-p2pdma.h |  5 +
  4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index b2396c22b53e..b6db41d4b708 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -139,6 +139,15 @@ config PCI_P2PDMA
  transations must be between devices behind the same root port.
  (Typically behind a network of PCIe switches).
  
+	  Enabling this option will also disable ACS on all ports behind

+ any PCIe switch. This effectively puts all devices behind any
+ switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)


+ individual devices behind any switch will not be able to be
+ assigned to separate VMs because there is no isolation between
+ them. Additionally, any malicious PCIe devices will be able to
+ DMA to memory exposed by other EPs in the same domain as TLPs
+ will not be checked by the IOMMU.
+
  If unsure, say N.
  
  config PCI_LABEL

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ed9dce8552a2..e9f43b43acac 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device 
*dev)
  }
  
  /*

- * If a device is behind a switch, we try to find the upstream bridge
- * port of the switch. This requires two calls to pci_upstream_bridge():
- * one for the upstream port on the switch, one on the upstream port
- * for the next level in the hierarchy. Because of this, devices connected
- * to the root port will be rejected.
+ * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
+ * @pdev: device to disable ACS flags for
+ *
+ * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
+ * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
+ * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)


+ *
+ * This function is called when the devices are first enumerated and
+ * will result in all devices behind any bridge to be in the same IOMMU
+ * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
+ * on this largish hammer. If you need the devices to be in separate groups
+ * don't enable CONFIG_PCI_P2PDMA.
+ *
+ * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
   */
-static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
+int pci_p2pdma_disable_acs(struct pci_dev *pdev)
  {
-   struct pci_dev *up1, *up2;
+   int pos;
+   u16 ctrl;
  
-	if (!pdev)

-   return NULL;
+