Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
>Yes i need to document that some more in hmm.txt... Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with what you sent out? > I am about to send RFC for nouveau, i am still working out some bugs. Great. I will keep an eye out for it. An example user of hmm will be very helpful. > i will fix the MAINTAINERS as part of those. Awesome, thanks. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote: Allright, so, I think I have a plan to fix this, but it will take a little bit of time. Basically the idea is to have firmware pass to Linux a region that's known to not have anything in it that it can use for the vmalloc space rather than have linux arbitrarily cut the address space in half. I'm pretty sure I can always find large enough "holes" in the physical address space that are outside of both RAM/OpenCAPI/Nvlink and PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux to have to know about the intimate HW details so I'll pass it from FW. It will take some time to adjust Linux and get updated FW around though. Once that's done, I'll be able to have the linear mapping go through the entire 52-bit space (minus that hole). Of course the hole need to be large enough to hold a vmemmap for a 52-bit space, so that's about 4TB. So I probably need a hole that's at least 8TB. As for the mapping attributes, it should be easy for my linear mapping code to ensure anything that isn't actual RAM is mapped NC. Very cool. I'm glad to hear you found a way to fix this. Thanks, Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 02, 2018 at 09:38:43PM +, Stephen Bates wrote: > > It seems people miss-understand HMM :( > > Hi Jerome > > Your unhappy face emoticon made me sad so I went off to (re)read up > on HMM. Along the way I came up with a couple of things. > > While hmm.txt is really nice to read it makes no mention of > DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when > one might choose to use one over the other. Would it be possible to > update hmm.txt to include some discussion on this? I understand > that DEVICE_PUBLIC creates a mapping in the kernel's linear address > space for the device memory and DEVICE_PRIVATE does not. However, > like I said, I am not sure when you would use either one and the > pros and cons of doing so. I actually ended up finding some useful > information in memremap.h but I don't think it is fair to expect > people to dig *that* deep to find this information ;-). Yes i need to document that some more in hmm.txt, PRIVATE is for device that have memory that do not fit regular memory expectation ie cachable so PCIe device memory fit under that category. So if all you need is struct page for such memory then this is a perfect fit. On top of that you can use more HMM feature, like using this memory transparently inside a process address space. PUBLIC is for memory that belong to a device but still can be access by CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such memory and just want struct page you can use that and again if you want to use that inside a process address space HMM provide more helpers to do so. > A quick grep shows no drivers using the HMM API in the upstream code > today. Is this correct? Are there any examples of out of tree drivers > that use HMM you can point me too? As a driver developer what > resources exist to help me write a HMM aware driver? I am about to send RFC for nouveau, i am still working out some bugs. I was hoping to be done today but i am still fighting with the hardware. They are other drivers being work on with HMM. I do not know exactly when they will be made public (i expect in coming months). How you use HMM is under the control of the device driver, as well as how you expose it to userspace. They use it how they want to use it. There is no pattern or requirement imposed by HMM. All driver being work on so far are GPU like hardware, ie big chunk of on board memory (several giga-bytes) and they want to use that memory inside process address space in a transparent fashion to the program and CPU. Each have their own API expose to userspace and while they are a lot of similarity among them, lot of details of userspace API is hardware specific. In GPU world most of the driver are in userspace, application do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those API then have a hardware specific userspace driver that talks to hardware specific IOCTL. So this is not like network or block device. > The (very nice) hmm.txt document is not references in the MAINTAINERS > file? You might want to fix that when you have a moment. I have couple small fixes/typo patches that i need to cleanup and send i will fix the MAINTAINERS as part of those. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote: > > > > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote: > > > We use only 52 in practice but yes. > > > > > > > That's 64PB. If you use need > > > > a sparse vmemmap for the entire space it will take 16TB which leaves you > > > > with 63.98PB of address space left. (Similar calculations for other > > > > numbers of address bits.) > > > > > > We only have 52 bits of virtual space for the kernel with the radix > > > MMU. > > > > Ok, assuming you only have 52 bits of physical address space: the sparse > > vmemmap takes 1TB and you're left with 3.9PB of address space for other > > things. So, again, why doesn't that work? Is my math wrong > > The big problem is not the vmemmap, it's the linear mapping Allright, so, I think I have a plan to fix this, but it will take a little bit of time. Basically the idea is to have firmware pass to Linux a region that's known to not have anything in it that it can use for the vmalloc space rather than have linux arbitrarily cut the address space in half. I'm pretty sure I can always find large enough "holes" in the physical address space that are outside of both RAM/OpenCAPI/Nvlink and PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux to have to know about the intimate HW details so I'll pass it from FW. It will take some time to adjust Linux and get updated FW around though. Once that's done, I'll be able to have the linear mapping go through the entire 52-bit space (minus that hole). Of course the hole need to be large enough to hold a vmemmap for a 52-bit space, so that's about 4TB. So I probably need a hole that's at least 8TB. As for the mapping attributes, it should be easy for my linear mapping code to ensure anything that isn't actual RAM is mapped NC. Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
> It seems people miss-understand HMM :( Hi Jerome Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. Along the way I came up with a couple of things. While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when one might choose to use one over the other. Would it be possible to update hmm.txt to include some discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the kernel's linear address space for the device memory and DEVICE_PRIVATE does not. However, like I said, I am not sure when you would use either one and the pros and cons of doing so. I actually ended up finding some useful information in memremap.h but I don't think it is fair to expect people to dig *that* deep to find this information ;-). A quick grep shows no drivers using the HMM API in the upstream code today. Is this correct? Are there any examples of out of tree drivers that use HMM you can point me too? As a driver developer what resources exist to help me write a HMM aware driver? The (very nice) hmm.txt document is not references in the MAINTAINERS file? You might want to fix that when you have a moment. Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote: > On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi wrote: > > > > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS. > > No. > > Or rather, that's simply just another (small) part of it all - and an > architected and documented one at that. > > Like the page table caching entries, the memory type range registers > are really just "secondary information". They don't actually select > between PCIe and RAM, they just affect the behavior on top of that. > > The really nitty-gritty stuff is not architected, and generally not > documented outside (possibly) the BIOS writer's guide that is not made > public. > > Those magical registers contain details like how the DRAM is > interleaved (if it is), what the timings are, where which memory > controller handles which memory range, and what are goes to PCIe etc. > > Basically all the actual *steering* information is very much hidden > away from the kernel (and often from the BIOS too). The parts we see > at a higher level are just tuning and tweaks. > > Note: the details differ _enormously_ between different chips. The > setup can be very different, with things like Knights Landing having > the external cache that can also act as local memory that isn't a > cache but maps at a different physical address instead etc. That's the > kind of steering I'm talking about - at a low level how physical > addresses get mapped to different cache partitions, memory > controllers, or to the IO system etc. Right, MRC code is not documented publicly, and it is very much CPU dependent. It programs address decoders and maps DRAMs to physical address as you described. MTRRs have nothing to do with this memory controller setting. That said, MTRRs specify CPU's memory access type, such as UC and WB. Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds wrote: > > Like the page table caching entries, the memory type range registers > are really just "secondary information". They don't actually select > between PCIe and RAM, they just affect the behavior on top of that. Side note: historically the two may have been almost the same, since the CPU only had one single unified bus for "memory" (whether that was memory-mapped PCI or actual RAM). The steering was external. But even back then you had extended bits to specify things like how the 640k-1M region got remapped - which could depend on not just the address, but on whether you read or wrote to it. The "lost" 384kB of RAM could either be remapped at a different address, or could be used for shadowing the (slow) ROM contents, or whatever. Linus ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi wrote: > > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS. No. Or rather, that's simply just another (small) part of it all - and an architected and documented one at that. Like the page table caching entries, the memory type range registers are really just "secondary information". They don't actually select between PCIe and RAM, they just affect the behavior on top of that. The really nitty-gritty stuff is not architected, and generally not documented outside (possibly) the BIOS writer's guide that is not made public. Those magical registers contain details like how the DRAM is interleaved (if it is), what the timings are, where which memory controller handles which memory range, and what are goes to PCIe etc. Basically all the actual *steering* information is very much hidden away from the kernel (and often from the BIOS too). The parts we see at a higher level are just tuning and tweaks. Note: the details differ _enormously_ between different chips. The setup can be very different, with things like Knights Landing having the external cache that can also act as local memory that isn't a cache but maps at a different physical address instead etc. That's the kind of steering I'm talking about - at a low level how physical addresses get mapped to different cache partitions, memory controllers, or to the IO system etc. Linus ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote: > > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt > > wrote: > > > > > > Could be that x86 has the smarts to do the right thing, still trying to > > > untangle the code :-) > > > > Afaik, x86 will not cache PCI unless the system is misconfigured, and > > even then it's more likely to just raise a machine check exception > > than cache things. > > > > The last-level cache is going to do fills and spills directly to the > > memory controller, not to the PCIe side of things. > > > > (I guess you *can* do things differently, and I wouldn't be surprised > > if some people inside Intel did try to do things differently with > > trying nvram over PCIe, but in general I think the above is true) > > > > You won't find it in the kernel code either. It's in hardware with > > firmware configuration of what addresses are mapped to the memory > > controllers (and _how_ they are mapped) and which are not. > > Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear > mapping code by checking the address vs. memblocks to chose the right > page table attributes. FWIW, this thing is called MTRRs on x86, which are initialized by BIOS. These registers effectively overwrite page table setups. Intel SDM defines the effect as follows. 'PAT Entry Value' is the page table setup. MTRR Memory Type PAT Entry Value Effective Memory Type UCUC UC UCWC WC UCWT UC UCWB UC UCWP UC On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC. Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered ranges become UC. # cat /proc/mtrr : reg01: base=0xc00 (12582912MB), size=2097152MB, count=1: uncachable : # cat /proc/iomem | grep 'PCI Bus' : c00-c3f : PCI Bus :00 c40-c7f : PCI Bus :11 c80-cbf : PCI Bus :36 cc0-cff : PCI Bus :5b d00-d3f : PCI Bus :80 d40-d7f : PCI Bus :85 d80-dbf : PCI Bus :ae dc0-dff : PCI Bus :d7 -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote: The big problem is not the vmemmap, it's the linear mapping. Ah, yes, ok. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote: (Switching back to my non-IBM address ...) > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote: > > We use only 52 in practice but yes. > > > > > That's 64PB. If you use need > > > a sparse vmemmap for the entire space it will take 16TB which leaves you > > > with 63.98PB of address space left. (Similar calculations for other > > > numbers of address bits.) > > > > We only have 52 bits of virtual space for the kernel with the radix > > MMU. > > Ok, assuming you only have 52 bits of physical address space: the sparse > vmemmap takes 1TB and you're left with 3.9PB of address space for other > things. So, again, why doesn't that work? Is my math wrong The big problem is not the vmemmap, it's the linear mapping. Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote: > > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote: > > We use only 52 in practice but yes. > > > > > That's 64PB. If you use need > > > a sparse vmemmap for the entire space it will take 16TB which leaves you > > > with 63.98PB of address space left. (Similar calculations for other > > > numbers of address bits.) > > > > We only have 52 bits of virtual space for the kernel with the radix > > MMU. > > Ok, assuming you only have 52 bits of physical address space: the sparse > vmemmap takes 1TB and you're left with 3.9PB of address space for other > things. So, again, why doesn't that work? Is my math wrong The big problem is not the vmemmap, it's the linear mapping. Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote: We use only 52 in practice but yes. That's 64PB. If you use need a sparse vmemmap for the entire space it will take 16TB which leaves you with 63.98PB of address space left. (Similar calculations for other numbers of address bits.) We only have 52 bits of virtual space for the kernel with the radix MMU. Ok, assuming you only have 52 bits of physical address space: the sparse vmemmap takes 1TB and you're left with 3.9PB of address space for other things. So, again, why doesn't that work? Is my math wrong? Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote: > > On 01/03/18 02:45 PM, Logan Gunthorpe wrote: > > It handles it fine for many situations. But when you try to map > > something that is at the end of the physical address space then the > > spares-vmemmap needs virtual address space that's the size of the > > physical address space divided by PAGE_SIZE which may be a little bit > > too large... > > Though, considering this more, maybe this shouldn't be a problem... > > Lets say you have 56bits of address space. We use only 52 in practice but yes. > That's 64PB. If you use need > a sparse vmemmap for the entire space it will take 16TB which leaves you > with 63.98PB of address space left. (Similar calculations for other > numbers of address bits.) We only have 52 bits of virtual space for the kernel with the radix MMU. > So I'm not sure what the problem with this is. > > We still have to ensure all the arches map the memory with the right > cache bits but that should be relatively easy to solve. > > Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote: > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt > wrote: > > > > Could be that x86 has the smarts to do the right thing, still trying to > > untangle the code :-) > > Afaik, x86 will not cache PCI unless the system is misconfigured, and > even then it's more likely to just raise a machine check exception > than cache things. > > The last-level cache is going to do fills and spills directly to the > memory controller, not to the PCIe side of things. > > (I guess you *can* do things differently, and I wouldn't be surprised > if some people inside Intel did try to do things differently with > trying nvram over PCIe, but in general I think the above is true) > > You won't find it in the kernel code either. It's in hardware with > firmware configuration of what addresses are mapped to the memory > controllers (and _how_ they are mapped) and which are not. Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear mapping code by checking the address vs. memblocks to chose the right page table attributes. So the main problem on our side is to figure out the problem of too big PFNs. I need to look at this with Aneesh, we might be able to make things fit with a bit of wrangling. > You _might_ find it in the BIOS, assuming you understood the tables > and had the BIOS writer's guide to unravel the magic registers. > > But you might not even find it there. Some of the memory unit timing > programming is done very early, and by code that Intel doesn't even > release to the BIOS writers except as a magic encrypted blob, afaik. > Some of the magic might even be in microcode. > > The page table settings for cacheability are more like a hint, and > only _part_ of the whole picture. The memory type range registers are > another part. And magic low-level uarch, northbridge and memory unit > specific magic is yet another part. > > So you can disable caching for memory, but I'm pretty sure you can't > enable caching for PCIe at least in the common case. At best you can > affect how the store buffer works for PCIe. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt wrote: > > Could be that x86 has the smarts to do the right thing, still trying to > untangle the code :-) Afaik, x86 will not cache PCI unless the system is misconfigured, and even then it's more likely to just raise a machine check exception than cache things. The last-level cache is going to do fills and spills directly to the memory controller, not to the PCIe side of things. (I guess you *can* do things differently, and I wouldn't be surprised if some people inside Intel did try to do things differently with trying nvram over PCIe, but in general I think the above is true) You won't find it in the kernel code either. It's in hardware with firmware configuration of what addresses are mapped to the memory controllers (and _how_ they are mapped) and which are not. You _might_ find it in the BIOS, assuming you understood the tables and had the BIOS writer's guide to unravel the magic registers. But you might not even find it there. Some of the memory unit timing programming is done very early, and by code that Intel doesn't even release to the BIOS writers except as a magic encrypted blob, afaik. Some of the magic might even be in microcode. The page table settings for cacheability are more like a hint, and only _part_ of the whole picture. The memory type range registers are another part. And magic low-level uarch, northbridge and memory unit specific magic is yet another part. So you can disable caching for memory, but I'm pretty sure you can't enable caching for PCIe at least in the common case. At best you can affect how the store buffer works for PCIe. Linus ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote: > On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote: > > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM > > on ppc64 (maybe via an arch hook as it might depend on the processor > > family). Server powerpc cannot do cachable accesses on IO memory > > (unless it's special OpenCAPI or nVlink, but not on PCIe). > > I think you are right on this - even on x86 we must not create > cachable mappings of PCI BARs - there is no way that works the way > anyone would expect. > > I think this series doesn't have a problem here only because it never > touches the BAR pages with the CPU. > > BAR memory should be mapped into the CPU as WC at best on all arches.. Could be that x86 has the smarts to do the right thing, still trying to untangle the code :-) Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 02:45 PM, Logan Gunthorpe wrote: It handles it fine for many situations. But when you try to map something that is at the end of the physical address space then the spares-vmemmap needs virtual address space that's the size of the physical address space divided by PAGE_SIZE which may be a little bit too large... Though, considering this more, maybe this shouldn't be a problem... Lets say you have 56bits of address space. That's 64PB. If you use need a sparse vmemmap for the entire space it will take 16TB which leaves you with 63.98PB of address space left. (Similar calculations for other numbers of address bits.) So I'm not sure what the problem with this is. We still have to ensure all the arches map the memory with the right cache bits but that should be relatively easy to solve. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 02:37 PM, Dan Williams wrote: Ah ok, I'd need to look at the details. I had been assuming that sparse-vmemmap could handle such a situation, but that could indeed be a broken assumption. It handles it fine for many situations. But when you try to map something that is at the end of the physical address space then the spares-vmemmap needs virtual address space that's the size of the physical address space divided by PAGE_SIZE which may be a little bit too large... Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
> The intention of HMM is to be useful for all device memory that wish > to have struct page for various reasons. Hi Jermone and thanks for your input! Understood. We have looked at HMM in the past and long term I definitely would like to consider how we can add P2P functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass addressable and non-addressable blocks of data between devices. However that is well beyond the intentions of this series ;-). Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote: >> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt >> wrote: >> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: >> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: >> > > > Hi Everyone, >> > > >> > > >> > > So Oliver (CC) was having issues getting any of that to work for us. >> > > >> > > The problem is that acccording to him (I didn't double check the latest >> > > patches) you effectively hotplug the PCIe memory into the system when >> > > creating struct pages. >> > > >> > > This cannot possibly work for us. First we cannot map PCIe memory as >> > > cachable. (Note that doing so is a bad idea if you are behind a PLX >> > > switch anyway since you'd ahve to manage cache coherency in SW). >> > >> > Note: I think the above means it won't work behind a switch on x86 >> > either, will it ? >> >> The devm_memremap_pages() infrastructure allows placing the memmap in >> "System-RAM" even if the hotplugged range is in PCI space. So, even if >> it is an issue on some configurations, it's just a simple adjustment >> to where the memmap is placed. > > But what happens with that PCI memory ? Is it effectively turned into > nromal memory (ie, usable for normal allocations, potentially used to > populate user pages etc...) or is it kept aside ? > > Also on ppc64, the physical addresses of PCIe make it so far appart > that there's no way we can map them into the linear mapping at the > normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like > page_address or virt_to_page cannot work as-is on PCIe addresses. Ah ok, I'd need to look at the details. I had been assuming that sparse-vmemmap could handle such a situation, but that could indeed be a broken assumption. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote: > > > On 01/03/18 02:10 PM, Jerome Glisse wrote: > > It seems people miss-understand HMM :( you do not have to use all of > > its features. If all you care about is having struct page then just > > use that for instance in your case only use those following 3 functions: > > > > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove() > > for cleanup. > > To what benefit over just using devm_memremap_pages()? If I'm using the hmm > interface and disabling all the features, I don't see the point. We've also > cleaned up the devm_memremap_pages() interface to be more usefully generic > in such a way that I'd hope HMM starts using it too and gets rid of the code > duplication. > The first HMM variant find a hole and do not require a resource as input parameter. Beside that internaly for PCIE device memory devm_memremap_pages() does not do the right thing last time i check it always create a linear mapping of the range ie HMM call add_pages() while devm_memremap_pages() call arch_add_memory() When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages() to match my need. I am more than happy to modify devm_memremap_pages() to also handle HMM needs. Note that the intention of HMM is to be a middle layer between low level infrastructure and device driver. Idea is that such impedance layer should make it easier down the road to change how thing are handled down below without having to touch many device driver. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 02:18 PM, Jerome Glisse wrote: This is pretty easy to do with HMM: unsigned long hmm_page_to_phys_pfn(struct page *page) This is not useful unless you want to go through all the kernel paths we are using and replace page_to_phys() and friends with something else that calls an HMM function when appropriate... The problem isn't getting the physical address from a page, it's that we are passing these pages through various kernel interfaces which expect pages that work in the usual manner. (Look at the code: we quite simply provide a way to get the PCI bus address from a page when necessary). Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote: > > > On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote: > > However, what happens if anything calls page_address() on them ? Some > > DMA ops do that for example, or some devices might ... > > Although we could probably work around it with some pain, we rely on > page_address() and virt_to_phys(), etc to work on these pages. So on x86, > yes, it makes it into the linear mapping. This is pretty easy to do with HMM: unsigned long hmm_page_to_phys_pfn(struct page *page) { struct hmm_devmem *devmem; unsigned long ppfn; /* Sanity test maybe BUG_ON() */ if (!is_device_private_page(page)) return -1UL; devmem = page->pgmap->data; ppfn = page_to_page(page) - devmem->pfn_first; return ppfn + devmem->device_phys_base_pfn; } Note that last field does not exist in today HMM because i did not need such helper so far but this can be added. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 02:10 PM, Jerome Glisse wrote: It seems people miss-understand HMM :( you do not have to use all of its features. If all you care about is having struct page then just use that for instance in your case only use those following 3 functions: hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove() for cleanup. To what benefit over just using devm_memremap_pages()? If I'm using the hmm interface and disabling all the features, I don't see the point. We've also cleaned up the devm_memremap_pages() interface to be more usefully generic in such a way that I'd hope HMM starts using it too and gets rid of the code duplication. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote: However, what happens if anything calls page_address() on them ? Some DMA ops do that for example, or some devices might ... Although we could probably work around it with some pain, we rely on page_address() and virt_to_phys(), etc to work on these pages. So on x86, yes, it makes it into the linear mapping. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote: > > > On 01/03/18 01:55 PM, Jerome Glisse wrote: > > Well this again a new user of struct page for device memory just for > > one usecase. I wanted HMM to be more versatile so that it could be use > > for this kind of thing too. I guess the message didn't go through. I > > will take some cycles tomorrow to look into this patchset to ascertain > > how struct page is use in this context. > > We looked at it but didn't see how any of it was applicable to our needs. > It seems people miss-understand HMM :( you do not have to use all of its features. If all you care about is having struct page then just use that for instance in your case only use those following 3 functions: hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove() for cleanup. You can set the fault callback to an empty stub that always do return VM_SIGBUS or a patch to allow NULL callback inside HMM. You don't have to use the free callback if you don't care and if there is something that doesn't quite match what you want HMM can always be ajusted to address this. The intention of HMM is to be useful for all device memory that wish to have struct page for various reasons. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote: > > > The devm_memremap_pages() infrastructure allows placing the memmap in > "System-RAM" even if the hotplugged range is in PCI space. So, even if > it is an issue on some configurations, it's just a simple adjustment > to where the memmap is placed. Actually can you explain a bit more here ? devm_memremap_pages() doesn't take any specific argument about what to do with the memory. It does create the vmemmap sections etc... but does so by calling arch_add_memory(). So __add_memory() isn't called, which means the pages aren't added to the linear mapping. Then you manually add them to ZONE_DEVICE. Am I correct ? In that case, they indeed can't be used as normal memory pages, which is good, and if they are indeed not in the linear mapping, then there is no caching issues. However, what happens if anything calls page_address() on them ? Some DMA ops do that for example, or some devices might ... This is all quite convoluted with no documentation I can find that explains the various expectations. So the question is are those pages landing in the linear mapping, and if yes, by what code path ? The next question is if we ever want that to work on ppc64, we need a way to make this fit in our linear mapping and map it non-cachable, which will require some wrangling on how we handle that mapping. Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 01:55 PM, Jerome Glisse wrote: Well this again a new user of struct page for device memory just for one usecase. I wanted HMM to be more versatile so that it could be use for this kind of thing too. I guess the message didn't go through. I will take some cycles tomorrow to look into this patchset to ascertain how struct page is use in this context. We looked at it but didn't see how any of it was applicable to our needs. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 01:53 PM, Jason Gunthorpe wrote: On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote: Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM on ppc64 (maybe via an arch hook as it might depend on the processor family). Server powerpc cannot do cachable accesses on IO memory (unless it's special OpenCAPI or nVlink, but not on PCIe). I think you are right on this - even on x86 we must not create cachable mappings of PCI BARs - there is no way that works the way anyone would expect. On x86, even if I try to make a cachable mapping of a PCI BAR it always ends up being un-cached. The arch code in x86 always does the right thing here Other arches, not so much. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote: > > > > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote: > > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > > > > The problem is that acccording to him (I didn't double check the latest > > > > patches) you effectively hotplug the PCIe memory into the system when > > > > creating struct pages. > > > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > > > Note: I think the above means it won't work behind a switch on x86 > > > either, will it ? > > > > This works perfectly fine on x86 behind a switch and we've tested it on > > multiple machines. We've never had an issue of running out of virtual > > space despite our PCI bars typically being located with an offset of > > 56TB or more. The arch code on x86 also somehow figures out not to map > > the memory as cachable so that's not an issue (though, at this point, > > the CPU never accesses the memory so even if it were, it wouldn't affect > > anything). > > Oliver can you look into this ? You sais the memory was effectively > hotplug'ed into the system when creating the struct pages. That would > mean to me that it's a) mapped (which for us is cachable, maybe x86 has > tricks to avoid that) and b) potentially used to populate userspace > pages (that will definitely be cachable). Unless there's something in > there you didn't see that prevents it. > > > We also had this working on ARM64 a while back but it required some out > > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch > > code to ioremap the memory into the page map. > > > > You didn't mention what architecture you were trying this on. > > ppc64. > > > It may make sense at this point to make this feature dependent on x86 > > until more work is done to make it properly portable. Something like > > arch functions that allow adding IO memory pages to with a specific > > cache setting. Though, if an arch has such restrictive limits on the map > > size it would probably need to address that too somehow. > > Not fan of that approach. > > So there are two issues to consider here: > > - Our MMIO space is very far away from memory (high bits set in the > address) which causes problem with things like vmmemmap, page_address, > virt_to_page etc... Do you have similar issues on arm64 ? HMM private (HMM public is different) works around that by looking for "hole" in address space and using those for hotplug (ie page_to_pfn() != physical pfn of the memory). This is ok for HMM because the memory is never map by the CPU and we can find the physical pfn with a little bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev-> physical_base_address). To avoid anything going bad i actually do not populate the kernel linear mapping for the range hence definitly no CPU access at all through those struct page. CPU can still access PCIE bar through usual mmio map. > > - We need to ensure that the mechanism (which I'm not familiar with) > that you use to create the struct page's for the device don't end up > turning those device pages into normal "general use" pages for the > system. Oliver thinks it does, you say it doesn't, ... > > Jerome (Glisse), what's your take on this ? Smells like something that > could be covered by HMM... Well this again a new user of struct page for device memory just for one usecase. I wanted HMM to be more versatile so that it could be use for this kind of thing too. I guess the message didn't go through. I will take some cycles tomorrow to look into this patchset to ascertain how struct page is use in this context. Note that i also want peer to peer for HMM users but with ACS and using IOMMU ie having to populate IOMMU page table of one device to point to bar of another device. I need to test on how many platform this work, hardware engineer are unable/unwilling to commit on wether this work or not. > Logan, the only reason you need struct page's to begin with is for the > DMA API right ? Or am I missing something here ? If it is only needed for that this sounds like a waste of memory for struct page. Thought i understand this allow new API to match previous one. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote: Oliver can you look into this ? You sais the memory was effectively hotplug'ed into the system when creating the struct pages. That would mean to me that it's a) mapped (which for us is cachable, maybe x86 has tricks to avoid that) and b) potentially used to populate userspace pages (that will definitely be cachable). Unless there's something in there you didn't see that prevents it. Yes, we've been specifically prohibiting all cases where these pages get passed to userspace. We don't want that. Although it works in limited cases (ie x86), and we use it for some testing, there are dragons there. - Our MMIO space is very far away from memory (high bits set in the address) which causes problem with things like vmmemmap, page_address, virt_to_page etc... Do you have similar issues on arm64 ? No similar issues on arm64. Any chance you could simply not map the PCI bars that way? What's the point of that? It may simply mean ppc64 can't be supported until either that changes or the kernel infrastructure gets more sophisticated. Logan, the only reason you need struct page's to begin with is for the DMA API right ? Or am I missing something here ? It's not so much the DMA map API as it is the entire kernel infrastructure. Scatter lists (which are universally used to setup DMA requests) require pages and bios require pages, etc, etc. In fact, this patch set, in its current form, routes around the DMA API entirely. Myself[1] and others have done prototype work to migrate away from struct pages and to use pfn_t instead but this work doesn't seem to get very far in the community. Logan [1] https://marc.info/?l=linux-kernel&m=149566222124326&w=2 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote: > > But what happens with that PCI memory ? Is it effectively turned into > nromal memory (ie, usable for normal allocations, potentially used to > populate user pages etc...) or is it kept aside ? (What I mean is is it added to the page allocator basically) Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM on ppc64 (maybe via an arch hook as it might depend on the processor family). Server powerpc cannot do cachable accesses on IO memory (unless it's special OpenCAPI or nVlink, but not on PCIe). > Also on ppc64, the physical addresses of PCIe make it so far appart > that there's no way we can map them into the linear mapping at the > normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like > page_address or virt_to_page cannot work as-is on PCIe addresses. Talking of which ... is there any documentation on the whole memremap_page ? my grep turned out empty... Cheers, Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote: > On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt > wrote: > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: > > > > Hi Everyone, > > > > > > > > > So Oliver (CC) was having issues getting any of that to work for us. > > > > > > The problem is that acccording to him (I didn't double check the latest > > > patches) you effectively hotplug the PCIe memory into the system when > > > creating struct pages. > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > Note: I think the above means it won't work behind a switch on x86 > > either, will it ? > > The devm_memremap_pages() infrastructure allows placing the memmap in > "System-RAM" even if the hotplugged range is in PCI space. So, even if > it is an issue on some configurations, it's just a simple adjustment > to where the memmap is placed. But what happens with that PCI memory ? Is it effectively turned into nromal memory (ie, usable for normal allocations, potentially used to populate user pages etc...) or is it kept aside ? Also on ppc64, the physical addresses of PCIe make it so far appart that there's no way we can map them into the linear mapping at the normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like page_address or virt_to_page cannot work as-is on PCIe addresses. Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 18:09 +, Stephen Bates wrote: > > > So Oliver (CC) was having issues getting any of that to work for us. > > > > > > The problem is that acccording to him (I didn't double check the latest > > > patches) you effectively hotplug the PCIe memory into the system when > > > creating struct pages. > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > > > Note: I think the above means it won't work behind a switch on x86 > > either, will it ? > > > Ben > > We have done extensive testing of this series and its predecessors > using PCIe switches from both Broadcom (PLX) and Microsemi. We have > also done testing on x86_64, ARM64 and ppc64el based ARCH with > varying degrees of success. The series as it currently stands only > works on x86_64 but modified (hacky) versions have been made to work > on ARM64. The x86_64 testing has been done on a range of (Intel) > CPUs, servers, PCI EPs (including RDMA NICs from at least three > vendors, NVMe SSDs from at least four vendors and P2P devices from > four vendors) and PCI switches. > > I do find it slightly offensive that you would question the series > even working. I hope you are not suggesting we would submit this > framework multiple times without having done testing on it No need to get personal on that. I did specify that this was based on some incomplete understanding of what's going on with that new hack used to create struct pages. As it is, it cannot work on ppc64 however, in part because according to Oliver, we end up mapping things cachable, and in part, because of the address range issues. The latter issue might be fundamental to the approach and unfixable unless we have ways to use hooks for virt_to_page/page_address on these things. Ben. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote: > > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > > > The problem is that acccording to him (I didn't double check the latest > > > patches) you effectively hotplug the PCIe memory into the system when > > > creating struct pages. > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > Note: I think the above means it won't work behind a switch on x86 > > either, will it ? > > This works perfectly fine on x86 behind a switch and we've tested it on > multiple machines. We've never had an issue of running out of virtual > space despite our PCI bars typically being located with an offset of > 56TB or more. The arch code on x86 also somehow figures out not to map > the memory as cachable so that's not an issue (though, at this point, > the CPU never accesses the memory so even if it were, it wouldn't affect > anything). Oliver can you look into this ? You sais the memory was effectively hotplug'ed into the system when creating the struct pages. That would mean to me that it's a) mapped (which for us is cachable, maybe x86 has tricks to avoid that) and b) potentially used to populate userspace pages (that will definitely be cachable). Unless there's something in there you didn't see that prevents it. > We also had this working on ARM64 a while back but it required some out > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch > code to ioremap the memory into the page map. > > You didn't mention what architecture you were trying this on. ppc64. > It may make sense at this point to make this feature dependent on x86 > until more work is done to make it properly portable. Something like > arch functions that allow adding IO memory pages to with a specific > cache setting. Though, if an arch has such restrictive limits on the map > size it would probably need to address that too somehow. Not fan of that approach. So there are two issues to consider here: - Our MMIO space is very far away from memory (high bits set in the address) which causes problem with things like vmmemmap, page_address, virt_to_page etc... Do you have similar issues on arm64 ? - We need to ensure that the mechanism (which I'm not familiar with) that you use to create the struct page's for the device don't end up turning those device pages into normal "general use" pages for the system. Oliver thinks it does, you say it doesn't, ... Jerome (Glisse), what's your take on this ? Smells like something that could be covered by HMM... Logan, the only reason you need struct page's to begin with is for the DMA API right ? Or am I missing something here ? Cheers, Ben. > Thanks, > > Logan > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 03:31 AM, Sagi Grimberg wrote: * We also reject using devices that employ 'dma_virt_ops' which should fairly simply handle Jason's concerns that this work might break with the HFI, QIB and rxe drivers that use the virtual ops to implement their own special DMA operations. That's good, but what would happen for these devices? simply fail the mapping causing the ulp to fail its rdma operation? I would think that we need a capability flag for devices that support it. pci_p2pmem_find() will simply not return any devices when any client that uses dma_virt_ops. So in the NVMe target case it simply will not use P2P memory. And just in case, pci_p2pdma_map_sg() will also return 0 if the device passed to it uses dma_virt_ops as well. So if someone bypasses pci_p2pmem_find() they will get a failure during map. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 01/03/18 12:21 PM, Dan Williams wrote: Note: I think the above means it won't work behind a switch on x86 either, will it ? The devm_memremap_pages() infrastructure allows placing the memmap in "System-RAM" even if the hotplugged range is in PCI space. So, even if it is an issue on some configurations, it's just a simple adjustment to where the memmap is placed. Thanks for the confirmation Dan! Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: >> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: >> > Hi Everyone, >> >> >> So Oliver (CC) was having issues getting any of that to work for us. >> >> The problem is that acccording to him (I didn't double check the latest >> patches) you effectively hotplug the PCIe memory into the system when >> creating struct pages. >> >> This cannot possibly work for us. First we cannot map PCIe memory as >> cachable. (Note that doing so is a bad idea if you are behind a PLX >> switch anyway since you'd ahve to manage cache coherency in SW). > > Note: I think the above means it won't work behind a switch on x86 > either, will it ? The devm_memremap_pages() infrastructure allows placing the memmap in "System-RAM" even if the hotplugged range is in PCI space. So, even if it is an issue on some configurations, it's just a simple adjustment to where the memmap is placed. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
>> So Oliver (CC) was having issues getting any of that to work for us. >> >> The problem is that acccording to him (I didn't double check the latest >> patches) you effectively hotplug the PCIe memory into the system when >> creating struct pages. >> >> This cannot possibly work for us. First we cannot map PCIe memory as >> cachable. (Note that doing so is a bad idea if you are behind a PLX >> switch anyway since you'd ahve to manage cache coherency in SW). > > Note: I think the above means it won't work behind a switch on x86 > either, will it ? Ben We have done extensive testing of this series and its predecessors using PCIe switches from both Broadcom (PLX) and Microsemi. We have also done testing on x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The series as it currently stands only works on x86_64 but modified (hacky) versions have been made to work on ARM64. The x86_64 testing has been done on a range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least three vendors, NVMe SSDs from at least four vendors and P2P devices from four vendors) and PCI switches. I do find it slightly offensive that you would question the series even working. I hope you are not suggesting we would submit this framework multiple times without having done testing on it Stephen ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote: On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: The problem is that acccording to him (I didn't double check the latest patches) you effectively hotplug the PCIe memory into the system when creating struct pages. This cannot possibly work for us. First we cannot map PCIe memory as cachable. (Note that doing so is a bad idea if you are behind a PLX switch anyway since you'd ahve to manage cache coherency in SW). Note: I think the above means it won't work behind a switch on x86 either, will it ? This works perfectly fine on x86 behind a switch and we've tested it on multiple machines. We've never had an issue of running out of virtual space despite our PCI bars typically being located with an offset of 56TB or more. The arch code on x86 also somehow figures out not to map the memory as cachable so that's not an issue (though, at this point, the CPU never accesses the memory so even if it were, it wouldn't affect anything). We also had this working on ARM64 a while back but it required some out of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch code to ioremap the memory into the page map. You didn't mention what architecture you were trying this on. It may make sense at this point to make this feature dependent on x86 until more work is done to make it properly portable. Something like arch functions that allow adding IO memory pages to with a specific cache setting. Though, if an arch has such restrictive limits on the map size it would probably need to address that too somehow. Thanks, Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
Hi Everyone, Hi Logan, Here's v2 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.16-rc3 which already includes Christoph's devpagemap work the previous version was based off as well as a couple of the cleanup patches that were in v1. Additionally, we've made the following changes based on feedback: * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well as a bunch of cleanup and spelling fixes he pointed out in the last series. * To address Alex's ACS concerns, we change to a simpler method of just disabling ACS behind switches for any kernel that has CONFIG_PCI_P2PDMA. * We also reject using devices that employ 'dma_virt_ops' which should fairly simply handle Jason's concerns that this work might break with the HFI, QIB and rxe drivers that use the virtual ops to implement their own special DMA operations. That's good, but what would happen for these devices? simply fail the mapping causing the ulp to fail its rdma operation? I would think that we need a capability flag for devices that support it. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: > > Hi Everyone, > > > So Oliver (CC) was having issues getting any of that to work for us. > > The problem is that acccording to him (I didn't double check the latest > patches) you effectively hotplug the PCIe memory into the system when > creating struct pages. > > This cannot possibly work for us. First we cannot map PCIe memory as > cachable. (Note that doing so is a bad idea if you are behind a PLX > switch anyway since you'd ahve to manage cache coherency in SW). Note: I think the above means it won't work behind a switch on x86 either, will it ? > Then our MMIO space is so far away from our memory space that there is > not enough vmemmap virtual space to be able to do that. > > So this can only work accross achitectures by using something like HMM > to create special device struct page's. > > Ben. > > > > Here's v2 of our series to introduce P2P based copy offload to NVMe > > fabrics. This version has been rebased onto v4.16-rc3 which already > > includes Christoph's devpagemap work the previous version was based > > off as well as a couple of the cleanup patches that were in v1. > > > > Additionally, we've made the following changes based on feedback: > > > > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well > > as a bunch of cleanup and spelling fixes he pointed out in the last > > series. > > > > * To address Alex's ACS concerns, we change to a simpler method of > > just disabling ACS behind switches for any kernel that has > > CONFIG_PCI_P2PDMA. > > > > * We also reject using devices that employ 'dma_virt_ops' which should > > fairly simply handle Jason's concerns that this work might break with > > the HFI, QIB and rxe drivers that use the virtual ops to implement > > their own special DMA operations. > > > > Thanks, > > > > Logan > > > > -- > > > > This is a continuation of our work to enable using Peer-to-Peer PCI > > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who > > provided valuable feedback to get these patches to where they are today. > > > > The concept here is to use memory that's exposed on a PCI BAR as > > data buffers in the NVME target code such that data can be transferred > > from an RDMA NIC to the special memory and then directly to an NVMe > > device avoiding system memory entirely. The upside of this is better > > QoS for applications running on the CPU utilizing memory and lower > > PCI bandwidth required to the CPU (such that systems could be designed > > with fewer lanes connected to the CPU). However, presently, the > > trade-off is currently a reduction in overall throughput. (Largely due > > to hardware issues that would certainly improve in the future). > > > > Due to these trade-offs we've designed the system to only enable using > > the PCI memory in cases where the NIC, NVMe devices and memory are all > > behind the same PCI switch. This will mean many setups that could likely > > work well will not be supported so that we can be more confident it > > will work and not place any responsibility on the user to understand > > their topology. (We chose to go this route based on feedback we > > received at the last LSF). Future work may enable these transfers behind > > a fabric of PCI switches or perhaps using a white list of known good > > root complexes. > > > > In order to enable this functionality, we introduce a few new PCI > > functions such that a driver can register P2P memory with the system. > > Struct pages are created for this memory using devm_memremap_pages() > > and the PCI bus offset is stored in the corresponding pagemap structure. > > > > Another set of functions allow a client driver to create a list of > > client devices that will be used in a given P2P transactions and then > > use that list to find any P2P memory that is supported by all the > > client devices. This list is then also used to selectively disable the > > ACS bits for the downstream ports behind these devices. > > > > In the block layer, we also introduce a P2P request flag to indicate a > > given request targets P2P memory as well as a flag for a request queue > > to indicate a given queue supports targeting P2P memory. P2P requests > > will only be accepted by queues that support it. Also, P2P requests > > are marked to not be merged seeing a non-homogenous request would > > complicate the DMA mapping requirements. > > > > In the PCI NVMe driver, we modify the existing CMB support to utilize > > the new PCI P2P memory infrastructure and also add support for P2P > > memory in its request queue. When a P2P request is received it uses the > > pci_p2pmem_map_sg() function which applies the necessary transformation > > to get the corrent pci_bus_addr_t for the DMA transactions. > > > > In the RDMA core, we also adjust rdma_rw_ctx_init() and > > rdma_rw_ctx_destroy() to take a flag
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: > Hi Everyone, So Oliver (CC) was having issues getting any of that to work for us. The problem is that acccording to him (I didn't double check the latest patches) you effectively hotplug the PCIe memory into the system when creating struct pages. This cannot possibly work for us. First we cannot map PCIe memory as cachable. (Note that doing so is a bad idea if you are behind a PLX switch anyway since you'd ahve to manage cache coherency in SW). Then our MMIO space is so far away from our memory space that there is not enough vmemmap virtual space to be able to do that. So this can only work accross achitectures by using something like HMM to create special device struct page's. Ben. > Here's v2 of our series to introduce P2P based copy offload to NVMe > fabrics. This version has been rebased onto v4.16-rc3 which already > includes Christoph's devpagemap work the previous version was based > off as well as a couple of the cleanup patches that were in v1. > > Additionally, we've made the following changes based on feedback: > > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well > as a bunch of cleanup and spelling fixes he pointed out in the last > series. > > * To address Alex's ACS concerns, we change to a simpler method of > just disabling ACS behind switches for any kernel that has > CONFIG_PCI_P2PDMA. > > * We also reject using devices that employ 'dma_virt_ops' which should > fairly simply handle Jason's concerns that this work might break with > the HFI, QIB and rxe drivers that use the virtual ops to implement > their own special DMA operations. > > Thanks, > > Logan > > -- > > This is a continuation of our work to enable using Peer-to-Peer PCI > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who > provided valuable feedback to get these patches to where they are today. > > The concept here is to use memory that's exposed on a PCI BAR as > data buffers in the NVME target code such that data can be transferred > from an RDMA NIC to the special memory and then directly to an NVMe > device avoiding system memory entirely. The upside of this is better > QoS for applications running on the CPU utilizing memory and lower > PCI bandwidth required to the CPU (such that systems could be designed > with fewer lanes connected to the CPU). However, presently, the > trade-off is currently a reduction in overall throughput. (Largely due > to hardware issues that would certainly improve in the future). > > Due to these trade-offs we've designed the system to only enable using > the PCI memory in cases where the NIC, NVMe devices and memory are all > behind the same PCI switch. This will mean many setups that could likely > work well will not be supported so that we can be more confident it > will work and not place any responsibility on the user to understand > their topology. (We chose to go this route based on feedback we > received at the last LSF). Future work may enable these transfers behind > a fabric of PCI switches or perhaps using a white list of known good > root complexes. > > In order to enable this functionality, we introduce a few new PCI > functions such that a driver can register P2P memory with the system. > Struct pages are created for this memory using devm_memremap_pages() > and the PCI bus offset is stored in the corresponding pagemap structure. > > Another set of functions allow a client driver to create a list of > client devices that will be used in a given P2P transactions and then > use that list to find any P2P memory that is supported by all the > client devices. This list is then also used to selectively disable the > ACS bits for the downstream ports behind these devices. > > In the block layer, we also introduce a P2P request flag to indicate a > given request targets P2P memory as well as a flag for a request queue > to indicate a given queue supports targeting P2P memory. P2P requests > will only be accepted by queues that support it. Also, P2P requests > are marked to not be merged seeing a non-homogenous request would > complicate the DMA mapping requirements. > > In the PCI NVMe driver, we modify the existing CMB support to utilize > the new PCI P2P memory infrastructure and also add support for P2P > memory in its request queue. When a P2P request is received it uses the > pci_p2pmem_map_sg() function which applies the necessary transformation > to get the corrent pci_bus_addr_t for the DMA transactions. > > In the RDMA core, we also adjust rdma_rw_ctx_init() and > rdma_rw_ctx_destroy() to take a flags argument which indicates whether > to use the PCI P2P mapping functions or not. > > Finally, in the NVMe fabrics target port we introduce a new > configuration boolean: 'allow_p2pmem'. When set, the port will attempt > to find P2P memory supported by the RDMA NIC and all namespaces. If > supported memory is found, it will be used in all IO tr