Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-05 Thread Stephen Bates
>Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with 
what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very 
helpful.

> i will fix the MAINTAINERS as part of those.

Awesome, thanks.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-05 Thread Stephen Bates
>Yes i need to document that some more in hmm.txt...

Hi Jermone, thanks for the explanation. Can I suggest you update hmm.txt with 
what you sent out?

>  I am about to send RFC for nouveau, i am still working out some bugs.

Great. I will keep an eye out for it. An example user of hmm will be very 
helpful.

> i will fix the MAINTAINERS as part of those.

Awesome, thanks.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Logan Gunthorpe



On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.


Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Logan Gunthorpe



On 02/03/18 02:44 PM, Benjamin Herrenschmidt wrote:

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.


Very cool. I'm glad to hear you found a way to fix this.

Thanks,

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 09:38:43PM +, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 09:38:43PM +, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Benjamin Herrenschmidt
On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Benjamin Herrenschmidt
On Fri, 2018-03-02 at 10:25 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> > 
> > On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > > We use only 52 in practice but yes.
> > > 
> > > >   That's 64PB. If you use need
> > > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > > with 63.98PB of address space left. (Similar calculations for other
> > > > numbers of address bits.)
> > > 
> > > We only have 52 bits of virtual space for the kernel with the radix
> > > MMU.
> > 
> > Ok, assuming you only have 52 bits of physical address space: the sparse 
> > vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> > things. So, again, why doesn't that work? Is my math wrong
> 
> The big problem is not the vmemmap, it's the linear mapping

Allright, so, I think I have a plan to fix this, but it will take a
little bit of time.

Basically the idea is to have firmware pass to Linux a region that's
known to not have anything in it that it can use for the vmalloc space
rather than have linux arbitrarily cut the address space in half.

I'm pretty sure I can always find large enough "holes" in the physical
address space that are outside of both RAM/OpenCAPI/Nvlink and
PCIe/MMIO space. If anything, unused chip IDs. But I don't want Linux
to have to know about the intimate HW details so I'll pass it from FW.

It will take some time to adjust Linux and get updated FW around
though.

Once that's done, I'll be able to have the linear mapping go through
the entire 52-bit space (minus that hole). Of course the hole need to
be large enough to hold a vmemmap for a 52-bit space, so that's about
4TB. So I probably need a hole that's at least 8TB.

As for the mapping attributes, it should be easy for my linear mapping
code to ensure anything that isn't actual RAM is mapped NC.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Stephen Bates
> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. 
Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and 
DEVICE_PUBLIC. It also gives no indication when one might choose to use one 
over the other. Would it be possible to update hmm.txt to include some 
discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the 
kernel's linear address space for the device memory and DEVICE_PRIVATE does 
not. However, like I said, I am not sure when you would use either one and the 
pros and cons of doing so. I actually ended up finding some useful information 
in memremap.h but I don't think it is fair to expect people to dig *that* deep 
to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is 
this correct? Are there any examples of out of tree drivers that use HMM you 
can point me too? As a driver developer what resources exist to help me write a 
HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You 
might want to fix that when you have a moment.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Stephen Bates
> It seems people miss-understand HMM :( 

Hi Jerome

Your unhappy face emoticon made me sad so I went off to (re)read up on HMM. 
Along the way I came up with a couple of things.

While hmm.txt is really nice to read it makes no mention of DEVICE_PRIVATE and 
DEVICE_PUBLIC. It also gives no indication when one might choose to use one 
over the other. Would it be possible to update hmm.txt to include some 
discussion on this? I understand that DEVICE_PUBLIC creates a mapping in the 
kernel's linear address space for the device memory and DEVICE_PRIVATE does 
not. However, like I said, I am not sure when you would use either one and the 
pros and cons of doing so. I actually ended up finding some useful information 
in memremap.h but I don't think it is fair to expect people to dig *that* deep 
to find this information ;-).

A quick grep shows no drivers using the HMM API in the upstream code today. Is 
this correct? Are there any examples of out of tree drivers that use HMM you 
can point me too? As a driver developer what resources exist to help me write a 
HMM aware driver?

The (very nice) hmm.txt document is not references in the MAINTAINERS file? You 
might want to fix that when you have a moment.

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Kani, Toshi
On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi  wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Kani, Toshi
On Fri, 2018-03-02 at 08:57 -0800, Linus Torvalds wrote:
> On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi  wrote:
> > 
> > FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
> 
> No.
> 
> Or rather, that's simply just another (small) part of it all - and an
> architected and documented one at that.
> 
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.
> 
> The really nitty-gritty stuff is not architected, and generally not
> documented outside (possibly) the BIOS writer's guide that is not made
> public.
> 
> Those magical registers contain details like how the DRAM is
> interleaved (if it is), what the timings are, where which memory
> controller handles which memory range, and what are goes to PCIe etc.
> 
> Basically all the actual *steering* information is very much hidden
> away from the kernel (and often from the BIOS too). The parts we see
> at a higher level are just tuning and tweaks.
> 
> Note: the details differ _enormously_ between different chips. The
> setup can be very different, with things like Knights Landing having
> the external cache that can also act as local memory that isn't a
> cache but maps at a different physical address instead etc. That's the
> kind of steering I'm talking about - at a low level how physical
> addresses get mapped to different cache partitions, memory
> controllers, or to the IO system etc.

Right, MRC code is not documented publicly, and it is very much CPU
dependent.  It programs address decoders and maps DRAMs to physical
address as you described.  MTRRs have nothing to do with this memory
controller setting.  That said, MTRRs specify CPU's memory access type,
such as UC and WB.

Thanks,
-Toshi


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
 wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:57 AM, Linus Torvalds
 wrote:
>
> Like the page table caching entries, the memory type range registers
> are really just "secondary information". They don't actually select
> between PCIe and RAM, they just affect the behavior on top of that.

Side note: historically the two may have been almost the same, since
the CPU only had one single unified bus for "memory" (whether that was
memory-mapped PCI or actual RAM). The steering was external.

But even back then you had extended bits to specify things like how
the 640k-1M region got remapped - which could depend on not just the
address, but on whether you read or wrote to it.  The "lost" 384kB of
RAM could either be remapped at a different address, or could be used
for shadowing the (slow) ROM contents, or whatever.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi  wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Linus Torvalds
On Fri, Mar 2, 2018 at 8:22 AM, Kani, Toshi  wrote:
>
> FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.

No.

Or rather, that's simply just another (small) part of it all - and an
architected and documented one at that.

Like the page table caching entries, the memory type range registers
are really just "secondary information". They don't actually select
between PCIe and RAM, they just affect the behavior on top of that.

The really nitty-gritty stuff is not architected, and generally not
documented outside (possibly) the BIOS writer's guide that is not made
public.

Those magical registers contain details like how the DRAM is
interleaved (if it is), what the timings are, where which memory
controller handles which memory range, and what are goes to PCIe etc.

Basically all the actual *steering* information is very much hidden
away from the kernel (and often from the BIOS too). The parts we see
at a higher level are just tuning and tweaks.

Note: the details differ _enormously_ between different chips. The
setup can be very different, with things like Knights Landing having
the external cache that can also act as local memory that isn't a
cache but maps at a different physical address instead etc. That's the
kind of steering I'm talking about - at a low level how physical
addresses get mapped to different cache partitions, memory
controllers, or to the IO system etc.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Kani, Toshi
On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  
> > wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type

UCUC   UC
UCWC   WC
UCWT   UC
UCWB   UC
UCWP   UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc00 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c00-c3f : PCI Bus :00
c40-c7f : PCI Bus :11
c80-cbf : PCI Bus :36
cc0-cff : PCI Bus :5b
d00-d3f : PCI Bus :80
d40-d7f : PCI Bus :85
d80-dbf : PCI Bus :ae
dc0-dff : PCI Bus :d7

-Toshi




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Kani, Toshi
On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  
> > wrote:
> > > 
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> > 
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> > 
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> > 
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> > 
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
> 
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups.  Intel SDM
defines the effect as follows.  'PAT Entry Value' is the page table
setup.

MTRR Memory Type  PAT Entry Value  Effective Memory Type

UCUC   UC
UCWC   WC
UCWT   UC
UCWB   UC
UCWP   UC 

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
 :
reg01: base=0xc00 (12582912MB), size=2097152MB, count=1:
uncachable
 :

# cat /proc/iomem | grep 'PCI Bus'
 :
c00-c3f : PCI Bus :00
c40-c7f : PCI Bus :11
c80-cbf : PCI Bus :36
cc0-cff : PCI Bus :5b
d00-d3f : PCI Bus :80
d40-d7f : PCI Bus :85
d80-dbf : PCI Bus :ae
dc0-dff : PCI Bus :d7

-Toshi




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:

The big problem is not the vmemmap, it's the linear mapping.


Ah, yes, ok.

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 04:26 PM, Benjamin Herrenschmidt wrote:

The big problem is not the vmemmap, it's the linear mapping.


Ah, yes, ok.

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:

(Switching back to my non-IBM address ...)

> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 16:19 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:
> > We use only 52 in practice but yes.
> > 
> > >   That's 64PB. If you use need
> > > a sparse vmemmap for the entire space it will take 16TB which leaves you
> > > with 63.98PB of address space left. (Similar calculations for other
> > > numbers of address bits.)
> > 
> > We only have 52 bits of virtual space for the kernel with the radix
> > MMU.
> 
> Ok, assuming you only have 52 bits of physical address space: the sparse 
> vmemmap takes 1TB and you're left with 3.9PB of address space for other 
> things. So, again, why doesn't that work? Is my math wrong

The big problem is not the vmemmap, it's the linear mapping.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:

We use only 52 in practice but yes.


  That's 64PB. If you use need
a sparse vmemmap for the entire space it will take 16TB which leaves you
with 63.98PB of address space left. (Similar calculations for other
numbers of address bits.)


We only have 52 bits of virtual space for the kernel with the radix
MMU.


Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?


Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 04:00 PM, Benjamin Herrenschmidt wrote:

We use only 52 in practice but yes.


  That's 64PB. If you use need
a sparse vmemmap for the entire space it will take 16TB which leaves you
with 63.98PB of address space left. (Similar calculations for other
numbers of address bits.)


We only have 52 bits of virtual space for the kernel with the radix
MMU.


Ok, assuming you only have 52 bits of physical address space: the sparse 
vmemmap takes 1TB and you're left with 3.9PB of address space for other 
things. So, again, why doesn't that work? Is my math wrong?


Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:57 -0700, Logan Gunthorpe wrote:
> 
> On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
> > It handles it fine for many situations. But when you try to map 
> > something that is at the end of the physical address space then the 
> > spares-vmemmap needs virtual address space that's the size of the 
> > physical address space divided by PAGE_SIZE which may be a little bit 
> > too large...
> 
> Though, considering this more, maybe this shouldn't be a problem...
> 
> Lets say you have 56bits of address space.

We use only 52 in practice but yes.

>  That's 64PB. If you use need 
> a sparse vmemmap for the entire space it will take 16TB which leaves you 
> with 63.98PB of address space left. (Similar calculations for other 
> numbers of address bits.)

We only have 52 bits of virtual space for the kernel with the radix
MMU.

> So I'm not sure what the problem with this is.
> 
> We still have to ensure all the arches map the memory with the right 
> cache bits but that should be relatively easy to solve.
> 
> Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  
> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  
> wrote:
> > 
> > Could be that x86 has the smarts to do the right thing, still trying to
> > untangle the code :-)
> 
> Afaik, x86 will not cache PCI unless the system is misconfigured, and
> even then it's more likely to just raise a machine check exception
> than cache things.
> 
> The last-level cache is going to do fills and spills directly to the
> memory controller, not to the PCIe side of things.
> 
> (I guess you *can* do things differently, and I wouldn't be surprised
> if some people inside Intel did try to do things differently with
> trying nvram over PCIe, but in general I think the above is true)
> 
> You won't find it in the kernel code either. It's in hardware with
> firmware configuration of what addresses are mapped to the memory
> controllers (and _how_ they are mapped) and which are not.

Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
mapping code by checking the address vs. memblocks to chose the right
page table attributes.

So the main problem on our side is to figure out the problem of too big
PFNs. I need to look at this with Aneesh, we might be able to make
things fit with a bit of wrangling.

> You _might_ find it in the BIOS, assuming you understood the tables
> and had the BIOS writer's guide to unravel the magic registers.
> 
> But you might not even find it there. Some of the memory unit timing
> programming is done very early, and by code that Intel doesn't even
> release to the BIOS writers except as a magic encrypted blob, afaik.
> Some of the magic might even be in microcode.
> 
> The page table settings for cacheability are more like a hint, and
> only _part_ of the whole picture. The memory type range registers are
> another part. And magic low-level uarch, northbridge and memory unit
> specific magic is yet another part.
> 
> So you can disable caching for memory, but I'm pretty sure you can't
> enable caching for PCIe at least in the common case. At best you can
> affect how the store buffer works for PCIe.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Linus Torvalds
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Linus Torvalds
On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt  wrote:
>
> Could be that x86 has the smarts to do the right thing, still trying to
> untangle the code :-)

Afaik, x86 will not cache PCI unless the system is misconfigured, and
even then it's more likely to just raise a machine check exception
than cache things.

The last-level cache is going to do fills and spills directly to the
memory controller, not to the PCIe side of things.

(I guess you *can* do things differently, and I wouldn't be surprised
if some people inside Intel did try to do things differently with
trying nvram over PCIe, but in general I think the above is true)

You won't find it in the kernel code either. It's in hardware with
firmware configuration of what addresses are mapped to the memory
controllers (and _how_ they are mapped) and which are not.

You _might_ find it in the BIOS, assuming you understood the tables
and had the BIOS writer's guide to unravel the magic registers.

But you might not even find it there. Some of the memory unit timing
programming is done very early, and by code that Intel doesn't even
release to the BIOS writers except as a magic encrypted blob, afaik.
Some of the magic might even be in microcode.

The page table settings for cacheability are more like a hint, and
only _part_ of the whole picture. The memory type range registers are
another part. And magic low-level uarch, northbridge and memory unit
specific magic is yet another part.

So you can disable caching for memory, but I'm pretty sure you can't
enable caching for PCIe at least in the common case. At best you can
affect how the store buffer works for PCIe.

  Linus


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 13:53 -0700, Jason Gunthorpe wrote:
> On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> > Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> > on ppc64 (maybe via an arch hook as it might depend on the processor
> > family). Server powerpc cannot do cachable accesses on IO memory
> > (unless it's special OpenCAPI or nVlink, but not on PCIe).
> 
> I think you are right on this - even on x86 we must not create
> cachable mappings of PCI BARs - there is no way that works the way
> anyone would expect.
> 
> I think this series doesn't have a problem here only because it never
> touches the BAR pages with the CPU.
> 
> BAR memory should be mapped into the CPU as WC at best on all arches..

Could be that x86 has the smarts to do the right thing, still trying to
untangle the code :-)

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...


Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)


So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:45 PM, Logan Gunthorpe wrote:
It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...


Though, considering this more, maybe this shouldn't be a problem...

Lets say you have 56bits of address space. That's 64PB. If you use need 
a sparse vmemmap for the entire space it will take 16TB which leaves you 
with 63.98PB of address space left. (Similar calculations for other 
numbers of address bits.)


So I'm not sure what the problem with this is.

We still have to ensure all the arches map the memory with the right 
cache bits but that should be relatively easy to solve.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:37 PM, Dan Williams wrote:

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.


It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:37 PM, Dan Williams wrote:

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.


It handles it fine for many situations. But when you try to map 
something that is at the end of the physical address space then the 
spares-vmemmap needs virtual address space that's the size of the 
physical address space divided by PAGE_SIZE which may be a little bit 
too large...


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the 
past and long term I definitely would like to consider how we can add P2P 
functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass 
addressable and non-addressable blocks of data between devices. However that is 
well beyond the intentions of this series ;-).

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
> The intention of HMM is to be useful for all device memory that wish
> to have struct page for various reasons.

Hi Jermone and thanks for your input! Understood. We have looked at HMM in the 
past and long term I definitely would like to consider how we can add P2P 
functionality to HMM for both DEVICE_PRIVATE and DEVICE_PUBLIC so we can pass 
addressable and non-addressable blocks of data between devices. However that is 
well beyond the intentions of this series ;-).

Stephen




Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams
On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>>  wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams
On Thu, Mar 1, 2018 at 12:34 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
>> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>>  wrote:
>> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > > > Hi Everyone,
>> > >
>> > >
>> > > So Oliver (CC) was having issues getting any of that to work for us.
>> > >
>> > > The problem is that acccording to him (I didn't double check the latest
>> > > patches) you effectively hotplug the PCIe memory into the system when
>> > > creating struct pages.
>> > >
>> > > This cannot possibly work for us. First we cannot map PCIe memory as
>> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
>> > > switch anyway since you'd ahve to manage cache coherency in SW).
>> >
>> > Note: I think the above means it won't work behind a switch on x86
>> > either, will it ?
>>
>> The devm_memremap_pages() infrastructure allows placing the memmap in
>> "System-RAM" even if the hotplugged range is in PCI space. So, even if
>> it is an issue on some configurations, it's just a simple adjustment
>> to where the memmap is placed.
>
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?
>
> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Ah ok, I'd need to look at the details. I had been assuming that
sparse-vmemmap could handle such a situation, but that could indeed be
a broken assumption.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:18 PM, Jerome Glisse wrote:

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...


The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:18 PM, Jerome Glisse wrote:

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
This is not useful unless you want to go through all the kernel paths we 
are using and replace page_to_phys() and friends with something else 
that calls an HMM function when appropriate...


The problem isn't getting the physical address from a page, it's that we 
are passing these pages through various kernel interfaces which expect 
pages that work in the usual manner. (Look at the code: we quite simply 
provide a way to get the PCI bus address from a page when necessary).


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
struct hmm_devmem *devmem;
unsigned long ppfn;

/* Sanity test maybe BUG_ON() */
if (!is_device_private_page(page))
return -1UL;

devmem = page->pgmap->data;
ppfn = page_to_page(page) - devmem->pfn_first;
return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
struct hmm_devmem *devmem;
unsigned long ppfn;

/* Sanity test maybe BUG_ON() */
if (!is_device_private_page(page))
return -1UL;

devmem = page->pgmap->data;
ppfn = page_to_page(page) - devmem->pfn_first;
return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:10 PM, Jerome Glisse wrote:

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.


To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.


Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:10 PM, Jerome Glisse wrote:

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.


To what benefit over just using devm_memremap_pages()? If I'm using the 
hmm interface and disabling all the features, I don't see the point. 
We've also cleaned up the devm_memremap_pages() interface to be more 
usefully generic in such a way that I'd hope HMM starts using it too and 
gets rid of the code duplication.


Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...


Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...


Although we could probably work around it with some pain, we rely on 
page_address() and virt_to_phys(), etc to work on these pages. So on 
x86, yes, it makes it into the linear mapping.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> 
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

Actually can you explain a bit more here ?

devm_memremap_pages() doesn't take any specific argument about what to
do with the memory.

It does create the vmemmap sections etc... but does so by calling
arch_add_memory(). So __add_memory() isn't called, which means the
pages aren't added to the linear mapping. Then you manually add them to
ZONE_DEVICE.

Am I correct ?

In that case, they indeed can't be used as normal memory pages, which
is good, and if they are indeed not in the linear mapping, then there
is no caching issues.

However, what happens if anything calls page_address() on them ? Some
DMA ops do that for example, or some devices might ...

This is all quite convoluted with no documentation I can find that
explains the various expectations.

So the question is are those pages landing in the linear mapping, and
if yes, by what code path ?

The next question is if we ever want that to work on ppc64, we need a
way to make this fit in our linear mapping and map it non-cachable,
which will require some wrangling on how we handle that mapping.

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:55 PM, Jerome Glisse wrote:

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.


We looked at it but didn't see how any of it was applicable to our needs.

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:55 PM, Jerome Glisse wrote:

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.


We looked at it but didn't see how any of it was applicable to our needs.

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:53 PM, Jason Gunthorpe wrote:

On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).


I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.


On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here Other arches, not so much.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:53 PM, Jason Gunthorpe wrote:

On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).


I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.


On x86, even if I try to make a cachable mapping of a PCI BAR it always 
ends up being un-cached. The arch code in x86 always does the right 
thing here Other arches, not so much.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.


Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.



  - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?


No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.



Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?


It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.


Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.


Logan


[1] https://marc.info/?l=linux-kernel=149566222124326=2


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 01:29 PM, Benjamin Herrenschmidt wrote:

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.


Yes, we've been specifically prohibiting all cases where these pages get 
passed to userspace. We don't want that. Although it works in limited 
cases (ie x86), and we use it for some testing, there are dragons there.



  - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?


No similar issues on arm64. Any chance you could simply not map the PCI 
bars that way? What's the point of that? It may simply mean ppc64 can't 
be supported until either that changes or the kernel infrastructure gets 
more sophisticated.



Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?


It's not so much the DMA map API as it is the entire kernel 
infrastructure. Scatter lists (which are universally used to setup DMA 
requests) require pages and bios require pages, etc, etc. In fact, this 
patch set, in its current form, routes around the DMA API entirely.


Myself[1] and others have done prototype work to migrate away from 
struct pages and to use pfn_t instead but this work doesn't seem to get 
very far in the community.


Logan


[1] https://marc.info/?l=linux-kernel=149566222124326=2


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jason Gunthorpe
On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).

I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.

I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.

BAR memory should be mapped into the CPU as WC at best on all arches..

Jason


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jason Gunthorpe
On Fri, Mar 02, 2018 at 07:40:15AM +1100, Benjamin Herrenschmidt wrote:
> Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
> on ppc64 (maybe via an arch hook as it might depend on the processor
> family). Server powerpc cannot do cachable accesses on IO memory
> (unless it's special OpenCAPI or nVlink, but not on PCIe).

I think you are right on this - even on x86 we must not create
cachable mappings of PCI BARs - there is no way that works the way
anyone would expect.

I think this series doesn't have a problem here only because it never
touches the BAR pages with the CPU.

BAR memory should be mapped into the CPU as WC at best on all arches..

Jason


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Fri, 2018-03-02 at 07:34 +1100, Benjamin Herrenschmidt wrote:
> 
> But what happens with that PCI memory ? Is it effectively turned into
> nromal memory (ie, usable for normal allocations, potentially used to
> populate user pages etc...) or is it kept aside ?

(What I mean is is it added to the page allocator basically)

Also we need to be able to hard block MEMREMAP_WB mappings of non-RAM
on ppc64 (maybe via an arch hook as it might depend on the processor
family). Server powerpc cannot do cachable accesses on IO memory
(unless it's special OpenCAPI or nVlink, but not on PCIe).

> Also on ppc64, the physical addresses of PCIe make it so far appart
> that there's no way we can map them into the linear mapping at the
> normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
> page_address or virt_to_page cannot work as-is on PCIe addresses.

Talking of which ... is there any documentation on the whole
memremap_page ? my grep turned out empty...

Cheers,
Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>  wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:21 -0800, Dan Williams wrote:
> On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
>  wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > > > Hi Everyone,
> > > 
> > > 
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> The devm_memremap_pages() infrastructure allows placing the memmap in
> "System-RAM" even if the hotplugged range is in PCI space. So, even if
> it is an issue on some configurations, it's just a simple adjustment
> to where the memmap is placed.

But what happens with that PCI memory ? Is it effectively turned into
nromal memory (ie, usable for normal allocations, potentially used to
populate user pages etc...) or is it kept aside ?

Also on ppc64, the physical addresses of PCIe make it so far appart
that there's no way we can map them into the linear mapping at the
normal offset of PAGE_OFFSET + (pfn << PAGE_SHIFT), so things like
page_address or virt_to_page cannot work as-is on PCIe addresses.

Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 18:09 +, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 18:09 +, Stephen  Bates wrote:
> > > So Oliver (CC) was having issues getting any of that to work for us.
> > > 
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> >   
> >   Note: I think the above means it won't work behind a switch on x86
> >   either, will it ?
> 
>  
> Ben 
> 
> We have done extensive testing of this series and its predecessors
> using PCIe switches from both Broadcom (PLX) and Microsemi. We have
> also done testing on x86_64, ARM64 and ppc64el based ARCH with
> varying degrees of success. The series as it currently stands only
> works on x86_64 but modified (hacky) versions have been made to work
> on ARM64. The x86_64 testing has been done on a range of (Intel)
> CPUs, servers, PCI EPs (including RDMA NICs from at least three
> vendors, NVMe SSDs from at least four vendors and P2P devices from
> four vendors) and PCI switches.
> 
> I do find it slightly offensive that you would question the series
> even working. I hope you are not suggesting we would submit this
> framework multiple times without having done testing on it

No need to get personal on that. I did specify that this was based on
some incomplete understanding of what's going on with that new hack
used to create struct pages.

As it is, it cannot work on ppc64 however, in part because according to
Oliver, we end up mapping things cachable, and in part, because of the
address range issues.

The latter issue might be fundamental to the approach and unfixable
unless we have ways to use hooks for virt_to_page/page_address on these
things.

Ben.



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> 
> On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > The problem is that acccording to him (I didn't double check the latest
> > > patches) you effectively hotplug the PCIe memory into the system when
> > > creating struct pages.
> > > 
> > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > switch anyway since you'd ahve to manage cache coherency in SW).
> > 
> > Note: I think the above means it won't work behind a switch on x86
> > either, will it ?
> 
> This works perfectly fine on x86 behind a switch and we've tested it on 
> multiple machines. We've never had an issue of running out of virtual 
> space despite our PCI bars typically being located with an offset of 
> 56TB or more. The arch code on x86 also somehow figures out not to map 
> the memory as cachable so that's not an issue (though, at this point, 
> the CPU never accesses the memory so even if it were, it wouldn't affect 
> anything).

Oliver can you look into this ? You sais the memory was effectively
hotplug'ed into the system when creating the struct pages. That would
mean to me that it's a) mapped (which for us is cachable, maybe x86 has
tricks to avoid that) and b) potentially used to populate userspace
pages (that will definitely be cachable). Unless there's something in
there you didn't see that prevents it.

> We also had this working on ARM64 a while back but it required some out 
> of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> code to ioremap the memory into the page map.
> 
> You didn't mention what architecture you were trying this on.

ppc64.

> It may make sense at this point to make this feature dependent on x86 
> until more work is done to make it properly portable. Something like 
> arch functions that allow adding IO memory pages to with a specific 
> cache setting. Though, if an arch has such restrictive limits on the map 
> size it would probably need to address that too somehow.

Not fan of that approach.

So there are two issues to consider here:

 - Our MMIO space is very far away from memory (high bits set in the
address) which causes problem with things like vmmemmap, page_address,
virt_to_page etc... Do you have similar issues on arm64 ?

 - We need to ensure that the mechanism (which I'm not familiar with)
that you use to create the struct page's for the device don't end up
turning those device pages into normal "general use" pages for the
system. Oliver thinks it does, you say it doesn't, ... 

Jerome (Glisse), what's your take on this ? Smells like something that
could be covered by HMM...

Logan, the only reason you need struct page's to begin with is for the
DMA API right ? Or am I missing something here ?

Cheers,
Ben.

> Thanks,
> 
> Logan
> 



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 03:31 AM, Sagi Grimberg wrote:

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.


And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 03:31 AM, Sagi Grimberg wrote:

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


pci_p2pmem_find() will simply not return any devices when any client 
that uses dma_virt_ops. So in the NVMe target case it simply will not 
use P2P memory.


And just in case, pci_p2pdma_map_sg() will also return 0 if the device 
passed to it uses dma_virt_ops as well. So if someone bypasses 
pci_p2pmem_find() they will get a failure during map.


Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 12:21 PM, Dan Williams wrote:

Note: I think the above means it won't work behind a switch on x86
either, will it ?


The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.


Thanks for the confirmation Dan!

Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 01/03/18 12:21 PM, Dan Williams wrote:

Note: I think the above means it won't work behind a switch on x86
either, will it ?


The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.


Thanks for the confirmation Dan!

Logan



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams
On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Dan Williams
On Wed, Feb 28, 2018 at 7:56 PM, Benjamin Herrenschmidt
 wrote:
> On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
>> > Hi Everyone,
>>
>>
>> So Oliver (CC) was having issues getting any of that to work for us.
>>
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>>
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>
> Note: I think the above means it won't work behind a switch on x86
> either, will it ?

The devm_memremap_pages() infrastructure allows placing the memmap in
"System-RAM" even if the hotplugged range is in PCI space. So, even if
it is an issue on some configurations, it's just a simple adjustment
to where the memmap is placed.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe 
switches from both Broadcom (PLX) and Microsemi. We have also done testing on 
x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The 
series as it currently stands only works on x86_64 but modified (hacky) 
versions have been made to work on ARM64. The x86_64 testing has been done on a 
range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least 
three vendors, NVMe SSDs from at least four vendors and P2P devices from four 
vendors) and PCI switches.

I do find it slightly offensive that you would question the series even 
working. I hope you are not suggesting we would submit this framework multiple 
times without having done testing on it

Stephen



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Stephen Bates
>> So Oliver (CC) was having issues getting any of that to work for us.
>> 
>> The problem is that acccording to him (I didn't double check the latest
>> patches) you effectively hotplug the PCIe memory into the system when
>> creating struct pages.
>> 
>> This cannot possibly work for us. First we cannot map PCIe memory as
>> cachable. (Note that doing so is a bad idea if you are behind a PLX
>> switch anyway since you'd ahve to manage cache coherency in SW).
>   
>   Note: I think the above means it won't work behind a switch on x86
>   either, will it ?
 
Ben 

We have done extensive testing of this series and its predecessors using PCIe 
switches from both Broadcom (PLX) and Microsemi. We have also done testing on 
x86_64, ARM64 and ppc64el based ARCH with varying degrees of success. The 
series as it currently stands only works on x86_64 but modified (hacky) 
versions have been made to work on ARM64. The x86_64 testing has been done on a 
range of (Intel) CPUs, servers, PCI EPs (including RDMA NICs from at least 
three vendors, NVMe SSDs from at least four vendors and P2P devices from four 
vendors) and PCI switches.

I do find it slightly offensive that you would question the series even 
working. I hope you are not suggesting we would submit this framework multiple 
times without having done testing on it

Stephen



Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:

On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).


Note: I think the above means it won't work behind a switch on x86
either, will it ?


This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).


We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.


You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.


Thanks,

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Logan Gunthorpe



On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:

On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).


Note: I think the above means it won't work behind a switch on x86
either, will it ?


This works perfectly fine on x86 behind a switch and we've tested it on 
multiple machines. We've never had an issue of running out of virtual 
space despite our PCI bars typically being located with an offset of 
56TB or more. The arch code on x86 also somehow figures out not to map 
the memory as cachable so that's not an issue (though, at this point, 
the CPU never accesses the memory so even if it were, it wouldn't affect 
anything).


We also had this working on ARM64 a while back but it required some out 
of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
code to ioremap the memory into the page map.


You didn't mention what architecture you were trying this on.

It may make sense at this point to make this feature dependent on x86 
until more work is done to make it properly portable. Something like 
arch functions that allow adding IO memory pages to with a specific 
cache setting. Though, if an arch has such restrictive limits on the map 
size it would probably need to address that too somehow.


Thanks,

Logan


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Sagi Grimberg



Hi Everyone,


Hi Logan,


Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Sagi Grimberg



Hi Everyone,


Hi Logan,


Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Benjamin Herrenschmidt
On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO 

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Benjamin Herrenschmidt
On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> Hi Everyone,


So Oliver (CC) was having issues getting any of that to work for us.

The problem is that acccording to him (I didn't double check the latest
patches) you effectively hotplug the PCIe memory into the system when
creating struct pages.

This cannot possibly work for us. First we cannot map PCIe memory as
cachable. (Note that doing so is a bad idea if you are behind a PLX
switch anyway since you'd ahve to manage cache coherency in SW).

Then our MMIO space is so far away from our memory space that there is
not enough vmemmap virtual space to be able to do that.

So this can only work accross achitectures by using something like HMM
to create special device struct page's.

Ben.


> Here's v2 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.16-rc3 which already
> includes Christoph's devpagemap work the previous version was based
> off as well as a couple of the cleanup patches that were in v1.
> 
> Additionally, we've made the following changes based on feedback:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> Thanks,
> 
> Logan
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> provided valuable feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the
> trade-off is currently a reduction in overall throughput. (Largely due
> to hardware issues that would certainly improve in the future).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch. This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We chose to go this route based on feedback we
> received at the last LSF). Future work may enable these transfers behind
> a fabric of PCI switches or perhaps using a white list of known good
> root complexes.
> 
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices. This list is then also used to selectively disable the
> ACS bits for the downstream ports behind these devices.
> 
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO 

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a 

Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Benjamin Herrenschmidt
On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
> 
> 
> So Oliver (CC) was having issues getting any of that to work for us.
> 
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
> 
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
> 
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
> 
> Ben.
> 
> 
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> > 
> > Additionally, we've made the following changes based on feedback:
> > 
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> >   as a bunch of cleanup and spelling fixes he pointed out in the last
> >   series.
> > 
> > * To address Alex's ACS concerns, we change to a simpler method of
> >   just disabling ACS behind switches for any kernel that has
> >   CONFIG_PCI_P2PDMA.
> > 
> > * We also reject using devices that employ 'dma_virt_ops' which should
> >   fairly simply handle Jason's concerns that this work might break with
> >   the HFI, QIB and rxe drivers that use the virtual ops to implement
> >   their own special DMA operations.
> > 
> > Thanks,
> > 
> > Logan
> > 
> > --
> > 
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> > 
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> > 
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> > 
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> > 
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> > 
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> > 
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> > 
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a 

[PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Logan Gunthorpe
Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c|   3 +
 drivers/infiniband/core/rw.c|  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-
 

[PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-02-28 Thread Logan Gunthorpe
Hi Everyone,

Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

Thanks,

Logan

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
provided valuable feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVME target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU). However, presently, the
trade-off is currently a reduction in overall throughput. (Largely due
to hardware issues that would certainly improve in the future).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch. This will mean many setups that could likely
work well will not be supported so that we can be more confident it
will work and not place any responsibility on the user to understand
their topology. (We chose to go this route based on feedback we
received at the last LSF). Future work may enable these transfers behind
a fabric of PCI switches or perhaps using a white list of known good
root complexes.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices. This list is then also used to selectively disable the
ACS bits for the downstream ports behind these devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

Logan Gunthorpe (10):
  PCI/P2PDMA: Support peer to peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci |  25 ++
 block/blk-core.c|   3 +
 drivers/infiniband/core/rw.c|  21 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   5 +-