RE: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Deucher, Alexander
> -Original Message-
> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
> Sent: Friday, January 06, 2017 1:26 PM
> To: Jerome Glisse
> Cc: Sagalovitch, Serguei; Jerome Glisse; Deucher, Alexander; 'linux-
> ker...@vger.kernel.org'; 'linux-r...@vger.kernel.org'; 'linux-
> nvd...@lists.01.org'; 'linux-me...@vger.kernel.org'; 'dri-
> de...@lists.freedesktop.org'; 'linux-...@vger.kernel.org'; Kuehling, Felix;
> Blinzer, Paul; Koenig, Christian; Suthikulpanit, Suravee; Sander, Ben;
> h...@infradead.org; Zhou, David(ChunMing); Yu, Qiang
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> > On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > >
> > > > > > > I still don't understand what you driving at - you've said in both
> > > > > > > cases a user VMA exists.
> > > > > > In the former case no, there is no VMA directly but if you want one
> than
> > > > > > a device can provide one. But such VMA is useless as CPU access is
> not
> > > > > > expected.
> > > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > > not VMA based for setting up RDMA - that is shorted sighted and
> does
> > > > > not seem to reflect where the industry is going.
> > > > >
> > > > > So focus on having VMA backed by actual physical memory that
> covers
> > > > > your GPU objects and ask how do we wire up the '__user *' to the
> DMA
> > > > > API in the best way so the DMA API still has enough information to
> > > > > setup IOMMUs and whatnot.
> > > > I am talking about 2 different thing. Existing hardware and API where
> you
> > > > _do not_ have a vma and you do not need one. This is just
> > > > > existing stuff.
> 
> > > I do not understand why you assume that existing API doesn't  need one.
> > > I would say that a lot of __existing__ user level API and their support in
> > > kernel (especially outside of graphics domain) assumes that we have vma
> and
> > > deal with __user * pointers.
> 
> +1
> 
> > Well i am thinking to GPUDirect here. Some of GPUDirect use case do not
> have
> > vma (struct vm_area_struct) associated with them they directly apply to
> GPU
> > object that aren't expose to CPU. Yes some use case have vma for share
> buffer.
> 
> Lets stop talkind about GPU direct. Today we can't even make VMA
> pointing at a PCI bar work properly in the kernel - lets start there
> please. People can argue over other options once that is done.
> 
> > For HMM plan is to restrict to ODP and either to replace ODP with HMM or
> change
> > ODP to not use get_user_pages_remote() but directly fetch informations
> from
> > CPU page table. Everything else stay as it is. I posted patchset to replace
> > ODP with HMM in the past.
> 
> Make a generic API for all of this and you'd have my vote..
> 
> IMHO, you must support basic pinning semantics - that is necessary to
> support generic short lived DMA (eg filesystem, etc). That hardware
> can clearly do that if it can support ODP.

We would definitely like to have support for hardware that can't handle page 
faults gracefully.

Alex

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Deucher, Alexander
> -Original Message-
> From: Haggai Eran [mailto:hagg...@mellanox.com]
> Sent: Wednesday, November 30, 2016 5:46 AM
> To: Jason Gunthorpe
> Cc: linux-ker...@vger.kernel.org; linux-r...@vger.kernel.org; linux-
> nvd...@ml01.01.org; Koenig, Christian; Suthikulpanit, Suravee; Bridgman,
> John; Deucher, Alexander; linux-me...@vger.kernel.org;
> dan.j.willi...@intel.com; log...@deltatee.com; dri-
> de...@lists.freedesktop.org; Max Gurtovoy; linux-...@vger.kernel.org;
> Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2016 at 06:19:40PM +, Haggai Eran wrote:
> >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> >>>> user-space and the GPU not to migrate it. If they do, the MR gets
> >>>> destroyed immediately.
> >>> That sounds horrible. How can that possibly work? What if the MR is
> >>> being used when the GPU decides to migrate?
> >> Naturally this doesn't support migration. The GPU is expected to pin
> >> these pages as long as the MR lives. The MR invalidation is done only as
> >> a last resort to keep system correctness.
> >
> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.
> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

Christian sent out an RFC patch a while back that extended the OOM to cover 
memory allocated for the GPU:
https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html

Alex

> 
> >
> >> I think it is similar to how non-ODP MRs rely on user-space today to
> >> keep them correct. If you do something like madvise(MADV_DONTNEED)
> on a
> >> non-ODP MR's pages, you can still get yourself into a data corruption
> >> situation (HCA sees one page and the process sees another for the same
> >> virtual address). The pinning that we use only guarentees the HCA's page
> >> won't be reused.
> >
> > That is not really data corruption - the data still goes where it was
> > originally destined. That is an application violating the
> > requirements of a MR.
> I guess it is a matter of terminology. If you compare it to the ODP case
> or the CPU case then you usually expect a single virtual address to map to
> a single physical page. Violating this cause some of your writes to be dropped
> which is a data corruption in my book, even if the application caused it.
> 
> > An application cannot munmap/mremap a VMA
> > while a non ODP MR points to it and then keep using the MR.
> Right. And it is perfectly fine to have some similar requirements from the
> application
> when doing peer to peer with a non-ODP MR.
> 
> > That is totally different from a GPU driver wanthing to mess with
> > translation to physical pages.
> >
> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?
> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer, and O_DIRECT operations that access GPU memory.
> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel,
> although that is intended to be handled by the GPU driver if I understand
> correctly.
> 
> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API
> support
> for peer to peer. I'm not sure we need vmap. We need a way to have a
> scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.
> 
> Haggai
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm