> -----Original Message----- > From: Haggai Eran [mailto:haggaie at mellanox.com] > Sent: Wednesday, November 30, 2016 5:46 AM > To: Jason Gunthorpe > Cc: linux-kernel at vger.kernel.org; linux-rdma at vger.kernel.org; linux- > nvdimm at ml01.01.org; Koenig, Christian; Suthikulpanit, Suravee; Bridgman, > John; Deucher, Alexander; Linux-media at vger.kernel.org; > dan.j.williams at intel.com; logang at deltatee.com; dri- > devel at lists.freedesktop.org; Max Gurtovoy; linux-pci at vger.kernel.org; > Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben > Subject: Re: Enabling peer to peer device transactions for PCIe devices > > On 11/28/2016 9:02 PM, Jason Gunthorpe wrote: > > On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote: > >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on > >>>> user-space and the GPU not to migrate it. If they do, the MR gets > >>>> destroyed immediately. > >>> That sounds horrible. How can that possibly work? What if the MR is > >>> being used when the GPU decides to migrate? > >> Naturally this doesn't support migration. The GPU is expected to pin > >> these pages as long as the MR lives. The MR invalidation is done only as > >> a last resort to keep system correctness. > > > > That just forces applications to handle horrible unexpected > > failures. If this sort of thing is needed for correctness then OOM > > kill the offending process, don't corrupt its operation. > Yes, that sounds fine. Can we simply kill the process from the GPU driver? > Or do we need to extend the OOM killer to manage GPU pages?
Christian sent out an RFC patch a while back that extended the OOM to cover memory allocated for the GPU: https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html Alex > > > > >> I think it is similar to how non-ODP MRs rely on user-space today to > >> keep them correct. If you do something like madvise(MADV_DONTNEED) > on a > >> non-ODP MR's pages, you can still get yourself into a data corruption > >> situation (HCA sees one page and the process sees another for the same > >> virtual address). The pinning that we use only guarentees the HCA's page > >> won't be reused. > > > > That is not really data corruption - the data still goes where it was > > originally destined. That is an application violating the > > requirements of a MR. > I guess it is a matter of terminology. If you compare it to the ODP case > or the CPU case then you usually expect a single virtual address to map to > a single physical page. Violating this cause some of your writes to be dropped > which is a data corruption in my book, even if the application caused it. > > > An application cannot munmap/mremap a VMA > > while a non ODP MR points to it and then keep using the MR. > Right. And it is perfectly fine to have some similar requirements from the > application > when doing peer to peer with a non-ODP MR. > > > That is totally different from a GPU driver wanthing to mess with > > translation to physical pages. > > > >>> From what I understand we are not really talking about kernel p2p, > >>> everything proposed so far is being mediated by a userspace VMA, so > >>> I'd focus on making that work. > > > >> Fair enough, although we will need both eventually, and I hope the > >> infrastructure can be shared to some degree. > > > > What use case do you see for in kernel? > Two cases I can think of are RDMA access to an NVMe device's controller > memory buffer, and O_DIRECT operations that access GPU memory. > Also, HMM's migration between two GPUs could use peer to peer in the > kernel, > although that is intended to be handled by the GPU driver if I understand > correctly. > > > Presumably in-kernel could use a vmap or something and the same basic > > flow? > I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API > support > for peer to peer. I'm not sure we need vmap. We need a way to have a > scatterlist > of MMIO pfns, and ZONE_DEVICE allows that. > > Haggai