Thank you Felix for sharing. See a few comments inline > -----Original Message----- > From: Felix Kuehling <felix.kuehl...@amd.com> > Sent: Tuesday, January 23, 2024 3:17 PM > To: Zeng, Oak <oak.z...@intel.com>; Christian König > <christian.koe...@amd.com>; > Danilo Krummrich <d...@redhat.com>; Dave Airlie <airl...@redhat.com>; Daniel > Vetter <dan...@ffwll.ch> > Cc: Welty, Brian <brian.we...@intel.com>; dri-devel@lists.freedesktop.org; > intel- > x...@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bo...@intel.com>; > Ghimiray, Himal Prasad <himal.prasad.ghimi...@intel.com>; > thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana > <niranjana.vishwanathap...@intel.com>; Brost, Matthew > <matthew.br...@intel.com>; Gupta, saurabhg <saurabhg.gu...@intel.com> > Subject: Re: Making drm_gpuvm work across gpu devices > > On 2024-01-23 14:37, Zeng, Oak wrote: > > Thanks Christian. I have some comment inline below. > > > > Danilo, can you also take a look and give your feedback? Thanks. > > Sorry, just catching up with this thread now. I'm also not familiar with > drm_gpuvm. > > Some general observations based on my experience with KFD, amdgpu and > SVM. With SVM we have a single virtual address space managed in user > mode (basically using mmap) with attributes per virtual address range > maintained in the kernel mode driver. Different devices get their > mappings of pieces of that address space using the same virtual > addresses. We also support migration to different DEVICE_PRIVATE memory > spaces.
I think one same virtual address can be mapped into different devices. For different device, reading from same virtual address result in same content. Driver either map the page table to pointing to the same physical location, or migrate before mapping. I guess you imply this. > > However, we still have page tables managed per device. Each device can > have different page table formats and layout (e.g. different GPU > generations in the same system) and the same memory may be mapped with > different flags on different devices in order to get the right coherence > behaviour. We also need to maintain per-device DMA mappings somewhere. > That means, as far as the device page tables are concerned, we still > have separate address spaces. SVM only adds a layer on top, which > coordinates these separate device virtual address spaces so that some > parts of them provide the appearance of a shared virtual address space. > Yes exactly the same understanding. > At some point you need to decide, where you draw the boundary between > managing a per-process shared virtual address space and managing > per-device virtual address spaces. In amdgpu that boundary is currently > where kfd_svm code calls amdgpu_vm code to manage the per-device page > tables. Exactly, in xe driver it is xe_svm and xe_vm. Just different name 😊 > > In the amdgpu driver, we still have the traditional memory management > APIs in the render nodes that don't do SVM. They share the device > virtual address spaces with SVM. We have to be careful that we don't try > to manage the same device virtual address ranges with these two > different memory managers. In practice, we let the non-SVM APIs use the > upper half of the canonical address space, while the lower half can be > used almost entirely for SVM. In xekmd, we also have to support a mixed usage of traditional gem_create/vm_bind api and malloc. I just wondering why you had to split the canonical address space b/t those two usage model. To illustrate those two usage: Traditional model: Ptr= mmap(anonymous) Vm_bind(ptr, bo) Submit gpu kernel using ptr System allocator model: Ptr = mmap(anonymous) or malloc() Submit gpu kernel using ptr The point is, both ptr are allocated in user space anonymously inside one process space, so there is no collision even if you don't deliberately divide the canonical address space. Thanks, Oak > > Regards, > Felix > > > > > >> -----Original Message----- > >> From: Christian König <christian.koe...@amd.com> > >> Sent: Tuesday, January 23, 2024 6:13 AM > >> To: Zeng, Oak <oak.z...@intel.com>; Danilo Krummrich <d...@redhat.com>; > >> Dave Airlie <airl...@redhat.com>; Daniel Vetter <dan...@ffwll.ch> > >> Cc: Welty, Brian <brian.we...@intel.com>; dri-devel@lists.freedesktop.org; > >> intel- > >> x...@lists.freedesktop.org; Bommu, Krishnaiah <krishnaiah.bo...@intel.com>; > >> Ghimiray, Himal Prasad <himal.prasad.ghimi...@intel.com>; > >> thomas.hellst...@linux.intel.com; Vishwanathapura, Niranjana > >> <niranjana.vishwanathap...@intel.com>; Brost, Matthew > >> <matthew.br...@intel.com> > >> Subject: Re: Making drm_gpuvm work across gpu devices > >> > >> Hi Oak, > >> > >> Am 23.01.24 um 04:21 schrieb Zeng, Oak: > >>> Hi Danilo and all, > >>> > >>> During the work of Intel's SVM code, we came up the idea of making > >> drm_gpuvm to work across multiple gpu devices. See some discussion here: > >> https://lore.kernel.org/dri- > >> > devel/PH7PR11MB70049E7E6A2F40BF6282ECC292742@PH7PR11MB7004.namprd > >> 11.prod.outlook.com/ > >>> The reason we try to do this is, for a SVM (shared virtual memory across > >>> cpu > >> program and all gpu program on all gpu devices) process, the address space > >> has > >> to be across all gpu devices. So if we make drm_gpuvm to work across > >> devices, > >> then our SVM code can leverage drm_gpuvm as well. > >>> At a first look, it seems feasible because drm_gpuvm doesn't really use > >>> the > >> drm_device *drm pointer a lot. This param is used only for > >> printing/warning. So I > >> think maybe we can delete this drm field from drm_gpuvm. > >>> This way, on a multiple gpu device system, for one process, we can have > >>> only > >> one drm_gpuvm instance, instead of multiple drm_gpuvm instances (one for > >> each gpu device). > >>> What do you think? > >> Well from the GPUVM side I don't think it would make much difference if > >> we have the drm device or not. > >> > >> But the experience we had with the KFD I think I should mention that we > >> should absolutely *not* deal with multiple devices at the same time in > >> the UAPI or VM objects inside the driver. > >> > >> The background is that all the APIs inside the Linux kernel are build > >> around the idea that they work with only one device at a time. This > >> accounts for both low level APIs like the DMA API as well as pretty high > >> level things like for example file system address space etc... > > Yes most API are per device based. > > > > One exception I know is actually the kfd SVM API. If you look at the > > svm_ioctl > function, it is per-process based. Each kfd_process represent a process > across N gpu > devices. Cc Felix. > > > > Need to say, kfd SVM represent a shared virtual address space across CPU > > and all > GPU devices on the system. This is by the definition of SVM (shared virtual > memory). > This is very different from our legacy gpu *device* driver which works for > only one > device (i.e., if you want one device to access another device's memory, you > will have > to use dma-buf export/import etc). > > > > We have the same design requirement of SVM. For anyone who want to > implement the SVM concept, this is a hard requirement. Since now drm has the > drm_gpuvm concept which strictly speaking is designed for one device, I want > to see > whether we can extend drm_gpuvm to make it work for both single device (as > used > in xe) and multipe devices (will be used in the SVM code). That is why I > brought up > this topic. > > > >> So when you have multiple GPUs you either have an inseparable cluster of > >> them which case you would also only have one drm_device. Or you have > >> separated drm_device which also results in separate drm render nodes and > >> separate virtual address spaces and also eventually separate IOMMU > >> domains which gives you separate dma_addresses for the same page and so > >> separate GPUVM page tables.... > > I am thinking we can still make each device has its separate > > drm_device/render > node/iommu domains/gpu page table. Just as what we have today. I am not plan > to > change this picture. > > > > But the virtual address space will support two modes of operation: > > 1. one drm_gpuvm per device. This is when svm is not in the picture > > 2. all devices in the process share one single drm_gpuvm, when svm is in the > picture. In xe driver design, we have to support a mixture use of legacy mode > (such > as gem_create and vm_bind) and svm (such as malloc'ed memory for gpu > submission). So whenever SVM is in the picture, we want one single process > address > space across all devices. Drm_gpuvm doesn't need to be aware of those two > operation modes. It is driver's responsibility to use different mode. > > > > For example, in mode #1, a driver's vm structure (such as xe_vm) can > > inherit from > drm_gpuvm. In mode #2, a driver's svm structure (xe_svm in this series: > https://lore.kernel.org/dri-devel/20240117221223.18540-1-oak.z...@intel.com/) > can inherit from drm_gpuvm while each xe_vm (still a per-device based struct) > will > just have a pointer to the drm_gpuvm structure. This way when svm is in play, > we > build a 1 process:1 mm_struct:1 xe_svm:N xe_vm correlations which means shared > address space across gpu devices. > > > > This requires some changes of drm_gpuvm design: > > 1. The drm_device *drm pointer, in mode #2 operation, this can be NULL, > > means > this drm_gpuvm is not for specific gpu device > > 2. The common dma_resv object: drm_gem_object *r_obj. *Does one dma_resv > object allocated/initialized for one device work for all devices*? From first > look, > dma_resv is just some CPU structure maintaining dma-fences. So I guess it > should > work for all devices? I definitely need you to comment. > > > > > >> It's up to you how to implement it, but I think it's pretty clear that > >> you need separate drm_gpuvm objects to manage those. > > As explained above, I am thinking of one drm_gpuvm object across all devices > when SVM is in the picture... > > > >> That you map the same thing in all those virtual address spaces at the > >> same address is a completely different optimization problem I think. > > Not sure I follow here... the requirement from SVM is, one virtual address > > points to > same physical backing store. For example, whenever CPU or any GPU device > access > this virtual address, it refers to the same physical content. Of course the > physical > backing store can be migrated b/t host memory and any of the GPU's device > memory, > but the content should be consistent. > > > > So we are mapping same physical content to the same virtual address in > > either cpu > page table or any gpu device's page table... > > > >> What we could certainly do is to optimize hmm_range_fault by making > >> hmm_range a reference counted object and using it for multiple devices > >> at the same time if those devices request the same range of an mm_struct. > >> > > Not very follow. If you are trying to resolve a multiple devices concurrent > > access > problem, I think we should serialize concurrent device fault to one address > range. > The reason is, during device fault handling, we might migrate the backing > store so > hmm_range->hmm_pfns[] might have changed after one device access it. > > > >> I think if you start using the same drm_gpuvm for multiple devices you > >> will sooner or later start to run into the same mess we have seen with > >> KFD, where we moved more and more functionality from the KFD to the DRM > >> render node because we found that a lot of the stuff simply doesn't work > >> correctly with a single object to maintain the state. > > As I understand it, KFD is designed to work across devices. A single pseudo > > /dev/kfd > device represent all hardware gpu devices. That is why during kfd open, many > pdd > (process device data) is created, each for one hardware device for this > process. Yes > the codes are a little complicated. > > > > Kfd manages the shared virtual address space in the kfd driver codes, like > > the split, > merging etc. Here I am looking whether we can leverage the drm_gpuvm code for > those functions. > > > > As of the shared virtual address space across gpu devices, it is a hard > > requirement > for svm/system allocator (aka malloc for gpu program). We need to make it work > either at driver level or drm_gpuvm level. Drm_gpuvm is better because the > work > can be shared b/t drivers. > > > > Thanks a lot, > > Oak > > > >> Just one more point to your original discussion on the xe list: I think > >> it's perfectly valid for an application to map something at the same > >> address you already have something else. > >> > >> Cheers, > >> Christian. > >> > >>> Thanks, > >>> Oak