Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19
On Thu, Apr 06, 2017 at 11:22:12AM +0800, Figo.zhang wrote: [...] > > Heterogeneous Memory Management (HMM) (description and justification) > > > > Today device driver expose dedicated memory allocation API through their > > device file, often relying on a combination of IOCTL and mmap calls. The > > device can only access and use memory allocated through this API. This > > effectively split the program address space into object allocated for the > > device and useable by the device and other regular memory (malloc, mmap > > of a file, share memory, …) only accessible by CPU (or in a very limited > > way by a device by pinning memory). > > > > Allowing different isolated component of a program to use a device thus > > require duplication of the input data structure using device memory > > allocator. This is reasonable for simple data structure (array, grid, > > image, …) but this get extremely complex with advance data structure > > (list, tree, graph, …) that rely on a web of memory pointers. This is > > becoming a serious limitation on the kind of work load that can be > > offloaded to device like GPU. > > > > how handle it by current GPU software stack? maintain a complex middle > firmwork/HAL? Yes you still need a framework like OpenCL or CUDA. They are work under way to leverage GPU directly from language like C++, so i expect that the HAL will be hidden more and more for a larger group of programmer. Note i still expect some programmer will want to program closer to the hardware to extract every bit of performances they can. For OpenCL you need HMM to implement what is described as fine-grained system SVM memory model (see OpenCL 2.0 or latter specification). > > New industry standard like C++, OpenCL or CUDA are pushing to remove this > > barrier. This require a shared address space between GPU device and CPU so > > that GPU can access any memory of a process (while still obeying memory > > protection like read only). > > GPU can access the whole process VMAs or any VMAs which backing system > memory has migrate to GPU page table? Whole process VMAs, it does not need to be migrated to device memory. The migration is an optional features that is necessary for performances but GPU can access system memory just fine. [...] > > When page backing an address of a process is migrated to device memory > > the CPU page table entry is set to a new specific swap entry. CPU access > > to such address triggers a migration back to system memory, just like if > > the page was swap on disk. HMM also blocks any one from pinning a > > ZONE_DEVICE page so that it can always be migrated back to system memory > > if CPU access it. Conversely HMM does not migrate to device memory any > > page that is pin in system memory. > > > > the purpose of migrate the system pages to device is that device can read > the system memory? > if the CPU/programs want read the device data, it need pin/mapping the > device memory to the process address space? > if multiple applications want to read the same device memory region > concurrently, how to do it? Purpose of migrating to device memory is to leverage device memory bandwidth. PCIE bandwidth 32GB/s, device memory bandwidth between 256GB/s to 1TB/s also device bandwidth has smaller latency. CPU can not access device memory. It can but in limited way on PCIE and it would violate memory model programmer get for regular system memory hence for all intents and purposes it is better to say that CPU can not access any of the device memory. Share VMA will just work, so if a VMA is share between 2 process than both process can access the same memory. All the semantics that are valid on the CPU are also valid on the GPU. Nothing change there. > it is better a graph to show how CPU and GPU share the address space. I am not good at making ASCII graph, nor would i know how to graph this. Any valid address on the CPU is valid on the GPU, that's it really. The migration to device memory is orthogonal to the share address space. Cheers, Jérôme
Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19
On Thu, Apr 06, 2017 at 11:22:12AM +0800, Figo.zhang wrote: [...] > > Heterogeneous Memory Management (HMM) (description and justification) > > > > Today device driver expose dedicated memory allocation API through their > > device file, often relying on a combination of IOCTL and mmap calls. The > > device can only access and use memory allocated through this API. This > > effectively split the program address space into object allocated for the > > device and useable by the device and other regular memory (malloc, mmap > > of a file, share memory, …) only accessible by CPU (or in a very limited > > way by a device by pinning memory). > > > > Allowing different isolated component of a program to use a device thus > > require duplication of the input data structure using device memory > > allocator. This is reasonable for simple data structure (array, grid, > > image, …) but this get extremely complex with advance data structure > > (list, tree, graph, …) that rely on a web of memory pointers. This is > > becoming a serious limitation on the kind of work load that can be > > offloaded to device like GPU. > > > > how handle it by current GPU software stack? maintain a complex middle > firmwork/HAL? Yes you still need a framework like OpenCL or CUDA. They are work under way to leverage GPU directly from language like C++, so i expect that the HAL will be hidden more and more for a larger group of programmer. Note i still expect some programmer will want to program closer to the hardware to extract every bit of performances they can. For OpenCL you need HMM to implement what is described as fine-grained system SVM memory model (see OpenCL 2.0 or latter specification). > > New industry standard like C++, OpenCL or CUDA are pushing to remove this > > barrier. This require a shared address space between GPU device and CPU so > > that GPU can access any memory of a process (while still obeying memory > > protection like read only). > > GPU can access the whole process VMAs or any VMAs which backing system > memory has migrate to GPU page table? Whole process VMAs, it does not need to be migrated to device memory. The migration is an optional features that is necessary for performances but GPU can access system memory just fine. [...] > > When page backing an address of a process is migrated to device memory > > the CPU page table entry is set to a new specific swap entry. CPU access > > to such address triggers a migration back to system memory, just like if > > the page was swap on disk. HMM also blocks any one from pinning a > > ZONE_DEVICE page so that it can always be migrated back to system memory > > if CPU access it. Conversely HMM does not migrate to device memory any > > page that is pin in system memory. > > > > the purpose of migrate the system pages to device is that device can read > the system memory? > if the CPU/programs want read the device data, it need pin/mapping the > device memory to the process address space? > if multiple applications want to read the same device memory region > concurrently, how to do it? Purpose of migrating to device memory is to leverage device memory bandwidth. PCIE bandwidth 32GB/s, device memory bandwidth between 256GB/s to 1TB/s also device bandwidth has smaller latency. CPU can not access device memory. It can but in limited way on PCIE and it would violate memory model programmer get for regular system memory hence for all intents and purposes it is better to say that CPU can not access any of the device memory. Share VMA will just work, so if a VMA is share between 2 process than both process can access the same memory. All the semantics that are valid on the CPU are also valid on the GPU. Nothing change there. > it is better a graph to show how CPU and GPU share the address space. I am not good at making ASCII graph, nor would i know how to graph this. Any valid address on the CPU is valid on the GPU, that's it really. The migration to device memory is orthogonal to the share address space. Cheers, Jérôme
[HMM 00/16] HMM (Heterogeneous Memory Management) v19
Patchset is on top of mmotm mmotm-2017-04-04-15-00 it would conflict with Michal memory hotplug patchset (first patch in this serie would be the conflicting one). There is also build issue against 4.11-rc* where some definitions are now in include/linux/sched/mm.h to fix this patchset this new header file need to be included in migrate.c and hmm.c but patchset have been otherwise build tested on different arch and there wasn't any issues. It was also tested with real hardware on x86-64. Changes since v18: - Use an enum for memory type instead of set of flag, this make a more clear separation between different type of ZONE_DEVICE memory (ie persistent or HMM unaddressable memory) -Don’t preserve soft-dirtyness as check and restore can not be use with an active device driver. This could be revisited if we are ever able to save device states -Drop the extra flag to migratepage callback of address_space and use a new migrate mode instead of adding a new parameters. -Improves comments in various code path -Use rw_sem to protect mirrors list -Improved Kconfig help description -Drop over cautious BUG_ON() -Added a documentation file -Build fixes -Typo fixes Heterogeneous Memory Management (HMM) (description and justification) Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, …) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, …) but this get extremely complex with advance data structure (list, tree, graph, …) that rely on a web of memory pointers. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU. New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. HMM is a set of helpers to facilitate several aspects of address space sharing and device memory management. Unlike existing sharing mechanism that rely on pining pages use by a device, HMM relies on mmu_notifier to propagate CPU page table update to device page table. Duplicating CPU page table is only one aspect necessary for efficiently using device like GPU. GPU local memory have bandwidth in the TeraBytes/ second range but they are connected to main memory through a system bus like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it is necessary to allow migration of process memory from main system memory to device memory. Issue is that on platform that only have PCIE the device memory is not accessible by the CPU with the same properties as main memory (cache coherency, atomic operations, …). To allow migration from main memory to device memory HMM provides a set of helper to hotplug device memory as a new type of ZONE_DEVICE memory which is un-addressable by CPU but still has struct page representing it. This allow most of the core kernel logic that deals with a process memory to stay oblivious of the peculiarity of device memory. When page backing an address of a process is migrated to device memory the CPU page table entry is set to a new specific swap entry. CPU access to such address triggers a migration back to system memory, just like if the page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE page so that it can always be migrated back to system memory if CPU access it. Conversely HMM does not migrate to device memory any page that is pin in system memory. To allow efficient migration between device memory and main memory a new migrate_vma() helpers is added with this patchset. It allows to leverage device DMA engine to perform the copy operation. This feature will be use by upstream driver like nouveau mlx5 and probably other in the future (amdgpu is next suspect in line). We are actively working on nouveau and mlx5 support. To test this patchset we also worked with NVidia close source driver team, they have more resources than us to test this kind of infrastructure and also a bigger and better userspace eco-system with various real industry workload they can be use to test and profile HMM. The expected workload is a program builds a data set on the CPU (from disk, from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...) to give hint on memory placement for the input data and also for
[HMM 00/16] HMM (Heterogeneous Memory Management) v19
Patchset is on top of mmotm mmotm-2017-04-04-15-00 it would conflict with Michal memory hotplug patchset (first patch in this serie would be the conflicting one). There is also build issue against 4.11-rc* where some definitions are now in include/linux/sched/mm.h to fix this patchset this new header file need to be included in migrate.c and hmm.c but patchset have been otherwise build tested on different arch and there wasn't any issues. It was also tested with real hardware on x86-64. Changes since v18: - Use an enum for memory type instead of set of flag, this make a more clear separation between different type of ZONE_DEVICE memory (ie persistent or HMM unaddressable memory) -Don’t preserve soft-dirtyness as check and restore can not be use with an active device driver. This could be revisited if we are ever able to save device states -Drop the extra flag to migratepage callback of address_space and use a new migrate mode instead of adding a new parameters. -Improves comments in various code path -Use rw_sem to protect mirrors list -Improved Kconfig help description -Drop over cautious BUG_ON() -Added a documentation file -Build fixes -Typo fixes Heterogeneous Memory Management (HMM) (description and justification) Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, …) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, …) but this get extremely complex with advance data structure (list, tree, graph, …) that rely on a web of memory pointers. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU. New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. HMM is a set of helpers to facilitate several aspects of address space sharing and device memory management. Unlike existing sharing mechanism that rely on pining pages use by a device, HMM relies on mmu_notifier to propagate CPU page table update to device page table. Duplicating CPU page table is only one aspect necessary for efficiently using device like GPU. GPU local memory have bandwidth in the TeraBytes/ second range but they are connected to main memory through a system bus like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it is necessary to allow migration of process memory from main system memory to device memory. Issue is that on platform that only have PCIE the device memory is not accessible by the CPU with the same properties as main memory (cache coherency, atomic operations, …). To allow migration from main memory to device memory HMM provides a set of helper to hotplug device memory as a new type of ZONE_DEVICE memory which is un-addressable by CPU but still has struct page representing it. This allow most of the core kernel logic that deals with a process memory to stay oblivious of the peculiarity of device memory. When page backing an address of a process is migrated to device memory the CPU page table entry is set to a new specific swap entry. CPU access to such address triggers a migration back to system memory, just like if the page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE page so that it can always be migrated back to system memory if CPU access it. Conversely HMM does not migrate to device memory any page that is pin in system memory. To allow efficient migration between device memory and main memory a new migrate_vma() helpers is added with this patchset. It allows to leverage device DMA engine to perform the copy operation. This feature will be use by upstream driver like nouveau mlx5 and probably other in the future (amdgpu is next suspect in line). We are actively working on nouveau and mlx5 support. To test this patchset we also worked with NVidia close source driver team, they have more resources than us to test this kind of infrastructure and also a bigger and better userspace eco-system with various real industry workload they can be use to test and profile HMM. The expected workload is a program builds a data set on the CPU (from disk, from network, from sensors, …). Program uses GPU API (OpenCL, CUDA, ...) to give hint on memory placement for the input data and also for