Re: [RFC 0/8] Define coherent device memory node
On Sat, Nov 05, 2016 at 10:51:21AM +0530, Anshuman Khandual wrote: > On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > > I looked at the hmm-v13 w.r.t migration and I guess some form of device > > callback/acceleration during migration is something we should definitely > > have. I still haven't figured out how non addressable and coherent device > > memory can fit together there. I was waiting for the page cache > > migration support to be pushed to the repository before I start looking > > at this closely. > > Aneesh, did not get that. Currently basic page cache migration is supported, > right ? The device callback during migration, fault etc are supported through > page->pgmap pointer and extending dev_pagemap structure to accommodate new > members. IIUC that is the reason ZONE_DEVICE is being modified so that page > ->pgmap overloading can be used for various driver/device specific callbacks > while inside core VM functions or HMM functions. > > HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which > can have it's struct pages in system RAM but they cannot be accessed from the > CPU. Now coherent device memory is kind of similar to persistent memory like > NVDIMM which is already supported through ZONE_DEVICE (though we might not > want to use vmemap_altmap instead have the struct pages in the system RAM). > Now HMM has to learn working with 'dev_pagemap->addressable' type of device > memory and then support all possible migrations through it's API. So in a > nutshell, these are the changes we need to do to make HMM work with coherent > device memory. > > (0) Support all possible migrations between system RAM and device memory > for current un-addressable device memory and make the HMM migration > API layer comprehensive and complete. What is no comprehensive or complete in the API layer ? I think the API is pretty clear the migrate function does not rely on anything except HMM pfn. > > (1) Create coherent device memory representation in ZONE_DEVICE > (a) Make it exactly the same as that of persistent memory/NVDIMM > > or > > (b) Create a new type for coherent device memory representation So i will soon push an updated tree with modification to HMM API (from device driver point of view but the migrate stuff is virtually the same). I slpitted the addressable and movable concept and thus it is now easy to support coherent addressable memory and non addressable memory. > > (2) Support all possible migrations between system RAM and device memory > for new addressable coherent device memory represented in ZONE_DEVICE > extending the HMM migration API layer. > > Right now, HMM V13 patch series supports migration for a subset of private > anonymous pages for un-addressable device memory. I am wondering how difficult > is it to implement all possible anon, file mapping migration support for both > un-addressable and addressable coherent device memory through ZONE_DEVICE. > There is no need to extend the API to support file back as matter of fact the 2 patches i sent you do support migration of file back page (page->mapping) to and from ZONE_DEVICE as long as this ZONE_DEVICE memory is accessible by the CPU and coherent. What i am still working on is the non addressable case that is way more tedious (handle direct IO, read, write and writeback). So difficulty for coherent memory is nill, it is the non addressable memory that is hard to support in respect to file back page. Cheers, Jérôme
Re: [RFC 0/8] Define coherent device memory node
On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote: > I looked at the hmm-v13 w.r.t migration and I guess some form of device > callback/acceleration during migration is something we should definitely > have. I still haven't figured out how non addressable and coherent device > memory can fit together there. I was waiting for the page cache > migration support to be pushed to the repository before I start looking > at this closely. Aneesh, did not get that. Currently basic page cache migration is supported, right ? The device callback during migration, fault etc are supported through page->pgmap pointer and extending dev_pagemap structure to accommodate new members. IIUC that is the reason ZONE_DEVICE is being modified so that page ->pgmap overloading can be used for various driver/device specific callbacks while inside core VM functions or HMM functions. HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which can have it's struct pages in system RAM but they cannot be accessed from the CPU. Now coherent device memory is kind of similar to persistent memory like NVDIMM which is already supported through ZONE_DEVICE (though we might not want to use vmemap_altmap instead have the struct pages in the system RAM). Now HMM has to learn working with 'dev_pagemap->addressable' type of device memory and then support all possible migrations through it's API. So in a nutshell, these are the changes we need to do to make HMM work with coherent device memory. (0) Support all possible migrations between system RAM and device memory for current un-addressable device memory and make the HMM migration API layer comprehensive and complete. (1) Create coherent device memory representation in ZONE_DEVICE (a) Make it exactly the same as that of persistent memory/NVDIMM or (b) Create a new type for coherent device memory representation (2) Support all possible migrations between system RAM and device memory for new addressable coherent device memory represented in ZONE_DEVICE extending the HMM migration API layer. Right now, HMM V13 patch series supports migration for a subset of private anonymous pages for un-addressable device memory. I am wondering how difficult is it to implement all possible anon, file mapping migration support for both un-addressable and addressable coherent device memory through ZONE_DEVICE.
Re: [RFC 0/8] Define coherent device memory node
On Fri, Oct 28, 2016 at 10:59:52AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse writes: > >> > >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> >> Jerome Glisse writes: > >> >> > >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> >> > > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> >> callback/acceleration during migration is something we should definitely > >> >> have. I still haven't figured out how non addressable and coherent > >> >> device > >> >> memory can fit together there. I was waiting for the page cache > >> >> migration support to be pushed to the repository before I start looking > >> >> at this closely. > >> >> > >> > > >> > The page cache migration does not touch the migrate code path. My issue > >> > with > >> > page cache is writeback. The only difference with existing migrate code > >> > is > >> > refcount check for ZONE_DEVICE page. Everything else is the same. > >> > >> What about the radix tree ? does file system migrate_page callback handle > >> replacing normal page with ZONE_DEVICE page/exceptional entries ? > >> > > > > It use the exact same existing code (from mm/migrate.c) so yes the radix > > tree > > is updated and buffer_head are migrated. > > > > I looked at the the page cache migration patches shared and I find that > you are not using exceptional entries when we migrate a page cache page to > device memory. But I am now not sure how a read from page cache will > work with that. > > ie, a file system read will now find the page in page cache. But we > cannot do a copy_to_user of that page because that is now backed by an > unaddressable memory right ? > > do_generic_file_read() does > page = find_get_page(mapping, index); > > ret = copy_page_to_iter(page, offset, nr, iter); > > which does > void *kaddr = kmap_atomic(page); > size_t wanted = copy_to_iter(kaddr + offset, bytes, i); > kunmap_atomic(kaddr); Like i said right now for un-addressable memory my patches are mostly broken. For read and write. I am focusing on page write back for now as it seemed to be the more problematic case. For read/write the intention is to trigger a migration back to system memory inside read/write of filesystem. This is also why i will need a flag to indicate if a filesystem support migration to un-addressable memory. But in your case where the device memory is accessible then it should just work, or do you need to do special thing when kmaping device page ? Cheers, Jérôme
Re: [RFC 0/8] Define coherent device memory node
On Fri, Oct 28, 2016 at 11:17:31AM +0530, Anshuman Khandual wrote: > On 10/27/2016 08:35 PM, Jerome Glisse wrote: > > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > >> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >>> Jerome Glisse writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > > Jerome Glisse writes: > >> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > In my patchset there is no policy, it is all under device driver control > which > decide what range of memory is migrated and when. I think only device > driver as > proper knowledge to make such decision. By coalescing data from GPU > counters and > request from application made through the uppler level programming API > like > Cuda. > > >>> > >>> Right, I understand that. But what I pointed out here is that there are > >>> problems > >>> now migrating user mapped pages back and forth between LRU system RAM > >>> memory and > >>> non LRU device memory which is yet to be solved. Because you are > >>> proposing a non > >>> LRU based design with ZONE_DEVICE, how we are solving/working around these > >>> problems for bi-directional migration ? > >> > >> Let me elaborate on this bit more. Before non LRU migration support patch > >> series > >> from Minchan, it was not possible to migrate non LRU pages which are > >> generally > >> driver managed through migrate_pages interface. This was affecting the > >> ability > >> to do compaction on platforms which has a large share of non LRU pages. > >> That series > >> actually solved the migration problem and allowed compaction. But it still > >> did not > >> solve the migration problem for non LRU *user mapped* pages. So if the non > >> LRU pages > >> are mapped into a process's page table and being accessed from user space, > >> it can > >> not be moved using migrate_pages interface. > >> > >> Minchan had a draft solution for that problem which is still hosted here. > >> On his > >> suggestion I had tried this solution but still faced some other problems > >> during > >> mapped pages migration. (NOTE: IIRC this was not posted in the community) > >> > >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the > >> following > >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > >> > >> As I had mentioned earlier, we intend to support all possible migrations > >> between > >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > >> > >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times > >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > > > I achieve this 2 objective in HMM, i sent you the additional patches for > > file > > back page migration. I am not done working on them but they are small. > > Sure, will go through them. Thanks ! > > > > > > >> This is not happening now with non LRU pages. Here are some of reasons but > >> before > >> that some notes. > >> > >> * Driver initiates all the migrations > >> * Driver does the isolation of pages > >> * Driver puts the isolated pages in a linked list > >> * Driver passes the linked list to migrate_pages interface for migration > >> * IIRC isolation of non LRU pages happens through > >> page->as->aops->isolate_page call > >> * If migration fails, call page->as->aops->putback_page to give the page > >> back to the > >> device driver > >> > >> 1. queue_pages_range() currently does not work with non LRU pages, needs > >> to be fixed > >> > >> 2. After a successful migration from non LRU device memory to LRU system > >> RAM, the non > >>LRU will be freed back. Right now migrate_pages releases these pages to > >> buddy, but > >>in this situation we need the pages to be given back to the driver > >> instead. Hence > >>migrate_pages needs to be changed to accommodate this. > >> > >> 3. After LRU system RAM to non LRU device migration for a mapped page, > >> does the new > >>page (which came from device memory) will be part of core MM LRU either > >> for Anon > >>or File mapping ? > >> > >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a > >> mapped page, > >>how we are going to store "address_space->address_space_operations" and > >> "Anon VMA > >>Chain" reverse mapping information both on the page->mapping element ? > >> > >> 5. After LRU (File mapped) system RAM to non LRU device migration for a > >> mapped page, > >>how we are going to store "address_space->address_space_operations" of > >> the device > >>driver and radix tree based reverse mapp
Re: [RFC 0/8] Define coherent device memory node
Jerome Glisse writes: > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >> >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> >> Jerome Glisse writes: >> >> >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> >> > >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> >> callback/acceleration during migration is something we should definitely >> >> have. I still haven't figured out how non addressable and coherent device >> >> memory can fit together there. I was waiting for the page cache >> >> migration support to be pushed to the repository before I start looking >> >> at this closely. >> >> >> > >> > The page cache migration does not touch the migrate code path. My issue >> > with >> > page cache is writeback. The only difference with existing migrate code is >> > refcount check for ZONE_DEVICE page. Everything else is the same. >> >> What about the radix tree ? does file system migrate_page callback handle >> replacing normal page with ZONE_DEVICE page/exceptional entries ? >> > > It use the exact same existing code (from mm/migrate.c) so yes the radix tree > is updated and buffer_head are migrated. > I looked at the the page cache migration patches shared and I find that you are not using exceptional entries when we migrate a page cache page to device memory. But I am now not sure how a read from page cache will work with that. ie, a file system read will now find the page in page cache. But we cannot do a copy_to_user of that page because that is now backed by an unaddressable memory right ? do_generic_file_read() does page = find_get_page(mapping, index); ret = copy_page_to_iter(page, offset, nr, iter); which does void *kaddr = kmap_atomic(page); size_t wanted = copy_to_iter(kaddr + offset, bytes, i); kunmap_atomic(kaddr); -aneesh
Re: [RFC 0/8] Define coherent device memory node
On 10/27/2016 08:35 PM, Jerome Glisse wrote: > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: >> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>> Jerome Glisse writes: On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: >> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > In my patchset there is no policy, it is all under device driver control which decide what range of memory is migrated and when. I think only device driver as proper knowledge to make such decision. By coalescing data from GPU counters and request from application made through the uppler level programming API like Cuda. >>> >>> Right, I understand that. But what I pointed out here is that there are >>> problems >>> now migrating user mapped pages back and forth between LRU system RAM >>> memory and >>> non LRU device memory which is yet to be solved. Because you are proposing >>> a non >>> LRU based design with ZONE_DEVICE, how we are solving/working around these >>> problems for bi-directional migration ? >> >> Let me elaborate on this bit more. Before non LRU migration support patch >> series >> from Minchan, it was not possible to migrate non LRU pages which are >> generally >> driver managed through migrate_pages interface. This was affecting the >> ability >> to do compaction on platforms which has a large share of non LRU pages. That >> series >> actually solved the migration problem and allowed compaction. But it still >> did not >> solve the migration problem for non LRU *user mapped* pages. So if the non >> LRU pages >> are mapped into a process's page table and being accessed from user space, >> it can >> not be moved using migrate_pages interface. >> >> Minchan had a draft solution for that problem which is still hosted here. On >> his >> suggestion I had tried this solution but still faced some other problems >> during >> mapped pages migration. (NOTE: IIRC this was not posted in the community) >> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the >> following >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) >> >> As I had mentioned earlier, we intend to support all possible migrations >> between >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. >> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > I achieve this 2 objective in HMM, i sent you the additional patches for file > back page migration. I am not done working on them but they are small. Sure, will go through them. Thanks ! > > >> This is not happening now with non LRU pages. Here are some of reasons but >> before >> that some notes. >> >> * Driver initiates all the migrations >> * Driver does the isolation of pages >> * Driver puts the isolated pages in a linked list >> * Driver passes the linked list to migrate_pages interface for migration >> * IIRC isolation of non LRU pages happens through >> page->as->aops->isolate_page call >> * If migration fails, call page->as->aops->putback_page to give the page >> back to the >> device driver >> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to >> be fixed >> >> 2. After a successful migration from non LRU device memory to LRU system >> RAM, the non >>LRU will be freed back. Right now migrate_pages releases these pages to >> buddy, but >>in this situation we need the pages to be given back to the driver >> instead. Hence >>migrate_pages needs to be changed to accommodate this. >> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does >> the new >>page (which came from device memory) will be part of core MM LRU either >> for Anon >>or File mapping ? >> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" and >> "Anon VMA >>Chain" reverse mapping information both on the page->mapping element ? >> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" of >> the device >>driver and radix tree based reverse mapping information for the existing >> file >>mapping both on the same page->mapping element ? >> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops >> which will >>defined inside the device driver) and the reverse mapping information >> (either anon >>or file mapping)
Re: [RFC 0/8] Define coherent device memory node
On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: > On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > > On 10/26/2016 09:32 PM, Jerome Glisse wrote: > >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > > Jerome Glisse writes: > >> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >>> Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > >> In my patchset there is no policy, it is all under device driver control > >> which > >> decide what range of memory is migrated and when. I think only device > >> driver as > >> proper knowledge to make such decision. By coalescing data from GPU > >> counters and > >> request from application made through the uppler level programming API like > >> Cuda. > >> > > > > Right, I understand that. But what I pointed out here is that there are > > problems > > now migrating user mapped pages back and forth between LRU system RAM > > memory and > > non LRU device memory which is yet to be solved. Because you are proposing > > a non > > LRU based design with ZONE_DEVICE, how we are solving/working around these > > problems for bi-directional migration ? > > Let me elaborate on this bit more. Before non LRU migration support patch > series > from Minchan, it was not possible to migrate non LRU pages which are generally > driver managed through migrate_pages interface. This was affecting the ability > to do compaction on platforms which has a large share of non LRU pages. That > series > actually solved the migration problem and allowed compaction. But it still > did not > solve the migration problem for non LRU *user mapped* pages. So if the non > LRU pages > are mapped into a process's page table and being accessed from user space, it > can > not be moved using migrate_pages interface. > > Minchan had a draft solution for that problem which is still hosted here. On > his > suggestion I had tried this solution but still faced some other problems > during > mapped pages migration. (NOTE: IIRC this was not posted in the community) > > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the > following > branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) > > As I had mentioned earlier, we intend to support all possible migrations > between > system RAM (LRU) and device memory (Non LRU) for user space mapped pages. > > (1) System RAM (Anon mapping) --> Device memory, back and forth many times > (2) System RAM (File mapping) --> Device memory, back and forth many times I achieve this 2 objective in HMM, i sent you the additional patches for file back page migration. I am not done working on them but they are small. > This is not happening now with non LRU pages. Here are some of reasons but > before > that some notes. > > * Driver initiates all the migrations > * Driver does the isolation of pages > * Driver puts the isolated pages in a linked list > * Driver passes the linked list to migrate_pages interface for migration > * IIRC isolation of non LRU pages happens through > page->as->aops->isolate_page call > * If migration fails, call page->as->aops->putback_page to give the page back > to the > device driver > > 1. queue_pages_range() currently does not work with non LRU pages, needs to > be fixed > > 2. After a successful migration from non LRU device memory to LRU system RAM, > the non >LRU will be freed back. Right now migrate_pages releases these pages to > buddy, but >in this situation we need the pages to be given back to the driver > instead. Hence >migrate_pages needs to be changed to accommodate this. > > 3. After LRU system RAM to non LRU device migration for a mapped page, does > the new >page (which came from device memory) will be part of core MM LRU either > for Anon >or File mapping ? > > 4. After LRU (Anon mapped) system RAM to non LRU device migration for a > mapped page, >how we are going to store "address_space->address_space_operations" and > "Anon VMA >Chain" reverse mapping information both on the page->mapping element ? > > 5. After LRU (File mapped) system RAM to non LRU device migration for a > mapped page, >how we are going to store "address_space->address_space_operations" of the > device >driver and radix tree based reverse mapping information for the existing > file >mapping both on the same page->mapping element ? > > 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops > which will >defined inside the device driver) and the reverse mapping information > (either anon >or file mapping) together after first round of migration. This non LRU > identity needs >to be retained continuously if we ever need to return this page to device > driver after >success
Re: [RFC 0/8] Define coherent device memory node
On 27/10/16 03:28, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: Jerome Glisse writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>> migration. While i put most of the migration code inside hmm_migrate.c >>> it >>> could easily be move to migrate.c without hmm_ prefix. >>> >>> There is 2 missing piece with existing migrate code. First is to put >>> memory >>> allocation for destination under control of who call the migrate code. >>> Second >>> is to allow offloading the copy operation to device (ie not use the CPU >>> to >>> copy data). >>> >>> I believe same requirement also make sense for platform you are >>> targeting. >>> Thus same code can be use. >>> >>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>> >>> I haven't posted this patchset yet because we are doing some >>> modifications >>> to the device driver API to accomodate some new features. But the >>> ZONE_DEVICE >>> changes and the overall migration code will stay the same more or less >>> (i have >>> patches that move it to migrate.c and share more code with existing >>> migrate >>> code). >>> >>> If you think i missed anything about lru and page cache please point it >>> to >>> me. Because when i audited code for that i didn't see any road block >>> with >>> the few fs i was looking at (ext4, xfs and core page cache code). >>> >> >> The other restriction around ZONE_DEVICE is, it is not a managed zone. >> That prevents any direct allocation from coherent device by application. >> ie, we would like to force allocation from coherent device using >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > To achieve this we rely on device fault code path ie when device take a > page fault > with help of HMM it will use existing memory if any for fault address but > if CPU > page table is empty (and it is not file back vma because of readback) > then device > can directly allocate device memory and HMM will update CPU page table to > point to > newly allocated device memory. > That is ok if the device touch the page first. What if we want the allocation touched first by cpu to come from GPU ?. Should we always depend on GPU driver to migrate such pages later from system RAM to GPU memory ? >>> >>> I am not sure what kind of workload would rather have every first CPU >>> access for >>> a range to use device memory. So no my code does not handle that and it is >>> pointless >>> for it as CPU can not access device memory for me. >> >> If the user space application can explicitly allocate device memory >> directly, we >> can save one round of migration when the device start accessing it. But then >> one >> can argue what problem statement the device would work on on a freshly >> allocated >> memory which has not been accessed by CPU for loading the data yet. Will >> look into >> this scenario in more detail. >> >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like >>> syscall. >>> Thought my personnal preference would still be to avoid use of such generic >>> syscall >>> but have device driver set allocation policy through its own userspace API >>> (device >>> driver could reuse internal of mbind() to achieve the end result). >> >> Okay, the basic premise of CDM node is to have a LRU based design where we >> can >> avoid use of driver specific user space memory management code altogether. > > And i think it is not a good fit, at least not for GPU. GPU device driver > have a > big chunk of code dedicated to memory management. You can look at drm/ttm and > at > userspace (most is in userspace). It is not because we want to reinvent the > wheel > it is because they are some unique constraint. > Could you elaborate on the unique constraints a bit more? I looked at ttm briefly (specifically ttm_memory.c), I can see zones being replicated, it feels like a mini-mm is embedded in there. > >>> >>> I am not saying that eveything you want to do is doable now with HMM but, >>> nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really >>> don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and >>> can be reuse >>> with device memory. >> >> With CDM node based design, the expectation is to get all/maximum core VM >> mechanism >> working so that, driver has to do less
Re: [RFC 0/8] Define coherent device memory node
On 10/27/2016 10:08 AM, Anshuman Khandual wrote: > On 10/26/2016 09:32 PM, Jerome Glisse wrote: >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote: On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > >> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >>> Jerome Glisse writes: On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> >> [...] >> You can take a look at hmm-v13 if you want to see how i do non LRU page migration. While i put most of the migration code inside hmm_migrate.c it could easily be move to migrate.c without hmm_ prefix. There is 2 missing piece with existing migrate code. First is to put memory allocation for destination under control of who call the migrate code. Second is to allow offloading the copy operation to device (ie not use the CPU to copy data). I believe same requirement also make sense for platform you are targeting. Thus same code can be use. hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 I haven't posted this patchset yet because we are doing some modifications to the device driver API to accomodate some new features. But the ZONE_DEVICE changes and the overall migration code will stay the same more or less (i have patches that move it to migrate.c and share more code with existing migrate code). If you think i missed anything about lru and page cache please point it to me. Because when i audited code for that i didn't see any road block with the few fs i was looking at (ext4, xfs and core page cache code). >>> >>> The other restriction around ZONE_DEVICE is, it is not a managed zone. >>> That prevents any direct allocation from coherent device by application. >>> ie, we would like to force allocation from coherent device using >>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >> >> To achieve this we rely on device fault code path ie when device take a >> page fault >> with help of HMM it will use existing memory if any for fault address >> but if CPU >> page table is empty (and it is not file back vma because of readback) >> then device >> can directly allocate device memory and HMM will update CPU page table >> to point to >> newly allocated device memory. >> > > That is ok if the device touch the page first. What if we want the > allocation touched first by cpu to come from GPU ?. Should we always > depend on GPU driver to migrate such pages later from system RAM to GPU > memory ? > I am not sure what kind of workload would rather have every first CPU access for a range to use device memory. So no my code does not handle that and it is pointless for it as CPU can not access device memory for me. That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. Thought my personnal preference would still be to avoid use of such generic syscall but have device driver set allocation policy through its own userspace API (device driver could reuse internal of mbind() to achieve the end result). I am not saying that eveything you want to do is doable now with HMM but, nothing preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse with device memory. Each device is so different from the other that i don't believe in a one API fit all. The drm GPU subsystem of the kernel is a testimony of how little can be share when it comes to GPU. The only common code is modesetting. Everything that deals with how to use GPU to compute stuff is per device and most of the logic is in userspace. So i do not see any commonality that could be abstracted at syscall level. I would rather let device driver stack (kernel and userspace) take such decision and have the higher level API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. Programmer target those high level API and they intend to use the mechanism each offer to manage memory and memory placement. I would say forcing them to use a second linux specific API to achieve the latter is wrong, at lest for now. So in the end if the mbind() syscall is done by the userspace side of the device driver then why not
Re: [RFC 0/8] Define coherent device memory node
On 10/26/2016 09:32 PM, Jerome Glisse wrote: > On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: >> On 10/26/2016 12:22 AM, Jerome Glisse wrote: >>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: Jerome Glisse writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >>> You can take a look at hmm-v13 if you want to see how i do non LRU page >>> migration. While i put most of the migration code inside hmm_migrate.c >>> it >>> could easily be move to migrate.c without hmm_ prefix. >>> >>> There is 2 missing piece with existing migrate code. First is to put >>> memory >>> allocation for destination under control of who call the migrate code. >>> Second >>> is to allow offloading the copy operation to device (ie not use the CPU >>> to >>> copy data). >>> >>> I believe same requirement also make sense for platform you are >>> targeting. >>> Thus same code can be use. >>> >>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >>> >>> I haven't posted this patchset yet because we are doing some >>> modifications >>> to the device driver API to accomodate some new features. But the >>> ZONE_DEVICE >>> changes and the overall migration code will stay the same more or less >>> (i have >>> patches that move it to migrate.c and share more code with existing >>> migrate >>> code). >>> >>> If you think i missed anything about lru and page cache please point it >>> to >>> me. Because when i audited code for that i didn't see any road block >>> with >>> the few fs i was looking at (ext4, xfs and core page cache code). >>> >> >> The other restriction around ZONE_DEVICE is, it is not a managed zone. >> That prevents any direct allocation from coherent device by application. >> ie, we would like to force allocation from coherent device using >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > To achieve this we rely on device fault code path ie when device take a > page fault > with help of HMM it will use existing memory if any for fault address but > if CPU > page table is empty (and it is not file back vma because of readback) > then device > can directly allocate device memory and HMM will update CPU page table to > point to > newly allocated device memory. > That is ok if the device touch the page first. What if we want the allocation touched first by cpu to come from GPU ?. Should we always depend on GPU driver to migrate such pages later from system RAM to GPU memory ? >>> >>> I am not sure what kind of workload would rather have every first CPU >>> access for >>> a range to use device memory. So no my code does not handle that and it is >>> pointless >>> for it as CPU can not access device memory for me. >>> >>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like >>> syscall. >>> Thought my personnal preference would still be to avoid use of such generic >>> syscall >>> but have device driver set allocation policy through its own userspace API >>> (device >>> driver could reuse internal of mbind() to achieve the end result). >>> >>> I am not saying that eveything you want to do is doable now with HMM but, >>> nothing >>> preclude achieving what you want to achieve using ZONE_DEVICE. I really >>> don't think >>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and >>> can be reuse >>> with device memory. >>> >>> Each device is so different from the other that i don't believe in a one >>> API fit all. >>> The drm GPU subsystem of the kernel is a testimony of how little can be >>> share when it >>> comes to GPU. The only common code is modesetting. Everything that deals >>> with how to >>> use GPU to compute stuff is per device and most of the logic is in >>> userspace. So i do >>> not see any commonality that could be abstracted at syscall level. I would >>> rather let >>> device driver stack (kernel and userspace) take such decision and have the >>> higher level >>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of >>> them. >>> Programmer target those high level API and they intend to use the mechanism >>> each offer >>> to manage memory and memory placement. I would say forcing them to use a >>> second linux >>> specific API to achieve the latter is wrong, at lest for now. >>> >>> So in the end if the mbind() syscall is done by the userspace side of the >>> device driver >>> then why not just having the device driver communicate this through its own >>> kernel >>> API (which can be much more expressive than what standardize syscall >>> offers). I
Re: [RFC 0/8] Define coherent device memory node
On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c > > it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put > > memory > > allocation for destination under control of who call the migrate code. > > Second > > is to allow offloading the copy operation to device (ie not use the CPU > > to > > copy data). > > > > I believe same requirement also make sense for platform you are > > targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some > > modifications > > to the device driver API to accomodate some new features. But the > > ZONE_DEVICE > > changes and the overall migration code will stay the same more or less > > (i have > > patches that move it to migrate.c and share more code with existing > > migrate > > code). > > > > If you think i missed anything about lru and page cache please point it > > to > > me. Because when i audited code for that i didn't see any road block > > with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a > >>> page fault > >>> with help of HMM it will use existing memory if any for fault address but > >>> if CPU > >>> page table is empty (and it is not file back vma because of readback) > >>> then device > >>> can directly allocate device memory and HMM will update CPU page table to > >>> point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU > > access for > > a range to use device memory. So no my code does not handle that and it is > > pointless > > for it as CPU can not access device memory for me. > > If the user space application can explicitly allocate device memory directly, > we > can save one round of migration when the device start accessing it. But then > one > can argue what problem statement the device would work on on a freshly > allocated > memory which has not been accessed by CPU for loading the data yet. Will look > into > this scenario in more detail. > > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like > > syscall. > > Thought my personnal preference would still be to avoid use of such generic > > syscall > > but have device driver set allocation policy through its own userspace API > > (device > > driver could reuse internal of mbind() to achieve the end result). > > Okay, the basic premise of CDM node is to have a LRU based design where we can > avoid use of driver specific user space memory management code altogether. And i think it is not a good fit, at least not for GPU. GPU device driver have a big chunk of code dedicated to memory management. You can look at drm/ttm and at userspace (most is in userspace). It is not because we want to reinvent the wheel it is because they are some unique constraint. > > > > I am not saying that eveything you want to do is doable now with HMM but, > > nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really > > don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and > > can be reuse > > with device memory. > > With CDM node based design, the expectation is to get all/maximum core VM > mechanism > working so that, driver has to do less device specific optimization. I think this is a bad idea, today, for GPU but i might be wrong. > > > > Each device is so different from the other that i don't believe in a one > > API fit all. > > Right, so as I had mentioned in the cover letter, > pglist_data->coherent
Re: [RFC 0/8] Define coherent device memory node
On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse writes: > >> > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> > > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device > >> callback/acceleration during migration is something we should definitely > >> have. I still haven't figured out how non addressable and coherent device > >> memory can fit together there. I was waiting for the page cache > >> migration support to be pushed to the repository before I start looking > >> at this closely. > >> > > > > The page cache migration does not touch the migrate code path. My issue with > > page cache is writeback. The only difference with existing migrate code is > > refcount check for ZONE_DEVICE page. Everything else is the same. > > What about the radix tree ? does file system migrate_page callback handle > replacing normal page with ZONE_DEVICE page/exceptional entries ? > It use the exact same existing code (from mm/migrate.c) so yes the radix tree is updated and buffer_head are migrated. Jérôme
Re: [RFC 0/8] Define coherent device memory node
On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: > > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse writes: > >> > >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >>> > >>> [...] > >>> > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c > > it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put > > memory > > allocation for destination under control of who call the migrate code. > > Second > > is to allow offloading the copy operation to device (ie not use the CPU > > to > > copy data). > > > > I believe same requirement also make sense for platform you are > > targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some > > modifications > > to the device driver API to accomodate some new features. But the > > ZONE_DEVICE > > changes and the overall migration code will stay the same more or less > > (i have > > patches that move it to migrate.c and share more code with existing > > migrate > > code). > > > > If you think i missed anything about lru and page cache please point it > > to > > me. Because when i audited code for that i didn't see any road block > > with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > >>> > >>> To achieve this we rely on device fault code path ie when device take a > >>> page fault > >>> with help of HMM it will use existing memory if any for fault address but > >>> if CPU > >>> page table is empty (and it is not file back vma because of readback) > >>> then device > >>> can directly allocate device memory and HMM will update CPU page table to > >>> point to > >>> newly allocated device memory. > >>> > >> > >> That is ok if the device touch the page first. What if we want the > >> allocation touched first by cpu to come from GPU ?. Should we always > >> depend on GPU driver to migrate such pages later from system RAM to GPU > >> memory ? > >> > > > > I am not sure what kind of workload would rather have every first CPU > > access for > > a range to use device memory. So no my code does not handle that and it is > > pointless > > for it as CPU can not access device memory for me. > > > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like > > syscall. > > Thought my personnal preference would still be to avoid use of such generic > > syscall > > but have device driver set allocation policy through its own userspace API > > (device > > driver could reuse internal of mbind() to achieve the end result). > > > > I am not saying that eveything you want to do is doable now with HMM but, > > nothing > > preclude achieving what you want to achieve using ZONE_DEVICE. I really > > don't think > > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and > > can be reuse > > with device memory. > > > > Each device is so different from the other that i don't believe in a one > > API fit all. > > The drm GPU subsystem of the kernel is a testimony of how little can be > > share when it > > comes to GPU. The only common code is modesetting. Everything that deals > > with how to > > use GPU to compute stuff is per device and most of the logic is in > > userspace. So i do > > not see any commonality that could be abstracted at syscall level. I would > > rather let > > device driver stack (kernel and userspace) take such decision and have the > > higher level > > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of > > them. > > Programmer target those high level API and they intend to use the mechanism > > each offer > > to manage memory and memory placement. I would say forcing them to use a > > second linux > > specific API to achieve the latter is wrong, at lest for now. > > > > So in the end if the mbind() syscall is done by the userspace side of the > > device driver > > then why not just having the device driver communicate this through its own > > kernel > > API (which can be much more expressive than what standardize syscall > > offers). I would > > rather avoid making change to any
Re: [RFC 0/8] Define coherent device memory node
On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put > memory > allocation for destination under control of who call the migrate code. > Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the > ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i > have > patches that move it to migrate.c and share more code with existing > migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > The other restriction around ZONE_DEVICE is, it is not a managed zone. That prevents any direct allocation from coherent device by application. ie, we would like to force allocation from coherent device using interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a >>> page fault >>> with help of HMM it will use existing memory if any for fault address but >>> if CPU >>> page table is empty (and it is not file back vma because of readback) then >>> device >>> can directly allocate device memory and HMM will update CPU page table to >>> point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access > for > a range to use device memory. So no my code does not handle that and it is > pointless > for it as CPU can not access device memory for me. If the user space application can explicitly allocate device memory directly, we can save one round of migration when the device start accessing it. But then one can argue what problem statement the device would work on on a freshly allocated memory which has not been accessed by CPU for loading the data yet. Will look into this scenario in more detail. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like > syscall. > Thought my personnal preference would still be to avoid use of such generic > syscall > but have device driver set allocation policy through its own userspace API > (device > driver could reuse internal of mbind() to achieve the end result). Okay, the basic premise of CDM node is to have a LRU based design where we can avoid use of driver specific user space memory management code altogether. > > I am not saying that eveything you want to do is doable now with HMM but, > nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't > think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and > can be reuse > with device memory. With CDM node based design, the expectation is to get all/maximum core VM mechanism working so that, driver has to do less device specific optimization. > > Each device is so different from the other that i don't believe in a one API > fit all. Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually can become a bit mask indicating the type of coherent device the node is and that can be used to implement multiple types of requirement in core mm for various kinds of devices in the future. > The drm GPU subsystem of the kernel is a testimony of how little can be share > when it > comes to GPU. The only common code is modesetting. Everything that deals with > how to > use GPU to compute stuff is per device and most of the logic is in userspace. > So i do Whats the basic reason which prevents such code/functionality sharing ? > not see any commonality that could be abstracted at syscall level. I would > rather let > device driver stack (kernel and userspace) take
Re: [RFC 0/8] Define coherent device memory node
On 10/26/2016 12:22 AM, Jerome Glisse wrote: > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >>> >>> [...] >>> > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put > memory > allocation for destination under control of who call the migrate code. > Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the > ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i > have > patches that move it to migrate.c and share more code with existing > migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > The other restriction around ZONE_DEVICE is, it is not a managed zone. That prevents any direct allocation from coherent device by application. ie, we would like to force allocation from coherent device using interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? >>> >>> To achieve this we rely on device fault code path ie when device take a >>> page fault >>> with help of HMM it will use existing memory if any for fault address but >>> if CPU >>> page table is empty (and it is not file back vma because of readback) then >>> device >>> can directly allocate device memory and HMM will update CPU page table to >>> point to >>> newly allocated device memory. >>> >> >> That is ok if the device touch the page first. What if we want the >> allocation touched first by cpu to come from GPU ?. Should we always >> depend on GPU driver to migrate such pages later from system RAM to GPU >> memory ? >> > > I am not sure what kind of workload would rather have every first CPU access > for > a range to use device memory. So no my code does not handle that and it is > pointless > for it as CPU can not access device memory for me. > > That said nothing forbid to add support for ZONE_DEVICE with mbind() like > syscall. > Thought my personnal preference would still be to avoid use of such generic > syscall > but have device driver set allocation policy through its own userspace API > (device > driver could reuse internal of mbind() to achieve the end result). > > I am not saying that eveything you want to do is doable now with HMM but, > nothing > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't > think > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and > can be reuse > with device memory. > > Each device is so different from the other that i don't believe in a one API > fit all. > The drm GPU subsystem of the kernel is a testimony of how little can be share > when it > comes to GPU. The only common code is modesetting. Everything that deals with > how to > use GPU to compute stuff is per device and most of the logic is in userspace. > So i do > not see any commonality that could be abstracted at syscall level. I would > rather let > device driver stack (kernel and userspace) take such decision and have the > higher level > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of > them. > Programmer target those high level API and they intend to use the mechanism > each offer > to manage memory and memory placement. I would say forcing them to use a > second linux > specific API to achieve the latter is wrong, at lest for now. > > So in the end if the mbind() syscall is done by the userspace side of the > device driver > then why not just having the device driver communicate this through its own > kernel > API (which can be much more expressive than what standardize syscall offers). > I would > rather avoid making change to any syscall for now. > > If latter, down the road, once the userspace ecosystem stabilize, we see that > there > is a good level at which we can abstract memory policy for enough devices > then and > only then it would make sense to either introduce new syscall or grow/modify > existing > one. Right now i fear we could only make bad decisio
Re: [RFC 0/8] Define coherent device memory node
Jerome Glisse writes: > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: >> > >> I looked at the hmm-v13 w.r.t migration and I guess some form of device >> callback/acceleration during migration is something we should definitely >> have. I still haven't figured out how non addressable and coherent device >> memory can fit together there. I was waiting for the page cache >> migration support to be pushed to the repository before I start looking >> at this closely. >> > > The page cache migration does not touch the migrate code path. My issue with > page cache is writeback. The only difference with existing migrate code is > refcount check for ZONE_DEVICE page. Everything else is the same. What about the radix tree ? does file system migrate_page callback handle replacing normal page with ZONE_DEVICE page/exceptional entries ? > > For writeback i need to use a bounce page so basicly i am trying to hook > myself > along the ISA bounce infrastructure for bio and i think it is the easiest path > to solve this in my case. > > In your case where block device can also access the device memory you don't > even need to use bounce page for writeback. > -aneesh
Re: [RFC 0/8] Define coherent device memory node
On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > >> Jerome Glisse writes: > >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > > [...] > > > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page > >> > migration. While i put most of the migration code inside hmm_migrate.c it > >> > could easily be move to migrate.c without hmm_ prefix. > >> > > >> > There is 2 missing piece with existing migrate code. First is to put > >> > memory > >> > allocation for destination under control of who call the migrate code. > >> > Second > >> > is to allow offloading the copy operation to device (ie not use the CPU > >> > to > >> > copy data). > >> > > >> > I believe same requirement also make sense for platform you are > >> > targeting. > >> > Thus same code can be use. > >> > > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > >> > > >> > I haven't posted this patchset yet because we are doing some > >> > modifications > >> > to the device driver API to accomodate some new features. But the > >> > ZONE_DEVICE > >> > changes and the overall migration code will stay the same more or less > >> > (i have > >> > patches that move it to migrate.c and share more code with existing > >> > migrate > >> > code). > >> > > >> > If you think i missed anything about lru and page cache please point it > >> > to > >> > me. Because when i audited code for that i didn't see any road block with > >> > the few fs i was looking at (ext4, xfs and core page cache code). > >> > > >> > >> The other restriction around ZONE_DEVICE is, it is not a managed zone. > >> That prevents any direct allocation from coherent device by application. > >> ie, we would like to force allocation from coherent device using > >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > > > To achieve this we rely on device fault code path ie when device take a > > page fault > > with help of HMM it will use existing memory if any for fault address but > > if CPU > > page table is empty (and it is not file back vma because of readback) then > > device > > can directly allocate device memory and HMM will update CPU page table to > > point to > > newly allocated device memory. > > > > That is ok if the device touch the page first. What if we want the > allocation touched first by cpu to come from GPU ?. Should we always > depend on GPU driver to migrate such pages later from system RAM to GPU > memory ? > I am not sure what kind of workload would rather have every first CPU access for a range to use device memory. So no my code does not handle that and it is pointless for it as CPU can not access device memory for me. That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall. Thought my personnal preference would still be to avoid use of such generic syscall but have device driver set allocation policy through its own userspace API (device driver could reuse internal of mbind() to achieve the end result). I am not saying that eveything you want to do is doable now with HMM but, nothing preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse with device memory. Each device is so different from the other that i don't believe in a one API fit all. The drm GPU subsystem of the kernel is a testimony of how little can be share when it comes to GPU. The only common code is modesetting. Everything that deals with how to use GPU to compute stuff is per device and most of the logic is in userspace. So i do not see any commonality that could be abstracted at syscall level. I would rather let device driver stack (kernel and userspace) take such decision and have the higher level API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them. Programmer target those high level API and they intend to use the mechanism each offer to manage memory and memory placement. I would say forcing them to use a second linux specific API to achieve the latter is wrong, at lest for now. So in the end if the mbind() syscall is done by the userspace side of the device driver then why not just having the device driver communicate this through its own kernel API (which can be much more expressive than what standardize syscall offers). I would rather avoid making change to any syscall for now. If latter, down the road, once the userspace ecosystem stabilize, we see that there is a good level at which we can abstract memory policy for enough devices then and only then it would make sense to either introduce new syscall or grow/modify existing one. Right now i fear we could only make bad decision that we would regret down the road. I think we can achieve memory device support with the minimum amou
Re: [RFC 0/8] Define coherent device memory node
Jerome Glisse writes: > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: >> Jerome Glisse writes: >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > >> > You can take a look at hmm-v13 if you want to see how i do non LRU page >> > migration. While i put most of the migration code inside hmm_migrate.c it >> > could easily be move to migrate.c without hmm_ prefix. >> > >> > There is 2 missing piece with existing migrate code. First is to put memory >> > allocation for destination under control of who call the migrate code. >> > Second >> > is to allow offloading the copy operation to device (ie not use the CPU to >> > copy data). >> > >> > I believe same requirement also make sense for platform you are targeting. >> > Thus same code can be use. >> > >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 >> > >> > I haven't posted this patchset yet because we are doing some modifications >> > to the device driver API to accomodate some new features. But the >> > ZONE_DEVICE >> > changes and the overall migration code will stay the same more or less (i >> > have >> > patches that move it to migrate.c and share more code with existing migrate >> > code). >> > >> > If you think i missed anything about lru and page cache please point it to >> > me. Because when i audited code for that i didn't see any road block with >> > the few fs i was looking at (ext4, xfs and core page cache code). >> > >> >> The other restriction around ZONE_DEVICE is, it is not a managed zone. >> That prevents any direct allocation from coherent device by application. >> ie, we would like to force allocation from coherent device using >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? > > To achieve this we rely on device fault code path ie when device take a page > fault > with help of HMM it will use existing memory if any for fault address but if > CPU > page table is empty (and it is not file back vma because of readback) then > device > can directly allocate device memory and HMM will update CPU page table to > point to > newly allocated device memory. > That is ok if the device touch the page first. What if we want the allocation touched first by cpu to come from GPU ?. Should we always depend on GPU driver to migrate such pages later from system RAM to GPU memory ? -aneesh
Re: [RFC 0/8] Define coherent device memory node
On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. > > Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the > > ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i > > have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? To achieve this we rely on device fault code path ie when device take a page fault with help of HMM it will use existing memory if any for fault address but if CPU page table is empty (and it is not file back vma because of readback) then device can directly allocate device memory and HMM will update CPU page table to point to newly allocated device memory. So in fact i am not using existing kernel API to achieve this but the whole policy of where to allocate and what to allocate is under device driver responsability and device driver leverage its existing userspace API to get proper hint/direction from the application. Device memory is really a special case in my view, it only make sense to use it if memory is actively access by device and only way device access memory is when it is program to do so through the device driver API. There is nothing such as GPU threads in the kernel and there is no way to spawn or move work thread to GPU. This are specialize device and they require special per device API. So in my view using existing kernel API such as mbind() is counter productive. You might have buggy software that will mbind their memory to device and never use the device which lead to device memory being wasted for a process that never use the device. So my opinion is that you should not try to use existing kernel API to get policy information from userspace but let the device driver gather such policy through its own private API. Cheers, Jérôme
Re: [RFC 0/8] Define coherent device memory node
On Tue, Oct 25, 2016 at 11:07:39PM +1100, Balbir Singh wrote: > On 25/10/16 04:09, Jerome Glisse wrote: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >>Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >>Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >>a. Seamless migrations between system RAM and the coherent memory > >>b. Will have asynchronous and high throughput migrations > >>c. Be able to allocate huge order pages from these memory regions > >>d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >>Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >>a. PFN must have struct page > >>b. Struct page must able to be inside standard LRU lists > >> > >>The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? > Then get migrated? Because in my case device memory is not accessible by anything except the device (not entirely true but for sake of design it is) any page read from disk will be first read into regular page (from regular system memory). It is only once it is uptodate and in page cache that it can be migrated to a ZONE_DEVICE page. So read from disk use an intermediary page. Write back is kind of the same i plan on using a bounce page by leveraging existing bio bounce infrastructure. Cheers, Jérôme
Re: [RFC 0/8] Define coherent device memory node
On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > > >> [...] > > > >>Core kernel memory features like reclamation, evictions etc. might > >> need to be restricted or modified on the coherent device memory node as > >> they can be performance limiting. The RFC does not propose anything on this > >> yet but it can be looked into later on. For now it just disables Auto NUMA > >> for any VMA which has coherent device memory. > >> > >>Seamless integration of coherent device memory with system memory > >> will enable various other features, some of which can be listed as follows. > >> > >>a. Seamless migrations between system RAM and the coherent memory > >>b. Will have asynchronous and high throughput migrations > >>c. Be able to allocate huge order pages from these memory regions > >>d. Restrict allocations to a large extent to the tasks using the > >> device for workload acceleration > >> > >>Before concluding, will look into the reasons why the existing > >> solutions don't work. There are two basic requirements which have to be > >> satisfies before the coherent device memory can be integrated with core > >> kernel seamlessly. > >> > >>a. PFN must have struct page > >>b. Struct page must able to be inside standard LRU lists > >> > >>The above two basic requirements discard the existing method of > >> device memory representation approaches like these which then requires the > >> need of creating a new framework. > > > > I do not believe the LRU list is a hard requirement, yes when faulting in > > a page inside the page cache it assumes it needs to be added to lru list. > > But i think this can easily be work around. > > > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > > so in my case a file back page must always be spawn first from a regular > > page and once read from disk then i can migrate to GPU page. > > > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > > device memory. This way no lru, no complex dance to make the memory out of > > reach from regular memory allocator. > > One of the reason to look at this as a NUMA node is to allow things like > over-commit of coherent device memory. The pages backing CDM being part of > lru and considering the coherent device as a numa node makes that really > simpler (we can run kswapd for that node). I am not convince that kswapd is what you want for overcommit, for HMM i leave overcommit to device driver and they seem quite happy about handling that themself. Only the device driver have enough information on what is worth evicting or what need to be evicted. > > I think we would have much to gain if we pool our effort on a single common > > solution for device memory. In my case the device memory is not accessible > > by the CPU (because PCIE restrictions), in your case it is. Thus the only > > difference is that in my case it can not be map inside the CPU page table > > while in yours it can. > > IMHO, we should be able to share the HMM migration approach. We > definitely won't need the mirror page table part. That is one of the > reson I requested HMM mirror page table to be a seperate patchset. They will need to share one thing, that is hmm_pfn_t which is a special pfn type in which i store HMM and migrate specific flag for migration. Because i can not use the struct list_head lru of struct page i have to do migration using array of pfn and i need to keep some flags per page during migration. So i share the same type hmm_pfn_t btw mirror and migrate code. But that's pretty small and it can be factor out of HMM, i can also just use pfn_t and add flag i need their. > > > > >> > >> (1) Traditional ioremap > >> > >>a. Memory is mapped into kernel (linear and virtual) and user space > >>b. These PFNs do not have struct pages associated with it > >>c. These special PFNs are marked with special flags inside the PTE > >>d. Cannot participate in core VM functions much because of this > >>e. Cannot do easy user space migrations > >> > >> (2) Zone ZONE_DEVICE > >> > >>a. Memory is mapped into kernel and user space > >>b. PFNs do have struct pages associated with it > >>c. These struct pages are allocated inside it's own memory range > >>d. Unfortunately the struct page's union containing LRU has been > >> used for struct dev_pagemap pointer > >>e. Hence it cannot be part of any LRU (like Page cache) > >>f. Hence file cached mapping cannot reside on these PFNs > >>g. Cannot do easy migrations > >> > >>I had also explored non LRU representation of this coherent device > >> memory where the integration with system RAM in the core VM is limited only > >> to the following functions. Not being insid
Re: [RFC 0/8] Define coherent device memory node
On 25/10/16 04:09, Jerome Glisse wrote: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE? Then get migrated? > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > I think thats a good idea to pool our efforts at the same time making progress >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen l
Re: [RFC 0/8] Define coherent device memory node
Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratabl
Re: [RFC 0/8] Define coherent device memory node
Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. One of the reason to look at this as a NUMA node is to allow things like over-commit of coherent device memory. The pages backing CDM being part of lru and considering the coherent device as a numa node makes that really simpler (we can run kswapd for that node). > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. IMHO, we should be able to share the HMM migration approach. We definitely won't need the mirror page table part. That is one of the reson I requested HMM mirror page table to be a seperate patchset. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node
Re: [RFC 0/8] Define coherent device memory node
On 10/24/2016 11:32 AM, David Nellans wrote: > On 10/24/2016 01:04 PM, Dave Hansen wrote: >> If you *really* don't want a "cdm" page to be migrated, then why isn't >> that policy set on the VMA in the first place? That would keep "cdm" >> pages from being made non-cdm. And, why would autonuma ever make a >> non-cdm page and migrate it in to cdm? There will be no NUMA access >> faults caused by the devices that are fed to autonuma. >> > Pages are desired to be migrateable, both into (starting cpu zone > movable->cdm) and out of (starting cdm->cpu zone movable) but only > through explicit migration, not via autonuma. OK, and is there a reason that the existing mbind code plus NUMA policies fails to give you this behavior? Does autonuma somehow override strict NUMA binding? > other pages in the same > VMA should still be migrateable between CPU nodes via autonuma however. That's not the way the implementation here works, as I understand it. See the VM_CDM patch and my responses to it. > Its expected a lot of these allocations are going to end up in THPs. > I'm not sure we need to explicitly disallow hugetlbfs support but the > identified use case is definitely via THPs not tlbfs. I think THP and hugetlbfs are implementations, not use cases. :) Is it too hard to support hugetlbfs that we should complicate its code to exclude it from this type of memory? Why?
Re: [RFC 0/8] Define coherent device memory node
On 10/24/2016 01:04 PM, Dave Hansen wrote: On 10/23/2016 09:31 PM, Anshuman Khandual wrote: To achieve seamless integration between system RAM and coherent device memory it must be able to utilize core memory kernel features like anon mapping, file mapping, page cache, driver managed pages, HW poisoning, migrations, reclaim, compaction, etc. So, you need to support all these things, but not autonuma or hugetlbfs? What's the reasoning behind that? If you *really* don't want a "cdm" page to be migrated, then why isn't that policy set on the VMA in the first place? That would keep "cdm" pages from being made non-cdm. And, why would autonuma ever make a non-cdm page and migrate it in to cdm? There will be no NUMA access faults caused by the devices that are fed to autonuma. Pages are desired to be migrateable, both into (starting cpu zone movable->cdm) and out of (starting cdm->cpu zone movable) but only through explicit migration, not via autonuma. other pages in the same VMA should still be migrateable between CPU nodes via autonuma however. Its expected a lot of these allocations are going to end up in THPs. I'm not sure we need to explicitly disallow hugetlbfs support but the identified use case is definitely via THPs not tlbfs.
Re: [RFC 0/8] Define coherent device memory node
On 10/23/2016 09:31 PM, Anshuman Khandual wrote: > To achieve seamless integration between system RAM and coherent > device memory it must be able to utilize core memory kernel features like > anon mapping, file mapping, page cache, driver managed pages, HW poisoning, > migrations, reclaim, compaction, etc. So, you need to support all these things, but not autonuma or hugetlbfs? What's the reasoning behind that? If you *really* don't want a "cdm" page to be migrated, then why isn't that policy set on the VMA in the first place? That would keep "cdm" pages from being made non-cdm. And, why would autonuma ever make a non-cdm page and migrate it in to cdm? There will be no NUMA access faults caused by the devices that are fed to autonuma. I'm confused.
Re: [RFC 0/8] Define coherent device memory node
On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > [...] > Core kernel memory features like reclamation, evictions etc. might > need to be restricted or modified on the coherent device memory node as > they can be performance limiting. The RFC does not propose anything on this > yet but it can be looked into later on. For now it just disables Auto NUMA > for any VMA which has coherent device memory. > > Seamless integration of coherent device memory with system memory > will enable various other features, some of which can be listed as follows. > > a. Seamless migrations between system RAM and the coherent memory > b. Will have asynchronous and high throughput migrations > c. Be able to allocate huge order pages from these memory regions > d. Restrict allocations to a large extent to the tasks using the > device for workload acceleration > > Before concluding, will look into the reasons why the existing > solutions don't work. There are two basic requirements which have to be > satisfies before the coherent device memory can be integrated with core > kernel seamlessly. > > a. PFN must have struct page > b. Struct page must able to be inside standard LRU lists > > The above two basic requirements discard the existing method of > device memory representation approaches like these which then requires the > need of creating a new framework. I do not believe the LRU list is a hard requirement, yes when faulting in a page inside the page cache it assumes it needs to be added to lru list. But i think this can easily be work around. In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) so in my case a file back page must always be spawn first from a regular page and once read from disk then i can migrate to GPU page. So if you accept this intermediary step you can easily use ZONE_DEVICE for device memory. This way no lru, no complex dance to make the memory out of reach from regular memory allocator. I think we would have much to gain if we pool our effort on a single common solution for device memory. In my case the device memory is not accessible by the CPU (because PCIE restrictions), in your case it is. Thus the only difference is that in my case it can not be map inside the CPU page table while in yours it can. > > (1) Traditional ioremap > > a. Memory is mapped into kernel (linear and virtual) and user space > b. These PFNs do not have struct pages associated with it > c. These special PFNs are marked with special flags inside the PTE > d. Cannot participate in core VM functions much because of this > e. Cannot do easy user space migrations > > (2) Zone ZONE_DEVICE > > a. Memory is mapped into kernel and user space > b. PFNs do have struct pages associated with it > c. These struct pages are allocated inside it's own memory range > d. Unfortunately the struct page's union containing LRU has been > used for struct dev_pagemap pointer > e. Hence it cannot be part of any LRU (like Page cache) > f. Hence file cached mapping cannot reside on these PFNs > g. Cannot do easy migrations > > I had also explored non LRU representation of this coherent device > memory where the integration with system RAM in the core VM is limited only > to the following functions. Not being inside LRU is definitely going to > reduce the scope of tight integration with system RAM. > > (1) Migration support between system RAM and coherent memory > (2) Migration support between various coherent memory nodes > (3) Isolation of the coherent memory > (4) Mapping the coherent memory into user space through driver's > struct vm_operations > (5) HW poisoning of the coherent memory > > Allocating the entire memory of the coherent device node right > after hot plug into ZONE_MOVABLE (where the memory is already inside the > buddy system) will still expose a time window where other user space > allocations can come into the coherent device memory node and prevent the > intended isolation. So traditional hot plug is not the solution. Hence > started looking into CMA based non LRU solution but then hit the following > roadblocks. > > (1) CMA does not support hot plugging of new memory node > a. CMA area needs to be marked during boot before buddy is > initialized > b. cma_alloc()/cma_release() can happen on the marked area > c. Should be able to mark the CMA areas just after memory hot plug > d. cma_alloc()/cma_release() can happen later after the hot plug > e. This is not currently supported right now > > (2) Mapped non LRU migration of pages > a. Recent work from Michan Kim makes non LRU page migratable > b. But it still does not support migration of mapped non LRU pages > c. With non LRU CMA re
[RFC 0/8] Define coherent device memory node
There are certain devices like accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which might contain on board memory. This on board memory can be coherent along with system RAM and may be accessible from either the CPU or from the device. The coherency is usually achieved through synchronizing the cache accesses from either side. This makes the device memory appear in the same address space as that of the system RAM. The on board device memory and system RAM are coherent but have differences in their properties as explained and elaborated below. Following diagram explains how the coherent device memory appears in the memory address space. +-+ +-+ | | | | | CPU | | DEVICE | | | | | +-+ +-+ | | | Shared Address Space| +-+ | | | | | | | System RAM | Coherent Memory | | | | | | | +-+ User space applications might be interested in using the coherent device memory either explicitly or implicitly along with the system RAM utilizing the basic semantics for memory allocation, access and release. Basically the user applications should be able to allocate memory any where (system RAM or coherent memory) and then get it accessed either from the CPU or from the coherent device for various computation or data transformation purpose. User space really should not be concerned about memory placement and their subsequent allocations when the memory really faults because of the access. To achieve seamless integration between system RAM and coherent device memory it must be able to utilize core memory kernel features like anon mapping, file mapping, page cache, driver managed pages, HW poisoning, migrations, reclaim, compaction, etc. Making the coherent device memory appear as a distinct memory only NUMA node which will be initialized as any other node with memory can create this integration with currently available system RAM memory. Also at the same time there should be a differentiating mark which indicates that this node is a coherent device memory node not any other memory only system RAM node. Coherent device memory invariably isn't available until the driver for the device has been initialized. It is desirable but not required for the device to support memory offlining for the purposes such as power management, link management and hardware errors. Kernel allocation should not come here as it cannot be moved out. Hence coherent device memory should go inside ZONE_MOVABLE zone instead. This guarantees that kernel allocations will never be satisfied from this memory and any process having un-movable pages on this coherent device memory (likely achieved through pinning later on after initial allocation) can be killed to free up memory from page table and eventually hot plugging the node out. After similar representation as a NUMA node, the coherent memory might still need some special consideration while being inside the kernel. There can be a variety of coherent device memory nodes with different expectations and special considerations from the core kernel. This RFC discusses only one such scenario where the coherent device memory requires just isolation. Now let us consider in detail the case of a coherent device memory node which requires isolation. This kind of coherent device memory is on board an external device attached to the system through a link where there is a chance of link errors plugging out the entire memory node with it. More over the memory might also have higher chances of ECC errors as compared to the system RAM. These are just some possibilities. But the fact remains that the coherent device memory can have some other different properties which might not be desirable for some user space applications. An application should not be exposed to related risks of a device if its not taking advantage of special features of that device and it's memory. Because of the reasons explained above allocations into isolation based coherent device memory node should further be regulated apart from earlier requirement of kernel allocations not coming there. User space allocations should not come here implicitly without the user application explicitly knowing a