Re: [RFC 0/8] Define coherent device memory node

2016-11-05 Thread Jerome Glisse
On Sat, Nov 05, 2016 at 10:51:21AM +0530, Anshuman Khandual wrote:
> On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote:
> > I looked at the hmm-v13 w.r.t migration and I guess some form of device
> > callback/acceleration during migration is something we should definitely
> > have. I still haven't figured out how non addressable and coherent device
> > memory can fit together there. I was waiting for the page cache
> > migration support to be pushed to the repository before I start looking
> > at this closely.
> 
> Aneesh, did not get that. Currently basic page cache migration is supported,
> right ? The device callback during migration, fault etc are supported through
> page->pgmap pointer and extending dev_pagemap structure to accommodate new
> members. IIUC that is the reason ZONE_DEVICE is being modified so that page
> ->pgmap overloading can be used for various driver/device specific callbacks
> while inside core VM functions or HMM functions.
> 
> HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which
> can have it's struct pages in system RAM but they cannot be accessed from the
> CPU. Now coherent device memory is kind of similar to persistent memory like
> NVDIMM which is already supported through ZONE_DEVICE (though we might not
> want to use vmemap_altmap instead have the struct pages in the system RAM).
> Now HMM has to learn working with 'dev_pagemap->addressable' type of device
> memory and then support all possible migrations through it's API. So in a
> nutshell, these are the changes we need to do to make HMM work with coherent
> device memory.
> 
> (0) Support all possible migrations between system RAM and device memory
> for current un-addressable device memory and make the HMM migration
> API layer comprehensive and complete.

What is no comprehensive or complete in the API layer ? I think the API is
pretty clear the migrate function does not rely on anything except HMM pfn.


> 
> (1) Create coherent device memory representation in ZONE_DEVICE
>   (a) Make it exactly the same as that of persistent memory/NVDIMM
> 
>   or
> 
>   (b) Create a new type for coherent device memory representation

So i will soon push an updated tree with modification to HMM API (from device
driver point of view but the migrate stuff is virtually the same). I slpitted
the addressable and movable concept and thus it is now easy to support coherent
addressable memory and non addressable memory.

> 
> (2) Support all possible migrations between system RAM and device memory
> for new addressable coherent device memory represented in ZONE_DEVICE
> extending the HMM migration API layer.
>
> Right now, HMM V13 patch series supports migration for a subset of private
> anonymous pages for un-addressable device memory. I am wondering how difficult
> is it to implement all possible anon, file mapping migration support for both
> un-addressable and addressable coherent device memory through ZONE_DEVICE.
>
 
There is no need to extend the API to support file back as matter of fact the
2 patches i sent you do support migration of file back page (page->mapping)
to and from ZONE_DEVICE as long as this ZONE_DEVICE memory is accessible by
the CPU and coherent. What i am still working on is the non addressable case
that is way more tedious (handle direct IO, read, write and writeback).

So difficulty for coherent memory is nill, it is the non addressable memory that
is hard to support in respect to file back page.

Cheers,
Jérôme


Re: [RFC 0/8] Define coherent device memory node

2016-11-04 Thread Anshuman Khandual
On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote:
> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> callback/acceleration during migration is something we should definitely
> have. I still haven't figured out how non addressable and coherent device
> memory can fit together there. I was waiting for the page cache
> migration support to be pushed to the repository before I start looking
> at this closely.

Aneesh, did not get that. Currently basic page cache migration is supported,
right ? The device callback during migration, fault etc are supported through
page->pgmap pointer and extending dev_pagemap structure to accommodate new
members. IIUC that is the reason ZONE_DEVICE is being modified so that page
->pgmap overloading can be used for various driver/device specific callbacks
while inside core VM functions or HMM functions.

HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which
can have it's struct pages in system RAM but they cannot be accessed from the
CPU. Now coherent device memory is kind of similar to persistent memory like
NVDIMM which is already supported through ZONE_DEVICE (though we might not
want to use vmemap_altmap instead have the struct pages in the system RAM).
Now HMM has to learn working with 'dev_pagemap->addressable' type of device
memory and then support all possible migrations through it's API. So in a
nutshell, these are the changes we need to do to make HMM work with coherent
device memory.

(0) Support all possible migrations between system RAM and device memory
for current un-addressable device memory and make the HMM migration
API layer comprehensive and complete.

(1) Create coherent device memory representation in ZONE_DEVICE
(a) Make it exactly the same as that of persistent memory/NVDIMM

or

(b) Create a new type for coherent device memory representation

(2) Support all possible migrations between system RAM and device memory
for new addressable coherent device memory represented in ZONE_DEVICE
extending the HMM migration API layer.

Right now, HMM V13 patch series supports migration for a subset of private
anonymous pages for un-addressable device memory. I am wondering how difficult
is it to implement all possible anon, file mapping migration support for both
un-addressable and addressable coherent device memory through ZONE_DEVICE.



Re: [RFC 0/8] Define coherent device memory node

2016-10-28 Thread Jerome Glisse
On Fri, Oct 28, 2016 at 10:59:52AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse  writes:
> >> 
> >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> >> >> Jerome Glisse  writes:
> >> >> 
> >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >> >> >
> >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> >> >> callback/acceleration during migration is something we should definitely
> >> >> have. I still haven't figured out how non addressable and coherent 
> >> >> device
> >> >> memory can fit together there. I was waiting for the page cache
> >> >> migration support to be pushed to the repository before I start looking
> >> >> at this closely.
> >> >> 
> >> >
> >> > The page cache migration does not touch the migrate code path. My issue 
> >> > with
> >> > page cache is writeback. The only difference with existing migrate code 
> >> > is
> >> > refcount check for ZONE_DEVICE page. Everything else is the same.
> >> 
> >> What about the radix tree ? does file system migrate_page callback handle
> >> replacing normal page with ZONE_DEVICE page/exceptional entries ?
> >> 
> >
> > It use the exact same existing code (from mm/migrate.c) so yes the radix 
> > tree
> > is updated and buffer_head are migrated.
> >
> 
> I looked at the the page cache migration patches shared and I find that
> you are not using exceptional entries when we migrate a page cache page to
> device memory. But I am now not sure how a read from page cache will
> work with that.
> 
> ie, a file system read will now find the page in page cache. But we
> cannot do a copy_to_user of that page because that is now backed by an
> unaddressable memory right ?
> 
> do_generic_file_read() does
>   page = find_get_page(mapping, index);
>   
>   ret = copy_page_to_iter(page, offset, nr, iter);
> 
> which does
>   void *kaddr = kmap_atomic(page);
>   size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
>   kunmap_atomic(kaddr);

Like i said right now for un-addressable memory my patches are mostly broken.
For read and write. I am focusing on page write back for now as it seemed to
be the more problematic case. For read/write the intention is to trigger a
migration back to system memory inside read/write of filesystem. This is also
why i will need a flag to indicate if a filesystem support migration to
un-addressable memory.

But in your case where the device memory is accessible then it should just work,
or do you need to do special thing when kmaping  device page ?

Cheers,
Jérôme


Re: [RFC 0/8] Define coherent device memory node

2016-10-28 Thread Jerome Glisse
On Fri, Oct 28, 2016 at 11:17:31AM +0530, Anshuman Khandual wrote:
> On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
> >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
>  On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> > On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> >> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >>> Jerome Glisse  writes:
>  On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> > Jerome Glisse  writes:
> >> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> > 
> > [...]
> > 
>  In my patchset there is no policy, it is all under device driver control 
>  which
>  decide what range of memory is migrated and when. I think only device 
>  driver as
>  proper knowledge to make such decision. By coalescing data from GPU 
>  counters and
>  request from application made through the uppler level programming API 
>  like
>  Cuda.
> 
> >>>
> >>> Right, I understand that. But what I pointed out here is that there are 
> >>> problems
> >>> now migrating user mapped pages back and forth between LRU system RAM 
> >>> memory and
> >>> non LRU device memory which is yet to be solved. Because you are 
> >>> proposing a non
> >>> LRU based design with ZONE_DEVICE, how we are solving/working around these
> >>> problems for bi-directional migration ?
> >>
> >> Let me elaborate on this bit more. Before non LRU migration support patch 
> >> series
> >> from Minchan, it was not possible to migrate non LRU pages which are 
> >> generally
> >> driver managed through migrate_pages interface. This was affecting the 
> >> ability
> >> to do compaction on platforms which has a large share of non LRU pages. 
> >> That series
> >> actually solved the migration problem and allowed compaction. But it still 
> >> did not
> >> solve the migration problem for non LRU *user mapped* pages. So if the non 
> >> LRU pages
> >> are mapped into a process's page table and being accessed from user space, 
> >> it can
> >> not be moved using migrate_pages interface.
> >>
> >> Minchan had a draft solution for that problem which is still hosted here. 
> >> On his
> >> suggestion I had tried this solution but still faced some other problems 
> >> during
> >> mapped pages migration. (NOTE: IIRC this was not posted in the community)
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the 
> >> following
> >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) 
> >>
> >> As I had mentioned earlier, we intend to support all possible migrations 
> >> between
> >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
> >>
> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
> >> (2) System RAM (File mapping) --> Device memory, back and forth many times
> > 
> > I achieve this 2 objective in HMM, i sent you the additional patches for 
> > file
> > back page migration. I am not done working on them but they are small.
> 
> Sure, will go through them. Thanks !
> 
> > 
> > 
> >> This is not happening now with non LRU pages. Here are some of reasons but 
> >> before
> >> that some notes.
> >>
> >> * Driver initiates all the migrations
> >> * Driver does the isolation of pages
> >> * Driver puts the isolated pages in a linked list
> >> * Driver passes the linked list to migrate_pages interface for migration
> >> * IIRC isolation of non LRU pages happens through 
> >> page->as->aops->isolate_page call
> >> * If migration fails, call page->as->aops->putback_page to give the page 
> >> back to the
> >>   device driver
> >>
> >> 1. queue_pages_range() currently does not work with non LRU pages, needs 
> >> to be fixed
> >>
> >> 2. After a successful migration from non LRU device memory to LRU system 
> >> RAM, the non
> >>LRU will be freed back. Right now migrate_pages releases these pages to 
> >> buddy, but
> >>in this situation we need the pages to be given back to the driver 
> >> instead. Hence
> >>migrate_pages needs to be changed to accommodate this.
> >>
> >> 3. After LRU system RAM to non LRU device migration for a mapped page, 
> >> does the new
> >>page (which came from device memory) will be part of core MM LRU either 
> >> for Anon
> >>or File mapping ?
> >>
> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a 
> >> mapped page,
> >>how we are going to store "address_space->address_space_operations" and 
> >> "Anon VMA
> >>Chain" reverse mapping information both on the page->mapping element ?
> >>
> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a 
> >> mapped page,
> >>how we are going to store "address_space->address_space_operations" of 
> >> the device
> >>driver and radix tree based reverse mapp

Re: [RFC 0/8] Define coherent device memory node

2016-10-28 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>> 
>> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
>> >> Jerome Glisse  writes:
>> >> 
>> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>> >> >
>> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
>> >> callback/acceleration during migration is something we should definitely
>> >> have. I still haven't figured out how non addressable and coherent device
>> >> memory can fit together there. I was waiting for the page cache
>> >> migration support to be pushed to the repository before I start looking
>> >> at this closely.
>> >> 
>> >
>> > The page cache migration does not touch the migrate code path. My issue 
>> > with
>> > page cache is writeback. The only difference with existing migrate code is
>> > refcount check for ZONE_DEVICE page. Everything else is the same.
>> 
>> What about the radix tree ? does file system migrate_page callback handle
>> replacing normal page with ZONE_DEVICE page/exceptional entries ?
>> 
>
> It use the exact same existing code (from mm/migrate.c) so yes the radix tree
> is updated and buffer_head are migrated.
>

I looked at the the page cache migration patches shared and I find that
you are not using exceptional entries when we migrate a page cache page to
device memory. But I am now not sure how a read from page cache will
work with that.

ie, a file system read will now find the page in page cache. But we
cannot do a copy_to_user of that page because that is now backed by an
unaddressable memory right ?

do_generic_file_read() does
  page = find_get_page(mapping, index);
  
  ret = copy_page_to_iter(page, offset, nr, iter);

which does
void *kaddr = kmap_atomic(page);
size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);


-aneesh



Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Anshuman Khandual
On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
>> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
>>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
 On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>> Jerome Glisse  writes:
 On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
 In my patchset there is no policy, it is all under device driver control 
 which
 decide what range of memory is migrated and when. I think only device 
 driver as
 proper knowledge to make such decision. By coalescing data from GPU 
 counters and
 request from application made through the uppler level programming API like
 Cuda.

>>>
>>> Right, I understand that. But what I pointed out here is that there are 
>>> problems
>>> now migrating user mapped pages back and forth between LRU system RAM 
>>> memory and
>>> non LRU device memory which is yet to be solved. Because you are proposing 
>>> a non
>>> LRU based design with ZONE_DEVICE, how we are solving/working around these
>>> problems for bi-directional migration ?
>>
>> Let me elaborate on this bit more. Before non LRU migration support patch 
>> series
>> from Minchan, it was not possible to migrate non LRU pages which are 
>> generally
>> driver managed through migrate_pages interface. This was affecting the 
>> ability
>> to do compaction on platforms which has a large share of non LRU pages. That 
>> series
>> actually solved the migration problem and allowed compaction. But it still 
>> did not
>> solve the migration problem for non LRU *user mapped* pages. So if the non 
>> LRU pages
>> are mapped into a process's page table and being accessed from user space, 
>> it can
>> not be moved using migrate_pages interface.
>>
>> Minchan had a draft solution for that problem which is still hosted here. On 
>> his
>> suggestion I had tried this solution but still faced some other problems 
>> during
>> mapped pages migration. (NOTE: IIRC this was not posted in the community)
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the 
>> following
>> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) 
>>
>> As I had mentioned earlier, we intend to support all possible migrations 
>> between
>> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
>>
>> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
>> (2) System RAM (File mapping) --> Device memory, back and forth many times
> 
> I achieve this 2 objective in HMM, i sent you the additional patches for file
> back page migration. I am not done working on them but they are small.

Sure, will go through them. Thanks !

> 
> 
>> This is not happening now with non LRU pages. Here are some of reasons but 
>> before
>> that some notes.
>>
>> * Driver initiates all the migrations
>> * Driver does the isolation of pages
>> * Driver puts the isolated pages in a linked list
>> * Driver passes the linked list to migrate_pages interface for migration
>> * IIRC isolation of non LRU pages happens through 
>> page->as->aops->isolate_page call
>> * If migration fails, call page->as->aops->putback_page to give the page 
>> back to the
>>   device driver
>>
>> 1. queue_pages_range() currently does not work with non LRU pages, needs to 
>> be fixed
>>
>> 2. After a successful migration from non LRU device memory to LRU system 
>> RAM, the non
>>LRU will be freed back. Right now migrate_pages releases these pages to 
>> buddy, but
>>in this situation we need the pages to be given back to the driver 
>> instead. Hence
>>migrate_pages needs to be changed to accommodate this.
>>
>> 3. After LRU system RAM to non LRU device migration for a mapped page, does 
>> the new
>>page (which came from device memory) will be part of core MM LRU either 
>> for Anon
>>or File mapping ?
>>
>> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" and 
>> "Anon VMA
>>Chain" reverse mapping information both on the page->mapping element ?
>>
>> 5. After LRU (File mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" of 
>> the device
>>driver and radix tree based reverse mapping information for the existing 
>> file
>>mapping both on the same page->mapping element ?
>>
>> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops 
>> which will
>>defined inside the device driver) and the reverse mapping information 
>> (either anon
>>or file mapping) 

Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Jerome Glisse
On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> > On 10/26/2016 09:32 PM, Jerome Glisse wrote:
> >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>  On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> > Jerome Glisse  writes:
> >> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >>> Jerome Glisse  writes:
>  On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

[...]

> >> In my patchset there is no policy, it is all under device driver control 
> >> which
> >> decide what range of memory is migrated and when. I think only device 
> >> driver as
> >> proper knowledge to make such decision. By coalescing data from GPU 
> >> counters and
> >> request from application made through the uppler level programming API like
> >> Cuda.
> >>
> > 
> > Right, I understand that. But what I pointed out here is that there are 
> > problems
> > now migrating user mapped pages back and forth between LRU system RAM 
> > memory and
> > non LRU device memory which is yet to be solved. Because you are proposing 
> > a non
> > LRU based design with ZONE_DEVICE, how we are solving/working around these
> > problems for bi-directional migration ?
> 
> Let me elaborate on this bit more. Before non LRU migration support patch 
> series
> from Minchan, it was not possible to migrate non LRU pages which are generally
> driver managed through migrate_pages interface. This was affecting the ability
> to do compaction on platforms which has a large share of non LRU pages. That 
> series
> actually solved the migration problem and allowed compaction. But it still 
> did not
> solve the migration problem for non LRU *user mapped* pages. So if the non 
> LRU pages
> are mapped into a process's page table and being accessed from user space, it 
> can
> not be moved using migrate_pages interface.
> 
> Minchan had a draft solution for that problem which is still hosted here. On 
> his
> suggestion I had tried this solution but still faced some other problems 
> during
> mapped pages migration. (NOTE: IIRC this was not posted in the community)
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the 
> following
> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) 
> 
> As I had mentioned earlier, we intend to support all possible migrations 
> between
> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
> 
> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
> (2) System RAM (File mapping) --> Device memory, back and forth many times

I achieve this 2 objective in HMM, i sent you the additional patches for file
back page migration. I am not done working on them but they are small.


> This is not happening now with non LRU pages. Here are some of reasons but 
> before
> that some notes.
> 
> * Driver initiates all the migrations
> * Driver does the isolation of pages
> * Driver puts the isolated pages in a linked list
> * Driver passes the linked list to migrate_pages interface for migration
> * IIRC isolation of non LRU pages happens through 
> page->as->aops->isolate_page call
> * If migration fails, call page->as->aops->putback_page to give the page back 
> to the
>   device driver
> 
> 1. queue_pages_range() currently does not work with non LRU pages, needs to 
> be fixed
> 
> 2. After a successful migration from non LRU device memory to LRU system RAM, 
> the non
>LRU will be freed back. Right now migrate_pages releases these pages to 
> buddy, but
>in this situation we need the pages to be given back to the driver 
> instead. Hence
>migrate_pages needs to be changed to accommodate this.
> 
> 3. After LRU system RAM to non LRU device migration for a mapped page, does 
> the new
>page (which came from device memory) will be part of core MM LRU either 
> for Anon
>or File mapping ?
> 
> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a 
> mapped page,
>how we are going to store "address_space->address_space_operations" and 
> "Anon VMA
>Chain" reverse mapping information both on the page->mapping element ?
> 
> 5. After LRU (File mapped) system RAM to non LRU device migration for a 
> mapped page,
>how we are going to store "address_space->address_space_operations" of the 
> device
>driver and radix tree based reverse mapping information for the existing 
> file
>mapping both on the same page->mapping element ?
> 
> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops 
> which will
>defined inside the device driver) and the reverse mapping information 
> (either anon
>or file mapping) together after first round of migration. This non LRU 
> identity needs
>to be retained continuously if we ever need to return this page to device 
> driver after
>success

Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Balbir Singh


On 27/10/16 03:28, Jerome Glisse wrote:
> On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote:
>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
 Jerome Glisse  writes:

> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
> [...]
>
>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>> migration. While i put most of the migration code inside hmm_migrate.c 
>>> it
>>> could easily be move to migrate.c without hmm_ prefix.
>>>
>>> There is 2 missing piece with existing migrate code. First is to put 
>>> memory
>>> allocation for destination under control of who call the migrate code. 
>>> Second
>>> is to allow offloading the copy operation to device (ie not use the CPU 
>>> to
>>> copy data).
>>>
>>> I believe same requirement also make sense for platform you are 
>>> targeting.
>>> Thus same code can be use.
>>>
>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>
>>> I haven't posted this patchset yet because we are doing some 
>>> modifications
>>> to the device driver API to accomodate some new features. But the 
>>> ZONE_DEVICE
>>> changes and the overall migration code will stay the same more or less 
>>> (i have
>>> patches that move it to migrate.c and share more code with existing 
>>> migrate
>>> code).
>>>
>>> If you think i missed anything about lru and page cache please point it 
>>> to
>>> me. Because when i audited code for that i didn't see any road block 
>>> with
>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>
>>
>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>> That prevents any direct allocation from coherent device by application.
>> ie, we would like to force allocation from coherent device using
>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>
> To achieve this we rely on device fault code path ie when device take a 
> page fault
> with help of HMM it will use existing memory if any for fault address but 
> if CPU
> page table is empty (and it is not file back vma because of readback) 
> then device
> can directly allocate device memory and HMM will update CPU page table to 
> point to
> newly allocated device memory.
>

 That is ok if the device touch the page first. What if we want the
 allocation touched first by cpu to come from GPU ?. Should we always
 depend on GPU driver to migrate such pages later from system RAM to GPU
 memory ?

>>>
>>> I am not sure what kind of workload would rather have every first CPU 
>>> access for
>>> a range to use device memory. So no my code does not handle that and it is 
>>> pointless
>>> for it as CPU can not access device memory for me.
>>
>> If the user space application can explicitly allocate device memory 
>> directly, we
>> can save one round of migration when the device start accessing it. But then 
>> one
>> can argue what problem statement the device would work on on a freshly 
>> allocated
>> memory which has not been accessed by CPU for loading the data yet. Will 
>> look into
>> this scenario in more detail.
>>
>>>
>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
>>> syscall.
>>> Thought my personnal preference would still be to avoid use of such generic 
>>> syscall
>>> but have device driver set allocation policy through its own userspace API 
>>> (device
>>> driver could reuse internal of mbind() to achieve the end result).
>>
>> Okay, the basic premise of CDM node is to have a LRU based design where we 
>> can
>> avoid use of driver specific user space memory management code altogether.
> 
> And i think it is not a good fit, at least not for GPU. GPU device driver 
> have a
> big chunk of code dedicated to memory management. You can look at drm/ttm and 
> at
> userspace (most is in userspace). It is not because we want to reinvent the 
> wheel
> it is because they are some unique constraint.
> 

Could you elaborate on the unique constraints a bit more? I looked at ttm 
briefly
(specifically ttm_memory.c), I can see zones being replicated, it feels like a 
mini-mm
is embedded in there.

> 
>>>
>>> I am not saying that eveything you want to do is doable now with HMM but, 
>>> nothing
>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really 
>>> don't think
>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
>>> can be reuse
>>> with device memory.
>>
>> With CDM node based design, the expectation is to get all/maximum core VM 
>> mechanism
>> working so that, driver has to do less 

Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Anshuman Khandual
On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
 On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
>
>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>> Jerome Glisse  writes:
 On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>
>> [...]
>>
 You can take a look at hmm-v13 if you want to see how i do non LRU page
 migration. While i put most of the migration code inside hmm_migrate.c 
 it
 could easily be move to migrate.c without hmm_ prefix.

 There is 2 missing piece with existing migrate code. First is to put 
 memory
 allocation for destination under control of who call the migrate code. 
 Second
 is to allow offloading the copy operation to device (ie not use the 
 CPU to
 copy data).

 I believe same requirement also make sense for platform you are 
 targeting.
 Thus same code can be use.

 hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13

 I haven't posted this patchset yet because we are doing some 
 modifications
 to the device driver API to accomodate some new features. But the 
 ZONE_DEVICE
 changes and the overall migration code will stay the same more or less 
 (i have
 patches that move it to migrate.c and share more code with existing 
 migrate
 code).

 If you think i missed anything about lru and page cache please point 
 it to
 me. Because when i audited code for that i didn't see any road block 
 with
 the few fs i was looking at (ext4, xfs and core page cache code).

>>>
>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>> That prevents any direct allocation from coherent device by application.
>>> ie, we would like to force allocation from coherent device using
>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>
>> To achieve this we rely on device fault code path ie when device take a 
>> page fault
>> with help of HMM it will use existing memory if any for fault address 
>> but if CPU
>> page table is empty (and it is not file back vma because of readback) 
>> then device
>> can directly allocate device memory and HMM will update CPU page table 
>> to point to
>> newly allocated device memory.
>>
>
> That is ok if the device touch the page first. What if we want the
> allocation touched first by cpu to come from GPU ?. Should we always
> depend on GPU driver to migrate such pages later from system RAM to GPU
> memory ?
>

 I am not sure what kind of workload would rather have every first CPU 
 access for
 a range to use device memory. So no my code does not handle that and it is 
 pointless
 for it as CPU can not access device memory for me.

 That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
 syscall.
 Thought my personnal preference would still be to avoid use of such 
 generic syscall
 but have device driver set allocation policy through its own userspace API 
 (device
 driver could reuse internal of mbind() to achieve the end result).

 I am not saying that eveything you want to do is doable now with HMM but, 
 nothing
 preclude achieving what you want to achieve using ZONE_DEVICE. I really 
 don't think
 any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
 can be reuse
 with device memory.

 Each device is so different from the other that i don't believe in a one 
 API fit all.
 The drm GPU subsystem of the kernel is a testimony of how little can be 
 share when it
 comes to GPU. The only common code is modesetting. Everything that deals 
 with how to
 use GPU to compute stuff is per device and most of the logic is in 
 userspace. So i do
 not see any commonality that could be abstracted at syscall level. I would 
 rather let
 device driver stack (kernel and userspace) take such decision and have the 
 higher level
 API (OpenCL, Cuda, C++17, ...) expose something that make sense for each 
 of them.
 Programmer target those high level API and they intend to use the 
 mechanism each offer
 to manage memory and memory placement. I would say forcing them to use a 
 second linux
 specific API to achieve the latter is wrong, at lest for now.

 So in the end if the mbind() syscall is done by the userspace side of the 
 device driver
 then why not

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Anshuman Khandual
On 10/26/2016 09:32 PM, Jerome Glisse wrote:
> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
 Jerome Glisse  writes:

> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
> [...]
>
>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>> migration. While i put most of the migration code inside hmm_migrate.c 
>>> it
>>> could easily be move to migrate.c without hmm_ prefix.
>>>
>>> There is 2 missing piece with existing migrate code. First is to put 
>>> memory
>>> allocation for destination under control of who call the migrate code. 
>>> Second
>>> is to allow offloading the copy operation to device (ie not use the CPU 
>>> to
>>> copy data).
>>>
>>> I believe same requirement also make sense for platform you are 
>>> targeting.
>>> Thus same code can be use.
>>>
>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>
>>> I haven't posted this patchset yet because we are doing some 
>>> modifications
>>> to the device driver API to accomodate some new features. But the 
>>> ZONE_DEVICE
>>> changes and the overall migration code will stay the same more or less 
>>> (i have
>>> patches that move it to migrate.c and share more code with existing 
>>> migrate
>>> code).
>>>
>>> If you think i missed anything about lru and page cache please point it 
>>> to
>>> me. Because when i audited code for that i didn't see any road block 
>>> with
>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>
>>
>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>> That prevents any direct allocation from coherent device by application.
>> ie, we would like to force allocation from coherent device using
>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>
> To achieve this we rely on device fault code path ie when device take a 
> page fault
> with help of HMM it will use existing memory if any for fault address but 
> if CPU
> page table is empty (and it is not file back vma because of readback) 
> then device
> can directly allocate device memory and HMM will update CPU page table to 
> point to
> newly allocated device memory.
>

 That is ok if the device touch the page first. What if we want the
 allocation touched first by cpu to come from GPU ?. Should we always
 depend on GPU driver to migrate such pages later from system RAM to GPU
 memory ?

>>>
>>> I am not sure what kind of workload would rather have every first CPU 
>>> access for
>>> a range to use device memory. So no my code does not handle that and it is 
>>> pointless
>>> for it as CPU can not access device memory for me.
>>>
>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
>>> syscall.
>>> Thought my personnal preference would still be to avoid use of such generic 
>>> syscall
>>> but have device driver set allocation policy through its own userspace API 
>>> (device
>>> driver could reuse internal of mbind() to achieve the end result).
>>>
>>> I am not saying that eveything you want to do is doable now with HMM but, 
>>> nothing
>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really 
>>> don't think
>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
>>> can be reuse
>>> with device memory.
>>>
>>> Each device is so different from the other that i don't believe in a one 
>>> API fit all.
>>> The drm GPU subsystem of the kernel is a testimony of how little can be 
>>> share when it
>>> comes to GPU. The only common code is modesetting. Everything that deals 
>>> with how to
>>> use GPU to compute stuff is per device and most of the logic is in 
>>> userspace. So i do
>>> not see any commonality that could be abstracted at syscall level. I would 
>>> rather let
>>> device driver stack (kernel and userspace) take such decision and have the 
>>> higher level
>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of 
>>> them.
>>> Programmer target those high level API and they intend to use the mechanism 
>>> each offer
>>> to manage memory and memory placement. I would say forcing them to use a 
>>> second linux
>>> specific API to achieve the latter is wrong, at lest for now.
>>>
>>> So in the end if the mbind() syscall is done by the userspace side of the 
>>> device driver
>>> then why not just having the device driver communicate this through its own 
>>> kernel
>>> API (which can be much more expressive than what standardize syscall 
>>> offers). I 

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Jerome Glisse
On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse  writes:
> >>
> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>  Jerome Glisse  writes:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >>>
> >>> [...]
> >>>
> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> > migration. While i put most of the migration code inside hmm_migrate.c 
> > it
> > could easily be move to migrate.c without hmm_ prefix.
> >
> > There is 2 missing piece with existing migrate code. First is to put 
> > memory
> > allocation for destination under control of who call the migrate code. 
> > Second
> > is to allow offloading the copy operation to device (ie not use the CPU 
> > to
> > copy data).
> >
> > I believe same requirement also make sense for platform you are 
> > targeting.
> > Thus same code can be use.
> >
> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >
> > I haven't posted this patchset yet because we are doing some 
> > modifications
> > to the device driver API to accomodate some new features. But the 
> > ZONE_DEVICE
> > changes and the overall migration code will stay the same more or less 
> > (i have
> > patches that move it to migrate.c and share more code with existing 
> > migrate
> > code).
> >
> > If you think i missed anything about lru and page cache please point it 
> > to
> > me. Because when i audited code for that i didn't see any road block 
> > with
> > the few fs i was looking at (ext4, xfs and core page cache code).
> >
> 
>  The other restriction around ZONE_DEVICE is, it is not a managed zone.
>  That prevents any direct allocation from coherent device by application.
>  ie, we would like to force allocation from coherent device using
>  interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >>>
> >>> To achieve this we rely on device fault code path ie when device take a 
> >>> page fault
> >>> with help of HMM it will use existing memory if any for fault address but 
> >>> if CPU
> >>> page table is empty (and it is not file back vma because of readback) 
> >>> then device
> >>> can directly allocate device memory and HMM will update CPU page table to 
> >>> point to
> >>> newly allocated device memory.
> >>>
> >>
> >> That is ok if the device touch the page first. What if we want the
> >> allocation touched first by cpu to come from GPU ?. Should we always
> >> depend on GPU driver to migrate such pages later from system RAM to GPU
> >> memory ?
> >>
> > 
> > I am not sure what kind of workload would rather have every first CPU 
> > access for
> > a range to use device memory. So no my code does not handle that and it is 
> > pointless
> > for it as CPU can not access device memory for me.
> 
> If the user space application can explicitly allocate device memory directly, 
> we
> can save one round of migration when the device start accessing it. But then 
> one
> can argue what problem statement the device would work on on a freshly 
> allocated
> memory which has not been accessed by CPU for loading the data yet. Will look 
> into
> this scenario in more detail.
> 
> > 
> > That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
> > syscall.
> > Thought my personnal preference would still be to avoid use of such generic 
> > syscall
> > but have device driver set allocation policy through its own userspace API 
> > (device
> > driver could reuse internal of mbind() to achieve the end result).
> 
> Okay, the basic premise of CDM node is to have a LRU based design where we can
> avoid use of driver specific user space memory management code altogether.

And i think it is not a good fit, at least not for GPU. GPU device driver have a
big chunk of code dedicated to memory management. You can look at drm/ttm and at
userspace (most is in userspace). It is not because we want to reinvent the 
wheel
it is because they are some unique constraint.


> > 
> > I am not saying that eveything you want to do is doable now with HMM but, 
> > nothing
> > preclude achieving what you want to achieve using ZONE_DEVICE. I really 
> > don't think
> > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
> > can be reuse
> > with device memory.
> 
> With CDM node based design, the expectation is to get all/maximum core VM 
> mechanism
> working so that, driver has to do less device specific optimization.

I think this is a bad idea, today, for GPU but i might be wrong.
 
> > 
> > Each device is so different from the other that i don't believe in a one 
> > API fit all.
> 
> Right, so as I had mentioned in the cover letter, 
> pglist_data->coherent

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Jerome Glisse
On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse  writes:
> >> 
> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >> >
> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> >> callback/acceleration during migration is something we should definitely
> >> have. I still haven't figured out how non addressable and coherent device
> >> memory can fit together there. I was waiting for the page cache
> >> migration support to be pushed to the repository before I start looking
> >> at this closely.
> >> 
> >
> > The page cache migration does not touch the migrate code path. My issue with
> > page cache is writeback. The only difference with existing migrate code is
> > refcount check for ZONE_DEVICE page. Everything else is the same.
> 
> What about the radix tree ? does file system migrate_page callback handle
> replacing normal page with ZONE_DEVICE page/exceptional entries ?
> 

It use the exact same existing code (from mm/migrate.c) so yes the radix tree
is updated and buffer_head are migrated.

Jérôme


Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Jerome Glisse
On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse  writes:
> >>
> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>  Jerome Glisse  writes:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >>>
> >>> [...]
> >>>
> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> > migration. While i put most of the migration code inside hmm_migrate.c 
> > it
> > could easily be move to migrate.c without hmm_ prefix.
> >
> > There is 2 missing piece with existing migrate code. First is to put 
> > memory
> > allocation for destination under control of who call the migrate code. 
> > Second
> > is to allow offloading the copy operation to device (ie not use the CPU 
> > to
> > copy data).
> >
> > I believe same requirement also make sense for platform you are 
> > targeting.
> > Thus same code can be use.
> >
> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >
> > I haven't posted this patchset yet because we are doing some 
> > modifications
> > to the device driver API to accomodate some new features. But the 
> > ZONE_DEVICE
> > changes and the overall migration code will stay the same more or less 
> > (i have
> > patches that move it to migrate.c and share more code with existing 
> > migrate
> > code).
> >
> > If you think i missed anything about lru and page cache please point it 
> > to
> > me. Because when i audited code for that i didn't see any road block 
> > with
> > the few fs i was looking at (ext4, xfs and core page cache code).
> >
> 
>  The other restriction around ZONE_DEVICE is, it is not a managed zone.
>  That prevents any direct allocation from coherent device by application.
>  ie, we would like to force allocation from coherent device using
>  interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >>>
> >>> To achieve this we rely on device fault code path ie when device take a 
> >>> page fault
> >>> with help of HMM it will use existing memory if any for fault address but 
> >>> if CPU
> >>> page table is empty (and it is not file back vma because of readback) 
> >>> then device
> >>> can directly allocate device memory and HMM will update CPU page table to 
> >>> point to
> >>> newly allocated device memory.
> >>>
> >>
> >> That is ok if the device touch the page first. What if we want the
> >> allocation touched first by cpu to come from GPU ?. Should we always
> >> depend on GPU driver to migrate such pages later from system RAM to GPU
> >> memory ?
> >>
> > 
> > I am not sure what kind of workload would rather have every first CPU 
> > access for
> > a range to use device memory. So no my code does not handle that and it is 
> > pointless
> > for it as CPU can not access device memory for me.
> > 
> > That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
> > syscall.
> > Thought my personnal preference would still be to avoid use of such generic 
> > syscall
> > but have device driver set allocation policy through its own userspace API 
> > (device
> > driver could reuse internal of mbind() to achieve the end result).
> > 
> > I am not saying that eveything you want to do is doable now with HMM but, 
> > nothing
> > preclude achieving what you want to achieve using ZONE_DEVICE. I really 
> > don't think
> > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
> > can be reuse
> > with device memory.
> > 
> > Each device is so different from the other that i don't believe in a one 
> > API fit all.
> > The drm GPU subsystem of the kernel is a testimony of how little can be 
> > share when it
> > comes to GPU. The only common code is modesetting. Everything that deals 
> > with how to
> > use GPU to compute stuff is per device and most of the logic is in 
> > userspace. So i do
> > not see any commonality that could be abstracted at syscall level. I would 
> > rather let
> > device driver stack (kernel and userspace) take such decision and have the 
> > higher level
> > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of 
> > them.
> > Programmer target those high level API and they intend to use the mechanism 
> > each offer
> > to manage memory and memory placement. I would say forcing them to use a 
> > second linux
> > specific API to achieve the latter is wrong, at lest for now.
> > 
> > So in the end if the mbind() syscall is done by the userspace side of the 
> > device driver
> > then why not just having the device driver communicate this through its own 
> > kernel
> > API (which can be much more expressive than what standardize syscall 
> > offers). I would
> > rather avoid making change to any

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Anshuman Khandual
On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>>
>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
 Jerome Glisse  writes:
> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>
>>> [...]
>>>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put 
> memory
> allocation for destination under control of who call the migrate code. 
> Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>
> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the 
> ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i 
> have
> patches that move it to migrate.c and share more code with existing 
> migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).
>

 The other restriction around ZONE_DEVICE is, it is not a managed zone.
 That prevents any direct allocation from coherent device by application.
 ie, we would like to force allocation from coherent device using
 interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>
>>> To achieve this we rely on device fault code path ie when device take a 
>>> page fault
>>> with help of HMM it will use existing memory if any for fault address but 
>>> if CPU
>>> page table is empty (and it is not file back vma because of readback) then 
>>> device
>>> can directly allocate device memory and HMM will update CPU page table to 
>>> point to
>>> newly allocated device memory.
>>>
>>
>> That is ok if the device touch the page first. What if we want the
>> allocation touched first by cpu to come from GPU ?. Should we always
>> depend on GPU driver to migrate such pages later from system RAM to GPU
>> memory ?
>>
> 
> I am not sure what kind of workload would rather have every first CPU access 
> for
> a range to use device memory. So no my code does not handle that and it is 
> pointless
> for it as CPU can not access device memory for me.

If the user space application can explicitly allocate device memory directly, we
can save one round of migration when the device start accessing it. But then one
can argue what problem statement the device would work on on a freshly allocated
memory which has not been accessed by CPU for loading the data yet. Will look 
into
this scenario in more detail.

> 
> That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
> syscall.
> Thought my personnal preference would still be to avoid use of such generic 
> syscall
> but have device driver set allocation policy through its own userspace API 
> (device
> driver could reuse internal of mbind() to achieve the end result).

Okay, the basic premise of CDM node is to have a LRU based design where we can
avoid use of driver specific user space memory management code altogether.

> 
> I am not saying that eveything you want to do is doable now with HMM but, 
> nothing
> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't 
> think
> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
> can be reuse
> with device memory.

With CDM node based design, the expectation is to get all/maximum core VM 
mechanism
working so that, driver has to do less device specific optimization.

> 
> Each device is so different from the other that i don't believe in a one API 
> fit all.

Right, so as I had mentioned in the cover letter, pglist_data->coherent_device 
actually
can become a bit mask indicating the type of coherent device the node is and 
that can
be used to implement multiple types of requirement in core mm for various kinds 
of
devices in the future.

> The drm GPU subsystem of the kernel is a testimony of how little can be share 
> when it
> comes to GPU. The only common code is modesetting. Everything that deals with 
> how to
> use GPU to compute stuff is per device and most of the logic is in userspace. 
> So i do

Whats the basic reason which prevents such code/functionality sharing ?

> not see any commonality that could be abstracted at syscall level. I would 
> rather let
> device driver stack (kernel and userspace) take 

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Anshuman Khandual
On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>>
>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
 Jerome Glisse  writes:
> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>
>>> [...]
>>>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put 
> memory
> allocation for destination under control of who call the migrate code. 
> Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>
> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the 
> ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i 
> have
> patches that move it to migrate.c and share more code with existing 
> migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).
>

 The other restriction around ZONE_DEVICE is, it is not a managed zone.
 That prevents any direct allocation from coherent device by application.
 ie, we would like to force allocation from coherent device using
 interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>
>>> To achieve this we rely on device fault code path ie when device take a 
>>> page fault
>>> with help of HMM it will use existing memory if any for fault address but 
>>> if CPU
>>> page table is empty (and it is not file back vma because of readback) then 
>>> device
>>> can directly allocate device memory and HMM will update CPU page table to 
>>> point to
>>> newly allocated device memory.
>>>
>>
>> That is ok if the device touch the page first. What if we want the
>> allocation touched first by cpu to come from GPU ?. Should we always
>> depend on GPU driver to migrate such pages later from system RAM to GPU
>> memory ?
>>
> 
> I am not sure what kind of workload would rather have every first CPU access 
> for
> a range to use device memory. So no my code does not handle that and it is 
> pointless
> for it as CPU can not access device memory for me.
> 
> That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
> syscall.
> Thought my personnal preference would still be to avoid use of such generic 
> syscall
> but have device driver set allocation policy through its own userspace API 
> (device
> driver could reuse internal of mbind() to achieve the end result).
> 
> I am not saying that eveything you want to do is doable now with HMM but, 
> nothing
> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't 
> think
> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and 
> can be reuse
> with device memory.
> 
> Each device is so different from the other that i don't believe in a one API 
> fit all.
> The drm GPU subsystem of the kernel is a testimony of how little can be share 
> when it
> comes to GPU. The only common code is modesetting. Everything that deals with 
> how to
> use GPU to compute stuff is per device and most of the logic is in userspace. 
> So i do
> not see any commonality that could be abstracted at syscall level. I would 
> rather let
> device driver stack (kernel and userspace) take such decision and have the 
> higher level
> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of 
> them.
> Programmer target those high level API and they intend to use the mechanism 
> each offer
> to manage memory and memory placement. I would say forcing them to use a 
> second linux
> specific API to achieve the latter is wrong, at lest for now.
> 
> So in the end if the mbind() syscall is done by the userspace side of the 
> device driver
> then why not just having the device driver communicate this through its own 
> kernel
> API (which can be much more expressive than what standardize syscall offers). 
> I would
> rather avoid making change to any syscall for now.
> 
> If latter, down the road, once the userspace ecosystem stabilize, we see that 
> there
> is a good level at which we can abstract memory policy for enough devices 
> then and
> only then it would make sense to either introduce new syscall or grow/modify 
> existing
> one. Right now i fear we could only make bad decisio

Re: [RFC 0/8] Define coherent device memory node

2016-10-26 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>> 
>> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>> >
>> I looked at the hmm-v13 w.r.t migration and I guess some form of device
>> callback/acceleration during migration is something we should definitely
>> have. I still haven't figured out how non addressable and coherent device
>> memory can fit together there. I was waiting for the page cache
>> migration support to be pushed to the repository before I start looking
>> at this closely.
>> 
>
> The page cache migration does not touch the migrate code path. My issue with
> page cache is writeback. The only difference with existing migrate code is
> refcount check for ZONE_DEVICE page. Everything else is the same.

What about the radix tree ? does file system migrate_page callback handle
replacing normal page with ZONE_DEVICE page/exceptional entries ?

>
> For writeback i need to use a bounce page so basicly i am trying to hook 
> myself
> along the ISA bounce infrastructure for bio and i think it is the easiest path
> to solve this in my case.
>
> In your case where block device can also access the device memory you don't
> even need to use bounce page for writeback.
>

-aneesh



Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Jerome Glisse
On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse  writes:
> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> > [...]
> >
> >> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> >> > migration. While i put most of the migration code inside hmm_migrate.c it
> >> > could easily be move to migrate.c without hmm_ prefix.
> >> >
> >> > There is 2 missing piece with existing migrate code. First is to put 
> >> > memory
> >> > allocation for destination under control of who call the migrate code. 
> >> > Second
> >> > is to allow offloading the copy operation to device (ie not use the CPU 
> >> > to
> >> > copy data).
> >> >
> >> > I believe same requirement also make sense for platform you are 
> >> > targeting.
> >> > Thus same code can be use.
> >> >
> >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >> >
> >> > I haven't posted this patchset yet because we are doing some 
> >> > modifications
> >> > to the device driver API to accomodate some new features. But the 
> >> > ZONE_DEVICE
> >> > changes and the overall migration code will stay the same more or less 
> >> > (i have
> >> > patches that move it to migrate.c and share more code with existing 
> >> > migrate
> >> > code).
> >> >
> >> > If you think i missed anything about lru and page cache please point it 
> >> > to
> >> > me. Because when i audited code for that i didn't see any road block with
> >> > the few fs i was looking at (ext4, xfs and core page cache code).
> >> >
> >> 
> >> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> >> That prevents any direct allocation from coherent device by application.
> >> ie, we would like to force allocation from coherent device using
> >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >
> > To achieve this we rely on device fault code path ie when device take a 
> > page fault
> > with help of HMM it will use existing memory if any for fault address but 
> > if CPU
> > page table is empty (and it is not file back vma because of readback) then 
> > device
> > can directly allocate device memory and HMM will update CPU page table to 
> > point to
> > newly allocated device memory.
> >
> 
> That is ok if the device touch the page first. What if we want the
> allocation touched first by cpu to come from GPU ?. Should we always
> depend on GPU driver to migrate such pages later from system RAM to GPU
> memory ?
> 

I am not sure what kind of workload would rather have every first CPU access for
a range to use device memory. So no my code does not handle that and it is 
pointless
for it as CPU can not access device memory for me.

That said nothing forbid to add support for ZONE_DEVICE with mbind() like 
syscall.
Thought my personnal preference would still be to avoid use of such generic 
syscall
but have device driver set allocation policy through its own userspace API 
(device
driver could reuse internal of mbind() to achieve the end result).

I am not saying that eveything you want to do is doable now with HMM but, 
nothing
preclude achieving what you want to achieve using ZONE_DEVICE. I really don't 
think
any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can 
be reuse
with device memory.

Each device is so different from the other that i don't believe in a one API 
fit all.
The drm GPU subsystem of the kernel is a testimony of how little can be share 
when it
comes to GPU. The only common code is modesetting. Everything that deals with 
how to
use GPU to compute stuff is per device and most of the logic is in userspace. 
So i do
not see any commonality that could be abstracted at syscall level. I would 
rather let
device driver stack (kernel and userspace) take such decision and have the 
higher level
API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of 
them.
Programmer target those high level API and they intend to use the mechanism 
each offer
to manage memory and memory placement. I would say forcing them to use a second 
linux
specific API to achieve the latter is wrong, at lest for now.

So in the end if the mbind() syscall is done by the userspace side of the 
device driver
then why not just having the device driver communicate this through its own 
kernel
API (which can be much more expressive than what standardize syscall offers). I 
would
rather avoid making change to any syscall for now.

If latter, down the road, once the userspace ecosystem stabilize, we see that 
there
is a good level at which we can abstract memory policy for enough devices then 
and
only then it would make sense to either introduce new syscall or grow/modify 
existing
one. Right now i fear we could only make bad decision that we would regret down 
the
road.

I think we can achieve memory device support with the minimum amou

Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse  writes:
>> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
> [...]
>
>> > You can take a look at hmm-v13 if you want to see how i do non LRU page
>> > migration. While i put most of the migration code inside hmm_migrate.c it
>> > could easily be move to migrate.c without hmm_ prefix.
>> >
>> > There is 2 missing piece with existing migrate code. First is to put memory
>> > allocation for destination under control of who call the migrate code. 
>> > Second
>> > is to allow offloading the copy operation to device (ie not use the CPU to
>> > copy data).
>> >
>> > I believe same requirement also make sense for platform you are targeting.
>> > Thus same code can be use.
>> >
>> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>> >
>> > I haven't posted this patchset yet because we are doing some modifications
>> > to the device driver API to accomodate some new features. But the 
>> > ZONE_DEVICE
>> > changes and the overall migration code will stay the same more or less (i 
>> > have
>> > patches that move it to migrate.c and share more code with existing migrate
>> > code).
>> >
>> > If you think i missed anything about lru and page cache please point it to
>> > me. Because when i audited code for that i didn't see any road block with
>> > the few fs i was looking at (ext4, xfs and core page cache code).
>> >
>> 
>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>> That prevents any direct allocation from coherent device by application.
>> ie, we would like to force allocation from coherent device using
>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>
> To achieve this we rely on device fault code path ie when device take a page 
> fault
> with help of HMM it will use existing memory if any for fault address but if 
> CPU
> page table is empty (and it is not file back vma because of readback) then 
> device
> can directly allocate device memory and HMM will update CPU page table to 
> point to
> newly allocated device memory.
>

That is ok if the device touch the page first. What if we want the
allocation touched first by cpu to come from GPU ?. Should we always
depend on GPU driver to migrate such pages later from system RAM to GPU
memory ?

-aneesh



Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Jerome Glisse
On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

[...]

> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> > migration. While i put most of the migration code inside hmm_migrate.c it
> > could easily be move to migrate.c without hmm_ prefix.
> >
> > There is 2 missing piece with existing migrate code. First is to put memory
> > allocation for destination under control of who call the migrate code. 
> > Second
> > is to allow offloading the copy operation to device (ie not use the CPU to
> > copy data).
> >
> > I believe same requirement also make sense for platform you are targeting.
> > Thus same code can be use.
> >
> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >
> > I haven't posted this patchset yet because we are doing some modifications
> > to the device driver API to accomodate some new features. But the 
> > ZONE_DEVICE
> > changes and the overall migration code will stay the same more or less (i 
> > have
> > patches that move it to migrate.c and share more code with existing migrate
> > code).
> >
> > If you think i missed anything about lru and page cache please point it to
> > me. Because when i audited code for that i didn't see any road block with
> > the few fs i was looking at (ext4, xfs and core page cache code).
> >
> 
> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> That prevents any direct allocation from coherent device by application.
> ie, we would like to force allocation from coherent device using
> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
 
To achieve this we rely on device fault code path ie when device take a page 
fault
with help of HMM it will use existing memory if any for fault address but if CPU
page table is empty (and it is not file back vma because of readback) then 
device
can directly allocate device memory and HMM will update CPU page table to point 
to
newly allocated device memory.

So in fact i am not using existing kernel API to achieve this but the whole 
policy
of where to allocate and what to allocate is under device driver responsability 
and
device driver leverage its existing userspace API to get proper hint/direction 
from
the application.

Device memory is really a special case in my view, it only make sense to use it 
if
memory is actively access by device and only way device access memory is when 
it is
program to do so through the device driver API. There is nothing such as GPU 
threads
in the kernel and there is no way to spawn or move work thread to GPU. This are
specialize device and they require special per device API. So in my view using
existing kernel API such as mbind() is counter productive. You might have buggy
software that will mbind their memory to device and never use the device which
lead to device memory being wasted for a process that never use the device.

So my opinion is that you should not try to use existing kernel API to get 
policy
information from userspace but let the device driver gather such policy through 
its
own private API.

Cheers,
Jérôme


Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Jerome Glisse
On Tue, Oct 25, 2016 at 11:07:39PM +1100, Balbir Singh wrote:
> On 25/10/16 04:09, Jerome Glisse wrote:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> > 
> >> [...]
> > 
> >>Core kernel memory features like reclamation, evictions etc. might
> >> need to be restricted or modified on the coherent device memory node as
> >> they can be performance limiting. The RFC does not propose anything on this
> >> yet but it can be looked into later on. For now it just disables Auto NUMA
> >> for any VMA which has coherent device memory.
> >>
> >>Seamless integration of coherent device memory with system memory
> >> will enable various other features, some of which can be listed as follows.
> >>
> >>a. Seamless migrations between system RAM and the coherent memory
> >>b. Will have asynchronous and high throughput migrations
> >>c. Be able to allocate huge order pages from these memory regions
> >>d. Restrict allocations to a large extent to the tasks using the
> >>   device for workload acceleration
> >>
> >>Before concluding, will look into the reasons why the existing
> >> solutions don't work. There are two basic requirements which have to be
> >> satisfies before the coherent device memory can be integrated with core
> >> kernel seamlessly.
> >>
> >>a. PFN must have struct page
> >>b. Struct page must able to be inside standard LRU lists
> >>
> >>The above two basic requirements discard the existing method of
> >> device memory representation approaches like these which then requires the
> >> need of creating a new framework.
> > 
> > I do not believe the LRU list is a hard requirement, yes when faulting in
> > a page inside the page cache it assumes it needs to be added to lru list.
> > But i think this can easily be work around.
> > 
> > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> > so in my case a file back page must always be spawn first from a regular
> > page and once read from disk then i can migrate to GPU page.
> > 
> 
> I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE?
> Then get migrated?

Because in my case device memory is not accessible by anything except the device
(not entirely true but for sake of design it is) any page read from disk will be
first read into regular page (from regular system memory). It is only once it is
uptodate and in page cache that it can be migrated to a ZONE_DEVICE page.

So read from disk use an intermediary page. Write back is kind of the same i 
plan
on using a bounce page by leveraging existing bio bounce infrastructure.

Cheers,
Jérôme


Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Jerome Glisse
On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
> 
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> >> [...]
> >
> >>Core kernel memory features like reclamation, evictions etc. might
> >> need to be restricted or modified on the coherent device memory node as
> >> they can be performance limiting. The RFC does not propose anything on this
> >> yet but it can be looked into later on. For now it just disables Auto NUMA
> >> for any VMA which has coherent device memory.
> >> 
> >>Seamless integration of coherent device memory with system memory
> >> will enable various other features, some of which can be listed as follows.
> >> 
> >>a. Seamless migrations between system RAM and the coherent memory
> >>b. Will have asynchronous and high throughput migrations
> >>c. Be able to allocate huge order pages from these memory regions
> >>d. Restrict allocations to a large extent to the tasks using the
> >>   device for workload acceleration
> >> 
> >>Before concluding, will look into the reasons why the existing
> >> solutions don't work. There are two basic requirements which have to be
> >> satisfies before the coherent device memory can be integrated with core
> >> kernel seamlessly.
> >> 
> >>a. PFN must have struct page
> >>b. Struct page must able to be inside standard LRU lists
> >> 
> >>The above two basic requirements discard the existing method of
> >> device memory representation approaches like these which then requires the
> >> need of creating a new framework.
> >
> > I do not believe the LRU list is a hard requirement, yes when faulting in
> > a page inside the page cache it assumes it needs to be added to lru list.
> > But i think this can easily be work around.
> >
> > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> > so in my case a file back page must always be spawn first from a regular
> > page and once read from disk then i can migrate to GPU page.
> >
> > So if you accept this intermediary step you can easily use ZONE_DEVICE for
> > device memory. This way no lru, no complex dance to make the memory out of
> > reach from regular memory allocator.
> 
> One of the reason to look at this as a NUMA node is to allow things like
> over-commit of coherent device memory. The pages backing CDM being part of
> lru and considering the coherent device as a numa node makes that really
> simpler (we can run kswapd for that node).

I am not convince that kswapd is what you want for overcommit, for HMM i leave
overcommit to device driver and they seem quite happy about handling that
themself. Only the device driver have enough information on what is worth
evicting or what need to be evicted.
 
> > I think we would have much to gain if we pool our effort on a single common
> > solution for device memory. In my case the device memory is not accessible
> > by the CPU (because PCIE restrictions), in your case it is. Thus the only
> > difference is that in my case it can not be map inside the CPU page table
> > while in yours it can.
> 
> IMHO, we should be able to share the HMM migration approach. We
> definitely won't need the mirror page table part. That is one of the
> reson I requested HMM mirror page table to be a seperate patchset.

They will need to share one thing, that is hmm_pfn_t which is a special pfn
type in which i store HMM and migrate specific flag for migration. Because
i can not use the struct list_head lru of struct page i have to do migration
using array of pfn and i need to keep some flags per page during migration.

So i share the same type hmm_pfn_t btw mirror and migrate code. But that's
pretty small and it can be factor out of HMM, i can also just use pfn_t and
add flag i need their.
 
> 
> >
> >> 
> >> (1) Traditional ioremap
> >> 
> >>a. Memory is mapped into kernel (linear and virtual) and user space
> >>b. These PFNs do not have struct pages associated with it
> >>c. These special PFNs are marked with special flags inside the PTE
> >>d. Cannot participate in core VM functions much because of this
> >>e. Cannot do easy user space migrations
> >> 
> >> (2) Zone ZONE_DEVICE
> >> 
> >>a. Memory is mapped into kernel and user space
> >>b. PFNs do have struct pages associated with it
> >>c. These struct pages are allocated inside it's own memory range
> >>d. Unfortunately the struct page's union containing LRU has been
> >>   used for struct dev_pagemap pointer
> >>e. Hence it cannot be part of any LRU (like Page cache)
> >>f. Hence file cached mapping cannot reside on these PFNs
> >>g. Cannot do easy migrations
> >> 
> >>I had also explored non LRU representation of this coherent device
> >> memory where the integration with system RAM in the core VM is limited only
> >> to the following functions. Not being insid

Re: [RFC 0/8] Define coherent device memory node

2016-10-25 Thread Balbir Singh


On 25/10/16 04:09, Jerome Glisse wrote:
> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> 
>> [...]
> 
>>  Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>>
>>  Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>>
>>  a. Seamless migrations between system RAM and the coherent memory
>>  b. Will have asynchronous and high throughput migrations
>>  c. Be able to allocate huge order pages from these memory regions
>>  d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>>
>>  Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>>
>>  a. PFN must have struct page
>>  b. Struct page must able to be inside standard LRU lists
>>
>>  The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
> 
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
> 
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
> 

I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE?
Then get migrated?

> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.
> 
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.
> 

I think thats a good idea to pool our efforts at the same time making progress

>>
>> (1) Traditional ioremap
>>
>>  a. Memory is mapped into kernel (linear and virtual) and user space
>>  b. These PFNs do not have struct pages associated with it
>>  c. These special PFNs are marked with special flags inside the PTE
>>  d. Cannot participate in core VM functions much because of this
>>  e. Cannot do easy user space migrations
>>
>> (2) Zone ZONE_DEVICE
>>
>>  a. Memory is mapped into kernel and user space
>>  b. PFNs do have struct pages associated with it
>>  c. These struct pages are allocated inside it's own memory range
>>  d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>>  e. Hence it cannot be part of any LRU (like Page cache)
>>  f. Hence file cached mapping cannot reside on these PFNs
>>  g. Cannot do easy migrations
>>
>>  I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>>
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>>
>>  Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>>
>> (1) CMA does not support hot plugging of new memory node
>>  a. CMA area needs to be marked during boot before buddy is
>> initialized
>>  b. cma_alloc()/cma_release() can happen on the marked area
>>  c. Should be able to mark the CMA areas just after memory hot plug
>>  d. cma_alloc()/cma_release() can happen l

Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>>  Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>> 
>>  Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>> 
>>  a. Seamless migrations between system RAM and the coherent memory
>>  b. Will have asynchronous and high throughput migrations
>>  c. Be able to allocate huge order pages from these memory regions
>>  d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>> 
>>  Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>> 
>>  a. PFN must have struct page
>>  b. Struct page must able to be inside standard LRU lists
>> 
>>  The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>
> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.
>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.
>
>> 
>> (1) Traditional ioremap
>> 
>>  a. Memory is mapped into kernel (linear and virtual) and user space
>>  b. These PFNs do not have struct pages associated with it
>>  c. These special PFNs are marked with special flags inside the PTE
>>  d. Cannot participate in core VM functions much because of this
>>  e. Cannot do easy user space migrations
>> 
>> (2) Zone ZONE_DEVICE
>> 
>>  a. Memory is mapped into kernel and user space
>>  b. PFNs do have struct pages associated with it
>>  c. These struct pages are allocated inside it's own memory range
>>  d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>>  e. Hence it cannot be part of any LRU (like Page cache)
>>  f. Hence file cached mapping cannot reside on these PFNs
>>  g. Cannot do easy migrations
>> 
>>  I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>> 
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>> 
>>  Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>> 
>> (1) CMA does not support hot plugging of new memory node
>>  a. CMA area needs to be marked during boot before buddy is
>> initialized
>>  b. cma_alloc()/cma_release() can happen on the marked area
>>  c. Should be able to mark the CMA areas just after memory hot plug
>>  d. cma_alloc()/cma_release() can happen later after the hot plug
>>  e. This is not currently supported right now
>> 
>> (2) Mapped non LRU migration of pages
>>  a. Recent work from Michan Kim makes non LRU page migratabl

Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread Aneesh Kumar K.V
Jerome Glisse  writes:

> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>>  Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>> 
>>  Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>> 
>>  a. Seamless migrations between system RAM and the coherent memory
>>  b. Will have asynchronous and high throughput migrations
>>  c. Be able to allocate huge order pages from these memory regions
>>  d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>> 
>>  Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>> 
>>  a. PFN must have struct page
>>  b. Struct page must able to be inside standard LRU lists
>> 
>>  The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>
> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.

One of the reason to look at this as a NUMA node is to allow things like
over-commit of coherent device memory. The pages backing CDM being part of
lru and considering the coherent device as a numa node makes that really
simpler (we can run kswapd for that node).


>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.

IMHO, we should be able to share the HMM migration approach. We
definitely won't need the mirror page table part. That is one of the
reson I requested HMM mirror page table to be a seperate patchset.


>
>> 
>> (1) Traditional ioremap
>> 
>>  a. Memory is mapped into kernel (linear and virtual) and user space
>>  b. These PFNs do not have struct pages associated with it
>>  c. These special PFNs are marked with special flags inside the PTE
>>  d. Cannot participate in core VM functions much because of this
>>  e. Cannot do easy user space migrations
>> 
>> (2) Zone ZONE_DEVICE
>> 
>>  a. Memory is mapped into kernel and user space
>>  b. PFNs do have struct pages associated with it
>>  c. These struct pages are allocated inside it's own memory range
>>  d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>>  e. Hence it cannot be part of any LRU (like Page cache)
>>  f. Hence file cached mapping cannot reside on these PFNs
>>  g. Cannot do easy migrations
>> 
>>  I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>> 
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>> 
>>  Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>> 
>> (1) CMA does not support hot plugging of new memory node

Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread Dave Hansen
On 10/24/2016 11:32 AM, David Nellans wrote:
> On 10/24/2016 01:04 PM, Dave Hansen wrote:
>> If you *really* don't want a "cdm" page to be migrated, then why isn't
>> that policy set on the VMA in the first place?  That would keep "cdm"
>> pages from being made non-cdm.  And, why would autonuma ever make a
>> non-cdm page and migrate it in to cdm?  There will be no NUMA access
>> faults caused by the devices that are fed to autonuma.
>>
> Pages are desired to be migrateable, both into (starting cpu zone
> movable->cdm) and out of (starting cdm->cpu zone movable) but only
> through explicit migration, not via autonuma.

OK, and is there a reason that the existing mbind code plus NUMA
policies fails to give you this behavior?

Does autonuma somehow override strict NUMA binding?

>  other pages in the same
> VMA should still be migrateable between CPU nodes via autonuma however.

That's not the way the implementation here works, as I understand it.
See the VM_CDM patch and my responses to it.

> Its expected a lot of these allocations are going to end up in THPs. 
> I'm not sure we need to explicitly disallow hugetlbfs support but the
> identified use case is definitely via THPs not tlbfs.

I think THP and hugetlbfs are implementations, not use cases. :)

Is it too hard to support hugetlbfs that we should complicate its code
to exclude it from this type of memory?  Why?


Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread David Nellans

On 10/24/2016 01:04 PM, Dave Hansen wrote:


On 10/23/2016 09:31 PM, Anshuman Khandual wrote:

To achieve seamless integration  between system RAM and coherent
device memory it must be able to utilize core memory kernel features like
anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
migrations, reclaim, compaction, etc.

So, you need to support all these things, but not autonuma or hugetlbfs?
  What's the reasoning behind that?

If you *really* don't want a "cdm" page to be migrated, then why isn't
that policy set on the VMA in the first place?  That would keep "cdm"
pages from being made non-cdm.  And, why would autonuma ever make a
non-cdm page and migrate it in to cdm?  There will be no NUMA access
faults caused by the devices that are fed to autonuma.

Pages are desired to be migrateable, both into (starting cpu zone 
movable->cdm) and out of (starting cdm->cpu zone movable) but only 
through explicit migration, not via autonuma.  other pages in the same 
VMA should still be migrateable between CPU nodes via autonuma however.


Its expected a lot of these allocations are going to end up in THPs.  
I'm not sure we need to explicitly disallow hugetlbfs support but the 
identified use case is definitely via THPs not tlbfs.




Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread Dave Hansen
On 10/23/2016 09:31 PM, Anshuman Khandual wrote:
>   To achieve seamless integration  between system RAM and coherent
> device memory it must be able to utilize core memory kernel features like
> anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
> migrations, reclaim, compaction, etc.

So, you need to support all these things, but not autonuma or hugetlbfs?
 What's the reasoning behind that?

If you *really* don't want a "cdm" page to be migrated, then why isn't
that policy set on the VMA in the first place?  That would keep "cdm"
pages from being made non-cdm.  And, why would autonuma ever make a
non-cdm page and migrate it in to cdm?  There will be no NUMA access
faults caused by the devices that are fed to autonuma.

I'm confused.


Re: [RFC 0/8] Define coherent device memory node

2016-10-24 Thread Jerome Glisse
On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

> [...]

>   Core kernel memory features like reclamation, evictions etc. might
> need to be restricted or modified on the coherent device memory node as
> they can be performance limiting. The RFC does not propose anything on this
> yet but it can be looked into later on. For now it just disables Auto NUMA
> for any VMA which has coherent device memory.
> 
>   Seamless integration of coherent device memory with system memory
> will enable various other features, some of which can be listed as follows.
> 
>   a. Seamless migrations between system RAM and the coherent memory
>   b. Will have asynchronous and high throughput migrations
>   c. Be able to allocate huge order pages from these memory regions
>   d. Restrict allocations to a large extent to the tasks using the
>  device for workload acceleration
> 
>   Before concluding, will look into the reasons why the existing
> solutions don't work. There are two basic requirements which have to be
> satisfies before the coherent device memory can be integrated with core
> kernel seamlessly.
> 
>   a. PFN must have struct page
>   b. Struct page must able to be inside standard LRU lists
> 
>   The above two basic requirements discard the existing method of
> device memory representation approaches like these which then requires the
> need of creating a new framework.

I do not believe the LRU list is a hard requirement, yes when faulting in
a page inside the page cache it assumes it needs to be added to lru list.
But i think this can easily be work around.

In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
(not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
so in my case a file back page must always be spawn first from a regular
page and once read from disk then i can migrate to GPU page.

So if you accept this intermediary step you can easily use ZONE_DEVICE for
device memory. This way no lru, no complex dance to make the memory out of
reach from regular memory allocator.

I think we would have much to gain if we pool our effort on a single common
solution for device memory. In my case the device memory is not accessible
by the CPU (because PCIE restrictions), in your case it is. Thus the only
difference is that in my case it can not be map inside the CPU page table
while in yours it can.

> 
> (1) Traditional ioremap
> 
>   a. Memory is mapped into kernel (linear and virtual) and user space
>   b. These PFNs do not have struct pages associated with it
>   c. These special PFNs are marked with special flags inside the PTE
>   d. Cannot participate in core VM functions much because of this
>   e. Cannot do easy user space migrations
> 
> (2) Zone ZONE_DEVICE
> 
>   a. Memory is mapped into kernel and user space
>   b. PFNs do have struct pages associated with it
>   c. These struct pages are allocated inside it's own memory range
>   d. Unfortunately the struct page's union containing LRU has been
>  used for struct dev_pagemap pointer
>   e. Hence it cannot be part of any LRU (like Page cache)
>   f. Hence file cached mapping cannot reside on these PFNs
>   g. Cannot do easy migrations
> 
>   I had also explored non LRU representation of this coherent device
> memory where the integration with system RAM in the core VM is limited only
> to the following functions. Not being inside LRU is definitely going to
> reduce the scope of tight integration with system RAM.
> 
> (1) Migration support between system RAM and coherent memory
> (2) Migration support between various coherent memory nodes
> (3) Isolation of the coherent memory
> (4) Mapping the coherent memory into user space through driver's
> struct vm_operations
> (5) HW poisoning of the coherent memory
> 
>   Allocating the entire memory of the coherent device node right
> after hot plug into ZONE_MOVABLE (where the memory is already inside the
> buddy system) will still expose a time window where other user space
> allocations can come into the coherent device memory node and prevent the
> intended isolation. So traditional hot plug is not the solution. Hence
> started looking into CMA based non LRU solution but then hit the following
> roadblocks.
> 
> (1) CMA does not support hot plugging of new memory node
>   a. CMA area needs to be marked during boot before buddy is
>  initialized
>   b. cma_alloc()/cma_release() can happen on the marked area
>   c. Should be able to mark the CMA areas just after memory hot plug
>   d. cma_alloc()/cma_release() can happen later after the hot plug
>   e. This is not currently supported right now
> 
> (2) Mapped non LRU migration of pages
>   a. Recent work from Michan Kim makes non LRU page migratable
>   b. But it still does not support migration of mapped non LRU pages
>   c. With non LRU CMA re