On Mon, Nov 21, 2016 at 01:06:45PM +1100, Balbir Singh wrote: > > > On 19/11/16 05:18, Jérôme Glisse wrote: > > To allow use of device un-addressable memory inside a process add a > > special swap type. Also add a new callback to handle page fault on > > such entry. > > > > Signed-off-by: Jérôme Glisse <jgli...@redhat.com> > > Cc: Dan Williams <dan.j.willi...@intel.com> > > Cc: Ross Zwisler <ross.zwis...@linux.intel.com> > > --- > > fs/proc/task_mmu.c | 10 +++++++- > > include/linux/memremap.h | 5 ++++ > > include/linux/swap.h | 18 ++++++++++--- > > include/linux/swapops.h | 67 > > ++++++++++++++++++++++++++++++++++++++++++++++++ > > kernel/memremap.c | 14 ++++++++++ > > mm/Kconfig | 12 +++++++++ > > mm/memory.c | 24 +++++++++++++++++ > > mm/mprotect.c | 12 +++++++++ > > 8 files changed, 158 insertions(+), 4 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index 6909582..0726d39 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long > > addr, > > } else { > > mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; > > } > > - } else if (is_migration_entry(swpent)) > > + } else if (is_migration_entry(swpent)) { > > page = migration_entry_to_page(swpent); > > + } else if (is_device_entry(swpent)) { > > + page = device_entry_to_page(swpent); > > + } > > > So the reason there is a device swap entry for a page belonging to a user > process is > that it is in the middle of migration or is it always that a swap entry > represents > unaddressable memory belonging to a GPU device, but its tracked in the page > table > entries of the process.
For page being migrated i use the existing special migration pte entry. This new device special swap entry is only for unaddressable memory belonging to a device (GPU or any else). We need to keep track of those inside the CPU page table. Using a new special swap entry is the easiest way with the minimum amount of change to core mm. [...] > > +#ifdef CONFIG_DEVICE_UNADDRESSABLE > > +static inline swp_entry_t make_device_entry(struct page *page, bool write) > > +{ > > + return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page)); > > Code style checks I was trying to balance against 79 columns break rule :) [...] > > + } else if (is_device_entry(entry)) { > > + page = device_entry_to_page(entry); > > + > > + get_page(page); > > + rss[mm_counter(page)]++; > > Why does rss count go up? I wanted the device page to be treated like any other page. There is an argument to be made against and for doing that. Do you have strong argument for not doing this ? [...] > > @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte) > > if (unlikely(non_swap_entry(entry))) { > > if (is_migration_entry(entry)) { > > migration_entry_wait(vma->vm_mm, fe->pmd, fe->address); > > + } else if (is_device_entry(entry)) { > > + ret = device_entry_fault(vma, fe->address, entry, > > + fe->flags, fe->pmd); > > What does device_entry_fault() actually do here? Well it is a special fault handler, it must migrate the memory back to some place where the CPU can access it. It only matter for unaddressable memory. > > } else if (is_hwpoison_entry(entry)) { > > ret = VM_FAULT_HWPOISON; > > } else { > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 1bc1eb3..70aff3a 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct > > vm_area_struct *vma, pmd_t *pmd, > > > > pages++; > > } > > + > > + if (is_write_device_entry(entry)) { > > + pte_t newpte; > > + > > + make_device_entry_read(&entry); > > + newpte = swp_entry_to_pte(entry); > > + if (pte_swp_soft_dirty(oldpte)) > > + newpte = pte_swp_mksoft_dirty(newpte); > > + set_pte_at(mm, addr, pte, newpte); > > + > > + pages++; > > + } > > Does it make sense to call mprotect() on device memory ranges? There is nothing special about vma that containt device memory. They can be private anonymous, share, file back ... So any existing memory syscall must behave as expected. This is really just like any other page except that CPU can not access it. Cheers, Jérôme