Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-09 Thread Peter Xu
On Fri, Sep 07, 2018 at 01:54:35PM -0400, Jerome Glisse wrote:
> On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> > On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > > 
> > > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > > >> When splitting a huge page, we should set all small pages as 
> > > > > > > >> dirty if
> > > > > > > >> the original huge page has the dirty bit set before.  
> > > > > > > >> Otherwise we'll
> > > > > > > >> lose the original dirty bit.
> > > > > > > >
> > > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > > >
> > > > > > > > if (pmd_dirty(old_pmd))
> > > > > > > > SetPageDirty(page);
> > > > > > > >
> > > > > > > 
> > > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > > > __split_huge_page()
> > > > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > > > __split_huge_page_tail().
> > > > > > 
> > > > > > Hi, Kirill, Zi,
> > > > > > 
> > > > > > Thanks for your responses!
> > > > > > 
> > > > > > Though in my test the huge page seems to be splitted not by
> > > > > > split_huge_page_to_list() but by explicit calls to
> > > > > > change_protection().  The stack looks like this (again, this is a
> > > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > > 
> > > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > 
> > > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT 
> > > > > > ioctl
> > > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > > Andrea's (with very trivial changes):
> > > > > > 
> > > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > > > userfault
> > > > > > 
> > > > > > So... do we have two paths to split the huge pages separately?
> > > > > 
> > > > > We have two entiries that can be split: page table enties and 
> > > > > underlying
> > > > > compound page.
> > > > > 
> > > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE 
> > > > > page
> > > > > table. It doens't touch underlying compound page. The page still can 
> > > > > be
> > > > > mapped in other place as huge.
> > > > > 
> > > > > split_huge_page() (and ivariants of it) split compound page into a 
> > > > > number
> > > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > > PMD, but not other way around.
> > > > > 
> > > > > > 
> > > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > > dirty flags confused me for a few days already (the SetPageDirty() 
> > > > > > one
> > > > > > which sets the page dirty flag, and the pte_mkdirty() which sets 
> > > > > > that
> > > > > > onto the real PTEs).
> > > > > 
> > > > > Dirty bit from page table entries transferes to sturct page flug and 
> > > > > used
> > > > > for decision making in reclaim path.
> > > > 
> > > > Thanks for explaining.  It's much clearer for me.
> > > > 
> > > > Though for the issue I have encountered, I am still confused on why
> > > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > > 
> > > > if (pmd_dirty(old_pmd))
> > > > SetPageDirty(page);
> > > > 
> > > > However to me this only transfers (as you explained above) the dirty
> > > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > > to the page struct of the compound page.  It did not really apply to
> > > > every small page of the splitted huge page.  As you also explained,
> > > > 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-09 Thread Peter Xu
On Fri, Sep 07, 2018 at 01:54:35PM -0400, Jerome Glisse wrote:
> On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> > On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > > 
> > > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > > >> When splitting a huge page, we should set all small pages as 
> > > > > > > >> dirty if
> > > > > > > >> the original huge page has the dirty bit set before.  
> > > > > > > >> Otherwise we'll
> > > > > > > >> lose the original dirty bit.
> > > > > > > >
> > > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > > >
> > > > > > > > if (pmd_dirty(old_pmd))
> > > > > > > > SetPageDirty(page);
> > > > > > > >
> > > > > > > 
> > > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > > > __split_huge_page()
> > > > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > > > __split_huge_page_tail().
> > > > > > 
> > > > > > Hi, Kirill, Zi,
> > > > > > 
> > > > > > Thanks for your responses!
> > > > > > 
> > > > > > Though in my test the huge page seems to be splitted not by
> > > > > > split_huge_page_to_list() but by explicit calls to
> > > > > > change_protection().  The stack looks like this (again, this is a
> > > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > > 
> > > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > 
> > > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT 
> > > > > > ioctl
> > > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > > Andrea's (with very trivial changes):
> > > > > > 
> > > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > > > userfault
> > > > > > 
> > > > > > So... do we have two paths to split the huge pages separately?
> > > > > 
> > > > > We have two entiries that can be split: page table enties and 
> > > > > underlying
> > > > > compound page.
> > > > > 
> > > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE 
> > > > > page
> > > > > table. It doens't touch underlying compound page. The page still can 
> > > > > be
> > > > > mapped in other place as huge.
> > > > > 
> > > > > split_huge_page() (and ivariants of it) split compound page into a 
> > > > > number
> > > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > > PMD, but not other way around.
> > > > > 
> > > > > > 
> > > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > > dirty flags confused me for a few days already (the SetPageDirty() 
> > > > > > one
> > > > > > which sets the page dirty flag, and the pte_mkdirty() which sets 
> > > > > > that
> > > > > > onto the real PTEs).
> > > > > 
> > > > > Dirty bit from page table entries transferes to sturct page flug and 
> > > > > used
> > > > > for decision making in reclaim path.
> > > > 
> > > > Thanks for explaining.  It's much clearer for me.
> > > > 
> > > > Though for the issue I have encountered, I am still confused on why
> > > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > > 
> > > > if (pmd_dirty(old_pmd))
> > > > SetPageDirty(page);
> > > > 
> > > > However to me this only transfers (as you explained above) the dirty
> > > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > > to the page struct of the compound page.  It did not really apply to
> > > > every small page of the splitted huge page.  As you also explained,
> > > > 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-07 Thread Jerome Glisse
On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > 
> > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > >> When splitting a huge page, we should set all small pages as 
> > > > > > >> dirty if
> > > > > > >> the original huge page has the dirty bit set before.  Otherwise 
> > > > > > >> we'll
> > > > > > >> lose the original dirty bit.
> > > > > > >
> > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > >
> > > > > > >   if (pmd_dirty(old_pmd))
> > > > > > >   SetPageDirty(page);
> > > > > > >
> > > > > > 
> > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > > __split_huge_page()
> > > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > > __split_huge_page_tail().
> > > > > 
> > > > > Hi, Kirill, Zi,
> > > > > 
> > > > > Thanks for your responses!
> > > > > 
> > > > > Though in my test the huge page seems to be splitted not by
> > > > > split_huge_page_to_list() but by explicit calls to
> > > > > change_protection().  The stack looks like this (again, this is a
> > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > 
> > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > 
> > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > Andrea's (with very trivial changes):
> > > > > 
> > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > > userfault
> > > > > 
> > > > > So... do we have two paths to split the huge pages separately?
> > > > 
> > > > We have two entiries that can be split: page table enties and underlying
> > > > compound page.
> > > > 
> > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE 
> > > > page
> > > > table. It doens't touch underlying compound page. The page still can be
> > > > mapped in other place as huge.
> > > > 
> > > > split_huge_page() (and ivariants of it) split compound page into a 
> > > > number
> > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > PMD, but not other way around.
> > > > 
> > > > > 
> > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > > onto the real PTEs).
> > > > 
> > > > Dirty bit from page table entries transferes to sturct page flug and 
> > > > used
> > > > for decision making in reclaim path.
> > > 
> > > Thanks for explaining.  It's much clearer for me.
> > > 
> > > Though for the issue I have encountered, I am still confused on why
> > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > 
> > >   if (pmd_dirty(old_pmd))
> > >   SetPageDirty(page);
> > > 
> > > However to me this only transfers (as you explained above) the dirty
> > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > to the page struct of the compound page.  It did not really apply to
> > > every small page of the splitted huge page.  As you also explained,
> > > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > > compound huge page there, then IMHO it should also apply the dirty
> > > bits from the huge page to all the small page entries, no?
> > 
> > The bit on compound page represents all small subpages. PageDirty() on any
> > subpage will return you true if the compound page is 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-07 Thread Jerome Glisse
On Fri, Sep 07, 2018 at 12:35:24PM +0800, Peter Xu wrote:
> On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > > 
> > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > > >> When splitting a huge page, we should set all small pages as 
> > > > > > >> dirty if
> > > > > > >> the original huge page has the dirty bit set before.  Otherwise 
> > > > > > >> we'll
> > > > > > >> lose the original dirty bit.
> > > > > > >
> > > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > > >
> > > > > > >   if (pmd_dirty(old_pmd))
> > > > > > >   SetPageDirty(page);
> > > > > > >
> > > > > > 
> > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > > __split_huge_page()
> > > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > > __split_huge_page_tail().
> > > > > 
> > > > > Hi, Kirill, Zi,
> > > > > 
> > > > > Thanks for your responses!
> > > > > 
> > > > > Though in my test the huge page seems to be splitted not by
> > > > > split_huge_page_to_list() but by explicit calls to
> > > > > change_protection().  The stack looks like this (again, this is a
> > > > > customized kernel, and I added an explicit dump_stack() there):
> > > > > 
> > > > >   kernel:  dump_stack+0x5c/0x7b
> > > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > > >   kernel:  change_protection+0x3a2/0xab0
> > > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > > >   kernel:  ksys_ioctl+0x70/0x80
> > > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > > >   kernel:  do_syscall_64+0x55/0x150
> > > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > 
> > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > > you'd like to refer to the kernel, it's basically this one from
> > > > > Andrea's (with very trivial changes):
> > > > > 
> > > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > > userfault
> > > > > 
> > > > > So... do we have two paths to split the huge pages separately?
> > > > 
> > > > We have two entiries that can be split: page table enties and underlying
> > > > compound page.
> > > > 
> > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE 
> > > > page
> > > > table. It doens't touch underlying compound page. The page still can be
> > > > mapped in other place as huge.
> > > > 
> > > > split_huge_page() (and ivariants of it) split compound page into a 
> > > > number
> > > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > > PMD, but not other way around.
> > > > 
> > > > > 
> > > > > Another (possibly very naive) question is: could any of you hint me
> > > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > > onto the real PTEs).
> > > > 
> > > > Dirty bit from page table entries transferes to sturct page flug and 
> > > > used
> > > > for decision making in reclaim path.
> > > 
> > > Thanks for explaining.  It's much clearer for me.
> > > 
> > > Though for the issue I have encountered, I am still confused on why
> > > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > > 
> > >   if (pmd_dirty(old_pmd))
> > >   SetPageDirty(page);
> > > 
> > > However to me this only transfers (as you explained above) the dirty
> > > bit (AFAIU it's possibly set by the hardware when the page is written)
> > > to the page struct of the compound page.  It did not really apply to
> > > every small page of the splitted huge page.  As you also explained,
> > > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > > compound huge page there, then IMHO it should also apply the dirty
> > > bits from the huge page to all the small page entries, no?
> > 
> > The bit on compound page represents all small subpages. PageDirty() on any
> > subpage will return you true if the compound page is 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > 
> > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > >> When splitting a huge page, we should set all small pages as dirty 
> > > > > >> if
> > > > > >> the original huge page has the dirty bit set before.  Otherwise 
> > > > > >> we'll
> > > > > >> lose the original dirty bit.
> > > > > >
> > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > >
> > > > > > if (pmd_dirty(old_pmd))
> > > > > > SetPageDirty(page);
> > > > > >
> > > > > 
> > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > __split_huge_page()
> > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > __split_huge_page_tail().
> > > > 
> > > > Hi, Kirill, Zi,
> > > > 
> > > > Thanks for your responses!
> > > > 
> > > > Though in my test the huge page seems to be splitted not by
> > > > split_huge_page_to_list() but by explicit calls to
> > > > change_protection().  The stack looks like this (again, this is a
> > > > customized kernel, and I added an explicit dump_stack() there):
> > > > 
> > > >   kernel:  dump_stack+0x5c/0x7b
> > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > >   kernel:  change_protection+0x3a2/0xab0
> > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > >   kernel:  ksys_ioctl+0x70/0x80
> > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > >   kernel:  do_syscall_64+0x55/0x150
> > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > you'd like to refer to the kernel, it's basically this one from
> > > > Andrea's (with very trivial changes):
> > > > 
> > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > userfault
> > > > 
> > > > So... do we have two paths to split the huge pages separately?
> > > 
> > > We have two entiries that can be split: page table enties and underlying
> > > compound page.
> > > 
> > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > table. It doens't touch underlying compound page. The page still can be
> > > mapped in other place as huge.
> > > 
> > > split_huge_page() (and ivariants of it) split compound page into a number
> > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > PMD, but not other way around.
> > > 
> > > > 
> > > > Another (possibly very naive) question is: could any of you hint me
> > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > onto the real PTEs).
> > > 
> > > Dirty bit from page table entries transferes to sturct page flug and used
> > > for decision making in reclaim path.
> > 
> > Thanks for explaining.  It's much clearer for me.
> > 
> > Though for the issue I have encountered, I am still confused on why
> > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > 
> > if (pmd_dirty(old_pmd))
> > SetPageDirty(page);
> > 
> > However to me this only transfers (as you explained above) the dirty
> > bit (AFAIU it's possibly set by the hardware when the page is written)
> > to the page struct of the compound page.  It did not really apply to
> > every small page of the splitted huge page.  As you also explained,
> > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > compound huge page there, then IMHO it should also apply the dirty
> > bits from the huge page to all the small page entries, no?
> 
> The bit on compound page represents all small subpages. PageDirty() on any
> subpage will return you true if the compound page is dirty.

Ah I didn't notice this before (since PageDirty is defined with
PF_HEAD), thanks for pointing out.

> 
> > These dirty bits are really important to my scenario since AFAIU the
> > change_protection() call is using these dirty bits to decide whether
> > it should append the 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Thu, Sep 06, 2018 at 05:08:42PM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > > 
> > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > > >> When splitting a huge page, we should set all small pages as dirty 
> > > > > >> if
> > > > > >> the original huge page has the dirty bit set before.  Otherwise 
> > > > > >> we'll
> > > > > >> lose the original dirty bit.
> > > > > >
> > > > > > We don't lose it. It got transfered to struct page flag:
> > > > > >
> > > > > > if (pmd_dirty(old_pmd))
> > > > > > SetPageDirty(page);
> > > > > >
> > > > > 
> > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > > __split_huge_page()
> > > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > > __split_huge_page_tail().
> > > > 
> > > > Hi, Kirill, Zi,
> > > > 
> > > > Thanks for your responses!
> > > > 
> > > > Though in my test the huge page seems to be splitted not by
> > > > split_huge_page_to_list() but by explicit calls to
> > > > change_protection().  The stack looks like this (again, this is a
> > > > customized kernel, and I added an explicit dump_stack() there):
> > > > 
> > > >   kernel:  dump_stack+0x5c/0x7b
> > > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? update_load_avg+0x8b/0x550
> > > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > > >   kernel:  ? enqueue_entity+0x112/0x650
> > > >   kernel:  change_protection+0x3a2/0xab0
> > > >   kernel:  mwriteprotect_range+0xdd/0x110
> > > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > > >   kernel:  ? do_futex+0x2cf/0xb20
> > > >   kernel:  ? tty_write+0x1d2/0x2f0
> > > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > > >   kernel:  ksys_ioctl+0x70/0x80
> > > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > > >   kernel:  do_syscall_64+0x55/0x150
> > > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > > you'd like to refer to the kernel, it's basically this one from
> > > > Andrea's (with very trivial changes):
> > > > 
> > > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git 
> > > > userfault
> > > > 
> > > > So... do we have two paths to split the huge pages separately?
> > > 
> > > We have two entiries that can be split: page table enties and underlying
> > > compound page.
> > > 
> > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > > table. It doens't touch underlying compound page. The page still can be
> > > mapped in other place as huge.
> > > 
> > > split_huge_page() (and ivariants of it) split compound page into a number
> > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > > PMD, but not other way around.
> > > 
> > > > 
> > > > Another (possibly very naive) question is: could any of you hint me
> > > > how the page dirty bit is finally applied to the PTEs?  These two
> > > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > > onto the real PTEs).
> > > 
> > > Dirty bit from page table entries transferes to sturct page flug and used
> > > for decision making in reclaim path.
> > 
> > Thanks for explaining.  It's much clearer for me.
> > 
> > Though for the issue I have encountered, I am still confused on why
> > that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> > 
> > if (pmd_dirty(old_pmd))
> > SetPageDirty(page);
> > 
> > However to me this only transfers (as you explained above) the dirty
> > bit (AFAIU it's possibly set by the hardware when the page is written)
> > to the page struct of the compound page.  It did not really apply to
> > every small page of the splitted huge page.  As you also explained,
> > this __split_huge_pmd() only splits the PMD entry but it keeps the
> > compound huge page there, then IMHO it should also apply the dirty
> > bits from the huge page to all the small page entries, no?
> 
> The bit on compound page represents all small subpages. PageDirty() on any
> subpage will return you true if the compound page is dirty.

Ah I didn't notice this before (since PageDirty is defined with
PF_HEAD), thanks for pointing out.

> 
> > These dirty bits are really important to my scenario since AFAIU the
> > change_protection() call is using these dirty bits to decide whether
> > it should append the 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Jerome Glisse
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > >   if (pmd_dirty(old_pmd))
> > > > >   SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?
> 
> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
> /* Avoid taking write faults for known dirty pages */
> if (dirty_accountable && pte_dirty(ptent) &&
> (pte_soft_dirty(ptent) ||
> !(vma->vm_flags & VM_SOFTDIRTY))) {
> ptent = pte_mkwrite(ptent);
> }
> 
> So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> which is similar) although we pass in the new protocol with VM_WRITE
> here it'll still 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Jerome Glisse
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > >   if (pmd_dirty(old_pmd))
> > > > >   SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?
> 
> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
> /* Avoid taking write faults for known dirty pages */
> if (dirty_accountable && pte_dirty(ptent) &&
> (pte_soft_dirty(ptent) ||
> !(vma->vm_flags & VM_SOFTDIRTY))) {
> ptent = pte_mkwrite(ptent);
> }
> 
> So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
> which is similar) although we pass in the new protocol with VM_WRITE
> here it'll still 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Kirill A. Shutemov
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > >   if (pmd_dirty(old_pmd))
> > > > >   SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?

The bit on compound page represents all small subpages. PageDirty() on any
subpage will return you true if the compound page is dirty.

> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
> /* Avoid taking write faults for known dirty pages */
> if (dirty_accountable && pte_dirty(ptent) &&
> (pte_soft_dirty(ptent) ||
> !(vma->vm_flags & VM_SOFTDIRTY))) {
> ptent = pte_mkwrite(ptent);
> }
> 
> So when mprotect() with 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Kirill A. Shutemov
On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote:
> On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > > 
> > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > > >> When splitting a huge page, we should set all small pages as dirty if
> > > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > > >> lose the original dirty bit.
> > > > >
> > > > > We don't lose it. It got transfered to struct page flag:
> > > > >
> > > > >   if (pmd_dirty(old_pmd))
> > > > >   SetPageDirty(page);
> > > > >
> > > > 
> > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > > __split_huge_page()
> > > > propagates the dirty bit in the head page flag to all subpages in 
> > > > __split_huge_page_tail().
> > > 
> > > Hi, Kirill, Zi,
> > > 
> > > Thanks for your responses!
> > > 
> > > Though in my test the huge page seems to be splitted not by
> > > split_huge_page_to_list() but by explicit calls to
> > > change_protection().  The stack looks like this (again, this is a
> > > customized kernel, and I added an explicit dump_stack() there):
> > > 
> > >   kernel:  dump_stack+0x5c/0x7b
> > >   kernel:  __split_huge_pmd+0x192/0xdc0
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? update_load_avg+0x8b/0x550
> > >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> > >   kernel:  ? enqueue_entity+0x112/0x650
> > >   kernel:  change_protection+0x3a2/0xab0
> > >   kernel:  mwriteprotect_range+0xdd/0x110
> > >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> > >   kernel:  ? do_futex+0x2cf/0xb20
> > >   kernel:  ? tty_write+0x1d2/0x2f0
> > >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> > >   kernel:  do_vfs_ioctl+0x9f/0x610
> > >   kernel:  ? __x64_sys_futex+0x88/0x180
> > >   kernel:  ksys_ioctl+0x70/0x80
> > >   kernel:  __x64_sys_ioctl+0x16/0x20
> > >   kernel:  do_syscall_64+0x55/0x150
> > >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > > to kernel space, which is handled by mwriteprotect_range().  In case
> > > you'd like to refer to the kernel, it's basically this one from
> > > Andrea's (with very trivial changes):
> > > 
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > > 
> > > So... do we have two paths to split the huge pages separately?
> > 
> > We have two entiries that can be split: page table enties and underlying
> > compound page.
> > 
> > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> > table. It doens't touch underlying compound page. The page still can be
> > mapped in other place as huge.
> > 
> > split_huge_page() (and ivariants of it) split compound page into a number
> > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> > PMD, but not other way around.
> > 
> > > 
> > > Another (possibly very naive) question is: could any of you hint me
> > > how the page dirty bit is finally applied to the PTEs?  These two
> > > dirty flags confused me for a few days already (the SetPageDirty() one
> > > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > > onto the real PTEs).
> > 
> > Dirty bit from page table entries transferes to sturct page flug and used
> > for decision making in reclaim path.
> 
> Thanks for explaining.  It's much clearer for me.
> 
> Though for the issue I have encountered, I am still confused on why
> that dirty bit can be ignored for the splitted PTEs.  Indeed we have:
> 
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
> 
> However to me this only transfers (as you explained above) the dirty
> bit (AFAIU it's possibly set by the hardware when the page is written)
> to the page struct of the compound page.  It did not really apply to
> every small page of the splitted huge page.  As you also explained,
> this __split_huge_pmd() only splits the PMD entry but it keeps the
> compound huge page there, then IMHO it should also apply the dirty
> bits from the huge page to all the small page entries, no?

The bit on compound page represents all small subpages. PageDirty() on any
subpage will return you true if the compound page is dirty.

> These dirty bits are really important to my scenario since AFAIU the
> change_protection() call is using these dirty bits to decide whether
> it should append the WRITE bit - it finally corresponds to the lines
> in change_pte_range():
> 
> /* Avoid taking write faults for known dirty pages */
> if (dirty_accountable && pte_dirty(ptent) &&
> (pte_soft_dirty(ptent) ||
> !(vma->vm_flags & VM_SOFTDIRTY))) {
> ptent = pte_mkwrite(ptent);
> }
> 
> So when mprotect() with 

Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Wed, Sep 05, 2018 at 08:49:20AM -0400, Zi Yan wrote:
> On 5 Sep 2018, at 3:30, Peter Xu wrote:
> 
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> >> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> >>
> >>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>  When splitting a huge page, we should set all small pages as dirty if
>  the original huge page has the dirty bit set before.  Otherwise we'll
>  lose the original dirty bit.
> >>>
> >>> We don't lose it. It got transfered to struct page flag:
> >>>
> >>>   if (pmd_dirty(old_pmd))
> >>>   SetPageDirty(page);
> >>>
> >>
> >> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> >> __split_huge_page()
> >> propagates the dirty bit in the head page flag to all subpages in 
> >> __split_huge_page_tail().
> >
> > Hi, Kirill, Zi,
> >
> > Thanks for your responses!
> >
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> >
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> >
> > So... do we have two paths to split the huge pages separately?
> >
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> change_protection() only causes splitting a PMD entry into multiple PTEs
> but not the physical compound page, so my answer does not apply to your case.
> It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
> need to describe your problem with more details.

Hi, Zi,

I explained with some more details on my problem in my other reply to
Kirill.  Please have a look.

> 
> AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
> when reporting a page’s dirty bit information, some function in the kernel 
> only checks
> the PTE’s dirty bit but not the dirty bit in the struct page flags, which
> might provide a wrong answer.

Are you suggesting that we should always check both places (the PTE
dirty bit) and also the page flag to know whether a page is dirty
(hence, either of the bit set should mean the page is dirty)?

Thanks,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Wed, Sep 05, 2018 at 08:49:20AM -0400, Zi Yan wrote:
> On 5 Sep 2018, at 3:30, Peter Xu wrote:
> 
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> >> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> >>
> >>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>  When splitting a huge page, we should set all small pages as dirty if
>  the original huge page has the dirty bit set before.  Otherwise we'll
>  lose the original dirty bit.
> >>>
> >>> We don't lose it. It got transfered to struct page flag:
> >>>
> >>>   if (pmd_dirty(old_pmd))
> >>>   SetPageDirty(page);
> >>>
> >>
> >> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> >> __split_huge_page()
> >> propagates the dirty bit in the head page flag to all subpages in 
> >> __split_huge_page_tail().
> >
> > Hi, Kirill, Zi,
> >
> > Thanks for your responses!
> >
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> >
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> >
> > So... do we have two paths to split the huge pages separately?
> >
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> change_protection() only causes splitting a PMD entry into multiple PTEs
> but not the physical compound page, so my answer does not apply to your case.
> It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
> need to describe your problem with more details.

Hi, Zi,

I explained with some more details on my problem in my other reply to
Kirill.  Please have a look.

> 
> AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
> when reporting a page’s dirty bit information, some function in the kernel 
> only checks
> the PTE’s dirty bit but not the dirty bit in the struct page flags, which
> might provide a wrong answer.

Are you suggesting that we should always check both places (the PTE
dirty bit) and also the page flag to know whether a page is dirty
(hence, either of the bit set should mean the page is dirty)?

Thanks,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > 
> > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > >> When splitting a huge page, we should set all small pages as dirty if
> > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > >> lose the original dirty bit.
> > > >
> > > > We don't lose it. It got transfered to struct page flag:
> > > >
> > > > if (pmd_dirty(old_pmd))
> > > > SetPageDirty(page);
> > > >
> > > 
> > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > __split_huge_page()
> > > propagates the dirty bit in the head page flag to all subpages in 
> > > __split_huge_page_tail().
> > 
> > Hi, Kirill, Zi,
> > 
> > Thanks for your responses!
> > 
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> > 
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > 
> > So... do we have two paths to split the huge pages separately?
> 
> We have two entiries that can be split: page table enties and underlying
> compound page.
> 
> split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> table. It doens't touch underlying compound page. The page still can be
> mapped in other place as huge.
> 
> split_huge_page() (and ivariants of it) split compound page into a number
> of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> PMD, but not other way around.
> 
> > 
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> Dirty bit from page table entries transferes to sturct page flug and used
> for decision making in reclaim path.

Thanks for explaining.  It's much clearer for me.

Though for the issue I have encountered, I am still confused on why
that dirty bit can be ignored for the splitted PTEs.  Indeed we have:

if (pmd_dirty(old_pmd))
SetPageDirty(page);

However to me this only transfers (as you explained above) the dirty
bit (AFAIU it's possibly set by the hardware when the page is written)
to the page struct of the compound page.  It did not really apply to
every small page of the splitted huge page.  As you also explained,
this __split_huge_pmd() only splits the PMD entry but it keeps the
compound huge page there, then IMHO it should also apply the dirty
bits from the huge page to all the small page entries, no?

These dirty bits are really important to my scenario since AFAIU the
change_protection() call is using these dirty bits to decide whether
it should append the WRITE bit - it finally corresponds to the lines
in change_pte_range():

/* Avoid taking write faults for known dirty pages */
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
!(vma->vm_flags & VM_SOFTDIRTY))) {
ptent = pte_mkwrite(ptent);
}

So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
which is similar) although we pass in the new protocol with VM_WRITE
here it'll still mask it since the dirty bit is not set, then the
userspace program (in my case, the QEMU thread that handles write
protect failures) can never fixup the write-protected page fault.

Am I missing anything important here?

Regards,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-06 Thread Peter Xu
On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > > 
> > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > > >> When splitting a huge page, we should set all small pages as dirty if
> > > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > > >> lose the original dirty bit.
> > > >
> > > > We don't lose it. It got transfered to struct page flag:
> > > >
> > > > if (pmd_dirty(old_pmd))
> > > > SetPageDirty(page);
> > > >
> > > 
> > > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > > __split_huge_page()
> > > propagates the dirty bit in the head page flag to all subpages in 
> > > __split_huge_page_tail().
> > 
> > Hi, Kirill, Zi,
> > 
> > Thanks for your responses!
> > 
> > Though in my test the huge page seems to be splitted not by
> > split_huge_page_to_list() but by explicit calls to
> > change_protection().  The stack looks like this (again, this is a
> > customized kernel, and I added an explicit dump_stack() there):
> > 
> >   kernel:  dump_stack+0x5c/0x7b
> >   kernel:  __split_huge_pmd+0x192/0xdc0
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? update_load_avg+0x8b/0x550
> >   kernel:  ? account_entity_enqueue+0xc5/0xf0
> >   kernel:  ? enqueue_entity+0x112/0x650
> >   kernel:  change_protection+0x3a2/0xab0
> >   kernel:  mwriteprotect_range+0xdd/0x110
> >   kernel:  userfaultfd_ioctl+0x50b/0x1210
> >   kernel:  ? do_futex+0x2cf/0xb20
> >   kernel:  ? tty_write+0x1d2/0x2f0
> >   kernel:  ? do_vfs_ioctl+0x9f/0x610
> >   kernel:  do_vfs_ioctl+0x9f/0x610
> >   kernel:  ? __x64_sys_futex+0x88/0x180
> >   kernel:  ksys_ioctl+0x70/0x80
> >   kernel:  __x64_sys_ioctl+0x16/0x20
> >   kernel:  do_syscall_64+0x55/0x150
> >   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> > to kernel space, which is handled by mwriteprotect_range().  In case
> > you'd like to refer to the kernel, it's basically this one from
> > Andrea's (with very trivial changes):
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> > 
> > So... do we have two paths to split the huge pages separately?
> 
> We have two entiries that can be split: page table enties and underlying
> compound page.
> 
> split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
> table. It doens't touch underlying compound page. The page still can be
> mapped in other place as huge.
> 
> split_huge_page() (and ivariants of it) split compound page into a number
> of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
> PMD, but not other way around.
> 
> > 
> > Another (possibly very naive) question is: could any of you hint me
> > how the page dirty bit is finally applied to the PTEs?  These two
> > dirty flags confused me for a few days already (the SetPageDirty() one
> > which sets the page dirty flag, and the pte_mkdirty() which sets that
> > onto the real PTEs).
> 
> Dirty bit from page table entries transferes to sturct page flug and used
> for decision making in reclaim path.

Thanks for explaining.  It's much clearer for me.

Though for the issue I have encountered, I am still confused on why
that dirty bit can be ignored for the splitted PTEs.  Indeed we have:

if (pmd_dirty(old_pmd))
SetPageDirty(page);

However to me this only transfers (as you explained above) the dirty
bit (AFAIU it's possibly set by the hardware when the page is written)
to the page struct of the compound page.  It did not really apply to
every small page of the splitted huge page.  As you also explained,
this __split_huge_pmd() only splits the PMD entry but it keeps the
compound huge page there, then IMHO it should also apply the dirty
bits from the huge page to all the small page entries, no?

These dirty bits are really important to my scenario since AFAIU the
change_protection() call is using these dirty bits to decide whether
it should append the WRITE bit - it finally corresponds to the lines
in change_pte_range():

/* Avoid taking write faults for known dirty pages */
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
!(vma->vm_flags & VM_SOFTDIRTY))) {
ptent = pte_mkwrite(ptent);
}

So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT,
which is similar) although we pass in the new protocol with VM_WRITE
here it'll still mask it since the dirty bit is not set, then the
userspace program (in my case, the QEMU thread that handles write
protect failures) can never fixup the write-protected page fault.

Am I missing anything important here?

Regards,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Kirill A. Shutemov
On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > 
> > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > >> When splitting a huge page, we should set all small pages as dirty if
> > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > >> lose the original dirty bit.
> > >
> > > We don't lose it. It got transfered to struct page flag:
> > >
> > >   if (pmd_dirty(old_pmd))
> > >   SetPageDirty(page);
> > >
> > 
> > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > __split_huge_page()
> > propagates the dirty bit in the head page flag to all subpages in 
> > __split_huge_page_tail().
> 
> Hi, Kirill, Zi,
> 
> Thanks for your responses!
> 
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
> 
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> 
> So... do we have two paths to split the huge pages separately?

We have two entiries that can be split: page table enties and underlying
compound page.

split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
table. It doens't touch underlying compound page. The page still can be
mapped in other place as huge.

split_huge_page() (and ivariants of it) split compound page into a number
of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
PMD, but not other way around.

> 
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

Dirty bit from page table entries transferes to sturct page flug and used
for decision making in reclaim path.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Kirill A. Shutemov
On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote:
> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> > 
> > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> > >> When splitting a huge page, we should set all small pages as dirty if
> > >> the original huge page has the dirty bit set before.  Otherwise we'll
> > >> lose the original dirty bit.
> > >
> > > We don't lose it. It got transfered to struct page flag:
> > >
> > >   if (pmd_dirty(old_pmd))
> > >   SetPageDirty(page);
> > >
> > 
> > Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> > __split_huge_page()
> > propagates the dirty bit in the head page flag to all subpages in 
> > __split_huge_page_tail().
> 
> Hi, Kirill, Zi,
> 
> Thanks for your responses!
> 
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
> 
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
> 
> So... do we have two paths to split the huge pages separately?

We have two entiries that can be split: page table enties and underlying
compound page.

split_huge_pmd() (and variants of it) split the PMD entry into a PTE page
table. It doens't touch underlying compound page. The page still can be
mapped in other place as huge.

split_huge_page() (and ivariants of it) split compound page into a number
of 4k (or whatever PAGE_SIZE is). The operation requires splitting all
PMD, but not other way around.

> 
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

Dirty bit from page table entries transferes to sturct page flug and used
for decision making in reclaim path.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Zi Yan
On 5 Sep 2018, at 3:30, Peter Xu wrote:

> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
>> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
>>
>>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
 When splitting a huge page, we should set all small pages as dirty if
 the original huge page has the dirty bit set before.  Otherwise we'll
 lose the original dirty bit.
>>>
>>> We don't lose it. It got transfered to struct page flag:
>>>
>>> if (pmd_dirty(old_pmd))
>>> SetPageDirty(page);
>>>
>>
>> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
>> __split_huge_page()
>> propagates the dirty bit in the head page flag to all subpages in 
>> __split_huge_page_tail().
>
> Hi, Kirill, Zi,
>
> Thanks for your responses!
>
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
>
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
>
> So... do we have two paths to split the huge pages separately?
>
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

change_protection() only causes splitting a PMD entry into multiple PTEs
but not the physical compound page, so my answer does not apply to your case.
It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
need to describe your problem with more details.

AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
when reporting a page’s dirty bit information, some function in the kernel only 
checks
the PTE’s dirty bit but not the dirty bit in the struct page flags, which
might provide a wrong answer.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Zi Yan
On 5 Sep 2018, at 3:30, Peter Xu wrote:

> On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
>> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
>>
>>> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
 When splitting a huge page, we should set all small pages as dirty if
 the original huge page has the dirty bit set before.  Otherwise we'll
 lose the original dirty bit.
>>>
>>> We don't lose it. It got transfered to struct page flag:
>>>
>>> if (pmd_dirty(old_pmd))
>>> SetPageDirty(page);
>>>
>>
>> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
>> __split_huge_page()
>> propagates the dirty bit in the head page flag to all subpages in 
>> __split_huge_page_tail().
>
> Hi, Kirill, Zi,
>
> Thanks for your responses!
>
> Though in my test the huge page seems to be splitted not by
> split_huge_page_to_list() but by explicit calls to
> change_protection().  The stack looks like this (again, this is a
> customized kernel, and I added an explicit dump_stack() there):
>
>   kernel:  dump_stack+0x5c/0x7b
>   kernel:  __split_huge_pmd+0x192/0xdc0
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? update_load_avg+0x8b/0x550
>   kernel:  ? account_entity_enqueue+0xc5/0xf0
>   kernel:  ? enqueue_entity+0x112/0x650
>   kernel:  change_protection+0x3a2/0xab0
>   kernel:  mwriteprotect_range+0xdd/0x110
>   kernel:  userfaultfd_ioctl+0x50b/0x1210
>   kernel:  ? do_futex+0x2cf/0xb20
>   kernel:  ? tty_write+0x1d2/0x2f0
>   kernel:  ? do_vfs_ioctl+0x9f/0x610
>   kernel:  do_vfs_ioctl+0x9f/0x610
>   kernel:  ? __x64_sys_futex+0x88/0x180
>   kernel:  ksys_ioctl+0x70/0x80
>   kernel:  __x64_sys_ioctl+0x16/0x20
>   kernel:  do_syscall_64+0x55/0x150
>   kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
> to kernel space, which is handled by mwriteprotect_range().  In case
> you'd like to refer to the kernel, it's basically this one from
> Andrea's (with very trivial changes):
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault
>
> So... do we have two paths to split the huge pages separately?
>
> Another (possibly very naive) question is: could any of you hint me
> how the page dirty bit is finally applied to the PTEs?  These two
> dirty flags confused me for a few days already (the SetPageDirty() one
> which sets the page dirty flag, and the pte_mkdirty() which sets that
> onto the real PTEs).

change_protection() only causes splitting a PMD entry into multiple PTEs
but not the physical compound page, so my answer does not apply to your case.
It is unclear how the dirty bit makes your QEMU get a SIGBUS. I think you
need to describe your problem with more details.

AFAIK, the PageDirty bit will not apply back to any PTEs. So for your case,
when reporting a page’s dirty bit information, some function in the kernel only 
checks
the PTE’s dirty bit but not the dirty bit in the struct page flags, which
might provide a wrong answer.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Peter Xu
On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> 
> > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> >> When splitting a huge page, we should set all small pages as dirty if
> >> the original huge page has the dirty bit set before.  Otherwise we'll
> >> lose the original dirty bit.
> >
> > We don't lose it. It got transfered to struct page flag:
> >
> > if (pmd_dirty(old_pmd))
> > SetPageDirty(page);
> >
> 
> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> __split_huge_page()
> propagates the dirty bit in the head page flag to all subpages in 
> __split_huge_page_tail().

Hi, Kirill, Zi,

Thanks for your responses!

Though in my test the huge page seems to be splitted not by
split_huge_page_to_list() but by explicit calls to
change_protection().  The stack looks like this (again, this is a
customized kernel, and I added an explicit dump_stack() there):

  kernel:  dump_stack+0x5c/0x7b
  kernel:  __split_huge_pmd+0x192/0xdc0
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? account_entity_enqueue+0xc5/0xf0
  kernel:  ? enqueue_entity+0x112/0x650
  kernel:  change_protection+0x3a2/0xab0
  kernel:  mwriteprotect_range+0xdd/0x110
  kernel:  userfaultfd_ioctl+0x50b/0x1210
  kernel:  ? do_futex+0x2cf/0xb20
  kernel:  ? tty_write+0x1d2/0x2f0
  kernel:  ? do_vfs_ioctl+0x9f/0x610
  kernel:  do_vfs_ioctl+0x9f/0x610
  kernel:  ? __x64_sys_futex+0x88/0x180
  kernel:  ksys_ioctl+0x70/0x80
  kernel:  __x64_sys_ioctl+0x16/0x20
  kernel:  do_syscall_64+0x55/0x150
  kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
to kernel space, which is handled by mwriteprotect_range().  In case
you'd like to refer to the kernel, it's basically this one from
Andrea's (with very trivial changes):

  https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault

So... do we have two paths to split the huge pages separately?

Another (possibly very naive) question is: could any of you hint me
how the page dirty bit is finally applied to the PTEs?  These two
dirty flags confused me for a few days already (the SetPageDirty() one
which sets the page dirty flag, and the pte_mkdirty() which sets that
onto the real PTEs).

Regards,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-05 Thread Peter Xu
On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote:
> On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:
> 
> > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> >> When splitting a huge page, we should set all small pages as dirty if
> >> the original huge page has the dirty bit set before.  Otherwise we'll
> >> lose the original dirty bit.
> >
> > We don't lose it. It got transfered to struct page flag:
> >
> > if (pmd_dirty(old_pmd))
> > SetPageDirty(page);
> >
> 
> Plus, when split_huge_page_to_list() splits a THP, its subroutine 
> __split_huge_page()
> propagates the dirty bit in the head page flag to all subpages in 
> __split_huge_page_tail().

Hi, Kirill, Zi,

Thanks for your responses!

Though in my test the huge page seems to be splitted not by
split_huge_page_to_list() but by explicit calls to
change_protection().  The stack looks like this (again, this is a
customized kernel, and I added an explicit dump_stack() there):

  kernel:  dump_stack+0x5c/0x7b
  kernel:  __split_huge_pmd+0x192/0xdc0
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? update_load_avg+0x8b/0x550
  kernel:  ? account_entity_enqueue+0xc5/0xf0
  kernel:  ? enqueue_entity+0x112/0x650
  kernel:  change_protection+0x3a2/0xab0
  kernel:  mwriteprotect_range+0xdd/0x110
  kernel:  userfaultfd_ioctl+0x50b/0x1210
  kernel:  ? do_futex+0x2cf/0xb20
  kernel:  ? tty_write+0x1d2/0x2f0
  kernel:  ? do_vfs_ioctl+0x9f/0x610
  kernel:  do_vfs_ioctl+0x9f/0x610
  kernel:  ? __x64_sys_futex+0x88/0x180
  kernel:  ksys_ioctl+0x70/0x80
  kernel:  __x64_sys_ioctl+0x16/0x20
  kernel:  do_syscall_64+0x55/0x150
  kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9

At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl
to kernel space, which is handled by mwriteprotect_range().  In case
you'd like to refer to the kernel, it's basically this one from
Andrea's (with very trivial changes):

  https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault

So... do we have two paths to split the huge pages separately?

Another (possibly very naive) question is: could any of you hint me
how the page dirty bit is finally applied to the PTEs?  These two
dirty flags confused me for a few days already (the SetPageDirty() one
which sets the page dirty flag, and the pte_mkdirty() which sets that
onto the real PTEs).

Regards,

-- 
Peter Xu


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-04 Thread Zi Yan
On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:

> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>> When splitting a huge page, we should set all small pages as dirty if
>> the original huge page has the dirty bit set before.  Otherwise we'll
>> lose the original dirty bit.
>
> We don't lose it. It got transfered to struct page flag:
>
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
>

Plus, when split_huge_page_to_list() splits a THP, its subroutine 
__split_huge_page()
propagates the dirty bit in the head page flag to all subpages in 
__split_huge_page_tail().

--
Best Regards
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-04 Thread Zi Yan
On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote:

> On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
>> When splitting a huge page, we should set all small pages as dirty if
>> the original huge page has the dirty bit set before.  Otherwise we'll
>> lose the original dirty bit.
>
> We don't lose it. It got transfered to struct page flag:
>
>   if (pmd_dirty(old_pmd))
>   SetPageDirty(page);
>

Plus, when split_huge_page_to_list() splits a THP, its subroutine 
__split_huge_page()
propagates the dirty bit in the head page flag to all subpages in 
__split_huge_page_tail().

--
Best Regards
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-04 Thread Kirill A. Shutemov
On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> When splitting a huge page, we should set all small pages as dirty if
> the original huge page has the dirty bit set before.  Otherwise we'll
> lose the original dirty bit.

We don't lose it. It got transfered to struct page flag:

if (pmd_dirty(old_pmd))
SetPageDirty(page);

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: hugepage: mark splitted page dirty when needed

2018-09-04 Thread Kirill A. Shutemov
On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote:
> When splitting a huge page, we should set all small pages as dirty if
> the original huge page has the dirty bit set before.  Otherwise we'll
> lose the original dirty bit.

We don't lose it. It got transfered to struct page flag:

if (pmd_dirty(old_pmd))
SetPageDirty(page);

-- 
 Kirill A. Shutemov