Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Anshuman Khandual



On 02/13/2019 07:08 PM, Michal Hocko wrote:
> On Wed 13-02-19 18:20:03, Anshuman Khandual wrote:
>> On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
>>> Honestly, I'm very skeptical about the idea. It took a lot of time to
>>> stabilize THP for singe page size, equal to PMD page table, but this looks
>>> like a new can of worms. :P
>>
>> I understand your concern here but HW providing some more TLB sizes beyond
>> standard page table level (PMD/PUD/PGD) based huge pages can help achieve
>> performance improvement when the buddy is already fragmented enough not to
>> provide higher order pages. PUD THP file mapping is already supported for
>> DAX and PUD THP anon mapping might be supported in near future (it is not
>> much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
>> will be much difficult). Around PMD sizes like HPAGE_CONT_PMD_SIZE or
>> HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
>> mapping than a PUD size anon mapping support in THP.
> 
> I do not think our page allocator is really ready to provide >PMD huge
> pages. So even if we deal with all the nasty things wrt locking and page
> table handling the crux becomes the allocation side. The current
> CMA/contig allocator is everything but useful for THP. It can barely
> handle hugetlb cases which are mostly pre-allocate based.

I understand the point for > PMD size. Hence first we can just narrow the
focus on contiguous PTE level huge pages which are < PMD but could offer
THP benefits on arm64 for 64K config page sizes.

> 
> Besides that is there any real world usecase driving this or it is
> merely "this is possible so let's just do it"?

64K config arm64 kernel is mostly unable to use THP at PMD level of 512 MB.
But it should be able benefit from THP if we have support at cont PTE level
of 2MB which is way less than 512MB.


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Kirill A. Shutemov
On Wed, Feb 13, 2019 at 06:20:03PM +0530, Anshuman Khandual wrote:
> 
> 
> On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
> > On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
> >> Hello,
> >>
> >> THP is currently supported for
> >>
> >> - PMD level pages (anon and file)
> >> - PUD level pages (file - DAX file system)
> >>
> >> THP is a single entry mapping at standard page table levels (either PMD or 
> >> PUD)
> >>
> >> But architectures like ARM64 supports non-standard page table level huge 
> >> pages
> >> with contiguous bits.
> >>
> >> - These are created as multiple entries at either PTE or PMD level
> >> - These multiple entries carry pages which are physically contiguous
> >> - A special PTE bit (PTE_CONT) is set indicating single entry to be 
> >> contiguous
> >>
> >> These multiple contiguous entries create a huge page size which is 
> >> different
> >> than standard PMD/PUD level but they provide benefits of huge memory like
> >> less number of faults, bigger TLB coverage, less TLB miss etc.
> >>
> >> Currently they are used as HugeTLB pages because
> >>
> >>- HugeTLB page sizes is carried in the VMA
> >>- Page table walker can operate on multiple PTE or PMD entries given 
> >> its size in VMA
> >>- Irrespective of HugeTLB page size its operated with set_huge_pte_at() 
> >> at any level
> >>- set_huge_pte_at() is arch specific which knows how to encode multiple 
> >> consecutive entries
> >>
> >> But not as THP huge pages because
> >>
> >>- THP size is not encoded any where like VMA
> >>- Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or 
> >> at PMD (HPAGE_PMD_SIZE)
> >>- Page table operates directly with set_pmd_at() or set_pud_at()
> >>- Direct faulted or promoted huge pages is verified with 
> >> [pmd|pud]_trans_huge()
> >>
> >> How non-standard huge pages can be supported for THP
> >>
> >>- THP starts recognizing non standard huge page (exported by arch) like 
> >> HPAGE_CONT_(PMD|PTE)_SIZE
> >>- THP starts operating for either on HPAGE_PMD_SIZE or 
> >> HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
> >>- set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
> >> set_pmd_at() with set_huge_pmd_at()
> >>- set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
> >> HPAGE_CONT_PMD_SIZE
> >>- In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
> >> level
> >>- Use set_huge_pte_at() which can operate on multiple contiguous PTE 
> >> bits
> > 
> > You only listed trivial things. All tricky stuff is what make THP
> > transparent.
> 
> Agreed. I was trying to draw an analogy from HugeTLB with respect to page
> table creation and it's walking. Huge page collapse and split on such non
> standard huge pages will involve taking care of much details.
> 
> > 
> > To consider it seriously we need to understand what it means for
> > split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?
> 
> Absolutely. Can these operate on non standard probably multi entry based
> huge pages ? How to handle atomicity etc.

We need to handle split for them to provide transparency.

> > In particular, I'm worry to expose (to user or CPU) page table state in
> > the middle of conversion (huge->small or small->huge). Handling this on
> > page table level provides a level atomicity that you will not have.
> 
> I understand it might require a software based lock instead of standard HW
> atomicity constructs which will make it slow but is that even possible ?

I'm not yet sure if it is possible. I don't yet wrap my head around the
idea yet.

> > Honestly, I'm very skeptical about the idea. It took a lot of time to
> > stabilize THP for singe page size, equal to PMD page table, but this looks
> > like a new can of worms. :P
> 
> I understand your concern here but HW providing some more TLB sizes beyond
> standard page table level (PMD/PUD/PGD) based huge pages can help achieve
> performance improvement when the buddy is already fragmented enough not to
> provide higher order pages. PUD THP file mapping is already supported for
> DAX and PUD THP anon mapping might be supported in near future (it is not
> much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
> will be much difficult).

That's a bold claim. I would like to look at code. :)

Supporting more than one THP page size at the same time brings a lot more
questions, besides allocation path (although I'm sure compaction will be
happy about this).

For instance, what page size you'll allocate for a given fault
address?

How do you deal with pre-allocated page tables? Deposit 513 page tables
for a given PUD THP page might be fun. :P

> Around PMD sizes like HPAGE_CONT_PMD_SIZE or
> HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
> mapping than a PUD size anon mapping support in THP.
> 
> > 
> > It *might* be possible to support it for DAX, but beyond that...
> >
> 
> Did not get that. Wh

Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Michal Hocko
On Wed 13-02-19 18:20:03, Anshuman Khandual wrote:
> On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
> > Honestly, I'm very skeptical about the idea. It took a lot of time to
> > stabilize THP for singe page size, equal to PMD page table, but this looks
> > like a new can of worms. :P
> 
> I understand your concern here but HW providing some more TLB sizes beyond
> standard page table level (PMD/PUD/PGD) based huge pages can help achieve
> performance improvement when the buddy is already fragmented enough not to
> provide higher order pages. PUD THP file mapping is already supported for
> DAX and PUD THP anon mapping might be supported in near future (it is not
> much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
> will be much difficult). Around PMD sizes like HPAGE_CONT_PMD_SIZE or
> HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
> mapping than a PUD size anon mapping support in THP.

I do not think our page allocator is really ready to provide >PMD huge
pages. So even if we deal with all the nasty things wrt locking and page
table handling the crux becomes the allocation side. The current
CMA/contig allocator is everything but useful for THP. It can barely
handle hugetlb cases which are mostly pre-allocate based.

Besides that is there any real world usecase driving this or it is
merely "this is possible so let's just do it"?
-- 
Michal Hocko
SUSE Labs


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Kirill A. Shutemov
On Wed, Feb 13, 2019 at 05:06:47AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 12, 2019 at 11:33:31AM +0300, Kirill A. Shutemov wrote:
> > To consider it seriously we need to understand what it means for
> > split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?
> > 
> > In particular, I'm worry to expose (to user or CPU) page table state in
> > the middle of conversion (huge->small or small->huge). Handling this on
> > page table level provides a level atomicity that you will not have.
> 
> We could do an RCU-style trick where (eg) for merging 16 consecutive
> entries together, we allocate a new PTE leaf, take the mmap_sem for write,
> copy the page table over, update the new entries, then put the new leaf
> into the PMD level.  Then iterate over the old PTE leaf again, and set
> any dirty bits in the new leaf which were set during the race window.
> 
> Does that cover all the problems?

Probably, but it will kill scalability. Taking mmap_sem for write to
handle page fault or MADV_DONTNEED will not make anybody happy.

-- 
 Kirill A. Shutemov


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Matthew Wilcox
On Tue, Feb 12, 2019 at 11:33:31AM +0300, Kirill A. Shutemov wrote:
> To consider it seriously we need to understand what it means for
> split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?
> 
> In particular, I'm worry to expose (to user or CPU) page table state in
> the middle of conversion (huge->small or small->huge). Handling this on
> page table level provides a level atomicity that you will not have.

We could do an RCU-style trick where (eg) for merging 16 consecutive
entries together, we allocate a new PTE leaf, take the mmap_sem for write,
copy the page table over, update the new entries, then put the new leaf
into the PMD level.  Then iterate over the old PTE leaf again, and set
any dirty bits in the new leaf which were set during the race window.

Does that cover all the problems?

> Honestly, I'm very skeptical about the idea. It took a lot of time to
> stabilize THP for singe page size, equal to PMD page table, but this looks
> like a new can of worms. :P

It's definitely a lot of work, and it has a lot of prerequisites.


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-13 Thread Anshuman Khandual



On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
> On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
>> Hello,
>>
>> THP is currently supported for
>>
>> - PMD level pages (anon and file)
>> - PUD level pages (file - DAX file system)
>>
>> THP is a single entry mapping at standard page table levels (either PMD or 
>> PUD)
>>
>> But architectures like ARM64 supports non-standard page table level huge 
>> pages
>> with contiguous bits.
>>
>> - These are created as multiple entries at either PTE or PMD level
>> - These multiple entries carry pages which are physically contiguous
>> - A special PTE bit (PTE_CONT) is set indicating single entry to be 
>> contiguous
>>
>> These multiple contiguous entries create a huge page size which is different
>> than standard PMD/PUD level but they provide benefits of huge memory like
>> less number of faults, bigger TLB coverage, less TLB miss etc.
>>
>> Currently they are used as HugeTLB pages because
>>
>>  - HugeTLB page sizes is carried in the VMA
>>  - Page table walker can operate on multiple PTE or PMD entries given 
>> its size in VMA
>>  - Irrespective of HugeTLB page size its operated with set_huge_pte_at() 
>> at any level
>>  - set_huge_pte_at() is arch specific which knows how to encode multiple 
>> consecutive entries
>>  
>> But not as THP huge pages because
>>
>>  - THP size is not encoded any where like VMA
>>  - Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or 
>> at PMD (HPAGE_PMD_SIZE)
>>  - Page table operates directly with set_pmd_at() or set_pud_at()
>>  - Direct faulted or promoted huge pages is verified with 
>> [pmd|pud]_trans_huge()
>>
>> How non-standard huge pages can be supported for THP
>>
>>  - THP starts recognizing non standard huge page (exported by arch) like 
>> HPAGE_CONT_(PMD|PTE)_SIZE
>>  - THP starts operating for either on HPAGE_PMD_SIZE or 
>> HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
>>  - set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
>> set_pmd_at() with set_huge_pmd_at()
>>  - set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
>> HPAGE_CONT_PMD_SIZE
>>  - In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
>> level
>>  - Use set_huge_pte_at() which can operate on multiple contiguous PTE 
>> bits
> 
> You only listed trivial things. All tricky stuff is what make THP
> transparent.

Agreed. I was trying to draw an analogy from HugeTLB with respect to page
table creation and it's walking. Huge page collapse and split on such non
standard huge pages will involve taking care of much details.

> 
> To consider it seriously we need to understand what it means for
> split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?

Absolutely. Can these operate on non standard probably multi entry based
huge pages ? How to handle atomicity etc.

> 
> In particular, I'm worry to expose (to user or CPU) page table state in
> the middle of conversion (huge->small or small->huge). Handling this on
> page table level provides a level atomicity that you will not have.

I understand it might require a software based lock instead of standard HW
atomicity constructs which will make it slow but is that even possible ?

> 
> Honestly, I'm very skeptical about the idea. It took a lot of time to
> stabilize THP for singe page size, equal to PMD page table, but this looks
> like a new can of worms. :P

I understand your concern here but HW providing some more TLB sizes beyond
standard page table level (PMD/PUD/PGD) based huge pages can help achieve
performance improvement when the buddy is already fragmented enough not to
provide higher order pages. PUD THP file mapping is already supported for
DAX and PUD THP anon mapping might be supported in near future (it is not
much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
will be much difficult). Around PMD sizes like HPAGE_CONT_PMD_SIZE or
HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
mapping than a PUD size anon mapping support in THP.

> 
> It *might* be possible to support it for DAX, but beyond that...
>

Did not get that. Why would you think that this is possible or appropriate
only for DAX file mapping but not for anon mapping ?


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-12 Thread Kirill A. Shutemov
On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
> Hello,
> 
> THP is currently supported for
> 
> - PMD level pages (anon and file)
> - PUD level pages (file - DAX file system)
> 
> THP is a single entry mapping at standard page table levels (either PMD or 
> PUD)
> 
> But architectures like ARM64 supports non-standard page table level huge pages
> with contiguous bits.
> 
> - These are created as multiple entries at either PTE or PMD level
> - These multiple entries carry pages which are physically contiguous
> - A special PTE bit (PTE_CONT) is set indicating single entry to be contiguous
> 
> These multiple contiguous entries create a huge page size which is different
> than standard PMD/PUD level but they provide benefits of huge memory like
> less number of faults, bigger TLB coverage, less TLB miss etc.
> 
> Currently they are used as HugeTLB pages because
> 
>   - HugeTLB page sizes is carried in the VMA
>   - Page table walker can operate on multiple PTE or PMD entries given 
> its size in VMA
>   - Irrespective of HugeTLB page size its operated with set_huge_pte_at() 
> at any level
>   - set_huge_pte_at() is arch specific which knows how to encode multiple 
> consecutive entries
>   
> But not as THP huge pages because
> 
>   - THP size is not encoded any where like VMA
>   - Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or 
> at PMD (HPAGE_PMD_SIZE)
>   - Page table operates directly with set_pmd_at() or set_pud_at()
>   - Direct faulted or promoted huge pages is verified with 
> [pmd|pud]_trans_huge()
> 
> How non-standard huge pages can be supported for THP
> 
>   - THP starts recognizing non standard huge page (exported by arch) like 
> HPAGE_CONT_(PMD|PTE)_SIZE
>   - THP starts operating for either on HPAGE_PMD_SIZE or 
> HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
>   - set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
> set_pmd_at() with set_huge_pmd_at()
>   - set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
> HPAGE_CONT_PMD_SIZE
>   - In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
> level
>   - Use set_huge_pte_at() which can operate on multiple contiguous PTE 
> bits

You only listed trivial things. All tricky stuff is what make THP
transparent.

To consider it seriously we need to understand what it means for
split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?

In particular, I'm worry to expose (to user or CPU) page table state in
the middle of conversion (huge->small or small->huge). Handling this on
page table level provides a level atomicity that you will not have.

Honestly, I'm very skeptical about the idea. It took a lot of time to
stabilize THP for singe page size, equal to PMD page table, but this looks
like a new can of worms. :P

It *might* be possible to support it for DAX, but beyond that...

-- 
 Kirill A. Shutemov


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-07 Thread Anshuman Khandual



On 02/08/2019 09:54 AM, Matthew Wilcox wrote:
> On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
>> How non-standard huge pages can be supported for THP
>>
>>  - THP starts recognizing non standard huge page (exported by arch) like 
>> HPAGE_CONT_(PMD|PTE)_SIZE
>>  - THP starts operating for either on HPAGE_PMD_SIZE or 
>> HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
>>  - set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
>> set_pmd_at() with set_huge_pmd_at()
>>  - set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
>> HPAGE_CONT_PMD_SIZE
>>  - In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
>> level
>>  - Use set_huge_pte_at() which can operate on multiple contiguous PTE 
>> bits
> 
> I think your proposed solution reflects thinking like a hardware person
> rather than like a software person.  Or maybe like an MM person rather
> than a FS person.  I see the same problem with Kirill's solutions ;-)

You might be right on this :) I was trying to derive a solution based on
all existing semantics with limited code addition rather than inventing
something completely different.

> 
> Perhaps you don't realise that using larger pages when appropriate
> would also benefit filesystems as well as CPUs.  You didn't include
> linux-fsdevel on this submission, so that's a plausible explanation.

Yes that was an omission. Thanks for adding linux-fsdevel to the thread.

> 
> The XArray currently supports arbitrary power-of-two-naturally-aligned
> page sizes, and conveniently so does the page allocator [1].  The problem
> is that various bits of the MM have a very fixed mindset that pages are
> PTE, PMD or PUD in size.

I agree. But in general it works as allocated page with required order do
reside in one of these levels in the page table.

> 
> We should enhance routines like vmf_insert_page() to handle
> arbitrary sized pages rather than having separate vmf_insert_pfn()
> and vmf_insert_pfn_pmd().  We probably need to enhance the set_pxx_at()
> API to pass in an order, rather than explicitly naming pte/pmd/pud/...

I agree. set_huge_pte_at() actually does that to some extent on ARM64.
But thats just for HugeTLB.

> 
> First, though, we need to actually get arbitrary sized pages handled
> correctly in the page cache.  So if anyone's interested in talking about
> this, but hasn't been reviewing or commenting on the patches I've been
> sending to make this happen, I'm going to seriously question their actual
> commitment to wanting this to happen, rather than wanting a nice holiday
> in Puerto Rico.
> 
> Sorry to be so blunt about this, but I've only had review from Kirill,
> which makes me think that nobody else actually cares about getting
> this fixed.

To be honest I have not been following your work in this regard. I started
looking into this problem late last year and my goal has been more focused 
towards a THP solution for intermediate page table level sized huge pages.

But I agree to your point that there should be an wider solution which can
make generic MM deal with page sizes of any order rather than page table
level ones like PTE/PMD/PUD etc.

> 
> [1] Support for arbitrary sized and aligned entries is in progress for
> the XArray, but I don't think there's any appetite for changing the buddy
> allocator to let us allocate "pages" that are an arbitrary extent in size.
> 
> 


Re: [LSF/MM TOPIC] Non standard size THP

2019-02-07 Thread Matthew Wilcox
On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
> How non-standard huge pages can be supported for THP
> 
>   - THP starts recognizing non standard huge page (exported by arch) like 
> HPAGE_CONT_(PMD|PTE)_SIZE
>   - THP starts operating for either on HPAGE_PMD_SIZE or 
> HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
>   - set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
> set_pmd_at() with set_huge_pmd_at()
>   - set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
> HPAGE_CONT_PMD_SIZE
>   - In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
> level
>   - Use set_huge_pte_at() which can operate on multiple contiguous PTE 
> bits

I think your proposed solution reflects thinking like a hardware person
rather than like a software person.  Or maybe like an MM person rather
than a FS person.  I see the same problem with Kirill's solutions ;-)

Perhaps you don't realise that using larger pages when appropriate
would also benefit filesystems as well as CPUs.  You didn't include
linux-fsdevel on this submission, so that's a plausible explanation.

The XArray currently supports arbitrary power-of-two-naturally-aligned
page sizes, and conveniently so does the page allocator [1].  The problem
is that various bits of the MM have a very fixed mindset that pages are
PTE, PMD or PUD in size.

We should enhance routines like vmf_insert_page() to handle
arbitrary sized pages rather than having separate vmf_insert_pfn()
and vmf_insert_pfn_pmd().  We probably need to enhance the set_pxx_at()
API to pass in an order, rather than explicitly naming pte/pmd/pud/...

First, though, we need to actually get arbitrary sized pages handled
correctly in the page cache.  So if anyone's interested in talking about
this, but hasn't been reviewing or commenting on the patches I've been
sending to make this happen, I'm going to seriously question their actual
commitment to wanting this to happen, rather than wanting a nice holiday
in Puerto Rico.

Sorry to be so blunt about this, but I've only had review from Kirill,
which makes me think that nobody else actually cares about getting
this fixed.

[1] Support for arbitrary sized and aligned entries is in progress for
the XArray, but I don't think there's any appetite for changing the buddy
allocator to let us allocate "pages" that are an arbitrary extent in size.



[LSF/MM TOPIC] Non standard size THP

2019-02-07 Thread Anshuman Khandual
Hello,

THP is currently supported for

- PMD level pages (anon and file)
- PUD level pages (file - DAX file system)

THP is a single entry mapping at standard page table levels (either PMD or PUD)

But architectures like ARM64 supports non-standard page table level huge pages
with contiguous bits.

- These are created as multiple entries at either PTE or PMD level
- These multiple entries carry pages which are physically contiguous
- A special PTE bit (PTE_CONT) is set indicating single entry to be contiguous

These multiple contiguous entries create a huge page size which is different
than standard PMD/PUD level but they provide benefits of huge memory like
less number of faults, bigger TLB coverage, less TLB miss etc.

Currently they are used as HugeTLB pages because

- HugeTLB page sizes is carried in the VMA
- Page table walker can operate on multiple PTE or PMD entries given 
its size in VMA
- Irrespective of HugeTLB page size its operated with set_huge_pte_at() 
at any level
- set_huge_pte_at() is arch specific which knows how to encode multiple 
consecutive entries

But not as THP huge pages because

- THP size is not encoded any where like VMA
- Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or 
at PMD (HPAGE_PMD_SIZE)
- Page table operates directly with set_pmd_at() or set_pud_at()
- Direct faulted or promoted huge pages is verified with 
[pmd|pud]_trans_huge()

How non-standard huge pages can be supported for THP

- THP starts recognizing non standard huge page (exported by arch) like 
HPAGE_CONT_(PMD|PTE)_SIZE
- THP starts operating for either on HPAGE_PMD_SIZE or 
HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
- set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace 
set_pmd_at() with set_huge_pmd_at()
- set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or 
HPAGE_CONT_PMD_SIZE
- In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE 
level
- Use set_huge_pte_at() which can operate on multiple contiguous PTE 
bits

Kirill Shutemov proposed re-working the page table traversal during last year's
LSFMM. A recursive page table walk just with level information would allow us to
introduce artificial or non-standard page table levels for contiguous bit huge
page support.

https://lwn.net/Articles/753267/

Here is the matrix for contiguous PTE and PMD sizes for various base page size
configurations on ARM64. Promoting or faulting pages at contiguous PTE level is
much more likely than PMD level which are more difficult to allocate at run 
time.

CONT PTEPMDCONT PMD
---
4K:64K  2M32M
16K:2M 32M 1G
64K:2M512M16G

Having support for contiguous PTE size based THP size will help many workloads 
utilize
THP benefits. I understand there would be much more fine grained details which 
need to
be sorted out and difficulties to be overcome but its worth starting a 
discussion on this
front which can really benefit workloads.

- Anshuman