Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-19 Thread Joao Martins


On 5/18/21 8:56 PM, Jane Chu wrote:
> On 5/18/2021 10:27 AM, Joao Martins wrote:
> 
>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>> On 5/5/21 11:20 PM, Dan Williams wrote:
 On Wed, May 5, 2021 at 12:50 PM Joao Martins  
 wrote:
> On 5/5/21 7:44 PM, Dan Williams wrote:
>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
>> wrote:
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index b46f63dcaed3..bb28d82dda5e 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>  struct completion done;
>>>  enum memory_type type;
>>>  unsigned int flags;
>>> +   unsigned long align;
>> I think this wants some kernel-doc above to indicate that non-zero
>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>> means "use non-compound base pages".
>> [...]
>>
>> The non-zero value must be
>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>> Hmm, maybe it should be an
>> enum:
>>
>> enum devmap_geometry {
>>  DEVMAP_PTE,
>>  DEVMAP_PMD,
>>  DEVMAP_PUD,
>> }
>>
> I suppose a converter between devmap_geometry and page_size would be 
> needed too? And maybe
> the whole dax/nvdimm align values change meanwhile (as a followup 
> improvement)?
 I think it is ok for dax/nvdimm to continue to maintain their align
 value because it should be ok to have 4MB align if the device really
 wanted. However, when it goes to map that alignment with
 memremap_pages() it can pick a mode. For example, it's already the
 case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
 they're already separate concepts that can stay separate.

>>> Gotcha.
>> I am reconsidering part of the above. In general, yes, the meaning of devmap 
>> @align
>> represents a slightly different variation of the device @align i.e. how the 
>> metadata is
>> laid out **but** regardless of what kind of page table entries we use 
>> vmemmap.
>>
>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax 
>> already
>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE 
>> and PUD_SIZE)
>> 2) the geometry of metadata is very much tied to the value we pick to @align 
>> at namespace
>> provisioning -- not the "align" we might use at mmap() perhaps that's what 
>> you referred
>> above? -- and 3) the value of geometry actually derives from dax device 
>> @align because we
>> will need to create compound pages representing a page size of @align value.
>>
>> Using your example above: you're saying that dax->align == 1G is mapped with 
>> DEVMAP_PTEs,
>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on 
>> what archs
>> decide to do at vmemmap_populate()) and uses base pages as its metadata 
>> regardless of what
>> device @align. In reality what we want to convey in @geometry is not page 
>> table sizes, but
>> just the page size used for the vmemmap of the dax device. Additionally, 
>> limiting its
>> value might not be desirable... if tomorrow Linux for some arch supports 
>> dax/nvdimm
>> devices with 4M align or 64K align, the value of @geometry will have to 
>> reflect the 4M to
>> create compound pages of order 10 for the said vmemmap.
>>
>> I am going to wait until you finish reviewing the remaining four patches of 
>> this series,
>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but 
>> without
>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and 
>> enum a
>> setter/getter to audit its value? Thoughts?
> 
> Good points there.
> 
> My understanding is that  dax->align  conveys granularity of size while 
> carving out a namespace it's a geometry attribute loosely akin to sector size 
> of a spindle 
> disk.  I tend to think that device pagesize  has almost no relation to 
> "align" in that, it's 
> possible to have 1G "align" and 4K pagesize, or verse versa.  That is, with 
> the advent of compound page 
> support, it is possible to totally separate the two concepts.
> 
> How about adding a new option to "ndctl create-namespace" that describes 
> device creator's desired pagesize, and another parameter to describe whether 
> the pagesize shall 
> be fixed or allowed to be split up, such that, if the intention is to never 
> split up 2M pagesize, then it 
> would be possible to save a lot metadata space on the device?

Maybe that can be selected by the driver too, but it's an interesting point you 
raise
should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
suggestion?). device-dax for example would use geometry == align and therefore 
save space
(like what I propose in patch 10). But fsdax would retain the default that is 
geometry =
PAGE_SIZE and align = PMD_SIZE should it want to split pages.

Interestingly, devmap poisoni

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-18 Thread Jane Chu

On 5/18/2021 10:27 AM, Joao Martins wrote:


On 5/5/21 11:36 PM, Joao Martins wrote:

On 5/5/21 11:20 PM, Dan Williams wrote:

On Wed, May 5, 2021 at 12:50 PM Joao Martins  wrote:

On 5/5/21 7:44 PM, Dan Williams wrote:

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
 struct completion done;
 enum memory_type type;
 unsigned int flags;
+   unsigned long align;

I think this wants some kernel-doc above to indicate that non-zero
means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
means "use non-compound base pages".

[...]


The non-zero value must be
PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
Hmm, maybe it should be an
enum:

enum devmap_geometry {
 DEVMAP_PTE,
 DEVMAP_PMD,
 DEVMAP_PUD,
}


I suppose a converter between devmap_geometry and page_size would be needed 
too? And maybe
the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

I think it is ok for dax/nvdimm to continue to maintain their align
value because it should be ok to have 4MB align if the device really
wanted. However, when it goes to map that alignment with
memremap_pages() it can pick a mode. For example, it's already the
case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
they're already separate concepts that can stay separate.


Gotcha.

I am reconsidering part of the above. In general, yes, the meaning of devmap 
@align
represents a slightly different variation of the device @align i.e. how the 
metadata is
laid out **but** regardless of what kind of page table entries we use vmemmap.

By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax 
already
validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE 
and PUD_SIZE)
2) the geometry of metadata is very much tied to the value we pick to @align at 
namespace
provisioning -- not the "align" we might use at mmap() perhaps that's what you 
referred
above? -- and 3) the value of geometry actually derives from dax device @align 
because we
will need to create compound pages representing a page size of @align value.

Using your example above: you're saying that dax->align == 1G is mapped with 
DEVMAP_PTEs,
in reality the vmemmap is populated with PMDs/PUDs page tables (depending on 
what archs
decide to do at vmemmap_populate()) and uses base pages as its metadata 
regardless of what
device @align. In reality what we want to convey in @geometry is not page table 
sizes, but
just the page size used for the vmemmap of the dax device. Additionally, 
limiting its
value might not be desirable... if tomorrow Linux for some arch supports 
dax/nvdimm
devices with 4M align or 64K align, the value of @geometry will have to reflect 
the 4M to
create compound pages of order 10 for the said vmemmap.

I am going to wait until you finish reviewing the remaining four patches of 
this series,
but maybe this is a simple misnomer (s/align/geometry/) with a comment but 
without
DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum 
a
setter/getter to audit its value? Thoughts?

Joao


Good points there.

My understanding is that  dax->align  conveys granularity of size while 
carving out a namespace,


it's a geometry attribute loosely akin to sector size of a spindle 
disk.  I tend to think that


device pagesize  has almost no relation to "align" in that, it's 
possible to have 1G "align" and


4K pagesize, or verse versa.  That is, with the advent of compound page 
support, it is possible


to totally separate the two concepts.

How about adding a new option to "ndctl create-namespace" that describes 
device creator's desired


pagesize, and another parameter to describe whether the pagesize shall 
be fixed or allowed to be split up,


such that, if the intention is to never split up 2M pagesize, then it 
would be possible to save a lot metadata


space on the device?

thanks,

-jane
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-18 Thread Joao Martins
On 5/5/21 11:36 PM, Joao Martins wrote:
> On 5/5/21 11:20 PM, Dan Williams wrote:
>> On Wed, May 5, 2021 at 12:50 PM Joao Martins  
>> wrote:
>>> On 5/5/21 7:44 PM, Dan Williams wrote:
 On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
 wrote:
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b46f63dcaed3..bb28d82dda5e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -114,6 +114,7 @@ struct dev_pagemap {
> struct completion done;
> enum memory_type type;
> unsigned int flags;
> +   unsigned long align;

 I think this wants some kernel-doc above to indicate that non-zero
 means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
 means "use non-compound base pages".

[...]

 The non-zero value must be
 PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
 Hmm, maybe it should be an
 enum:

 enum devmap_geometry {
 DEVMAP_PTE,
 DEVMAP_PMD,
 DEVMAP_PUD,
 }

>>> I suppose a converter between devmap_geometry and page_size would be needed 
>>> too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup 
>>> improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
>>
> Gotcha.

I am reconsidering part of the above. In general, yes, the meaning of devmap 
@align
represents a slightly different variation of the device @align i.e. how the 
metadata is
laid out **but** regardless of what kind of page table entries we use vmemmap.

By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax 
already
validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE 
and PUD_SIZE)
2) the geometry of metadata is very much tied to the value we pick to @align at 
namespace
provisioning -- not the "align" we might use at mmap() perhaps that's what you 
referred
above? -- and 3) the value of geometry actually derives from dax device @align 
because we
will need to create compound pages representing a page size of @align value.

Using your example above: you're saying that dax->align == 1G is mapped with 
DEVMAP_PTEs,
in reality the vmemmap is populated with PMDs/PUDs page tables (depending on 
what archs
decide to do at vmemmap_populate()) and uses base pages as its metadata 
regardless of what
device @align. In reality what we want to convey in @geometry is not page table 
sizes, but
just the page size used for the vmemmap of the dax device. Additionally, 
limiting its
value might not be desirable... if tomorrow Linux for some arch supports 
dax/nvdimm
devices with 4M align or 64K align, the value of @geometry will have to reflect 
the 4M to
create compound pages of order 10 for the said vmemmap.

I am going to wait until you finish reviewing the remaining four patches of 
this series,
but maybe this is a simple misnomer (s/align/geometry/) with a comment but 
without
DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum 
a
setter/getter to audit its value? Thoughts?

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Joao Martins
On 5/6/21 12:43 PM, Matthew Wilcox wrote:
> On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
 I think it is ok for dax/nvdimm to continue to maintain their align
 value because it should be ok to have 4MB align if the device really
 wanted. However, when it goes to map that alignment with
 memremap_pages() it can pick a mode. For example, it's already the
 case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
 they're already separate concepts that can stay separate.
>>>
>>> devdax namespace with align of 1G implies we expect to map them with 1G
>>> pte entries? I didn't follow when you say we map them today with
>>> DEVMAP_PTE entries.
>>
>> This sort of confusion is largelly why Dan is suggesting a @geometry for 
>> naming rather
>> than @align (which traditionally refers to page tables entry sizes in 
>> pagemap-related stuff).
>>
>> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages 
>> (DEVMAP_PTE),
>> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order 
>> (DEVMAP_PUD).
>>
>> So, today:
>>
>> * namespaces with align of 1G would use *struct pages of order-0* 
>> (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>> * namespaces with align of 2M would use *struct pages of order-0* 
>> (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>>
>> After this series:
>>
>> * namespaces with align of 1G would use *compound struct pages of order-30* 
>> (DEVMAP_PUD)
>> backed with PMD entries in the direct map.
> 
> order-18
> 
Right, thanks for the correction.

Forgot to subtract PAGE_SIZE order (12).

>> * namespaces with align of 1G would use *compound struct pages of order-21* 
>> (DEVMAP_PMD)
>> backed with PTE entries in the direct map.
> 
> i think you mean "align of 2M", and order-9.
> 
True.

> (note that these numbers are/can be different for powerpc since it
> supports different TLB page sizes.  please don't get locked into x86
> sizes, and don't ignore the benefits *to software* of managing memory
> in sizes other than just those supported by the hardware).
> 
I am not ignoring that either that I think.

The page size looking values above is also what the consumer of this 
(device-dax) allows
you to use as @align, hence why you see DEVMAP_PTE, PMD and PUD as the sizes.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Matthew Wilcox
On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
> >> I think it is ok for dax/nvdimm to continue to maintain their align
> >> value because it should be ok to have 4MB align if the device really
> >> wanted. However, when it goes to map that alignment with
> >> memremap_pages() it can pick a mode. For example, it's already the
> >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> >> they're already separate concepts that can stay separate.
> > 
> > devdax namespace with align of 1G implies we expect to map them with 1G
> > pte entries? I didn't follow when you say we map them today with
> > DEVMAP_PTE entries.
> 
> This sort of confusion is largelly why Dan is suggesting a @geometry for 
> naming rather
> than @align (which traditionally refers to page tables entry sizes in 
> pagemap-related stuff).
> 
> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages 
> (DEVMAP_PTE),
> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order 
> (DEVMAP_PUD).
> 
> So, today:
> 
> * namespaces with align of 1G would use *struct pages of order-0* 
> (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> * namespaces with align of 2M would use *struct pages of order-0* 
> (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> 
> After this series:
> 
> * namespaces with align of 1G would use *compound struct pages of order-30* 
> (DEVMAP_PUD)
> backed with PMD entries in the direct map.

order-18

> * namespaces with align of 1G would use *compound struct pages of order-21* 
> (DEVMAP_PMD)
> backed with PTE entries in the direct map.

i think you mean "align of 2M", and order-9.

(note that these numbers are/can be different for powerpc since it
supports different TLB page sizes.  please don't get locked into x86
sizes, and don't ignore the benefits *to software* of managing memory
in sizes other than just those supported by the hardware).
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Joao Martins



On 5/6/21 9:05 AM, Aneesh Kumar K.V wrote:
> 
> 
> IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
> save the vmmemap space by not allocating memory for tail struct pages? 
> 
Right.

It reuses base pages across the vmemmap, but for the base pages containing
only the tail struct pages.

> Dan Williams  writes:
> 
>>> enum:

 enum devmap_geometry {
 DEVMAP_PTE,
 DEVMAP_PMD,
 DEVMAP_PUD,
 }

>>> I suppose a converter between devmap_geometry and page_size would be needed 
>>> too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup 
>>> improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
> 
> devdax namespace with align of 1G implies we expect to map them with 1G
> pte entries? I didn't follow when you say we map them today with
> DEVMAP_PTE entries.
> 
This sort of confusion is largelly why Dan is suggesting a @geometry for naming 
rather
than @align (which traditionally refers to page tables entry sizes in 
pagemap-related stuff).

DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages 
(DEVMAP_PTE),
compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order 
(DEVMAP_PUD).

So, today:

* namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) 
backed with
PMD entries in the direct map.
* namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) 
backed with
PMD entries in the direct map.

After this series:

* namespaces with align of 1G would use *compound struct pages of order-30* 
(DEVMAP_PUD)
backed with PMD entries in the direct map.
* namespaces with align of 1G would use *compound struct pages of order-21* 
(DEVMAP_PMD)
backed with PTE entries in the direct map.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Joao Martins
On 5/6/21 12:03 AM, Dan Williams wrote:
> On Wed, May 5, 2021 at 3:36 PM Joao Martins  wrote:
> [..]
>>> Ah yup, my eyes glazed over that. I think this is another place that
>>> benefits from a more specific name than "align". "pfns_per_compound"
>>> "compound_pfns"?
>>>
>> We are still describing a page, just not a base page. So perhaps 
>> @pfns_per_hpage ?
>>
>> I am fine with @pfns_per_compound or @compound_pfns as well.
> 
> My only concern about hpage is that hpage implies PMD, where compound
> is generic across PMD and PUD.
> 
True.

I will stick with your suggestions.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Aneesh Kumar K.V



IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
save the vmmemap space by not allocating memory for tail struct pages? 

Dan Williams  writes:

> > enum:
>> >
>> > enum devmap_geometry {
>> > DEVMAP_PTE,
>> > DEVMAP_PMD,
>> > DEVMAP_PUD,
>> > }
>> >
>> I suppose a converter between devmap_geometry and page_size would be needed 
>> too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup 
>> improvement)?
>
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.

devdax namespace with align of 1G implies we expect to map them with 1G
pte entries? I didn't follow when you say we map them today with
DEVMAP_PTE entries.


-aneesh
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams
On Wed, May 5, 2021 at 3:36 PM Joao Martins  wrote:
[..]
> > Ah yup, my eyes glazed over that. I think this is another place that
> > benefits from a more specific name than "align". "pfns_per_compound"
> > "compound_pfns"?
> >
> We are still describing a page, just not a base page. So perhaps 
> @pfns_per_hpage ?
>
> I am fine with @pfns_per_compound or @compound_pfns as well.

My only concern about hpage is that hpage implies PMD, where compound
is generic across PMD and PUD.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Joao Martins
On 5/5/21 11:20 PM, Dan Williams wrote:
> On Wed, May 5, 2021 at 12:50 PM Joao Martins  
> wrote:
>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
>>> wrote:

 Add a new align property for struct dev_pagemap which specifies that a
 pagemap is composed of a set of compound pages of size @align, instead
 of base pages. When these pages are initialised, most are initialised as
 tail pages instead of order-0 pages.

 For certain ZONE_DEVICE users like device-dax which have a fixed page
 size, this creates an opportunity to optimize GUP and GUP-fast walkers,
 treating it the same way as THP or hugetlb pages.

 Signed-off-by: Joao Martins 
 ---
  include/linux/memremap.h | 13 +
  mm/memremap.c|  8 ++--
  mm/page_alloc.c  | 24 +++-
  3 files changed, 42 insertions(+), 3 deletions(-)

 diff --git a/include/linux/memremap.h b/include/linux/memremap.h
 index b46f63dcaed3..bb28d82dda5e 100644
 --- a/include/linux/memremap.h
 +++ b/include/linux/memremap.h
 @@ -114,6 +114,7 @@ struct dev_pagemap {
 struct completion done;
 enum memory_type type;
 unsigned int flags;
 +   unsigned long align;
>>>
>>> I think this wants some kernel-doc above to indicate that non-zero
>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>> means "use non-compound base pages".
>>
>> Got it. Are you thinking a kernel_doc on top of the variable above, or 
>> preferably on top
>> of pgmap_align()?
> 
> I was thinking in dev_pagemap because this value is more than just a
> plain alignment its restructuring the layout and construction of the
> memmap(), but when I say it that way it seems much less like a vanilla
> align value compared to a construction / geometry mode setting.
> 
/me nods

>>
>>> The non-zero value must be
>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>> Hmm, maybe it should be an
>>> enum:
>>>
>>> enum devmap_geometry {
>>> DEVMAP_PTE,
>>> DEVMAP_PMD,
>>> DEVMAP_PUD,
>>> }
>>>
>> I suppose a converter between devmap_geometry and page_size would be needed 
>> too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup 
>> improvement)?
> 
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.
> 
Gotcha.

>>
>> Although to be fair we only ever care about compound page size in this 
>> series (and
>> similarly dax/nvdimm @align properties).
>>
>>> ...because it's more than just an alignment it's a structural
>>> definition of how the memmap is laid out.
>>>

Somehow I missed this other part of your response.

 const struct dev_pagemap_ops *ops;
 void *owner;
 int nr_range;
 @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
 dev_pagemap *pgmap)
 return NULL;
  }

 +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
 +{
 +   if (!pgmap || !pgmap->align)
 +   return PAGE_SIZE;
 +   return pgmap->align;
 +}
 +
 +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
 +{
 +   return PHYS_PFN(pgmap_align(pgmap));
 +}
 +
  #ifdef CONFIG_ZONE_DEVICE
  bool pfn_zone_device_reserved(unsigned long pfn);
  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 diff --git a/mm/memremap.c b/mm/memremap.c
 index 805d761740c4..d160853670c4 100644
 --- a/mm/memremap.c
 +++ b/mm/memremap.c
 @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
 struct mhp_params *params,
 memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 PHYS_PFN(range->start),
 PHYS_PFN(range_len(range)), pgmap);
 -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
 -   - pfn_first(pgmap, range_id));
 +   if (pgmap_align(pgmap) > PAGE_SIZE)
 +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
 +   - pfn_first(pgmap, range_id)) / 
 pgmap_pfn_align(pgmap));
 +   else
 +   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
 +   - pfn_first(pgmap, range_id));
 return 0;

  err_add_memory:
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 58974067bbd4..3a77f9e43f3a 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_allo

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams
On Wed, May 5, 2021 at 12:50 PM Joao Martins  wrote:
>
>
>
> On 5/5/21 7:44 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
> > wrote:
> >>
> >> Add a new align property for struct dev_pagemap which specifies that a
> >> pagemap is composed of a set of compound pages of size @align, instead
> >> of base pages. When these pages are initialised, most are initialised as
> >> tail pages instead of order-0 pages.
> >>
> >> For certain ZONE_DEVICE users like device-dax which have a fixed page
> >> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> >> treating it the same way as THP or hugetlb pages.
> >>
> >> Signed-off-by: Joao Martins 
> >> ---
> >>  include/linux/memremap.h | 13 +
> >>  mm/memremap.c|  8 ++--
> >>  mm/page_alloc.c  | 24 +++-
> >>  3 files changed, 42 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index b46f63dcaed3..bb28d82dda5e 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -114,6 +114,7 @@ struct dev_pagemap {
> >> struct completion done;
> >> enum memory_type type;
> >> unsigned int flags;
> >> +   unsigned long align;
> >
> > I think this wants some kernel-doc above to indicate that non-zero
> > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> > means "use non-compound base pages".
>
> Got it. Are you thinking a kernel_doc on top of the variable above, or 
> preferably on top
> of pgmap_align()?

I was thinking in dev_pagemap because this value is more than just a
plain alignment its restructuring the layout and construction of the
memmap(), but when I say it that way it seems much less like a vanilla
align value compared to a construction / geometry mode setting.

>
> > The non-zero value must be
> > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
> > Hmm, maybe it should be an
> > enum:
> >
> > enum devmap_geometry {
> > DEVMAP_PTE,
> > DEVMAP_PMD,
> > DEVMAP_PUD,
> > }
> >
> I suppose a converter between devmap_geometry and page_size would be needed 
> too? And maybe
> the whole dax/nvdimm align values change meanwhile (as a followup 
> improvement)?

I think it is ok for dax/nvdimm to continue to maintain their align
value because it should be ok to have 4MB align if the device really
wanted. However, when it goes to map that alignment with
memremap_pages() it can pick a mode. For example, it's already the
case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
they're already separate concepts that can stay separate.

>
> Although to be fair we only ever care about compound page size in this series 
> (and
> similarly dax/nvdimm @align properties).
>
> > ...because it's more than just an alignment it's a structural
> > definition of how the memmap is laid out.
> >
> >> const struct dev_pagemap_ops *ops;
> >> void *owner;
> >> int nr_range;
> >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
> >> dev_pagemap *pgmap)
> >> return NULL;
> >>  }
> >>
> >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> >> +{
> >> +   if (!pgmap || !pgmap->align)
> >> +   return PAGE_SIZE;
> >> +   return pgmap->align;
> >> +}
> >> +
> >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> >> +{
> >> +   return PHYS_PFN(pgmap_align(pgmap));
> >> +}
> >> +
> >>  #ifdef CONFIG_ZONE_DEVICE
> >>  bool pfn_zone_device_reserved(unsigned long pfn);
> >>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index 805d761740c4..d160853670c4 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> >> struct mhp_params *params,
> >> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >> PHYS_PFN(range->start),
> >> PHYS_PFN(range_len(range)), pgmap);
> >> -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> -   - pfn_first(pgmap, range_id));
> >> +   if (pgmap_align(pgmap) > PAGE_SIZE)
> >> +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >> +   - pfn_first(pgmap, range_id)) / 
> >> pgmap_pfn_align(pgmap));
> >> +   else
> >> +   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> +   - pfn_first(pgmap, range_id));
> >> return 0;
> >>
> >>  err_add_memory:
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 58974067bbd4..3a77f9e43f3a 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >> unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >> 

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Joao Martins



On 5/5/21 7:44 PM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  
> wrote:
>>
>> Add a new align property for struct dev_pagemap which specifies that a
>> pagemap is composed of a set of compound pages of size @align, instead
>> of base pages. When these pages are initialised, most are initialised as
>> tail pages instead of order-0 pages.
>>
>> For certain ZONE_DEVICE users like device-dax which have a fixed page
>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>> treating it the same way as THP or hugetlb pages.
>>
>> Signed-off-by: Joao Martins 
>> ---
>>  include/linux/memremap.h | 13 +
>>  mm/memremap.c|  8 ++--
>>  mm/page_alloc.c  | 24 +++-
>>  3 files changed, 42 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index b46f63dcaed3..bb28d82dda5e 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>> struct completion done;
>> enum memory_type type;
>> unsigned int flags;
>> +   unsigned long align;
> 
> I think this wants some kernel-doc above to indicate that non-zero
> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> means "use non-compound base pages". 

Got it. Are you thinking a kernel_doc on top of the variable above, or 
preferably on top
of pgmap_align()?

> The non-zero value must be
> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. 
> Hmm, maybe it should be an
> enum:
> 
> enum devmap_geometry {
> DEVMAP_PTE,
> DEVMAP_PMD,
> DEVMAP_PUD,
> }
> 
I suppose a converter between devmap_geometry and page_size would be needed 
too? And maybe
the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

Although to be fair we only ever care about compound page size in this series 
(and
similarly dax/nvdimm @align properties).

> ...because it's more than just an alignment it's a structural
> definition of how the memmap is laid out.
> 
>> const struct dev_pagemap_ops *ops;
>> void *owner;
>> int nr_range;
>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
>> dev_pagemap *pgmap)
>> return NULL;
>>  }
>>
>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
>> +{
>> +   if (!pgmap || !pgmap->align)
>> +   return PAGE_SIZE;
>> +   return pgmap->align;
>> +}
>> +
>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>> +{
>> +   return PHYS_PFN(pgmap_align(pgmap));
>> +}
>> +
>>  #ifdef CONFIG_ZONE_DEVICE
>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 805d761740c4..d160853670c4 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
>> struct mhp_params *params,
>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>> PHYS_PFN(range->start),
>> PHYS_PFN(range_len(range)), pgmap);
>> -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> -   - pfn_first(pgmap, range_id));
>> +   if (pgmap_align(pgmap) > PAGE_SIZE)
>> +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>> +   - pfn_first(pgmap, range_id)) / 
>> pgmap_pfn_align(pgmap));
>> +   else
>> +   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> +   - pfn_first(pgmap, range_id));
>> return 0;
>>
>>  err_add_memory:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 58974067bbd4..3a77f9e43f3a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> unsigned long pfn, end_pfn = start_pfn + nr_pages;
>> struct pglist_data *pgdat = zone->zone_pgdat;
>> struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>> +   unsigned int pfn_align = pgmap_pfn_align(pgmap);
>> +   unsigned int order_align = order_base_2(pfn_align);
>> unsigned long zone_idx = zone_idx(zone);
>> unsigned long start = jiffies;
>> int nid = pgdat->node_id;
>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> nr_pages = end_pfn - start_pfn;
>> }
>>
>> -   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> +   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?
> 
@pfn_align is in pages too. It's pgmap_align() which is in bytes:

+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+   return PHYS_PFN(pgmap_alig

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Matthew Wilcox
On Wed, May 05, 2021 at 11:44:29AM -0700, Dan Williams wrote:
> > @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> > unsigned long pfn, end_pfn = start_pfn + nr_pages;
> > struct pglist_data *pgdat = zone->zone_pgdat;
> > struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +   unsigned int pfn_align = pgmap_pfn_align(pgmap);
> > +   unsigned int order_align = order_base_2(pfn_align);
> > unsigned long zone_idx = zone_idx(zone);
> > unsigned long start = jiffies;
> > int nid = pgdat->node_id;
> > @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone 
> > *zone,
> > nr_pages = end_pfn - start_pfn;
> > }
> >
> > -   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?

If something is measured in bytes, I like to use size_t (if it's
in memory) and loff_t (if it's on storage).  The compiler doesn't do
anything useful to warn you, but it's a nice indication to humans about
what's going on.  And it removes the temptation to do 'pfn_align >>=
PAGE_SHIFT' and suddenly take pfn_align from being measured in bytes to
being measured in pages.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Dan Williams
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins  wrote:
>
> Add a new align property for struct dev_pagemap which specifies that a
> pagemap is composed of a set of compound pages of size @align, instead
> of base pages. When these pages are initialised, most are initialised as
> tail pages instead of order-0 pages.
>
> For certain ZONE_DEVICE users like device-dax which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> treating it the same way as THP or hugetlb pages.
>
> Signed-off-by: Joao Martins 
> ---
>  include/linux/memremap.h | 13 +
>  mm/memremap.c|  8 ++--
>  mm/page_alloc.c  | 24 +++-
>  3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b46f63dcaed3..bb28d82dda5e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -114,6 +114,7 @@ struct dev_pagemap {
> struct completion done;
> enum memory_type type;
> unsigned int flags;
> +   unsigned long align;

I think this wants some kernel-doc above to indicate that non-zero
means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
means "use non-compound base pages". The non-zero value must be
PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. Hmm, maybe it should be an
enum:

enum devmap_geometry {
DEVMAP_PTE,
DEVMAP_PMD,
DEVMAP_PUD,
}

...because it's more than just an alignment it's a structural
definition of how the memmap is laid out.

> const struct dev_pagemap_ops *ops;
> void *owner;
> int nr_range;
> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
> dev_pagemap *pgmap)
> return NULL;
>  }
>
> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> +{
> +   if (!pgmap || !pgmap->align)
> +   return PAGE_SIZE;
> +   return pgmap->align;
> +}
> +
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +   return PHYS_PFN(pgmap_align(pgmap));
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  bool pfn_zone_device_reserved(unsigned long pfn);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 805d761740c4..d160853670c4 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> PHYS_PFN(range->start),
> PHYS_PFN(range_len(range)), pgmap);
> -   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -   - pfn_first(pgmap, range_id));
> +   if (pgmap_align(pgmap) > PAGE_SIZE)
> +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +   - pfn_first(pgmap, range_id)) / 
> pgmap_pfn_align(pgmap));
> +   else
> +   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +   - pfn_first(pgmap, range_id));
> return 0;
>
>  err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 58974067bbd4..3a77f9e43f3a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> unsigned long pfn, end_pfn = start_pfn + nr_pages;
> struct pglist_data *pgdat = zone->zone_pgdat;
> struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +   unsigned int pfn_align = pgmap_pfn_align(pgmap);
> +   unsigned int order_align = order_base_2(pfn_align);
> unsigned long zone_idx = zone_idx(zone);
> unsigned long start = jiffies;
> int nid = pgdat->node_id;
> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> nr_pages = end_pfn - start_pfn;
> }
>
> -   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {

pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
PAGE_SHIFT" I missed somewhere?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-03-25 Thread Joao Martins
Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead
of base pages. When these pages are initialised, most are initialised as
tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins 
---
 include/linux/memremap.h | 13 +
 mm/memremap.c|  8 ++--
 mm/page_alloc.c  | 24 +++-
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
struct completion done;
enum memory_type type;
unsigned int flags;
+   unsigned long align;
const struct dev_pagemap_ops *ops;
void *owner;
int nr_range;
@@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
dev_pagemap *pgmap)
return NULL;
 }
 
+static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
+{
+   if (!pgmap || !pgmap->align)
+   return PAGE_SIZE;
+   return pgmap->align;
+}
+
+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+   return PHYS_PFN(pgmap_align(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..d160853670c4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct 
mhp_params *params,
memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), pgmap);
-   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-   - pfn_first(pgmap, range_id));
+   if (pgmap_align(pgmap) > PAGE_SIZE)
+   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+   - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
+   else
+   percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+   - pfn_first(pgmap, range_id));
return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58974067bbd4..3a77f9e43f3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long pfn, end_pfn = start_pfn + nr_pages;
struct pglist_data *pgdat = zone->zone_pgdat;
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+   unsigned int pfn_align = pgmap_pfn_align(pgmap);
+   unsigned int order_align = order_base_2(pfn_align);
unsigned long zone_idx = zone_idx(zone);
unsigned long start = jiffies;
int nid = pgdat->node_id;
@@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
nr_pages = end_pfn - start_pfn;
}
 
-   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
struct page *page = pfn_to_page(pfn);
+   unsigned long i;
 
__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+   if (pfn_align == 1)
+   continue;
+
+   __SetPageHead(page);
+
+   for (i = 1; i < pfn_align; i++) {
+   __init_zone_device_page(page + i, pfn + i, zone_idx,
+   nid, pgmap);
+   prep_compound_tail(page, i);
+
+   /*
+* The first and second tail pages need to
+* initialized first, hence the head page is
+* prepared last.
+*/
+   if (i == 2)
+   prep_compound_head(page, order_align);
+   }
}
 
pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org