from:"Mike Kravetz"

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-07 Thread Mike Kravetz

On 07/07/2017 01:45 AM, Anshuman Khandual wrote:
> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping.  To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address.  This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API.  It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size.  In addition, the MREMAP_MAYMOVE flag must be specified.
> 
> Yeah it all looks good. But why is this requirement that if
> MREMAP_MAYMOVE is specified then old_size and new_size must
> be equal.

No real reason.  I just wanted to clearly separate the new interface from
the old.  On second thought, it would be better to require old_size == 0
as in the legacy interface.

-- 
Mike Kravetz

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-07 Thread Mike Kravetz

On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping.  To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address.  This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API.  It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size.  In addition, the MREMAP_MAYMOVE flag must be specified.
> 
> The patch breaks important invariant that anon page can be mapped into a
> process only once.

Actually, the patch does not add any new functionality.  It only provides
a new interface to existing functionality.

Is it not possible to have an anon page mapped twice into the same process
via system V shared memory?  shmget(anon), shmat(), shmat.  
Of course, those are shared rather than private anon pages.

> 
> What is going to happen to mirrored after CoW for instance?
> 
> In my opinion, it shouldn't be allowed for anon/private mappings at least.
> And with this limitation, I don't see much sense in the new interface --
> just create mirror by mmap()ing the file again.

The code today works for anon shared mappings.  See simple program below.

You are correct in that it makes little or no sense for private mappings.
When looking closer at existing code, mremap() creates a new private
mapping in this case.  This is most likely a bug.

Again, my intention is not to create new functionality but rather document
existing functionality as part of a programming interface.

-- 
Mike Kravetz

#include 
#include 
#include 
#include 
#define __USE_GNU
#include 
#include 
#include 
#include 
#include 

#define H_PAGESIZE (2 * 1024 * 1024)

int hugetlb = 0;

#define PROTECTION (PROT_READ | PROT_WRITE)
#define ADDR (void *)(0x0UL)
/* #define FLAGS (MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB) */
#define FLAGS (MAP_SHARED|MAP_ANONYMOUS)

int main(int argc, char ** argv)
{
int fd, ret;
int i;
long long hpages, tpage;
void *addr;
void *addr2;
char foo;

if (argc == 2) {
if (!strcmp(argv[1], "hugetlb"))
hugetlb = 1;
}

hpages = 5;

printf("Reserving an address ...\n");
addr = mmap(ADDR, H_PAGESIZE * hpages * 2,
PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE|
(hugetlb ? MAP_HUGETLB : 0),
-1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit (1);
}
printf("\tgot address %p to %p\n",
(void *)addr, (void *)(addr + H_PAGESIZE * hpages * 2));

printf("mmapping %d 2MB huge pages\n", hpages);
addr = mmap(addr, H_PAGESIZE * hpages, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS|
(hugetlb ? MAP_HUGETLB : 0),
-1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit (1);
}

/* initialize data */
for (i = 0; i < hpages; i++)
*((char *)addr + (i * H_PAGESIZE)) = 'a'; 

printf("pages allocated and initialized at %p\n", (void *)addr);

addr2 = mremap(addr, 0, H_PAGESIZE * hpages,
MREMAP_MAYMOVE | MREMAP_FIXED,
addr + (H_PAGESIZE * hpages));
if (addr2 == MAP_FAILED) {
perror("mremap");
exit (1);
}
printf("mapping relocated to %p\n", (void *)addr2);

/* verify data */
printf("Verifying data at address %p\n", (void *)addr);
for (i = 0; i < hpages; i++) {
if (*((char *)addr + (i * H_PAGESIZE)) != 'a') {
printf("data at address %p not as expected\n",
(void *)((char *)addr + (i * H_PAGESIZE)));
}
}
if (i >= hpages)
printf("\t success!\n");

/* verify data */
printf("Verifying data at address %p\n", (void *)addr2);
for (i = 0; i < hpages; i++) {
if (*((char *)addr2 + (i * H_PAGESIZE)) != 'a') {
printf("data at address %p not as expected\n",
(void *)((char *)addr2 + (i * H_PAGESIZE)));
}
}
if (i >= hpages)
printf("\t success!\n");

return ret;
}

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-07 Thread Mike Kravetz

On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>> On Thu, Jul 06, 2017 at 09:17:26AM -0700, Mike Kravetz wrote:
>>>> The mremap system call has the ability to 'mirror' parts of an existing
>>>> mapping.  To do so, it creates a new mapping that maps the same pages as
>>>> the original mapping, just at a different virtual address.  This
>>>> functionality has existed since at least the 2.6 kernel.
>>>>
>>>> This patch simply adds a new flag to mremap which will make this
>>>> functionality part of the API.  It maintains backward compatibility with
>>>> the existing way of requesting mirroring (old_size == 0).
>>>>
>>>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>>>> old_size.  In addition, the MREMAP_MAYMOVE flag must be specified.
>>>
>>> The patch breaks important invariant that anon page can be mapped into a
>>> process only once.
>>
>> Actually, the patch does not add any new functionality.  It only provides
>> a new interface to existing functionality.
>>
>> Is it not possible to have an anon page mapped twice into the same process
>> via system V shared memory?  shmget(anon), shmat(), shmat.  
>> Of course, those are shared rather than private anon pages.
> 
> By anon pages I mean, private anon or file pages. These are subject to CoW.
> 
>>> What is going to happen to mirrored after CoW for instance?
>>>
>>> In my opinion, it shouldn't be allowed for anon/private mappings at least.
>>> And with this limitation, I don't see much sense in the new interface --
>>> just create mirror by mmap()ing the file again.
>>
>> The code today works for anon shared mappings.  See simple program below.
>>
>> You are correct in that it makes little or no sense for private mappings.
>> When looking closer at existing code, mremap() creates a new private
>> mapping in this case.  This is most likely a bug.
> 
> IIRC, existing code doesn't create mirrors of private pages as it requires
> old_len to be zero. There's no way to get private pages mapped twice this
> way.

Correct.
As mentioned above, mremap does 'something' for private anon pages when
old_len == 0.  However, this may be considered a bug.  In this case, mremap
creates a new private anon mapping of length new_size.  Since old_len == 0,
it does not unmap any of the old mapping.  So, in this case mremap basically
creates a new private mapping (unrealted to the original) and does not
modify the old mapping.

-- 
Mike Kravetz

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-10 Thread Mike Kravetz

On 07/10/2017 09:22 AM, Vlastimil Babka wrote:
> On 07/09/2017 09:32 AM, Anshuman Khandual wrote:
>> On 07/07/2017 11:39 PM, Mike Kravetz wrote:
>>> On 07/07/2017 10:45 AM, Kirill A. Shutemov wrote:
>>>> On Fri, Jul 07, 2017 at 10:29:52AM -0700, Mike Kravetz wrote:
>>>>> On 07/07/2017 03:23 AM, Kirill A. Shutemov wrote:
>>>>>> What is going to happen to mirrored after CoW for instance?
>>>>>>
>>>>>> In my opinion, it shouldn't be allowed for anon/private mappings at 
>>>>>> least.
>>>>>> And with this limitation, I don't see much sense in the new interface --
>>>>>> just create mirror by mmap()ing the file again.
>>>>>
>>>>> The code today works for anon shared mappings.  See simple program below.
>>>>>
>>>>> You are correct in that it makes little or no sense for private mappings.
>>>>> When looking closer at existing code, mremap() creates a new private
>>>>> mapping in this case.  This is most likely a bug.
>>>>
>>>> IIRC, existing code doesn't create mirrors of private pages as it requires
>>>> old_len to be zero. There's no way to get private pages mapped twice this
>>>> way.
>>>
>>> Correct.
>>> As mentioned above, mremap does 'something' for private anon pages when
>>> old_len == 0.  However, this may be considered a bug.  In this case, mremap
>>> creates a new private anon mapping of length new_size.  Since old_len == 0,
>>> it does not unmap any of the old mapping.  So, in this case mremap basically
>>> creates a new private mapping (unrealted to the original) and does not
>>> modify the old mapping.
>>>
>>
>> Yeah, in my experiment, after the mremap() exists we have two different VMAs
>> which can contain two different set of data. No page sharing is happening.
> 
> So how does this actually work for the JVM garbage collector use case?
> Aren't the garbage collected objects private anon?

Good point.
The sample program the JVM team gave me uses a shared anon mapping.  As you
mention one would expect these mappings to be private.  I have asked them
for more details on their use case.

> Anyway this should be documented.

Yes, their prototype work seems to take advantage of this existing undocumented
behavior.  It seems we have been carrying this functionality for at least 13
years.  It may be time to document.

-- 
Mike Kravetz

Re: [PATCH] mm/mremap: Document MREMAP_FIXED dependency on MREMAP_MAYMOVE

2017-07-10 Thread Mike Kravetz

On 07/10/2017 06:41 AM, Michal Hocko wrote:
> On Mon 10-07-17 17:02:11, Anshuman Khandual wrote:
>> In the header file, just specify the dependency of MREMAP_FIXED
>> on MREMAP_MAYMOVE and make it explicit for the user space.
> 
> I really fail to see a point of this patch. The depency belongs to the
> code and it seems that we already enforce it
>   if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
>   return ret;
> 
> So what is the point here?

Agree, I am not sure of your reasoning.

If to assist the programmer, there is no need as this is clearly specified
in the man page:

"If  MREMAP_FIXED  is  specified,  then MREMAP_MAYMOVE must also be
 specified."

-- 
Mike Kravetz

> 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  include/uapi/linux/mman.h | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index ade4acd..8cae3f6 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -3,8 +3,10 @@
>>  
>>  #include 
>>  
>> -#define MREMAP_MAYMOVE  1
>> -#define MREMAP_FIXED2
>> +#define MREMAP_MAYMOVE  1 /* VMA can move after remap and resize */
>> +#define MREMAP_FIXED2 /* VMA can remap at particular address */
>> +
>> +/* NOTE: MREMAP_FIXED must be set with MREMAP_MAYMOVE, not alone */
>>  
>>  #define OVERCOMMIT_GUESS0
>>  #define OVERCOMMIT_ALWAYS   1
>> -- 
>> 1.8.5.2
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-11 Thread Mike Kravetz

On 07/11/2017 05:36 AM, Michal Hocko wrote:
> On Thu 06-07-17 09:17:26, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping.  To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address.  This
>> functionality has existed since at least the 2.6 kernel.
>>
>> This patch simply adds a new flag to mremap which will make this
>> functionality part of the API.  It maintains backward compatibility with
>> the existing way of requesting mirroring (old_size == 0).
>>
>> If this new MREMAP_MIRROR flag is specified, then new_size must equal
>> old_size.  In addition, the MREMAP_MAYMOVE flag must be specified.
> 
> I have to admit that this came as a suprise to me. There is no mention
> about this special case in the man page and the mremap code is so
> convoluted that I simply didn't see it there. I guess the only
> reasonable usecase is when you do not have a fd for the shared memory.

I was surprised as well when a JVM developer pointed this out.

>From the old e-mail thread, here is original use case:
shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732
shmat(11337732, 0, 0)   = 0x40299000
shmctl(11337732, IPC_RMID, 0)   = 0
mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0
mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x10) = 0x10

The JVM team wants to do something similar.  They are using
mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead
of shmget/shmat.  As Vlastimil mentioned previously, one would not
expect a shared mapping for parts of the JVM heap.  I am working
to get clarification from the JVM team.

> Anyway the patch should fail with -EINVAL on private mappings as Kirill
> already pointed out

Yes.  I think this should be a separate patch.  As mentioned earlier,
mremap today creates a new/additional private mapping if called in this
way with old_size == 0.  To me, this is a bug.

> and this should go along with an update to the
> man page which describes also the historical behavior.

Yes, man page updates are a must.

One reason for the RFC was to determine if people thought we should:
1) Just document the existing old_size == 0 functionality
2) Create a more explicit interface such as a new mremap flag for this
   functionality

I am waiting to see what direction people prefer before making any
man page updates.

>Make sure you
> document that this is not really a mirroring (e.g. faulting page in one
> address will automatically map it to the other mapping(s)) but merely a
> copy of the range. Maybe MREMAP_COPY would be more appropriate name.

Good point.  mirror is the first word that came to mind, but it does
not exactly apply.

-- 
Mike Kravetz

> 
>> Signed-off-by: Mike Kravetz 
>> ---
>>  include/uapi/linux/mman.h   |  5 +++--
>>  mm/mremap.c | 23 ---
>>  tools/include/uapi/linux/mman.h |  5 +++--
>>  3 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index ade4acd..6b3e0df 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -3,8 +3,9 @@
>>  
>>  #include 
>>  
>> -#define MREMAP_MAYMOVE  1
>> -#define MREMAP_FIXED2
>> +#define MREMAP_MAYMOVE  0x01
>> +#define MREMAP_FIXED0x02
>> +#define MREMAP_MIRROR   0x04
>>  
>>  #define OVERCOMMIT_GUESS0
>>  #define OVERCOMMIT_ALWAYS   1
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index cd8a1b1..f18ab36 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
>> long, old_len,
>>  struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
>>  LIST_HEAD(uf_unmap);
>>  
>> -if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
>> +if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
>>  return ret;
>>  
>> -if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
>> +if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
>> +!(flags & MREMAP_MAYMOVE))
>>  return ret;
>>  
>>  if (offset_in_page(addr))
>> @@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
>> long, old_len,
>>  old_len = PAGE_ALIGN(old_len);
>>  new_len = PAGE_ALIGN(new_len);
>>  
>> -/*
>> - * We allow a zero old-len a

Re: [RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-11 Thread Mike Kravetz

On 07/11/2017 02:02 PM, Andrea Arcangeli wrote:
> On Tue, Jul 11, 2017 at 11:23:19AM -0700, Mike Kravetz wrote:
>> I was surprised as well when a JVM developer pointed this out.
>>
>> From the old e-mail thread, here is original use case:
>> shmget(IPC_PRIVATE, 31498240, 0x1c0|0600) = 11337732
>> shmat(11337732, 0, 0)   = 0x40299000
>> shmctl(11337732, IPC_RMID, 0)   = 0
>> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0) = 0
>> mremap(0x402a9000, 0, 65536, MREMAP_MAYMOVE|MREMAP_FIXED, 0x10) = 
>> 0x10
>>
>> The JVM team wants to do something similar.  They are using
>> mmap(MAP_ANONYMOUS|MAP_SHARED) to create the initial mapping instead
>> of shmget/shmat.  As Vlastimil mentioned previously, one would not
>> expect a shared mapping for parts of the JVM heap.  I am working
>> to get clarification from the JVM team.
> 
> Why don't they use memfd_create instead? That's made so that the fd is
> born anon unlinked so when the last reference is dropped all memory
> associated with it is automatically freed. No need of IC_RMID and then
> they can use mmap instead of mremap(len=0) to get a double map of it.

Wow!  I did not even know about memfd_create until you mentioned it.
That would certainly work for 'normal' pages.

> If they use mmap(MAP_ANONYMOUS|MAP_SHARED) it's not hugetlbfs, that
> would have been the only issue.
> 
> Using hugetlbfs for JVM wouldn't be really flexible, better they try
> to leverage THP on SHM or the hugetlbfs reservation gets in the way of
> efficient use of the unused memory for memory allocations that don't
> have a definitive size (i.e. JVM forks or more JVM are run in
> parallel).

Well, the JVM has had a config option for the use of hugetlbfs for quite
some time.  I assume they have already had to deal with these issues.

What prompted this discussion is that they want the mremap mirroring/
duplication functionality extended to support hugetlbfs.  This is pretty
straight forward.  But, I wanted to have a discussion about whether the
mremap(old_size == 0) functionality should be formally documented first.

Do note that if you actually create/mount a hugetlbfs filesystem and
use a fd in that filesystem you can get the desired functionality.  However,
they want to avoid this extra step if possible and use mmap(anon, hugetlb).

I'm guessing that if memfd_create supported hugetlbfs, that would also
meet their needs.  Any thoughts about extending memfd_create support to
hugetlbfs?  I can't think of any big issues.  In fact, 'under the covers'
there actually is a hugetlbfs file created for anon mappings.  However,
that is not exposed to the user.

>> Yes.  I think this should be a separate patch.  As mentioned earlier,
>> mremap today creates a new/additional private mapping if called in this
>> way with old_size == 0.  To me, this is a bug.
> 
> Kernel by sheer luck should stay stable, but the result is weird and
> it's unlikely intentional.

Yes, that is why I think it is a bug.  Not that kernel is unstable, but
rather the unintentional/unexpected result.

> memfd_create doesn't have such issue, the new mmap MAP_PRIVATE will
> get the file pages correctly after a new mmap (even if there were cows
> in the old MAP_PRIVATE mmap).
> 
>> One reason for the RFC was to determine if people thought we should:
>> 1) Just document the existing old_size == 0 functionality
>> 2) Create a more explicit interface such as a new mremap flag for this
>>functionality
>>
>> I am waiting to see what direction people prefer before making any
>> man page updates.
> 
> I guess old_size == 0 would better be dropped if possible, if
> memfd_create fits perfectly your needs as I supposed above. If it's
> not dropped then it's not very far from allowing mmap of /proc/self/mm
> again (removed around so far as 2.3.x?).

Yes, in my google'ing it appears the first users of mremap(old_size == 0)
previously used mmap of /proc/self/mm.

If memfd_create can be extended to support hugetlbfs, then I might suggest
dropping the memfd_create(old_size == 0) support.  Just a thought.

-- 
Mike Kravetz

[RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag

2017-07-06 Thread Mike Kravetz

The mremap system call has the ability to 'mirror' parts of an existing
mapping.  To do so, it creates a new mapping that maps the same pages as
the original mapping, just at a different virtual address.  This
functionality has existed since at least the 2.6 kernel [1].  A comment
was added to the code to help preserve this feature.

The Oracle JVM team has discovered this feature and used it while
prototyping a new garbage collection model.  This new model shows promise,
and they are considering its use in a future release.  However, since
the only mention of this functionality is a single comment in the kernel,
they are concerned about its future.

I propose the addition of a new MREMAP_MIRROR flag to explicitly request
this functionality.  The flag simply provides the same functionality as
the existing undocumented 'old_size == 0' interface.  As an alternative,
we could simply document the 'old_size == 0' interface in the man page.
In either case, man page modifications would be needed.

Future Direction

After more formally adding this to the API (either new flag or documenting
existing interface), the mremap code could be enhanced to optimize this
case.  Currently, 'mirroring' only sets up the new mapping.  It does not
create page table entries for new mapping.  This could be added as an
enhancement.

The JVM today has the option of using (static) huge pages.  The mremap
system call does not fully support huge page mappings today.  You can
use mremap to shrink the size of a huge page mapping, but it can not be
used to expand or mirror a mapping.  Such support is fairly straight
forward.

[1] https://lkml.org/lkml/2004/1/12/260

Mike Kravetz (1):
  mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

 include/uapi/linux/mman.h   |  5 +++--
 mm/mremap.c | 23 ---
 tools/include/uapi/linux/mman.h |  5 +++--
 3 files changed, 22 insertions(+), 11 deletions(-)

-- 
2.7.5

[RFC PATCH 1/1] mm/mremap: add MREMAP_MIRROR flag for existing mirroring functionality

2017-07-06 Thread Mike Kravetz

The mremap system call has the ability to 'mirror' parts of an existing
mapping.  To do so, it creates a new mapping that maps the same pages as
the original mapping, just at a different virtual address.  This
functionality has existed since at least the 2.6 kernel.

This patch simply adds a new flag to mremap which will make this
functionality part of the API.  It maintains backward compatibility with
the existing way of requesting mirroring (old_size == 0).

If this new MREMAP_MIRROR flag is specified, then new_size must equal
old_size.  In addition, the MREMAP_MAYMOVE flag must be specified.

Signed-off-by: Mike Kravetz 
---
 include/uapi/linux/mman.h   |  5 +++--
 mm/mremap.c | 23 ---
 tools/include/uapi/linux/mman.h |  5 +++--
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index ade4acd..6b3e0df 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@
 
 #include 
 
-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED   2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED   0x02
+#define MREMAP_MIRROR  0x04
 
 #define OVERCOMMIT_GUESS   0
 #define OVERCOMMIT_ALWAYS  1
diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b1..f18ab36 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -516,10 +516,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
long, old_len,
struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
LIST_HEAD(uf_unmap);
 
-   if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
+   if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_MIRROR))
return ret;
 
-   if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
+   if ((flags & MREMAP_FIXED || flags & MREMAP_MIRROR) &&
+   !(flags & MREMAP_MAYMOVE))
return ret;
 
if (offset_in_page(addr))
@@ -528,14 +529,22 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
long, old_len,
old_len = PAGE_ALIGN(old_len);
new_len = PAGE_ALIGN(new_len);
 
-   /*
-* We allow a zero old-len as a special case
-* for DOS-emu "duplicate shm area" thing. But
-* a zero new-len is nonsensical.
-*/
+   /* A zero new-len is nonsensical. */
if (!new_len)
return ret;
 
+   /*
+* For backward compatibility, we allow a zero old-len to imply
+* mirroring.  This was originally a special case for DOS-emu.
+*/
+   if (!old_len)
+   flags |= MREMAP_MIRROR;
+   else if (flags & MREMAP_MIRROR) {
+   if (old_len != new_len)
+   return ret;
+   old_len = 0;
+   }
+
if (down_write_killable(>mm->mmap_sem))
return -EINTR;
 
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index 81d8edf..069f7a5 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -3,8 +3,9 @@
 
 #include 
 
-#define MREMAP_MAYMOVE 1
-#define MREMAP_FIXED   2
+#define MREMAP_MAYMOVE 0x01
+#define MREMAP_FIXED   0x02
+#define MREMAP_MIRROR  0x04
 
 #define OVERCOMMIT_GUESS   0
 #define OVERCOMMIT_ALWAYS  1
-- 
2.7.5

Re: [RFC PATCH 0/1] mm/mremap: add MREMAP_MIRROR flag

2017-07-07 Thread Mike Kravetz

On 07/07/2017 01:19 AM, Anshuman Khandual wrote:
> On 07/06/2017 09:47 PM, Mike Kravetz wrote:
>> The mremap system call has the ability to 'mirror' parts of an existing
>> mapping.  To do so, it creates a new mapping that maps the same pages as
>> the original mapping, just at a different virtual address.  This
>> functionality has existed since at least the 2.6 kernel [1].  A comment
>> was added to the code to help preserve this feature.
> 
> 
> Is this the comment ? If yes, then its not very clear.
> 
>   /*
>* We allow a zero old-len as a special case
>* for DOS-emu "duplicate shm area" thing. But
>* a zero new-len is nonsensical.
>*/
> 

Yes, I believe that is the comment.

>>
>> The Oracle JVM team has discovered this feature and used it while
>> prototyping a new garbage collection model.  This new model shows promise,
>> and they are considering its use in a future release.  However, since
>> the only mention of this functionality is a single comment in the kernel,
>> they are concerned about its future.
>>
>> I propose the addition of a new MREMAP_MIRROR flag to explicitly request
>> this functionality.  The flag simply provides the same functionality as
>> the existing undocumented 'old_size == 0' interface.  As an alternative,
>> we could simply document the 'old_size == 0' interface in the man page.
>> In either case, man page modifications would be needed.
> 
> Right. Adding MREMAP_MIRROR sounds cleaner from application programming
> point of view. But it extends the interface.

Yes.  That is the reason for the RFC.  We currently have functionality
that is not clearly part of a programming interface.  Application programmers
do not like to depend on something that is not part of an interface.

>>
>> Future Direction
>>
>> After more formally adding this to the API (either new flag or documenting
>> existing interface), the mremap code could be enhanced to optimize this
>> case.  Currently, 'mirroring' only sets up the new mapping.  It does not
>> create page table entries for new mapping.  This could be added as an
>> enhancement.
> 
> Then how it achieves mirroring, both the pointers should see the same
> data, that can happen with page table entries pointing to same pages,
> right ?

Correct.

In the code today, page tables for the new (mirrored) mapping are created
as needed via faults.  The enhancement would be to create page table entries
for the new mapping.

-- 
Mike Kravetz

[RFC PATCH 1/3] mm:hugetlb: Define system call hugetlb size encodings in single file

2017-07-17 Thread Mike Kravetz

If hugetlb pages are requested in mmap or shmget system calls,  a huge
page size other than default can be requested.  This is accomplished by
encoding the log2 of the huge page size in the upper bits of the flag
argument.  asm-generic and arch specific headers all define the same
values for these encodings.

Put common definitions in a single header file.  arch specific code can
still override if desired.

Signed-off-by: Mike Kravetz 
---
 include/uapi/asm-generic/hugetlb_encode.h | 30 ++
 1 file changed, 30 insertions(+)
 create mode 100644 include/uapi/asm-generic/hugetlb_encode.h

diff --git a/include/uapi/asm-generic/hugetlb_encode.h 
b/include/uapi/asm-generic/hugetlb_encode.h
new file mode 100644
index 000..aa09fc0
--- /dev/null
+++ b/include/uapi/asm-generic/hugetlb_encode.h
@@ -0,0 +1,30 @@
+#ifndef _ASM_GENERIC_HUGETLB_ENCODE_H_
+#define _ASM_GENERIC_HUGETLB_ENCODE_H_
+
+/*
+ * Several system calls take a flag to request "hugetlb" huge pages.
+ * Without further specification, these system calls will use the
+ * system's default huge page size.  If a system supports multiple
+ * huge page sizes, the desired huge page size can be specified in
+ * bits [26:31] of the flag arguments.  The value in these 6 bits
+ * will encode the log2 of the huge page size.
+ *
+ * The following definitions are associated with this huge page size
+ * encoding in flag arguments.  System call specific header files
+ * that use this encoding should include this file.  They can then
+ * provide definitions based on these with their own specific prefix.
+ * for example #define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT.
+ */
+
+#define HUGETLB_FLAG_ENCODE_SHIFT  26
+#define HUGETLB_FLAG_ENCODE_MASK   0x3f
+
+#define HUGETLB_FLAG_ENCODE_512KB  (19 << MAP_HUGE_SHIFT
+#define HUGETLB_FLAG_ENCODE_1MB(20 << MAP_HUGE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_2MB(21 << MAP_HUGE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_8MB(23 << MAP_HUGE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_16MB   (24 << MAP_HUGE_SHIFT)
+#define HUGETLB_FLAG_ENCODE_1GB(30 << MAP_HUGE_SHIFT)
+#define HUGETLB_FLAG_ENCODE__16GB  (34 << MAP_HUGE_SHIFT)
+
+#endif /* _ASM_GENERIC_HUGETLB_ENCODE_H_ */
-- 
2.7.5

Re: [PATCH] mm,hugetlb: compute page_size_log properly

2017-07-17 Thread Mike Kravetz

I hate to resurrect this thread, but I would like to add hugetlb support
to memfd_create.  This is for JVM garbage collection as discussed in
this thread [1].

Adding hugetlb support to memfd_create, means that memfd_create will take
a flag something like MFD_HUGETLB.  And, if a user wants hugetlb pages
they may want a huge page size different than the system default.  So, it
make sense to use the same type of encoding used by mmap and shmget.
However, I would hate to copy/paste the same values used by mmap and shmget
and just give them different names.  So, how about something like the
following:

1) Put all the log2 encoded huge page size definitions in a common header
   file.
2) Arch specific code can use these values, or overwrite as needed.
3) All system calls using this encoding (mmap, shmget and memfd_create in
   the future) will use these common values.

I have also put the shm user space definitions in the uapi file as
previously suggested by Matthew Wilcox.  I did not (yet) move the
shm definitions to arch specific files as suggested by Aneesh Kumar.

[1] https://lkml.org/lkml/2017/7/6/564

Mike Kravetz (3):
  mm:hugetlb:  Define system call hugetlb size encodings in single file
  mm: arch: Use new hugetlb size encoding definitions
  mm: shm: Use new hugetlb size encoding definitions

 arch/alpha/include/uapi/asm/mman.h| 14 ++
 arch/mips/include/uapi/asm/mman.h | 14 ++
 arch/parisc/include/uapi/asm/mman.h   | 14 ++
 arch/powerpc/include/uapi/asm/mman.h  | 23 ++-
 arch/x86/include/uapi/asm/mman.h  | 10 --
 arch/xtensa/include/uapi/asm/mman.h   | 14 ++
 include/linux/shm.h   | 17 -
 include/uapi/asm-generic/hugetlb_encode.h | 30 ++
 include/uapi/asm-generic/mman-common.h|  6 --
 include/uapi/linux/shm.h  | 23 +--
 10 files changed, 97 insertions(+), 68 deletions(-)
 create mode 100644 include/uapi/asm-generic/hugetlb_encode.h

-- 
2.7.5

[RFC PATCH 2/3] mm: arch: Use new hugetlb size encoding definitions

2017-07-17 Thread Mike Kravetz

Use the common definitions from hugetlb_encode.h header file for
encoding hugetlb size definitions in mmap system call flags.

Signed-off-by: Mike Kravetz 
---
 arch/alpha/include/uapi/asm/mman.h | 14 ++
 arch/mips/include/uapi/asm/mman.h  | 14 ++
 arch/parisc/include/uapi/asm/mman.h| 14 ++
 arch/powerpc/include/uapi/asm/mman.h   | 23 ++-
 arch/x86/include/uapi/asm/mman.h   | 10 --
 arch/xtensa/include/uapi/asm/mman.h| 14 ++
 include/uapi/asm-generic/mman-common.h |  6 --
 7 files changed, 46 insertions(+), 49 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 02760f6..bfa5828 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -1,6 +1,8 @@
 #ifndef __ALPHA_MMAN_H__
 #define __ALPHA_MMAN_H__
 
+#include 
+
 #define PROT_READ  0x1 /* page can be read */
 #define PROT_WRITE 0x2 /* page can be written */
 #define PROT_EXEC  0x4 /* page can be executed */
@@ -68,15 +70,11 @@
 #define MAP_FILE   0
 
 /*
- * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
- * This gives us 6 bits, which is enough until someone invents 128 bit address
- * spaces.
- *
- * Assume these are all power of twos.
- * When 0 use the default page size.
+ * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
+ * size other than the default is desired.  See hugetlb_encode.h
  */
-#define MAP_HUGE_SHIFT 26
-#define MAP_HUGE_MASK  0x3f
+#define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT
+#define MAP_HUGE_MASK  HUGETLB_FLAG_ENCODE_MASK
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index 655e2fb..9e55284 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -8,6 +8,8 @@
 #ifndef _ASM_MMAN_H
 #define _ASM_MMAN_H
 
+#include 
+
 /*
  * Protections are chosen from these bits, OR'd together.  The
  * implementation does not necessarily support PROT_EXEC or PROT_WRITE
@@ -95,15 +97,11 @@
 #define MAP_FILE   0
 
 /*
- * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
- * This gives us 6 bits, which is enough until someone invents 128 bit address
- * spaces.
- *
- * Assume these are all power of twos.
- * When 0 use the default page size.
+ * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
+ * size other than the default is desired.  See hugetlb_encode.h
  */
-#define MAP_HUGE_SHIFT 26
-#define MAP_HUGE_MASK  0x3f
+#define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT
+#define MAP_HUGE_MASK  HUGETLB_FLAG_ENCODE_MASK
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 5979745..11c6d86 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -1,6 +1,8 @@
 #ifndef __PARISC_MMAN_H__
 #define __PARISC_MMAN_H__
 
+#include 
+
 #define PROT_READ  0x1 /* page can be read */
 #define PROT_WRITE 0x2 /* page can be written */
 #define PROT_EXEC  0x4 /* page can be executed */
@@ -65,15 +67,11 @@
 #define MAP_VARIABLE   0
 
 /*
- * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
- * This gives us 6 bits, which is enough until someone invents 128 bit address
- * spaces.
- *
- * Assume these are all power of twos.
- * When 0 use the default page size.
+ * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
+ * size other than the default is desired.  See hugetlb_encode.h
  */
-#define MAP_HUGE_SHIFT 26
-#define MAP_HUGE_MASK  0x3f
+#define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT
+#define MAP_HUGE_MASK  HUGETLB_FLAG_ENCODE_MASK
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
diff --git a/arch/powerpc/include/uapi/asm/mman.h 
b/arch/powerpc/include/uapi/asm/mman.h
index ab45cc2..80fd56c 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -8,6 +8,7 @@
 #define _UAPI_ASM_POWERPC_MMAN_H
 
 #include 
+#include 
 
 
 #define PROT_SAO   0x10/* Strong Access Ordering */
@@ -30,19 +31,15 @@
 #define MAP_HUGETLB0x4 /* create a huge page mapping */
 
 /*
- * When MAP_HUGETLB is set, bits [26:31] of the flags argument to mmap(2),
- * encode the log2 of the huge page size. A value of zero indicates that the
- * default huge page size should be used. To use a non-default huge page size,
- * one of these defines can be used, or the size can be encoded by hand. Note
- * that on most systems only a subset, or possibly none, of these sizes will be
- * available.
+ * Huge page size encoding when MAP_HUGETLB is specified, and a huge page
+ * size other than

[RFC PATCH 3/3] mm: shm: Use new hugetlb size encoding definitions

2017-07-17 Thread Mike Kravetz

Use the common definitions from hugetlb_encode.h header file for
encoding hugetlb size definitions in shmget system call flags.  In
addition, move these definitions to the from the internal to user
(uapi) header file.

Suggested-by: Matthew Wilcox 
Signed-off-by: Mike Kravetz 
---
 include/linux/shm.h  | 17 -
 include/uapi/linux/shm.h | 23 +--
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/shm.h b/include/linux/shm.h
index 04e8818..d56285a 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -27,23 +27,6 @@ struct shmid_kernel /* private to the kernel */
 /* shm_mode upper byte flags */
 #defineSHM_DEST01000   /* segment will be destroyed on last 
detach */
 #define SHM_LOCKED  02000   /* segment will not be swapped */
-#define SHM_HUGETLB 04000   /* segment will use huge TLB pages */
-#define SHM_NORESERVE   01  /* don't check for reservations */
-
-/* Bits [26:31] are reserved */
-
-/*
- * When SHM_HUGETLB is set bits [26:31] encode the log2 of the huge page size.
- * This gives us 6 bits, which is enough until someone invents 128 bit address
- * spaces.
- *
- * Assume these are all power of twos.
- * When 0 use the default page size.
- */
-#define SHM_HUGE_SHIFT  26
-#define SHM_HUGE_MASK   0x3f
-#define SHM_HUGE_2MB(21 << SHM_HUGE_SHIFT)
-#define SHM_HUGE_1GB(30 << SHM_HUGE_SHIFT)
 
 #ifdef CONFIG_SYSVIPC
 struct sysv_shm {
diff --git a/include/uapi/linux/shm.h b/include/uapi/linux/shm.h
index 1fbf24e..329bc17 100644
--- a/include/uapi/linux/shm.h
+++ b/include/uapi/linux/shm.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 #ifndef __KERNEL__
 #include 
 #endif
@@ -40,11 +41,29 @@ struct shmid_ds {
 /* Include the definition of shmid64_ds and shminfo64 */
 #include 
 
-/* permission flag for shmget */
+/* shmget() shmflg values. */
+/* The bottom nine bits are the same as open(2) mode flags */
 #define SHM_R  0400/* or S_IRUGO from  */
 #define SHM_W  0200/* or S_IWUGO from  */
+/* Bits 9 & 10 are IPC_CREAT and IPC_EXCL */
+#define SHM_HUGETLB04000   /* segment will use huge TLB pages */
+#defineSHM_NORESERVE   01  /* don't check for reservations */
 
-/* mode for attach */
+/*
+ * Huge page size encoding when SHM_HUGETLB is specified, and a huge page
+ * size other than the default is desired.  See hugetlb_encode.h
+ */
+#define SHM_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT
+#define SHM_HUGE_MASK  HUGETLB_FLAG_ENCODE_MASK
+#define MAP_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB
+#define MAP_HUGE_1MB   HUGETLB_FLAG_ENCODE_1MB
+#define MAP_HUGE_2MB   HUGETLB_FLAG_ENCODE_2MB
+#define MAP_HUGE_8MB   HUGETLB_FLAG_ENCODE_8MB
+#define MAP_HUGE_16MB  HUGETLB_FLAG_ENCODE_16MB
+#define MAP_HUGE_1GB   HUGETLB_FLAG_ENCODE_1GB
+#define MAP_HUGE_16GB  HUGETLB_FLAG_ENCODE__16GB
+
+/* shmat() shmflg values */
 #defineSHM_RDONLY  01  /* read-only access */
 #defineSHM_RND 02  /* round attach address to SHMLBA 
boundary */
 #defineSHM_REMAP   04  /* take-over region on attach */
-- 
2.7.5

Re: [PATCH] mm/mremap: Fail map duplication attempts for private mappings

2017-07-19 Thread Mike Kravetz

On 07/13/2017 12:11 PM, Vlastimil Babka wrote:
> [+CC linux-api]
> 
> On 07/13/2017 05:58 PM, Mike Kravetz wrote:
>> mremap will create a 'duplicate' mapping if old_size == 0 is
>> specified.  Such duplicate mappings make no sense for private
>> mappings.  If duplication is attempted for a private mapping,
>> mremap creates a separate private mapping unrelated to the
>> original mapping and makes no modifications to the original.
>> This is contrary to the purpose of mremap which should return
>> a mapping which is in some way related to the original.
>>
>> Therefore, return EINVAL in the case where if an attempt is
>> made to duplicate a private mapping.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Acked-by: Vlastimil Babka 

After considering Michal's concerns with follow on patch, it appears
this patch provides the most desired behavior.  Any other concerns
or issues with this patch?

If this moves forward, I will create man page updates to describe the
mremap(old_size == 0) behavior.

-- 
Mike Kravetz

> 
>> ---
>>  mm/mremap.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index cd8a1b1..076f506 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -383,6 +383,13 @@ static struct vm_area_struct *vma_to_resize(unsigned 
>> long addr,
>>  if (!vma || vma->vm_start > addr)
>>  return ERR_PTR(-EFAULT);
>>  
>> +/*
>> + * !old_len  is a special case where a mapping is 'duplicated'.
>> + * Do not allow this for private mappings.
>> + */
>> +if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)))
>> +return ERR_PTR(-EINVAL);
>> +
>>  if (is_vm_hugetlb_page(vma))
>>  return ERR_PTR(-EINVAL);
>>  
>>
>

[PATCH v2] mm/mremap: Fail map duplication attempts for private mappings

2017-07-20 Thread Mike Kravetz

mremap will create a 'duplicate' mapping if old_size == 0 is
specified.  Such duplicate mappings make no sense for private
mappings.  If duplication is attempted for a private mapping,
mremap creates a separate private mapping unrelated to the
original mapping and makes no modifications to the original.
This is contrary to the purpose of mremap which should return
a mapping which is in some way related to the original.

Therefore, return EINVAL in the case where if an attempt is
made to duplicate a private mapping.  Also, print a warning
message (once) if such an attempt is made.

Signed-off-by: Mike Kravetz 
---
 mm/mremap.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b1..949f6a7 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -383,6 +383,15 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
if (!vma || vma->vm_start > addr)
return ERR_PTR(-EFAULT);
 
+   /*
+* !old_len  is a special case where a mapping is 'duplicated'.
+* Do not allow this for private mappings.
+*/
+   if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
+   pr_warn_once("%s (%d): attempted to duplicate a private mapping 
with mremap.  This is not supported.\n", current->comm, current->pid);
+   return ERR_PTR(-EINVAL);
+   }
+
if (is_vm_hugetlb_page(vma))
return ERR_PTR(-EINVAL);
 
-- 
2.7.5

Re: [PATCH] selftests/vm: Add test to validate mirror functionality with mremap

2017-07-20 Thread Mike Kravetz

On 07/20/2017 02:36 AM, Anshuman Khandual wrote:
> This adds a test to validate mirror functionality with mremap()
> system call on shared anon mappings.
> 
> Suggested-by: Mike Kravetz 
> Signed-off-by: Anshuman Khandual 
> ---
>  tools/testing/selftests/vm/Makefile|  1 +
>  .../selftests/vm/mremap_mirror_shared_anon.c   | 54 
> ++

This may be a better fit in LTP where there are already several other
mremap tests.  I honestly do not know the best place for such a test.

>  2 files changed, 55 insertions(+)
>  create mode 100644 tools/testing/selftests/vm/mremap_mirror_shared_anon.c
> 
> diff --git a/tools/testing/selftests/vm/Makefile 
> b/tools/testing/selftests/vm/Makefile
> index cbb29e4..11657ff5 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += userfaultfd
>  TEST_GEN_FILES += mlock-random-test
>  TEST_GEN_FILES += virtual_address_range
> +TEST_GEN_FILES += mremap_mirror_shared_anon
>  
>  TEST_PROGS := run_vmtests
>  
> diff --git a/tools/testing/selftests/vm/mremap_mirror_shared_anon.c 
> b/tools/testing/selftests/vm/mremap_mirror_shared_anon.c
> new file mode 100644
> index 000..b0adbb2
> --- /dev/null
> +++ b/tools/testing/selftests/vm/mremap_mirror_shared_anon.c
> @@ -0,0 +1,54 @@
> +/*
> + * Test to verify mirror functionality with mremap() system
> + * call for shared anon mappings.
> + *
> + * Copyright (C) 2017 Anshuman Khandual, IBM Corporation
> + *
> + * Licensed under GPL V2
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PATTERN  0xbe
> +#define ALLOC_SIZE   0x1UL /* Works for 64K and 4K pages */

Why hardcode?  You could use sysconf to get page size and use some
multiple of that.

> +
> +int test_mirror(char *old, char *new, unsigned long size)
> +{
> + unsigned long i;
> +
> + for (i = 0; i < size; i++) {
> + if (new[i] != old[i]) {
> + printf("Mismatch at new[%lu] expected "
> + "%d received %d\n", i, old[i], new[i]);
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + char *ptr, *mirror_ptr;
> +
> + ptr = mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
> + MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> + if (ptr == MAP_FAILED) {
> + perror("map() failed");
> + return -1;
> + }
> + memset(ptr, PATTERN, ALLOC_SIZE);
> +
> + mirror_ptr =  (char *) mremap(ptr, 0, ALLOC_SIZE, 1);

Why hardcode 1?  You really want the MREMAP_MAYMOVE flag.  Right?

> + if (mirror_ptr == MAP_FAILED) {
> + perror("mremap() failed");
> + return -1;
> + }
> +
> + if (test_mirror(ptr, mirror_ptr, ALLOC_SIZE))
> + return 1;
> + return 0;
> +}

You may want to expand the test to make sure mremap(old_size == 0)
fails for private mappings.  Of course, this assumes my proposed
patch gets in.  Until then, it will succeed and create a new unrelated
mapping.

-- 
Mike Kravetz

Re: [PATCH v2] mm/mremap: Fail map duplication attempts for private mappings

2017-07-21 Thread Mike Kravetz

On 07/21/2017 07:36 AM, Michal Hocko wrote:
> On Thu 20-07-17 13:37:59, Mike Kravetz wrote:
>> mremap will create a 'duplicate' mapping if old_size == 0 is
>> specified.  Such duplicate mappings make no sense for private
>> mappings.
> 
> sorry for the nit picking but this is not true strictly speaking.
> It makes some sense, arguably (e.g. take an atomic snapshot of the
> mapping). It doesn't make any sense with the _current_ implementation.
> 
>> If duplication is attempted for a private mapping,
>> mremap creates a separate private mapping unrelated to the
>> original mapping and makes no modifications to the original.
>> This is contrary to the purpose of mremap which should return
>> a mapping which is in some way related to the original.
>>
>> Therefore, return EINVAL in the case where if an attempt is
>> made to duplicate a private mapping.  Also, print a warning
>> message (once) if such an attempt is made.
>>
>> Signed-off-by: Mike Kravetz 
> 
> I do not insist on the comment update suggested
> http://lkml.kernel.org/r/20170720082058.gf9...@dhcp22.suse.cz
> but I would appreciate it...
> 
> Other than that looks reasonably to me
> 
> Acked-by: Michal Hocko 

My apologies.  I overlooked your comment about the comment when
creating the patch.  Below is the patch with commit message and
comment updated.

>From 5c4a1602bd6a942544ed011dc0a72fd258e874b2 Mon Sep 17 00:00:00 2001
From: Mike Kravetz 
Date: Wed, 12 Jul 2017 13:52:47 -0700
Subject: [PATCH] mm/mremap: Fail map duplication attempts for private mappings

mremap will attempt to create a 'duplicate' mapping if old_size
== 0 is specified.  In the case of private mappings, mremap
will actually create a fresh separate private mapping unrelated
to the original.  This does not fit with the design semantics of
mremap as the intention is to create a new mapping based on the
original.

Therefore, return EINVAL in the case where an attempt is made
to duplicate a private mapping.  Also, print a warning message
(once) if such an attempt is made.

Signed-off-by: Mike Kravetz 
---
 mm/mremap.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index cd8a1b1..75b167d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -383,6 +383,19 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
if (!vma || vma->vm_start > addr)
return ERR_PTR(-EFAULT);
 
+   /*
+* !old_len is a special case where an attempt is made to 'duplicate'
+* a mapping.  This makes no sense for private mappings as it will
+* instead create a fresh/new mapping unrelated to the original.  This
+* is contrary to the basic idea of mremap which creates new mappings
+* based on the original.  There are no known use cases for this
+* behavior.  As a result, fail such attempts.
+*/
+   if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
+   pr_warn_once("%s (%d): attempted to duplicate a private mapping 
with mremap.  This is not supported.\n", current->comm, current->pid);
+   return ERR_PTR(-EINVAL);
+   }
+
if (is_vm_hugetlb_page(vma))
return ERR_PTR(-EINVAL);
 
-- 
2.7.5

[PATCH] sparc64: mm: fix copy_tsb to correctly copy huge page TSBs

2017-06-02 Thread Mike Kravetz

When a TSB grows beyond its current capacity, a new TSB is allocated
and copy_tsb is called to copy entries from the old TSB to the new.
A hash shift based on page size is used to calculate the index of an
entry in the TSB.  copy_tsb has hard coded PAGE_SHIFT in these
calculations.  However, for huge page TSBs the value REAL_HPAGE_SHIFT
should be used.  As a result, when copy_tsb is called for a huge page
TSB the entries are placed at the incorrect index in the newly
allocated TSB.  When doing hardware table walk, the MMU does not
match these entries and we end up in the TSB miss handling code.
This code will then create and write an entry to the correct index
in the TSB.  We take a performance hit for the table walk miss and
recreation of these entries.

Pass a new parameter to copy_tsb that is the page size shift to be
used when copying the TSB.

Suggested-by: Anthony Yznaga 
Signed-off-by: Mike Kravetz 
---
 arch/sparc/kernel/tsb.S | 11 +++
 arch/sparc/mm/tsb.c |  7 +--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 10689cf..07c0df9 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -455,13 +455,16 @@ __tsb_context_switch:
.type   copy_tsb,#function
 copy_tsb:  /* %o0=old_tsb_base, %o1=old_tsb_size
 * %o2=new_tsb_base, %o3=new_tsb_size
+* %o4=page_size_shift
 */
sethi   %uhi(TSB_PASS_BITS), %g7
srlx%o3, 4, %o3
-   add %o0, %o1, %g1   /* end of old tsb */
+   add %o0, %o1, %o1   /* end of old tsb */
sllx%g7, 32, %g7
sub %o3, 1, %o3 /* %o3 == new tsb hash mask */
 
+   mov %o4, %g1/* page_size_shift */
+
 661:   prefetcha   [%o0] ASI_N, #one_read
.section.tsb_phys_patch, "ax"
.word   661b
@@ -486,9 +489,9 @@ copy_tsb:   /* %o0=old_tsb_base, %o1=old_tsb_size
/* This can definitely be computed faster... */
srlx%o0, 4, %o5 /* Build index */
and %o5, 511, %o5   /* Mask index */
-   sllx%o5, PAGE_SHIFT, %o5 /* Put into vaddr position */
+   sllx%o5, %g1, %o5   /* Put into vaddr position */
or  %o4, %o5, %o4   /* Full VADDR. */
-   srlx%o4, PAGE_SHIFT, %o4 /* Shift down to create index */
+   srlx%o4, %g1, %o4   /* Shift down to create index */
and %o4, %o3, %o4   /* Mask with new_tsb_nents-1 */
sllx%o4, 4, %o4 /* Shift back up into tsb ent offset */
TSB_STORE(%o2 + %o4, %g2)   /* Store TAG */
@@ -496,7 +499,7 @@ copy_tsb:   /* %o0=old_tsb_base, %o1=old_tsb_size
TSB_STORE(%o2 + %o4, %g3)   /* Store TTE */
 
 80:add %o0, 16, %o0
-   cmp %o0, %g1
+   cmp %o0, %o1
bne,pt  %xcc, 90b
 nop
 
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index 0a04811..8d802a2 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -496,7 +496,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long 
tsb_index, unsigned long rss)
extern void copy_tsb(unsigned long old_tsb_base,
 unsigned long old_tsb_size,
 unsigned long new_tsb_base,
-unsigned long new_tsb_size);
+unsigned long new_tsb_size,
+unsigned long page_size_shift);
unsigned long old_tsb_base = (unsigned long) old_tsb;
unsigned long new_tsb_base = (unsigned long) new_tsb;
 
@@ -504,7 +505,9 @@ void tsb_grow(struct mm_struct *mm, unsigned long 
tsb_index, unsigned long rss)
old_tsb_base = __pa(old_tsb_base);
new_tsb_base = __pa(new_tsb_base);
}
-   copy_tsb(old_tsb_base, old_size, new_tsb_base, new_size);
+   copy_tsb(old_tsb_base, old_size, new_tsb_base, new_size,
+   tsb_index == MM_TSB_BASE ?
+   PAGE_SHIFT : REAL_HPAGE_SHIFT);
}
 
mm->context.tsb_block[tsb_index].tsb = new_tsb;
-- 
2.7.4

Re: [PATCH v2] mm: show total hugetlb memory consumption in /proc/meminfo

2017-11-21 Thread Mike Kravetz

On 11/21/2017 11:59 AM, Roman Gushchin wrote:
> On Tue, Nov 21, 2017 at 11:19:07AM -0800, Andrew Morton wrote:
>>
>> Why not
>>
>>  seq_printf(m,
>>  "HugePages_Total:   %5lu\n"
>>  "HugePages_Free:%5lu\n"
>>  "HugePages_Rsvd:%5lu\n"
>>  "HugePages_Surp:%5lu\n"
>>  "Hugepagesize:   %8lu kB\n",
>>  h->nr_huge_pages,
>>  h->free_huge_pages,
>>  h->resv_huge_pages,
>>  h->surplus_huge_pages,
>>  1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
>>
>>  for_each_hstate(h)
>>  total += (PAGE_SIZE << huge_page_order(h)) * h->nr_huge_pages;
>>  seq_printf(m, "Hugetlb:%8lu kB\n", total / 1024);
>>  
>> ?
> 
> The idea was that the local variable guarantees the consistency
> between Hugetlb and HugePages_Total numbers. Otherwise we have
> to take hugetlb_lock.

Most important it prevents HugePages_Total from being larger than 
Hugetlb.

> What we can do, is to rename "count" into "nr_huge_pages", like:
> 
>   for_each_hstate(h) {
>   unsigned long nr_huge_pages = h->nr_huge_pages;
> 
>   total += (PAGE_SIZE << huge_page_order(h)) * nr_huge_pages;
> 
>   if (h == _hstate)
>   seq_printf(m,
>  "HugePages_Total:   %5lu\n"
>  "HugePages_Free:%5lu\n"
>  "HugePages_Rsvd:%5lu\n"
>  "HugePages_Surp:%5lu\n"
>  "Hugepagesize:   %8lu kB\n",
>  nr_huge_pages,
>  h->free_huge_pages,
>  h->resv_huge_pages,
>  h->surplus_huge_pages,
>  (PAGE_SIZE << huge_page_order(h)) / 1024);
>   }
> 
>   seq_printf(m, "Hugetlb:%8lu kB\n", total / 1024);
> 
> But maybe taking a lock is not a bad idea, because it will also
> guarantee consistency between other numbers (like HugePages_Free) as well,
> which is not true right now.

You are correct in that there is no consistency guarantee for the numbers
with the default huge page size today.  However, I am not really a fan of
taking the lock for that guarantee.  IMO, the above code is fine.

This discussion reminds me that ideally there should be a per-hstate lock.
My guess is that the global lock is a carry over from the days when only
a single huge page size was supported.  In practice, I don't think this is
much of an issue as people typically only use a single huge page size.  But,
if anyone thinks is/may be an issue I am happy to make the changes.

-- 
Mike Kravetz

> 
> Thanks!
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>

Re: [RFC PATCH 0/3] restructure memfd code

2017-11-21 Thread Mike Kravetz

On 11/21/2017 08:32 AM, Khalid Aziz wrote:
> On Wed, 2017-11-08 at 17:41 -0800, Mike Kravetz wrote:
>> With the addition of memfd hugetlbfs support, we now have the
>> situation
>> where memfd depends on TMPFS -or- HUGETLBFS.  Previously, memfd was
>> only
>> supported on tmpfs, so it made sense that the code resides in
>> shmem.c.
>>
>> This patch series moves the memfd code to separate files (memfd.c and
>> memfd.h).  It creates a new config option MEMFD_CREATE that is
>> defined
>> if either TMPFS or HUGETLBFS is defined.
>>
>> In the current code, memfd is only functional if TMPFS is
>> defined.  If
>> HUGETLFS is defined and TMPFS is not defined, then memfd
>> functionality
>> will not be available for hugetlbfs.  This does not cause BUGs, just
>> a
>> potential lack of desired functionality.
>>
>> Another way to approach this issue would be to simply make HUGETLBFS
>> depend on TMPFS.
>>
>> This patch series is built on top of the Marc-André Lureau v3 series
>> "memfd: add sealing to hugetlb-backed memory":
>> http://lkml.kernel.org/r/20171107122800.25517-1-marcandre.lureau@redh
>> at.com
>>
>> Mike Kravetz (3):
>>   mm: hugetlbfs: move HUGETLBFS_I outside #ifdef CONFIG_HUGETLBFS
>>   mm: memfd: split out memfd for use by multiple filesystems
>>   mm: memfd: remove memfd code from shmem files and use new memfd
>> files
>>
> 
> Hi Mike,
> 
> This looks like a useful change. After applying patch 2, you end up
> with duplicate definitions of number of symbols though. Although those
> duplicates will not cause compilation problems since memfd.c is not
> compiled until after patch 3 has been applied, would it make more sense
> to combine moving of all code in one patch?

Thanks Khalid,

I was aware of this situation when creating the patch.  It was broken out
as above simply to make it easier to review/understand.  Not sure if that
is actually the case.  The other option was as you suggested to simply
combine the add/remove as a single patch.

I am somewhat waiting to see how Marc-André Lureau's file sealing series
progresses as this series touches the same code.
-- 
Mike Kravetz

Re: [PATCH 1/1] mm/cma: fix alloc_contig_range ret code/potential leak

2017-11-22 Thread Mike Kravetz

On 11/22/2017 04:00 AM, Johannes Weiner wrote:
> On Mon, Nov 20, 2017 at 11:39:30AM -0800, Mike Kravetz wrote:
>> If the call __alloc_contig_migrate_range() in alloc_contig_range
>> returns -EBUSY, processing continues so that test_pages_isolated()
>> is called where there is a tracepoint to identify the busy pages.
>> However, it is possible for busy pages to become available between
>> the calls to these two routines.  In this case, the range of pages
>> may be allocated.   Unfortunately, the original return code (ret
>> == -EBUSY) is still set and returned to the caller.  Therefore,
>> the caller believes the pages were not allocated and they are leaked.
>>
>> Update the return code with the value from test_pages_isolated().
>>
>> Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation 
>> failure")
>> Cc: 
>> Signed-off-by: Mike Kravetz 
> 
> Wow, good catch.
> 
>> ---
>>  mm/page_alloc.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 77e4d3c5c57b..3605ca82fd29 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7632,10 +7632,10 @@ int alloc_contig_range(unsigned long start, unsigned 
>> long end,
>>  }
>>  
>>  /* Make sure the range is really isolated. */
>> -if (test_pages_isolated(outer_start, end, false)) {
>> +ret = test_pages_isolated(outer_start, end, false);
>> +if (ret) {
>>  pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
>>  __func__, outer_start, end);
>> -ret = -EBUSY;
>>  goto done;
> 
> Essentially, an -EBUSY from __alloc_contig_migrate_range() doesn't
> mean anything, and we return 0 if the rest of the operations succeed.
> 
> Since we never plan on returning that particular -EBUSY, would it be
> more robust to reset it right then and there, rather than letting it
> run on in ret for more than a screenful?
> 
> It would also be good to note in that fall-through comment that the
> pages becoming free on their own is a distinct possibility.
> 
> As Michal points out, this is really subtle. It makes sense to make it
> as explicit as possible.

Ok, I thought about zero'ing ret right after the call to
__alloc_contig_migrate_range and return of -EBUSY.  It just didn't look
right to me.  But, you are correct.  We should make this as explicit as
possible.  I will respin the patch as suggested and be sure to include an
explicit comment when setting ret = 0.

-- 
Mike Kravetz

[PATCH v2] mm/cma: fix alloc_contig_range ret code/potential leak

2017-11-22 Thread Mike Kravetz

If the call __alloc_contig_migrate_range() in alloc_contig_range
returns -EBUSY, processing continues so that test_pages_isolated()
is called where there is a tracepoint to identify the busy pages.
However, it is possible for busy pages to become available between
the calls to these two routines.  In this case, the range of pages
may be allocated.   Unfortunately, the original return code (ret
== -EBUSY) is still set and returned to the caller.  Therefore,
the caller believes the pages were not allocated and they are leaked.

Update comment to indicate that allocation is still possible even if
__alloc_contig_migrate_range returns -EBUSY.  Also, clear return code
in this case so that it is not accidentally used or returned to caller.

Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation 
failure")
Cc: 
Signed-off-by: Mike Kravetz 
---
 mm/page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..25e81844d1aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7582,11 +7582,18 @@ int alloc_contig_range(unsigned long start, unsigned 
long end,
 
/*
 * In case of -EBUSY, we'd like to know which page causes problem.
-* So, just fall through. We will check it in test_pages_isolated().
+* So, just fall through. test_pages_isolated() has a tracepoint
+* which will report the busy page.
+*
+* It is possible that busy pages could become available before
+* the call to test_pages_isolated, and the range will actually be
+* allocated.  So, if we fall through be sure to clear ret so that
+* -EBUSY is not accidentally used or returned to caller.
 */
ret = __alloc_contig_migrate_range(, start, end);
if (ret && ret != -EBUSY)
goto done;
+   ret =0;
 
/*
 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
-- 
2.13.6

Re: hugetlb page migration vs. overcommit

2017-11-22 Thread Mike Kravetz

On 11/22/2017 07:28 AM, Michal Hocko wrote:
> Hi,
> is there any reason why we enforce the overcommit limit during hugetlb
> pages migration? It's in alloc_huge_page_node->__alloc_buddy_huge_page
> path. I am wondering whether this is really an intentional behavior.

I do not think it was intentional.  But, I was not around when that
code was added.

> The page migration allocates a page just temporarily so we should be
> able to go over the overcommit limit for the migration duration. The
> reason I am asking is that hugetlb pages tend to be utilized usually
> (otherwise the memory would be just wasted and pool shrunk) but then
> the migration simply fails which breaks memory hotplug and other
> migration dependent functionality which is quite suboptimal. You can
> workaround that by increasing the overcommit limit.

Yes.  In an environment making optimal use of huge pages, you are unlikely
to have 'spare pages' set aside for a potential migration operation.  So
I agree that it would make sense to try and allocate overcommit pages for
this purpose.

> Why don't we simply migrate as long as we are able to allocate the
> target hugetlb page? I have a half baked patch to remove this
> restriction, would there be an opposition to do something like that?

I would not be opposed and would help with this effort.  My concern would
be any subtle hugetlb accounting issues once you start messing with
additional overcommit pages.

Since Naoya was originally involved in huge page migration, I would welcome
his comments.
-- 
Mike Kravetz

Re: [PATCH RFC 1/2] mm, hugetlb: unify core page allocation accounting and initialization

2017-11-29 Thread Mike Kravetz

On 11/28/2017 10:57 PM, Michal Hocko wrote:
> On Tue 28-11-17 13:34:53, Mike Kravetz wrote:
>> On 11/28/2017 06:12 AM, Michal Hocko wrote:
> [...]
>>> +/*
>>> + * Allocates a fresh page to the hugetlb allocator pool in the node 
>>> interleaved
>>> + * manner.
>>> + */
>>>  static int alloc_fresh_huge_page(struct hstate *h, nodemask_t 
>>> *nodes_allowed)
>>>  {
>>> struct page *page;
>>> int nr_nodes, node;
>>> -   int ret = 0;
>>> +   gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>>>  
>>> for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>>> -   page = alloc_fresh_huge_page_node(h, node);
>>> -   if (page) {
>>> -   ret = 1;
>>> +   page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
>>> +   node, nodes_allowed);
>>
>> I don't have the greatest understanding of node/nodemasks, but ...
>> Since __hugetlb_alloc_buddy_huge_page calls __alloc_pages_nodemask(), do
>> we still need to explicitly iterate over nodes with
>> for_each_node_mask_to_alloc() here?
> 
> Yes we do, because callers depend on the round robin allocation policy
> which is implemented by the ugly for_each_node_mask_to_alloc. I am not
> saying I like the way this is done but this is user visible thing.

Ah, thanks.

I missed the __GFP_THISNODE.  Because of that, the nodes_allowed mask is
not used in the allocation attempts.  So, cycling through the nodes with
the for_each_node_mask_to_alloc makes sense.

> Or maybe I've missunderstood the whole thing...

No, this should preserve the original behavior.

-- 
Mike Kravetz

Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-29 Thread Mike Kravetz

On 11/29/2017 01:22 AM, Michal Hocko wrote:
> What about this on top. I haven't tested this yet though.

Yes, this would work.

However, I think a simple modification to your previous free_huge_page
changes would make this unnecessary.  I was confused in your previous
patch because you decremented the per-node surplus page count, but not
the global count.  I think it would have been correct (and made this
patch unnecessary) if you decremented the global counter there as well.

Of course, this patch makes the surplus accounting more explicit.

If we move forward with this patch, one issue below.

> ---
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1b6d7783c717..f5fcd4e355dc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>   long freed);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
>  void putback_active_hugepage(struct page *page);
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason);
>  void free_huge_page(struct page *page);
>  void hugetlb_fix_reserve_counts(struct inode *inode);
>  extern struct mutex *hugetlb_fault_mutex_table;
> @@ -232,6 +233,7 @@ static inline bool isolate_huge_page(struct page *page, 
> struct list_head *list)
>   return false;
>  }
>  #define putback_active_hugepage(p)   do {} while (0)
> +#define move_hugetlb_state(old, new, reason) do {} while (0)
>  
>  static inline unsigned long hugetlb_change_protection(struct vm_area_struct 
> *vma,
>   unsigned long address, unsigned long end, pgprot_t newprot)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 037bf0f89463..30601c1c62f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  int hugetlb_max_hstate __read_mostly;
> @@ -4830,3 +4831,34 @@ void putback_active_hugepage(struct page *page)
>   spin_unlock(_lock);
>   put_page(page);
>  }
> +
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason)
> +{
> + struct hstate *h = page_hstate(oldpage);
> +
> + hugetlb_cgroup_migrate(oldpage, newpage);
> + set_page_owner_migrate_reason(newpage, reason);
> +
> + /*
> +  * transfer temporary state of the new huge page. This is
> +  * reverse to other transitions because the newpage is going to
> +  * be final while the old one will be freed so it takes over
> +  * the temporary status.
> +  *
> +  * Also note that we have to transfer the per-node surplus state
> +  * here as well otherwise the global surplus count will not match
> +  * the per-node's.
> +  */
> + if (PageHugeTemporary(newpage)) {
> + int old_nid = page_to_nid(oldpage);
> + int new_nid = page_to_nid(newpage);
> +
> + SetPageHugeTemporary(oldpage);
> + ClearPageHugeTemporary(newpage);
> +
> + if (h->surplus_huge_pages_node[old_nid]) {
> +     h->surplus_huge_pages_node[old_nid]--;
> + h->surplus_huge_pages_node[new_nid]++;
> + }

You need to take hugetlb_lock before adjusting the surplus counts.

-- 
Mike Kravetz

Re: [PATCH RFC 2/2] mm, hugetlb: do not rely on overcommit limit during migration

2017-11-30 Thread Mike Kravetz

On 11/29/2017 11:57 PM, Michal Hocko wrote:
> On Wed 29-11-17 11:52:53, Mike Kravetz wrote:
>> On 11/29/2017 01:22 AM, Michal Hocko wrote:
>>> What about this on top. I haven't tested this yet though.
>>
>> Yes, this would work.
>>
>> However, I think a simple modification to your previous free_huge_page
>> changes would make this unnecessary.  I was confused in your previous
>> patch because you decremented the per-node surplus page count, but not
>> the global count.  I think it would have been correct (and made this
>> patch unnecessary) if you decremented the global counter there as well.
> 
> We cannot really increment the global counter because the over number of
> surplus pages during migration doesn't increase.

I was not suggesting we increment the global surplus count.  Rather,
your previous patch should have decremented the global surplus count in
free_huge_page.  Something like:

@@ -1283,7 +1283,13 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;

-   if (h->surplus_huge_pages_node[nid]) {
+   if (PageHugeTemporary(page)) {
+   list_del(>lru);
+   ClearPageHugeTemporary(page);
+   update_and_free_page(h, page);
+   if (h->surplus_huge_pages_node[nid])
+   h->surplus_huge_pages--;
+   h->surplus_huge_pages_node[nid]--;
+   }
+   } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(>lru);
update_and_free_page(h, page);

When we allocate one of these 'PageHugeTemporary' pages, we only increment
the global and node specific nr_huge_pages counters.  To me, this makes all
the huge page counters be the same as if there were simply one additional
pre-allocated huge page.  This 'extra' (PageHugeTemporary) page will go
away when free_huge_page is called.  So, my thought is that it is not
necessary to transfer per-node counts from the original to target node.
Of course, I may be missing something.

When thinking about transfering per-node counts as is done in your latest
patch, I took another look at all the per-node counts.  This may show my
ignorance of huge page migration, but do we need to handle the case where
the page being migrated is 'free'?  Is that possible?  If so, there will
be a count for free_huge_pages_node and the page will be on the per node
hugepage_freelists that must be handled

-- 
Mike Kravetz

Re: [PATCH v2 5/9] shmem: add sealing support to hugetlb-backed memfd

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> Adapt add_seals()/get_seals() to work with hugetbfs-backed memory.
> 
> Teach memfd_create() to allow sealing operations on MFD_HUGETLB.
> 
> Signed-off-by: Marc-André Lureau 

Looks fine to me,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  mm/shmem.c | 47 ---
>  1 file changed, 28 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b7811979611f..b08ba5d84d84 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2717,6 +2717,19 @@ static int shmem_wait_for_pins(struct address_space 
> *mapping)
>   return error;
>  }
>  
> +static unsigned int *memfd_file_seals_ptr(struct file *file)
> +{
> + if (file->f_op == _file_operations)
> + return _I(file_inode(file))->seals;
> +
> +#ifdef CONFIG_HUGETLBFS
> + if (file->f_op == _file_operations)
> + return _I(file_inode(file))->seals;
> +#endif
> +
> + return NULL;
> +}
> +
>  #define F_ALL_SEALS (F_SEAL_SEAL | \
>F_SEAL_SHRINK | \
>F_SEAL_GROW | \
> @@ -2725,7 +2738,7 @@ static int shmem_wait_for_pins(struct address_space 
> *mapping)
>  static int memfd_add_seals(struct file *file, unsigned int seals)
>  {
>   struct inode *inode = file_inode(file);
> - struct shmem_inode_info *info = SHMEM_I(inode);
> + unsigned int *file_seals;
>   int error;
>  
>   /*
> @@ -2758,8 +2771,6 @@ static int memfd_add_seals(struct file *file, unsigned 
> int seals)
>* other file types.
>*/
>  
> - if (file->f_op != _file_operations)
> - return -EINVAL;
>   if (!(file->f_mode & FMODE_WRITE))
>   return -EPERM;
>   if (seals & ~(unsigned int)F_ALL_SEALS)
> @@ -2767,12 +2778,18 @@ static int memfd_add_seals(struct file *file, 
> unsigned int seals)
>  
>   inode_lock(inode);
>  
> - if (info->seals & F_SEAL_SEAL) {
> + file_seals = memfd_file_seals_ptr(file);
> + if (!file_seals) {
> + error = -EINVAL;
> + goto unlock;
> + }
> +
> + if (*file_seals & F_SEAL_SEAL) {
>   error = -EPERM;
>   goto unlock;
>   }
>  
> - if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) {
> + if ((seals & F_SEAL_WRITE) && !(*file_seals & F_SEAL_WRITE)) {
>   error = mapping_deny_writable(file->f_mapping);
>   if (error)
>   goto unlock;
> @@ -2784,7 +2801,7 @@ static int memfd_add_seals(struct file *file, unsigned 
> int seals)
>   }
>   }
>  
> - info->seals |= seals;
> + *file_seals |= seals;
>   error = 0;
>  
>  unlock:
> @@ -2794,10 +2811,9 @@ static int memfd_add_seals(struct file *file, unsigned 
> int seals)
>  
>  static int memfd_get_seals(struct file *file)
>  {
> - if (file->f_op != _file_operations)
> - return -EINVAL;
> + unsigned int *seals = memfd_file_seals_ptr(file);
>  
> - return SHMEM_I(file_inode(file))->seals;
> + return seals ? *seals : -EINVAL;
>  }
>  
>  long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> @@ -3657,7 +3673,7 @@ SYSCALL_DEFINE2(memfd_create,
>   const char __user *, uname,
>   unsigned int, flags)
>  {
> - struct shmem_inode_info *info;
> + unsigned int *file_seals;
>   struct file *file;
>   int fd, error;
>   char *name;
> @@ -3667,9 +3683,6 @@ SYSCALL_DEFINE2(memfd_create,
>   if (flags & ~(unsigned int)MFD_ALL_FLAGS)
>   return -EINVAL;
>   } else {
> - /* Sealing not supported in hugetlbfs (MFD_HUGETLB) */
> - if (flags & MFD_ALLOW_SEALING)
> - return -EINVAL;
>   /* Allow huge page size encoding in flags. */
>   if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
>   (MFD_HUGE_MASK << MFD_HUGE_SHIFT)))
> @@ -3722,12 +3735,8 @@ SYSCALL_DEFINE2(memfd_create,
>   file->f_flags |= O_RDWR | O_LARGEFILE;
>  
>   if (flags & MFD_ALLOW_SEALING) {
> - /*
> -  * flags check at beginning of function ensures
> -  * this is not a hugetlbfs (MFD_HUGETLB) file.
> -  */
> - info = SHMEM_I(file_inode(file));
> - info->seals &= ~F_SEAL_SEAL;
> + file_seals = memfd_file_seals_ptr(file);
> + *file_seals &= ~F_SEAL_SEAL;
>   }
>  
>   fd_install(fd, file);
>

Re: [PATCH v2 6/9] memfd-tests: test hugetlbfs sealing

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> Remove most of the special-casing of hugetlbfs now that sealing
> is supported.
> 
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  tools/testing/selftests/memfd/memfd_test.c | 150 
> +++--
>  1 file changed, 15 insertions(+), 135 deletions(-)
> 
> diff --git a/tools/testing/selftests/memfd/memfd_test.c 
> b/tools/testing/selftests/memfd/memfd_test.c
> index 845e5f67b6f0..cca957a06525 100644
> --- a/tools/testing/selftests/memfd/memfd_test.c
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -513,6 +513,10 @@ static void mfd_assert_grow_write(int fd)
>   static char *buf;
>   ssize_t l;
>  
> + /* hugetlbfs does not support write */
> + if (hugetlbfs_test)
> + return;
> +
>   buf = malloc(mfd_def_size * 8);
>   if (!buf) {
>   printf("malloc(%d) failed: %m\n", mfd_def_size * 8);
> @@ -533,6 +537,10 @@ static void mfd_fail_grow_write(int fd)
>   static char *buf;
>   ssize_t l;
>  
> + /* hugetlbfs does not support write */
> + if (hugetlbfs_test)
> + return;
> +
>   buf = malloc(mfd_def_size * 8);
>   if (!buf) {
>   printf("malloc(%d) failed: %m\n", mfd_def_size * 8);
> @@ -627,18 +635,13 @@ static void test_create(void)
>   fd = mfd_assert_new("", 0, MFD_CLOEXEC);
>   close(fd);
>  
> - if (!hugetlbfs_test) {
> - /* verify MFD_ALLOW_SEALING is allowed */
> - fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
> - close(fd);
> -
> - /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
> - fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
> - close(fd);
> - } else {
> - /* sealing is not supported on hugetlbfs */
> - mfd_fail_new("", MFD_ALLOW_SEALING);
> - }
> + /* verify MFD_ALLOW_SEALING is allowed */
> + fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
> + close(fd);
> +
> + /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
> + fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
> + close(fd);
>  }
>  
>  /*
> @@ -649,10 +652,6 @@ static void test_basic(void)
>  {
>   int fd;
>  
> - /* hugetlbfs does not contain sealing support */
> - if (hugetlbfs_test)
> - return;
> -
>   printf("%s BASIC\n", MEMFD_STR);
>  
>   fd = mfd_assert_new("kern_memfd_basic",
> @@ -697,28 +696,6 @@ static void test_basic(void)
>   close(fd);
>  }
>  
> -/*
> - * hugetlbfs doesn't support seals or write, so just verify grow and shrink
> - * on a hugetlbfs file created via memfd_create.
> - */
> -static void test_hugetlbfs_grow_shrink(void)
> -{
> - int fd;
> -
> - printf("%s HUGETLBFS-GROW-SHRINK\n", MEMFD_STR);
> -
> - fd = mfd_assert_new("kern_memfd_seal_write",
> - mfd_def_size,
> - MFD_CLOEXEC);
> -
> - mfd_assert_read(fd);
> - mfd_assert_write(fd);
> - mfd_assert_shrink(fd);
> - mfd_assert_grow(fd);
> -
> - close(fd);
> -}
> -
>  /*
>   * Test SEAL_WRITE
>   * Test whether SEAL_WRITE actually prevents modifications.
> @@ -727,13 +704,6 @@ static void test_seal_write(void)
>  {
>   int fd;
>  
> - /*
> -  * hugetlbfs does not contain sealing or write support.  Just test
> -  * basic grow and shrink via test_hugetlbfs_grow_shrink.
> -  */
> - if (hugetlbfs_test)
> - return test_hugetlbfs_grow_shrink();
> -
>   printf("%s SEAL-WRITE\n", MEMFD_STR);
>  
>   fd = mfd_assert_new("kern_memfd_seal_write",
> @@ -760,10 +730,6 @@ static void test_seal_shrink(void)
>  {
>   int fd;
>  
> - /* hugetlbfs does not contain sealing support */
> - if (hugetlbfs_test)
> - return;
> -
>   printf("%s SEAL-SHRINK\n", MEMFD_STR);
>  
>   fd = mfd_assert_new("kern_memfd_seal_shrink",
> @@ -790,10 +756,6 @@ static void test_seal_grow(void)
>  {
>   int fd;
>  
> - /* hugetlbfs does not contain sealing support */
> - if (hugetlbfs_test)
> - return;
> -
>   printf("%s SEAL-GROW\n", MEMFD_STR);
>  
>   fd = mfd_assert_new("kern_memfd_seal_grow",
> @@ -820,10 +782,6 @@ static void test_seal_resize(void)
>  {
>   int fd;
>  
> - /* hugetlbfs does

Re: [PATCH v2 7/9] memfd-test: add 'memfd-hugetlb:' prefix when testing hugetlbfs

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> Suggested-by: Mike Kravetz 
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  tools/testing/selftests/memfd/memfd_test.c | 26 --
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/testing/selftests/memfd/memfd_test.c 
> b/tools/testing/selftests/memfd/memfd_test.c
> index cca957a06525..955d09ee16ca 100644
> --- a/tools/testing/selftests/memfd/memfd_test.c
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -20,6 +20,7 @@
>  #include 
>  
>  #define MEMFD_STR"memfd:"
> +#define MEMFD_HUGE_STR   "memfd-hugetlb:"
>  #define SHARED_FT_STR"(shared file-table)"
>  
>  #define MFD_DEF_SIZE 8192
> @@ -30,6 +31,7 @@
>   */
>  static int hugetlbfs_test;
>  static size_t mfd_def_size = MFD_DEF_SIZE;
> +static const char *memfd_str = MEMFD_STR;
>  
>  /*
>   * Copied from mlock2-tests.c
> @@ -606,7 +608,7 @@ static void test_create(void)
>   char buf[2048];
>   int fd;
>  
> - printf("%s CREATE\n", MEMFD_STR);
> + printf("%s CREATE\n", memfd_str);
>  
>   /* test NULL name */
>   mfd_fail_new(NULL, 0);
> @@ -652,7 +654,7 @@ static void test_basic(void)
>  {
>   int fd;
>  
> - printf("%s BASIC\n", MEMFD_STR);
> + printf("%s BASIC\n", memfd_str);
>  
>   fd = mfd_assert_new("kern_memfd_basic",
>   mfd_def_size,
> @@ -704,7 +706,7 @@ static void test_seal_write(void)
>  {
>   int fd;
>  
> - printf("%s SEAL-WRITE\n", MEMFD_STR);
> + printf("%s SEAL-WRITE\n", memfd_str);
>  
>   fd = mfd_assert_new("kern_memfd_seal_write",
>   mfd_def_size,
> @@ -730,7 +732,7 @@ static void test_seal_shrink(void)
>  {
>   int fd;
>  
> - printf("%s SEAL-SHRINK\n", MEMFD_STR);
> + printf("%s SEAL-SHRINK\n", memfd_str);
>  
>   fd = mfd_assert_new("kern_memfd_seal_shrink",
>   mfd_def_size,
> @@ -756,7 +758,7 @@ static void test_seal_grow(void)
>  {
>   int fd;
>  
> - printf("%s SEAL-GROW\n", MEMFD_STR);
> + printf("%s SEAL-GROW\n", memfd_str);
>  
>   fd = mfd_assert_new("kern_memfd_seal_grow",
>   mfd_def_size,
> @@ -782,7 +784,7 @@ static void test_seal_resize(void)
>  {
>   int fd;
>  
> - printf("%s SEAL-RESIZE\n", MEMFD_STR);
> + printf("%s SEAL-RESIZE\n", memfd_str);
>  
>   fd = mfd_assert_new("kern_memfd_seal_resize",
>   mfd_def_size,
> @@ -808,7 +810,7 @@ static void test_share_dup(char *banner, char *b_suffix)
>  {
>   int fd, fd2;
>  
> - printf("%s %s %s\n", MEMFD_STR, banner, b_suffix);
> + printf("%s %s %s\n", memfd_str, banner, b_suffix);
>  
>   fd = mfd_assert_new("kern_memfd_share_dup",
>   mfd_def_size,
> @@ -850,7 +852,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
>   int fd;
>   void *p;
>  
> - printf("%s %s %s\n", MEMFD_STR,  banner, b_suffix);
> + printf("%s %s %s\n", memfd_str,  banner, b_suffix);
>  
>   fd = mfd_assert_new("kern_memfd_share_mmap",
>   mfd_def_size,
> @@ -884,7 +886,7 @@ static void test_share_open(char *banner, char *b_suffix)
>  {
>   int fd, fd2;
>  
> - printf("%s %s %s\n", MEMFD_STR, banner, b_suffix);
> + printf("%s %s %s\n", memfd_str, banner, b_suffix);
>  
>   fd = mfd_assert_new("kern_memfd_share_open",
>   mfd_def_size,
> @@ -927,7 +929,7 @@ static void test_share_fork(char *banner, char *b_suffix)
>   int fd;
>   pid_t pid;
>  
> - printf("%s %s %s\n", MEMFD_STR, banner, b_suffix);
> + printf("%s %s %s\n", memfd_str, banner, b_suffix);
>  
>   fd = mfd_assert_new("kern_memfd_share_fork",
>   mfd_def_size,
> @@ -963,7 +965,11 @@ int main(int argc, char **argv)
>   }
>  
>   hugetlbfs_test = 1;
> + memfd_str = MEMFD_HUGE_STR;
>   mfd_def_size = hpage_size * 2;
> + } else {
> + printf("Unknown option: %s\n", argv[1]);
> + abort();
>   }
>   }
>  
>

Re: [PATCH v2 8/9] memfd-test: move common code to a shared unit

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> The memfd & fuse tests will share more common code in the following
> commits to test hugetlb support.
> 
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  tools/testing/selftests/memfd/Makefile |  5 
>  tools/testing/selftests/memfd/common.c | 45 
> ++
>  tools/testing/selftests/memfd/common.h |  9 ++
>  tools/testing/selftests/memfd/fuse_test.c  |  8 ++
>  tools/testing/selftests/memfd/memfd_test.c | 36 ++--
>  5 files changed, 63 insertions(+), 40 deletions(-)
>  create mode 100644 tools/testing/selftests/memfd/common.c
>  create mode 100644 tools/testing/selftests/memfd/common.h
> 
> diff --git a/tools/testing/selftests/memfd/Makefile 
> b/tools/testing/selftests/memfd/Makefile
> index 3926a0409dda..a5276a91dfbf 100644
> --- a/tools/testing/selftests/memfd/Makefile
> +++ b/tools/testing/selftests/memfd/Makefile
> @@ -12,3 +12,8 @@ fuse_mnt.o: CFLAGS += $(shell pkg-config fuse --cflags)
>  include ../lib.mk
>  
>  $(OUTPUT)/fuse_mnt: LDLIBS += $(shell pkg-config fuse --libs)
> +
> +$(OUTPUT)/memfd_test: memfd_test.c common.o
> +$(OUTPUT)/fuse_test: fuse_test.c common.o
> +
> +EXTRA_CLEAN = common.o
> diff --git a/tools/testing/selftests/memfd/common.c 
> b/tools/testing/selftests/memfd/common.c
> new file mode 100644
> index ..7ed269cd3abb
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/common.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "common.h"
> +
> +int hugetlbfs_test = 0;
> +
> +/*
> + * Copied from mlock2-tests.c
> + */
> +unsigned long default_huge_page_size(void)
> +{
> + unsigned long hps = 0;
> + char *line = NULL;
> + size_t linelen = 0;
> + FILE *f = fopen("/proc/meminfo", "r");
> +
> + if (!f)
> + return 0;
> + while (getline(, , f) > 0) {
> + if (sscanf(line, "Hugepagesize:   %lu kB", ) == 1) {
> + hps <<= 10;
> + break;
> + }
> + }
> +
> + free(line);
> + fclose(f);
> + return hps;
> +}
> +
> +int sys_memfd_create(const char *name, unsigned int flags)
> +{
> + if (hugetlbfs_test)
> + flags |= MFD_HUGETLB;
> +
> + return syscall(__NR_memfd_create, name, flags);
> +}
> diff --git a/tools/testing/selftests/memfd/common.h 
> b/tools/testing/selftests/memfd/common.h
> new file mode 100644
> index ..522d2c630bd8
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/common.h
> @@ -0,0 +1,9 @@
> +#ifndef COMMON_H_
> +#define COMMON_H_
> +
> +extern int hugetlbfs_test;
> +
> +unsigned long default_huge_page_size(void);
> +int sys_memfd_create(const char *name, unsigned int flags);
> +
> +#endif
> diff --git a/tools/testing/selftests/memfd/fuse_test.c 
> b/tools/testing/selftests/memfd/fuse_test.c
> index 1ccb7a3eb14b..795a25ba8521 100644
> --- a/tools/testing/selftests/memfd/fuse_test.c
> +++ b/tools/testing/selftests/memfd/fuse_test.c
> @@ -33,15 +33,11 @@
>  #include 
>  #include 
>  
> +#include "common.h"
> +
>  #define MFD_DEF_SIZE 8192
>  #define STACK_SIZE 65536
>  
> -static int sys_memfd_create(const char *name,
> - unsigned int flags)
> -{
> - return syscall(__NR_memfd_create, name, flags);
> -}
> -
>  static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
>  {
>   int r, fd;
> diff --git a/tools/testing/selftests/memfd/memfd_test.c 
> b/tools/testing/selftests/memfd/memfd_test.c
> index 955d09ee16ca..4c049b6b6985 100644
> --- a/tools/testing/selftests/memfd/memfd_test.c
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  
> +#include "common.h"
> +
>  #define MEMFD_STR"memfd:"
>  #define MEMFD_HUGE_STR   "memfd-hugetlb:"
>  #define SHARED_FT_STR"(shared file-table)"
> @@ -29,43 +31,9 @@
>  /*
>   * Default is not to test hugetlbfs
>   */
> -static int hugetlbfs_test;
>  static size_t mfd_def_size = MFD_DEF_SIZE;
>  static const char *memfd_str = MEMFD_STR;
>  
> -/*
> - * Copied from mlock2-tests.c
> - */
> -static unsigned long default_huge_page_size(void)
> -{
> - unsigned long hps = 0;
> - char *line = NULL;
> - siz

Re: [PATCH v2 9/9] memfd-test: run fuse test on hugetlb backend memory

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> Suggested-by: Mike Kravetz 
> Signed-off-by: Marc-André Lureau 
> ---
>  tools/testing/selftests/memfd/fuse_test.c  | 30 
> ++
>  tools/testing/selftests/memfd/run_fuse_test.sh |  2 +-
>  tools/testing/selftests/memfd/run_tests.sh |  1 +
>  3 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/testing/selftests/memfd/fuse_test.c 
> b/tools/testing/selftests/memfd/fuse_test.c
> index 795a25ba8521..0a85b34929e1 100644
> --- a/tools/testing/selftests/memfd/fuse_test.c
> +++ b/tools/testing/selftests/memfd/fuse_test.c
> @@ -38,6 +38,8 @@
>  #define MFD_DEF_SIZE 8192
>  #define STACK_SIZE 65536
>  
> +static size_t mfd_def_size = MFD_DEF_SIZE;
> +
>  static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
>  {
>   int r, fd;
> @@ -123,7 +125,7 @@ static void *mfd_assert_mmap_shared(int fd)
>   void *p;
>  
>   p = mmap(NULL,
> -  MFD_DEF_SIZE,
> +  mfd_def_size,
>PROT_READ | PROT_WRITE,
>MAP_SHARED,
>fd,
> @@ -141,7 +143,7 @@ static void *mfd_assert_mmap_private(int fd)
>   void *p;
>  
>   p = mmap(NULL,
> -  MFD_DEF_SIZE,
> +  mfd_def_size,
>PROT_READ | PROT_WRITE,
>MAP_PRIVATE,
>fd,
> @@ -174,7 +176,7 @@ static int sealing_thread_fn(void *arg)
>   usleep(20);
>  
>   /* unmount mapping before sealing to avoid i_mmap_writable failures */
> - munmap(global_p, MFD_DEF_SIZE);
> + munmap(global_p, mfd_def_size);
>  
>   /* Try sealing the global file; expect EBUSY or success. Current
>* kernels will never succeed, but in the future, kernels might
> @@ -224,7 +226,7 @@ static void join_sealing_thread(pid_t pid)
>  
>  int main(int argc, char **argv)
>  {
> - static const char zero[MFD_DEF_SIZE];
> + char *zero;
>   int fd, mfd, r;
>   void *p;
>   int was_sealed;
> @@ -235,6 +237,25 @@ int main(int argc, char **argv)
>   abort();
>   }
>  
> + if (argc >= 3) {
> + if (!strcmp(argv[2], "hugetlbfs")) {
> + unsigned long hpage_size = default_huge_page_size();
> +
> + if (!hpage_size) {
> + printf("Unable to determine huge page size\n");
> + abort();
> + }
> +
> + hugetlbfs_test = 1;
> + mfd_def_size = hpage_size * 2;
> + } else {
> + printf("Unknown option: %s\n", argv[2]);
> + abort();
> + }
> + }
> +
> + zero = calloc(sizeof(*zero), mfd_def_size);
> +
>   /* open FUSE memfd file for GUP testing */
>   printf("opening: %s\n", argv[1]);
>   fd = open(argv[1], O_RDONLY | O_CLOEXEC);

When ftruncate'ing the newly created file, you need to make sure length is
a multiple of huge page size for hugetlbfs files.  So, you will want to
do something like:

--- a/tools/testing/selftests/memfd/fuse_test.c
+++ b/tools/testing/selftests/memfd/fuse_test.c
@@ -265,7 +265,7 @@ int main(int argc, char **argv)
 
/* create new memfd-object */
mfd = mfd_assert_new("kern_memfd_fuse",
-MFD_DEF_SIZE,
+mfd_def_size,
 MFD_CLOEXEC | MFD_ALLOW_SEALING);
 
/* mmap memfd-object for writing */

Leaving MFD_DEF_SIZE for the size of reads and writes should be fine.

-- 
Mike Kravetz

> @@ -303,6 +324,7 @@ int main(int argc, char **argv)
>   close(fd);
>  
>   printf("fuse: DONE\n");
> + free(zero);
>  
>   return 0;
>  }
> diff --git a/tools/testing/selftests/memfd/run_fuse_test.sh 
> b/tools/testing/selftests/memfd/run_fuse_test.sh
> index 407df68dfe27..22e572e2d66a 100755
> --- a/tools/testing/selftests/memfd/run_fuse_test.sh
> +++ b/tools/testing/selftests/memfd/run_fuse_test.sh
> @@ -10,6 +10,6 @@ set -e
>  
>  mkdir mnt
>  ./fuse_mnt ./mnt
> -./fuse_test ./mnt/memfd
> +./fuse_test ./mnt/memfd $@
>  fusermount -u ./mnt
>  rmdir ./mnt
> diff --git a/tools/testing/selftests/memfd/run_tests.sh 
> b/tools/testing/selftests/memfd/run_tests.sh
> index daabb350697c..c2d41ed81b24 100755
> --- a/tools/testing/selftests/memfd/run_tests.sh
> +++ b/tools/testing/selftests/memfd/run_tests.sh
> @@ -60,6 +60,7 @@ fi
>  # Run the hugetlbfs test
>  #
>  ./memfd_test hugetlbfs
> +./run_fuse_test.sh hugetlbfs
>  
>  #
>  # Give back any huge pages allocated for the test
>

Re: [PATCH v2 0/9] memfd: add sealing to hugetlb-backed memory

2017-11-06 Thread Mike Kravetz

On 11/06/2017 06:39 AM, Marc-André Lureau wrote:
> Hi,
> 
> Recently, Mike Kravetz added hugetlbfs support to memfd. However, he
> didn't add sealing support. One of the reasons to use memfd is to have
> shared memory sealing when doing IPC or sharing memory with another
> process with some extra safety. qemu uses shared memory & hugetables
> with vhost-user (used by dpdk), so it is reasonable to use memfd
> now instead for convenience and security reasons.

Thanks for doing this.

I will create a patch to restructure the code such that memfd_create (and
file sealing) is split out and will depend on CONFIG_TMPFS -or-
CONFIG_HUGETLBFS.  I think this can wait to go in until after this patch
series.  Unless, someone prefers that it go in first?

-- 
Mike Kravetz


> 
> Thanks!
> 
> v1->v2: after Mike review,
> - add "memfd-hugetlb:" prefix in memfd-test
> - run fuse test on hugetlb backend memory
> - rename function memfd_file_get_seals() -> memfd_file_seals_ptr()
> - update commit messages
> - added reviewed-by tags
> 
> RFC->v1:
> - split rfc patch, after early review feedback
> - added patch for memfd-test changes
> - fix build with hugetlbfs disabled
> - small code and commit messages improvements
> 
> Marc-André Lureau (9):
>   shmem: unexport shmem_add_seals()/shmem_get_seals()
>   shmem: rename functions that are memfd-related
>   hugetlb: expose hugetlbfs_inode_info in header
>   hugetlbfs: implement memfd sealing
>   shmem: add sealing support to hugetlb-backed memfd
>   memfd-tests: test hugetlbfs sealing
>   memfd-test: add 'memfd-hugetlb:' prefix when testing hugetlbfs
>   memfd-test: move common code to a shared unit
>   memfd-test: run fuse test on hugetlb backend memory
> 
>  fs/fcntl.c |   2 +-
>  fs/hugetlbfs/inode.c   |  39 +++--
>  include/linux/hugetlb.h|  11 ++
>  include/linux/shmem_fs.h   |   6 +-
>  mm/shmem.c |  59 ---
>  tools/testing/selftests/memfd/Makefile |   5 +
>  tools/testing/selftests/memfd/common.c |  45 ++
>  tools/testing/selftests/memfd/common.h |   9 ++
>  tools/testing/selftests/memfd/fuse_test.c  |  36 +++--
>  tools/testing/selftests/memfd/memfd_test.c | 212 
> -
>  tools/testing/selftests/memfd/run_fuse_test.sh |   2 +-
>  tools/testing/selftests/memfd/run_tests.sh |   1 +
>  12 files changed, 195 insertions(+), 232 deletions(-)
>  create mode 100644 tools/testing/selftests/memfd/common.c
>  create mode 100644 tools/testing/selftests/memfd/common.h
>

Re: [PATCH v3 9/9] memfd-test: run fuse test on hugetlb backend memory

2017-11-07 Thread Mike Kravetz

On 11/07/2017 04:28 AM, Marc-André Lureau wrote:
> Suggested-by: Mike Kravetz 
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  tools/testing/selftests/memfd/fuse_test.c  | 38 
> --
>  tools/testing/selftests/memfd/run_fuse_test.sh |  2 +-
>  tools/testing/selftests/memfd/run_tests.sh |  1 +
>  3 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/testing/selftests/memfd/fuse_test.c 
> b/tools/testing/selftests/memfd/fuse_test.c
> index 795a25ba8521..b018e835737d 100644
> --- a/tools/testing/selftests/memfd/fuse_test.c
> +++ b/tools/testing/selftests/memfd/fuse_test.c
> @@ -38,6 +38,8 @@
>  #define MFD_DEF_SIZE 8192
>  #define STACK_SIZE 65536
>  
> +static size_t mfd_def_size = MFD_DEF_SIZE;
> +
>  static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
>  {
>   int r, fd;
> @@ -123,7 +125,7 @@ static void *mfd_assert_mmap_shared(int fd)
>   void *p;
>  
>   p = mmap(NULL,
> -  MFD_DEF_SIZE,
> +  mfd_def_size,
>PROT_READ | PROT_WRITE,
>MAP_SHARED,
>fd,
> @@ -141,7 +143,7 @@ static void *mfd_assert_mmap_private(int fd)
>   void *p;
>  
>   p = mmap(NULL,
> -  MFD_DEF_SIZE,
> +  mfd_def_size,
>PROT_READ | PROT_WRITE,
>MAP_PRIVATE,
>fd,
> @@ -174,7 +176,7 @@ static int sealing_thread_fn(void *arg)
>   usleep(20);
>  
>   /* unmount mapping before sealing to avoid i_mmap_writable failures */
> - munmap(global_p, MFD_DEF_SIZE);
> + munmap(global_p, mfd_def_size);
>  
>   /* Try sealing the global file; expect EBUSY or success. Current
>* kernels will never succeed, but in the future, kernels might
> @@ -224,7 +226,7 @@ static void join_sealing_thread(pid_t pid)
>  
>  int main(int argc, char **argv)
>  {
> - static const char zero[MFD_DEF_SIZE];
> + char *zero;
>   int fd, mfd, r;
>   void *p;
>   int was_sealed;
> @@ -235,6 +237,25 @@ int main(int argc, char **argv)
>   abort();
>   }
>  
> + if (argc >= 3) {
> + if (!strcmp(argv[2], "hugetlbfs")) {
> + unsigned long hpage_size = default_huge_page_size();
> +
> + if (!hpage_size) {
> + printf("Unable to determine huge page size\n");
> + abort();
> + }
> +
> + hugetlbfs_test = 1;
> + mfd_def_size = hpage_size * 2;
> + } else {
> + printf("Unknown option: %s\n", argv[2]);
> + abort();
> + }
> + }
> +
> + zero = calloc(sizeof(*zero), mfd_def_size);
> +
>   /* open FUSE memfd file for GUP testing */
>   printf("opening: %s\n", argv[1]);
>   fd = open(argv[1], O_RDONLY | O_CLOEXEC);
> @@ -245,7 +266,7 @@ int main(int argc, char **argv)
>  
>   /* create new memfd-object */
>   mfd = mfd_assert_new("kern_memfd_fuse",
> -  MFD_DEF_SIZE,
> +  mfd_def_size,
>MFD_CLOEXEC | MFD_ALLOW_SEALING);
>  
>   /* mmap memfd-object for writing */
> @@ -264,7 +285,7 @@ int main(int argc, char **argv)
>* This guarantees that the receive-buffer is pinned for 1s until the
>* data is written into it. The racing ADD_SEALS should thus fail as
>* the pages are still pinned. */
> - r = read(fd, p, MFD_DEF_SIZE);
> + r = read(fd, p, mfd_def_size);
>   if (r < 0) {
>   printf("read() failed: %m\n");
>   abort();
> @@ -291,10 +312,10 @@ int main(int argc, char **argv)
>* enough to avoid any in-flight writes. */
>  
>   p = mfd_assert_mmap_private(mfd);
> - if (was_sealed && memcmp(p, zero, MFD_DEF_SIZE)) {
> + if (was_sealed && memcmp(p, zero, mfd_def_size)) {
>   printf("memfd sealed during read() but data not discarded\n");
>   abort();
> - } else if (!was_sealed && !memcmp(p, zero, MFD_DEF_SIZE)) {
> + } else if (!was_sealed && !memcmp(p, zero, mfd_def_size)) {
>   printf("memfd sealed after read() but data discarded\n");
>   abort();
>   }
> @@ -303,6 +324,7 @@ int main(int argc, char **argv)
>   close(fd);
>  
>   printf("fuse: DONE\n");
> + free(zero);
>  
>   retur

Re: [RFC PATCH 1/5] mm, hugetlb: unify core page allocation accounting and initialization

2017-12-12 Thread Mike Kravetz

On 12/04/2017 06:01 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> hugetlb allocator has two entry points to the page allocator
> - alloc_fresh_huge_page_node
> - __hugetlb_alloc_buddy_huge_page
> 
> The two differ very subtly in two aspects. The first one doesn't care
> about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
> prep_new_huge_page is not used because it not only initializes hugetlb
> specific stuff but because it also put_page and releases the page to
> the hugetlb pool which is not what is required in some contexts. This
> makes things more complicated than necessary.
> 
> Simplify things by a) removing the page allocator entry point duplicity
> and only keep __hugetlb_alloc_buddy_huge_page and b) make
> prep_new_huge_page more reusable by removing the put_page which moves
> the page to the allocator pool. All current callers are updated to call
> put_page explicitly. Later patches will add new callers which won't
> need it.
> 
> This patch shouldn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko 

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 61 
> +---
>  1 file changed, 29 insertions(+), 32 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2c9033d39bfe..8189c92fac82 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1157,6 +1157,7 @@ static struct page 
> *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
>   if (page) {
>   prep_compound_gigantic_page(page, huge_page_order(h));
>   prep_new_huge_page(h, page, nid);
> + put_page(page); /* free it into the hugepage allocator */
>   }
>  
>   return page;
> @@ -1304,7 +1305,6 @@ static void prep_new_huge_page(struct hstate *h, struct 
> page *page, int nid)
>   h->nr_huge_pages++;
>   h->nr_huge_pages_node[nid]++;
>   spin_unlock(_lock);
> - put_page(page); /* free it into the hugepage allocator */
>  }
>  
>  static void prep_compound_gigantic_page(struct page *page, unsigned int 
> order)
> @@ -1381,41 +1381,49 @@ pgoff_t __basepage_index(struct page *page)
>   return (index << compound_order(page_head)) + compound_idx;
>  }
>  
> -static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> +static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> + gfp_t gfp_mask, int nid, nodemask_t *nmask)
>  {
> + int order = huge_page_order(h);
>   struct page *page;
>  
> - page = __alloc_pages_node(nid,
> - htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> - 
> __GFP_RETRY_MAYFAIL|__GFP_NOWARN,
> - huge_page_order(h));
> - if (page) {
> - prep_new_huge_page(h, page, nid);
> - }
> + gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
> + if (nid == NUMA_NO_NODE)
> + nid = numa_mem_id();
> + page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
> + if (page)
> + __count_vm_event(HTLB_BUDDY_PGALLOC);
> + else
> + __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
>  
>   return page;
>  }
>  
> +/*
> + * Allocates a fresh page to the hugetlb allocator pool in the node 
> interleaved
> + * manner.
> + */
>  static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
>  {
>   struct page *page;
>   int nr_nodes, node;
> - int ret = 0;
> + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  
>   for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> - page = alloc_fresh_huge_page_node(h, node);
> - if (page) {
> - ret = 1;
> + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
> + node, nodes_allowed);
> + if (page)
>   break;
> - }
> +
>   }
>  
> - if (ret)
> - count_vm_event(HTLB_BUDDY_PGALLOC);
> - else
> - count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> + if (!page)
> + return 0;
>  
> - return ret;
> + prep_new_huge_page(h, page, page_to_nid(page));
> + put_page(page); /* free it into the hugepage allocator */
> +
> + return 1;
>  }
>  
>  /*
> @@ -1523,17 +1531,6 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
> unsigned long end_pfn)
>   return rc;
>  }
>  
> -static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> - gfp_t gfp_mask, int nid, nodemas

Re: [RFC PATCH 2/5] mm, hugetlb: integrate giga hugetlb more naturally to the allocation path

2017-12-12 Thread Mike Kravetz

On 12/04/2017 06:01 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
> specie with a lot of special casing. The allocation path is not an
> exception. Unnecessarily so to be honest. It is true that the underlying
> allocator is different but that is an implementation detail.
> 
> This patch unifies the hugetlb allocation path that a prepares fresh
> pool pages. alloc_fresh_gigantic_page basically copies alloc_fresh_huge_page
> logic so we can move everything there. This will simplify set_max_huge_pages
> which doesn't have to care about what kind of huge page we allocate.
> 
> Signed-off-by: Michal Hocko 

I agree with the analysis.  Thanks for cleaning this up.  There really is
no need for the separate allocation paths.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 53 -
>  1 file changed, 12 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8189c92fac82..ac105fb32620 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1106,7 +1106,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
>   return zone_spans_pfn(zone, last_pfn);
>  }
>  
> -static struct page *alloc_gigantic_page(int nid, struct hstate *h)
> +static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> + int nid, nodemask_t *nodemask)
>  {
>   unsigned int order = huge_page_order(h);
>   unsigned long nr_pages = 1 << order;
> @@ -1114,11 +1115,9 @@ static struct page *alloc_gigantic_page(int nid, 
> struct hstate *h)
>   struct zonelist *zonelist;
>   struct zone *zone;
>   struct zoneref *z;
> - gfp_t gfp_mask;
>  
> - gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>   zonelist = node_zonelist(nid, gfp_mask);
> - for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
> NULL) {
> + for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
> nodemask) {
>   spin_lock_irqsave(>lock, flags);
>  
>   pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> @@ -1149,42 +1148,11 @@ static struct page *alloc_gigantic_page(int nid, 
> struct hstate *h)
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
>  static void prep_compound_gigantic_page(struct page *page, unsigned int 
> order);
>  
> -static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
> -{
> - struct page *page;
> -
> - page = alloc_gigantic_page(nid, h);
> - if (page) {
> - prep_compound_gigantic_page(page, huge_page_order(h));
> - prep_new_huge_page(h, page, nid);
> - put_page(page); /* free it into the hugepage allocator */
> - }
> -
> - return page;
> -}
> -
> -static int alloc_fresh_gigantic_page(struct hstate *h,
> - nodemask_t *nodes_allowed)
> -{
> - struct page *page = NULL;
> - int nr_nodes, node;
> -
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> - page = alloc_fresh_gigantic_page_node(h, node);
> - if (page)
> - return 1;
> - }
> -
> - return 0;
> -}
> -
>  #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
>  static inline bool gigantic_page_supported(void) { return false; }
>  static inline void free_gigantic_page(struct page *page, unsigned int order) 
> { }
>  static inline void destroy_compound_gigantic_page(struct page *page,
>   unsigned int order) { }
> -static inline int alloc_fresh_gigantic_page(struct hstate *h,
> - nodemask_t *nodes_allowed) { return 0; }
>  #endif
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
> @@ -1410,8 +1378,12 @@ static int alloc_fresh_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed)
>   gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  
>   for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> - page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
> - node, nodes_allowed);
> + if (hstate_is_gigantic(h))
> + page = alloc_gigantic_page(h, gfp_mask,
> + node, nodes_allowed);
> + else
> + page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
> + node, nodes_allowed);
>   if (page)
>   break;
>  
> @@ -1420,6 +1392,8 @@ static int a

Re: [RFC PATCH 3/5] mm, hugetlb: do not rely on overcommit limit during migration

2017-12-13 Thread Mike Kravetz

gt; delta)
>  retry:
>   spin_unlock(_lock);
>   for (i = 0; i < needed; i++) {
> - page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h),
> + page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h),
>   NUMA_NO_NODE, NULL);
>   if (!page) {
>   alloc_ok = false;
> @@ -2258,7 +2304,7 @@ static unsigned long set_max_huge_pages(struct hstate 
> *h, unsigned long count,
>* First take pages out of surplus state.  Then make up the
>* remaining difference by allocating fresh huge pages.
>*
> -  * We might race with __alloc_buddy_huge_page() here and be unable
> +  * We might race with __alloc_surplus_huge_page() here and be unable
>* to convert a surplus huge page to a normal huge page. That is
>* not critical, though, it just means the overall size of the
>* pool might be one hugepage larger than it needs to be, but
> @@ -2301,7 +2347,7 @@ static unsigned long set_max_huge_pages(struct hstate 
> *h, unsigned long count,
>* By placing pages into the surplus state independent of the
>* overcommit value, we are allowing the surplus pool size to
>* exceed overcommit. There are few sane options here. Since
> -  * __alloc_buddy_huge_page() is checking the global counter,
> +  * __alloc_surplus_huge_page() is checking the global counter,
>* though, we'll note that we're not allowed to exceed surplus
>* and won't grow the pool anywhere else. Not until one of the
>* sysctls are changed, or the surplus pages go out of use.
> @@ -4775,3 +4821,36 @@ void putback_active_hugepage(struct page *page)
>   spin_unlock(_lock);
>   put_page(page);
>  }
> +
> +void move_hugetlb_state(struct page *oldpage, struct page *newpage, int 
> reason)
> +{
> + struct hstate *h = page_hstate(oldpage);
> +
> + hugetlb_cgroup_migrate(oldpage, newpage);
> + set_page_owner_migrate_reason(newpage, reason);
> +
> + /*
> +  * transfer temporary state of the new huge page. This is
> +  * reverse to other transitions because the newpage is going to
> +  * be final while the old one will be freed so it takes over
> +  * the temporary status.
> +  *
> +  * Also note that we have to transfer the per-node surplus state
> +  * here as well otherwise the global surplus count will not match
> +  * the per-node's.
> +  */
> + if (PageHugeTemporary(newpage)) {
> + int old_nid = page_to_nid(oldpage);
> + int new_nid = page_to_nid(newpage);
> +
> + SetPageHugeTemporary(oldpage);
> + ClearPageHugeTemporary(newpage);
> +
> +     spin_lock(_lock);
> + if (h->surplus_huge_pages_node[old_nid]) {
> + h->surplus_huge_pages_node[old_nid]--;
> + h->surplus_huge_pages_node[new_nid]++;
> + }
> + spin_unlock(_lock);
> + }
> +}

In the previous version of this patch, I asked about handling of 'free' huge
pages.  I did a little digging and IIUC, we do not attempt migration of
free huge pages.  The routine isolate_huge_page() has this check:

if (!page_huge_active(page) || !get_page_unless_zero(page)) {
ret = false;
goto unlock;
}

I believe one of your motivations for this effort was memory offlining.
So, this implies that a memory area can not be offlined if it contains
a free (not in use) huge page?  Just FYI and may be something we want to
address later.

My other issues were addressed.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4d0be47a322a..1e5525a25691 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1323,9 +1323,8 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>   put_anon_vma(anon_vma);
>  
>   if (rc == MIGRATEPAGE_SUCCESS) {
> - hugetlb_cgroup_migrate(hpage, new_hpage);
> + move_hugetlb_state(hpage, new_hpage, reason);
>   put_new_page = NULL;
> - set_page_owner_migrate_reason(new_hpage, reason);
>   }
>  
>   unlock_page(hpage);
>

Re: [RFC PATCH 4/5] mm, hugetlb: get rid of surplus page accounting tricks

2017-12-13 Thread Mike Kravetz

ULL);
> - /*
> -  * We incremented the global counters already
> -  */
>   h->nr_huge_pages_node[r_nid]++;
>   h->surplus_huge_pages_node[r_nid]++;
> - } else {
> - h->nr_huge_pages--;
> - h->surplus_huge_pages--;

In the case of a successful surplus allocation, the following counters
are incremented:

h->surplus_huge_pages_node[page_to_nid(page)]++;
h->surplus_huge_pages++;
h->nr_huge_pages_node[r_nid]++;
h->surplus_huge_pages_node[r_nid]++;

Looks like per-node surplus_huge_pages_node is incremented twice, and
global nr_huge_pages is not incremented at all.

Also, you removed r_nid so I'm guessing this will not compile?
-- 
Mike Kravetz


>   }
> +
> +out_unlock:
>   spin_unlock(_lock);
>  
>   return page;
>

Re: [PATCH v2 1/5] hugetlb: use page.private for hugetlb specific page flags

2021-01-20 Thread Mike Kravetz

On 1/20/21 1:30 AM, Oscar Salvador wrote:
> On Tue, Jan 19, 2021 at 05:30:45PM -0800, Mike Kravetz wrote:
>> + * Macros to create test, set and clear function definitions for
>> + * hugetlb specific page flags.
>> + */
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +#define TESTHPAGEFLAG(uname, flname)\
>> +static inline int HPage##uname(struct page *page)   \
>> +{ BUILD_BUG_ON(sizeof_field(struct page, private) * \
>> +BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
>> +return test_bit(HPG_##flname, &(page->private)); }
>> +
>> +#define SETHPAGEFLAG(uname, flname) \
>> +static inline void SetHPage##uname(struct page *page)   \
>> +{ set_bit(HPG_##flname, &(page->private)); }
>> +
>> +#define CLEARHPAGEFLAG(uname, flname)   \
>> +static inline void ClearHPage##uname(struct page *page) \
>> +{ clear_bit(HPG_##flname, &(page->private)); }
>> +#else
>> +#define TESTHPAGEFLAG(uname, flname)\
>> +static inline int HPage##uname(struct page *page)   \
>> +{ BUILD_BUG_ON(sizeof_field(struct page, private) * \
>> +BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
>> +return 0 }
> 
> You missed a ";" right there.

Thanks.  I made that typo when fixing up some trivial checkpatch warnings.
Lesson learned (again), nothing is too trivial not to introduce erros.

> I might be missing something, but I do not think we need a BUILD_BUG_ON there
> when CONFIG_HUGETLB_PAGE is not set?
> Actually, would make more sense to move the BUILD_BUG_ON from above to
> hugetlb_init?

Yes, hugetlb_init is a better location for the BUILD_BUG_ON.  I initially
put it there before creating the !CONFIG_HUGETLB_PAGE version of the macros.
With the !CONFIG_HUGETLB_PAGE versions, none of the flags are ever used so
it is OK if the BUILD_BUG_ON is not included if !CONFIG_HUGETLB_PAGE.

-- 
Mike Kravetz

Re: [PATCH v2 4/5] hugetlb: convert PageHugeTemporary() to HPageTemporary flag

2021-01-20 Thread Mike Kravetz

On 1/20/21 2:09 AM, Oscar Salvador wrote:
> On Tue, Jan 19, 2021 at 05:30:48PM -0800, Mike Kravetz wrote:
>> Use new hugetlb specific HPageTemporary flag to replace the
>> PageHugeTemporary() interfaces.
>>
>> Signed-off-by: Mike Kravetz 
> 
> I would have added a brief comment explaining why it is ok to drop
> the PageHuge() check in PageHugeTemporary.
> AFAICS, the paths checking it already know they are handling with a 
> hugetlb page, but still it is better to mention it in the changelog
> in case someone wonders.

Thanks.

Since I have to do another version, I will add this.

-- 
Mike Kravetz

Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Mike Kravetz

On 1/15/21 9:43 AM, Mike Kravetz wrote:
> On 1/15/21 1:17 AM, Oscar Salvador wrote:
>> On Mon, Jan 11, 2021 at 01:01:51PM -0800, Mike Kravetz wrote:
>>> Use the new hugetlb page specific flag to replace the page_huge_active
>>> interfaces.  By it's name, page_huge_active implied that a huge page
>>> was on the active list.  However, that is not really what code checking
>>> the flag wanted to know.  It really wanted to determine if the huge
>>> page could be migrated.  This happens when the page is actually added
>>> the page cache and/or task page table.  This is the reasoning behind the
>>> name change.
>>>
>>> The VM_BUG_ON_PAGE() calls in the interfaces were not really necessary
>>> as in all case but one we KNOW the page is a hugetlb page.  Therefore,
>>> they are removed.  In one call to HPageMigratable() is it possible for
>>> the page to not be a hugetlb page due to a race.  However, the code
>>> making the call (scan_movable_pages) is inherently racy, and page state
>>> will be validated later in the migration process.
>>>
>>> Note:  Since HPageMigratable is used outside hugetlb.c, it can not be
>>> static.  Therefore, a new set of hugetlb page flag macros is added for
>>> non-static flag functions.
>>
>> Two things about this one:
>>
>> I am not sure about the name of this one.
>> It is true that page_huge_active() was only called by memory-hotplug and all
>> it wanted to know was whether the page was in-use and so if it made sense
>> to migrate it, so I see some value in the new PageMigratable flag.
>>
>> However, not all in-use hugetlb can be migrated, e.g: we might have 
>> constraints
>> when it comes to migrate certain sizes of hugetlb, right?
>> So setting HPageMigratable to all active hugetlb pages might be a bit 
>> misleading?
>> HPageActive maybe? (Sorry, don't have a replacement)
> 
> You concerns about the name change are correct.
> 
> The reason for the change came about from discussions about Muchun's series
> of fixes and the need for a new 'page is freed' status to fix a race.  In
> that discussion, Michal asked 'Why can't we simply set page_huge_active when
> the page is allocated and put on the active list?'.  That is mentioned above,
> but we really do not want to try and migrate pages after they are allocated
> and before they are in use.  That causes problems in the fault handling code.
> 
> Anyway, that is how the suggestion for Migration came about.
> 
> In that discussion David Hildenbrand noted that code in alloc_contig_range
> should migrate free hugetlb pages, but there is no support for that today.
> I plan to look at that if nobody else does.  When such code is added, the
> name 'Migratable' will become less applicable.
> 
> I'm not great at naming.  Perhaps 'In_Use' as a flag name might fit better.
> 

I went back and took a closer look.  Migration is the reason the existing
page_huge_active interfaces were introduced.  And, the only use of the
page_huge_active check is to determine if a page can be migrated.  So,
I think 'Migratable' may be the most suitable name.

To address the concern about not all hugetlb sizes are migratable, we can
just make a check before setting the flag.  This should even help in the
migration/offline paths as we will know sooner if the page can be
migrated or not.

We can address naming in the 'migrating free hugetlb pages' issue when
that code is written.
-- 
Mike Kravetz

Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Mike Kravetz

On 1/15/21 12:29 PM, Oscar Salvador wrote:
> 
> About that alloc_contig_range topic, I would like to take a look unless
> someone is already on it or about to be.
> 
> Thanks Mike for the time ;-)

Feel free.

My first thought is that migration of a free hugetlb page would need to
be something like:
1) allocate a fresh hugetlb page from buddy
2) free the 'migrated' free huge page back to buddy

I do not think we can use the existing 'isolate-migrate' flow.  Isolating
a page would make it unavailable for allocation and that could cause
application issues.  

-- 
Mike Kravetz

Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-20 Thread Mike Kravetz

On 1/20/21 2:00 AM, Oscar Salvador wrote:
> On Wed, Jan 20, 2021 at 10:59:05AM +0100, Oscar Salvador wrote:
>> On Tue, Jan 19, 2021 at 05:30:46PM -0800, Mike Kravetz wrote:
>>> Use the new hugetlb page specific flag HPageMigratable to replace the
>>> page_huge_active interfaces.  By it's name, page_huge_active implied
>>> that a huge page was on the active list.  However, that is not really
>>> what code checking the flag wanted to know.  It really wanted to determine
>>> if the huge page could be migrated.  This happens when the page is actually
>>> added the page cache and/or task page table.  This is the reasoning behind
>>> the name change.
>>>
>>> The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
>>> really necessary as we KNOW the page is a hugetlb page.  Therefore, they
>>> are removed.
>>>
>>> The routine page_huge_active checked for PageHeadHuge before testing the
>>> active bit.  This is unnecessary in the case where we hold a reference or
>>> lock and know it is a hugetlb head page.  page_huge_active is also called
>>> without holding a reference or lock (scan_movable_pages), and can race with
>>> code freeing the page.  The extra check in page_huge_active shortened the
>>> race window, but did not prevent the race.  Offline code calling
>>> scan_movable_pages already deals with these races, so removing the check
>>> is acceptable.  Add comment to racy code.
>>>
>>> Signed-off-by: Mike Kravetz 
>>
>> Hi Mike,
>>
>> This comment addresses both this patch and the next one.
>>
>> Instead of putting the SetHPageMigratable flag spread over the
>> allocation paths, would it make more sense to place it in
>> alloc_huge_page before returning the page?
>> Then we could opencode SetHPageMigratableIfSupported right there.
> 
> and in putback_active_hugepage.


Hi Oscar,

In Muchun's series of hugetlb bug fixes, Michal asked the same question.

https://lore.kernel.org/linux-mm/7e69a55c-d501-6b42-8225-a677f09fb...@oracle.com/

The 'short answer' is that the this would allow a page to be migrated
after allocation but before the page fault code adds it to the page
cache or page tables.  This actually caused bugs in the past.
-- 
Mike Kravetz

[PATCH 1/5] hugetlb: use page.private for hugetlb specific page flags

2021-01-15 Thread Mike Kravetz

As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages.  Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code.  Sometimes, the
naming is just confusing.  For example: The PagePrivate flag indicates
a huge page reservation was consumed and needs to be restored if an error
is encountered and the page is freed before it is instantiated.  The
page.private field contains the pointer to a subpool if the page is
associated with one.

In an effort to make the code more readable, use page.private to contain
hugetlb specific flags.  A set of hugetlb_*_page_flag() routines are
created for flag manipulation.  More importantly, an enum of flag values
will be created with names that actually reflect their purpose.

In this patch,
- Create infrastructure for hugetlb_*_page_flag functions
- Move subpool pointer to page[1].private to make way for flags
  Create routines with meaningful names to modify subpool field
- Use new HP_Restore_Reserve flag instead of PagePrivate

Conversion of other state information will happen in subsequent patches.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 12 ++--
 include/linux/hugetlb.h | 61 +
 mm/hugetlb.c| 46 +++
 3 files changed, 87 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 740693d7f255..b8a661780c4a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -955,15 +955,9 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
-   /*
-* page_private is subpool pointer in hugetlb pages.  Transfer to
-* new page.  PagePrivate is not associated with page_private for
-* hugetlb pages and can not be set here as only page_huge_active
-* pages can be migrated.
-*/
-   if (page_private(page)) {
-   set_page_private(newpage, page_private(page));
-   set_page_private(page, 0);
+   if (hugetlb_page_subpool(page)) {
+   hugetlb_set_page_subpool(newpage, hugetlb_page_subpool(page));
+   hugetlb_set_page_subpool(page, NULL);
}
 
if (mode != MIGRATE_SYNC_NO_COPY)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ef5b144b8aac..64f8c7a64186 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -472,6 +472,19 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+/*
+ * huegtlb page specific state flags.  These flags are located in page.private
+ * of the hugetlb head page.  The hugetlb_*_page_flag() routines should be used
+ * to manipulate these flags.
+ *
+ * HP_Restore_Reserve - Set when a hugetlb page consumes a reservation at
+ * allocation time.  Cleared when page is fully instantiated.  Free
+ * routine checks flag to restore a reservation on error paths.
+ */
+enum hugetlb_page_flags {
+   HP_Restore_Reserve = 0,
+};
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #define HSTATE_NAME_LEN 32
@@ -531,6 +544,38 @@ extern unsigned int default_hstate_idx;
 
 #define default_hstate (hstates[default_hstate_idx])
 
+static inline int hugetlb_test_page_flag(struct page *page,
+   enum hugetlb_page_flags hp_flag)
+{
+   return test_bit(hp_flag, >private);
+}
+
+static inline void hugetlb_set_page_flag(struct page *page,
+   enum hugetlb_page_flags hp_flag)
+{
+   return set_bit(hp_flag, >private);
+}
+
+static inline void hugetlb_clear_page_flag(struct page *page,
+   enum hugetlb_page_flags hp_flag)
+{
+   return clear_bit(hp_flag, >private);
+}
+
+/*
+ * hugetlb page subpool pointer located in hpage[1].private
+ */
+static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
+{
+   return (struct hugepage_subpool *)(hpage+1)->private;
+}
+
+static inline void hugetlb_set_page_subpool(struct page *hpage,
+   struct hugepage_subpool *subpool)
+{
+   set_page_private(hpage+1, (unsigned long)subpool);
+}
+
 static inline struct hstate *hstate_file(struct file *f)
 {
return hstate_inode(file_inode(f));
@@ -775,6 +820,22 @@ void set_page_huge_active(struct page *page);
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int hugetlb_test_page_flag(struct page *page,
+   enum hugetlb_page_flags hp_flag)
+{
+   return 0;
+}
+
+static inline void hugetlb_set_page_flag(struct page *page,
+   enum hugetlb_page_flags hp_flag)
+{
+}
+
+static

[PATCH 0/5] create hugetlb flags to consolidate state

2021-01-15 Thread Mike Kravetz

While discussing a series of hugetlb fixes in [1], it became evident
that the hugetlb specific page state information is stored in a somewhat
haphazard manner.  Code dealing with state information would be easier
to read, understand and maintain if this information was stored in a
consistent manner.

This series uses page.private of the hugetlb head page for storing a
set of hugetlb specific page flags.  Routines are priovided for test,
set and clear of the flags.

[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuc...@bytedance.com

RFC -> PATCH
  Simplified to use a single set of flag manipulation routines (Oscar)
  Moved flags and routines to hugetlb.h (Muchun)
  Changed format of page flag names (Muchun)
  Changed subpool routine names (Matthew)
  More comments in code (Oscar)

Based on v5.11-rc3-mmotm-2021-01-12-01-57

Mike Kravetz (5):
  hugetlb: use page.private for hugetlb specific page flags
  hugetlb: convert page_huge_active() to HP_Migratable flag
  hugetlb: only set HP_Migratable for migratable hstates
  hugetlb: convert PageHugeTemporary() to HP_Temporary flag
  hugetlb: convert PageHugeFreed to HP_Freed flag

 fs/hugetlbfs/inode.c   |  14 +---
 include/linux/hugetlb.h|  81 
 include/linux/page-flags.h |   6 --
 mm/hugetlb.c   | 150 +++--
 mm/memory_hotplug.c|   8 +-
 mm/migrate.c   |  12 ---
 6 files changed, 137 insertions(+), 134 deletions(-)

-- 
2.29.2

[PATCH 2/5] hugetlb: convert page_huge_active() to HP_Migratable flag

2021-01-15 Thread Mike Kravetz

Use the new hugetlb page specific flag HP_Migratable to replace the
page_huge_active interfaces.  By it's name, page_huge_active implied
that a huge page was on the active list.  However, that is not really
what code checking the flag wanted to know.  It really wanted to determine
if the huge page could be migrated.  This happens when the page is actually
added the page cache and/or task page table.  This is the reasoning behind
the name change.

The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
really necessary as we KNOW the page is a hugetlb page.  Therefore, they
are removed.

The routine page_huge_active checked for PageHeadHuge before testing the
active bit.  This is unnecessary in the case where we hold a reference or
lock and know it is a hugetlb head page.  page_huge_active is also called
without holding a reference or lock (scan_movable_pages), and can race with
code freeing the page.  The extra check in page_huge_active shortened the
race window, but did not prevent the race.  Offline code calling
scan_movable_pages already deals with these races, so removing the check
is acceptable.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c   |  2 +-
 include/linux/hugetlb.h|  4 
 include/linux/page-flags.h |  6 -
 mm/hugetlb.c   | 45 ++
 mm/memory_hotplug.c|  8 ++-
 5 files changed, 23 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b8a661780c4a..89bc9062b4f6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   set_page_huge_active(page);
+   hugetlb_set_page_flag(page, HP_Migratable);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 64f8c7a64186..353d81913cc7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -480,9 +480,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
  * HP_Restore_Reserve - Set when a hugetlb page consumes a reservation at
  * allocation time.  Cleared when page is fully instantiated.  Free
  * routine checks flag to restore a reservation on error paths.
+ * HP_Migratable - Set after a newly allocated page is added to the page
+ * cache and/or page tables.  Indicates the page is a candidate for
+ * migration.
  */
 enum hugetlb_page_flags {
HP_Restore_Reserve = 0,
+   HP_Migratable,
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index bc6fd1ee7dd6..04a34c08e0a6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -592,15 +592,9 @@ static inline void ClearPageCompound(struct page *page)
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
 int PageHeadHuge(struct page *page);
-bool page_huge_active(struct page *page);
 #else
 TESTPAGEFLAG_FALSE(Huge)
 TESTPAGEFLAG_FALSE(HeadHuge)
-
-static inline bool page_huge_active(struct page *page)
-{
-   return 0;
-}
 #endif
 
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b01002d8fc2b..c43cebf2f278 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,30 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Test to determine whether the hugepage is "active/in-use" (i.e. being linked
- * to hstate->hugepage_activelist.)
- *
- * This function can be called for tail pages, but never returns true for them.
- */
-bool page_huge_active(struct page *page)
-{
-   return PageHeadHuge(page) && PagePrivate([1]);
-}
-
-/* never called for tail page */
-void set_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   SetPagePrivate([1]);
-}
-
-static void clear_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   ClearPagePrivate([1]);
-}
-
 /*
  * Internal hugetlb specific page flag. Do not use outside of the hugetlb
  * code
@@ -1438,7 +1414,7 @@ static void __free_huge_page(struct page *page)
}
 
spin_lock(_lock);
-   clear_page_huge_active(page);
+   hugetlb_clear_page_flag(page, HP_Migratable);
hugetlb_cgroup_uncharge_page(hstate_index(h),
 pages_per_huge_page(h), page);
hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
@@ -4221,7 +4197,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
struct vm_area_struct *vma,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
-   set_page

[PATCH 4/5] hugetlb: convert PageHugeTemporary() to HP_Temporary flag

2021-01-15 Thread Mike Kravetz

Use new hugetlb specific flag HP_Temporary flag to replace the
PageHugeTemporary() interfaces.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  5 +
 mm/hugetlb.c| 36 +++-
 2 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e7157cf9967f..166825c85875 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -483,10 +483,15 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * HP_Migratable - Set after a newly allocated page is added to the page
  * cache and/or page tables.  Indicates the page is a candidate for
  * migration.
+ * HP_Temporary - Set on a page that is temporarily allocated from the buddy
+ * allocator.  Typically used for migration target pages when no pages
+ * are available in the pool.  The hugetlb free page path will
+ * immediately free pages with this flag set to the buddy allocator.
  */
 enum hugetlb_page_flags {
HP_Restore_Reserve = 0,
HP_Migratable,
+   HP_Temporary,
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31e896c70ba0..53e9168a97bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,28 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Internal hugetlb specific page flag. Do not use outside of the hugetlb
- * code
- */
-static inline bool PageHugeTemporary(struct page *page)
-{
-   if (!PageHuge(page))
-   return false;
-
-   return (unsigned long)page[2].mapping == -1U;
-}
-
-static inline void SetPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = (void *)-1U;
-}
-
-static inline void ClearPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = NULL;
-}
-
 static void __free_huge_page(struct page *page)
 {
/*
@@ -1422,9 +1400,9 @@ static void __free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (PageHugeTemporary(page)) {
+   if (hugetlb_test_page_flag(page, HP_Temporary)) {
list_del(>lru);
-   ClearPageHugeTemporary(page);
+   hugetlb_clear_page_flag(page, HP_Temporary);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
@@ -1863,7 +1841,7 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * codeflow
 */
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
-   SetPageHugeTemporary(page);
+   hugetlb_set_page_flag(page, HP_Temporary);
spin_unlock(_lock);
put_page(page);
return NULL;
@@ -1894,7 +1872,7 @@ static struct page *alloc_migrate_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * We do not account these pages as surplus because they are only
 * temporary and will be released properly on the last reference
 */
-   SetPageHugeTemporary(page);
+   hugetlb_set_page_flag(page, HP_Temporary);
 
return page;
 }
@@ -5608,12 +5586,12 @@ void move_hugetlb_state(struct page *oldpage, struct 
page *newpage, int reason)
 * here as well otherwise the global surplus count will not match
 * the per-node's.
 */
-   if (PageHugeTemporary(newpage)) {
+   if (hugetlb_test_page_flag(newpage, HP_Temporary)) {
int old_nid = page_to_nid(oldpage);
int new_nid = page_to_nid(newpage);
 
-   SetPageHugeTemporary(oldpage);
-   ClearPageHugeTemporary(newpage);
+   hugetlb_set_page_flag(oldpage, HP_Temporary);
+   hugetlb_clear_page_flag(newpage, HP_Temporary);
 
spin_lock(_lock);
if (h->surplus_huge_pages_node[old_nid]) {
-- 
2.29.2

[PATCH 3/5] hugetlb: only set HP_Migratable for migratable hstates

2021-01-15 Thread Mike Kravetz

The HP_Migratable flag indicates a page is a candidate for migration.
Only set the flag if the page's hstate supports migration.  This allows
the migration paths to detect non-migratable pages earlier.  The check
in unmap_and_move_huge_page for migration support can be removed as it
is no longer necessary.  If migration is not supported for the hstate,
HP_Migratable will not be set, the page will not be isolated and no
attempt will be made to migrate.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c|  2 +-
 include/linux/hugetlb.h |  9 +
 mm/hugetlb.c|  8 
 mm/migrate.c| 12 
 4 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 89bc9062b4f6..14d77d01e38d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   hugetlb_set_page_flag(page, HP_Migratable);
+   hugetlb_set_HP_Migratable(page);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 353d81913cc7..e7157cf9967f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -716,6 +716,15 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return arch_hugetlb_migration_supported(h);
 }
 
+/*
+ * Only set flag if hstate supports migration
+ */
+static inline void hugetlb_set_HP_Migratable(struct page *page)
+{
+   if (hugepage_migration_supported(page_hstate(page)))
+   hugetlb_set_page_flag(page, HP_Migratable);
+}
+
 /*
  * Movability check is different as compared to migration check.
  * It determines whether or not a huge page should be placed on
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c43cebf2f278..31e896c70ba0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4197,7 +4197,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
struct vm_area_struct *vma,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
-   hugetlb_set_page_flag(new_page, HP_Migratable);
+   hugetlb_set_HP_Migratable(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
@@ -4439,7 +4439,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 * been isolated for migration.
 */
if (new_page)
-   hugetlb_set_page_flag(page, HP_Migratable);
+   hugetlb_set_HP_Migratable(page);
 
unlock_page(page);
 out:
@@ -4750,7 +4750,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
spin_unlock(ptl);
-   hugetlb_set_page_flag(page, HP_Migratable);
+   hugetlb_set_HP_Migratable(page);
if (vm_shared)
unlock_page(page);
ret = 0;
@@ -5585,7 +5585,7 @@ void putback_active_hugepage(struct page *page)
 {
VM_BUG_ON_PAGE(!PageHead(page), page);
spin_lock(_lock);
-   hugetlb_set_page_flag(page, HP_Migratable);
+   hugetlb_set_HP_Migratable(page);
list_move_tail(>lru, &(page_hstate(page))->hugepage_activelist);
spin_unlock(_lock);
put_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0339f3874d7c..296d61613abc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1272,18 +1272,6 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
 
-   /*
-* Migratability of hugepages depends on architectures and their size.
-* This check is necessary because some callers of hugepage migration
-* like soft offline and memory hotremove don't walk through page
-* tables or check whether the hugepage is pmd-based or not before
-* kicking migration.
-*/
-   if (!hugepage_migration_supported(page_hstate(hpage))) {
-   list_move_tail(>lru, ret);
-   return -ENOSYS;
-   }
-
if (page_count(hpage) == 1) {
/* page was freed from under us. So we are done. */
putback_active_hugepage(hpage);
-- 
2.29.2

[PATCH 5/5] hugetlb: convert PageHugeFreed to HP_Freed flag

2021-01-15 Thread Mike Kravetz

Use new hugetlb specific flag HP_Freed flag to replace the
PageHugeFreed interfaces.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c| 23 ---
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 166825c85875..5c99969fbbd6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -487,11 +487,13 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * allocator.  Typically used for migration target pages when no pages
  * are available in the pool.  The hugetlb free page path will
  * immediately free pages with this flag set to the buddy allocator.
+ * HP_Freed - Set when page is on the free lists.
  */
 enum hugetlb_page_flags {
HP_Restore_Reserve = 0,
HP_Migratable,
HP_Temporary,
+   HP_Freed,
 };
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53e9168a97bd..073137b32657 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -79,21 +79,6 @@ DEFINE_SPINLOCK(hugetlb_lock);
 static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
 
-static inline bool PageHugeFreed(struct page *head)
-{
-   return page_private(head + 4) == -1UL;
-}
-
-static inline void SetPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, -1UL);
-}
-
-static inline void ClearPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, 0);
-}
-
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
@@ -1043,7 +1028,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
-   SetPageHugeFreed(page);
+   hugetlb_set_page_flag(page, HP_Freed);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1060,7 +1045,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
 
list_move(>lru, >hugepage_activelist);
set_page_refcounted(page);
-   ClearPageHugeFreed(page);
+   hugetlb_clear_page_flag(page, HP_Freed);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
return page;
@@ -1474,7 +1459,7 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearPageHugeFreed(page);
+   hugetlb_clear_page_flag(page, HP_Freed);
spin_unlock(_lock);
 }
 
@@ -1747,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
 * We should make sure that the page is already on the free list
 * when it is dissolved.
 */
-   if (unlikely(!PageHugeFreed(head))) {
+   if (unlikely(!hugetlb_test_page_flag(head, HP_Freed))) {
rc = -EAGAIN;
goto out;
}
-- 
2.29.2

Re: [PATCH 2/5] hugetlb: convert page_huge_active() to HP_Migratable flag

2021-01-16 Thread Mike Kravetz

On 1/16/21 4:06 AM, Oscar Salvador wrote:
> On Sat, Jan 16, 2021 at 04:24:16AM +, Matthew Wilcox wrote:
>> and name these HPG_restore_reserve and HPG_migratable
>>
>> and generate the calls to hugetlb_set_page_flag etc from macros, eg:
>>
>> #define TESTHPAGEFLAG(uname, lname)  \
>> static __always_inline bool HPage##uname(struct page *page)  \
>> { return test_bit(HPG_##lname, >private); }
>> ...
>> #define HPAGEFLAG(uname, lname)  
>> \
>>  TESTHPAGEFLAG(uname, lname) \
>>  SETHPAGEFLAG(uname, lname)  \
>>  CLEARHPAGEFLAG(uname, lname)
>>
>> HPAGEFLAG(RestoreReserve, restore_reserve)
>> HPAGEFLAG(Migratable, migratable)
>>
>> just to mirror page-flags.h more closely.
> 
> That is on me.
> I thought that given the low number of flags, we coud get away with:
> 
> hugetlb_{set,test,clear}_page_flag(page, flag)
> 
> and call it from the code.
> But some of the flags need to be set/tested outside hugetlb code, so
> it indeed looks nicer and more consistent to follow page-flags.h convention.
> 
> Sorry for the noise.

Thanks everyone!

I was unsure about the best way to go for this.  Will send out a new version
in a few days using the page-flag style macros.

-- 
Mike Kravetz

[PATCH v3 1/5] hugetlb: use page.private for hugetlb specific page flags

2021-01-22 Thread Mike Kravetz

As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages.  Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code.  Sometimes, the
naming is just confusing.  For example: The PagePrivate flag indicates
a huge page reservation was consumed and needs to be restored if an error
is encountered and the page is freed before it is instantiated.  The
page.private field contains the pointer to a subpool if the page is
associated with one.

In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags.  These flags will have test, set and clear
functions similar to those used for 'normal' page flags.  More importantly,
an enum of flag values will be created with names that actually reflect
their purpose.

In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
  Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate

Conversion of other state information will happen in subsequent patches.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 12 ++--
 include/linux/hugetlb.h | 68 +
 mm/hugetlb.c| 48 +++--
 3 files changed, 96 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 21c20fd5f9ee..b00801fd6002 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -966,15 +966,9 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
-   /*
-* page_private is subpool pointer in hugetlb pages.  Transfer to
-* new page.  PagePrivate is not associated with page_private for
-* hugetlb pages and can not be set here as only page_huge_active
-* pages can be migrated.
-*/
-   if (page_private(page)) {
-   set_page_private(newpage, page_private(page));
-   set_page_private(page, 0);
+   if (hugetlb_page_subpool(page)) {
+   hugetlb_set_page_subpool(newpage, hugetlb_page_subpool(page));
+   hugetlb_set_page_subpool(page, NULL);
}
 
if (mode != MIGRATE_SYNC_NO_COPY)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..a7eb05315c6e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -472,6 +472,60 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+/*
+ * huegtlb page specific state flags.  These flags are located in page.private
+ * of the hugetlb head page.  Functions created via the below macros should be
+ * used to manipulate these flags.
+ *
+ * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
+ * allocation time.  Cleared when page is fully instantiated.  Free
+ * routine checks flag to restore a reservation on error paths.
+ */
+enum hugetlb_page_flags {
+   HPG_restore_reserve = 0,
+   __NR_HPAGEFLAGS,
+};
+
+/*
+ * Macros to create test, set and clear function definitions for
+ * hugetlb specific page flags.
+ */
+#ifdef CONFIG_HUGETLB_PAGE
+#define TESTHPAGEFLAG(uname, flname)   \
+static inline int HPage##uname(struct page *page)  \
+   { return test_bit(HPG_##flname, &(page->private)); }
+
+#define SETHPAGEFLAG(uname, flname)\
+static inline void SetHPage##uname(struct page *page)  \
+   { set_bit(HPG_##flname, &(page->private)); }
+
+#define CLEARHPAGEFLAG(uname, flname)  \
+static inline void ClearHPage##uname(struct page *page)\
+   { clear_bit(HPG_##flname, &(page->private)); }
+#else
+#define TESTHPAGEFLAG(uname, flname)   \
+static inline int HPage##uname(struct page *page)  \
+   { return 0; }
+
+#define SETHPAGEFLAG(uname, flname)\
+static inline void SetHPage##uname(struct page *page)  \
+   { }
+
+#define CLEARHPAGEFLAG(uname, flname)  \
+static inline void ClearHPage##uname(struct page *page)\
+   { }
+#endif
+
+#define HPAGEFLAG(uname, flname)   \
+   TESTHPAGEFLAG(uname, flname)\
+   SETHPAGEFLAG(uname, flname) \
+   CLEARHPAGEFLAG(uname, flname)   \
+
+/*
+ * Create functions associated with hugetlb page flags
+ */
+HPAGEFLAG(RestoreReserve, restore_reserve)
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #define HSTATE_NAME_LEN 32
@@ -531,6 +585,

[PATCH v3 5/5] hugetlb: convert PageHugeFreed to HPageFreed flag

2021-01-22 Thread Mike Kravetz

Use new hugetlb specific HPageFreed flag to replace the
PageHugeFreed interfaces.

Signed-off-by: Mike Kravetz 
Reviewed-by: Oscar Salvador 
Reviewed-by: Muchun Song 
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c| 23 ---
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3c86c3a0e144..f09085c0afea 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -487,11 +487,13 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * allocator.  Typically used for migration target pages when no pages
  * are available in the pool.  The hugetlb free page path will
  * immediately free pages with this flag set to the buddy allocator.
+ * HPG_freed - Set when page is on the free lists.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
HPG_migratable,
HPG_temporary,
+   HPG_freed,
__NR_HPAGEFLAGS,
 };
 
@@ -536,6 +538,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
 HPAGEFLAG(RestoreReserve, restore_reserve)
 HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
+HPAGEFLAG(Freed, freed)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 70ffa1027988..c1f283cee0a9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -79,21 +79,6 @@ DEFINE_SPINLOCK(hugetlb_lock);
 static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
 
-static inline bool PageHugeFreed(struct page *head)
-{
-   return page_private(head + 4) == -1UL;
-}
-
-static inline void SetPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, -1UL);
-}
-
-static inline void ClearPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, 0);
-}
-
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
@@ -1043,7 +1028,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
-   SetPageHugeFreed(page);
+   SetHPageFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1060,7 +1045,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
 
list_move(>lru, >hugepage_activelist);
set_page_refcounted(page);
-   ClearPageHugeFreed(page);
+   ClearHPageFreed(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
return page;
@@ -1474,7 +1459,7 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearPageHugeFreed(page);
+   ClearHPageFreed(page);
spin_unlock(_lock);
 }
 
@@ -1747,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
 * We should make sure that the page is already on the free list
 * when it is dissolved.
 */
-   if (unlikely(!PageHugeFreed(head))) {
+   if (unlikely(!HPageFreed(head))) {
spin_unlock(_lock);
cond_resched();
 
-- 
2.29.2

[PATCH v3 4/5] hugetlb: convert PageHugeTemporary() to HPageTemporary flag

2021-01-22 Thread Mike Kravetz

Use new hugetlb specific HPageTemporary flag to replace the
PageHugeTemporary() interfaces.  PageHugeTemporary does contain
a PageHuge() check.  However, this interface is only used within
hugetlb code where we know we are dealing with a hugetlb page.
Therefore, the check can be eliminated.

Signed-off-by: Mike Kravetz 
Reviewed-by: Oscar Salvador 
---
 include/linux/hugetlb.h |  6 ++
 mm/hugetlb.c| 36 +++-
 2 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cd1960541f2a..3c86c3a0e144 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -483,10 +483,15 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * HPG_migratable  - Set after a newly allocated page is added to the page
  * cache and/or page tables.  Indicates the page is a candidate for
  * migration.
+ * HPG_temporary - - Set on a page that is temporarily allocated from the buddy
+ * allocator.  Typically used for migration target pages when no pages
+ * are available in the pool.  The hugetlb free page path will
+ * immediately free pages with this flag set to the buddy allocator.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
HPG_migratable,
+   HPG_temporary,
__NR_HPAGEFLAGS,
 };
 
@@ -530,6 +535,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
  */
 HPAGEFLAG(RestoreReserve, restore_reserve)
 HPAGEFLAG(Migratable, migratable)
+HPAGEFLAG(Temporary, temporary)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4da1a29ac5e2..70ffa1027988 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,28 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Internal hugetlb specific page flag. Do not use outside of the hugetlb
- * code
- */
-static inline bool PageHugeTemporary(struct page *page)
-{
-   if (!PageHuge(page))
-   return false;
-
-   return (unsigned long)page[2].mapping == -1U;
-}
-
-static inline void SetPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = (void *)-1U;
-}
-
-static inline void ClearPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = NULL;
-}
-
 static void __free_huge_page(struct page *page)
 {
/*
@@ -1422,9 +1400,9 @@ static void __free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (PageHugeTemporary(page)) {
+   if (HPageTemporary(page)) {
list_del(>lru);
-   ClearPageHugeTemporary(page);
+   ClearHPageTemporary(page);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
@@ -1860,7 +1838,7 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * codeflow
 */
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
-   SetPageHugeTemporary(page);
+   SetHPageTemporary(page);
spin_unlock(_lock);
put_page(page);
return NULL;
@@ -1891,7 +1869,7 @@ static struct page *alloc_migrate_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * We do not account these pages as surplus because they are only
 * temporary and will be released properly on the last reference
 */
-   SetPageHugeTemporary(page);
+   SetHPageTemporary(page);
 
return page;
 }
@@ -5612,12 +5590,12 @@ void move_hugetlb_state(struct page *oldpage, struct 
page *newpage, int reason)
 * here as well otherwise the global surplus count will not match
 * the per-node's.
 */
-   if (PageHugeTemporary(newpage)) {
+   if (HPageTemporary(newpage)) {
int old_nid = page_to_nid(oldpage);
int new_nid = page_to_nid(newpage);
 
-   SetPageHugeTemporary(oldpage);
-   ClearPageHugeTemporary(newpage);
+   SetHPageTemporary(oldpage);
+   ClearHPageTemporary(newpage);
 
spin_lock(_lock);
if (h->surplus_huge_pages_node[old_nid]) {
-- 
2.29.2

[PATCH v3 0/5] create hugetlb flags to consolidate state

2021-01-22 Thread Mike Kravetz

While discussing a series of hugetlb fixes in [1], it became evident
that the hugetlb specific page state information is stored in a somewhat
haphazard manner.  Code dealing with state information would be easier
to read, understand and maintain if this information was stored in a
consistent manner.

This series uses page.private of the hugetlb head page for storing a
set of hugetlb specific page flags.  Routines are priovided for test,
set and clear of the flags.

[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuc...@bytedance.com

v2 -> v3
  Fixed !CONFIG_HUGETLB_PAGE build bug and moved BUILD_BUG_ON to
  hugetlb_init (Oscar)
  Enhanced patch 3 (HPageTemporary flag) commit message (Oscar)
  Fixed typo enhanced coment in scan_movable_pages (Miaohe)
Patch -> v2
  Went back to functions similar to 'normal' page flags (Matthew/Muchun)
  Decided to leave check in unmap_and_move_huge_page and print warning
RFC -> PATCH
  Simplified to use a single set of flag manipulation routines (Oscar)
  Moved flags and routines to hugetlb.h (Muchun)
  Changed format of page flag names (Muchun)
  Changed subpool routine names (Matthew)
  More comments in code (Oscar)

Based on v5.11-rc4-mmotm-2021-01-21-20-07

Mike Kravetz (5):
  hugetlb: use page.private for hugetlb specific page flags
  hugetlb: convert page_huge_active() HPageMigratable flag
  hugetlb: only set HPageMigratable for migratable hstates
  hugetlb: convert PageHugeTemporary() to HPageTemporary flag
  hugetlb: convert PageHugeFreed to HPageFreed flag

 fs/hugetlbfs/inode.c   |  14 +---
 include/linux/hugetlb.h|  91 ++
 include/linux/page-flags.h |   6 --
 mm/hugetlb.c   | 152 -
 mm/memory_hotplug.c|   9 ++-
 mm/migrate.c   |   9 +--
 6 files changed, 154 insertions(+), 127 deletions(-)

-- 
2.29.2

Re: [PATCH v13 03/12] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

2021-01-22 Thread Mike Kravetz

t; entry.
> + * @reuse_page:  the page which is reused for the tail vmemmap 
> pages.
> + * @reuse_addr:  the virtual address of the @reuse_page page.
> + * @vmemmap_pages:   the list head of the vmemmap pages that can be freed.
> + */
> +struct vmemmap_remap_walk {
> + void (*remap_pte)(pte_t *pte, unsigned long addr,
> +   struct vmemmap_remap_walk *walk);
> + struct page *reuse_page;
> + unsigned long reuse_addr;
> + struct list_head *vmemmap_pages;
> +};
> +
> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> +   unsigned long end,
> +   struct vmemmap_remap_walk *walk)
> +{
> + pte_t *pte;
> +
> + pte = pte_offset_kernel(pmd, addr);
> +
> + /*
> +  * The reuse_page is found 'first' in table walk before we start
> +  * remapping (which is calling @walk->remap_pte).
> +  */
> + if (walk->reuse_addr == addr) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->reuse_page = pte_page(*pte++);
> + /*
> +  * Becasue the reuse address is part of the range that we are
> +  * walking, skip the reuse address range.
> +  */
> + addr += PAGE_SIZE;
> + }
> +
> + for (; addr != end; addr += PAGE_SIZE, pte++) {
> + BUG_ON(pte_none(*pte));
> +
> + walk->remap_pte(pte, addr, walk);
> + }
> +}
> +
> +static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> +   unsigned long end,
> +   struct vmemmap_remap_walk *walk)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + BUG_ON(pmd_none(*pmd));
> +
> + next = pmd_addr_end(addr, end);
> + vmemmap_pte_range(pmd, addr, next, walk);
> + } while (pmd++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
> +   unsigned long end,
> +   struct vmemmap_remap_walk *walk)
> +{
> + pud_t *pud;
> + unsigned long next;
> +
> + pud = pud_offset(p4d, addr);
> + do {
> + BUG_ON(pud_none(*pud));
> +
> + next = pud_addr_end(addr, end);
> + vmemmap_pmd_range(pud, addr, next, walk);
> + } while (pud++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
> +   unsigned long end,
> +   struct vmemmap_remap_walk *walk)
> +{
> + p4d_t *p4d;
> + unsigned long next;
> +
> + p4d = p4d_offset(pgd, addr);
> + do {
> + BUG_ON(p4d_none(*p4d));
> +
> + next = p4d_addr_end(addr, end);
> + vmemmap_pud_range(p4d, addr, next, walk);
> + } while (p4d++, addr = next, addr != end);
> +}
> +
> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + unsigned long addr = start;
> + unsigned long next;
> + pgd_t *pgd;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> +
> + pgd = pgd_offset_k(addr);
> + do {
> + BUG_ON(pgd_none(*pgd));
> +
> + next = pgd_addr_end(addr, end);
> + vmemmap_p4d_range(pgd, addr, next, walk);
> + } while (pgd++, addr = next, addr != end);
> +
> + /*
> +  * We do not change the mapping of the vmemmap virtual address range
> +  * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
> +  * So we not need to flush the TLB.
> +  */
> + flush_tlb_kernel_range(start - PAGE_SIZE, end);
> +}
> +
> +/*
> + * Free a vmemmap page. A vmemmap page can be allocated from the memblock
> + * allocator or buddy allocator. If the PG_reserved flag is set, it means
> + * that it allocated from the memblock allocator, just free it via the
> + * free_bootmem_page(). Otherwise, use __free_page().
> + */
> +static inline void free_vmemmap_page(struct page *page)
> +{
> + if (PageReserved(page))
> + free_bootmem_page(page);
> + else
> + __free_page(page);
> +}
> +
> +/* Free a list of the vmemmap pages */
> +static void free_vmemmap_page_list(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + list_del(>lru);
> +

Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-22 Thread Mike Kravetz

On 1/21/21 10:53 PM, Miaohe Lin wrote:
> Hi:
> On 2021/1/20 9:30, Mike Kravetz wrote:
>> Use the new hugetlb page specific flag HPageMigratable to replace the
>> page_huge_active interfaces.  By it's name, page_huge_active implied
>> that a huge page was on the active list.  However, that is not really
>> what code checking the flag wanted to know.  It really wanted to determine
>> if the huge page could be migrated.  This happens when the page is actually
>> added the page cache and/or task page table.  This is the reasoning behind
> 
> s/added/added to/

Thanks

> 
>> the name change.
>>
...
>> @@ -1260,7 +1260,13 @@ static int scan_movable_pages(unsigned long start, 
>> unsigned long end,
>>  if (!PageHuge(page))
>>  continue;
>>  head = compound_head(page);
>> -if (page_huge_active(head))
>> +/*
>> + * This test is racy as we hold no reference or lock.  The
>> + * hugetlb page could have been free'ed and head is no longer
>> + * a hugetlb page before the following check.  In such unlikely
>> + * cases false positives and negatives are possible.
>> + */
> 
> Is it necessary to mention that: "offline_pages() could handle these racy 
> cases." in the comment ?
> 

I will enhance the comment to say that calling code must deal with these
possible scenarios.
-- 
Mike Kravetz

>> +if (HPageMigratable(head))
>>  goto found;
>>  skip = compound_nr(head) - (page - head);
>>  pfn += skip - 1;
>>
> 
> Looks good to me. Thanks.
> 
> Reviewed-by: Miaohe Lin 
>

[PATCH v3 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-22 Thread Mike Kravetz

Use the new hugetlb page specific flag HPageMigratable to replace the
page_huge_active interfaces.  By it's name, page_huge_active implied
that a huge page was on the active list.  However, that is not really
what code checking the flag wanted to know.  It really wanted to determine
if the huge page could be migrated.  This happens when the page is actually
added to the page cache and/or task page table.  This is the reasoning
behind the name change.

The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
really necessary as we KNOW the page is a hugetlb page.  Therefore, they
are removed.

The routine page_huge_active checked for PageHeadHuge before testing the
active bit.  This is unnecessary in the case where we hold a reference or
lock and know it is a hugetlb head page.  page_huge_active is also called
without holding a reference or lock (scan_movable_pages), and can race with
code freeing the page.  The extra check in page_huge_active shortened the
race window, but did not prevent the race.  Offline code calling
scan_movable_pages already deals with these races, so removing the check
is acceptable.  Add comment to racy code.

Signed-off-by: Mike Kravetz 
Reviewed-by: Oscar Salvador 
Reviewed-by: Muchun Song 
Reviewed-by: Miaohe Lin 
---
 fs/hugetlbfs/inode.c   |  2 +-
 include/linux/hugetlb.h|  5 +
 include/linux/page-flags.h |  6 -
 mm/hugetlb.c   | 45 ++
 mm/memory_hotplug.c|  9 +++-
 5 files changed, 25 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b00801fd6002..e1d7ed2a53a9 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   set_page_huge_active(page);
+   SetHPageMigratable(page);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a7eb05315c6e..58be44a915d1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -480,9 +480,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
  * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
  * allocation time.  Cleared when page is fully instantiated.  Free
  * routine checks flag to restore a reservation on error paths.
+ * HPG_migratable  - Set after a newly allocated page is added to the page
+ * cache and/or page tables.  Indicates the page is a candidate for
+ * migration.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
+   HPG_migratable,
__NR_HPAGEFLAGS,
 };
 
@@ -525,6 +529,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
  * Create functions associated with hugetlb page flags
  */
 HPAGEFLAG(RestoreReserve, restore_reserve)
+HPAGEFLAG(Migratable, migratable)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index ec5d0290e0ee..db914477057b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -592,15 +592,9 @@ static inline void ClearPageCompound(struct page *page)
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
 int PageHeadHuge(struct page *page);
-bool page_huge_active(struct page *page);
 #else
 TESTPAGEFLAG_FALSE(Huge)
 TESTPAGEFLAG_FALSE(HeadHuge)
-
-static inline bool page_huge_active(struct page *page)
-{
-   return 0;
-}
 #endif
 
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ec6138ca81b..f1a3c8230dbf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,30 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Test to determine whether the hugepage is "active/in-use" (i.e. being linked
- * to hstate->hugepage_activelist.)
- *
- * This function can be called for tail pages, but never returns true for them.
- */
-bool page_huge_active(struct page *page)
-{
-   return PageHeadHuge(page) && PagePrivate([1]);
-}
-
-/* never called for tail page */
-void set_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   SetPagePrivate([1]);
-}
-
-static void clear_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   ClearPagePrivate([1]);
-}
-
 /*
  * Internal hugetlb specific page flag. Do not use outside of the hugetlb
  * code
@@ -1438,7 +1414,7 @@ static void __free_huge_page(struct page *page)
}
 
spin_lock(_lock);
-   clear_page_huge_active(page);
+   ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),
 pages_per_huge_page(h), page);
hugetlb_cgroup_uncharge_pag

[PATCH v3 3/5] hugetlb: only set HPageMigratable for migratable hstates

2021-01-22 Thread Mike Kravetz

The HP_Migratable flag indicates a page is a candidate for migration.
Only set the flag if the page's hstate supports migration.  This allows
the migration paths to detect non-migratable pages earlier.  If migration
is not supported for the hstate, HP_Migratable will not be set, the page
will not be isolated and no attempt will be made to migrate.  We should
never get to unmap_and_move_huge_page for a page where migration is not
supported, so throw a warning if we do.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 2 +-
 include/linux/hugetlb.h | 9 +
 mm/hugetlb.c| 8 
 mm/migrate.c| 9 -
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e1d7ed2a53a9..93f7b8d3c5fd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 58be44a915d1..cd1960541f2a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -740,6 +740,15 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return arch_hugetlb_migration_supported(h);
 }
 
+/*
+ * Only set HPageMigratable if migration supported for page
+ */
+static inline void SetHPageMigratableIfSupported(struct page *page)
+{
+   if (hugepage_migration_supported(page_hstate(page)))
+   SetHPageMigratable(page);
+}
+
 /*
  * Movability check is different as compared to migration check.
  * It determines whether or not a huge page should be placed on
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f1a3c8230dbf..4da1a29ac5e2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4194,7 +4194,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
struct vm_area_struct *vma,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
-   SetHPageMigratable(new_page);
+   SetHPageMigratableIfSupported(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
@@ -4436,7 +4436,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 * been isolated for migration.
 */
if (new_page)
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
 
unlock_page(page);
 out:
@@ -4747,7 +4747,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
spin_unlock(ptl);
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
if (vm_shared)
unlock_page(page);
ret = 0;
@@ -5589,7 +5589,7 @@ void putback_active_hugepage(struct page *page)
 {
VM_BUG_ON_PAGE(!PageHead(page), page);
spin_lock(_lock);
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
list_move_tail(>lru, &(page_hstate(page))->hugepage_activelist);
spin_unlock(_lock);
put_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index a3e1acc72ad7..c8d19e83f372 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1275,13 +1275,12 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
struct address_space *mapping = NULL;
 
/*
-* Migratability of hugepages depends on architectures and their size.
-* This check is necessary because some callers of hugepage migration
-* like soft offline and memory hotremove don't walk through page
-* tables or check whether the hugepage is pmd-based or not before
-* kicking migration.
+* Support for migration should be checked at isolation time.
+* Therefore, we should never get here if migration is not supported
+* for the page.
 */
if (!hugepage_migration_supported(page_hstate(hpage))) {
+   VM_WARN_ON(1);
list_move_tail(>lru, ret);
return -ENOSYS;
}
-- 
2.29.2

Re: [PATCH v2] Documentation/admin-guide: kernel-parameters: update CMA entries

2021-01-25 Thread Mike Kravetz

On 1/24/21 8:32 PM, Randy Dunlap wrote:
> Add qualifying build option legend [CMA] to kernel boot options
> that requirce CMA support to be enabled for them to be usable.
> 
> Also capitalize 'CMA' when it is used as an acronym.
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: Andrew Morton 
> Cc: Mike Kravetz 
> ---
> v2: rebase & resend
> 
>  Documentation/admin-guide/kernel-parameters.txt |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Thanks,

Acked-by: Mike Kravetz 

-- 
Mike Kravetz

> 
> --- linux-next-20210122.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ linux-next-20210122/Documentation/admin-guide/kernel-parameters.txt
> @@ -631,7 +631,7 @@
>   kernel/dma/contiguous.c
>  
>   cma_pernuma=nn[MG]
> - [ARM64,KNL]
> + [ARM64,KNL,CMA]
>   Sets the size of kernel per-numa memory area for
>   contiguous memory allocations. A value of 0 disables
>   per-numa CMA altogether. And If this option is not
> @@ -1556,12 +1556,12 @@
>   hpet_mmap=  [X86, HPET_MMAP] Allow userspace to mmap HPET
>   registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>  
> - hugetlb_cma=[HW] The size of a cma area used for allocation
> + hugetlb_cma=[HW,CMA] The size of a CMA area used for allocation
>   of gigantic hugepages.
>   Format: nn[KMGTPE]
>  
> - Reserve a cma area of given size and allocate gigantic
> - hugepages using the cma allocator. If enabled, the
> + Reserve a CMA area of given size and allocate gigantic
> + hugepages using the CMA allocator. If enabled, the
>   boot-time allocation of gigantic hugepages is skipped.
>  
>   hugepages=  [HW] Number of HugeTLB pages to allocate at boot.
>

Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page

2021-01-25 Thread Mike Kravetz

nce we allocate
>>>>>> +  * vmemmap pages successfully, then we can free
>>>>>> +  * a HugeTLB page.
>>>>>> +  */
>>>>>> + goto retry;
>>>>>
>>>>> Ugh, I don't think this will work, there's no guarantee that we'll ever
>>>>> succeed and now we can't free a 2MB hugepage because we cannot allocate a
>>>>> 4KB page.  We absolutely have to ensure we make forward progress here.
>>>>
>>>> This can trigger a OOM when there is no memory and kill someone to release
>>>> some memory. Right?
>>>>
>>>>>
>>>>> We're going to be freeing the hugetlb page after this succeeeds, can we
>>>>> not use part of the hugetlb page that we're freeing for this memory
>>>>> instead?
>>>>
>>>> It seems a good idea. We can try to allocate memory firstly, if successful,
>>>> just use the new page to remap (it can reduce memory fragmentation).
>>>> If not, we can use part of the hugetlb page to remap. What's your opinion
>>>> about this?
>>>
>>> If the HugeTLB page is a gigantic page which is allocated from
>>> CMA. In this case, we cannot use part of the hugetlb page to remap.
>>> Right?
>>
>> Right; and I don't think the "reuse part of a huge page as vmemmap while
>> freeing, while that part itself might not have a proper vmemmap yet (or
>> might cover itself now)" is particularly straight forward. Maybe I'm
>> wrong :)
>>
>> Also, watch out for huge pages on ZONE_MOVABLE, in that case you also
>> shouldn't allocate the vmemmap from there ...
> 
> Yeah, you are right. So I tend to trigger OOM to kill other processes to
> reclaim some memory when we allocate memory fails.

IIUC, even non-gigantic hugetlb pages can exist in CMA.  They can be migrated
out of CMA if needed (except free pages in the pool, but that is a separate
issue David H already noted in another thread).

When we first started discussing this patch set, one suggestion was to force
hugetlb pool pages to be allocated at boot time and never permit them to be
freed back to the buddy allocator.  A primary reason for the suggestion was
to avoid this issue of needing to allocate memory when freeing a hugetlb page
to buddy.  IMO, that would be an unreasonable restriction for many existing
hugetlb use cases.

A simple thought is that we simply fail the 'freeing hugetlb page to buddy'
if we can not allocate the required vmemmap pages.  However, as David R says
freeing hugetlb pages to buddy is a reasonable way to free up memory in oom
situations.  However, failing the operation 'might' be better than looping
forever trying to allocate the pages needed?  As mentioned in the previous
patch, it would be better to use GFP_ATOMIC to at least dip into reserves if
we can.

I think using pages of the hugetlb for vmemmap to cover pages of the hugetlb
is the only way we can guarantee success of freeing a hugetlb page to buddy.
However, this should only only be used when there is no other option and could
result in vmemmap pages residing in CMA or ZONE_MOVABLE.  I'm not sure how
much better this is than failing the free to buddy operation.

I don't have a solution.  Just wanted to share some thoughts.

BTW, just thought of something else.  Consider offlining a memory section that
contains a free hugetlb page.  The offline code will try to disolve the hugetlb
page (free to buddy).  So, vmemmap pages will need to be allocated.  We will
try to allocate vmemap pages on the same node as the hugetlb page.  But, if
this memory section is the last of the node all the pages will have been
isolated and no allocations will succeed.  Is that a possible scenario, or am
I just having too many negative thoughts?

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: make hugepage size conversion more readable

2021-01-21 Thread Mike Kravetz

On 1/21/21 5:42 PM, Miaohe Lin wrote:
> Hi:
> On 2021/1/22 3:00, Mike Kravetz wrote:
>> On 1/20/21 1:23 AM, Miaohe Lin wrote:
>>> The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to
>>> (PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by
>>> replace it with huge_page_size(h) / SZ_1K.
>>>
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  fs/hugetlbfs/inode.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index 25c1857ff45d..f94b8f6553fa 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -1519,8 +1519,8 @@ static struct vfsmount *__init 
>>> mount_one_hugetlbfs(struct hstate *h)
>>> put_fs_context(fc);
>>> }
>>> if (IS_ERR(mnt))
>>> -   pr_err("Cannot mount internal hugetlbfs for page size %uK",
>>> -  1U << (h->order + PAGE_SHIFT - 10));
>>> +   pr_err("Cannot mount internal hugetlbfs for page size %luK",
>>> +  huge_page_size(h) / SZ_1K);
>>
>> I appreciate the effort to make the code more readable.  The existing
>> calculation does take a minute to understand.  However, it is correct and
>> anyone modifying the code should be able to understand.
>>
>> With my compiler, your proposed change adds an additional instruction to
>> the routine mount_one_hugetlbfs.  I know this is not significant, but still
> 
> I thought compiler would generate the same code...
> 
>> it does increase the kernel size for a change that is of questionable value.
>>
>> In the kernel, size in KB is often calculated as (size << (PAGE_SHIFT - 10)).
>> If you change the calculation in the hugetlb code to be:
>>> huge_page_size(h) << (PAGE_SHIFT - 10)
> 
> I'am sorry but this looks not really correct. I think the calculation shoud be
> huge_page_size(h) >> 10. What do you think?

My bad!  I was looking at code that converts page counts to KB.  Sorry.

Yes, huge_page_size(h) >> 10 is correct.

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: make hugepage size conversion more readable

2021-01-21 Thread Mike Kravetz

On 1/20/21 1:23 AM, Miaohe Lin wrote:
> The calculation 1U << (h->order + PAGE_SHIFT - 10) is actually equal to
> (PAGE_SHIFT << (h->order)) >> 10. So we can make it more readable by
> replace it with huge_page_size(h) / SZ_1K.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 25c1857ff45d..f94b8f6553fa 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1519,8 +1519,8 @@ static struct vfsmount *__init 
> mount_one_hugetlbfs(struct hstate *h)
>   put_fs_context(fc);
>   }
>   if (IS_ERR(mnt))
> - pr_err("Cannot mount internal hugetlbfs for page size %uK",
> -1U << (h->order + PAGE_SHIFT - 10));
> + pr_err("Cannot mount internal hugetlbfs for page size %luK",
> +huge_page_size(h) / SZ_1K);

I appreciate the effort to make the code more readable.  The existing
calculation does take a minute to understand.  However, it is correct and
anyone modifying the code should be able to understand.

With my compiler, your proposed change adds an additional instruction to
the routine mount_one_hugetlbfs.  I know this is not significant, but still
it does increase the kernel size for a change that is of questionable value.

In the kernel, size in KB is often calculated as (size << (PAGE_SHIFT - 10)).
If you change the calculation in the hugetlb code to be:

huge_page_size(h) << (PAGE_SHIFT - 10)

my compiler will actually reduce the size of the routine by one instruction.
-- 
Mike Kravetz

>   return mnt;
>  }
>  
>

Re: [PATCH v2] hugetlbfs: remove meaningless variable avoid_reserve

2021-01-21 Thread Mike Kravetz

On 1/19/21 11:15 PM, Miaohe Lin wrote:
> The variable avoid_reserve is meaningless because we never changed its
> value and just passed it to alloc_huge_page(). So remove it to make code
> more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc
> hugepage yet. Also add a comment offered by Mike Kravetz to explain this.
> 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Miaohe Lin 
> Cc: Mike Kravetz 
> ---
>  fs/hugetlbfs/inode.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)

Reviewed-by: Mike Kravetz 

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 4bbfd78a7ccb..14df2f73b8ef 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -680,7 +680,6 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>*/
>   struct page *page;
>   unsigned long addr;
> - int avoid_reserve = 0;
>  
>   cond_resched();
>  
> @@ -716,8 +715,15 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>   continue;
>   }
>  
> - /* Allocate page and add to page cache */
> - page = alloc_huge_page(_vma, addr, avoid_reserve);
> + /*
> +  * Allocate page without setting the avoid_reserve argument.
> +  * There certainly are no reserves associated with the
> +  * pseudo_vma.  However, there could be shared mappings with
> +  * reserves for the file at the inode level.  If we fallocate
> +  * pages in these areas, we need to consume the reserves
> +  * to keep reservation accounting consistent.
> +  */
> + page = alloc_huge_page(_vma, addr, 0);
>   hugetlb_drop_vma_policy(_vma);
>   if (IS_ERR(page)) {
>   mutex_unlock(_fault_mutex_table[hash]);
> 


-- 
Mike Kravetz

Re: [PATCH v2] hugetlbfs: correct obsolete function name in hugetlbfs_read_iter()

2021-01-19 Thread Mike Kravetz

On 1/17/21 10:32 PM, Miaohe Lin wrote:
> Since commit 36e789144267 ("kill do_generic_mapping_read"), the function
> do_generic_mapping_read() is renamed to do_generic_file_read(). And then
> commit 47c27bc46946 ("fs: pass iocb to do_generic_file_read") renamed it
> to generic_file_buffered_read(). So replace do_generic_mapping_read() with
> generic_file_buffered_read() to keep comment uptodate.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 23ad6ed8b75f..d02616513b43 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -310,7 +310,7 @@ hugetlbfs_read_actor(struct page *page, unsigned long 
> offset,
>  
>  /*
>   * Support for read() - Find the page attached to f_mapping and copy out the
> - * data. Its *very* similar to do_generic_mapping_read(), we can't use that
> + * data. Its *very* similar to generic_file_buffered_read(), we can't use 
> that
>   * since it has PAGE_SIZE assumptions.
>   */
>  static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
>

Re: [PATCH] hugetlbfs: Use helper macro default_hstate in init_hugetlbfs_fs

2021-01-19 Thread Mike Kravetz

CC Andrew

On 1/19/21 9:53 AM, Mike Kravetz wrote:
> On 1/16/21 1:18 AM, Miaohe Lin wrote:
>> Since commit e5ff215941d5 ("hugetlb: multiple hstates for multiple page
>> sizes"), we can use macro default_hstate to get the struct hstate which
>> we use by default. But init_hugetlbfs_fs() forgot to use it.
>>
>> Signed-off-by: Miaohe Lin 
> 
> Thanks,
> 
> Reviewed-by: Mike Kravetz 
>

Re: [PATCH] hugetlbfs: remove meaningless variable avoid_reserve

2021-01-19 Thread Mike Kravetz

Please CC Andrew on hugetlb patches as they need to go through his tree.

On 1/16/21 1:26 AM, Miaohe Lin wrote:
> The variable avoid_reserve is meaningless because we never changed its
> value and just passed it to alloc_huge_page(). So remove it to make code
> more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc
> hugepage yet.

One might argue that using a named variable makes the call to alloc_huge_page
more clear.  I do not disagree with the change,  However, there are some
subtle reasons why alloc_huge_page is called with 'avoid_reserve = 0' from
fallocate.  Therefore, I would prefer that a comment be added above the call
in addition to this change.  See below.

> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 88751e35e69d..23ad6ed8b75f 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -680,7 +680,6 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>*/
>   struct page *page;
>   unsigned long addr;
> - int avoid_reserve = 0;
>  
>   cond_resched();
>  
> @@ -717,7 +716,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>   }
>  
>   /* Allocate page and add to page cache */

Perhaps, change comment to read:

/*
 * Allocate page without setting the avoid_reserve argument.
 * There certainly are no reserves associated with the
 * pseudo_vma.  However, there could be shared mappings with
 * reserves for the file at the inode level.  If we fallocate
 * pages in these areas, we need to consume the reserves
         * to keep reservation accounting consistent.
 */

-- 
Mike Kravetz

> - page = alloc_huge_page(_vma, addr, avoid_reserve);
> + page = alloc_huge_page(_vma, addr, 0);
>   hugetlb_drop_vma_policy(_vma);
>   if (IS_ERR(page)) {
>   mutex_unlock(_fault_mutex_table[hash]);
>

Re: [PATCH] hugetlbfs: Use helper macro default_hstate in init_hugetlbfs_fs

2021-01-19 Thread Mike Kravetz

On 1/16/21 1:18 AM, Miaohe Lin wrote:
> Since commit e5ff215941d5 ("hugetlb: multiple hstates for multiple page
> sizes"), we can use macro default_hstate to get the struct hstate which
> we use by default. But init_hugetlbfs_fs() forgot to use it.
> 
> Signed-off-by: Miaohe Lin 

Thanks,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> ---
>  fs/hugetlbfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 9b221b87fbea..88751e35e69d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1544,7 +1544,7 @@ static int __init init_hugetlbfs_fs(void)
>   goto out_free;
>  
>   /* default hstate mount is required */
> - mnt = mount_one_hugetlbfs([default_hstate_idx]);
> + mnt = mount_one_hugetlbfs(_hstate);
>   if (IS_ERR(mnt)) {
>   error = PTR_ERR(mnt);
>   goto out_unreg;
>

Re: [PATCH v2] hugetlbfs: Remove useless BUG_ON(!inode) in hugetlbfs_setattr()

2021-01-19 Thread Mike Kravetz

CC Andrew

On 1/18/21 3:07 AM, Miaohe Lin wrote:
> When we reach here with inode = NULL, we should have crashed as inode has
> already been dereferenced via hstate_inode. So this BUG_ON(!inode) does not
> take effect and should be removed.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 2 --
>  1 file changed, 2 deletions(-)

Thanks,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 016c863b493b..79464963f95e 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -759,8 +759,6 @@ static int hugetlbfs_setattr(struct dentry *dentry, 
> struct iattr *attr)
>   unsigned int ia_valid = attr->ia_valid;
>   struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
>  
> - BUG_ON(!inode);
> -
>   error = setattr_prepare(dentry, attr);
>   if (error)
>   return error;
>

Re: [PATCH v2] mm/hugetlb: avoid unnecessary hugetlb_acct_memory() call

2021-01-19 Thread Mike Kravetz

On 1/15/21 1:44 AM, Miaohe Lin wrote:
> Hi:
> 
> On 2021/1/15 17:28, Oscar Salvador wrote:
>> On Fri, Jan 15, 2021 at 04:20:13AM -0500, Miaohe Lin wrote:
>>> When gbl_reserve is 0, hugetlb_acct_memory() will do nothing except holding
>>> and releasing hugetlb_lock. We should avoid this unnecessary hugetlb_lock
>>> lock/unlock cycle which is happening on 'most' hugetlb munmap operations by
>>> check delta against 0 at the beginning of hugetlb_acct_memory.
>>>
>>> Reviewed-by: David Hildenbrand 
>>> Signed-off-by: Miaohe Lin 
>>
>> I would avoid mentioning gbl_reserve as not all callers use it, and focus
>> on what delta means:
>>
>> "When reservation accounting remains unchanged..", but anyway:
> 
> Sounds good. Maybe Andrew could kindly do this if this patch is picked up ?

Thank you and Andrew.

Looks like Andrew updated the commit message and added to his tree.

-- 
Mike Kravetz

[PATCH v2 1/5] hugetlb: use page.private for hugetlb specific page flags

2021-01-19 Thread Mike Kravetz

As hugetlbfs evolved, state information about hugetlb pages was added.
One 'convenient' way of doing this was to use available fields in tail
pages.  Over time, it has become difficult to know the meaning or contents
of fields simply by looking at a small bit of code.  Sometimes, the
naming is just confusing.  For example: The PagePrivate flag indicates
a huge page reservation was consumed and needs to be restored if an error
is encountered and the page is freed before it is instantiated.  The
page.private field contains the pointer to a subpool if the page is
associated with one.

In an effort to make the code more readable, use page.private to contain
hugetlb specific page flags.  These flags will have test, set and clear
functions similar to those used for 'normal' page flags.  More importantly,
an enum of flag values will be created with names that actually reflect
their purpose.

In this patch,
- Create infrastructure for hugetlb specific page flag functions
- Move subpool pointer to page[1].private to make way for flags
  Create routines with meaningful names to modify subpool field
- Use new HPageRestoreReserve flag instead of PagePrivate

Conversion of other state information will happen in subsequent patches.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 12 ++-
 include/linux/hugetlb.h | 72 +
 mm/hugetlb.c| 45 +-
 3 files changed, 97 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 740693d7f255..b8a661780c4a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -955,15 +955,9 @@ static int hugetlbfs_migrate_page(struct address_space 
*mapping,
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
 
-   /*
-* page_private is subpool pointer in hugetlb pages.  Transfer to
-* new page.  PagePrivate is not associated with page_private for
-* hugetlb pages and can not be set here as only page_huge_active
-* pages can be migrated.
-*/
-   if (page_private(page)) {
-   set_page_private(newpage, page_private(page));
-   set_page_private(page, 0);
+   if (hugetlb_page_subpool(page)) {
+   hugetlb_set_page_subpool(newpage, hugetlb_page_subpool(page));
+   hugetlb_set_page_subpool(page, NULL);
}
 
if (mode != MIGRATE_SYNC_NO_COPY)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ef5b144b8aac..be71a00ee2a0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -472,6 +472,64 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
+/*
+ * huegtlb page specific state flags.  These flags are located in page.private
+ * of the hugetlb head page.  Functions created via the below macros should be
+ * used to manipulate these flags.
+ *
+ * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
+ * allocation time.  Cleared when page is fully instantiated.  Free
+ * routine checks flag to restore a reservation on error paths.
+ */
+enum hugetlb_page_flags {
+   HPG_restore_reserve = 0,
+   __NR_HPAGEFLAGS,
+};
+
+/*
+ * Macros to create test, set and clear function definitions for
+ * hugetlb specific page flags.
+ */
+#ifdef CONFIG_HUGETLB_PAGE
+#define TESTHPAGEFLAG(uname, flname)   \
+static inline int HPage##uname(struct page *page)  \
+   { BUILD_BUG_ON(sizeof_field(struct page, private) * \
+   BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
+   return test_bit(HPG_##flname, &(page->private)); }
+
+#define SETHPAGEFLAG(uname, flname)\
+static inline void SetHPage##uname(struct page *page)  \
+   { set_bit(HPG_##flname, &(page->private)); }
+
+#define CLEARHPAGEFLAG(uname, flname)  \
+static inline void ClearHPage##uname(struct page *page)\
+   { clear_bit(HPG_##flname, &(page->private)); }
+#else
+#define TESTHPAGEFLAG(uname, flname)   \
+static inline int HPage##uname(struct page *page)  \
+   { BUILD_BUG_ON(sizeof_field(struct page, private) * \
+   BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
+   return 0 }
+
+#define SETHPAGEFLAG(uname, flname)\
+static inline void SetHPage##uname(struct page *page)  \
+   { }
+
+#define CLEARHPAGEFLAG(uname, flname)  \
+static inline void ClearHPage##uname(struct page *page)\
+   { }
+#endif
+
+#define HPAGEFLAG(uname, flname)   \
+   TESTHPAGEFLAG(uname, flname)\
+

[PATCH v2 4/5] hugetlb: convert PageHugeTemporary() to HPageTemporary flag

2021-01-19 Thread Mike Kravetz

Use new hugetlb specific HPageTemporary flag to replace the
PageHugeTemporary() interfaces.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  6 ++
 mm/hugetlb.c| 36 +++-
 2 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1e17529c8b81..ec329b9cc0fc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -483,10 +483,15 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * HPG_migratable  - Set after a newly allocated page is added to the page
  * cache and/or page tables.  Indicates the page is a candidate for
  * migration.
+ * HPG_temporary - - Set on a page that is temporarily allocated from the buddy
+ * allocator.  Typically used for migration target pages when no pages
+ * are available in the pool.  The hugetlb free page path will
+ * immediately free pages with this flag set to the buddy allocator.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
HPG_migratable,
+   HPG_temporary,
__NR_HPAGEFLAGS,
 };
 
@@ -534,6 +539,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
  */
 HPAGEFLAG(RestoreReserve, restore_reserve)
 HPAGEFLAG(Migratable, migratable)
+HPAGEFLAG(Temporary, temporary)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6e32751489e8..0d2bfc2b6adc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,28 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Internal hugetlb specific page flag. Do not use outside of the hugetlb
- * code
- */
-static inline bool PageHugeTemporary(struct page *page)
-{
-   if (!PageHuge(page))
-   return false;
-
-   return (unsigned long)page[2].mapping == -1U;
-}
-
-static inline void SetPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = (void *)-1U;
-}
-
-static inline void ClearPageHugeTemporary(struct page *page)
-{
-   page[2].mapping = NULL;
-}
-
 static void __free_huge_page(struct page *page)
 {
/*
@@ -1422,9 +1400,9 @@ static void __free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;
 
-   if (PageHugeTemporary(page)) {
+   if (HPageTemporary(page)) {
list_del(>lru);
-   ClearPageHugeTemporary(page);
+   ClearHPageTemporary(page);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
@@ -1863,7 +1841,7 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * codeflow
 */
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
-   SetPageHugeTemporary(page);
+   SetHPageTemporary(page);
spin_unlock(_lock);
put_page(page);
return NULL;
@@ -1894,7 +1872,7 @@ static struct page *alloc_migrate_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 * We do not account these pages as surplus because they are only
 * temporary and will be released properly on the last reference
 */
-   SetPageHugeTemporary(page);
+   SetHPageTemporary(page);
 
return page;
 }
@@ -5607,12 +5585,12 @@ void move_hugetlb_state(struct page *oldpage, struct 
page *newpage, int reason)
 * here as well otherwise the global surplus count will not match
 * the per-node's.
 */
-   if (PageHugeTemporary(newpage)) {
+   if (HPageTemporary(newpage)) {
int old_nid = page_to_nid(oldpage);
int new_nid = page_to_nid(newpage);
 
-   SetPageHugeTemporary(oldpage);
-   ClearPageHugeTemporary(newpage);
+   SetHPageTemporary(oldpage);
+   ClearHPageTemporary(newpage);
 
spin_lock(_lock);
if (h->surplus_huge_pages_node[old_nid]) {
-- 
2.29.2

[PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-19 Thread Mike Kravetz

Use the new hugetlb page specific flag HPageMigratable to replace the
page_huge_active interfaces.  By it's name, page_huge_active implied
that a huge page was on the active list.  However, that is not really
what code checking the flag wanted to know.  It really wanted to determine
if the huge page could be migrated.  This happens when the page is actually
added the page cache and/or task page table.  This is the reasoning behind
the name change.

The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
really necessary as we KNOW the page is a hugetlb page.  Therefore, they
are removed.

The routine page_huge_active checked for PageHeadHuge before testing the
active bit.  This is unnecessary in the case where we hold a reference or
lock and know it is a hugetlb head page.  page_huge_active is also called
without holding a reference or lock (scan_movable_pages), and can race with
code freeing the page.  The extra check in page_huge_active shortened the
race window, but did not prevent the race.  Offline code calling
scan_movable_pages already deals with these races, so removing the check
is acceptable.  Add comment to racy code.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c   |  2 +-
 include/linux/hugetlb.h|  5 +
 include/linux/page-flags.h |  6 -
 mm/hugetlb.c   | 45 ++
 mm/memory_hotplug.c|  8 ++-
 5 files changed, 24 insertions(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b8a661780c4a..ec9f03aa2738 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   set_page_huge_active(page);
+   SetHPageMigratable(page);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index be71a00ee2a0..ce3d03da0133 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -480,9 +480,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
  * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
  * allocation time.  Cleared when page is fully instantiated.  Free
  * routine checks flag to restore a reservation on error paths.
+ * HPG_migratable  - Set after a newly allocated page is added to the page
+ * cache and/or page tables.  Indicates the page is a candidate for
+ * migration.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
+   HPG_migratable,
__NR_HPAGEFLAGS,
 };
 
@@ -529,6 +533,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
  * Create functions associated with hugetlb page flags
  */
 HPAGEFLAG(RestoreReserve, restore_reserve)
+HPAGEFLAG(Migratable, migratable)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index bc6fd1ee7dd6..04a34c08e0a6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -592,15 +592,9 @@ static inline void ClearPageCompound(struct page *page)
 #ifdef CONFIG_HUGETLB_PAGE
 int PageHuge(struct page *page);
 int PageHeadHuge(struct page *page);
-bool page_huge_active(struct page *page);
 #else
 TESTPAGEFLAG_FALSE(Huge)
 TESTPAGEFLAG_FALSE(HeadHuge)
-
-static inline bool page_huge_active(struct page *page)
-{
-   return 0;
-}
 #endif
 
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8bed6b5202d2..c24da40626d3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1353,30 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
-/*
- * Test to determine whether the hugepage is "active/in-use" (i.e. being linked
- * to hstate->hugepage_activelist.)
- *
- * This function can be called for tail pages, but never returns true for them.
- */
-bool page_huge_active(struct page *page)
-{
-   return PageHeadHuge(page) && PagePrivate([1]);
-}
-
-/* never called for tail page */
-void set_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   SetPagePrivate([1]);
-}
-
-static void clear_page_huge_active(struct page *page)
-{
-   VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
-   ClearPagePrivate([1]);
-}
-
 /*
  * Internal hugetlb specific page flag. Do not use outside of the hugetlb
  * code
@@ -1438,7 +1414,7 @@ static void __free_huge_page(struct page *page)
}
 
spin_lock(_lock);
-   clear_page_huge_active(page);
+   ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),
 pages_per_huge_page(h), page);
hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
@@ -4220,7 +4196,7 @@ static vm_fault_t hugetlb_cow(struct m

[PATCH v2 3/5] hugetlb: only set HPageMigratable for migratable hstates

2021-01-19 Thread Mike Kravetz

The HP_Migratable flag indicates a page is a candidate for migration.
Only set the flag if the page's hstate supports migration.  This allows
the migration paths to detect non-migratable pages earlier.  If migration
is not supported for the hstate, HP_Migratable will not be set, the page
will not be isolated and no attempt will be made to migrate.  We should
never get to unmap_and_move_huge_page for a page where migration is not
supported, so throw a warning if we do.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 2 +-
 include/linux/hugetlb.h | 9 +
 mm/hugetlb.c| 8 
 mm/migrate.c| 9 -
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ec9f03aa2738..8b8acdafd0be 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,7 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 
mutex_unlock(_fault_mutex_table[hash]);
 
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
/*
 * unlock_page because locked by add_to_page_cache()
 * put_page() due to reference from alloc_huge_page()
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ce3d03da0133..1e17529c8b81 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -744,6 +744,15 @@ static inline bool hugepage_migration_supported(struct 
hstate *h)
return arch_hugetlb_migration_supported(h);
 }
 
+/*
+ * Only set HPageMigratable if migration supported for page
+ */
+static inline void SetHPageMigratableIfSupported(struct page *page)
+{
+   if (hugepage_migration_supported(page_hstate(page)))
+   SetHPageMigratable(page);
+}
+
 /*
  * Movability check is different as compared to migration check.
  * It determines whether or not a huge page should be placed on
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c24da40626d3..6e32751489e8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4196,7 +4196,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, 
struct vm_area_struct *vma,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
-   SetHPageMigratable(new_page);
+   SetHPageMigratableIfSupported(new_page);
/* Make the old page be freed below */
new_page = old_page;
}
@@ -4438,7 +4438,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 * been isolated for migration.
 */
if (new_page)
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
 
unlock_page(page);
 out:
@@ -4749,7 +4749,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
spin_unlock(ptl);
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
if (vm_shared)
unlock_page(page);
ret = 0;
@@ -5584,7 +5584,7 @@ void putback_active_hugepage(struct page *page)
 {
VM_BUG_ON_PAGE(!PageHead(page), page);
spin_lock(_lock);
-   SetHPageMigratable(page);
+   SetHPageMigratableIfSupported(page);
list_move_tail(>lru, &(page_hstate(page))->hugepage_activelist);
spin_unlock(_lock);
put_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0339f3874d7c..943391cd1a7c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1273,13 +1273,12 @@ static int unmap_and_move_huge_page(new_page_t 
get_new_page,
struct address_space *mapping = NULL;
 
/*
-* Migratability of hugepages depends on architectures and their size.
-* This check is necessary because some callers of hugepage migration
-* like soft offline and memory hotremove don't walk through page
-* tables or check whether the hugepage is pmd-based or not before
-* kicking migration.
+* Support for migration should be checked at isolation time.
+* Therefore, we should never get here if migration is not supported
+* for the page.
 */
if (!hugepage_migration_supported(page_hstate(hpage))) {
+   VM_WARN_ON(1);
list_move_tail(>lru, ret);
return -ENOSYS;
}
-- 
2.29.2

[PATCH v2 5/5] hugetlb: convert PageHugeFreed to HPageFreed flag

2021-01-19 Thread Mike Kravetz

Use new hugetlb specific HPageFreed flag to replace the
PageHugeFreed interfaces.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c| 23 ---
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ec329b9cc0fc..8fd0970cefdb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -487,11 +487,13 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr,
  * allocator.  Typically used for migration target pages when no pages
  * are available in the pool.  The hugetlb free page path will
  * immediately free pages with this flag set to the buddy allocator.
+ * HPG_freed - Set when page is on the free lists.
  */
 enum hugetlb_page_flags {
HPG_restore_reserve = 0,
HPG_migratable,
HPG_temporary,
+   HPG_freed,
__NR_HPAGEFLAGS,
 };
 
@@ -540,6 +542,7 @@ static inline void ClearHPage##uname(struct page *page) 
\
 HPAGEFLAG(RestoreReserve, restore_reserve)
 HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
+HPAGEFLAG(Freed, freed)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0d2bfc2b6adc..d5a78aedbfda 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -79,21 +79,6 @@ DEFINE_SPINLOCK(hugetlb_lock);
 static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
 
-static inline bool PageHugeFreed(struct page *head)
-{
-   return page_private(head + 4) == -1UL;
-}
-
-static inline void SetPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, -1UL);
-}
-
-static inline void ClearPageHugeFreed(struct page *head)
-{
-   set_page_private(head + 4, 0);
-}
-
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
@@ -1043,7 +1028,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
list_move(>lru, >hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
-   SetPageHugeFreed(page);
+   SetHPageFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -1060,7 +1045,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
 
list_move(>lru, >hugepage_activelist);
set_page_refcounted(page);
-   ClearPageHugeFreed(page);
+   ClearHPageFreed(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
return page;
@@ -1474,7 +1459,7 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearPageHugeFreed(page);
+   ClearHPageFreed(page);
spin_unlock(_lock);
 }
 
@@ -1747,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
 * We should make sure that the page is already on the free list
 * when it is dissolved.
 */
-   if (unlikely(!PageHugeFreed(head))) {
+   if (unlikely(!HPageFreed(head))) {
rc = -EAGAIN;
goto out;
}
-- 
2.29.2

[PATCH v2 0/5] create hugetlb flags to consolidate state

2021-01-19 Thread Mike Kravetz

While discussing a series of hugetlb fixes in [1], it became evident
that the hugetlb specific page state information is stored in a somewhat
haphazard manner.  Code dealing with state information would be easier
to read, understand and maintain if this information was stored in a
consistent manner.

This series uses page.private of the hugetlb head page for storing a
set of hugetlb specific page flags.  Routines are priovided for test,
set and clear of the flags.

[1] https://lore.kernel.org/r/20210106084739.63318-1-songmuc...@bytedance.com

Patch -> v2
  Went back to functions similar to 'normal' page flags (Matthew/Muchun)
  Decided to leave check in unmap_and_move_huge_page and print warning

RFC -> PATCH
  Simplified to use a single set of flag manipulation routines (Oscar)
  Moved flags and routines to hugetlb.h (Muchun)
  Changed format of page flag names (Muchun)
  Changed subpool routine names (Matthew)
  More comments in code (Oscar)

Based on v5.11-rc3-mmotm-2021-01-12-01-57

Mike Kravetz (5):
  hugetlb: use page.private for hugetlb specific page flags
  hugetlb: convert page_huge_active() HPageMigratable flag
  hugetlb: only set HPageMigratable for migratable hstates
  hugetlb: convert PageHugeTemporary() to HPageTemporary flag
  hugetlb: convert PageHugeFreed to HPageFreed flag

 fs/hugetlbfs/inode.c   |  14 +---
 include/linux/hugetlb.h|  95 +++
 include/linux/page-flags.h |   6 --
 mm/hugetlb.c   | 149 +++--
 mm/memory_hotplug.c|   8 +-
 mm/migrate.c   |   9 +--
 6 files changed, 154 insertions(+), 127 deletions(-)

-- 
2.29.2

Re: [RFC PATCH 1/3] hugetlb: use page.private for hugetlb specific page flags

2021-01-13 Thread Mike Kravetz

On 1/13/21 5:54 AM, Oscar Salvador wrote:
> On Mon, Jan 11, 2021 at 01:01:50PM -0800, Mike Kravetz wrote:
>> As hugetlbfs evolved, state information about hugetlb pages was added.
>> One 'convenient' way of doing this was to use available fields in tail
>> pages.  Over time, it has become difficult to know the meaning or contents
>> of fields simply be looking at a small bit of code.  Sometimes, the
>> naming is just confusing.  For example: The PagePrivate flag indicates
>> a huge page reservation was consumed and needs to be restored if an error
>> is encountered and the page is freed before it is instantiated.  The
>> page.private field contains the pointer to a subpool if the page is
>> associated with one.
>>
>> In an effort to make the code more readable, use page.private to contain
>> hugetlb specific flags.  These flags will have test, set and clear functions
>> similar to those used for 'normal' page flags.  More importantly, the
>> flags will have names which actually reflect their purpose.
>>
>> In this patch,
>> - Create infrastructure for huge page flag functions
>> - Move subpool pointer to page[1].private to make way for flags
>>   Create routines with meaningful names to modify subpool field
>> - Use new HPageRestoreReserve reserve flag instead of PagePrivate
>>
>> Conversion of other state information will happen in subsequent patches.
> 
> I like this idea, it would make the code much easier to follow, and together
> with Muchun's gathering indiscrete index hugetlb code will start looking less
> scarier.

Thanks for taking a look.

> 
> I do have a question below:
> 
>> +enum htlb_page_flags {
>> +HPAGE_RestoreReserve = 0,
>> +};
>> +
>> +/*
>> + * Macros to create function definitions for hpage flags
>> + */
>> +#define TESTHPAGEFLAG(flname)   \
>> +static inline int HPage##flname(struct page *page)  \
>> +{ return test_bit(HPAGE_##flname, &(page->private)); }
>> +
>> +#define SETHPAGEFLAG(flname)\
>> +static inline void SetHPage##flname(struct page *page)  \
>> +{ set_bit(HPAGE_##flname, &(page->private)); }
>> +
>> +#define CLEARHPAGEFLAG(flname)  \
>> +static inline void ClearHPage##flname(struct page *page)\
>> +{ clear_bit(HPAGE_##flname, &(page->private)); }
>> +
>> +#define HPAGEFLAG(flname)   \
>> +TESTHPAGEFLAG(flname)   \
>> +SETHPAGEFLAG(flname)\
>> +CLEARHPAGEFLAG(flname)
>> +
>> +HPAGEFLAG(RestoreReserve)
> 
> I have mixed feelings about this.
> Could we have a single function that sets/clears the bit/flag?
> e.g:
> 
>  static inline void hugetlb_set_flag(struct page *p, page_flag)
>  {
>  set_bit(flag, &(page->private));
>  }
> 
> etc.
> It would look less of an overkill?

Sure, we could do this.  As noted, I simply patterned this after the
page flag routines in page-flags.h.  If we go to single functions as
you suggest, I would perhaps change the name a bit to indicate the flags
were associated with the page.  Invoking code comparison would be:

SetHPageRestoreReserve(page)
-vs-
hugetlb_set_pageflag(page, HP_Restore_Reserve)

In either case, code would be more readable and you can easily grep for
a specific flag.

If we do go with single functions as above, then they would certainly be
moved to hugetlb.h as some flags need to be accessed outside hugetlb.c.
Muchun has already suggested this movement.
-- 
Mike Kravetz

Re: [RFC PATCH 1/3] hugetlb: use page.private for hugetlb specific page flags

2021-01-13 Thread Mike Kravetz

On 1/13/21 6:45 AM, Matthew Wilcox wrote:
> On Mon, Jan 11, 2021 at 01:01:50PM -0800, Mike Kravetz wrote:
>> +if (hpage_spool(page)) {
> 
> Could this be named hpage_subpool(), please?
> 

Of course!

-- 
Mike Kravetz

Re: [PATCH] mm/hugetlb: avoid unnecessary hugetlb_acct_memory() call

2021-01-14 Thread Mike Kravetz

On 1/14/21 4:32 AM, David Hildenbrand wrote:
> On 14.01.21 12:31, Miaohe Lin wrote:
>> When gbl_reserve is 0, hugetlb_acct_memory() will do nothing except holding
>> and releasing hugetlb_lock.
> 
> So, what's the deal then? Adding more code?
> 
> If this is a performance improvement, we should spell it out. Otherwise
> I don't see a real benefit of this patch.
> 

Thanks for finding/noticing this.

As David points out, the commit message should state that this is a
performance improvement.  Mention that such a change avoids an unnecessary
hugetlb_lock lock/unlock cycle.  You can also mention that this unnecessary
lock cycle is happening on 'most' hugetlb munmap operations.

>>
>> Signed-off-by: Miaohe Lin 
>> ---
>>  mm/hugetlb.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 737b2dce19e6..fe2da9ad6233 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5241,7 +5241,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
>> start, long end,
>>   * reservations to be released may be adjusted.
>>   */
>>  gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
>> -hugetlb_acct_memory(h, -gbl_reserve);
>> +if (gbl_reserve)
>> +hugetlb_acct_memory(h, -gbl_reserve);

It is true that gbl_reserve is likely to be 0 in this code path.  However,
there are other code paths where hugetlb_acct_memory is called with a delta
value of 0 as well.  I would rather see a simple check at the beginning of
hugetlb_acct_memory like.

if (!delta)
return 0;

-- 
Mike Kravetz

>>  
>>  return 0;
>>  }
>>
> 
>

Re: [PATCH] mm/hugetlb: Use helper huge_page_order and pages_per_huge_page

2021-01-14 Thread Mike Kravetz

On 1/14/21 3:44 AM, Miaohe Lin wrote:
> Since commit a5516438959d ("hugetlb: modular state for hugetlb page size"),
> we can use huge_page_order to access hstate->order and pages_per_huge_page
> to fetch the pages per huge page. But gather_bootmem_prealloc() forgot to
> use it.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/hugetlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Thanks,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH v2 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page

2021-01-06 Thread Mike Kravetz

On 1/6/21 8:35 AM, Michal Hocko wrote:
> On Wed 06-01-21 16:47:35, Muchun Song wrote:
>> Because we only can isolate a active page via isolate_huge_page()
>> and hugetlbfs_fallocate() forget to mark it as active, we cannot
>> isolate and migrate those pages.
> 
> I've little bit hard time to understand this initially and had to dive
> into the code to make sense of it. I would consider the following
> wording easier to grasp. Feel free to reuse if you like.
> "
> If a new hugetlb page is allocated during fallocate it will not be
> marked as active (set_page_huge_active) which will result in a later
> isolate_huge_page failure when the page migration code would like to
> move that page. Such a failure would be unexpected and wrong.
> "
> 
> Now to the fix. I believe that this patch shows that the
> set_page_huge_active is just too subtle. Is there any reason why we
> cannot make all freshly allocated huge pages active by default?

I looked into that yesterday.  The primary issue is in page fault code,
hugetlb_no_page is an example.  If page_huge_active is set, then it can
be isolated for migration.  So, migration could race with the page fault
and the page could be migrated before being added to the page table of
the faulting task.  This was an issue when hugetlb_no_page set_page_huge_active
right after allocating and clearing the huge page.  Commit cb6acd01e2e4
moved the set_page_huge_active after adding the page to the page table
to address this issue.
-- 
Mike Kravetz

Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-06 Thread Mike Kravetz

On 1/6/21 8:56 AM, Michal Hocko wrote:
> On Wed 06-01-21 16:47:36, Muchun Song wrote:
>> There is a race condition between __free_huge_page()
>> and dissolve_free_huge_page().
>>
>> CPU0: CPU1:
>>
>> // page_count(page) == 1
>> put_page(page)
>>   __free_huge_page(page)
>>   dissolve_free_huge_page(page)
>> spin_lock(_lock)
>> // PageHuge(page) && !page_count(page)
>> update_and_free_page(page)
>> // page is freed to the buddy
>> spin_unlock(_lock)
>> spin_lock(_lock)
>> clear_page_huge_active(page)
>> enqueue_huge_page(page)
>> // It is wrong, the page is already freed
>> spin_unlock(_lock)
>>
>> The race windows is between put_page() and spin_lock() which
>> is in the __free_huge_page().
> 
> The race window reall is between put_page and dissolve_free_huge_page.
> And the result is that the put_page path would clobber an unrelated page
> (either free or already reused page) which is quite serious.
> Fortunatelly pages are dissolved very rarely. I believe that user would
> require to be privileged to hit this by intention.
> 
>> We should make sure that the page is already on the free list
>> when it is dissolved.
> 
> Another option would be to check for PageHuge in __free_huge_page. Have
> you considered that rather than add yet another state? The scope of the
> spinlock would have to be extended. If that sounds more tricky then can
> we check the page->lru in the dissolve path? If the page is still
> PageHuge and reference count 0 then there shouldn't be many options
> where it can be queued, right?

The tricky part with expanding lock scope will be the potential call to
hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock.

I am not sure what you mean by 'check the page->lru'?  If we knew the page
was on the free list, then we could dissolve.  But, I do not think there
is an easy way to determine that from page->lru.  A hugetlb page is either
going to be on the active list or free list.

> 
>> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle 
>> hugepage")
>> Signed-off-by: Muchun Song 
>> ---
>>  mm/hugetlb.c | 38 ++
>>  1 file changed, 38 insertions(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 4741d60f8955..8ff138c17129 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock);
>>  static int num_fault_mutexes;
>>  struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
>>  
>> +static inline bool PageHugeFreed(struct page *head)
>> +{
>> +return (unsigned long)head[3].mapping == -1U;
>> +}
>> +
>> +static inline void SetPageHugeFreed(struct page *head)
>> +{
>> +head[3].mapping = (void *)-1U;
>> +}
>> +
>> +static inline void ClearPageHugeFreed(struct page *head)
>> +{
>> +head[3].mapping = NULL;
>> +}
>> +
>>  /* Forward declaration */
>>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>>  
>> @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
>> page *page)
>>  list_move(>lru, >hugepage_freelists[nid]);
>>  h->free_huge_pages++;
>>  h->free_huge_pages_node[nid]++;
>> +SetPageHugeFreed(page);
>>  }
>>  
>>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>> @@ -1044,6 +1060,7 @@ static struct page 
>> *dequeue_huge_page_node_exact(struct hstate *h, int nid)
>>  
>>  list_move(>lru, >hugepage_activelist);
>>  set_page_refcounted(page);
>> +ClearPageHugeFreed(page);
>>  h->free_huge_pages--;
>>  h->free_huge_pages_node[nid]--;
>>  return page;
>> @@ -1291,6 +1308,17 @@ static inline void 
>> destroy_compound_gigantic_page(struct page *page,
>>  unsigned int order) { }
>>  #endif
>>  
>> +/*
>> + * Because we reuse the mapping field of some tail page structs, we should
>> + * reset those mapping to initial value before @head is freed to the buddy
>> + * allocator. The invalid value will be checked in the 
>> free_tail_pages_check().
>> + */

When I suggested using head[3].mapp

Re: [PATCH v2 2/6] mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page

2021-01-06 Thread Mike Kravetz

On 1/6/21 12:02 PM, Michal Hocko wrote:
> On Wed 06-01-21 11:30:25, Mike Kravetz wrote:
>> On 1/6/21 8:35 AM, Michal Hocko wrote:
>>> On Wed 06-01-21 16:47:35, Muchun Song wrote:
>>>> Because we only can isolate a active page via isolate_huge_page()
>>>> and hugetlbfs_fallocate() forget to mark it as active, we cannot
>>>> isolate and migrate those pages.
>>>
>>> I've little bit hard time to understand this initially and had to dive
>>> into the code to make sense of it. I would consider the following
>>> wording easier to grasp. Feel free to reuse if you like.
>>> "
>>> If a new hugetlb page is allocated during fallocate it will not be
>>> marked as active (set_page_huge_active) which will result in a later
>>> isolate_huge_page failure when the page migration code would like to
>>> move that page. Such a failure would be unexpected and wrong.
>>> "
>>>
>>> Now to the fix. I believe that this patch shows that the
>>> set_page_huge_active is just too subtle. Is there any reason why we
>>> cannot make all freshly allocated huge pages active by default?
>>
>> I looked into that yesterday.  The primary issue is in page fault code,
>> hugetlb_no_page is an example.  If page_huge_active is set, then it can
>> be isolated for migration.  So, migration could race with the page fault
>> and the page could be migrated before being added to the page table of
>> the faulting task.  This was an issue when hugetlb_no_page 
>> set_page_huge_active
>> right after allocating and clearing the huge page.  Commit cb6acd01e2e4
>> moved the set_page_huge_active after adding the page to the page table
>> to address this issue.
> 
> Thanks for the clarification. I was not aware of this subtlety. The
> existing comment is not helping much TBH. I am still digesting the
> suggested race. The page is new and exclusive and not visible via page
> tables yet, so the only source of the migration would be pfn based
> (hotplug, poisoning), right?

That is correct.


> Btw. s@set_page_huge_active@set_page_huge_migrateable@ would help
> readability IMHO. With a comment explaining that this _has_ to be called
> after the page is fully initialized.

Agree, I will add that as a future enhancement.

-- 
Mike Kravetz

Re: [PATCH] mm/hugetlb: Fix potential missing huge page size info

2021-01-07 Thread Mike Kravetz

On 1/7/21 4:34 AM, Miaohe Lin wrote:
> The huge page size is encoded for VM_FAULT_HWPOISON errors only. So if we
> return VM_FAULT_HWPOISON, huge page size would just be ignored.
> 
> Fixes: aa50d3a7aa81 ("Encode huge page size for VM_FAULT_HWPOISON errors")
> Signed-off-by: Miaohe Lin 
> Cc: 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks!

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH] mm/hugetlb: Fix potential double free in hugetlb_register_node() error path

2021-01-07 Thread Mike Kravetz

On 1/7/21 4:32 AM, Miaohe Lin wrote:
> In hugetlb_sysfs_add_hstate(), we would do kobject_put() on hstate_kobjs
> when failed to create sysfs group but forget to set hstate_kobjs to NULL.
> Then in hugetlb_register_node() error path, we may free it again via
> hugetlb_unregister_node().
> 
> Fixes: a3437870160c ("hugetlb: new sysfs interface")
> Signed-off-by: Miaohe Lin 
> Cc: 
> ---
>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Thanks, this is a potential issue that should be fixed.

Reviewed-by: Mike Kravetz 

This has been around for a long time (more than 12 years).  I suspect
nobody actually experienced this issue.  You just discovered via code
inspection.  Correct?
At one time cc stable would not be accepted for this type of issue,
not sure about today.
-- 
Mike Kravetz

Re: [PATCH 3/6] hugetlb: add free page reporting support

2021-01-07 Thread Mike Kravetz

On 1/5/21 7:49 PM, Liang Li wrote:
> hugetlb manages its page in hstate's free page list, not in buddy
> system, this patch try to make it works for hugetlbfs. It canbe
> used for memory overcommit in virtualization and hugetlb pre zero
> out.

I am not looking closely at the hugetlb changes yet.  There seem to be
higher level questions about page reporting/etc.  Once those are sorted,
I will be happy to take a closer look.  One quick question below.

> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include "page_reporting.h"
>  #include "internal.h"
>  
>  int hugetlb_max_hstate __read_mostly;
> @@ -1028,6 +1029,9 @@ static void enqueue_huge_page(struct hstate *h, struct 
> page *page)
>   list_move(>lru, >hugepage_freelists[nid]);
>   h->free_huge_pages++;
>   h->free_huge_pages_node[nid]++;
> + if (hugepage_reported(page))
> + __ClearPageReported(page);
> + hugepage_reporting_notify_free(h->order);
>  }
>  
>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> @@ -5531,6 +5535,21 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
> address, pgd_t *pgd, int fla
>   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
> PAGE_SHIFT);
>  }
>  
> +void isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
> +{
> + VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> + list_move(>lru, >hugepage_activelist);
> + set_page_refcounted(page);
> +}
> +
> +void putback_isolate_huge_page(struct hstate *h, struct page *page)
> +{
> + int nid = page_to_nid(page);
> +
> + list_move(>lru, >hugepage_freelists[nid]);
> +}

The above routines move pages between the free and active lists without any
update to free page counts.  How does that work?  Will the number of entries
on the free list get out of sync with the free_huge_pages counters?
-- 
Mike Kravetz

Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Mike Kravetz

On 1/7/21 12:40 AM, Michal Hocko wrote:
> On Wed 06-01-21 12:58:29, Mike Kravetz wrote:
>> On 1/6/21 8:56 AM, Michal Hocko wrote:
>>> On Wed 06-01-21 16:47:36, Muchun Song wrote:
>>>> There is a race condition between __free_huge_page()
>>>> and dissolve_free_huge_page().
>>>>
>>>> CPU0: CPU1:
>>>>
>>>> // page_count(page) == 1
>>>> put_page(page)
>>>>   __free_huge_page(page)
>>>>   dissolve_free_huge_page(page)
>>>> spin_lock(_lock)
>>>> // PageHuge(page) && !page_count(page)
>>>> update_and_free_page(page)
>>>> // page is freed to the buddy
>>>> spin_unlock(_lock)
>>>> spin_lock(_lock)
>>>> clear_page_huge_active(page)
>>>> enqueue_huge_page(page)
>>>> // It is wrong, the page is already freed
>>>> spin_unlock(_lock)
>>>>
>>>> The race windows is between put_page() and spin_lock() which
>>>> is in the __free_huge_page().
>>>
>>> The race window reall is between put_page and dissolve_free_huge_page.
>>> And the result is that the put_page path would clobber an unrelated page
>>> (either free or already reused page) which is quite serious.
>>> Fortunatelly pages are dissolved very rarely. I believe that user would
>>> require to be privileged to hit this by intention.
>>>
>>>> We should make sure that the page is already on the free list
>>>> when it is dissolved.
>>>
>>> Another option would be to check for PageHuge in __free_huge_page. Have
>>> you considered that rather than add yet another state? The scope of the
>>> spinlock would have to be extended. If that sounds more tricky then can
>>> we check the page->lru in the dissolve path? If the page is still
>>> PageHuge and reference count 0 then there shouldn't be many options
>>> where it can be queued, right?
>>
>> The tricky part with expanding lock scope will be the potential call to
>> hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock.
> 
> Can we rearrange the code and move hugepage_subpool_put_pages after all
> this is done? Or is there any strong reason for the particular ordering?

The reservation code is so fragile, I always get nervous when making
any changes.  However, the straight forward patch below passes some
simple testing.  The only difference I can see is that global counts
are adjusted before sub-pool counts.  This should not be an issue as
global and sub-pool counts are adjusted independently (not under the
same lock).  Allocation code checks sub-pool counts before global
counts.  So, there is a SMALL potential that a racing allocation which
previously succeeded would now fail.  I do not think this is an issue
in practice.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3b38ea958e95..658593840212 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1395,6 +1395,11 @@ static void __free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;
 
+   spin_lock(_lock);
+   /* check for race with dissolve_free_huge_page/update_and_free_page */
+   if (!PageHuge(page))
+   return;
+
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
 
@@ -1403,26 +1408,6 @@ static void __free_huge_page(struct page *page)
restore_reserve = PagePrivate(page);
ClearPagePrivate(page);
 
-   /*
-* If PagePrivate() was set on page, page allocation consumed a
-* reservation.  If the page was associated with a subpool, there
-* would have been a page reserved in the subpool before allocation
-* via hugepage_subpool_get_pages().  Since we are 'restoring' the
-* reservtion, do not call hugepage_subpool_put_pages() as this will
-* remove the reserved page from the subpool.
-*/
-   if (!restore_reserve) {
-   /*
-* A return code of zero implies that the subpool will be
-* under its minimum size if the reservation is not restored
-* after page is free.  Therefore, force restore_reserve
-* operation.
-*/
-   if (hugepage_subpool_put_pages(spool, 1) == 0)
-   restore_reserve = true;
-   }
-
-   spin_lock(_lock);
clear_page_huge_active(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),

Re: [External] Re: [PATCH v2 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-07 Thread Mike Kravetz

On 1/7/21 7:11 AM, Muchun Song wrote:
> On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko  wrote:
>>
>> On Thu 07-01-21 20:59:33, Muchun Song wrote:
>>> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko  wrote:
>> [...]
>>>> Right. Can we simply back off in the dissolving path when ref count is
>>>> 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario
>>>> when the all above is true and the page is not being freed?
>>>
>>> The list_empty(>lru) may always return false.
>>> The page before freeing is on the active list
>>> (hstate->hugepage_activelist).Then it is on the free list
>>> after freeing. So list_empty(>lru) is always false.
>>
>> The point I was trying to make is that the page has to be enqueued when
>> it is dissolved and freed. If the page is not enqueued then something
>> racing. But then I have realized that this is not a great check to
>> detect the race because pages are going to be released to buddy
>> allocator and that will reuse page->lru again. So scratch that and sorry
>> for the detour.
>>
>> But that made me think some more and one way to reliably detect the race
>> should be PageHuge() check in the freeing path. This is what dissolve
>> path does already. PageHuge becomes false during update_and_free_page()
>> while holding the hugetlb_lock. So can we use that?
> 
> It may make the thing complex. Apart from freeing it to the
> buddy allocator, free_huge_page also does something else for
> us. If we detect the race in the freeing path, if it is not a HugeTLB
> page, the freeing path just returns. We also should move those
> things to the dissolve path. Right?
> 
> But I find a tricky problem to solve. See free_huge_page().
> If we are in non-task context, we should schedule a work
> to free the page. We reuse the page->mapping. If the page
> is already freed by the dissolve path. We should not touch
> the page->mapping. So we need to check PageHuge().
> The check and llist_add() should be protected by
> hugetlb_lock. But we cannot do that. Right? If dissolve
> happens after it is linked to the list. We also should
> remove it from the list (hpage_freelist). It seems to make
> the thing more complex.

You are correct.  This is also an issue/potential problem with this
race.  It seems that adding the state information might be the least
complex way to address issue.

-- 
Mike Kravetz

Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread Mike Kravetz

On 1/12/21 6:53 AM, Michal Hocko wrote:
> On Tue 12-01-21 15:41:02, David Hildenbrand wrote:
>> On 12.01.21 15:23, Michal Hocko wrote:
>>> On Tue 12-01-21 13:16:45, Michal Hocko wrote:
>>> [...]
>>>> Well, currently pool pages are not migrateable but you are right that
>>>> this is likely something that we will need to look into in the future
>>>> and this optimization would stand in the way.
>>>
>>> After some more thinking I believe I was wrong in my last statement.
>>> This optimization shouldn't have any effect on pages on the pool as
>>> those stay at reference count 0 and they cannot be isolated either
>>> (clear_page_huge_active before it is enqueued).
>>>
>>> That being said, the migration code would still have to learn about
>>> about this pages but that is out of scope of this discussion.
>>>
>>> Sorry about the confusion from my side.
>>>
>>
>> At this point I am fairly confused what's working at what's not :D
> 
> heh, tell me something about that. Hugetlb is a maze full of land mines.
> 
>> I think this will require more thought, on how to teach
>> alloc_contig_range() (and eventually in some cases offline_pages()?) to
>> do the right thing.
> 
> Well, offlining sort of works because it retries both migrates and
> dissolves. It can fail with the later due to reservations but that can
> be expected. We can still try harder to rellocate/rebalance per numa
> pools to keep the reservation but I strongly suspect nobody has noticed
> this to be a problem so there we are.

Due to my time zone, I get to read all the previous comments before
commenting myself. :)

To be clear, this patch is handling a very specific case where a hugetlb
page was isolated for migration and after being isolated the last reference
to the page was dropped.  Normally, dropping the last reference would send
the page to free_huge_page processing.  free_huge_page processing would
either dissolve the page to the buddy allocator or more likely place the
page on the free list of the pool.  However, since isolation also holds
a reference to the page, processing is continued down the migration path.

Today there is no code to migrate free huge pages in the pool.  Only
active in use pages are migrated.  Without this patch, 'free pages' in
the very specific state above would be migrated.  But to be clear, that
was never the intention of any hugetlb migration code.  In that respect,
I believe this patch helps the current code work as designed.

David brings up the valid point that alloc_contig_range needs to deal
with free hugetlb pool pages.  However, that is code which needs to be
written as it does not exist today.  I suspect nobody thought about free
hugetlb pages when alloc_contig_range was written.  This patch should
in no way hinder development of that code.  Free huge pages have a ref
count of 0, and this code is checking for ref count of 1.

That is a long way of saying that I still agree with this patch.  The
only modification/suggestion would be enhancing the commit message as
suggested by Michal.
-- 
Mike Kravetz

Re: [PATCH v12 03/13] mm: Introduce VM_WARN_ON_PAGE macro

2021-01-13 Thread Mike Kravetz

On 1/6/21 6:19 AM, Muchun Song wrote:
> Very similar to VM_WARN_ON_ONCE_PAGE and VM_BUG_ON_PAGE, add
> VM_WARN_ON_PAGE macro.
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/mmdebug.h | 8 
>  1 file changed, 8 insertions(+)

I was going to question the use/need for this macro in the following
patch.  Looks like Oscar has already done that, and free_bootmem_page
will now use VM_BUG_ON_PAGE.  So, this patch can be dropped.

-- 
Mike Kravetz

Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Mike Kravetz

On 1/15/21 1:17 AM, Oscar Salvador wrote:
> On Mon, Jan 11, 2021 at 01:01:51PM -0800, Mike Kravetz wrote:
>> Use the new hugetlb page specific flag to replace the page_huge_active
>> interfaces.  By it's name, page_huge_active implied that a huge page
>> was on the active list.  However, that is not really what code checking
>> the flag wanted to know.  It really wanted to determine if the huge
>> page could be migrated.  This happens when the page is actually added
>> the page cache and/or task page table.  This is the reasoning behind the
>> name change.
>>
>> The VM_BUG_ON_PAGE() calls in the interfaces were not really necessary
>> as in all case but one we KNOW the page is a hugetlb page.  Therefore,
>> they are removed.  In one call to HPageMigratable() is it possible for
>> the page to not be a hugetlb page due to a race.  However, the code
>> making the call (scan_movable_pages) is inherently racy, and page state
>> will be validated later in the migration process.
>>
>> Note:  Since HPageMigratable is used outside hugetlb.c, it can not be
>> static.  Therefore, a new set of hugetlb page flag macros is added for
>> non-static flag functions.
> 
> Two things about this one:
> 
> I am not sure about the name of this one.
> It is true that page_huge_active() was only called by memory-hotplug and all
> it wanted to know was whether the page was in-use and so if it made sense
> to migrate it, so I see some value in the new PageMigratable flag.
> 
> However, not all in-use hugetlb can be migrated, e.g: we might have 
> constraints
> when it comes to migrate certain sizes of hugetlb, right?
> So setting HPageMigratable to all active hugetlb pages might be a bit 
> misleading?
> HPageActive maybe? (Sorry, don't have a replacement)

You concerns about the name change are correct.

The reason for the change came about from discussions about Muchun's series
of fixes and the need for a new 'page is freed' status to fix a race.  In
that discussion, Michal asked 'Why can't we simply set page_huge_active when
the page is allocated and put on the active list?'.  That is mentioned above,
but we really do not want to try and migrate pages after they are allocated
and before they are in use.  That causes problems in the fault handling code.

Anyway, that is how the suggestion for Migration came about.

In that discussion David Hildenbrand noted that code in alloc_contig_range
should migrate free hugetlb pages, but there is no support for that today.
I plan to look at that if nobody else does.  When such code is added, the
name 'Migratable' will become less applicable.

I'm not great at naming.  Perhaps 'In_Use' as a flag name might fit better.

> The other thing is that you are right that scan_movable_pages is racy, but
> page_huge_active() was checking if the page had the Head flag set before
> retrieving page[1].
> 
> Before the page_huge_active() in scan_movable_pages() we have the
> if (!PageHuge(page)) check, but could it be that between that check and
> the page_huge_active(), the page gets dissolved, and so we are checking
> a wrong page[1]? Am I making sense? 

Yes, you are making sense.

The reason I decided to drop the check is because it does not eliminate the
race.  Even with that check in page_huge_active, the page could be dissolved
between that check and check of page[1].  There really is no way to eliminate
the race without holding a reference to the page (or hugetlb_lock).  That
check in page_huge_active just shortens the race window.

-- 
Mike Kravetz

Re: [RFC PATCH 3/3] hugetlb: convert PageHugeTemporary() to HPageTempSurplus

2021-01-15 Thread Mike Kravetz

On 1/15/21 2:16 AM, Oscar Salvador wrote:
> On Mon, Jan 11, 2021 at 01:01:52PM -0800, Mike Kravetz wrote:
>> Use new hugetlb specific flag HPageTempSurplus to replace the
>> PageHugeTemporary() interfaces.
>>
>> Signed-off-by: Mike Kravetz 
>> ---
>>  mm/hugetlb.c | 38 +-
>>  1 file changed, 9 insertions(+), 29 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 34ce82f4823c..949e1f987319 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -58,6 +58,7 @@ static unsigned long hugetlb_cma_size __initdata;
>>  enum htlb_page_flags {
>>  HPAGE_RestoreReserve = 0,
>>  HPAGE_Migratable,
>> +HPAGE_TempSurplus,
>>  };
>>  
>>  /*
>> @@ -99,6 +100,7 @@ void ClearHPage##flname(struct page *page)\
>>  
>>  HPAGEFLAG(RestoreReserve)
>>  EXT_HPAGEFLAG(Migratable)
>> +HPAGEFLAG(TempSurplus)
> 
> Would HPAGE_Temporary/Temporary not be a better fit?

Yes it would.

> The point about current PageHugeTemporary is that these pages are temporary as
> they do not belong to the pool and will vanish once the last reference gets 
> drop.
> We do not really care that much about the surplus part?
> 
> Besides, alloc_migrate_huge_page seems to not want to thread these pages as 
> surplus.
> 

All correct, not sure why I was thinking 'surplus' when naming the flag.

> Also, I would add a comment either next to each flag or above
> the enum htlb_page_flags (probably the latter) with a brief explanation
> of each flag.

Will do.

-- 
Mike Kravetz

> Besides that, it looks fine to me.
> Here I do not see the same problem in
> stripping the PageHuge check in PageHugeTemporary, as I did in previous patch,
> because all callers of it make sure they operate on a hugetlb page.
>

Re: [External] Re: [PATCH v12 04/13] mm/hugetlb: Free the vmemmap pages associated with each HugeTLB page

2021-01-13 Thread Mike Kravetz

On 1/13/21 1:20 AM, Oscar Salvador wrote:
> On Tue, Jan 12, 2021 at 07:33:33PM +0800, Muchun Song wrote:
>>> It seems a bit odd to only pass "start" for the BUG_ON.
>>> Also, I kind of dislike the "addr += PAGE_SIZE" in vmemmap_pte_range.
>>>
>>> I wonder if adding a ".remap_start_addr" would make more sense.
>>> And adding it here with the vmemmap_remap_walk init.
>>
>> How about introducing a new function which aims to get the reuse
>> page? In this case, we can drop the BUG_ON() and "addr += PAGE_SIZE"
>> which is in vmemmap_pte_range. The vmemmap_remap_range only
>> does the remapping.
> 
> How would that look? 
> It might be good, dunno, but the point is, we should try to make the rules as
> simple as possible, dropping weird assumptions.
> 
> Callers of vmemmap_remap_free should know three things:
> 
> - Range to be remapped
> - Addr to remap to
> - Current implemantion needs addr to be remap to to be part of the complete
>   range
> 
> right?

And, current implementation needs must have remap addr be the first in the
complete range.  This is just because of the way the page tables are walked
for remapping.  The remap/reuse page must be found first so that the following
pages can be remapped to it.

That implementation seems to be the 'most efficient' for hugetlb pages where
we want vmemmap pages n+3 and beyond mapped to n+2.

In a more general purpose vmemmap_remap_free implementation, the reuse/remap
address would not necessarily need to be related to the range.  However, this
would require a separate page table walk/validation for the reuse address
independent of the range.  This may be what Muchun was proposing for 'a new
function which aims to get the reuse page'.

IMO, the decision on how to implement depends on the intended use case.
- If this is going to be hugetlb only (or perhaps generic huge page only)
  functionality, then I am OK with an efficient implementation that has
  some restrictions.
- If we see this being used for more general purpose remapping, then we
  should go with a more general purpose implementation.

Again, just my opinion.
-- 
Mike Kravetz

Re: [PATCH v12 12/13] mm/hugetlb: Gather discrete indexes of tail page

2021-01-13 Thread Mike Kravetz

On 1/6/21 6:19 AM, Muchun Song wrote:
> For HugeTLB page, there are more metadata to save in the struct page.
> But the head struct page cannot meet our needs, so we have to abuse
> other tail struct page to store the metadata. In order to avoid
> conflicts caused by subsequent use of more tail struct pages, we can
> gather these discrete indexes of tail struct page. In this case, it
> will be easier to add a new tail page index later.
> 
> There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
> page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP,
> so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Oscar Salvador 
> ---
>  include/linux/hugetlb.h| 14 ++
>  include/linux/hugetlb_cgroup.h | 15 +--
>  mm/hugetlb.c   | 25 -
>  mm/hugetlb_vmemmap.c   |  8 
>  4 files changed, 43 insertions(+), 19 deletions(-)

My apologies!  I did not get to this patch in previous versions of the
series.  My "RFC create hugetlb flags to consolidate state" was done
before I even noticed your efforts here.

At least we agree the metadata could be better organized. :)

IMO, using page.private of the head page to consolidate flags will be
easier to manage.  So, I would like to use that.

The BUILD_BUG_ON in this patch makes sense.
-- 
Mike Kravetz

Re: [v5] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-10-02 Thread Mike Kravetz

On 10/2/20 4:25 AM, Michal Hocko wrote:
> On Wed 30-09-20 15:03:11, Mike Kravetz wrote:
>> On 9/30/20 1:47 PM, Vijay Balakrishna wrote:
>>> On 9/30/2020 11:20 AM, Mike Kravetz wrote:
>>>> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
>>>>
>>>> Sorry for jumping in so late.  Should we use this as an opportunity to
>>>> also fix up the messages logged when (re)calculating mfk?  They are wrong
>>>> and could be quite confusing.
>>>
>>>
>>> Sure.  Please share your thoughts regarding appropriate message.  Here is 
>>> what I'm thinking
>>>
>>> pr_warn("min_free_kbytes is not updated to %d because current value %d is 
>>> preferred\n", new_min_free_kbytes, min_free_kbytes);
>>>
>>> If above message is reasonable I can post a new revision (v6).
>>
>> Just considering the below example,
>>
>>>> For example consider the following sequence
>>>> of operations and corresponding log messages produced.
>>>>
>>>> Freshly booted VM with 2 nodes and 8GB memory:
>>>> # cat /proc/sys/vm/min_free_kbytes
>>>> 90112
>>>> # echo 9 > /proc/sys/vm/min_free_kbytes
>>>> # cat /proc/sys/vm/min_free_kbytes
>>>> 9
>>>> # echo 0 > /sys/devices/system/node/node1/memory56/online
>>>> [  135.099947] Offlined Pages 32768
>>>> [  135.102362] min_free_kbytes is not updated to 11241 because user 
>>>> defined value 9 is preferred
>>
>> I am not sure if there is any value in printing the above line.  Especially
>> in this context as it becomes obsolete with the printing of the next line.
> 
> The original intention was to make it explicit that auto-tuning is
> influenced by the user provided configuration.
> 
>>>> [  135.109070] khugepaged: raising min_free_kbytes from 9 to 90112 to 
>>>> help t
>>>> ransparent hugepage allocations
>>
>> IMO, the above line is the only one that should be output as a result of the
>> recalculation.
> 
> Well, but khugepaged could be disabled and then the above might not get
> printed. Sure the code could get reorganized and all that but is this
> really worth that?
> 
>> I guess that brings up the question of 'should we continue to track the user
>> defined value if we overwrite it?".  If we quit tracking it may help with the
>> next message.
> 
> Auto tuning and user provided override is quite tricky to get sensible.
> Especially in the case here. Admin has provided an override but has the
> potential memory hotplug been considered? Or to make it even more
> complicated, consider that the hotplug happens without admin involvement
> - e.g. memory gets hotremoved due to HW problems. Is the admin provided
> value still meaningful? To be honest I do not have a good answer and I
> am not sure we should care all that much until we see practical
> problems.

I am not insisting that this be cleaned up.  The change in this patch to
ensure THP related calculations are performed during hotplug is the most
important.

I became aware of the logging issues when looking at a customer issue with
an older kernel.  The min_free_kbytes setting was integral to the issue we
were investigating, and it was unclear whether or not the customer had
changed the value.  I knew the system log should contain evidence of manually
setting min_free_kbytes.  However, there was no evidence in the log.  Turns
out the customer did not change the value, but it did cause me to do a deep
dive into the logging code.
-- 
Mike Kravetz

Re: [PATCH 01/10] mm/hugetlb: not necessary to coalesce regions recursively

2020-08-10 Thread Mike Kravetz

On 8/7/20 2:12 AM, Wei Yang wrote:
> Per my understanding, we keep the regions ordered and would always
> coalesce regions properly. So the task to keep this property is just
> to coalesce its neighbour.
> 
> Let's simplify this.
> 
> Signed-off-by: Wei Yang 

Thanks!  It is unfortunate that the region management code is difficult
to understand.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH 02/10] mm/hugetlb: make sure to get NULL when list is empty

2020-08-10 Thread Mike Kravetz

On 8/7/20 7:28 AM, Wei Yang wrote:
> On Fri, Aug 07, 2020 at 08:49:51PM +0800, Baoquan He wrote:
>> On 08/07/20 at 05:12pm, Wei Yang wrote:
>>> list_first_entry() may not return NULL even when the list is empty.
>>>
>>> Let's make sure the behavior by using list_first_entry_or_null(),
>>> otherwise it would corrupt the list.
>>>
>>> Signed-off-by: Wei Yang 
>>> ---
>>>  mm/hugetlb.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 62ec74f6d03f..0a2f3851b828 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -237,7 +237,8 @@ get_file_region_entry_from_cache(struct resv_map *resv, 
>>> long from, long to)
>>> VM_BUG_ON(resv->region_cache_count <= 0);
>>
>>
>> We have had above line, is it possible to be NULL from list_first_entry?
>>
>>>  
>>> resv->region_cache_count--;
>>> -   nrg = list_first_entry(>region_cache, struct file_region, link);
>>> +   nrg = list_first_entry_or_null(>region_cache,
>>> +   struct file_region, link);
>>> VM_BUG_ON(!nrg);
> 
> Or we can remove this VM_BUG_ON()?
> 

I would prefer that we just remove the 'VM_BUG_ON(!nrg)'.  Code elsewhere
is responsible for making sure there is ALWAYS an entry in the cache.  That
is why the 'VM_BUG_ON(resv->region_cache_count <= 0)' is at the beginning
of the routine.

-- 
Mike Kravetz

Re: [PATCH 03/10] mm/hugetlb: use list_splice to merge two list at once

2020-08-10 Thread Mike Kravetz

On 8/7/20 2:12 AM, Wei Yang wrote:
> Instead of add allocated file_region one by one to region_cache, we
> could use list_splice to merge two list at once.
> 
> Also we know the number of entries in the list, increase the number
> directly.
> 
> Signed-off-by: Wei Yang 

Thanks!

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH 04/10] mm/hugetlb: count file_region to be added when regions_needed != NULL

2020-08-10 Thread Mike Kravetz

On 8/7/20 2:12 AM, Wei Yang wrote:
> There are only two cases of function add_reservation_in_range()
> 
> * count file_region and return the number in regions_needed
> * do the real list operation without counting
> 
> This means it is not necessary to have two parameters to classify these
> two cases.
> 
> Just use regions_needed to separate them.
> 
> Signed-off-by: Wei Yang 

Thanks!

I really like removal of the 'count_only' parameter.  However, within the
routine 'count_only' made the code just a little easier to understand. I
might have:
- Removed the function parameter.
- Added local variable
  bool count_only = regions_needed != NULL;
- Left remaining code as it.

I'm OK with the proposed change.  Any change to readability is VERY minimal.

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH 06/10] mm/hugetlb: remove redundant huge_pte_alloc() in hugetlb_fault()

2020-08-10 Thread Mike Kravetz

On 8/7/20 2:12 AM, Wei Yang wrote:
> Before proper processing, huge_pte_alloc() would be called
> un-conditionally. It is not necessary to do this when ptep is NULL.

Worse, that extra call is a bug.  I believe Andrew pulled this patch into
his queue.  It still could use a review.

https://lore.kernel.org/linux-mm/e670f327-5cf9-1959-96e4-6dc7cc30d...@oracle.com/

-- 
Mike Kravetz

Re: [PATCH 07/10] mm/hugetlb: a page from buddy is not on any list

2020-08-10 Thread Mike Kravetz

On 8/7/20 2:12 AM, Wei Yang wrote:
> The page allocated from buddy is not on any list, so just use list_add()
> is enough.
> 
> Signed-off-by: Wei Yang 

Thanks!

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 2070 matches

Mail list logo