Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-30 Thread Hugh Dickins
On Mon, 30 Mar 2015, KOSAKI Motohiro wrote:
> On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson  wrote:
> > On 03/26/2015 07:56 AM, Davide Libenzi wrote:
> >> On Wed, 25 Mar 2015, David Rientjes wrote:
> >>
> >>> I looked at this thread at http://marc.info/?t=14139250881
> >>> since I didn't have it in my mailbox, and I didn't get a chance
> >>> to actually run your test code.
> >>>
> >>> In short, I think what you're saying is that
> >>>
> >>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
> >>> 4KB) == EINVAL
> >>
> >> I am not sure you have read the email correctly:
> >>
> >> munmap(mmap(size, HUGETLB), size) = EFAIL
> >>
> >> For every size not multiple of the huge page size. Whereas:
> >>
> >> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
> >
> > I think Davide is right here, this is a long existing bug in the
> > MAP_HUGETLB implementation.  Specifically, the mmap man page says:
> >
> > All pages containing a part of the indicated range are unmapped, and
> > subsequent references to these pages will generate SIGSEGV.
> >
> > I realize that huge pages may not have been considered by those that
> > wrote the spec.  But if I read this I would assume that all pages,
> > regardless of size, touched by the munmap() request should be unmapped.
> >
> > Please include
> > Acked-by: Eric B Munson 
> > to the original patch.  I would like to see the mmap man page adjusted
> > to make note of this behavior as well.
> 
> This is just a bug fix and I never think this has large risk. But
> caution, we might revert immediately
> if this patch arise some regression even if it's come from broken
> application code.
> 
> Acked-by: KOSAKI Motohiro 

and, without wishing to be confrontational,
Nacked-by: Hugh Dickins 

I agree with you that the risk on munmap is probably not large;
but I still have no interest in spending more time on changing
twelve-year-old established behaviour in this way, then looking
out for the regressions and preparing to revert.

The risk is larger on mprotect, and the other calls which this
patch as is would leave inconsistent with munmap.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-30 Thread KOSAKI Motohiro
On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 03/26/2015 07:56 AM, Davide Libenzi wrote:
>> On Wed, 25 Mar 2015, David Rientjes wrote:
>>
>>> I looked at this thread at http://marc.info/?t=14139250881
>>> since I didn't have it in my mailbox, and I didn't get a chance
>>> to actually run your test code.
>>>
>>> In short, I think what you're saying is that
>>>
>>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
>>> 4KB) == EINVAL
>>
>> I am not sure you have read the email correctly:
>>
>> munmap(mmap(size, HUGETLB), size) = EFAIL
>>
>> For every size not multiple of the huge page size. Whereas:
>>
>> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
>
> I think Davide is right here, this is a long existing bug in the
> MAP_HUGETLB implementation.  Specifically, the mmap man page says:
>
> All pages containing a part of the indicated range are unmapped, and
> subsequent references to these pages will generate SIGSEGV.
>
> I realize that huge pages may not have been considered by those that
> wrote the spec.  But if I read this I would assume that all pages,
> regardless of size, touched by the munmap() request should be unmapped.
>
> Please include
> Acked-by: Eric B Munson 
> to the original patch.  I would like to see the mmap man page adjusted
> to make note of this behavior as well.

This is just a bug fix and I never think this has large risk. But
caution, we might revert immediately
if this patch arise some regression even if it's come from broken
application code.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-30 Thread Hugh Dickins
On Mon, 30 Mar 2015, KOSAKI Motohiro wrote:
 On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson emun...@akamai.com wrote:
  On 03/26/2015 07:56 AM, Davide Libenzi wrote:
  On Wed, 25 Mar 2015, David Rientjes wrote:
 
  I looked at this thread at http://marc.info/?t=14139250881
  since I didn't have it in my mailbox, and I didn't get a chance
  to actually run your test code.
 
  In short, I think what you're saying is that
 
  ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
  4KB) == EINVAL
 
  I am not sure you have read the email correctly:
 
  munmap(mmap(size, HUGETLB), size) = EFAIL
 
  For every size not multiple of the huge page size. Whereas:
 
  munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
 
  I think Davide is right here, this is a long existing bug in the
  MAP_HUGETLB implementation.  Specifically, the mmap man page says:
 
  All pages containing a part of the indicated range are unmapped, and
  subsequent references to these pages will generate SIGSEGV.
 
  I realize that huge pages may not have been considered by those that
  wrote the spec.  But if I read this I would assume that all pages,
  regardless of size, touched by the munmap() request should be unmapped.
 
  Please include
  Acked-by: Eric B Munson emun...@akamai.com
  to the original patch.  I would like to see the mmap man page adjusted
  to make note of this behavior as well.
 
 This is just a bug fix and I never think this has large risk. But
 caution, we might revert immediately
 if this patch arise some regression even if it's come from broken
 application code.
 
 Acked-by: KOSAKI Motohiro kosaki.motoh...@gmail.com

and, without wishing to be confrontational,
Nacked-by: Hugh Dickins hu...@google.com

I agree with you that the risk on munmap is probably not large;
but I still have no interest in spending more time on changing
twelve-year-old established behaviour in this way, then looking
out for the regressions and preparing to revert.

The risk is larger on mprotect, and the other calls which this
patch as is would leave inconsistent with munmap.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-30 Thread KOSAKI Motohiro
On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson emun...@akamai.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 03/26/2015 07:56 AM, Davide Libenzi wrote:
 On Wed, 25 Mar 2015, David Rientjes wrote:

 I looked at this thread at http://marc.info/?t=14139250881
 since I didn't have it in my mailbox, and I didn't get a chance
 to actually run your test code.

 In short, I think what you're saying is that

 ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
 4KB) == EINVAL

 I am not sure you have read the email correctly:

 munmap(mmap(size, HUGETLB), size) = EFAIL

 For every size not multiple of the huge page size. Whereas:

 munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK

 I think Davide is right here, this is a long existing bug in the
 MAP_HUGETLB implementation.  Specifically, the mmap man page says:

 All pages containing a part of the indicated range are unmapped, and
 subsequent references to these pages will generate SIGSEGV.

 I realize that huge pages may not have been considered by those that
 wrote the spec.  But if I read this I would assume that all pages,
 regardless of size, touched by the munmap() request should be unmapped.

 Please include
 Acked-by: Eric B Munson emun...@akamai.com
 to the original patch.  I would like to see the mmap man page adjusted
 to make note of this behavior as well.

This is just a bug fix and I never think this has large risk. But
caution, we might revert immediately
if this patch arise some regression even if it's come from broken
application code.

Acked-by: KOSAKI Motohiro kosaki.motoh...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Eric B Munson
On Thu, 26 Mar 2015, David Rientjes wrote:

> On Thu, 26 Mar 2015, Davide Libenzi wrote:
> 
> > > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
> > > hugetlb vma is long standing and there may be applications that break as 
> > > a 
> > > result of changing the behavior: a database that reserves all allocated 
> > > hugetlb memory with mmap() so that it always has exclusive access to 
> > > those 
> > > hugepages, whether they are faulted or not, and maintains its own 
> > > hugepage 
> > > pool (which is common), may test the return value of munmap() and depend 
> > > on it returning -EINVAL to determine if it is freeing memory that was 
> > > either dynamically allocated or mapped from the hugetlb reserved pool.
> > 
> > You went a long way to create such a case.
> > But, in your case, that application will erroneously considering hugepage 
> > mmaped memory, as dynamically allocated, since it will always get EINVAL, 
> > unless it passes an aligned size. Aligned size, which a fix like the one 
> > posted in the patch will still leave as success.
> 
> There was a patch proposed last week to add reserved pools to the 
> hugetlbfs mount option specifically for the case where a large database 
> wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
> pages become reserved on mmap().  In that case, the database never wants 
> to do munmap() and instead maintains its own hugepage pool.
> 
> That makes the usual database case, mmap() all necessary hugetlb pages to 
> reserve them, even easier since they have historically had to maintain 
> this pool amongst various processes.
> 
> Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
> true, returns ptr to its hugepage pool?  I can't say for certain that none 
> exist, that's why the potential for breakage exists.

Such an application can use /proc/pid/smaps to determine the page size
of a mapping.  IMO, this is relying on broken behavior but I see where
you are coming from that this behavior has been present for a long time.

As I stated before, I think we should fix this bug and make munmap()
behavior match what is described in the man page.

> 
> > OTOH, an application, which might be more common than the one you posted,
> > which calls munmap() to release a pointer which it validly got from a 
> > previous mmap(), will leak huge pages as all the issued munmaps will fail.
> > 
> 
> That application would have to be ignoring an EINVAL return value.
> 
> > > If we were to go back in time and decide this when the munmap() behavior 
> > > for hugetlb vmas was originally introduced, that would be valid.  The 
> > > problem is that it could lead to userspace breakage and that's a 
> > > non-starter.
> > > 
> > > What we can do is improve the documentation and man-page to clearly 
> > > specify the long-standing behavior so that nobody encounters unexpected 
> > > results in the future.
> > 
> > This way you will leave the mmap API with broken semantics.
> > In any case, I am done arguing.
> > I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
> > the mmap man pages with the new funny behaviour.
> > 
> 
> The behavior is certainly not new, it has always been the case for 
> munmap() on hugetlb vmas.
> 
> In a strict POSIX interpretation, it refers only to pages in the sense of
> what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
> any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
> It would be best to modify the man page to explicitly state this for 
> MAP_HUGETLB.


signature.asc
Description: Digital signature


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Vlastimil Babka
Might be too late in this thread, but in case you are going to continue and/or
repost:

[CC += linux-...@vger.kernel.org]
(also linux-man and Michael to match my other reply)

Since this is a kernel-user-space API change, please CC linux-api@. The
kernel source file Documentation/SubmitChecklist notes that all Linux kernel
patches that change userspace interfaces should be CCed to
linux-...@vger.kernel.org, so that the various parties who are interested in API
changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html


On 03/26/2015 09:03 PM, David Rientjes wrote:
> On Thu, 26 Mar 2015, Davide Libenzi wrote:
> 
>> > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
>> > hugetlb vma is long standing and there may be applications that break as a 
>> > result of changing the behavior: a database that reserves all allocated 
>> > hugetlb memory with mmap() so that it always has exclusive access to those 
>> > hugepages, whether they are faulted or not, and maintains its own hugepage 
>> > pool (which is common), may test the return value of munmap() and depend 
>> > on it returning -EINVAL to determine if it is freeing memory that was 
>> > either dynamically allocated or mapped from the hugetlb reserved pool.
>> 
>> You went a long way to create such a case.
>> But, in your case, that application will erroneously considering hugepage 
>> mmaped memory, as dynamically allocated, since it will always get EINVAL, 
>> unless it passes an aligned size. Aligned size, which a fix like the one 
>> posted in the patch will still leave as success.
> 
> There was a patch proposed last week to add reserved pools to the 
> hugetlbfs mount option specifically for the case where a large database 
> wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
> pages become reserved on mmap().  In that case, the database never wants 
> to do munmap() and instead maintains its own hugepage pool.
> 
> That makes the usual database case, mmap() all necessary hugetlb pages to 
> reserve them, even easier since they have historically had to maintain 
> this pool amongst various processes.
> 
> Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
> true, returns ptr to its hugepage pool?  I can't say for certain that none 
> exist, that's why the potential for breakage exists.
> 
>> OTOH, an application, which might be more common than the one you posted,
>> which calls munmap() to release a pointer which it validly got from a 
>> previous mmap(), will leak huge pages as all the issued munmaps will fail.
>> 
> 
> That application would have to be ignoring an EINVAL return value.
> 
>> > If we were to go back in time and decide this when the munmap() behavior 
>> > for hugetlb vmas was originally introduced, that would be valid.  The 
>> > problem is that it could lead to userspace breakage and that's a 
>> > non-starter.
>> > 
>> > What we can do is improve the documentation and man-page to clearly 
>> > specify the long-standing behavior so that nobody encounters unexpected 
>> > results in the future.
>> 
>> This way you will leave the mmap API with broken semantics.
>> In any case, I am done arguing.
>> I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
>> the mmap man pages with the new funny behaviour.
>> 
> 
> The behavior is certainly not new, it has always been the case for 
> munmap() on hugetlb vmas.
> 
> In a strict POSIX interpretation, it refers only to pages in the sense of
> what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
> any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
> It would be best to modify the man page to explicitly state this for 
> MAP_HUGETLB.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Vlastimil Babka
On 03/26/2015 08:39 PM, Davide Libenzi wrote:
> On Thu, 26 Mar 2015, David Rientjes wrote:
> 
>> Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
>> hugetlb vma is long standing and there may be applications that break as a 
>> result of changing the behavior: a database that reserves all allocated 
>> hugetlb memory with mmap() so that it always has exclusive access to those 
>> hugepages, whether they are faulted or not, and maintains its own hugepage 
>> pool (which is common), may test the return value of munmap() and depend 
>> on it returning -EINVAL to determine if it is freeing memory that was 
>> either dynamically allocated or mapped from the hugetlb reserved pool.
> 
> You went a long way to create such a case.
> But, in your case, that application will erroneously considering hugepage 
> mmaped memory, as dynamically allocated, since it will always get EINVAL, 
> unless it passes an aligned size. Aligned size, which a fix like the one 
> posted in the patch will still leave as success.
> OTOH, an application, which might be more common than the one you posted,
> which calls munmap() to release a pointer which it validly got from a 
> previous mmap(), will leak huge pages as all the issued munmaps will fail.
> 
> 
>> If we were to go back in time and decide this when the munmap() behavior 
>> for hugetlb vmas was originally introduced, that would be valid.  The 
>> problem is that it could lead to userspace breakage and that's a 
>> non-starter.
>> 
>> What we can do is improve the documentation and man-page to clearly 
>> specify the long-standing behavior so that nobody encounters unexpected 
>> results in the future.
> 
> This way you will leave the mmap API with broken semantics.
> In any case, I am done arguing.
> I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
> the mmap man pages with the new funny behaviour.

+ CC's

You know that people don't always magically CC themselves, or read all of
lkml/linux-mm? :)

> 
> 
> - Davide
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Vlastimil Babka
On 03/26/2015 08:39 PM, Davide Libenzi wrote:
 On Thu, 26 Mar 2015, David Rientjes wrote:
 
 Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
 hugetlb vma is long standing and there may be applications that break as a 
 result of changing the behavior: a database that reserves all allocated 
 hugetlb memory with mmap() so that it always has exclusive access to those 
 hugepages, whether they are faulted or not, and maintains its own hugepage 
 pool (which is common), may test the return value of munmap() and depend 
 on it returning -EINVAL to determine if it is freeing memory that was 
 either dynamically allocated or mapped from the hugetlb reserved pool.
 
 You went a long way to create such a case.
 But, in your case, that application will erroneously considering hugepage 
 mmaped memory, as dynamically allocated, since it will always get EINVAL, 
 unless it passes an aligned size. Aligned size, which a fix like the one 
 posted in the patch will still leave as success.
 OTOH, an application, which might be more common than the one you posted,
 which calls munmap() to release a pointer which it validly got from a 
 previous mmap(), will leak huge pages as all the issued munmaps will fail.
 
 
 If we were to go back in time and decide this when the munmap() behavior 
 for hugetlb vmas was originally introduced, that would be valid.  The 
 problem is that it could lead to userspace breakage and that's a 
 non-starter.
 
 What we can do is improve the documentation and man-page to clearly 
 specify the long-standing behavior so that nobody encounters unexpected 
 results in the future.
 
 This way you will leave the mmap API with broken semantics.
 In any case, I am done arguing.
 I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
 the mmap man pages with the new funny behaviour.

+ CC's

You know that people don't always magically CC themselves, or read all of
lkml/linux-mm? :)

 
 
 - Davide
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Vlastimil Babka
Might be too late in this thread, but in case you are going to continue and/or
repost:

[CC += linux-...@vger.kernel.org]
(also linux-man and Michael to match my other reply)

Since this is a kernel-user-space API change, please CC linux-api@. The
kernel source file Documentation/SubmitChecklist notes that all Linux kernel
patches that change userspace interfaces should be CCed to
linux-...@vger.kernel.org, so that the various parties who are interested in API
changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html


On 03/26/2015 09:03 PM, David Rientjes wrote:
 On Thu, 26 Mar 2015, Davide Libenzi wrote:
 
  Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
  hugetlb vma is long standing and there may be applications that break as a 
  result of changing the behavior: a database that reserves all allocated 
  hugetlb memory with mmap() so that it always has exclusive access to those 
  hugepages, whether they are faulted or not, and maintains its own hugepage 
  pool (which is common), may test the return value of munmap() and depend 
  on it returning -EINVAL to determine if it is freeing memory that was 
  either dynamically allocated or mapped from the hugetlb reserved pool.
 
 You went a long way to create such a case.
 But, in your case, that application will erroneously considering hugepage 
 mmaped memory, as dynamically allocated, since it will always get EINVAL, 
 unless it passes an aligned size. Aligned size, which a fix like the one 
 posted in the patch will still leave as success.
 
 There was a patch proposed last week to add reserved pools to the 
 hugetlbfs mount option specifically for the case where a large database 
 wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
 pages become reserved on mmap().  In that case, the database never wants 
 to do munmap() and instead maintains its own hugepage pool.
 
 That makes the usual database case, mmap() all necessary hugetlb pages to 
 reserve them, even easier since they have historically had to maintain 
 this pool amongst various processes.
 
 Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
 true, returns ptr to its hugepage pool?  I can't say for certain that none 
 exist, that's why the potential for breakage exists.
 
 OTOH, an application, which might be more common than the one you posted,
 which calls munmap() to release a pointer which it validly got from a 
 previous mmap(), will leak huge pages as all the issued munmaps will fail.
 
 
 That application would have to be ignoring an EINVAL return value.
 
  If we were to go back in time and decide this when the munmap() behavior 
  for hugetlb vmas was originally introduced, that would be valid.  The 
  problem is that it could lead to userspace breakage and that's a 
  non-starter.
  
  What we can do is improve the documentation and man-page to clearly 
  specify the long-standing behavior so that nobody encounters unexpected 
  results in the future.
 
 This way you will leave the mmap API with broken semantics.
 In any case, I am done arguing.
 I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
 the mmap man pages with the new funny behaviour.
 
 
 The behavior is certainly not new, it has always been the case for 
 munmap() on hugetlb vmas.
 
 In a strict POSIX interpretation, it refers only to pages in the sense of
 what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
 any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
 It would be best to modify the man page to explicitly state this for 
 MAP_HUGETLB.
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-27 Thread Eric B Munson
On Thu, 26 Mar 2015, David Rientjes wrote:

 On Thu, 26 Mar 2015, Davide Libenzi wrote:
 
   Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
   hugetlb vma is long standing and there may be applications that break as 
   a 
   result of changing the behavior: a database that reserves all allocated 
   hugetlb memory with mmap() so that it always has exclusive access to 
   those 
   hugepages, whether they are faulted or not, and maintains its own 
   hugepage 
   pool (which is common), may test the return value of munmap() and depend 
   on it returning -EINVAL to determine if it is freeing memory that was 
   either dynamically allocated or mapped from the hugetlb reserved pool.
  
  You went a long way to create such a case.
  But, in your case, that application will erroneously considering hugepage 
  mmaped memory, as dynamically allocated, since it will always get EINVAL, 
  unless it passes an aligned size. Aligned size, which a fix like the one 
  posted in the patch will still leave as success.
 
 There was a patch proposed last week to add reserved pools to the 
 hugetlbfs mount option specifically for the case where a large database 
 wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
 pages become reserved on mmap().  In that case, the database never wants 
 to do munmap() and instead maintains its own hugepage pool.
 
 That makes the usual database case, mmap() all necessary hugetlb pages to 
 reserve them, even easier since they have historically had to maintain 
 this pool amongst various processes.
 
 Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
 true, returns ptr to its hugepage pool?  I can't say for certain that none 
 exist, that's why the potential for breakage exists.

Such an application can use /proc/pid/smaps to determine the page size
of a mapping.  IMO, this is relying on broken behavior but I see where
you are coming from that this behavior has been present for a long time.

As I stated before, I think we should fix this bug and make munmap()
behavior match what is described in the man page.

 
  OTOH, an application, which might be more common than the one you posted,
  which calls munmap() to release a pointer which it validly got from a 
  previous mmap(), will leak huge pages as all the issued munmaps will fail.
  
 
 That application would have to be ignoring an EINVAL return value.
 
   If we were to go back in time and decide this when the munmap() behavior 
   for hugetlb vmas was originally introduced, that would be valid.  The 
   problem is that it could lead to userspace breakage and that's a 
   non-starter.
   
   What we can do is improve the documentation and man-page to clearly 
   specify the long-standing behavior so that nobody encounters unexpected 
   results in the future.
  
  This way you will leave the mmap API with broken semantics.
  In any case, I am done arguing.
  I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
  the mmap man pages with the new funny behaviour.
  
 
 The behavior is certainly not new, it has always been the case for 
 munmap() on hugetlb vmas.
 
 In a strict POSIX interpretation, it refers only to pages in the sense of
 what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
 any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
 It would be best to modify the man page to explicitly state this for 
 MAP_HUGETLB.


signature.asc
Description: Digital signature


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread David Rientjes
On Thu, 26 Mar 2015, Davide Libenzi wrote:

> > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
> > hugetlb vma is long standing and there may be applications that break as a 
> > result of changing the behavior: a database that reserves all allocated 
> > hugetlb memory with mmap() so that it always has exclusive access to those 
> > hugepages, whether they are faulted or not, and maintains its own hugepage 
> > pool (which is common), may test the return value of munmap() and depend 
> > on it returning -EINVAL to determine if it is freeing memory that was 
> > either dynamically allocated or mapped from the hugetlb reserved pool.
> 
> You went a long way to create such a case.
> But, in your case, that application will erroneously considering hugepage 
> mmaped memory, as dynamically allocated, since it will always get EINVAL, 
> unless it passes an aligned size. Aligned size, which a fix like the one 
> posted in the patch will still leave as success.

There was a patch proposed last week to add reserved pools to the 
hugetlbfs mount option specifically for the case where a large database 
wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
pages become reserved on mmap().  In that case, the database never wants 
to do munmap() and instead maintains its own hugepage pool.

That makes the usual database case, mmap() all necessary hugetlb pages to 
reserve them, even easier since they have historically had to maintain 
this pool amongst various processes.

Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
true, returns ptr to its hugepage pool?  I can't say for certain that none 
exist, that's why the potential for breakage exists.

> OTOH, an application, which might be more common than the one you posted,
> which calls munmap() to release a pointer which it validly got from a 
> previous mmap(), will leak huge pages as all the issued munmaps will fail.
> 

That application would have to be ignoring an EINVAL return value.

> > If we were to go back in time and decide this when the munmap() behavior 
> > for hugetlb vmas was originally introduced, that would be valid.  The 
> > problem is that it could lead to userspace breakage and that's a 
> > non-starter.
> > 
> > What we can do is improve the documentation and man-page to clearly 
> > specify the long-standing behavior so that nobody encounters unexpected 
> > results in the future.
> 
> This way you will leave the mmap API with broken semantics.
> In any case, I am done arguing.
> I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
> the mmap man pages with the new funny behaviour.
> 

The behavior is certainly not new, it has always been the case for 
munmap() on hugetlb vmas.

In a strict POSIX interpretation, it refers only to pages in the sense of
what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
It would be best to modify the man page to explicitly state this for 
MAP_HUGETLB.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Thu, 26 Mar 2015, David Rientjes wrote:

> Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
> hugetlb vma is long standing and there may be applications that break as a 
> result of changing the behavior: a database that reserves all allocated 
> hugetlb memory with mmap() so that it always has exclusive access to those 
> hugepages, whether they are faulted or not, and maintains its own hugepage 
> pool (which is common), may test the return value of munmap() and depend 
> on it returning -EINVAL to determine if it is freeing memory that was 
> either dynamically allocated or mapped from the hugetlb reserved pool.

You went a long way to create such a case.
But, in your case, that application will erroneously considering hugepage 
mmaped memory, as dynamically allocated, since it will always get EINVAL, 
unless it passes an aligned size. Aligned size, which a fix like the one 
posted in the patch will still leave as success.
OTOH, an application, which might be more common than the one you posted,
which calls munmap() to release a pointer which it validly got from a 
previous mmap(), will leak huge pages as all the issued munmaps will fail.


> If we were to go back in time and decide this when the munmap() behavior 
> for hugetlb vmas was originally introduced, that would be valid.  The 
> problem is that it could lead to userspace breakage and that's a 
> non-starter.
> 
> What we can do is improve the documentation and man-page to clearly 
> specify the long-standing behavior so that nobody encounters unexpected 
> results in the future.

This way you will leave the mmap API with broken semantics.
In any case, I am done arguing.
I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
the mmap man pages with the new funny behaviour.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread David Rientjes
On Thu, 26 Mar 2015, Davide Libenzi wrote:

> > I looked at this thread at http://marc.info/?t=14139250881 since I 
> > didn't have it in my mailbox, and I didn't get a chance to actually run 
> > your test code.
> > 
> > In short, I think what you're saying is that
> > 
> > ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
> > munmap(ptr, 4KB) == EINVAL
> 
> I am not sure you have read the email correctly:
> 
> munmap(mmap(size, HUGETLB), size) = EFAIL
> 
> For every size not multiple of the huge page size.
> Whereas:
> 
> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
> 

Yes, I read it correctly, and wrote how your test case should have failed 
above.  It fails when you do the 4KB mmap() with MAP_HUGETLB and pass 4KB 
to munmap(), correct?

I have no idea what EFAIL is, though.

> > The question you pose is whether munmap(ptr, 4KB) should succeed for a 
> > hugetlb vma and in your patch you align this to the hugepage size of the 
> > vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
> > for a non-hugetlb vma.
> > 
> > The munmap() spec says the whole pages that include any part of the passed 
> > length should be unmapped.  In spirit, I would agree with you that the 
> > page size for the vma is the hugepage size so that would be what would be 
> > unmapped.
> > 
> > But that's going by a spec that doesn't address hugepages and is worded in 
> > a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
> > done.  It carries no notion of variable page sizes and how hugepages 
> > should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
> > this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
> > but the munmap() behavior for hugetlb vmas is not restricted.
> > 
> > It would seem too dangerous at this point to change the behavior of 
> > munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
> > arise from aligning to the hugepage size.
> 
> You mean, there is an harder failure than the current failure? :)
> 

Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
hugetlb vma is long standing and there may be applications that break as a 
result of changing the behavior: a database that reserves all allocated 
hugetlb memory with mmap() so that it always has exclusive access to those 
hugepages, whether they are faulted or not, and maintains its own hugepage 
pool (which is common), may test the return value of munmap() and depend 
on it returning -EINVAL to determine if it is freeing memory that was 
either dynamically allocated or mapped from the hugetlb reserved pool.

> > Some applications purposefully reserve hugetlb pages by mmap() and never 
> > munmap() them so they have exclusive access to hugepages that were 
> > allocated either at boot or runtime by the sysadmin.  If they depend on 
> > the return value of munmap() to determine if memory to free is memory 
> > dynamically allocated by the application or reserved as hugetlb memory, 
> > then this would cause them to break.  I can't say for certain that no such 
> > application exists.
> 
> The fact that certain applications will seldomly call an API, should be no 
> reason for API to have bugs, or at the very least, a bahviour which not 
> only in not documented in the man pages, but also totally unrespectful of 
> the normal mmap/munmap semantics.

If we were to go back in time and decide this when the munmap() behavior 
for hugetlb vmas was originally introduced, that would be valid.  The 
problem is that it could lead to userspace breakage and that's a 
non-starter.

What we can do is improve the documentation and man-page to clearly 
specify the long-standing behavior so that nobody encounters unexpected 
results in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Eric B Munson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/26/2015 07:56 AM, Davide Libenzi wrote:
> On Wed, 25 Mar 2015, David Rientjes wrote:
> 
>> I looked at this thread at http://marc.info/?t=14139250881
>> since I didn't have it in my mailbox, and I didn't get a chance
>> to actually run your test code.
>> 
>> In short, I think what you're saying is that
>> 
>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
>> 4KB) == EINVAL
> 
> I am not sure you have read the email correctly:
> 
> munmap(mmap(size, HUGETLB), size) = EFAIL
> 
> For every size not multiple of the huge page size. Whereas:
> 
> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK

I think Davide is right here, this is a long existing bug in the
MAP_HUGETLB implementation.  Specifically, the mmap man page says:

All pages containing a part of the indicated range are unmapped, and
subsequent references to these pages will generate SIGSEGV.

I realize that huge pages may not have been considered by those that
wrote the spec.  But if I read this I would assume that all pages,
regardless of size, touched by the munmap() request should be unmapped.

Please include
Acked-by: Eric B Munson 
to the original patch.  I would like to see the mmap man page adjusted
to make note of this behavior as well.

> 
> 
>> Respecting the mmap(2) POSIX specification?  I don't think 
>> mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates
>>  POSIX.1-2001 and not only because it obviously doesn't address 
>> MAP_HUGETLB, but I don't think the spec says the system cannot
>> map more memory than len.
>> 
>> Using MAP_HUGETLB is really more a library function than anything
>> else since you could easily implement the same behavior in a
>> library.  That function includes aligning len to the hugepage
>> size, so doing
>> 
>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
>> 
>> is the equivalent to doing
>> 
>> ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
>> 
>> and that doesn't violate any spec.  But your patch doesn't change
>> mmap() at all, so let's forget about that.
> 
> That is what every mmap() implementation does, irrespectively of
> any page size. And that is also what the POSIX spec states. The
> size will be automatically rounded up to a multiple of the
> underline physical page size. The problem is not mmap() though, in
> this case.
> 
> 
>> The question you pose is whether munmap(ptr, 4KB) should succeed
>> for a hugetlb vma and in your patch you align this to the
>> hugepage size of the vma in the same manner that munmap(ptr, 2KB)
>> would be aligned to PAGE_SIZE for a non-hugetlb vma.
>> 
>> The munmap() spec says the whole pages that include any part of
>> the passed length should be unmapped.  In spirit, I would agree
>> with you that the page size for the vma is the hugepage size so
>> that would be what would be unmapped.
>> 
>> But that's going by a spec that doesn't address hugepages and is
>> worded in a way that {PAGE_SIZE} is the base unit that both
>> mmap() and munmap() is done.  It carries no notion of variable
>> page sizes and how hugepages should be handled with respect to
>> pages of {PAGE_SIZE} length.  So I think this is beyond the scope
>> of the spec: any length is aligned to PAGE_SIZE, but the munmap()
>> behavior for hugetlb vmas is not restricted.
>> 
>> It would seem too dangerous at this point to change the behavior
>> of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs
>> could actually arise from aligning to the hugepage size.
> 
> You mean, there is an harder failure than the current failure? :)
> 
> 
>> Some applications purposefully reserve hugetlb pages by mmap()
>> and never munmap() them so they have exclusive access to
>> hugepages that were allocated either at boot or runtime by the
>> sysadmin.  If they depend on the return value of munmap() to
>> determine if memory to free is memory dynamically allocated by
>> the application or reserved as hugetlb memory, then this would
>> cause them to break.  I can't say for certain that no such 
>> application exists.
> 
> The fact that certain applications will seldomly call an API,
> should be no reason for API to have bugs, or at the very least, a
> bahviour which not only in not documented in the man pages, but
> also totally unrespectful of the normal mmap/munmap semantics. 
> Again, the scenario that you are picturing, is one where an
> application relies on a permanent (that is what it is - it always
> fails unless the munmap size is multiple than huge page size)
> failure of munmap, to do some productive task. An munmap() of huge
> page aligned size, will succeed in both case (vanilla, and patch).
> 
> 
>> Since hugetlb memory is beyond the scope of the POSIX.1-2001
>> munmap() specification, and there's a potential userspace
>> breakage if the length becomes hugepage aligned, I think the
>> do_unmap() implementation is correct as it stands.
> 
> If the length is huge page aligned, it will be working with 

Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Wed, 25 Mar 2015, David Rientjes wrote:

> I looked at this thread at http://marc.info/?t=14139250881 since I 
> didn't have it in my mailbox, and I didn't get a chance to actually run 
> your test code.
> 
> In short, I think what you're saying is that
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
>   munmap(ptr, 4KB) == EINVAL

I am not sure you have read the email correctly:

munmap(mmap(size, HUGETLB), size) = EFAIL

For every size not multiple of the huge page size.
Whereas:

munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK


> Respecting the mmap(2) POSIX specification?  I don't think 
> mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
> POSIX.1-2001 and not only because it obviously doesn't address 
> MAP_HUGETLB, but I don't think the spec says the system cannot map more 
> memory than len.
> 
> Using MAP_HUGETLB is really more a library function than anything else 
> since you could easily implement the same behavior in a library.  That 
> function includes aligning len to the hugepage size, so doing
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
> 
> is the equivalent to doing
> 
>   ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
> 
> and that doesn't violate any spec.  But your patch doesn't change mmap() 
> at all, so let's forget about that.

That is what every mmap() implementation does, irrespectively of any page 
size. And that is also what the POSIX spec states.
The size will be automatically rounded up to a multiple of the underline 
physical page size.
The problem is not mmap() though, in this case.


> The question you pose is whether munmap(ptr, 4KB) should succeed for a 
> hugetlb vma and in your patch you align this to the hugepage size of the 
> vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
> for a non-hugetlb vma.
> 
> The munmap() spec says the whole pages that include any part of the passed 
> length should be unmapped.  In spirit, I would agree with you that the 
> page size for the vma is the hugepage size so that would be what would be 
> unmapped.
> 
> But that's going by a spec that doesn't address hugepages and is worded in 
> a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
> done.  It carries no notion of variable page sizes and how hugepages 
> should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
> this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
> but the munmap() behavior for hugetlb vmas is not restricted.
> 
> It would seem too dangerous at this point to change the behavior of 
> munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
> arise from aligning to the hugepage size.

You mean, there is an harder failure than the current failure? :)


> Some applications purposefully reserve hugetlb pages by mmap() and never 
> munmap() them so they have exclusive access to hugepages that were 
> allocated either at boot or runtime by the sysadmin.  If they depend on 
> the return value of munmap() to determine if memory to free is memory 
> dynamically allocated by the application or reserved as hugetlb memory, 
> then this would cause them to break.  I can't say for certain that no such 
> application exists.

The fact that certain applications will seldomly call an API, should be no 
reason for API to have bugs, or at the very least, a bahviour which not 
only in not documented in the man pages, but also totally unrespectful of 
the normal mmap/munmap semantics.
Again, the scenario that you are picturing, is one where an application 
relies on a permanent (that is what it is - it always fails unless the 
munmap size is multiple than huge page size) failure of munmap, to do some 
productive task.
An munmap() of huge page aligned size, will succeed in both case (vanilla, 
and patch).


> Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
> specification, and there's a potential userspace breakage if the length 
> becomes hugepage aligned, I think the do_unmap() implementation is correct 
> as it stands.

If the length is huge page aligned, it will be working with or without 
patch applied.
The problem is for the other 2097151 out of 2097152 cases, where length is 
not indeed aligned to 2MB (or whatever hugepage size is for the 
architecture).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Wed, 25 Mar 2015, David Rientjes wrote:

 I looked at this thread at http://marc.info/?t=14139250881 since I 
 didn't have it in my mailbox, and I didn't get a chance to actually run 
 your test code.
 
 In short, I think what you're saying is that
 
   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
   munmap(ptr, 4KB) == EINVAL

I am not sure you have read the email correctly:

munmap(mmap(size, HUGETLB), size) = EFAIL

For every size not multiple of the huge page size.
Whereas:

munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK


 Respecting the mmap(2) POSIX specification?  I don't think 
 mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
 POSIX.1-2001 and not only because it obviously doesn't address 
 MAP_HUGETLB, but I don't think the spec says the system cannot map more 
 memory than len.
 
 Using MAP_HUGETLB is really more a library function than anything else 
 since you could easily implement the same behavior in a library.  That 
 function includes aligning len to the hugepage size, so doing
 
   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
 
 is the equivalent to doing
 
   ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
 
 and that doesn't violate any spec.  But your patch doesn't change mmap() 
 at all, so let's forget about that.

That is what every mmap() implementation does, irrespectively of any page 
size. And that is also what the POSIX spec states.
The size will be automatically rounded up to a multiple of the underline 
physical page size.
The problem is not mmap() though, in this case.


 The question you pose is whether munmap(ptr, 4KB) should succeed for a 
 hugetlb vma and in your patch you align this to the hugepage size of the 
 vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
 for a non-hugetlb vma.
 
 The munmap() spec says the whole pages that include any part of the passed 
 length should be unmapped.  In spirit, I would agree with you that the 
 page size for the vma is the hugepage size so that would be what would be 
 unmapped.
 
 But that's going by a spec that doesn't address hugepages and is worded in 
 a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
 done.  It carries no notion of variable page sizes and how hugepages 
 should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
 this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
 but the munmap() behavior for hugetlb vmas is not restricted.
 
 It would seem too dangerous at this point to change the behavior of 
 munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
 arise from aligning to the hugepage size.

You mean, there is an harder failure than the current failure? :)


 Some applications purposefully reserve hugetlb pages by mmap() and never 
 munmap() them so they have exclusive access to hugepages that were 
 allocated either at boot or runtime by the sysadmin.  If they depend on 
 the return value of munmap() to determine if memory to free is memory 
 dynamically allocated by the application or reserved as hugetlb memory, 
 then this would cause them to break.  I can't say for certain that no such 
 application exists.

The fact that certain applications will seldomly call an API, should be no 
reason for API to have bugs, or at the very least, a bahviour which not 
only in not documented in the man pages, but also totally unrespectful of 
the normal mmap/munmap semantics.
Again, the scenario that you are picturing, is one where an application 
relies on a permanent (that is what it is - it always fails unless the 
munmap size is multiple than huge page size) failure of munmap, to do some 
productive task.
An munmap() of huge page aligned size, will succeed in both case (vanilla, 
and patch).


 Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
 specification, and there's a potential userspace breakage if the length 
 becomes hugepage aligned, I think the do_unmap() implementation is correct 
 as it stands.

If the length is huge page aligned, it will be working with or without 
patch applied.
The problem is for the other 2097151 out of 2097152 cases, where length is 
not indeed aligned to 2MB (or whatever hugepage size is for the 
architecture).


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread David Rientjes
On Thu, 26 Mar 2015, Davide Libenzi wrote:

  Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
  hugetlb vma is long standing and there may be applications that break as a 
  result of changing the behavior: a database that reserves all allocated 
  hugetlb memory with mmap() so that it always has exclusive access to those 
  hugepages, whether they are faulted or not, and maintains its own hugepage 
  pool (which is common), may test the return value of munmap() and depend 
  on it returning -EINVAL to determine if it is freeing memory that was 
  either dynamically allocated or mapped from the hugetlb reserved pool.
 
 You went a long way to create such a case.
 But, in your case, that application will erroneously considering hugepage 
 mmaped memory, as dynamically allocated, since it will always get EINVAL, 
 unless it passes an aligned size. Aligned size, which a fix like the one 
 posted in the patch will still leave as success.

There was a patch proposed last week to add reserved pools to the 
hugetlbfs mount option specifically for the case where a large database 
wants sole reserved access to the hugepage pool.  This is why hugetlbfs 
pages become reserved on mmap().  In that case, the database never wants 
to do munmap() and instead maintains its own hugepage pool.

That makes the usual database case, mmap() all necessary hugetlb pages to 
reserve them, even easier since they have historically had to maintain 
this pool amongst various processes.

Is there a process out there that tests for munmap(ptr) == EINVAL and, if 
true, returns ptr to its hugepage pool?  I can't say for certain that none 
exist, that's why the potential for breakage exists.

 OTOH, an application, which might be more common than the one you posted,
 which calls munmap() to release a pointer which it validly got from a 
 previous mmap(), will leak huge pages as all the issued munmaps will fail.
 

That application would have to be ignoring an EINVAL return value.

  If we were to go back in time and decide this when the munmap() behavior 
  for hugetlb vmas was originally introduced, that would be valid.  The 
  problem is that it could lead to userspace breakage and that's a 
  non-starter.
  
  What we can do is improve the documentation and man-page to clearly 
  specify the long-standing behavior so that nobody encounters unexpected 
  results in the future.
 
 This way you will leave the mmap API with broken semantics.
 In any case, I am done arguing.
 I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
 the mmap man pages with the new funny behaviour.
 

The behavior is certainly not new, it has always been the case for 
munmap() on hugetlb vmas.

In a strict POSIX interpretation, it refers only to pages in the sense of
what is returned by sysconf(_SC_PAGESIZE).  Such vmas are not backed by 
any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined.  
It would be best to modify the man page to explicitly state this for 
MAP_HUGETLB.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread David Rientjes
On Thu, 26 Mar 2015, Davide Libenzi wrote:

  I looked at this thread at http://marc.info/?t=14139250881 since I 
  didn't have it in my mailbox, and I didn't get a chance to actually run 
  your test code.
  
  In short, I think what you're saying is that
  
  ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
  munmap(ptr, 4KB) == EINVAL
 
 I am not sure you have read the email correctly:
 
 munmap(mmap(size, HUGETLB), size) = EFAIL
 
 For every size not multiple of the huge page size.
 Whereas:
 
 munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
 

Yes, I read it correctly, and wrote how your test case should have failed 
above.  It fails when you do the 4KB mmap() with MAP_HUGETLB and pass 4KB 
to munmap(), correct?

I have no idea what EFAIL is, though.

  The question you pose is whether munmap(ptr, 4KB) should succeed for a 
  hugetlb vma and in your patch you align this to the hugepage size of the 
  vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
  for a non-hugetlb vma.
  
  The munmap() spec says the whole pages that include any part of the passed 
  length should be unmapped.  In spirit, I would agree with you that the 
  page size for the vma is the hugepage size so that would be what would be 
  unmapped.
  
  But that's going by a spec that doesn't address hugepages and is worded in 
  a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
  done.  It carries no notion of variable page sizes and how hugepages 
  should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
  this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
  but the munmap() behavior for hugetlb vmas is not restricted.
  
  It would seem too dangerous at this point to change the behavior of 
  munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
  arise from aligning to the hugepage size.
 
 You mean, there is an harder failure than the current failure? :)
 

Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
hugetlb vma is long standing and there may be applications that break as a 
result of changing the behavior: a database that reserves all allocated 
hugetlb memory with mmap() so that it always has exclusive access to those 
hugepages, whether they are faulted or not, and maintains its own hugepage 
pool (which is common), may test the return value of munmap() and depend 
on it returning -EINVAL to determine if it is freeing memory that was 
either dynamically allocated or mapped from the hugetlb reserved pool.

  Some applications purposefully reserve hugetlb pages by mmap() and never 
  munmap() them so they have exclusive access to hugepages that were 
  allocated either at boot or runtime by the sysadmin.  If they depend on 
  the return value of munmap() to determine if memory to free is memory 
  dynamically allocated by the application or reserved as hugetlb memory, 
  then this would cause them to break.  I can't say for certain that no such 
  application exists.
 
 The fact that certain applications will seldomly call an API, should be no 
 reason for API to have bugs, or at the very least, a bahviour which not 
 only in not documented in the man pages, but also totally unrespectful of 
 the normal mmap/munmap semantics.

If we were to go back in time and decide this when the munmap() behavior 
for hugetlb vmas was originally introduced, that would be valid.  The 
problem is that it could lead to userspace breakage and that's a 
non-starter.

What we can do is improve the documentation and man-page to clearly 
specify the long-standing behavior so that nobody encounters unexpected 
results in the future.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Thu, 26 Mar 2015, David Rientjes wrote:

 Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
 hugetlb vma is long standing and there may be applications that break as a 
 result of changing the behavior: a database that reserves all allocated 
 hugetlb memory with mmap() so that it always has exclusive access to those 
 hugepages, whether they are faulted or not, and maintains its own hugepage 
 pool (which is common), may test the return value of munmap() and depend 
 on it returning -EINVAL to determine if it is freeing memory that was 
 either dynamically allocated or mapped from the hugetlb reserved pool.

You went a long way to create such a case.
But, in your case, that application will erroneously considering hugepage 
mmaped memory, as dynamically allocated, since it will always get EINVAL, 
unless it passes an aligned size. Aligned size, which a fix like the one 
posted in the patch will still leave as success.
OTOH, an application, which might be more common than the one you posted,
which calls munmap() to release a pointer which it validly got from a 
previous mmap(), will leak huge pages as all the issued munmaps will fail.


 If we were to go back in time and decide this when the munmap() behavior 
 for hugetlb vmas was originally introduced, that would be valid.  The 
 problem is that it could lead to userspace breakage and that's a 
 non-starter.
 
 What we can do is improve the documentation and man-page to clearly 
 specify the long-standing behavior so that nobody encounters unexpected 
 results in the future.

This way you will leave the mmap API with broken semantics.
In any case, I am done arguing.
I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
the mmap man pages with the new funny behaviour.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Eric B Munson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/26/2015 07:56 AM, Davide Libenzi wrote:
 On Wed, 25 Mar 2015, David Rientjes wrote:
 
 I looked at this thread at http://marc.info/?t=14139250881
 since I didn't have it in my mailbox, and I didn't get a chance
 to actually run your test code.
 
 In short, I think what you're saying is that
 
 ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
 4KB) == EINVAL
 
 I am not sure you have read the email correctly:
 
 munmap(mmap(size, HUGETLB), size) = EFAIL
 
 For every size not multiple of the huge page size. Whereas:
 
 munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK

I think Davide is right here, this is a long existing bug in the
MAP_HUGETLB implementation.  Specifically, the mmap man page says:

All pages containing a part of the indicated range are unmapped, and
subsequent references to these pages will generate SIGSEGV.

I realize that huge pages may not have been considered by those that
wrote the spec.  But if I read this I would assume that all pages,
regardless of size, touched by the munmap() request should be unmapped.

Please include
Acked-by: Eric B Munson emun...@akamai.com
to the original patch.  I would like to see the mmap man page adjusted
to make note of this behavior as well.

 
 
 Respecting the mmap(2) POSIX specification?  I don't think 
 mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates
  POSIX.1-2001 and not only because it obviously doesn't address 
 MAP_HUGETLB, but I don't think the spec says the system cannot
 map more memory than len.
 
 Using MAP_HUGETLB is really more a library function than anything
 else since you could easily implement the same behavior in a
 library.  That function includes aligning len to the hugepage
 size, so doing
 
 ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
 
 is the equivalent to doing
 
 ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
 
 and that doesn't violate any spec.  But your patch doesn't change
 mmap() at all, so let's forget about that.
 
 That is what every mmap() implementation does, irrespectively of
 any page size. And that is also what the POSIX spec states. The
 size will be automatically rounded up to a multiple of the
 underline physical page size. The problem is not mmap() though, in
 this case.
 
 
 The question you pose is whether munmap(ptr, 4KB) should succeed
 for a hugetlb vma and in your patch you align this to the
 hugepage size of the vma in the same manner that munmap(ptr, 2KB)
 would be aligned to PAGE_SIZE for a non-hugetlb vma.
 
 The munmap() spec says the whole pages that include any part of
 the passed length should be unmapped.  In spirit, I would agree
 with you that the page size for the vma is the hugepage size so
 that would be what would be unmapped.
 
 But that's going by a spec that doesn't address hugepages and is
 worded in a way that {PAGE_SIZE} is the base unit that both
 mmap() and munmap() is done.  It carries no notion of variable
 page sizes and how hugepages should be handled with respect to
 pages of {PAGE_SIZE} length.  So I think this is beyond the scope
 of the spec: any length is aligned to PAGE_SIZE, but the munmap()
 behavior for hugetlb vmas is not restricted.
 
 It would seem too dangerous at this point to change the behavior
 of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs
 could actually arise from aligning to the hugepage size.
 
 You mean, there is an harder failure than the current failure? :)
 
 
 Some applications purposefully reserve hugetlb pages by mmap()
 and never munmap() them so they have exclusive access to
 hugepages that were allocated either at boot or runtime by the
 sysadmin.  If they depend on the return value of munmap() to
 determine if memory to free is memory dynamically allocated by
 the application or reserved as hugetlb memory, then this would
 cause them to break.  I can't say for certain that no such 
 application exists.
 
 The fact that certain applications will seldomly call an API,
 should be no reason for API to have bugs, or at the very least, a
 bahviour which not only in not documented in the man pages, but
 also totally unrespectful of the normal mmap/munmap semantics. 
 Again, the scenario that you are picturing, is one where an
 application relies on a permanent (that is what it is - it always
 fails unless the munmap size is multiple than huge page size)
 failure of munmap, to do some productive task. An munmap() of huge
 page aligned size, will succeed in both case (vanilla, and patch).
 
 
 Since hugetlb memory is beyond the scope of the POSIX.1-2001
 munmap() specification, and there's a potential userspace
 breakage if the length becomes hugepage aligned, I think the
 do_unmap() implementation is correct as it stands.
 
 If the length is huge page aligned, it will be working with or
 without patch applied. The problem is for the other 2097151 out of
 2097152 cases, where length is not indeed aligned to 2MB (or
 

Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread David Rientjes
On Wed, 25 Mar 2015, Davide Libenzi wrote:

> > When you say "tracking back to 3.2.x", I think you mean you've tried as
> > far back as 3.2.x and found the same behaviour, but not tried further?
> > 
> > From the source, it looks like this is unchanged since MAP_HUGETLB was
> > introduced in 2.6.32.  And is the same behaviour as you've been given
> > with hugetlbfs since it arrived in 2.5.46.
> 
> Went back checking the application logs, an I had to patch the code (the 
> application one - to align size on munmap()) on May 2014.
> But it might be we started noticing it at that time, because before the 
> allocation pattern was simply monotonic, so it could be it was always 
> there.
> The bug test app is ten lines of code, to verify that.
> 

I looked at this thread at http://marc.info/?t=14139250881 since I 
didn't have it in my mailbox, and I didn't get a chance to actually run 
your test code.

In short, I think what you're saying is that

ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
munmap(ptr, 4KB) == EINVAL

> > The patch looks to me as if it will do what you want, and I agree
> > that it's displeasing that you can mmap a size, and then fail to
> > munmap that same size.
> > 
> > But I tend to think that's simply typical of the clunkiness we offer
> > you with hugetlbfs and MAP_HUGETLB: that it would have been better to
> > make a different choice all those years ago, but wrong to change the
> > user interface now.
> > 
> > Perhaps others will disagree.  And if I'm wrong, and the behaviour
> > got somehow changed in 3.2, then that's a different story and we
> > should fix it back.
> 
> This is not an interface change, in the sense old clients will continue to 
> work.
> This is simply respecting the mmap(2) POSIX specification, for a feature 
> which has been exposed via mmap(2).
> 

Respecting the mmap(2) POSIX specification?  I don't think 
mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
POSIX.1-2001 and not only because it obviously doesn't address 
MAP_HUGETLB, but I don't think the spec says the system cannot map more 
memory than len.

Using MAP_HUGETLB is really more a library function than anything else 
since you could easily implement the same behavior in a library.  That 
function includes aligning len to the hugepage size, so doing

ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)

is the equivalent to doing

ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)

and that doesn't violate any spec.  But your patch doesn't change mmap() 
at all, so let's forget about that.

The question you pose is whether munmap(ptr, 4KB) should succeed for a 
hugetlb vma and in your patch you align this to the hugepage size of the 
vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
for a non-hugetlb vma.

The munmap() spec says the whole pages that include any part of the passed 
length should be unmapped.  In spirit, I would agree with you that the 
page size for the vma is the hugepage size so that would be what would be 
unmapped.

But that's going by a spec that doesn't address hugepages and is worded in 
a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
done.  It carries no notion of variable page sizes and how hugepages 
should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
but the munmap() behavior for hugetlb vmas is not restricted.

It would seem too dangerous at this point to change the behavior of 
munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
arise from aligning to the hugepage size.

Some applications purposefully reserve hugetlb pages by mmap() and never 
munmap() them so they have exclusive access to hugepages that were 
allocated either at boot or runtime by the sysadmin.  If they depend on 
the return value of munmap() to determine if memory to free is memory 
dynamically allocated by the application or reserved as hugetlb memory, 
then this would cause them to break.  I can't say for certain that no such 
application exists.

Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
specification, and there's a potential userspace breakage if the length 
becomes hugepage aligned, I think the do_unmap() implementation is correct 
as it stands.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Davide Libenzi
On Wed, 25 Mar 2015, Hugh Dickins wrote:

> When you say "tracking back to 3.2.x", I think you mean you've tried as
> far back as 3.2.x and found the same behaviour, but not tried further?
> 
> From the source, it looks like this is unchanged since MAP_HUGETLB was
> introduced in 2.6.32.  And is the same behaviour as you've been given
> with hugetlbfs since it arrived in 2.5.46.

Went back checking the application logs, an I had to patch the code (the 
application one - to align size on munmap()) on May 2014.
But it might be we started noticing it at that time, because before the 
allocation pattern was simply monotonic, so it could be it was always 
there.
The bug test app is ten lines of code, to verify that.


> The patch looks to me as if it will do what you want, and I agree
> that it's displeasing that you can mmap a size, and then fail to
> munmap that same size.
> 
> But I tend to think that's simply typical of the clunkiness we offer
> you with hugetlbfs and MAP_HUGETLB: that it would have been better to
> make a different choice all those years ago, but wrong to change the
> user interface now.
> 
> Perhaps others will disagree.  And if I'm wrong, and the behaviour
> got somehow changed in 3.2, then that's a different story and we
> should fix it back.

This is not an interface change, in the sense old clients will continue to 
work.
This is simply respecting the mmap(2) POSIX specification, for a feature 
which has been exposed via mmap(2).



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Hugh Dickins
On Wed, 22 Oct 2014, Davide Libenzi wrote:

> [Resending with proper CC list suggested by Andrew]

I have recently been reminded of this languishing in my inbox ;)
(along with many others that I won't get to answer so quickly).

And in turn it reminds me of an older from Joern, who was annoyed
that he couldn't mmap a hugetlbfs file with MAP_HUGETLB.

Cc'ing more people, including Eric, the father of MAP_HUGETLB.

> 
> Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, 
> causes munmap to fail.  Tested on 3.13.x but tracking back to 3.2.x.

When you say "tracking back to 3.2.x", I think you mean you've tried as
far back as 3.2.x and found the same behaviour, but not tried further?

>From the source, it looks like this is unchanged since MAP_HUGETLB was
introduced in 2.6.32.  And is the same behaviour as you've been given
with hugetlbfs since it arrived in 2.5.46.

> In do_munmap() we forcibly want a 4KB default page, and we wrongly 
> calculate the end of the map.  Since the calculated end is within the end 
> address of the target vma, we try to do a split with an address right in 
> the middle of a huge page, which would fail with EINVAL.
> 
> Tentative (untested) patch and test case attached (be sure you have a few 
> huge pages available via /proc/sys/vm/nr_hugepages tinkering).
> 
> 
> Signed-Off-By: Davide Libenzi 

The patch looks to me as if it will do what you want, and I agree
that it's displeasing that you can mmap a size, and then fail to
munmap that same size.

But I tend to think that's simply typical of the clunkiness we offer
you with hugetlbfs and MAP_HUGETLB: that it would have been better to
make a different choice all those years ago, but wrong to change the
user interface now.

Perhaps others will disagree.  And if I'm wrong, and the behaviour
got somehow changed in 3.2, then that's a different story and we
should fix it back.

Hugh

> 
> 
> - Davide
> 
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7f85520..6dba257 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long 
> start, size_t len)
>   if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
>   return -EINVAL;
>  
> - len = PAGE_ALIGN(len);
> - if (len == 0)
> - return -EINVAL;
> -
>   /* Find the first overlapping VMA */
>   vma = find_vma(mm, start);
>   if (!vma)
> @@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long 
> start, size_t len)
>   prev = vma->vm_prev;
>   /* we have  start < vma->vm_end  */
>  
> + if (likely(!is_vm_hugetlb_page(vma)))
> + len = PAGE_ALIGN(len);
> + else {
> + unsigned long hpage_size = huge_page_size(hstate_vma(vma));
> +
> + len = ALIGN(len, hpage_size);
> + }
> + if (unlikely(len == 0))
> + return -EINVAL;
> +
>   /* if it doesn't overlap, we have nothing.. */
>   end = start + len;
>   if (vma->vm_start >= end)
> 
> 
> 
> 
> [hugebug.c]
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static void test(int flags, size_t size)
> {
> void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
>   flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> if (addr == MAP_FAILED)
> {
> perror("mmap");
> exit(1);
> }
> *(char*) addr = 17;
> 
> if (munmap(addr, size) != 0)
> {
> perror("munmap");
> exit(1);
> }
> }
> 
> int main(int ac, const char** av)
> {
> static const size_t hugepage_size = 2 * 1024 * 1024;
> 
> printf("Testing normal pages with 2MB size ...\n");
> test(0, hugepage_size);
> printf("OK\n");
> 
> printf("Testing huge pages with 2MB size ...\n");
> test(MAP_HUGETLB, hugepage_size);
> printf("OK\n");
> 
> 
> printf("Testing normal pages with 4KB byte size ...\n");
> test(0, 4096);
> printf("OK\n");
> 
> printf("Testing huge pages with 4KB byte size ...\n");
> test(MAP_HUGETLB, 4096);
> printf("OK\n");
> 
> 
> printf("Testing normal pages with 1 byte size ...\n");
> test(0, 1);
> printf("OK\n");
> 
> printf("Testing huge pages with 1 byte size ...\n");
> test(MAP_HUGETLB, 1);
> printf("OK\n");
> 
> return 0;
> }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Davide Libenzi
On Wed, 25 Mar 2015, Hugh Dickins wrote:

 When you say tracking back to 3.2.x, I think you mean you've tried as
 far back as 3.2.x and found the same behaviour, but not tried further?
 
 From the source, it looks like this is unchanged since MAP_HUGETLB was
 introduced in 2.6.32.  And is the same behaviour as you've been given
 with hugetlbfs since it arrived in 2.5.46.

Went back checking the application logs, an I had to patch the code (the 
application one - to align size on munmap()) on May 2014.
But it might be we started noticing it at that time, because before the 
allocation pattern was simply monotonic, so it could be it was always 
there.
The bug test app is ten lines of code, to verify that.


 The patch looks to me as if it will do what you want, and I agree
 that it's displeasing that you can mmap a size, and then fail to
 munmap that same size.
 
 But I tend to think that's simply typical of the clunkiness we offer
 you with hugetlbfs and MAP_HUGETLB: that it would have been better to
 make a different choice all those years ago, but wrong to change the
 user interface now.
 
 Perhaps others will disagree.  And if I'm wrong, and the behaviour
 got somehow changed in 3.2, then that's a different story and we
 should fix it back.

This is not an interface change, in the sense old clients will continue to 
work.
This is simply respecting the mmap(2) POSIX specification, for a feature 
which has been exposed via mmap(2).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Hugh Dickins
On Wed, 22 Oct 2014, Davide Libenzi wrote:

 [Resending with proper CC list suggested by Andrew]

I have recently been reminded of this languishing in my inbox ;)
(along with many others that I won't get to answer so quickly).

And in turn it reminds me of an older from Joern, who was annoyed
that he couldn't mmap a hugetlbfs file with MAP_HUGETLB.

Cc'ing more people, including Eric, the father of MAP_HUGETLB.

 
 Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, 
 causes munmap to fail.  Tested on 3.13.x but tracking back to 3.2.x.

When you say tracking back to 3.2.x, I think you mean you've tried as
far back as 3.2.x and found the same behaviour, but not tried further?

From the source, it looks like this is unchanged since MAP_HUGETLB was
introduced in 2.6.32.  And is the same behaviour as you've been given
with hugetlbfs since it arrived in 2.5.46.

 In do_munmap() we forcibly want a 4KB default page, and we wrongly 
 calculate the end of the map.  Since the calculated end is within the end 
 address of the target vma, we try to do a split with an address right in 
 the middle of a huge page, which would fail with EINVAL.
 
 Tentative (untested) patch and test case attached (be sure you have a few 
 huge pages available via /proc/sys/vm/nr_hugepages tinkering).
 
 
 Signed-Off-By: Davide Libenzi davi...@xmailserver.org

The patch looks to me as if it will do what you want, and I agree
that it's displeasing that you can mmap a size, and then fail to
munmap that same size.

But I tend to think that's simply typical of the clunkiness we offer
you with hugetlbfs and MAP_HUGETLB: that it would have been better to
make a different choice all those years ago, but wrong to change the
user interface now.

Perhaps others will disagree.  And if I'm wrong, and the behaviour
got somehow changed in 3.2, then that's a different story and we
should fix it back.

Hugh

 
 
 - Davide
 
 
 diff --git a/mm/mmap.c b/mm/mmap.c
 index 7f85520..6dba257 100644
 --- a/mm/mmap.c
 +++ b/mm/mmap.c
 @@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long 
 start, size_t len)
   if ((start  ~PAGE_MASK) || start  TASK_SIZE || len  TASK_SIZE-start)
   return -EINVAL;
  
 - len = PAGE_ALIGN(len);
 - if (len == 0)
 - return -EINVAL;
 -
   /* Find the first overlapping VMA */
   vma = find_vma(mm, start);
   if (!vma)
 @@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long 
 start, size_t len)
   prev = vma-vm_prev;
   /* we have  start  vma-vm_end  */
  
 + if (likely(!is_vm_hugetlb_page(vma)))
 + len = PAGE_ALIGN(len);
 + else {
 + unsigned long hpage_size = huge_page_size(hstate_vma(vma));
 +
 + len = ALIGN(len, hpage_size);
 + }
 + if (unlikely(len == 0))
 + return -EINVAL;
 +
   /* if it doesn't overlap, we have nothing.. */
   end = start + len;
   if (vma-vm_start = end)
 
 
 
 
 [hugebug.c]
 
 #include sys/mman.h
 #include stdio.h
 #include stdlib.h
 #include string.h
 #include errno.h
 
 static void test(int flags, size_t size)
 {
 void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
   flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 
 if (addr == MAP_FAILED)
 {
 perror(mmap);
 exit(1);
 }
 *(char*) addr = 17;
 
 if (munmap(addr, size) != 0)
 {
 perror(munmap);
 exit(1);
 }
 }
 
 int main(int ac, const char** av)
 {
 static const size_t hugepage_size = 2 * 1024 * 1024;
 
 printf(Testing normal pages with 2MB size ...\n);
 test(0, hugepage_size);
 printf(OK\n);
 
 printf(Testing huge pages with 2MB size ...\n);
 test(MAP_HUGETLB, hugepage_size);
 printf(OK\n);
 
 
 printf(Testing normal pages with 4KB byte size ...\n);
 test(0, 4096);
 printf(OK\n);
 
 printf(Testing huge pages with 4KB byte size ...\n);
 test(MAP_HUGETLB, 4096);
 printf(OK\n);
 
 
 printf(Testing normal pages with 1 byte size ...\n);
 test(0, 1);
 printf(OK\n);
 
 printf(Testing huge pages with 1 byte size ...\n);
 test(MAP_HUGETLB, 1);
 printf(OK\n);
 
 return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread David Rientjes
On Wed, 25 Mar 2015, Davide Libenzi wrote:

  When you say tracking back to 3.2.x, I think you mean you've tried as
  far back as 3.2.x and found the same behaviour, but not tried further?
  
  From the source, it looks like this is unchanged since MAP_HUGETLB was
  introduced in 2.6.32.  And is the same behaviour as you've been given
  with hugetlbfs since it arrived in 2.5.46.
 
 Went back checking the application logs, an I had to patch the code (the 
 application one - to align size on munmap()) on May 2014.
 But it might be we started noticing it at that time, because before the 
 allocation pattern was simply monotonic, so it could be it was always 
 there.
 The bug test app is ten lines of code, to verify that.
 

I looked at this thread at http://marc.info/?t=14139250881 since I 
didn't have it in my mailbox, and I didn't get a chance to actually run 
your test code.

In short, I think what you're saying is that

ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
munmap(ptr, 4KB) == EINVAL

  The patch looks to me as if it will do what you want, and I agree
  that it's displeasing that you can mmap a size, and then fail to
  munmap that same size.
  
  But I tend to think that's simply typical of the clunkiness we offer
  you with hugetlbfs and MAP_HUGETLB: that it would have been better to
  make a different choice all those years ago, but wrong to change the
  user interface now.
  
  Perhaps others will disagree.  And if I'm wrong, and the behaviour
  got somehow changed in 3.2, then that's a different story and we
  should fix it back.
 
 This is not an interface change, in the sense old clients will continue to 
 work.
 This is simply respecting the mmap(2) POSIX specification, for a feature 
 which has been exposed via mmap(2).
 

Respecting the mmap(2) POSIX specification?  I don't think 
mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
POSIX.1-2001 and not only because it obviously doesn't address 
MAP_HUGETLB, but I don't think the spec says the system cannot map more 
memory than len.

Using MAP_HUGETLB is really more a library function than anything else 
since you could easily implement the same behavior in a library.  That 
function includes aligning len to the hugepage size, so doing

ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)

is the equivalent to doing

ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)

and that doesn't violate any spec.  But your patch doesn't change mmap() 
at all, so let's forget about that.

The question you pose is whether munmap(ptr, 4KB) should succeed for a 
hugetlb vma and in your patch you align this to the hugepage size of the 
vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
for a non-hugetlb vma.

The munmap() spec says the whole pages that include any part of the passed 
length should be unmapped.  In spirit, I would agree with you that the 
page size for the vma is the hugepage size so that would be what would be 
unmapped.

But that's going by a spec that doesn't address hugepages and is worded in 
a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
done.  It carries no notion of variable page sizes and how hugepages 
should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
but the munmap() behavior for hugetlb vmas is not restricted.

It would seem too dangerous at this point to change the behavior of 
munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
arise from aligning to the hugepage size.

Some applications purposefully reserve hugetlb pages by mmap() and never 
munmap() them so they have exclusive access to hugepages that were 
allocated either at boot or runtime by the sysadmin.  If they depend on 
the return value of munmap() to determine if memory to free is memory 
dynamically allocated by the application or reserved as hugetlb memory, 
then this would cause them to break.  I can't say for certain that no such 
application exists.

Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
specification, and there's a potential userspace breakage if the length 
becomes hugepage aligned, I think the do_unmap() implementation is correct 
as it stands.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/