Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Mon, 30 Mar 2015, KOSAKI Motohiro wrote: > On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson wrote: > > On 03/26/2015 07:56 AM, Davide Libenzi wrote: > >> On Wed, 25 Mar 2015, David Rientjes wrote: > >> > >>> I looked at this thread at http://marc.info/?t=14139250881 > >>> since I didn't have it in my mailbox, and I didn't get a chance > >>> to actually run your test code. > >>> > >>> In short, I think what you're saying is that > >>> > >>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, > >>> 4KB) == EINVAL > >> > >> I am not sure you have read the email correctly: > >> > >> munmap(mmap(size, HUGETLB), size) = EFAIL > >> > >> For every size not multiple of the huge page size. Whereas: > >> > >> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK > > > > I think Davide is right here, this is a long existing bug in the > > MAP_HUGETLB implementation. Specifically, the mmap man page says: > > > > All pages containing a part of the indicated range are unmapped, and > > subsequent references to these pages will generate SIGSEGV. > > > > I realize that huge pages may not have been considered by those that > > wrote the spec. But if I read this I would assume that all pages, > > regardless of size, touched by the munmap() request should be unmapped. > > > > Please include > > Acked-by: Eric B Munson > > to the original patch. I would like to see the mmap man page adjusted > > to make note of this behavior as well. > > This is just a bug fix and I never think this has large risk. But > caution, we might revert immediately > if this patch arise some regression even if it's come from broken > application code. > > Acked-by: KOSAKI Motohiro and, without wishing to be confrontational, Nacked-by: Hugh Dickins I agree with you that the risk on munmap is probably not large; but I still have no interest in spending more time on changing twelve-year-old established behaviour in this way, then looking out for the regressions and preparing to revert. The risk is larger on mprotect, and the other calls which this patch as is would leave inconsistent with munmap. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 03/26/2015 07:56 AM, Davide Libenzi wrote: >> On Wed, 25 Mar 2015, David Rientjes wrote: >> >>> I looked at this thread at http://marc.info/?t=14139250881 >>> since I didn't have it in my mailbox, and I didn't get a chance >>> to actually run your test code. >>> >>> In short, I think what you're saying is that >>> >>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, >>> 4KB) == EINVAL >> >> I am not sure you have read the email correctly: >> >> munmap(mmap(size, HUGETLB), size) = EFAIL >> >> For every size not multiple of the huge page size. Whereas: >> >> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK > > I think Davide is right here, this is a long existing bug in the > MAP_HUGETLB implementation. Specifically, the mmap man page says: > > All pages containing a part of the indicated range are unmapped, and > subsequent references to these pages will generate SIGSEGV. > > I realize that huge pages may not have been considered by those that > wrote the spec. But if I read this I would assume that all pages, > regardless of size, touched by the munmap() request should be unmapped. > > Please include > Acked-by: Eric B Munson > to the original patch. I would like to see the mmap man page adjusted > to make note of this behavior as well. This is just a bug fix and I never think this has large risk. But caution, we might revert immediately if this patch arise some regression even if it's come from broken application code. Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Mon, 30 Mar 2015, KOSAKI Motohiro wrote: On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson emun...@akamai.com wrote: On 03/26/2015 07:56 AM, Davide Libenzi wrote: On Wed, 25 Mar 2015, David Rientjes wrote: I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK I think Davide is right here, this is a long existing bug in the MAP_HUGETLB implementation. Specifically, the mmap man page says: All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. I realize that huge pages may not have been considered by those that wrote the spec. But if I read this I would assume that all pages, regardless of size, touched by the munmap() request should be unmapped. Please include Acked-by: Eric B Munson emun...@akamai.com to the original patch. I would like to see the mmap man page adjusted to make note of this behavior as well. This is just a bug fix and I never think this has large risk. But caution, we might revert immediately if this patch arise some regression even if it's come from broken application code. Acked-by: KOSAKI Motohiro kosaki.motoh...@gmail.com and, without wishing to be confrontational, Nacked-by: Hugh Dickins hu...@google.com I agree with you that the risk on munmap is probably not large; but I still have no interest in spending more time on changing twelve-year-old established behaviour in this way, then looking out for the regressions and preparing to revert. The risk is larger on mprotect, and the other calls which this patch as is would leave inconsistent with munmap. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson emun...@akamai.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/26/2015 07:56 AM, Davide Libenzi wrote: On Wed, 25 Mar 2015, David Rientjes wrote: I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK I think Davide is right here, this is a long existing bug in the MAP_HUGETLB implementation. Specifically, the mmap man page says: All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. I realize that huge pages may not have been considered by those that wrote the spec. But if I read this I would assume that all pages, regardless of size, touched by the munmap() request should be unmapped. Please include Acked-by: Eric B Munson emun...@akamai.com to the original patch. I would like to see the mmap man page adjusted to make note of this behavior as well. This is just a bug fix and I never think this has large risk. But caution, we might revert immediately if this patch arise some regression even if it's come from broken application code. Acked-by: KOSAKI Motohiro kosaki.motoh...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, David Rientjes wrote: > On Thu, 26 Mar 2015, Davide Libenzi wrote: > > > > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a > > > hugetlb vma is long standing and there may be applications that break as > > > a > > > result of changing the behavior: a database that reserves all allocated > > > hugetlb memory with mmap() so that it always has exclusive access to > > > those > > > hugepages, whether they are faulted or not, and maintains its own > > > hugepage > > > pool (which is common), may test the return value of munmap() and depend > > > on it returning -EINVAL to determine if it is freeing memory that was > > > either dynamically allocated or mapped from the hugetlb reserved pool. > > > > You went a long way to create such a case. > > But, in your case, that application will erroneously considering hugepage > > mmaped memory, as dynamically allocated, since it will always get EINVAL, > > unless it passes an aligned size. Aligned size, which a fix like the one > > posted in the patch will still leave as success. > > There was a patch proposed last week to add reserved pools to the > hugetlbfs mount option specifically for the case where a large database > wants sole reserved access to the hugepage pool. This is why hugetlbfs > pages become reserved on mmap(). In that case, the database never wants > to do munmap() and instead maintains its own hugepage pool. > > That makes the usual database case, mmap() all necessary hugetlb pages to > reserve them, even easier since they have historically had to maintain > this pool amongst various processes. > > Is there a process out there that tests for munmap(ptr) == EINVAL and, if > true, returns ptr to its hugepage pool? I can't say for certain that none > exist, that's why the potential for breakage exists. Such an application can use /proc/pid/smaps to determine the page size of a mapping. IMO, this is relying on broken behavior but I see where you are coming from that this behavior has been present for a long time. As I stated before, I think we should fix this bug and make munmap() behavior match what is described in the man page. > > > OTOH, an application, which might be more common than the one you posted, > > which calls munmap() to release a pointer which it validly got from a > > previous mmap(), will leak huge pages as all the issued munmaps will fail. > > > > That application would have to be ignoring an EINVAL return value. > > > > If we were to go back in time and decide this when the munmap() behavior > > > for hugetlb vmas was originally introduced, that would be valid. The > > > problem is that it could lead to userspace breakage and that's a > > > non-starter. > > > > > > What we can do is improve the documentation and man-page to clearly > > > specify the long-standing behavior so that nobody encounters unexpected > > > results in the future. > > > > This way you will leave the mmap API with broken semantics. > > In any case, I am done arguing. > > I will leave to Andrew to sort it out, and to Michael Kerrisk to update > > the mmap man pages with the new funny behaviour. > > > > The behavior is certainly not new, it has always been the case for > munmap() on hugetlb vmas. > > In a strict POSIX interpretation, it refers only to pages in the sense of > what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by > any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. > It would be best to modify the man page to explicitly state this for > MAP_HUGETLB. signature.asc Description: Digital signature
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
Might be too late in this thread, but in case you are going to continue and/or repost: [CC += linux-...@vger.kernel.org] (also linux-man and Michael to match my other reply) Since this is a kernel-user-space API change, please CC linux-api@. The kernel source file Documentation/SubmitChecklist notes that all Linux kernel patches that change userspace interfaces should be CCed to linux-...@vger.kernel.org, so that the various parties who are interested in API changes are informed. For further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html On 03/26/2015 09:03 PM, David Rientjes wrote: > On Thu, 26 Mar 2015, Davide Libenzi wrote: > >> > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a >> > hugetlb vma is long standing and there may be applications that break as a >> > result of changing the behavior: a database that reserves all allocated >> > hugetlb memory with mmap() so that it always has exclusive access to those >> > hugepages, whether they are faulted or not, and maintains its own hugepage >> > pool (which is common), may test the return value of munmap() and depend >> > on it returning -EINVAL to determine if it is freeing memory that was >> > either dynamically allocated or mapped from the hugetlb reserved pool. >> >> You went a long way to create such a case. >> But, in your case, that application will erroneously considering hugepage >> mmaped memory, as dynamically allocated, since it will always get EINVAL, >> unless it passes an aligned size. Aligned size, which a fix like the one >> posted in the patch will still leave as success. > > There was a patch proposed last week to add reserved pools to the > hugetlbfs mount option specifically for the case where a large database > wants sole reserved access to the hugepage pool. This is why hugetlbfs > pages become reserved on mmap(). In that case, the database never wants > to do munmap() and instead maintains its own hugepage pool. > > That makes the usual database case, mmap() all necessary hugetlb pages to > reserve them, even easier since they have historically had to maintain > this pool amongst various processes. > > Is there a process out there that tests for munmap(ptr) == EINVAL and, if > true, returns ptr to its hugepage pool? I can't say for certain that none > exist, that's why the potential for breakage exists. > >> OTOH, an application, which might be more common than the one you posted, >> which calls munmap() to release a pointer which it validly got from a >> previous mmap(), will leak huge pages as all the issued munmaps will fail. >> > > That application would have to be ignoring an EINVAL return value. > >> > If we were to go back in time and decide this when the munmap() behavior >> > for hugetlb vmas was originally introduced, that would be valid. The >> > problem is that it could lead to userspace breakage and that's a >> > non-starter. >> > >> > What we can do is improve the documentation and man-page to clearly >> > specify the long-standing behavior so that nobody encounters unexpected >> > results in the future. >> >> This way you will leave the mmap API with broken semantics. >> In any case, I am done arguing. >> I will leave to Andrew to sort it out, and to Michael Kerrisk to update >> the mmap man pages with the new funny behaviour. >> > > The behavior is certainly not new, it has always been the case for > munmap() on hugetlb vmas. > > In a strict POSIX interpretation, it refers only to pages in the sense of > what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by > any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. > It would be best to modify the man page to explicitly state this for > MAP_HUGETLB. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On 03/26/2015 08:39 PM, Davide Libenzi wrote: > On Thu, 26 Mar 2015, David Rientjes wrote: > >> Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a >> hugetlb vma is long standing and there may be applications that break as a >> result of changing the behavior: a database that reserves all allocated >> hugetlb memory with mmap() so that it always has exclusive access to those >> hugepages, whether they are faulted or not, and maintains its own hugepage >> pool (which is common), may test the return value of munmap() and depend >> on it returning -EINVAL to determine if it is freeing memory that was >> either dynamically allocated or mapped from the hugetlb reserved pool. > > You went a long way to create such a case. > But, in your case, that application will erroneously considering hugepage > mmaped memory, as dynamically allocated, since it will always get EINVAL, > unless it passes an aligned size. Aligned size, which a fix like the one > posted in the patch will still leave as success. > OTOH, an application, which might be more common than the one you posted, > which calls munmap() to release a pointer which it validly got from a > previous mmap(), will leak huge pages as all the issued munmaps will fail. > > >> If we were to go back in time and decide this when the munmap() behavior >> for hugetlb vmas was originally introduced, that would be valid. The >> problem is that it could lead to userspace breakage and that's a >> non-starter. >> >> What we can do is improve the documentation and man-page to clearly >> specify the long-standing behavior so that nobody encounters unexpected >> results in the future. > > This way you will leave the mmap API with broken semantics. > In any case, I am done arguing. > I will leave to Andrew to sort it out, and to Michael Kerrisk to update > the mmap man pages with the new funny behaviour. + CC's You know that people don't always magically CC themselves, or read all of lkml/linux-mm? :) > > > - Davide > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On 03/26/2015 08:39 PM, Davide Libenzi wrote: On Thu, 26 Mar 2015, David Rientjes wrote: Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. + CC's You know that people don't always magically CC themselves, or read all of lkml/linux-mm? :) - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
Might be too late in this thread, but in case you are going to continue and/or repost: [CC += linux-...@vger.kernel.org] (also linux-man and Michael to match my other reply) Since this is a kernel-user-space API change, please CC linux-api@. The kernel source file Documentation/SubmitChecklist notes that all Linux kernel patches that change userspace interfaces should be CCed to linux-...@vger.kernel.org, so that the various parties who are interested in API changes are informed. For further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html On 03/26/2015 09:03 PM, David Rientjes wrote: On Thu, 26 Mar 2015, Davide Libenzi wrote: Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. There was a patch proposed last week to add reserved pools to the hugetlbfs mount option specifically for the case where a large database wants sole reserved access to the hugepage pool. This is why hugetlbfs pages become reserved on mmap(). In that case, the database never wants to do munmap() and instead maintains its own hugepage pool. That makes the usual database case, mmap() all necessary hugetlb pages to reserve them, even easier since they have historically had to maintain this pool amongst various processes. Is there a process out there that tests for munmap(ptr) == EINVAL and, if true, returns ptr to its hugepage pool? I can't say for certain that none exist, that's why the potential for breakage exists. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. That application would have to be ignoring an EINVAL return value. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. The behavior is certainly not new, it has always been the case for munmap() on hugetlb vmas. In a strict POSIX interpretation, it refers only to pages in the sense of what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. It would be best to modify the man page to explicitly state this for MAP_HUGETLB. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, David Rientjes wrote: On Thu, 26 Mar 2015, Davide Libenzi wrote: Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. There was a patch proposed last week to add reserved pools to the hugetlbfs mount option specifically for the case where a large database wants sole reserved access to the hugepage pool. This is why hugetlbfs pages become reserved on mmap(). In that case, the database never wants to do munmap() and instead maintains its own hugepage pool. That makes the usual database case, mmap() all necessary hugetlb pages to reserve them, even easier since they have historically had to maintain this pool amongst various processes. Is there a process out there that tests for munmap(ptr) == EINVAL and, if true, returns ptr to its hugepage pool? I can't say for certain that none exist, that's why the potential for breakage exists. Such an application can use /proc/pid/smaps to determine the page size of a mapping. IMO, this is relying on broken behavior but I see where you are coming from that this behavior has been present for a long time. As I stated before, I think we should fix this bug and make munmap() behavior match what is described in the man page. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. That application would have to be ignoring an EINVAL return value. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. The behavior is certainly not new, it has always been the case for munmap() on hugetlb vmas. In a strict POSIX interpretation, it refers only to pages in the sense of what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. It would be best to modify the man page to explicitly state this for MAP_HUGETLB. signature.asc Description: Digital signature
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, Davide Libenzi wrote: > > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a > > hugetlb vma is long standing and there may be applications that break as a > > result of changing the behavior: a database that reserves all allocated > > hugetlb memory with mmap() so that it always has exclusive access to those > > hugepages, whether they are faulted or not, and maintains its own hugepage > > pool (which is common), may test the return value of munmap() and depend > > on it returning -EINVAL to determine if it is freeing memory that was > > either dynamically allocated or mapped from the hugetlb reserved pool. > > You went a long way to create such a case. > But, in your case, that application will erroneously considering hugepage > mmaped memory, as dynamically allocated, since it will always get EINVAL, > unless it passes an aligned size. Aligned size, which a fix like the one > posted in the patch will still leave as success. There was a patch proposed last week to add reserved pools to the hugetlbfs mount option specifically for the case where a large database wants sole reserved access to the hugepage pool. This is why hugetlbfs pages become reserved on mmap(). In that case, the database never wants to do munmap() and instead maintains its own hugepage pool. That makes the usual database case, mmap() all necessary hugetlb pages to reserve them, even easier since they have historically had to maintain this pool amongst various processes. Is there a process out there that tests for munmap(ptr) == EINVAL and, if true, returns ptr to its hugepage pool? I can't say for certain that none exist, that's why the potential for breakage exists. > OTOH, an application, which might be more common than the one you posted, > which calls munmap() to release a pointer which it validly got from a > previous mmap(), will leak huge pages as all the issued munmaps will fail. > That application would have to be ignoring an EINVAL return value. > > If we were to go back in time and decide this when the munmap() behavior > > for hugetlb vmas was originally introduced, that would be valid. The > > problem is that it could lead to userspace breakage and that's a > > non-starter. > > > > What we can do is improve the documentation and man-page to clearly > > specify the long-standing behavior so that nobody encounters unexpected > > results in the future. > > This way you will leave the mmap API with broken semantics. > In any case, I am done arguing. > I will leave to Andrew to sort it out, and to Michael Kerrisk to update > the mmap man pages with the new funny behaviour. > The behavior is certainly not new, it has always been the case for munmap() on hugetlb vmas. In a strict POSIX interpretation, it refers only to pages in the sense of what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. It would be best to modify the man page to explicitly state this for MAP_HUGETLB. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, David Rientjes wrote: > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a > hugetlb vma is long standing and there may be applications that break as a > result of changing the behavior: a database that reserves all allocated > hugetlb memory with mmap() so that it always has exclusive access to those > hugepages, whether they are faulted or not, and maintains its own hugepage > pool (which is common), may test the return value of munmap() and depend > on it returning -EINVAL to determine if it is freeing memory that was > either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. > If we were to go back in time and decide this when the munmap() behavior > for hugetlb vmas was originally introduced, that would be valid. The > problem is that it could lead to userspace breakage and that's a > non-starter. > > What we can do is improve the documentation and man-page to clearly > specify the long-standing behavior so that nobody encounters unexpected > results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, Davide Libenzi wrote: > > I looked at this thread at http://marc.info/?t=14139250881 since I > > didn't have it in my mailbox, and I didn't get a chance to actually run > > your test code. > > > > In short, I think what you're saying is that > > > > ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) > > munmap(ptr, 4KB) == EINVAL > > I am not sure you have read the email correctly: > > munmap(mmap(size, HUGETLB), size) = EFAIL > > For every size not multiple of the huge page size. > Whereas: > > munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK > Yes, I read it correctly, and wrote how your test case should have failed above. It fails when you do the 4KB mmap() with MAP_HUGETLB and pass 4KB to munmap(), correct? I have no idea what EFAIL is, though. > > The question you pose is whether munmap(ptr, 4KB) should succeed for a > > hugetlb vma and in your patch you align this to the hugepage size of the > > vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE > > for a non-hugetlb vma. > > > > The munmap() spec says the whole pages that include any part of the passed > > length should be unmapped. In spirit, I would agree with you that the > > page size for the vma is the hugepage size so that would be what would be > > unmapped. > > > > But that's going by a spec that doesn't address hugepages and is worded in > > a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is > > done. It carries no notion of variable page sizes and how hugepages > > should be handled with respect to pages of {PAGE_SIZE} length. So I think > > this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, > > but the munmap() behavior for hugetlb vmas is not restricted. > > > > It would seem too dangerous at this point to change the behavior of > > munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually > > arise from aligning to the hugepage size. > > You mean, there is an harder failure than the current failure? :) > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. > > Some applications purposefully reserve hugetlb pages by mmap() and never > > munmap() them so they have exclusive access to hugepages that were > > allocated either at boot or runtime by the sysadmin. If they depend on > > the return value of munmap() to determine if memory to free is memory > > dynamically allocated by the application or reserved as hugetlb memory, > > then this would cause them to break. I can't say for certain that no such > > application exists. > > The fact that certain applications will seldomly call an API, should be no > reason for API to have bugs, or at the very least, a bahviour which not > only in not documented in the man pages, but also totally unrespectful of > the normal mmap/munmap semantics. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/26/2015 07:56 AM, Davide Libenzi wrote: > On Wed, 25 Mar 2015, David Rientjes wrote: > >> I looked at this thread at http://marc.info/?t=14139250881 >> since I didn't have it in my mailbox, and I didn't get a chance >> to actually run your test code. >> >> In short, I think what you're saying is that >> >> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, >> 4KB) == EINVAL > > I am not sure you have read the email correctly: > > munmap(mmap(size, HUGETLB), size) = EFAIL > > For every size not multiple of the huge page size. Whereas: > > munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK I think Davide is right here, this is a long existing bug in the MAP_HUGETLB implementation. Specifically, the mmap man page says: All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. I realize that huge pages may not have been considered by those that wrote the spec. But if I read this I would assume that all pages, regardless of size, touched by the munmap() request should be unmapped. Please include Acked-by: Eric B Munson to the original patch. I would like to see the mmap man page adjusted to make note of this behavior as well. > > >> Respecting the mmap(2) POSIX specification? I don't think >> mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates >> POSIX.1-2001 and not only because it obviously doesn't address >> MAP_HUGETLB, but I don't think the spec says the system cannot >> map more memory than len. >> >> Using MAP_HUGETLB is really more a library function than anything >> else since you could easily implement the same behavior in a >> library. That function includes aligning len to the hugepage >> size, so doing >> >> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) >> >> is the equivalent to doing >> >> ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) >> >> and that doesn't violate any spec. But your patch doesn't change >> mmap() at all, so let's forget about that. > > That is what every mmap() implementation does, irrespectively of > any page size. And that is also what the POSIX spec states. The > size will be automatically rounded up to a multiple of the > underline physical page size. The problem is not mmap() though, in > this case. > > >> The question you pose is whether munmap(ptr, 4KB) should succeed >> for a hugetlb vma and in your patch you align this to the >> hugepage size of the vma in the same manner that munmap(ptr, 2KB) >> would be aligned to PAGE_SIZE for a non-hugetlb vma. >> >> The munmap() spec says the whole pages that include any part of >> the passed length should be unmapped. In spirit, I would agree >> with you that the page size for the vma is the hugepage size so >> that would be what would be unmapped. >> >> But that's going by a spec that doesn't address hugepages and is >> worded in a way that {PAGE_SIZE} is the base unit that both >> mmap() and munmap() is done. It carries no notion of variable >> page sizes and how hugepages should be handled with respect to >> pages of {PAGE_SIZE} length. So I think this is beyond the scope >> of the spec: any length is aligned to PAGE_SIZE, but the munmap() >> behavior for hugetlb vmas is not restricted. >> >> It would seem too dangerous at this point to change the behavior >> of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs >> could actually arise from aligning to the hugepage size. > > You mean, there is an harder failure than the current failure? :) > > >> Some applications purposefully reserve hugetlb pages by mmap() >> and never munmap() them so they have exclusive access to >> hugepages that were allocated either at boot or runtime by the >> sysadmin. If they depend on the return value of munmap() to >> determine if memory to free is memory dynamically allocated by >> the application or reserved as hugetlb memory, then this would >> cause them to break. I can't say for certain that no such >> application exists. > > The fact that certain applications will seldomly call an API, > should be no reason for API to have bugs, or at the very least, a > bahviour which not only in not documented in the man pages, but > also totally unrespectful of the normal mmap/munmap semantics. > Again, the scenario that you are picturing, is one where an > application relies on a permanent (that is what it is - it always > fails unless the munmap size is multiple than huge page size) > failure of munmap, to do some productive task. An munmap() of huge > page aligned size, will succeed in both case (vanilla, and patch). > > >> Since hugetlb memory is beyond the scope of the POSIX.1-2001 >> munmap() specification, and there's a potential userspace >> breakage if the length becomes hugepage aligned, I think the >> do_unmap() implementation is correct as it stands. > > If the length is huge page aligned, it will be working with
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, David Rientjes wrote: > I looked at this thread at http://marc.info/?t=14139250881 since I > didn't have it in my mailbox, and I didn't get a chance to actually run > your test code. > > In short, I think what you're saying is that > > ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) > munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK > Respecting the mmap(2) POSIX specification? I don't think > mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates > POSIX.1-2001 and not only because it obviously doesn't address > MAP_HUGETLB, but I don't think the spec says the system cannot map more > memory than len. > > Using MAP_HUGETLB is really more a library function than anything else > since you could easily implement the same behavior in a library. That > function includes aligning len to the hugepage size, so doing > > ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) > > is the equivalent to doing > > ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) > > and that doesn't violate any spec. But your patch doesn't change mmap() > at all, so let's forget about that. That is what every mmap() implementation does, irrespectively of any page size. And that is also what the POSIX spec states. The size will be automatically rounded up to a multiple of the underline physical page size. The problem is not mmap() though, in this case. > The question you pose is whether munmap(ptr, 4KB) should succeed for a > hugetlb vma and in your patch you align this to the hugepage size of the > vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE > for a non-hugetlb vma. > > The munmap() spec says the whole pages that include any part of the passed > length should be unmapped. In spirit, I would agree with you that the > page size for the vma is the hugepage size so that would be what would be > unmapped. > > But that's going by a spec that doesn't address hugepages and is worded in > a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is > done. It carries no notion of variable page sizes and how hugepages > should be handled with respect to pages of {PAGE_SIZE} length. So I think > this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, > but the munmap() behavior for hugetlb vmas is not restricted. > > It would seem too dangerous at this point to change the behavior of > munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually > arise from aligning to the hugepage size. You mean, there is an harder failure than the current failure? :) > Some applications purposefully reserve hugetlb pages by mmap() and never > munmap() them so they have exclusive access to hugepages that were > allocated either at boot or runtime by the sysadmin. If they depend on > the return value of munmap() to determine if memory to free is memory > dynamically allocated by the application or reserved as hugetlb memory, > then this would cause them to break. I can't say for certain that no such > application exists. The fact that certain applications will seldomly call an API, should be no reason for API to have bugs, or at the very least, a bahviour which not only in not documented in the man pages, but also totally unrespectful of the normal mmap/munmap semantics. Again, the scenario that you are picturing, is one where an application relies on a permanent (that is what it is - it always fails unless the munmap size is multiple than huge page size) failure of munmap, to do some productive task. An munmap() of huge page aligned size, will succeed in both case (vanilla, and patch). > Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() > specification, and there's a potential userspace breakage if the length > becomes hugepage aligned, I think the do_unmap() implementation is correct > as it stands. If the length is huge page aligned, it will be working with or without patch applied. The problem is for the other 2097151 out of 2097152 cases, where length is not indeed aligned to 2MB (or whatever hugepage size is for the architecture). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, David Rientjes wrote: I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK Respecting the mmap(2) POSIX specification? I don't think mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates POSIX.1-2001 and not only because it obviously doesn't address MAP_HUGETLB, but I don't think the spec says the system cannot map more memory than len. Using MAP_HUGETLB is really more a library function than anything else since you could easily implement the same behavior in a library. That function includes aligning len to the hugepage size, so doing ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) is the equivalent to doing ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) and that doesn't violate any spec. But your patch doesn't change mmap() at all, so let's forget about that. That is what every mmap() implementation does, irrespectively of any page size. And that is also what the POSIX spec states. The size will be automatically rounded up to a multiple of the underline physical page size. The problem is not mmap() though, in this case. The question you pose is whether munmap(ptr, 4KB) should succeed for a hugetlb vma and in your patch you align this to the hugepage size of the vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE for a non-hugetlb vma. The munmap() spec says the whole pages that include any part of the passed length should be unmapped. In spirit, I would agree with you that the page size for the vma is the hugepage size so that would be what would be unmapped. But that's going by a spec that doesn't address hugepages and is worded in a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is done. It carries no notion of variable page sizes and how hugepages should be handled with respect to pages of {PAGE_SIZE} length. So I think this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, but the munmap() behavior for hugetlb vmas is not restricted. It would seem too dangerous at this point to change the behavior of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually arise from aligning to the hugepage size. You mean, there is an harder failure than the current failure? :) Some applications purposefully reserve hugetlb pages by mmap() and never munmap() them so they have exclusive access to hugepages that were allocated either at boot or runtime by the sysadmin. If they depend on the return value of munmap() to determine if memory to free is memory dynamically allocated by the application or reserved as hugetlb memory, then this would cause them to break. I can't say for certain that no such application exists. The fact that certain applications will seldomly call an API, should be no reason for API to have bugs, or at the very least, a bahviour which not only in not documented in the man pages, but also totally unrespectful of the normal mmap/munmap semantics. Again, the scenario that you are picturing, is one where an application relies on a permanent (that is what it is - it always fails unless the munmap size is multiple than huge page size) failure of munmap, to do some productive task. An munmap() of huge page aligned size, will succeed in both case (vanilla, and patch). Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() specification, and there's a potential userspace breakage if the length becomes hugepage aligned, I think the do_unmap() implementation is correct as it stands. If the length is huge page aligned, it will be working with or without patch applied. The problem is for the other 2097151 out of 2097152 cases, where length is not indeed aligned to 2MB (or whatever hugepage size is for the architecture). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, Davide Libenzi wrote: Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. There was a patch proposed last week to add reserved pools to the hugetlbfs mount option specifically for the case where a large database wants sole reserved access to the hugepage pool. This is why hugetlbfs pages become reserved on mmap(). In that case, the database never wants to do munmap() and instead maintains its own hugepage pool. That makes the usual database case, mmap() all necessary hugetlb pages to reserve them, even easier since they have historically had to maintain this pool amongst various processes. Is there a process out there that tests for munmap(ptr) == EINVAL and, if true, returns ptr to its hugepage pool? I can't say for certain that none exist, that's why the potential for breakage exists. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. That application would have to be ignoring an EINVAL return value. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. The behavior is certainly not new, it has always been the case for munmap() on hugetlb vmas. In a strict POSIX interpretation, it refers only to pages in the sense of what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. It would be best to modify the man page to explicitly state this for MAP_HUGETLB. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, Davide Libenzi wrote: I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK Yes, I read it correctly, and wrote how your test case should have failed above. It fails when you do the 4KB mmap() with MAP_HUGETLB and pass 4KB to munmap(), correct? I have no idea what EFAIL is, though. The question you pose is whether munmap(ptr, 4KB) should succeed for a hugetlb vma and in your patch you align this to the hugepage size of the vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE for a non-hugetlb vma. The munmap() spec says the whole pages that include any part of the passed length should be unmapped. In spirit, I would agree with you that the page size for the vma is the hugepage size so that would be what would be unmapped. But that's going by a spec that doesn't address hugepages and is worded in a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is done. It carries no notion of variable page sizes and how hugepages should be handled with respect to pages of {PAGE_SIZE} length. So I think this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, but the munmap() behavior for hugetlb vmas is not restricted. It would seem too dangerous at this point to change the behavior of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually arise from aligning to the hugepage size. You mean, there is an harder failure than the current failure? :) Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. Some applications purposefully reserve hugetlb pages by mmap() and never munmap() them so they have exclusive access to hugepages that were allocated either at boot or runtime by the sysadmin. If they depend on the return value of munmap() to determine if memory to free is memory dynamically allocated by the application or reserved as hugetlb memory, then this would cause them to break. I can't say for certain that no such application exists. The fact that certain applications will seldomly call an API, should be no reason for API to have bugs, or at the very least, a bahviour which not only in not documented in the man pages, but also totally unrespectful of the normal mmap/munmap semantics. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Thu, 26 Mar 2015, David Rientjes wrote: Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a hugetlb vma is long standing and there may be applications that break as a result of changing the behavior: a database that reserves all allocated hugetlb memory with mmap() so that it always has exclusive access to those hugepages, whether they are faulted or not, and maintains its own hugepage pool (which is common), may test the return value of munmap() and depend on it returning -EINVAL to determine if it is freeing memory that was either dynamically allocated or mapped from the hugetlb reserved pool. You went a long way to create such a case. But, in your case, that application will erroneously considering hugepage mmaped memory, as dynamically allocated, since it will always get EINVAL, unless it passes an aligned size. Aligned size, which a fix like the one posted in the patch will still leave as success. OTOH, an application, which might be more common than the one you posted, which calls munmap() to release a pointer which it validly got from a previous mmap(), will leak huge pages as all the issued munmaps will fail. If we were to go back in time and decide this when the munmap() behavior for hugetlb vmas was originally introduced, that would be valid. The problem is that it could lead to userspace breakage and that's a non-starter. What we can do is improve the documentation and man-page to clearly specify the long-standing behavior so that nobody encounters unexpected results in the future. This way you will leave the mmap API with broken semantics. In any case, I am done arguing. I will leave to Andrew to sort it out, and to Michael Kerrisk to update the mmap man pages with the new funny behaviour. - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/26/2015 07:56 AM, Davide Libenzi wrote: On Wed, 25 Mar 2015, David Rientjes wrote: I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL I am not sure you have read the email correctly: munmap(mmap(size, HUGETLB), size) = EFAIL For every size not multiple of the huge page size. Whereas: munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK I think Davide is right here, this is a long existing bug in the MAP_HUGETLB implementation. Specifically, the mmap man page says: All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. I realize that huge pages may not have been considered by those that wrote the spec. But if I read this I would assume that all pages, regardless of size, touched by the munmap() request should be unmapped. Please include Acked-by: Eric B Munson emun...@akamai.com to the original patch. I would like to see the mmap man page adjusted to make note of this behavior as well. Respecting the mmap(2) POSIX specification? I don't think mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates POSIX.1-2001 and not only because it obviously doesn't address MAP_HUGETLB, but I don't think the spec says the system cannot map more memory than len. Using MAP_HUGETLB is really more a library function than anything else since you could easily implement the same behavior in a library. That function includes aligning len to the hugepage size, so doing ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) is the equivalent to doing ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) and that doesn't violate any spec. But your patch doesn't change mmap() at all, so let's forget about that. That is what every mmap() implementation does, irrespectively of any page size. And that is also what the POSIX spec states. The size will be automatically rounded up to a multiple of the underline physical page size. The problem is not mmap() though, in this case. The question you pose is whether munmap(ptr, 4KB) should succeed for a hugetlb vma and in your patch you align this to the hugepage size of the vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE for a non-hugetlb vma. The munmap() spec says the whole pages that include any part of the passed length should be unmapped. In spirit, I would agree with you that the page size for the vma is the hugepage size so that would be what would be unmapped. But that's going by a spec that doesn't address hugepages and is worded in a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is done. It carries no notion of variable page sizes and how hugepages should be handled with respect to pages of {PAGE_SIZE} length. So I think this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, but the munmap() behavior for hugetlb vmas is not restricted. It would seem too dangerous at this point to change the behavior of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually arise from aligning to the hugepage size. You mean, there is an harder failure than the current failure? :) Some applications purposefully reserve hugetlb pages by mmap() and never munmap() them so they have exclusive access to hugepages that were allocated either at boot or runtime by the sysadmin. If they depend on the return value of munmap() to determine if memory to free is memory dynamically allocated by the application or reserved as hugetlb memory, then this would cause them to break. I can't say for certain that no such application exists. The fact that certain applications will seldomly call an API, should be no reason for API to have bugs, or at the very least, a bahviour which not only in not documented in the man pages, but also totally unrespectful of the normal mmap/munmap semantics. Again, the scenario that you are picturing, is one where an application relies on a permanent (that is what it is - it always fails unless the munmap size is multiple than huge page size) failure of munmap, to do some productive task. An munmap() of huge page aligned size, will succeed in both case (vanilla, and patch). Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() specification, and there's a potential userspace breakage if the length becomes hugepage aligned, I think the do_unmap() implementation is correct as it stands. If the length is huge page aligned, it will be working with or without patch applied. The problem is for the other 2097151 out of 2097152 cases, where length is not indeed aligned to 2MB (or
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, Davide Libenzi wrote: > > When you say "tracking back to 3.2.x", I think you mean you've tried as > > far back as 3.2.x and found the same behaviour, but not tried further? > > > > From the source, it looks like this is unchanged since MAP_HUGETLB was > > introduced in 2.6.32. And is the same behaviour as you've been given > > with hugetlbfs since it arrived in 2.5.46. > > Went back checking the application logs, an I had to patch the code (the > application one - to align size on munmap()) on May 2014. > But it might be we started noticing it at that time, because before the > allocation pattern was simply monotonic, so it could be it was always > there. > The bug test app is ten lines of code, to verify that. > I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL > > The patch looks to me as if it will do what you want, and I agree > > that it's displeasing that you can mmap a size, and then fail to > > munmap that same size. > > > > But I tend to think that's simply typical of the clunkiness we offer > > you with hugetlbfs and MAP_HUGETLB: that it would have been better to > > make a different choice all those years ago, but wrong to change the > > user interface now. > > > > Perhaps others will disagree. And if I'm wrong, and the behaviour > > got somehow changed in 3.2, then that's a different story and we > > should fix it back. > > This is not an interface change, in the sense old clients will continue to > work. > This is simply respecting the mmap(2) POSIX specification, for a feature > which has been exposed via mmap(2). > Respecting the mmap(2) POSIX specification? I don't think mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates POSIX.1-2001 and not only because it obviously doesn't address MAP_HUGETLB, but I don't think the spec says the system cannot map more memory than len. Using MAP_HUGETLB is really more a library function than anything else since you could easily implement the same behavior in a library. That function includes aligning len to the hugepage size, so doing ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) is the equivalent to doing ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) and that doesn't violate any spec. But your patch doesn't change mmap() at all, so let's forget about that. The question you pose is whether munmap(ptr, 4KB) should succeed for a hugetlb vma and in your patch you align this to the hugepage size of the vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE for a non-hugetlb vma. The munmap() spec says the whole pages that include any part of the passed length should be unmapped. In spirit, I would agree with you that the page size for the vma is the hugepage size so that would be what would be unmapped. But that's going by a spec that doesn't address hugepages and is worded in a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is done. It carries no notion of variable page sizes and how hugepages should be handled with respect to pages of {PAGE_SIZE} length. So I think this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, but the munmap() behavior for hugetlb vmas is not restricted. It would seem too dangerous at this point to change the behavior of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually arise from aligning to the hugepage size. Some applications purposefully reserve hugetlb pages by mmap() and never munmap() them so they have exclusive access to hugepages that were allocated either at boot or runtime by the sysadmin. If they depend on the return value of munmap() to determine if memory to free is memory dynamically allocated by the application or reserved as hugetlb memory, then this would cause them to break. I can't say for certain that no such application exists. Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() specification, and there's a potential userspace breakage if the length becomes hugepage aligned, I think the do_unmap() implementation is correct as it stands. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, Hugh Dickins wrote: > When you say "tracking back to 3.2.x", I think you mean you've tried as > far back as 3.2.x and found the same behaviour, but not tried further? > > From the source, it looks like this is unchanged since MAP_HUGETLB was > introduced in 2.6.32. And is the same behaviour as you've been given > with hugetlbfs since it arrived in 2.5.46. Went back checking the application logs, an I had to patch the code (the application one - to align size on munmap()) on May 2014. But it might be we started noticing it at that time, because before the allocation pattern was simply monotonic, so it could be it was always there. The bug test app is ten lines of code, to verify that. > The patch looks to me as if it will do what you want, and I agree > that it's displeasing that you can mmap a size, and then fail to > munmap that same size. > > But I tend to think that's simply typical of the clunkiness we offer > you with hugetlbfs and MAP_HUGETLB: that it would have been better to > make a different choice all those years ago, but wrong to change the > user interface now. > > Perhaps others will disagree. And if I'm wrong, and the behaviour > got somehow changed in 3.2, then that's a different story and we > should fix it back. This is not an interface change, in the sense old clients will continue to work. This is simply respecting the mmap(2) POSIX specification, for a feature which has been exposed via mmap(2). - Davide -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 22 Oct 2014, Davide Libenzi wrote: > [Resending with proper CC list suggested by Andrew] I have recently been reminded of this languishing in my inbox ;) (along with many others that I won't get to answer so quickly). And in turn it reminds me of an older from Joern, who was annoyed that he couldn't mmap a hugetlbfs file with MAP_HUGETLB. Cc'ing more people, including Eric, the father of MAP_HUGETLB. > > Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, > causes munmap to fail. Tested on 3.13.x but tracking back to 3.2.x. When you say "tracking back to 3.2.x", I think you mean you've tried as far back as 3.2.x and found the same behaviour, but not tried further? >From the source, it looks like this is unchanged since MAP_HUGETLB was introduced in 2.6.32. And is the same behaviour as you've been given with hugetlbfs since it arrived in 2.5.46. > In do_munmap() we forcibly want a 4KB default page, and we wrongly > calculate the end of the map. Since the calculated end is within the end > address of the target vma, we try to do a split with an address right in > the middle of a huge page, which would fail with EINVAL. > > Tentative (untested) patch and test case attached (be sure you have a few > huge pages available via /proc/sys/vm/nr_hugepages tinkering). > > > Signed-Off-By: Davide Libenzi The patch looks to me as if it will do what you want, and I agree that it's displeasing that you can mmap a size, and then fail to munmap that same size. But I tend to think that's simply typical of the clunkiness we offer you with hugetlbfs and MAP_HUGETLB: that it would have been better to make a different choice all those years ago, but wrong to change the user interface now. Perhaps others will disagree. And if I'm wrong, and the behaviour got somehow changed in 3.2, then that's a different story and we should fix it back. Hugh > > > - Davide > > > diff --git a/mm/mmap.c b/mm/mmap.c > index 7f85520..6dba257 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len) > if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start) > return -EINVAL; > > - len = PAGE_ALIGN(len); > - if (len == 0) > - return -EINVAL; > - > /* Find the first overlapping VMA */ > vma = find_vma(mm, start); > if (!vma) > @@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long > start, size_t len) > prev = vma->vm_prev; > /* we have start < vma->vm_end */ > > + if (likely(!is_vm_hugetlb_page(vma))) > + len = PAGE_ALIGN(len); > + else { > + unsigned long hpage_size = huge_page_size(hstate_vma(vma)); > + > + len = ALIGN(len, hpage_size); > + } > + if (unlikely(len == 0)) > + return -EINVAL; > + > /* if it doesn't overlap, we have nothing.. */ > end = start + len; > if (vma->vm_start >= end) > > > > > [hugebug.c] > > #include > #include > #include > #include > #include > > static void test(int flags, size_t size) > { > void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE, > flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (addr == MAP_FAILED) > { > perror("mmap"); > exit(1); > } > *(char*) addr = 17; > > if (munmap(addr, size) != 0) > { > perror("munmap"); > exit(1); > } > } > > int main(int ac, const char** av) > { > static const size_t hugepage_size = 2 * 1024 * 1024; > > printf("Testing normal pages with 2MB size ...\n"); > test(0, hugepage_size); > printf("OK\n"); > > printf("Testing huge pages with 2MB size ...\n"); > test(MAP_HUGETLB, hugepage_size); > printf("OK\n"); > > > printf("Testing normal pages with 4KB byte size ...\n"); > test(0, 4096); > printf("OK\n"); > > printf("Testing huge pages with 4KB byte size ...\n"); > test(MAP_HUGETLB, 4096); > printf("OK\n"); > > > printf("Testing normal pages with 1 byte size ...\n"); > test(0, 1); > printf("OK\n"); > > printf("Testing huge pages with 1 byte size ...\n"); > test(MAP_HUGETLB, 1); > printf("OK\n"); > > return 0; > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, Hugh Dickins wrote: When you say tracking back to 3.2.x, I think you mean you've tried as far back as 3.2.x and found the same behaviour, but not tried further? From the source, it looks like this is unchanged since MAP_HUGETLB was introduced in 2.6.32. And is the same behaviour as you've been given with hugetlbfs since it arrived in 2.5.46. Went back checking the application logs, an I had to patch the code (the application one - to align size on munmap()) on May 2014. But it might be we started noticing it at that time, because before the allocation pattern was simply monotonic, so it could be it was always there. The bug test app is ten lines of code, to verify that. The patch looks to me as if it will do what you want, and I agree that it's displeasing that you can mmap a size, and then fail to munmap that same size. But I tend to think that's simply typical of the clunkiness we offer you with hugetlbfs and MAP_HUGETLB: that it would have been better to make a different choice all those years ago, but wrong to change the user interface now. Perhaps others will disagree. And if I'm wrong, and the behaviour got somehow changed in 3.2, then that's a different story and we should fix it back. This is not an interface change, in the sense old clients will continue to work. This is simply respecting the mmap(2) POSIX specification, for a feature which has been exposed via mmap(2). - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 22 Oct 2014, Davide Libenzi wrote: [Resending with proper CC list suggested by Andrew] I have recently been reminded of this languishing in my inbox ;) (along with many others that I won't get to answer so quickly). And in turn it reminds me of an older from Joern, who was annoyed that he couldn't mmap a hugetlbfs file with MAP_HUGETLB. Cc'ing more people, including Eric, the father of MAP_HUGETLB. Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, causes munmap to fail. Tested on 3.13.x but tracking back to 3.2.x. When you say tracking back to 3.2.x, I think you mean you've tried as far back as 3.2.x and found the same behaviour, but not tried further? From the source, it looks like this is unchanged since MAP_HUGETLB was introduced in 2.6.32. And is the same behaviour as you've been given with hugetlbfs since it arrived in 2.5.46. In do_munmap() we forcibly want a 4KB default page, and we wrongly calculate the end of the map. Since the calculated end is within the end address of the target vma, we try to do a split with an address right in the middle of a huge page, which would fail with EINVAL. Tentative (untested) patch and test case attached (be sure you have a few huge pages available via /proc/sys/vm/nr_hugepages tinkering). Signed-Off-By: Davide Libenzi davi...@xmailserver.org The patch looks to me as if it will do what you want, and I agree that it's displeasing that you can mmap a size, and then fail to munmap that same size. But I tend to think that's simply typical of the clunkiness we offer you with hugetlbfs and MAP_HUGETLB: that it would have been better to make a different choice all those years ago, but wrong to change the user interface now. Perhaps others will disagree. And if I'm wrong, and the behaviour got somehow changed in 3.2, then that's a different story and we should fix it back. Hugh - Davide diff --git a/mm/mmap.c b/mm/mmap.c index 7f85520..6dba257 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) if ((start ~PAGE_MASK) || start TASK_SIZE || len TASK_SIZE-start) return -EINVAL; - len = PAGE_ALIGN(len); - if (len == 0) - return -EINVAL; - /* Find the first overlapping VMA */ vma = find_vma(mm, start); if (!vma) @@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) prev = vma-vm_prev; /* we have start vma-vm_end */ + if (likely(!is_vm_hugetlb_page(vma))) + len = PAGE_ALIGN(len); + else { + unsigned long hpage_size = huge_page_size(hstate_vma(vma)); + + len = ALIGN(len, hpage_size); + } + if (unlikely(len == 0)) + return -EINVAL; + /* if it doesn't overlap, we have nothing.. */ end = start + len; if (vma-vm_start = end) [hugebug.c] #include sys/mman.h #include stdio.h #include stdlib.h #include string.h #include errno.h static void test(int flags, size_t size) { void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE, flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror(mmap); exit(1); } *(char*) addr = 17; if (munmap(addr, size) != 0) { perror(munmap); exit(1); } } int main(int ac, const char** av) { static const size_t hugepage_size = 2 * 1024 * 1024; printf(Testing normal pages with 2MB size ...\n); test(0, hugepage_size); printf(OK\n); printf(Testing huge pages with 2MB size ...\n); test(MAP_HUGETLB, hugepage_size); printf(OK\n); printf(Testing normal pages with 4KB byte size ...\n); test(0, 4096); printf(OK\n); printf(Testing huge pages with 4KB byte size ...\n); test(MAP_HUGETLB, 4096); printf(OK\n); printf(Testing normal pages with 1 byte size ...\n); test(0, 1); printf(OK\n); printf(Testing huge pages with 1 byte size ...\n); test(MAP_HUGETLB, 1); printf(OK\n); return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned
On Wed, 25 Mar 2015, Davide Libenzi wrote: When you say tracking back to 3.2.x, I think you mean you've tried as far back as 3.2.x and found the same behaviour, but not tried further? From the source, it looks like this is unchanged since MAP_HUGETLB was introduced in 2.6.32. And is the same behaviour as you've been given with hugetlbfs since it arrived in 2.5.46. Went back checking the application logs, an I had to patch the code (the application one - to align size on munmap()) on May 2014. But it might be we started noticing it at that time, because before the allocation pattern was simply monotonic, so it could be it was always there. The bug test app is ten lines of code, to verify that. I looked at this thread at http://marc.info/?t=14139250881 since I didn't have it in my mailbox, and I didn't get a chance to actually run your test code. In short, I think what you're saying is that ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr, 4KB) == EINVAL The patch looks to me as if it will do what you want, and I agree that it's displeasing that you can mmap a size, and then fail to munmap that same size. But I tend to think that's simply typical of the clunkiness we offer you with hugetlbfs and MAP_HUGETLB: that it would have been better to make a different choice all those years ago, but wrong to change the user interface now. Perhaps others will disagree. And if I'm wrong, and the behaviour got somehow changed in 3.2, then that's a different story and we should fix it back. This is not an interface change, in the sense old clients will continue to work. This is simply respecting the mmap(2) POSIX specification, for a feature which has been exposed via mmap(2). Respecting the mmap(2) POSIX specification? I don't think mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates POSIX.1-2001 and not only because it obviously doesn't address MAP_HUGETLB, but I don't think the spec says the system cannot map more memory than len. Using MAP_HUGETLB is really more a library function than anything else since you could easily implement the same behavior in a library. That function includes aligning len to the hugepage size, so doing ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) is the equivalent to doing ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...) and that doesn't violate any spec. But your patch doesn't change mmap() at all, so let's forget about that. The question you pose is whether munmap(ptr, 4KB) should succeed for a hugetlb vma and in your patch you align this to the hugepage size of the vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE for a non-hugetlb vma. The munmap() spec says the whole pages that include any part of the passed length should be unmapped. In spirit, I would agree with you that the page size for the vma is the hugepage size so that would be what would be unmapped. But that's going by a spec that doesn't address hugepages and is worded in a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is done. It carries no notion of variable page sizes and how hugepages should be handled with respect to pages of {PAGE_SIZE} length. So I think this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, but the munmap() behavior for hugetlb vmas is not restricted. It would seem too dangerous at this point to change the behavior of munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually arise from aligning to the hugepage size. Some applications purposefully reserve hugetlb pages by mmap() and never munmap() them so they have exclusive access to hugepages that were allocated either at boot or runtime by the sysadmin. If they depend on the return value of munmap() to determine if memory to free is memory dynamically allocated by the application or reserved as hugetlb memory, then this would cause them to break. I can't say for certain that no such application exists. Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() specification, and there's a potential userspace breakage if the length becomes hugepage aligned, I think the do_unmap() implementation is correct as it stands. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/