Adding Andrew, Michal on CC On 08/27/2017 01:08 PM, Nadav Amit wrote: > Mike Kravetz <mike.krav...@oracle.com> wrote: > >> On 08/26/2017 12:11 PM, Nadav Amit wrote: >>> hugetlfs_fallocate() currently performs put_page() before unlock_page(). >>> This scenario opens a small time window, from the time the page is added >>> to the page cache, until it is unlocked, in which the page might be >>> removed from the page-cache by another core. If the page is removed >>> during this time windows, it might cause a memory corruption, as the >>> wrong page will be unlocked. >>> >>> It is arguable whether this scenario can happen in a real system, and >>> there are several mitigating factors. The issue was found by code >>> inspection (actually grep), and not by actually triggering the flow. >>> Yet, since putting the page before unlocking is incorrect it should be >>> fixed, if only to prevent future breakage or someone copy-pasting this >>> code. >>> >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()") >>> >>> cc: Eric Biggers <ebigge...@gmail.com> >>> cc: Mike Kravetz <mike.krav...@oracle.com> >>> >>> Signed-off-by: Nadav Amit <na...@vmware.com> >> >> Thank you Nadav. > > No problem. > >> >> Reviewed-by: Mike Kravetz <mike.krav...@oracle.com> >> >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be >> able to remove a page (file content) is through an inode operation such as >> truncate, hole punch, or unlink. That was the basis for my response that >> the inode lock would be required for page freeing. >> >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting. >> I was expecting to see a check for hugetlbfs pages and exit (without >> modification) if encountered. A quick review of the code did not find >> any such checks. >> >> I'll take a closer look to determine exactly how hugetlbfs files are >> handled. IMO, there should be something similar to the DAX check where >> the routine quickly exits. > > I did not cc stable when submitting the patch, based on your previous > response. Let me know if you want me to send v2 which does so.
I still do not believe there is a need to change this in stable. Your patch should be sufficient to ensure we do the right thing going forward. Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with hugetlbfs does indeed show a more general problem. One can use sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does not go through the special hugetlbfs page handling code, but rather the normal mm paths. As a result hugetlbfs accounting (like reserve counts) gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!! I will address this issue in a separate patch. -- Mike Kravetz