On 03/08/2018 02:15 PM, Andrew Morton wrote:
> On Thu, 8 Mar 2018 13:05:02 -0800 Mike Kravetz
> wrote:
>
>> A vma with vm_pgoff large enough to overflow a loff_t type when
>> converted to a byte offset can be passed via the remap_file_pages
>> system call. The hugetlbfs mmap routine uses the byte offset to
>> calculate reservations and file size.
>>
>> A sequence such as:
>> mmap(0x20a0, 0x60, 0, 0x66033, -1, 0);
>> remap_file_pages(0x20a0, 0x60, 0, 0x20, 0);
>> will result in the following when task exits/file closed,
>> kernel BUG at mm/hugetlb.c:749!
>> Call Trace:
>> hugetlbfs_evict_inode+0x2f/0x40
>> evict+0xcb/0x190
>> __dentry_kill+0xcb/0x150
>> __fput+0x164/0x1e0
>> task_work_run+0x84/0xa0
>> exit_to_usermode_loop+0x7d/0x80
>> do_syscall_64+0x18b/0x190
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>
>> The overflowed pgoff value causes hugetlbfs to try to set up a
>> mapping with a negative range (end < start) that leaves invalid
>> state which causes the BUG.
>>
>> The previous overflow fix to this code was incomplete and did not
>> take the remap_file_pages system call into account.
>>
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec)
>> static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct
>> *vma)
>> {
>> struct inode *inode = file_inode(file);
>> +unsigned long ovfl_mask;
>> loff_t len, vma_len;
>> int ret;
>> struct hstate *h = hstate_file(file);
>> @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file,
>> struct vm_area_struct *vma)
>> vma->vm_ops = &hugetlb_vm_ops;
>>
>> /*
>> - * Offset passed to mmap (before page shift) could have been
>> - * negative when represented as a (l)off_t.
>> + * page based offset in vm_pgoff could be sufficiently large to
>> + * overflow a (l)off_t when converted to byte offset.
>> */
>> -if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
>> +ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
>> +ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
>> + (PAGE_SHIFT + 1));
>
> That's a compile-time constant. The compiler will indeed generate an
> immediate load, but I think it would be better to make the code look
> more like we know that it's a constant, if you get what I mean.
> Something like
>
> /*
> * If a pgoff_t is to be converted to a byte index, this is the max value it
> * can have to avoid overflow in that conversion.
> */
> #define PGOFF_T_MAX
Ok
> And I bet that this constant could be used elsewhere - surely it's a
> very common thing to be checking for.
>
>
> Also, the expression seems rather complicated. Why are we adding 1 to
> PAGE_SHIFT? Isn't there a logical way of using PAGE_MASK?
The + 1 is there because this value will eventually be converted to
a loff_t which is signed. So, we need to take that sign bit into
account or we could end up with a negative value.
For PAGE_SHIFT == 12, PAGE_MASK is 0xf000. Our target
mask is 0xfff8 (for the sign bit). So, we could do
PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1)
This legacy hugetlbfs code may be a little different than other areas
in the use of loff_t. When doing some previous work in this area, I
did not find enough common used to make this more general purpose. See,
https://lkml.org/lkml/2017/4/12/793
> The resulting constant is 0xfff8 on 64-bit. We could just
> use along the lines of
>
> 1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)
Ah yes, BITS_PER_LONG is better than (sizeof(unsigned long) * BITS_PER_BYTE
> But why the -1? We should be able to handle a pgoff_t of
> 0xfff0 in this code?
I'm not exactly sure what you are asking/suggesting here and in the line
above. It is because of the conversion to a signed value that we have to
go with 0xfff8 instead of 0xfff0.
Here are a couple options for computing the mask. I changed the name
you suggested to make it more obvious that the mask is being used to
check for loff_t overflow.
If we want to explicitly comptue the mask as in code above.
#define PGOFF_LOFFT_MAX \
(((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))
Or, we use PAGE_MASK
#define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))
In either case, we need a big comment explaining the mask and
how we have that extra bit +/- 1 because the offset will be converted
to a signed value.
> Also, we later to
>
> len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> /* check for overflow */
> if (len < vma_len)
> return -EINVAL;
>
> which is ungainly: even if we passed the PGOFF_T_MAX test, there can
> still be an overflow which we still must check for. Is that avoidable?
> Probably not...
Yes, it is required. That check takes int