Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Andrew Morton
On Thu, 8 Mar 2018 15:37:57 -0800 Mike Kravetz  wrote:

> Here are a couple options for computing the mask.  I changed the name
> you suggested to make it more obvious that the mask is being used to
> check for loff_t overflow.
> 
> If we want to explicitly comptue the mask as in code above.
> #define PGOFF_LOFFT_MAX \
>   (((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))
> 
> Or, we use PAGE_MASK
> #define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))

Sounds good.

> In either case, we need a big comment explaining the mask and
> how we have that extra bit +/- 1 because the offset will be converted
> to a signed value.

Yup.

> > Also, we later to
> > 
> > len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> > /* check for overflow */
> > if (len < vma_len)
> > return -EINVAL;
> > 
> > which is ungainly: even if we passed the PGOFF_T_MAX test, there can
> > still be an overflow which we still must check for.  Is that avoidable?
> > Probably not...
> 
> Yes, it is required.  That check takes into account the length argument
> which is added to page offset.  So, yes you can pass the first check and
> fail this one.

Well I was sort of wondering if both checks could be done in a single
operation, but I guess not.



Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Andrew Morton
On Thu, 8 Mar 2018 15:37:57 -0800 Mike Kravetz  wrote:

> Here are a couple options for computing the mask.  I changed the name
> you suggested to make it more obvious that the mask is being used to
> check for loff_t overflow.
> 
> If we want to explicitly comptue the mask as in code above.
> #define PGOFF_LOFFT_MAX \
>   (((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))
> 
> Or, we use PAGE_MASK
> #define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))

Sounds good.

> In either case, we need a big comment explaining the mask and
> how we have that extra bit +/- 1 because the offset will be converted
> to a signed value.

Yup.

> > Also, we later to
> > 
> > len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> > /* check for overflow */
> > if (len < vma_len)
> > return -EINVAL;
> > 
> > which is ungainly: even if we passed the PGOFF_T_MAX test, there can
> > still be an overflow which we still must check for.  Is that avoidable?
> > Probably not...
> 
> Yes, it is required.  That check takes into account the length argument
> which is added to page offset.  So, yes you can pass the first check and
> fail this one.

Well I was sort of wondering if both checks could be done in a single
operation, but I guess not.



Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Mike Kravetz
On 03/08/2018 02:15 PM, Andrew Morton wrote:
> On Thu,  8 Mar 2018 13:05:02 -0800 Mike Kravetz  
> wrote:
> 
>> A vma with vm_pgoff large enough to overflow a loff_t type when
>> converted to a byte offset can be passed via the remap_file_pages
>> system call.  The hugetlbfs mmap routine uses the byte offset to
>> calculate reservations and file size.
>>
>> A sequence such as:
>>   mmap(0x20a0, 0x60, 0, 0x66033, -1, 0);
>>   remap_file_pages(0x20a0, 0x60, 0, 0x20, 0);
>> will result in the following when task exits/file closed,
>>   kernel BUG at mm/hugetlb.c:749!
>> Call Trace:
>>   hugetlbfs_evict_inode+0x2f/0x40
>>   evict+0xcb/0x190
>>   __dentry_kill+0xcb/0x150
>>   __fput+0x164/0x1e0
>>   task_work_run+0x84/0xa0
>>   exit_to_usermode_loop+0x7d/0x80
>>   do_syscall_64+0x18b/0x190
>>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>
>> The overflowed pgoff value causes hugetlbfs to try to set up a
>> mapping with a negative range (end < start) that leaves invalid
>> state which causes the BUG.
>>
>> The previous overflow fix to this code was incomplete and did not
>> take the remap_file_pages system call into account.
>>
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec)
>>  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct 
>> *vma)
>>  {
>>  struct inode *inode = file_inode(file);
>> +unsigned long ovfl_mask;
>>  loff_t len, vma_len;
>>  int ret;
>>  struct hstate *h = hstate_file(file);
>> @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, 
>> struct vm_area_struct *vma)
>>  vma->vm_ops = _vm_ops;
>>  
>>  /*
>> - * Offset passed to mmap (before page shift) could have been
>> - * negative when represented as a (l)off_t.
>> + * page based offset in vm_pgoff could be sufficiently large to
>> + * overflow a (l)off_t when converted to byte offset.
>>   */
>> -if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
>> +ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
>> +ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
>> +   (PAGE_SHIFT + 1));
> 
> That's a compile-time constant.  The compiler will indeed generate an
> immediate load, but I think it would be better to make the code look
> more like we know that it's a constant, if you get what I mean. 
> Something like
> 
> /*
>  * If a pgoff_t is to be converted to a byte index, this is the max value it
>  * can have to avoid overflow in that conversion.
>  */
> #define PGOFF_T_MAX   

Ok

> And I bet that this constant could be used elsewhere - surely it's a
> very common thing to be checking for.
> 
> 
> Also, the expression seems rather complicated.  Why are we adding 1 to
> PAGE_SHIFT?  Isn't there a logical way of using PAGE_MASK?

The + 1 is there because this value will eventually be converted to
a loff_t which is signed.  So, we need to take that sign bit into
account or we could end up with a negative value.

For PAGE_SHIFT == 12, PAGE_MASK is 0xf000.  Our target
mask is  0xfff8 (for the sign bit).  So, we could do
PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1)

This legacy hugetlbfs code may be a little different than other areas
in the use of loff_t.  When doing some previous work in this area, I
did not find enough common used to make this more general purpose.  See,
https://lkml.org/lkml/2017/4/12/793

> The resulting constant is 0xfff8 on 64-bit.  We could just
> use along the lines of
> 
>   1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)

Ah yes, BITS_PER_LONG is better than (sizeof(unsigned long) * BITS_PER_BYTE

> But why the -1?  We should be able to handle a pgoff_t of
> 0xfff0 in this code?

I'm not exactly sure what you are asking/suggesting here and in the line
above.  It is because of the conversion to a signed value that we have to
go with 0xfff8 instead of 0xfff0.

Here are a couple options for computing the mask.  I changed the name
you suggested to make it more obvious that the mask is being used to
check for loff_t overflow.

If we want to explicitly comptue the mask as in code above.
#define PGOFF_LOFFT_MAX \
(((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))

Or, we use PAGE_MASK
#define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))

In either case, we need a big comment explaining the mask and
how we have that extra bit +/- 1 because the offset will be converted
to a signed value.

> Also, we later to
> 
>   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>   /* check for overflow */
>   if (len < vma_len)
>   return -EINVAL;
> 
> which is ungainly: even if we passed the PGOFF_T_MAX test, there can
> still be an overflow which we still must check for.  Is that avoidable?
> Probably not...

Yes, it is required.  

Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Mike Kravetz
On 03/08/2018 02:15 PM, Andrew Morton wrote:
> On Thu,  8 Mar 2018 13:05:02 -0800 Mike Kravetz  
> wrote:
> 
>> A vma with vm_pgoff large enough to overflow a loff_t type when
>> converted to a byte offset can be passed via the remap_file_pages
>> system call.  The hugetlbfs mmap routine uses the byte offset to
>> calculate reservations and file size.
>>
>> A sequence such as:
>>   mmap(0x20a0, 0x60, 0, 0x66033, -1, 0);
>>   remap_file_pages(0x20a0, 0x60, 0, 0x20, 0);
>> will result in the following when task exits/file closed,
>>   kernel BUG at mm/hugetlb.c:749!
>> Call Trace:
>>   hugetlbfs_evict_inode+0x2f/0x40
>>   evict+0xcb/0x190
>>   __dentry_kill+0xcb/0x150
>>   __fput+0x164/0x1e0
>>   task_work_run+0x84/0xa0
>>   exit_to_usermode_loop+0x7d/0x80
>>   do_syscall_64+0x18b/0x190
>>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>
>> The overflowed pgoff value causes hugetlbfs to try to set up a
>> mapping with a negative range (end < start) that leaves invalid
>> state which causes the BUG.
>>
>> The previous overflow fix to this code was incomplete and did not
>> take the remap_file_pages system call into account.
>>
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec)
>>  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct 
>> *vma)
>>  {
>>  struct inode *inode = file_inode(file);
>> +unsigned long ovfl_mask;
>>  loff_t len, vma_len;
>>  int ret;
>>  struct hstate *h = hstate_file(file);
>> @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, 
>> struct vm_area_struct *vma)
>>  vma->vm_ops = _vm_ops;
>>  
>>  /*
>> - * Offset passed to mmap (before page shift) could have been
>> - * negative when represented as a (l)off_t.
>> + * page based offset in vm_pgoff could be sufficiently large to
>> + * overflow a (l)off_t when converted to byte offset.
>>   */
>> -if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
>> +ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
>> +ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
>> +   (PAGE_SHIFT + 1));
> 
> That's a compile-time constant.  The compiler will indeed generate an
> immediate load, but I think it would be better to make the code look
> more like we know that it's a constant, if you get what I mean. 
> Something like
> 
> /*
>  * If a pgoff_t is to be converted to a byte index, this is the max value it
>  * can have to avoid overflow in that conversion.
>  */
> #define PGOFF_T_MAX   

Ok

> And I bet that this constant could be used elsewhere - surely it's a
> very common thing to be checking for.
> 
> 
> Also, the expression seems rather complicated.  Why are we adding 1 to
> PAGE_SHIFT?  Isn't there a logical way of using PAGE_MASK?

The + 1 is there because this value will eventually be converted to
a loff_t which is signed.  So, we need to take that sign bit into
account or we could end up with a negative value.

For PAGE_SHIFT == 12, PAGE_MASK is 0xf000.  Our target
mask is  0xfff8 (for the sign bit).  So, we could do
PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1)

This legacy hugetlbfs code may be a little different than other areas
in the use of loff_t.  When doing some previous work in this area, I
did not find enough common used to make this more general purpose.  See,
https://lkml.org/lkml/2017/4/12/793

> The resulting constant is 0xfff8 on 64-bit.  We could just
> use along the lines of
> 
>   1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)

Ah yes, BITS_PER_LONG is better than (sizeof(unsigned long) * BITS_PER_BYTE

> But why the -1?  We should be able to handle a pgoff_t of
> 0xfff0 in this code?

I'm not exactly sure what you are asking/suggesting here and in the line
above.  It is because of the conversion to a signed value that we have to
go with 0xfff8 instead of 0xfff0.

Here are a couple options for computing the mask.  I changed the name
you suggested to make it more obvious that the mask is being used to
check for loff_t overflow.

If we want to explicitly comptue the mask as in code above.
#define PGOFF_LOFFT_MAX \
(((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))

Or, we use PAGE_MASK
#define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))

In either case, we need a big comment explaining the mask and
how we have that extra bit +/- 1 because the offset will be converted
to a signed value.

> Also, we later to
> 
>   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>   /* check for overflow */
>   if (len < vma_len)
>   return -EINVAL;
> 
> which is ungainly: even if we passed the PGOFF_T_MAX test, there can
> still be an overflow which we still must check for.  Is that avoidable?
> Probably not...

Yes, it is required.  That check takes into 

Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Andrew Morton
On Thu,  8 Mar 2018 13:05:02 -0800 Mike Kravetz  wrote:

> A vma with vm_pgoff large enough to overflow a loff_t type when
> converted to a byte offset can be passed via the remap_file_pages
> system call.  The hugetlbfs mmap routine uses the byte offset to
> calculate reservations and file size.
> 
> A sequence such as:
>   mmap(0x20a0, 0x60, 0, 0x66033, -1, 0);
>   remap_file_pages(0x20a0, 0x60, 0, 0x20, 0);
> will result in the following when task exits/file closed,
>   kernel BUG at mm/hugetlb.c:749!
> Call Trace:
>   hugetlbfs_evict_inode+0x2f/0x40
>   evict+0xcb/0x190
>   __dentry_kill+0xcb/0x150
>   __fput+0x164/0x1e0
>   task_work_run+0x84/0xa0
>   exit_to_usermode_loop+0x7d/0x80
>   do_syscall_64+0x18b/0x190
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> The overflowed pgoff value causes hugetlbfs to try to set up a
> mapping with a negative range (end < start) that leaves invalid
> state which causes the BUG.
> 
> The previous overflow fix to this code was incomplete and did not
> take the remap_file_pages system call into account.
> 
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec)
>  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>   struct inode *inode = file_inode(file);
> + unsigned long ovfl_mask;
>   loff_t len, vma_len;
>   int ret;
>   struct hstate *h = hstate_file(file);
> @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, 
> struct vm_area_struct *vma)
>   vma->vm_ops = _vm_ops;
>  
>   /*
> -  * Offset passed to mmap (before page shift) could have been
> -  * negative when represented as a (l)off_t.
> +  * page based offset in vm_pgoff could be sufficiently large to
> +  * overflow a (l)off_t when converted to byte offset.
>*/
> - if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> + ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
> + ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
> +(PAGE_SHIFT + 1));

That's a compile-time constant.  The compiler will indeed generate an
immediate load, but I think it would be better to make the code look
more like we know that it's a constant, if you get what I mean. 
Something like

/*
 * If a pgoff_t is to be converted to a byte index, this is the max value it
 * can have to avoid overflow in that conversion.
 */
#define PGOFF_T_MAX 

And I bet that this constant could be used elsewhere - surely it's a
very common thing to be checking for.


Also, the expression seems rather complicated.  Why are we adding 1 to
PAGE_SHIFT?  Isn't there a logical way of using PAGE_MASK?

The resulting constant is 0xfff8 on 64-bit.  We could just
use along the lines of

1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)

But why the -1?  We should be able to handle a pgoff_t of
0xfff0 in this code?


Also, we later to

len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
/* check for overflow */
if (len < vma_len)
return -EINVAL;

which is ungainly: even if we passed the PGOFF_T_MAX test, there can
still be an overflow which we still must check for.  Is that avoidable?
Probably not...



Re: [PATCH v2] hugetlbfs: check for pgoff value overflow

2018-03-08 Thread Andrew Morton
On Thu,  8 Mar 2018 13:05:02 -0800 Mike Kravetz  wrote:

> A vma with vm_pgoff large enough to overflow a loff_t type when
> converted to a byte offset can be passed via the remap_file_pages
> system call.  The hugetlbfs mmap routine uses the byte offset to
> calculate reservations and file size.
> 
> A sequence such as:
>   mmap(0x20a0, 0x60, 0, 0x66033, -1, 0);
>   remap_file_pages(0x20a0, 0x60, 0, 0x20, 0);
> will result in the following when task exits/file closed,
>   kernel BUG at mm/hugetlb.c:749!
> Call Trace:
>   hugetlbfs_evict_inode+0x2f/0x40
>   evict+0xcb/0x190
>   __dentry_kill+0xcb/0x150
>   __fput+0x164/0x1e0
>   task_work_run+0x84/0xa0
>   exit_to_usermode_loop+0x7d/0x80
>   do_syscall_64+0x18b/0x190
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> The overflowed pgoff value causes hugetlbfs to try to set up a
> mapping with a negative range (end < start) that leaves invalid
> state which causes the BUG.
> 
> The previous overflow fix to this code was incomplete and did not
> take the remap_file_pages system call into account.
> 
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec)
>  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>   struct inode *inode = file_inode(file);
> + unsigned long ovfl_mask;
>   loff_t len, vma_len;
>   int ret;
>   struct hstate *h = hstate_file(file);
> @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, 
> struct vm_area_struct *vma)
>   vma->vm_ops = _vm_ops;
>  
>   /*
> -  * Offset passed to mmap (before page shift) could have been
> -  * negative when represented as a (l)off_t.
> +  * page based offset in vm_pgoff could be sufficiently large to
> +  * overflow a (l)off_t when converted to byte offset.
>*/
> - if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> + ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
> + ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
> +(PAGE_SHIFT + 1));

That's a compile-time constant.  The compiler will indeed generate an
immediate load, but I think it would be better to make the code look
more like we know that it's a constant, if you get what I mean. 
Something like

/*
 * If a pgoff_t is to be converted to a byte index, this is the max value it
 * can have to avoid overflow in that conversion.
 */
#define PGOFF_T_MAX 

And I bet that this constant could be used elsewhere - surely it's a
very common thing to be checking for.


Also, the expression seems rather complicated.  Why are we adding 1 to
PAGE_SHIFT?  Isn't there a logical way of using PAGE_MASK?

The resulting constant is 0xfff8 on 64-bit.  We could just
use along the lines of

1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)

But why the -1?  We should be able to handle a pgoff_t of
0xfff0 in this code?


Also, we later to

len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
/* check for overflow */
if (len < vma_len)
return -EINVAL;

which is ungainly: even if we passed the PGOFF_T_MAX test, there can
still be an overflow which we still must check for.  Is that avoidable?
Probably not...