Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-29 Thread Christoph Hellwig
On Fri, Aug 28, 2020 at 07:40:08PM +0200, Arnd Bergmann wrote:
> Every syscall that passes an iovec seems to do this. If we make import_iovec()
> handle both cases directly, this syscall and a number of others can
> be simplified, and you avoid the x32 entry point I mentioned above

FYI, I do have a series that does this (even tested) and kills tons of
compat syscalls by that.  But by doing that I found the problem that
compat syscalls issued by io_uring don't trigger in_compat_syscall().
I need to go back to fixing the io_uring vs in_compat_syscall() issue
(probably for 5.9) and then submit the whole thing.


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Arnd Bergmann
On Fri, Aug 28, 2020 at 9:27 PM Jens Axboe  wrote:
> On 8/28/20 12:24 PM, Jens Axboe wrote:

> >> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
> >> iovec __user * uvector,
> >>  {
> >> ssize_t n;
> >> struct iovec *p;
> >> -   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
> >> - *iov, &p);
> >> +
> >> +   if (in_compat_syscall())
> >> +   n = compat_rw_copy_check_uvector(type, uvector, nr_segs,
> >> +fast_segs, *iov, &p);
> >> +   else
> >> +   n = rw_copy_check_uvector(type, uvector, nr_segs,
> >> + fast_segs, *iov, &p);
> >> if (n < 0) {
> >> if (p != *iov)
> >> kfree(p);
> >
> > Doesn't work for the async case, where you want to be holding on to the
> > allocated iovec. But in general I think it's a good helper for the sync
> > case, which is by far the majority.
>
> Nevermind, I'm an idiot for reading this totally wrong.
>

I think you are right about the need to pick the compat vs native
behavior based on req->ctx->compat instead of in_compat_syscall()
inside of io_import_iovec().

That one can probably call a lower-level version and when all other
callers get changed to calling

ssize_t import_iovec(int type, const struct iovec __user * uvector,
 unsigned nr_segs, unsigned fast_segs,
 struct iovec **iov, struct iov_iter *i)
{
   return __import_iovec(type, uvector, nr_segs, fast_segs, iov,
i, in_compat_syscall());
}

  Arnd


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Jens Axboe
On 8/28/20 12:24 PM, Jens Axboe wrote:
> On 8/28/20 11:40 AM, Arnd Bergmann wrote:
>> On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim  wrote:
>>> So finally, the API is as follows,
>>>
>>>  ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>>>unsigned long vlen, int advice, unsigned int flags);
>>
>> I had not followed the discussion earlier and only now came across
>> the syscall in linux-next, sorry for stirring things up this late.
>>
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
>>> b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 94bf4958d114..8f959d90338a 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -364,6 +364,7 @@
>>>  440common  watch_mount sys_watch_mount
>>>  441common  watch_sbsys_watch_sb
>>>  442common  fsinfo  sys_fsinfo
>>> +44364  process_madvise sys_process_madvise
>>>
>>>  #
>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>> @@ -407,3 +408,4 @@
>>>  545x32 execveatcompat_sys_execveat
>>>  546x32 preadv2 compat_sys_preadv64v2
>>>  547x32 pwritev2compat_sys_pwritev64v2
>>> +548x32 process_madvise compat_sys_process_madvise
>>
>> I think we should not add any new x32-specific syscalls. Instead I think
>> the compat_sys_process_madvise/sys_process_madvise can be
>> merged into one.
>>
>>> +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
>>> +   if (IS_ERR_OR_NULL(mm)) {
>>> +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>>> +   goto release_task;
>>> +   }
>>
>> Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
>> and I would try to avoid that. Can mm_access() be changed to
>> itself return PTR_ERR(-ESRCH) instead of NULL to improve its
>> calling conventions? I see there are only three other callers.
>>
>>
>>> +   ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, 
>>> &iter);
>>> +   if (ret >= 0) {
>>> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
>>> +   kfree(iov);
>>> +   }
>>> +   return ret;
>>> +}
>>> +
>>> +#ifdef CONFIG_COMPAT
>> ...
>>> +
>>> +   ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
>>> +   &iov, &iter);
>>> +   if (ret >= 0) {
>>> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
>>> +   kfree(iov);
>>> +   }
>>
>> Every syscall that passes an iovec seems to do this. If we make 
>> import_iovec()
>> handle both cases directly, this syscall and a number of others can
>> be simplified, and you avoid the x32 entry point I mentioned above
>>
>> Something like (untested)
>>
>> index dad8d0cfaaf7..0de4ddff24c1 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
>> iovec __user * uvector,
>>  {
>> ssize_t n;
>> struct iovec *p;
>> -   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
>> - *iov, &p);
>> +
>> +   if (in_compat_syscall())
>> +   n = compat_rw_copy_check_uvector(type, uvector, nr_segs,
>> +fast_segs, *iov, &p);
>> +   else
>> +   n = rw_copy_check_uvector(type, uvector, nr_segs,
>> + fast_segs, *iov, &p);
>> if (n < 0) {
>> if (p != *iov)
>> kfree(p);
> 
> Doesn't work for the async case, where you want to be holding on to the
> allocated iovec. But in general I think it's a good helper for the sync
> case, which is by far the majority.

Nevermind, I'm an idiot for reading this totally wrong.

-- 
Jens Axboe



Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Minchan Kim
On Fri, Aug 28, 2020 at 08:25:34PM +0200, Christian Brauner wrote:
> On Fri, Aug 28, 2020 at 8:24 PM Jens Axboe  wrote:
> >
> > On 8/28/20 11:40 AM, Arnd Bergmann wrote:
> > > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim  wrote:
> > >> So finally, the API is as follows,
> > >>
> > >>  ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> > >>unsigned long vlen, int advice, unsigned int flags);
> > >
> > > I had not followed the discussion earlier and only now came across
> > > the syscall in linux-next, sorry for stirring things up this late.
> > >
> > >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > >> b/arch/x86/entry/syscalls/syscall_64.tbl
> > >> index 94bf4958d114..8f959d90338a 100644
> > >> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > >> @@ -364,6 +364,7 @@
> > >>  440common  watch_mount sys_watch_mount
> > >>  441common  watch_sbsys_watch_sb
> > >>  442common  fsinfo  sys_fsinfo
> > >> +44364  process_madvise sys_process_madvise
> > >>
> > >>  #
> > >>  # x32-specific system call numbers start at 512 to avoid cache impact
> > >> @@ -407,3 +408,4 @@
> > >>  545x32 execveatcompat_sys_execveat
> > >>  546x32 preadv2 compat_sys_preadv64v2
> > >>  547x32 pwritev2compat_sys_pwritev64v2
> > >> +548x32 process_madvise compat_sys_process_madvise
> > >
> > > I think we should not add any new x32-specific syscalls. Instead I think
> > > the compat_sys_process_madvise/sys_process_madvise can be
> > > merged into one.
> > >
> > >> +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > >> +   if (IS_ERR_OR_NULL(mm)) {
> > >> +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > >> +   goto release_task;
> > >> +   }
> > >
> > > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
> > > and I would try to avoid that. Can mm_access() be changed to
> > > itself return PTR_ERR(-ESRCH) instead of NULL to improve its
> > > calling conventions? I see there are only three other callers.
> > >
> > >
> > >> +   ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, 
> > >> &iter);
> > >> +   if (ret >= 0) {
> > >> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > >> +   kfree(iov);
> > >> +   }
> > >> +   return ret;
> > >> +}
> > >> +
> > >> +#ifdef CONFIG_COMPAT
> > > ...
> > >> +
> > >> +   ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> > >> +   &iov, &iter);
> > >> +   if (ret >= 0) {
> > >> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > >> +   kfree(iov);
> > >> +   }
> > >
> > > Every syscall that passes an iovec seems to do this. If we make 
> > > import_iovec()
> > > handle both cases directly, this syscall and a number of others can
> > > be simplified, and you avoid the x32 entry point I mentioned above
> > >
> > > Something like (untested)
> > >
> > > index dad8d0cfaaf7..0de4ddff24c1 100644
> > > --- a/lib/iov_iter.c
> > > +++ b/lib/iov_iter.c
> > > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
> > > iovec __user * uvector,
> > >  {
> > > ssize_t n;
> > > struct iovec *p;
> > > -   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
> > > - *iov, &p);
> > > +
> > > +   if (in_compat_syscall())
> 
> I suggested the exact same solutions roughly 1.5 weeks ago. :)
> Fun when I saw you mentioning this in BBB I knew exactly what you were
> referring too. :)
> 

https://lore.kernel.org/linux-man/20200816081227.ngw3l45c5uncesmr@wittgenstein/

Yes, Christian suggested the idea but mostly for only this new syscall.
I don't have the time to revise the patchset yet but may have next week.
I will follow Christian's suggestion.

Thanks.


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Christian Brauner
On Fri, Aug 28, 2020 at 8:24 PM Jens Axboe  wrote:
>
> On 8/28/20 11:40 AM, Arnd Bergmann wrote:
> > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim  wrote:
> >> So finally, the API is as follows,
> >>
> >>  ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> >>unsigned long vlen, int advice, unsigned int flags);
> >
> > I had not followed the discussion earlier and only now came across
> > the syscall in linux-next, sorry for stirring things up this late.
> >
> >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> >> b/arch/x86/entry/syscalls/syscall_64.tbl
> >> index 94bf4958d114..8f959d90338a 100644
> >> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> >> @@ -364,6 +364,7 @@
> >>  440common  watch_mount sys_watch_mount
> >>  441common  watch_sbsys_watch_sb
> >>  442common  fsinfo  sys_fsinfo
> >> +44364  process_madvise sys_process_madvise
> >>
> >>  #
> >>  # x32-specific system call numbers start at 512 to avoid cache impact
> >> @@ -407,3 +408,4 @@
> >>  545x32 execveatcompat_sys_execveat
> >>  546x32 preadv2 compat_sys_preadv64v2
> >>  547x32 pwritev2compat_sys_pwritev64v2
> >> +548x32 process_madvise compat_sys_process_madvise
> >
> > I think we should not add any new x32-specific syscalls. Instead I think
> > the compat_sys_process_madvise/sys_process_madvise can be
> > merged into one.
> >
> >> +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> >> +   if (IS_ERR_OR_NULL(mm)) {
> >> +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >> +   goto release_task;
> >> +   }
> >
> > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
> > and I would try to avoid that. Can mm_access() be changed to
> > itself return PTR_ERR(-ESRCH) instead of NULL to improve its
> > calling conventions? I see there are only three other callers.
> >
> >
> >> +   ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, 
> >> &iter);
> >> +   if (ret >= 0) {
> >> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> >> +   kfree(iov);
> >> +   }
> >> +   return ret;
> >> +}
> >> +
> >> +#ifdef CONFIG_COMPAT
> > ...
> >> +
> >> +   ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> >> +   &iov, &iter);
> >> +   if (ret >= 0) {
> >> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> >> +   kfree(iov);
> >> +   }
> >
> > Every syscall that passes an iovec seems to do this. If we make 
> > import_iovec()
> > handle both cases directly, this syscall and a number of others can
> > be simplified, and you avoid the x32 entry point I mentioned above
> >
> > Something like (untested)
> >
> > index dad8d0cfaaf7..0de4ddff24c1 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
> > iovec __user * uvector,
> >  {
> > ssize_t n;
> > struct iovec *p;
> > -   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
> > - *iov, &p);
> > +
> > +   if (in_compat_syscall())

I suggested the exact same solutions roughly 1.5 weeks ago. :)
Fun when I saw you mentioning this in BBB I knew exactly what you were
referring too. :)

Christian


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Jens Axboe
On 8/28/20 11:40 AM, Arnd Bergmann wrote:
> On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim  wrote:
>> So finally, the API is as follows,
>>
>>  ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>>unsigned long vlen, int advice, unsigned int flags);
> 
> I had not followed the discussion earlier and only now came across
> the syscall in linux-next, sorry for stirring things up this late.
> 
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
>> b/arch/x86/entry/syscalls/syscall_64.tbl
>> index 94bf4958d114..8f959d90338a 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -364,6 +364,7 @@
>>  440common  watch_mount sys_watch_mount
>>  441common  watch_sbsys_watch_sb
>>  442common  fsinfo  sys_fsinfo
>> +44364  process_madvise sys_process_madvise
>>
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> @@ -407,3 +408,4 @@
>>  545x32 execveatcompat_sys_execveat
>>  546x32 preadv2 compat_sys_preadv64v2
>>  547x32 pwritev2compat_sys_pwritev64v2
>> +548x32 process_madvise compat_sys_process_madvise
> 
> I think we should not add any new x32-specific syscalls. Instead I think
> the compat_sys_process_madvise/sys_process_madvise can be
> merged into one.
> 
>> +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
>> +   if (IS_ERR_OR_NULL(mm)) {
>> +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>> +   goto release_task;
>> +   }
> 
> Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
> and I would try to avoid that. Can mm_access() be changed to
> itself return PTR_ERR(-ESRCH) instead of NULL to improve its
> calling conventions? I see there are only three other callers.
> 
> 
>> +   ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, 
>> &iter);
>> +   if (ret >= 0) {
>> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
>> +   kfree(iov);
>> +   }
>> +   return ret;
>> +}
>> +
>> +#ifdef CONFIG_COMPAT
> ...
>> +
>> +   ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
>> +   &iov, &iter);
>> +   if (ret >= 0) {
>> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
>> +   kfree(iov);
>> +   }
> 
> Every syscall that passes an iovec seems to do this. If we make import_iovec()
> handle both cases directly, this syscall and a number of others can
> be simplified, and you avoid the x32 entry point I mentioned above
> 
> Something like (untested)
> 
> index dad8d0cfaaf7..0de4ddff24c1 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
> iovec __user * uvector,
>  {
> ssize_t n;
> struct iovec *p;
> -   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
> - *iov, &p);
> +
> +   if (in_compat_syscall())
> +   n = compat_rw_copy_check_uvector(type, uvector, nr_segs,
> +fast_segs, *iov, &p);
> +   else
> +   n = rw_copy_check_uvector(type, uvector, nr_segs,
> + fast_segs, *iov, &p);
> if (n < 0) {
> if (p != *iov)
> kfree(p);

Doesn't work for the async case, where you want to be holding on to the
allocated iovec. But in general I think it's a good helper for the sync
case, which is by far the majority.

-- 
Jens Axboe



Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-08-28 Thread Arnd Bergmann
On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim  wrote:
> So finally, the API is as follows,
>
>  ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>unsigned long vlen, int advice, unsigned int flags);

I had not followed the discussion earlier and only now came across
the syscall in linux-next, sorry for stirring things up this late.

> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 94bf4958d114..8f959d90338a 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -364,6 +364,7 @@
>  440common  watch_mount sys_watch_mount
>  441common  watch_sbsys_watch_sb
>  442common  fsinfo  sys_fsinfo
> +44364  process_madvise sys_process_madvise
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -407,3 +408,4 @@
>  545x32 execveatcompat_sys_execveat
>  546x32 preadv2 compat_sys_preadv64v2
>  547x32 pwritev2compat_sys_pwritev64v2
> +548x32 process_madvise compat_sys_process_madvise

I think we should not add any new x32-specific syscalls. Instead I think
the compat_sys_process_madvise/sys_process_madvise can be
merged into one.

> +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> +   if (IS_ERR_OR_NULL(mm)) {
> +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +   goto release_task;
> +   }

Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile,
and I would try to avoid that. Can mm_access() be changed to
itself return PTR_ERR(-ESRCH) instead of NULL to improve its
calling conventions? I see there are only three other callers.


> +   ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, 
> &iter);
> +   if (ret >= 0) {
> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> +   kfree(iov);
> +   }
> +   return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
...
> +
> +   ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> +   &iov, &iter);
> +   if (ret >= 0) {
> +   ret = do_process_madvise(pidfd, &iter, behavior, flags);
> +   kfree(iov);
> +   }

Every syscall that passes an iovec seems to do this. If we make import_iovec()
handle both cases directly, this syscall and a number of others can
be simplified, and you avoid the x32 entry point I mentioned above

Something like (untested)

index dad8d0cfaaf7..0de4ddff24c1 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct
iovec __user * uvector,
 {
ssize_t n;
struct iovec *p;
-   n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs,
- *iov, &p);
+
+   if (in_compat_syscall())
+   n = compat_rw_copy_check_uvector(type, uvector, nr_segs,
+fast_segs, *iov, &p);
+   else
+   n = rw_copy_check_uvector(type, uvector, nr_segs,
+ fast_segs, *iov, &p);
if (n < 0) {
if (p != *iov)
kfree(p);


  Arnd


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-06-25 Thread Minchan Kim
On Wed, Jun 24, 2020 at 01:00:14PM -0700, David Rientjes wrote:
> On Mon, 22 Jun 2020, Minchan Kim wrote:
> 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 551ed816eefe..23abca3f93fa 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -17,6 +17,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior)
> > }
> >  }
> >  
> > +static bool
> > +process_madvise_behavior_valid(int behavior)
> > +{
> > +   switch (behavior) {
> > +   case MADV_COLD:
> > +   case MADV_PAGEOUT:
> > +   return true;
> > +   default:
> > +   return false;
> > +   }
> > +}
> > +
> >  /*
> >   * The madvise(2) system call.
> >   *
> > @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior)
> >   *  MADV_DONTDUMP - the application wants to prevent pages in the given 
> > range
> >   * from being included in its core dump.
> >   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > + *  MADV_COLD - the application is not expected to use this memory soon,
> > + * deactivate pages in this range so that they can be reclaimed
> > + * easily if memory pressure hanppens.
> > + *  MADV_PAGEOUT - the application is not expected to use this memory soon,
> > + * page out the pages in this range immediately.
> >   *
> >   * return values:
> >   *  zero- success
> > @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, 
> > size_t, len_in, int, behavior)
> >  {
> > return do_madvise(current, current->mm, start, len_in, behavior);
> >  }
> > +
> > +static int process_madvise_vec(struct task_struct *target_task,
> > +   struct mm_struct *mm, struct iov_iter *iter, int behavior)
> > +{
> > +   struct iovec iovec;
> > +   int ret = 0;
> > +
> > +   while (iov_iter_count(iter)) {
> > +   iovec = iov_iter_iovec(iter);
> > +   ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
> > +   iovec.iov_len, behavior);
> > +   if (ret < 0)
> > +   break;
> > +   iov_iter_advance(iter, iovec.iov_len);
> > +   }
> > +
> > +   return ret;
> > +}
> > +
> > +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
> > +   int behavior, unsigned int flags)
> > +{
> > +   ssize_t ret;
> > +   struct pid *pid;
> > +   struct task_struct *task;
> > +   struct mm_struct *mm;
> > +   size_t total_len = iov_iter_count(iter);
> > +
> > +   if (flags != 0)
> > +   return -EINVAL;
> > +
> > +   pid = pidfd_get_pid(pidfd);
> > +   if (IS_ERR(pid))
> > +   return PTR_ERR(pid);
> > +
> > +   task = get_pid_task(pid, PIDTYPE_PID);
> > +   if (!task) {
> > +   ret = -ESRCH;
> > +   goto put_pid;
> > +   }
> > +
> > +   if (task->mm != current->mm &&
> > +   !process_madvise_behavior_valid(behavior)) {
> > +   ret = -EINVAL;
> > +   goto release_task;
> > +   }
> > +
> > +   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > +   if (IS_ERR_OR_NULL(mm)) {
> > +   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +   goto release_task;
> > +   }
> > 
> 
> mm is always task->mm right?  I'm wondering if it would be better to find 
> the mm directly in process_madvise_vec() rather than passing it into the 
> function.  I'm not sure why we'd pass both task and mm here.

That's because of hint Jann provided in the past version.
https://lore.kernel.org/linux-api/CAG48ez27=pwm5m_n_988xt1huo7g7h6artql44zev6td-h-...@mail.gmail.com/

Thanks for the review, David.


Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-06-24 Thread David Rientjes
On Mon, 22 Jun 2020, Minchan Kim wrote:

> diff --git a/mm/madvise.c b/mm/madvise.c
> index 551ed816eefe..23abca3f93fa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior)
>   }
>  }
>  
> +static bool
> +process_madvise_behavior_valid(int behavior)
> +{
> + switch (behavior) {
> + case MADV_COLD:
> + case MADV_PAGEOUT:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior)
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *   from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *  MADV_COLD - the application is not expected to use this memory soon,
> + *   deactivate pages in this range so that they can be reclaimed
> + *   easily if memory pressure hanppens.
> + *  MADV_PAGEOUT - the application is not expected to use this memory soon,
> + *   page out the pages in this range immediately.
>   *
>   * return values:
>   *  zero- success
> @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, 
> size_t, len_in, int, behavior)
>  {
>   return do_madvise(current, current->mm, start, len_in, behavior);
>  }
> +
> +static int process_madvise_vec(struct task_struct *target_task,
> + struct mm_struct *mm, struct iov_iter *iter, int behavior)
> +{
> + struct iovec iovec;
> + int ret = 0;
> +
> + while (iov_iter_count(iter)) {
> + iovec = iov_iter_iovec(iter);
> + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
> + iovec.iov_len, behavior);
> + if (ret < 0)
> + break;
> + iov_iter_advance(iter, iovec.iov_len);
> + }
> +
> + return ret;
> +}
> +
> +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
> + int behavior, unsigned int flags)
> +{
> + ssize_t ret;
> + struct pid *pid;
> + struct task_struct *task;
> + struct mm_struct *mm;
> + size_t total_len = iov_iter_count(iter);
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + pid = pidfd_get_pid(pidfd);
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> +
> + task = get_pid_task(pid, PIDTYPE_PID);
> + if (!task) {
> + ret = -ESRCH;
> + goto put_pid;
> + }
> +
> + if (task->mm != current->mm &&
> + !process_madvise_behavior_valid(behavior)) {
> + ret = -EINVAL;
> + goto release_task;
> + }
> +
> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> + if (IS_ERR_OR_NULL(mm)) {
> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> + goto release_task;
> + }
> 

mm is always task->mm right?  I'm wondering if it would be better to find 
the mm directly in process_madvise_vec() rather than passing it into the 
function.  I'm not sure why we'd pass both task and mm here.

+
> + ret = process_madvise_vec(task, mm, iter, behavior);
> + if (ret >= 0)
> + ret = total_len - iov_iter_count(iter);
> +
> + mmput(mm);
> +release_task:
> + put_task_struct(task);
> +put_pid:
> + put_pid(pid);
> + return ret;
> +}
> +
> +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, 
> vec,
> + unsigned long, vlen, int, behavior, unsigned int, flags)

I love the idea of adding the flags parameter here and I can think of an 
immediate use case for MADV_HUGEPAGE, which is overloaded.

Today, MADV_HUGEPAGE controls enablement depending on system config and 
controls defrag behavior based on system config.  It also cannot be opted 
out of without setting MADV_NOHUGEPAGE :)

I was thinking of a flag that users could use to trigger an immediate 
collapse in process context regardless of the system config.

So I'm a big advocate of this flags parameter and consider it an absolute 
must for the API.

Acked-by: David Rientjes