Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Fri, Aug 28, 2020 at 07:40:08PM +0200, Arnd Bergmann wrote: > Every syscall that passes an iovec seems to do this. If we make import_iovec() > handle both cases directly, this syscall and a number of others can > be simplified, and you avoid the x32 entry point I mentioned above FYI, I do have a series that does this (even tested) and kills tons of compat syscalls by that. But by doing that I found the problem that compat syscalls issued by io_uring don't trigger in_compat_syscall(). I need to go back to fixing the io_uring vs in_compat_syscall() issue (probably for 5.9) and then submit the whole thing.
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Fri, Aug 28, 2020 at 9:27 PM Jens Axboe wrote: > On 8/28/20 12:24 PM, Jens Axboe wrote: > >> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct > >> iovec __user * uvector, > >> { > >> ssize_t n; > >> struct iovec *p; > >> - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, > >> - *iov, &p); > >> + > >> + if (in_compat_syscall()) > >> + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, > >> +fast_segs, *iov, &p); > >> + else > >> + n = rw_copy_check_uvector(type, uvector, nr_segs, > >> + fast_segs, *iov, &p); > >> if (n < 0) { > >> if (p != *iov) > >> kfree(p); > > > > Doesn't work for the async case, where you want to be holding on to the > > allocated iovec. But in general I think it's a good helper for the sync > > case, which is by far the majority. > > Nevermind, I'm an idiot for reading this totally wrong. > I think you are right about the need to pick the compat vs native behavior based on req->ctx->compat instead of in_compat_syscall() inside of io_import_iovec(). That one can probably call a lower-level version and when all other callers get changed to calling ssize_t import_iovec(int type, const struct iovec __user * uvector, unsigned nr_segs, unsigned fast_segs, struct iovec **iov, struct iov_iter *i) { return __import_iovec(type, uvector, nr_segs, fast_segs, iov, i, in_compat_syscall()); } Arnd
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On 8/28/20 12:24 PM, Jens Axboe wrote: > On 8/28/20 11:40 AM, Arnd Bergmann wrote: >> On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim wrote: >>> So finally, the API is as follows, >>> >>> ssize_t process_madvise(int pidfd, const struct iovec *iovec, >>>unsigned long vlen, int advice, unsigned int flags); >> >> I had not followed the discussion earlier and only now came across >> the syscall in linux-next, sorry for stirring things up this late. >> >>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl >>> b/arch/x86/entry/syscalls/syscall_64.tbl >>> index 94bf4958d114..8f959d90338a 100644 >>> --- a/arch/x86/entry/syscalls/syscall_64.tbl >>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >>> @@ -364,6 +364,7 @@ >>> 440common watch_mount sys_watch_mount >>> 441common watch_sbsys_watch_sb >>> 442common fsinfo sys_fsinfo >>> +44364 process_madvise sys_process_madvise >>> >>> # >>> # x32-specific system call numbers start at 512 to avoid cache impact >>> @@ -407,3 +408,4 @@ >>> 545x32 execveatcompat_sys_execveat >>> 546x32 preadv2 compat_sys_preadv64v2 >>> 547x32 pwritev2compat_sys_pwritev64v2 >>> +548x32 process_madvise compat_sys_process_madvise >> >> I think we should not add any new x32-specific syscalls. Instead I think >> the compat_sys_process_madvise/sys_process_madvise can be >> merged into one. >> >>> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); >>> + if (IS_ERR_OR_NULL(mm)) { >>> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; >>> + goto release_task; >>> + } >> >> Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, >> and I would try to avoid that. Can mm_access() be changed to >> itself return PTR_ERR(-ESRCH) instead of NULL to improve its >> calling conventions? I see there are only three other callers. >> >> >>> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, >>> &iter); >>> + if (ret >= 0) { >>> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >>> + kfree(iov); >>> + } >>> + return ret; >>> +} >>> + >>> +#ifdef CONFIG_COMPAT >> ... >>> + >>> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), >>> + &iov, &iter); >>> + if (ret >= 0) { >>> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >>> + kfree(iov); >>> + } >> >> Every syscall that passes an iovec seems to do this. If we make >> import_iovec() >> handle both cases directly, this syscall and a number of others can >> be simplified, and you avoid the x32 entry point I mentioned above >> >> Something like (untested) >> >> index dad8d0cfaaf7..0de4ddff24c1 100644 >> --- a/lib/iov_iter.c >> +++ b/lib/iov_iter.c >> @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct >> iovec __user * uvector, >> { >> ssize_t n; >> struct iovec *p; >> - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, >> - *iov, &p); >> + >> + if (in_compat_syscall()) >> + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, >> +fast_segs, *iov, &p); >> + else >> + n = rw_copy_check_uvector(type, uvector, nr_segs, >> + fast_segs, *iov, &p); >> if (n < 0) { >> if (p != *iov) >> kfree(p); > > Doesn't work for the async case, where you want to be holding on to the > allocated iovec. But in general I think it's a good helper for the sync > case, which is by far the majority. Nevermind, I'm an idiot for reading this totally wrong. -- Jens Axboe
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Fri, Aug 28, 2020 at 08:25:34PM +0200, Christian Brauner wrote: > On Fri, Aug 28, 2020 at 8:24 PM Jens Axboe wrote: > > > > On 8/28/20 11:40 AM, Arnd Bergmann wrote: > > > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim wrote: > > >> So finally, the API is as follows, > > >> > > >> ssize_t process_madvise(int pidfd, const struct iovec *iovec, > > >>unsigned long vlen, int advice, unsigned int flags); > > > > > > I had not followed the discussion earlier and only now came across > > > the syscall in linux-next, sorry for stirring things up this late. > > > > > >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > > >> b/arch/x86/entry/syscalls/syscall_64.tbl > > >> index 94bf4958d114..8f959d90338a 100644 > > >> --- a/arch/x86/entry/syscalls/syscall_64.tbl > > >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > >> @@ -364,6 +364,7 @@ > > >> 440common watch_mount sys_watch_mount > > >> 441common watch_sbsys_watch_sb > > >> 442common fsinfo sys_fsinfo > > >> +44364 process_madvise sys_process_madvise > > >> > > >> # > > >> # x32-specific system call numbers start at 512 to avoid cache impact > > >> @@ -407,3 +408,4 @@ > > >> 545x32 execveatcompat_sys_execveat > > >> 546x32 preadv2 compat_sys_preadv64v2 > > >> 547x32 pwritev2compat_sys_pwritev64v2 > > >> +548x32 process_madvise compat_sys_process_madvise > > > > > > I think we should not add any new x32-specific syscalls. Instead I think > > > the compat_sys_process_madvise/sys_process_madvise can be > > > merged into one. > > > > > >> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > > >> + if (IS_ERR_OR_NULL(mm)) { > > >> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > > >> + goto release_task; > > >> + } > > > > > > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, > > > and I would try to avoid that. Can mm_access() be changed to > > > itself return PTR_ERR(-ESRCH) instead of NULL to improve its > > > calling conventions? I see there are only three other callers. > > > > > > > > >> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, > > >> &iter); > > >> + if (ret >= 0) { > > >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); > > >> + kfree(iov); > > >> + } > > >> + return ret; > > >> +} > > >> + > > >> +#ifdef CONFIG_COMPAT > > > ... > > >> + > > >> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), > > >> + &iov, &iter); > > >> + if (ret >= 0) { > > >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); > > >> + kfree(iov); > > >> + } > > > > > > Every syscall that passes an iovec seems to do this. If we make > > > import_iovec() > > > handle both cases directly, this syscall and a number of others can > > > be simplified, and you avoid the x32 entry point I mentioned above > > > > > > Something like (untested) > > > > > > index dad8d0cfaaf7..0de4ddff24c1 100644 > > > --- a/lib/iov_iter.c > > > +++ b/lib/iov_iter.c > > > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct > > > iovec __user * uvector, > > > { > > > ssize_t n; > > > struct iovec *p; > > > - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, > > > - *iov, &p); > > > + > > > + if (in_compat_syscall()) > > I suggested the exact same solutions roughly 1.5 weeks ago. :) > Fun when I saw you mentioning this in BBB I knew exactly what you were > referring too. :) > https://lore.kernel.org/linux-man/20200816081227.ngw3l45c5uncesmr@wittgenstein/ Yes, Christian suggested the idea but mostly for only this new syscall. I don't have the time to revise the patchset yet but may have next week. I will follow Christian's suggestion. Thanks.
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Fri, Aug 28, 2020 at 8:24 PM Jens Axboe wrote: > > On 8/28/20 11:40 AM, Arnd Bergmann wrote: > > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim wrote: > >> So finally, the API is as follows, > >> > >> ssize_t process_madvise(int pidfd, const struct iovec *iovec, > >>unsigned long vlen, int advice, unsigned int flags); > > > > I had not followed the discussion earlier and only now came across > > the syscall in linux-next, sorry for stirring things up this late. > > > >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > >> b/arch/x86/entry/syscalls/syscall_64.tbl > >> index 94bf4958d114..8f959d90338a 100644 > >> --- a/arch/x86/entry/syscalls/syscall_64.tbl > >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl > >> @@ -364,6 +364,7 @@ > >> 440common watch_mount sys_watch_mount > >> 441common watch_sbsys_watch_sb > >> 442common fsinfo sys_fsinfo > >> +44364 process_madvise sys_process_madvise > >> > >> # > >> # x32-specific system call numbers start at 512 to avoid cache impact > >> @@ -407,3 +408,4 @@ > >> 545x32 execveatcompat_sys_execveat > >> 546x32 preadv2 compat_sys_preadv64v2 > >> 547x32 pwritev2compat_sys_pwritev64v2 > >> +548x32 process_madvise compat_sys_process_madvise > > > > I think we should not add any new x32-specific syscalls. Instead I think > > the compat_sys_process_madvise/sys_process_madvise can be > > merged into one. > > > >> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > >> + if (IS_ERR_OR_NULL(mm)) { > >> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > >> + goto release_task; > >> + } > > > > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, > > and I would try to avoid that. Can mm_access() be changed to > > itself return PTR_ERR(-ESRCH) instead of NULL to improve its > > calling conventions? I see there are only three other callers. > > > > > >> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, > >> &iter); > >> + if (ret >= 0) { > >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); > >> + kfree(iov); > >> + } > >> + return ret; > >> +} > >> + > >> +#ifdef CONFIG_COMPAT > > ... > >> + > >> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), > >> + &iov, &iter); > >> + if (ret >= 0) { > >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); > >> + kfree(iov); > >> + } > > > > Every syscall that passes an iovec seems to do this. If we make > > import_iovec() > > handle both cases directly, this syscall and a number of others can > > be simplified, and you avoid the x32 entry point I mentioned above > > > > Something like (untested) > > > > index dad8d0cfaaf7..0de4ddff24c1 100644 > > --- a/lib/iov_iter.c > > +++ b/lib/iov_iter.c > > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct > > iovec __user * uvector, > > { > > ssize_t n; > > struct iovec *p; > > - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, > > - *iov, &p); > > + > > + if (in_compat_syscall()) I suggested the exact same solutions roughly 1.5 weeks ago. :) Fun when I saw you mentioning this in BBB I knew exactly what you were referring too. :) Christian
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On 8/28/20 11:40 AM, Arnd Bergmann wrote: > On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim wrote: >> So finally, the API is as follows, >> >> ssize_t process_madvise(int pidfd, const struct iovec *iovec, >>unsigned long vlen, int advice, unsigned int flags); > > I had not followed the discussion earlier and only now came across > the syscall in linux-next, sorry for stirring things up this late. > >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl >> b/arch/x86/entry/syscalls/syscall_64.tbl >> index 94bf4958d114..8f959d90338a 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -364,6 +364,7 @@ >> 440common watch_mount sys_watch_mount >> 441common watch_sbsys_watch_sb >> 442common fsinfo sys_fsinfo >> +44364 process_madvise sys_process_madvise >> >> # >> # x32-specific system call numbers start at 512 to avoid cache impact >> @@ -407,3 +408,4 @@ >> 545x32 execveatcompat_sys_execveat >> 546x32 preadv2 compat_sys_preadv64v2 >> 547x32 pwritev2compat_sys_pwritev64v2 >> +548x32 process_madvise compat_sys_process_madvise > > I think we should not add any new x32-specific syscalls. Instead I think > the compat_sys_process_madvise/sys_process_madvise can be > merged into one. > >> + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); >> + if (IS_ERR_OR_NULL(mm)) { >> + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; >> + goto release_task; >> + } > > Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, > and I would try to avoid that. Can mm_access() be changed to > itself return PTR_ERR(-ESRCH) instead of NULL to improve its > calling conventions? I see there are only three other callers. > > >> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, >> &iter); >> + if (ret >= 0) { >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >> + kfree(iov); >> + } >> + return ret; >> +} >> + >> +#ifdef CONFIG_COMPAT > ... >> + >> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), >> + &iov, &iter); >> + if (ret >= 0) { >> + ret = do_process_madvise(pidfd, &iter, behavior, flags); >> + kfree(iov); >> + } > > Every syscall that passes an iovec seems to do this. If we make import_iovec() > handle both cases directly, this syscall and a number of others can > be simplified, and you avoid the x32 entry point I mentioned above > > Something like (untested) > > index dad8d0cfaaf7..0de4ddff24c1 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct > iovec __user * uvector, > { > ssize_t n; > struct iovec *p; > - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, > - *iov, &p); > + > + if (in_compat_syscall()) > + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, > +fast_segs, *iov, &p); > + else > + n = rw_copy_check_uvector(type, uvector, nr_segs, > + fast_segs, *iov, &p); > if (n < 0) { > if (p != *iov) > kfree(p); Doesn't work for the async case, where you want to be holding on to the allocated iovec. But in general I think it's a good helper for the sync case, which is by far the majority. -- Jens Axboe
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Mon, Jun 22, 2020 at 9:29 PM Minchan Kim wrote: > So finally, the API is as follows, > > ssize_t process_madvise(int pidfd, const struct iovec *iovec, >unsigned long vlen, int advice, unsigned int flags); I had not followed the discussion earlier and only now came across the syscall in linux-next, sorry for stirring things up this late. > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 94bf4958d114..8f959d90338a 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -364,6 +364,7 @@ > 440common watch_mount sys_watch_mount > 441common watch_sbsys_watch_sb > 442common fsinfo sys_fsinfo > +44364 process_madvise sys_process_madvise > > # > # x32-specific system call numbers start at 512 to avoid cache impact > @@ -407,3 +408,4 @@ > 545x32 execveatcompat_sys_execveat > 546x32 preadv2 compat_sys_preadv64v2 > 547x32 pwritev2compat_sys_pwritev64v2 > +548x32 process_madvise compat_sys_process_madvise I think we should not add any new x32-specific syscalls. Instead I think the compat_sys_process_madvise/sys_process_madvise can be merged into one. > + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > + if (IS_ERR_OR_NULL(mm)) { > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > + goto release_task; > + } Minor point: Having to use IS_ERR_OR_NULL() tends to be fragile, and I would try to avoid that. Can mm_access() be changed to itself return PTR_ERR(-ESRCH) instead of NULL to improve its calling conventions? I see there are only three other callers. > + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, > &iter); > + if (ret >= 0) { > + ret = do_process_madvise(pidfd, &iter, behavior, flags); > + kfree(iov); > + } > + return ret; > +} > + > +#ifdef CONFIG_COMPAT ... > + > + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), > + &iov, &iter); > + if (ret >= 0) { > + ret = do_process_madvise(pidfd, &iter, behavior, flags); > + kfree(iov); > + } Every syscall that passes an iovec seems to do this. If we make import_iovec() handle both cases directly, this syscall and a number of others can be simplified, and you avoid the x32 entry point I mentioned above Something like (untested) index dad8d0cfaaf7..0de4ddff24c1 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1683,8 +1683,13 @@ ssize_t import_iovec(int type, const struct iovec __user * uvector, { ssize_t n; struct iovec *p; - n = rw_copy_check_uvector(type, uvector, nr_segs, fast_segs, - *iov, &p); + + if (in_compat_syscall()) + n = compat_rw_copy_check_uvector(type, uvector, nr_segs, +fast_segs, *iov, &p); + else + n = rw_copy_check_uvector(type, uvector, nr_segs, + fast_segs, *iov, &p); if (n < 0) { if (p != *iov) kfree(p); Arnd
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Wed, Jun 24, 2020 at 01:00:14PM -0700, David Rientjes wrote: > On Mon, 22 Jun 2020, Minchan Kim wrote: > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 551ed816eefe..23abca3f93fa 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -17,6 +17,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior) > > } > > } > > > > +static bool > > +process_madvise_behavior_valid(int behavior) > > +{ > > + switch (behavior) { > > + case MADV_COLD: > > + case MADV_PAGEOUT: > > + return true; > > + default: > > + return false; > > + } > > +} > > + > > /* > > * The madvise(2) system call. > > * > > @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior) > > * MADV_DONTDUMP - the application wants to prevent pages in the given > > range > > * from being included in its core dump. > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > + * MADV_COLD - the application is not expected to use this memory soon, > > + * deactivate pages in this range so that they can be reclaimed > > + * easily if memory pressure hanppens. > > + * MADV_PAGEOUT - the application is not expected to use this memory soon, > > + * page out the pages in this range immediately. > > * > > * return values: > > * zero- success > > @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, > > size_t, len_in, int, behavior) > > { > > return do_madvise(current, current->mm, start, len_in, behavior); > > } > > + > > +static int process_madvise_vec(struct task_struct *target_task, > > + struct mm_struct *mm, struct iov_iter *iter, int behavior) > > +{ > > + struct iovec iovec; > > + int ret = 0; > > + > > + while (iov_iter_count(iter)) { > > + iovec = iov_iter_iovec(iter); > > + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base, > > + iovec.iov_len, behavior); > > + if (ret < 0) > > + break; > > + iov_iter_advance(iter, iovec.iov_len); > > + } > > + > > + return ret; > > +} > > + > > +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter, > > + int behavior, unsigned int flags) > > +{ > > + ssize_t ret; > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm; > > + size_t total_len = iov_iter_count(iter); > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + if (task->mm != current->mm && > > + !process_madvise_behavior_valid(behavior)) { > > + ret = -EINVAL; > > + goto release_task; > > + } > > + > > + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > > + if (IS_ERR_OR_NULL(mm)) { > > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > > + goto release_task; > > + } > > > > mm is always task->mm right? I'm wondering if it would be better to find > the mm directly in process_madvise_vec() rather than passing it into the > function. I'm not sure why we'd pass both task and mm here. That's because of hint Jann provided in the past version. https://lore.kernel.org/linux-api/CAG48ez27=pwm5m_n_988xt1huo7g7h6artql44zev6td-h-...@mail.gmail.com/ Thanks for the review, David.
Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
On Mon, 22 Jun 2020, Minchan Kim wrote: > diff --git a/mm/madvise.c b/mm/madvise.c > index 551ed816eefe..23abca3f93fa 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior) > } > } > > +static bool > +process_madvise_behavior_valid(int behavior) > +{ > + switch (behavior) { > + case MADV_COLD: > + case MADV_PAGEOUT: > + return true; > + default: > + return false; > + } > +} > + > /* > * The madvise(2) system call. > * > @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior) > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > + * MADV_COLD - the application is not expected to use this memory soon, > + * deactivate pages in this range so that they can be reclaimed > + * easily if memory pressure hanppens. > + * MADV_PAGEOUT - the application is not expected to use this memory soon, > + * page out the pages in this range immediately. > * > * return values: > * zero- success > @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, > size_t, len_in, int, behavior) > { > return do_madvise(current, current->mm, start, len_in, behavior); > } > + > +static int process_madvise_vec(struct task_struct *target_task, > + struct mm_struct *mm, struct iov_iter *iter, int behavior) > +{ > + struct iovec iovec; > + int ret = 0; > + > + while (iov_iter_count(iter)) { > + iovec = iov_iter_iovec(iter); > + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base, > + iovec.iov_len, behavior); > + if (ret < 0) > + break; > + iov_iter_advance(iter, iovec.iov_len); > + } > + > + return ret; > +} > + > +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter, > + int behavior, unsigned int flags) > +{ > + ssize_t ret; > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm; > + size_t total_len = iov_iter_count(iter); > + > + if (flags != 0) > + return -EINVAL; > + > + pid = pidfd_get_pid(pidfd); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task = get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret = -ESRCH; > + goto put_pid; > + } > + > + if (task->mm != current->mm && > + !process_madvise_behavior_valid(behavior)) { > + ret = -EINVAL; > + goto release_task; > + } > + > + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > + if (IS_ERR_OR_NULL(mm)) { > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > + goto release_task; > + } > mm is always task->mm right? I'm wondering if it would be better to find the mm directly in process_madvise_vec() rather than passing it into the function. I'm not sure why we'd pass both task and mm here. + > + ret = process_madvise_vec(task, mm, iter, behavior); > + if (ret >= 0) > + ret = total_len - iov_iter_count(iter); > + > + mmput(mm); > +release_task: > + put_task_struct(task); > +put_pid: > + put_pid(pid); > + return ret; > +} > + > +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, > vec, > + unsigned long, vlen, int, behavior, unsigned int, flags) I love the idea of adding the flags parameter here and I can think of an immediate use case for MADV_HUGEPAGE, which is overloaded. Today, MADV_HUGEPAGE controls enablement depending on system config and controls defrag behavior based on system config. It also cannot be opted out of without setting MADV_NOHUGEPAGE :) I was thinking of a flag that users could use to trigger an immediate collapse in process context regardless of the system config. So I'm a big advocate of this flags parameter and consider it an absolute must for the API. Acked-by: David Rientjes