> > Also in this thread Kamil mentioned that they also need calling prctl 
> > with PR_SET_MM during restore in their production setup.
>
> We're using that as well but it really feels like this:
>
>       prctl_map = (struct prctl_mm_map){
>           .start_code = start_code,
>           .end_code = end_code,
>           .start_stack = start_stack,
>           .start_data = start_data,
>           .end_data = end_data,
>           .start_brk = start_brk,
>           .brk = brk_val,
>           .arg_start = arg_start,
>           .arg_end = arg_end,
>           .env_start = env_start,
>           .env_end = env_end,
>           .auxv = NULL,
>           .auxv_size = 0,
>           .exe_fd = -1,
>       };
>
> should belong under ns_capable(CAP_SYS_ADMIN). Why is that necessary to relax?

When the prctl(PR_SET_MM_MAP...), the only privileged operation is to change 
the symlink of /proc/self/exe via set_mm_exe_file().
See 
https://github.com/torvalds/linux/blob/444fc5cde64330661bf59944c43844e7d4c2ccd8/kernel/sys.c#L2001-L2004
It needs CAP_SYS_ADMIN of the current namespace.

I would argue that setting the current process exe file check should just be 
reduced to a "can you ptrace a children" check.
Here's why: any process can masquerade into another executable with ptrace.
One can fork a child, ptrace it, have the child execve("target_exe"), then 
replace its memory content with an arbitrary program.
With CRIU's libcompel parasite mechanism (https://criu.org/Compel) this is 
fairly easy to implement.
In fact, we could modify CRIU to do just that (but with a fair amount of 
efforts due to the way CRIU is written),
and not rely on being able to SET_MM_EXE_FILE via prctl(). In turn, that would 
give an easy way to masquerade any process
into another one, provided that one can ptrace a child.

When not using PR_SET_MM_MAP, but using SET_MM_EXE_FILE, the CAP_RESOURCES at 
the root namespace level is required:
https://github.com/torvalds/linux/blob/444fc5cde64330661bf59944c43844e7d4c2ccd8/kernel/sys.c#L2109
This seems inconsistent. Also for some reason changing auxv is not privileged 
if using prctl via the MM_MAP mechanism, but is privileged otherwise.

Reply via email to