On Tue, 12 Nov 2013 17:02:36 +0400
Stanislav Kinsbursky <[email protected]> wrote:
12.11.2013 15:12, Jeff Layton пишет:
On Mon, 11 Nov 2013 16:47:03 -0800
Greg KH <[email protected]> wrote:
On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton
wrote:
We have a bit of a problem wrt to upcalls that use
call_usermodehelper
with containers and I'd like to bring this to some sort
of resolution...
A particularly problematic case (though there are
others) is the
nfsdcltrack upcall. It basically uses
call_usermodehelper to run a
program in userland to track some information on stable
storage for
nfsd.
I thought the discussion at the kernel summit about this
issue was:
- don't do this.
- don't do it.
- if you really need to do this, fix nfsd
Sorry, I couldn't make the kernel summit so I missed that
discussion. I
guess LWN didn't cover it?
In any case, I guess then that we'll either have to come up
with some
way to fix nfsd here, or simply ensure that nfsd can never
be started
unless root in the container has a full set of a full set of
capabilities.
One sort of Rube Goldberg possibility to fix nfsd is:
- when we start nfsd in a container, fork off an extra
kernel thread
that just sits idle. That thread would need to be a
descendant of the
userland process that started nfsd, so we'd need to
create it with
kernel_thread().
- Have the kernel just start up the UMH program in the
init_ns mount
namespace as it currently does, but also pass the pid
of the idle
kernel thread to the UMH upcall.
- The program will then use /proc/<pid>/root and
/proc/<pid>/ns/* to set
itself up for doing things properly.
Note that with this mechanism we can't actually run a
different binary
per container, but that's probably fine for most purposes.
Hmmm... Why we can't? We can go a bit further with userspace
idea.
We use UMH some very limited number of user programs. For 2,
actually:
1) /sbin/nfs_cache_getent
2) /sbin/nfsdcltrack
No, the kernel uses them for a lot more than that. Pretty much
all of
the keys API upcalls use it. See all of the callers of
call_usermodehelper. All of them are running user binaries out
of the
kernel, and almost all of them are certainly broken wrt
containers.
If we convert them into proxies, which use /proc/<pid>/root
and /proc/<pid>/ns/*, this will allow us to lookup the right
binary.
The only limitation here is presence of this "proxy" binaries
on "host".
Suppose I spawn my own container as a user, using all of this
spiffy
new user namespace stuff. Then I make the kernel use
call_usermodehelper to call the upcall in the init_ns, and then
trick
it into running my new "escape_from_namespace" program with
"real" root
privileges.
I don't think we can reasonably assume that having the kernel
exec an
arbitrary binary inside of a container is safe. Doing so inside
of the
init_ns is marginally more safe, but only marginally so...
And we don't need any significant changes in kernel.
BTW, Jeff, could you remind me, please, why exactly we need to
use UMH to run the binary?
What are this capabilities, which force us to do so?
Nothing _forces_ us to do so, but upcalls are very difficult to
handle,
and UMH has a lot of advantages over a long-running daemon
launched by
userland.
Originally, I created the nfsdcltrack upcall as a running daemon
called
nfsdcld, and the kernel used rpc_pipefs to communicate with it.
Everyone hated it because no one likes to have to run daemons
for
infrequently used upcalls. It's a pain for users to ensure that
it's
running and it's a pain to handle when it isn't. So, I was
encouraged
to turn that instead into a UMH upcall.
But leaving that aside, this problem is a lot larger than just
nfsd. We
have a *lot* of UMH upcalls in the kernel, so this problem is
more
general than just "fixing" nfsd's.