Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-07 Thread Seth Forshee
On Fri, Aug 07, 2015 at 11:35:31AM -0700, Casey Schaufler wrote:
> On 8/7/2015 7:32 AM, Seth Forshee wrote:
> > On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
> >> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> >>> Seth Forshee  writes:
> >>>
>  On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> > Seth Forshee  writes:
> >
> >> Initially this will be used to eliminate the implicit MNT_NODEV
> >> flag for mounts from user namespaces. In the future it will also
> >> be used for translating ids and checking capabilities for
> >> filesystems mounted from user namespaces.
> >>
> >> s_user_ns is initialized in alloc_super() and is generally set to
> >> current_user_ns(). To avoid security and corruption issues, two
> >> additional mount checks are also added:
> >>
> >>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >>in current_user_ns().
> >>
> >>  - sget() will fail with EBUSY when the filesystem it's looking
> >>for is already mounted from another user namespace.
> >>
> >> proc needs some special handling here. The user namespace of
> >> current isn't appropriate when forking as a result of clone (2)
> >> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >> from within the new user namespace. Instead, the user namespace
> >> which owns the new pid namespace should be used. sget_userns() is
> >> added to allow passing of a user namespace other than that of
> >> current, and this is used by proc_mount(). sget() becomes a
> >> wrapper around sget_userns() which passes current_user_ns().
> > From bits of the previous conversation.
> >
> > We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> > xattrs can travel from one mount of sysfs to another via the sysfs
> > backing store.
> >
> > For tmpfs and any other filesystems we support mounting without
> > privilige that support xattrs.  We need to identify them and
> > see if userspace is taking advantage of the ability to set
> > xattrs and file caps (unlikely).  If they are we need to call
> > sget_userns(..., &init_user_ns) on those filesystems as well.
> >
> > Possibly/Probably we should just do that for all of the interesting
> > filesystems to start with and then change back to an ordinary old sget
> > after we have done the testing and confirmed we will not be introducing
> > userspace regressions.
>  I was reviewing everything in preparation for sending v2 patches, and I
>  realized that doing this has an undesirable side effect. In patch 2 the
>  implicit nodev is removed for unprivileged mounts, and instead s_user_ns
>  is used to block opening devices in these mounts. When we set s_user_ns
>  to &init_user_ns, it becomes possible to open device nodes from
>  unprivileged mounts of these filesystems.
> 
>  This doesn't pose a real problem today. The only filesystems it will
>  affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
>  &init_user_ns for user namespace mounts), and all of these aren't
>  problems. sysfs is okay because kernfs doesn't (currently?) allow device
>  nodes, and a user would require CAP_MKNOD to create any device nodes in
>  a tmpfs or ramfs mount.
> 
>  But for sysfs in particular it does mean that we will need to make sure
>  that there's no way that device nodes could start appearing in an
>  unprivileged mount.
> >>> Good point about nodev.  
> >>>
> >>> For tmpfs and ramfs and security labels the smack policy of allowing but
> >>> filtering security labels mean smack once it has those bits will not
> >>> care which user namespace ramfs and tmpfs live in.  The labels should
> >>> pretty much stay the same in any case.
> >> Smack does care which namespace ramfs and tmpfs are in. With the patch
> >> I've got right now, if s_user_ns != &init_user_ns and the label of an
> >> inode does not match that of the root inode then
> >> security_inode_permission() will return EACCES.
> >>
> >> So if something with CAP_MAC_ADMIN is changing security labels in such a
> >> mount, suddenly those inodes might become inaccessible. And while it may
> >> be unlikely that anyone is doing this it's impossible for me to prove
> >> that's the case.
> >>
> >>> If the same class of handling will also apply to selinux and those are
> >>> the only two security modules that apply labels than we can leave tmpfs
> >>> and ramfs with the security labels of whomever mounted them.
> >> For SELinux I now have a patch which applies mountpoint labeling to
> >> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> >> Smack how this behavior will differ from what happens today, but my
> >> understanding is that this means that the label of the mountpoint is
> >> used for all objects from that superblock.

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-07 Thread Casey Schaufler
On 8/7/2015 7:32 AM, Seth Forshee wrote:
> On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>>> Seth Forshee  writes:
>>>
 On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
>
>> Initially this will be used to eliminate the implicit MNT_NODEV
>> flag for mounts from user namespaces. In the future it will also
>> be used for translating ids and checking capabilities for
>> filesystems mounted from user namespaces.
>>
>> s_user_ns is initialized in alloc_super() and is generally set to
>> current_user_ns(). To avoid security and corruption issues, two
>> additional mount checks are also added:
>>
>>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>in current_user_ns().
>>
>>  - sget() will fail with EBUSY when the filesystem it's looking
>>for is already mounted from another user namespace.
>>
>> proc needs some special handling here. The user namespace of
>> current isn't appropriate when forking as a result of clone (2)
>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>> from within the new user namespace. Instead, the user namespace
>> which owns the new pid namespace should be used. sget_userns() is
>> added to allow passing of a user namespace other than that of
>> current, and this is used by proc_mount(). sget() becomes a
>> wrapper around sget_userns() which passes current_user_ns().
> From bits of the previous conversation.
>
> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> xattrs can travel from one mount of sysfs to another via the sysfs
> backing store.
>
> For tmpfs and any other filesystems we support mounting without
> privilige that support xattrs.  We need to identify them and
> see if userspace is taking advantage of the ability to set
> xattrs and file caps (unlikely).  If they are we need to call
> sget_userns(..., &init_user_ns) on those filesystems as well.
>
> Possibly/Probably we should just do that for all of the interesting
> filesystems to start with and then change back to an ordinary old sget
> after we have done the testing and confirmed we will not be introducing
> userspace regressions.
 I was reviewing everything in preparation for sending v2 patches, and I
 realized that doing this has an undesirable side effect. In patch 2 the
 implicit nodev is removed for unprivileged mounts, and instead s_user_ns
 is used to block opening devices in these mounts. When we set s_user_ns
 to &init_user_ns, it becomes possible to open device nodes from
 unprivileged mounts of these filesystems.

 This doesn't pose a real problem today. The only filesystems it will
 affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
 &init_user_ns for user namespace mounts), and all of these aren't
 problems. sysfs is okay because kernfs doesn't (currently?) allow device
 nodes, and a user would require CAP_MKNOD to create any device nodes in
 a tmpfs or ramfs mount.

 But for sysfs in particular it does mean that we will need to make sure
 that there's no way that device nodes could start appearing in an
 unprivileged mount.
>>> Good point about nodev.  
>>>
>>> For tmpfs and ramfs and security labels the smack policy of allowing but
>>> filtering security labels mean smack once it has those bits will not
>>> care which user namespace ramfs and tmpfs live in.  The labels should
>>> pretty much stay the same in any case.
>> Smack does care which namespace ramfs and tmpfs are in. With the patch
>> I've got right now, if s_user_ns != &init_user_ns and the label of an
>> inode does not match that of the root inode then
>> security_inode_permission() will return EACCES.
>>
>> So if something with CAP_MAC_ADMIN is changing security labels in such a
>> mount, suddenly those inodes might become inaccessible. And while it may
>> be unlikely that anyone is doing this it's impossible for me to prove
>> that's the case.
>>
>>> If the same class of handling will also apply to selinux and those are
>>> the only two security modules that apply labels than we can leave tmpfs
>>> and ramfs with the security labels of whomever mounted them.
>> For SELinux I now have a patch which applies mountpoint labeling to
>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
>> Smack how this behavior will differ from what happens today, but my
>> understanding is that this means that the label of the mountpoint is
>> used for all objects from that superblock. Afaik it does not have the
>> Smack behavior of denying access to filesystem objects which have a
>> different label in the backing store.
>>
>>> For sysfs things get a little more interesting.  Assuming tmpfs and
>>> ramfs don't need s_use

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-07 Thread Seth Forshee
On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> > Seth Forshee  writes:
> > 
> > > On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> > >> Seth Forshee  writes:
> > >> 
> > >> > Initially this will be used to eliminate the implicit MNT_NODEV
> > >> > flag for mounts from user namespaces. In the future it will also
> > >> > be used for translating ids and checking capabilities for
> > >> > filesystems mounted from user namespaces.
> > >> >
> > >> > s_user_ns is initialized in alloc_super() and is generally set to
> > >> > current_user_ns(). To avoid security and corruption issues, two
> > >> > additional mount checks are also added:
> > >> >
> > >> >  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> > >> >in current_user_ns().
> > >> >
> > >> >  - sget() will fail with EBUSY when the filesystem it's looking
> > >> >for is already mounted from another user namespace.
> > >> >
> > >> > proc needs some special handling here. The user namespace of
> > >> > current isn't appropriate when forking as a result of clone (2)
> > >> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> > >> > from within the new user namespace. Instead, the user namespace
> > >> > which owns the new pid namespace should be used. sget_userns() is
> > >> > added to allow passing of a user namespace other than that of
> > >> > current, and this is used by proc_mount(). sget() becomes a
> > >> > wrapper around sget_userns() which passes current_user_ns().
> > >> 
> > >> From bits of the previous conversation.
> > >> 
> > >> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> > >> xattrs can travel from one mount of sysfs to another via the sysfs
> > >> backing store.
> > >> 
> > >> For tmpfs and any other filesystems we support mounting without
> > >> privilige that support xattrs.  We need to identify them and
> > >> see if userspace is taking advantage of the ability to set
> > >> xattrs and file caps (unlikely).  If they are we need to call
> > >> sget_userns(..., &init_user_ns) on those filesystems as well.
> > >> 
> > >> Possibly/Probably we should just do that for all of the interesting
> > >> filesystems to start with and then change back to an ordinary old sget
> > >> after we have done the testing and confirmed we will not be introducing
> > >> userspace regressions.
> > >
> > > I was reviewing everything in preparation for sending v2 patches, and I
> > > realized that doing this has an undesirable side effect. In patch 2 the
> > > implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> > > is used to block opening devices in these mounts. When we set s_user_ns
> > > to &init_user_ns, it becomes possible to open device nodes from
> > > unprivileged mounts of these filesystems.
> > >
> > > This doesn't pose a real problem today. The only filesystems it will
> > > affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> > > &init_user_ns for user namespace mounts), and all of these aren't
> > > problems. sysfs is okay because kernfs doesn't (currently?) allow device
> > > nodes, and a user would require CAP_MKNOD to create any device nodes in
> > > a tmpfs or ramfs mount.
> > >
> > > But for sysfs in particular it does mean that we will need to make sure
> > > that there's no way that device nodes could start appearing in an
> > > unprivileged mount.
> > 
> > Good point about nodev.  
> > 
> > For tmpfs and ramfs and security labels the smack policy of allowing but
> > filtering security labels mean smack once it has those bits will not
> > care which user namespace ramfs and tmpfs live in.  The labels should
> > pretty much stay the same in any case.
> 
> Smack does care which namespace ramfs and tmpfs are in. With the patch
> I've got right now, if s_user_ns != &init_user_ns and the label of an
> inode does not match that of the root inode then
> security_inode_permission() will return EACCES.
> 
> So if something with CAP_MAC_ADMIN is changing security labels in such a
> mount, suddenly those inodes might become inaccessible. And while it may
> be unlikely that anyone is doing this it's impossible for me to prove
> that's the case.
> 
> > If the same class of handling will also apply to selinux and those are
> > the only two security modules that apply labels than we can leave tmpfs
> > and ramfs with the security labels of whomever mounted them.
> 
> For SELinux I now have a patch which applies mountpoint labeling to
> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> Smack how this behavior will differ from what happens today, but my
> understanding is that this means that the label of the mountpoint is
> used for all objects from that superblock. Afaik it does not have the
> Smack behavior of denying access to filesystem objects which have a
> different label in the backing store.
> 
> > For sysfs things get a little more inter

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-07 Thread Seth Forshee
On Thu, Aug 06, 2015 at 12:11:53PM -0400, Stephen Smalley wrote:
> On 08/06/2015 11:44 AM, Seth Forshee wrote:
> > On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
> >> On 08/06/2015 10:20 AM, Seth Forshee wrote:
> >>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>  Seth Forshee  writes:
> 
> > On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >> Seth Forshee  writes:
> >>
> >>> Initially this will be used to eliminate the implicit MNT_NODEV
> >>> flag for mounts from user namespaces. In the future it will also
> >>> be used for translating ids and checking capabilities for
> >>> filesystems mounted from user namespaces.
> >>>
> >>> s_user_ns is initialized in alloc_super() and is generally set to
> >>> current_user_ns(). To avoid security and corruption issues, two
> >>> additional mount checks are also added:
> >>>
> >>>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >>>in current_user_ns().
> >>>
> >>>  - sget() will fail with EBUSY when the filesystem it's looking
> >>>for is already mounted from another user namespace.
> >>>
> >>> proc needs some special handling here. The user namespace of
> >>> current isn't appropriate when forking as a result of clone (2)
> >>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >>> from within the new user namespace. Instead, the user namespace
> >>> which owns the new pid namespace should be used. sget_userns() is
> >>> added to allow passing of a user namespace other than that of
> >>> current, and this is used by proc_mount(). sget() becomes a
> >>> wrapper around sget_userns() which passes current_user_ns().
> >>
> >> From bits of the previous conversation.
> >>
> >> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> >> xattrs can travel from one mount of sysfs to another via the sysfs
> >> backing store.
> >>
> >> For tmpfs and any other filesystems we support mounting without
> >> privilige that support xattrs.  We need to identify them and
> >> see if userspace is taking advantage of the ability to set
> >> xattrs and file caps (unlikely).  If they are we need to call
> >> sget_userns(..., &init_user_ns) on those filesystems as well.
> >>
> >> Possibly/Probably we should just do that for all of the interesting
> >> filesystems to start with and then change back to an ordinary old sget
> >> after we have done the testing and confirmed we will not be introducing
> >> userspace regressions.
> >
> > I was reviewing everything in preparation for sending v2 patches, and I
> > realized that doing this has an undesirable side effect. In patch 2 the
> > implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> > is used to block opening devices in these mounts. When we set s_user_ns
> > to &init_user_ns, it becomes possible to open device nodes from
> > unprivileged mounts of these filesystems.
> >
> > This doesn't pose a real problem today. The only filesystems it will
> > affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> > &init_user_ns for user namespace mounts), and all of these aren't
> > problems. sysfs is okay because kernfs doesn't (currently?) allow device
> > nodes, and a user would require CAP_MKNOD to create any device nodes in
> > a tmpfs or ramfs mount.
> >
> > But for sysfs in particular it does mean that we will need to make sure
> > that there's no way that device nodes could start appearing in an
> > unprivileged mount.
> 
>  Good point about nodev.  
> 
>  For tmpfs and ramfs and security labels the smack policy of allowing but
>  filtering security labels mean smack once it has those bits will not
>  care which user namespace ramfs and tmpfs live in.  The labels should
>  pretty much stay the same in any case.
> >>>
> >>> Smack does care which namespace ramfs and tmpfs are in. With the patch
> >>> I've got right now, if s_user_ns != &init_user_ns and the label of an
> >>> inode does not match that of the root inode then
> >>> security_inode_permission() will return EACCES.
> >>>
> >>> So if something with CAP_MAC_ADMIN is changing security labels in such a
> >>> mount, suddenly those inodes might become inaccessible. And while it may
> >>> be unlikely that anyone is doing this it's impossible for me to prove
> >>> that's the case.
> >>>
>  If the same class of handling will also apply to selinux and those are
>  the only two security modules that apply labels than we can leave tmpfs
>  and ramfs with the security labels of whomever mounted them.
> >>>
> >>> For SELinux I now have a patch which applies mountpoint labeling to
> >>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> >>> Smack how this

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-06 Thread Stephen Smalley
On 08/06/2015 11:44 AM, Seth Forshee wrote:
> On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
>> On 08/06/2015 10:20 AM, Seth Forshee wrote:
>>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
 Seth Forshee  writes:

> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>> Seth Forshee  writes:
>>
>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>> flag for mounts from user namespaces. In the future it will also
>>> be used for translating ids and checking capabilities for
>>> filesystems mounted from user namespaces.
>>>
>>> s_user_ns is initialized in alloc_super() and is generally set to
>>> current_user_ns(). To avoid security and corruption issues, two
>>> additional mount checks are also added:
>>>
>>>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>>in current_user_ns().
>>>
>>>  - sget() will fail with EBUSY when the filesystem it's looking
>>>for is already mounted from another user namespace.
>>>
>>> proc needs some special handling here. The user namespace of
>>> current isn't appropriate when forking as a result of clone (2)
>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>> from within the new user namespace. Instead, the user namespace
>>> which owns the new pid namespace should be used. sget_userns() is
>>> added to allow passing of a user namespace other than that of
>>> current, and this is used by proc_mount(). sget() becomes a
>>> wrapper around sget_userns() which passes current_user_ns().
>>
>> From bits of the previous conversation.
>>
>> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
>> xattrs can travel from one mount of sysfs to another via the sysfs
>> backing store.
>>
>> For tmpfs and any other filesystems we support mounting without
>> privilige that support xattrs.  We need to identify them and
>> see if userspace is taking advantage of the ability to set
>> xattrs and file caps (unlikely).  If they are we need to call
>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>
>> Possibly/Probably we should just do that for all of the interesting
>> filesystems to start with and then change back to an ordinary old sget
>> after we have done the testing and confirmed we will not be introducing
>> userspace regressions.
>
> I was reviewing everything in preparation for sending v2 patches, and I
> realized that doing this has an undesirable side effect. In patch 2 the
> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> is used to block opening devices in these mounts. When we set s_user_ns
> to &init_user_ns, it becomes possible to open device nodes from
> unprivileged mounts of these filesystems.
>
> This doesn't pose a real problem today. The only filesystems it will
> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> &init_user_ns for user namespace mounts), and all of these aren't
> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> nodes, and a user would require CAP_MKNOD to create any device nodes in
> a tmpfs or ramfs mount.
>
> But for sysfs in particular it does mean that we will need to make sure
> that there's no way that device nodes could start appearing in an
> unprivileged mount.

 Good point about nodev.  

 For tmpfs and ramfs and security labels the smack policy of allowing but
 filtering security labels mean smack once it has those bits will not
 care which user namespace ramfs and tmpfs live in.  The labels should
 pretty much stay the same in any case.
>>>
>>> Smack does care which namespace ramfs and tmpfs are in. With the patch
>>> I've got right now, if s_user_ns != &init_user_ns and the label of an
>>> inode does not match that of the root inode then
>>> security_inode_permission() will return EACCES.
>>>
>>> So if something with CAP_MAC_ADMIN is changing security labels in such a
>>> mount, suddenly those inodes might become inaccessible. And while it may
>>> be unlikely that anyone is doing this it's impossible for me to prove
>>> that's the case.
>>>
 If the same class of handling will also apply to selinux and those are
 the only two security modules that apply labels than we can leave tmpfs
 and ramfs with the security labels of whomever mounted them.
>>>
>>> For SELinux I now have a patch which applies mountpoint labeling to
>>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
>>> Smack how this behavior will differ from what happens today, but my
>>> understanding is that this means that the label of the mountpoint is
>>> used for all objects from that superblock. Afaik it does not have the
>>> Smack behavior of denying access to filesyst

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-06 Thread Seth Forshee
On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
> On 08/06/2015 10:20 AM, Seth Forshee wrote:
> > On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> >> Seth Forshee  writes:
> >>
> >>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>  Seth Forshee  writes:
> 
> > Initially this will be used to eliminate the implicit MNT_NODEV
> > flag for mounts from user namespaces. In the future it will also
> > be used for translating ids and checking capabilities for
> > filesystems mounted from user namespaces.
> >
> > s_user_ns is initialized in alloc_super() and is generally set to
> > current_user_ns(). To avoid security and corruption issues, two
> > additional mount checks are also added:
> >
> >  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >in current_user_ns().
> >
> >  - sget() will fail with EBUSY when the filesystem it's looking
> >for is already mounted from another user namespace.
> >
> > proc needs some special handling here. The user namespace of
> > current isn't appropriate when forking as a result of clone (2)
> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> > from within the new user namespace. Instead, the user namespace
> > which owns the new pid namespace should be used. sget_userns() is
> > added to allow passing of a user namespace other than that of
> > current, and this is used by proc_mount(). sget() becomes a
> > wrapper around sget_userns() which passes current_user_ns().
> 
>  From bits of the previous conversation.
> 
>  We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
>  xattrs can travel from one mount of sysfs to another via the sysfs
>  backing store.
> 
>  For tmpfs and any other filesystems we support mounting without
>  privilige that support xattrs.  We need to identify them and
>  see if userspace is taking advantage of the ability to set
>  xattrs and file caps (unlikely).  If they are we need to call
>  sget_userns(..., &init_user_ns) on those filesystems as well.
> 
>  Possibly/Probably we should just do that for all of the interesting
>  filesystems to start with and then change back to an ordinary old sget
>  after we have done the testing and confirmed we will not be introducing
>  userspace regressions.
> >>>
> >>> I was reviewing everything in preparation for sending v2 patches, and I
> >>> realized that doing this has an undesirable side effect. In patch 2 the
> >>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> >>> is used to block opening devices in these mounts. When we set s_user_ns
> >>> to &init_user_ns, it becomes possible to open device nodes from
> >>> unprivileged mounts of these filesystems.
> >>>
> >>> This doesn't pose a real problem today. The only filesystems it will
> >>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> >>> &init_user_ns for user namespace mounts), and all of these aren't
> >>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> >>> nodes, and a user would require CAP_MKNOD to create any device nodes in
> >>> a tmpfs or ramfs mount.
> >>>
> >>> But for sysfs in particular it does mean that we will need to make sure
> >>> that there's no way that device nodes could start appearing in an
> >>> unprivileged mount.
> >>
> >> Good point about nodev.  
> >>
> >> For tmpfs and ramfs and security labels the smack policy of allowing but
> >> filtering security labels mean smack once it has those bits will not
> >> care which user namespace ramfs and tmpfs live in.  The labels should
> >> pretty much stay the same in any case.
> > 
> > Smack does care which namespace ramfs and tmpfs are in. With the patch
> > I've got right now, if s_user_ns != &init_user_ns and the label of an
> > inode does not match that of the root inode then
> > security_inode_permission() will return EACCES.
> > 
> > So if something with CAP_MAC_ADMIN is changing security labels in such a
> > mount, suddenly those inodes might become inaccessible. And while it may
> > be unlikely that anyone is doing this it's impossible for me to prove
> > that's the case.
> > 
> >> If the same class of handling will also apply to selinux and those are
> >> the only two security modules that apply labels than we can leave tmpfs
> >> and ramfs with the security labels of whomever mounted them.
> > 
> > For SELinux I now have a patch which applies mountpoint labeling to
> > mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> > Smack how this behavior will differ from what happens today, but my
> > understanding is that this means that the label of the mountpoint is
> > used for all objects from that superblock. Afaik it does not have the
> > Smack behavior of denying access to filesystem objects which have a
> > different label

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-06 Thread Stephen Smalley
On 08/06/2015 10:20 AM, Seth Forshee wrote:
> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>> Seth Forshee  writes:
>>
>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
 Seth Forshee  writes:

> Initially this will be used to eliminate the implicit MNT_NODEV
> flag for mounts from user namespaces. In the future it will also
> be used for translating ids and checking capabilities for
> filesystems mounted from user namespaces.
>
> s_user_ns is initialized in alloc_super() and is generally set to
> current_user_ns(). To avoid security and corruption issues, two
> additional mount checks are also added:
>
>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>in current_user_ns().
>
>  - sget() will fail with EBUSY when the filesystem it's looking
>for is already mounted from another user namespace.
>
> proc needs some special handling here. The user namespace of
> current isn't appropriate when forking as a result of clone (2)
> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> from within the new user namespace. Instead, the user namespace
> which owns the new pid namespace should be used. sget_userns() is
> added to allow passing of a user namespace other than that of
> current, and this is used by proc_mount(). sget() becomes a
> wrapper around sget_userns() which passes current_user_ns().

 From bits of the previous conversation.

 We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
 xattrs can travel from one mount of sysfs to another via the sysfs
 backing store.

 For tmpfs and any other filesystems we support mounting without
 privilige that support xattrs.  We need to identify them and
 see if userspace is taking advantage of the ability to set
 xattrs and file caps (unlikely).  If they are we need to call
 sget_userns(..., &init_user_ns) on those filesystems as well.

 Possibly/Probably we should just do that for all of the interesting
 filesystems to start with and then change back to an ordinary old sget
 after we have done the testing and confirmed we will not be introducing
 userspace regressions.
>>>
>>> I was reviewing everything in preparation for sending v2 patches, and I
>>> realized that doing this has an undesirable side effect. In patch 2 the
>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
>>> is used to block opening devices in these mounts. When we set s_user_ns
>>> to &init_user_ns, it becomes possible to open device nodes from
>>> unprivileged mounts of these filesystems.
>>>
>>> This doesn't pose a real problem today. The only filesystems it will
>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
>>> &init_user_ns for user namespace mounts), and all of these aren't
>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
>>> a tmpfs or ramfs mount.
>>>
>>> But for sysfs in particular it does mean that we will need to make sure
>>> that there's no way that device nodes could start appearing in an
>>> unprivileged mount.
>>
>> Good point about nodev.  
>>
>> For tmpfs and ramfs and security labels the smack policy of allowing but
>> filtering security labels mean smack once it has those bits will not
>> care which user namespace ramfs and tmpfs live in.  The labels should
>> pretty much stay the same in any case.
> 
> Smack does care which namespace ramfs and tmpfs are in. With the patch
> I've got right now, if s_user_ns != &init_user_ns and the label of an
> inode does not match that of the root inode then
> security_inode_permission() will return EACCES.
> 
> So if something with CAP_MAC_ADMIN is changing security labels in such a
> mount, suddenly those inodes might become inaccessible. And while it may
> be unlikely that anyone is doing this it's impossible for me to prove
> that's the case.
> 
>> If the same class of handling will also apply to selinux and those are
>> the only two security modules that apply labels than we can leave tmpfs
>> and ramfs with the security labels of whomever mounted them.
> 
> For SELinux I now have a patch which applies mountpoint labeling to
> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> Smack how this behavior will differ from what happens today, but my
> understanding is that this means that the label of the mountpoint is
> used for all objects from that superblock. Afaik it does not have the
> Smack behavior of denying access to filesystem objects which have a
> different label in the backing store.
> 
>> For sysfs things get a little more interesting.  Assuming tmpfs and
>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
>> with possibly invalid securitly labels set on a different mount of
>> selinux

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-06 Thread Seth Forshee
On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >> Seth Forshee  writes:
> >> 
> >> > Initially this will be used to eliminate the implicit MNT_NODEV
> >> > flag for mounts from user namespaces. In the future it will also
> >> > be used for translating ids and checking capabilities for
> >> > filesystems mounted from user namespaces.
> >> >
> >> > s_user_ns is initialized in alloc_super() and is generally set to
> >> > current_user_ns(). To avoid security and corruption issues, two
> >> > additional mount checks are also added:
> >> >
> >> >  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >> >in current_user_ns().
> >> >
> >> >  - sget() will fail with EBUSY when the filesystem it's looking
> >> >for is already mounted from another user namespace.
> >> >
> >> > proc needs some special handling here. The user namespace of
> >> > current isn't appropriate when forking as a result of clone (2)
> >> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >> > from within the new user namespace. Instead, the user namespace
> >> > which owns the new pid namespace should be used. sget_userns() is
> >> > added to allow passing of a user namespace other than that of
> >> > current, and this is used by proc_mount(). sget() becomes a
> >> > wrapper around sget_userns() which passes current_user_ns().
> >> 
> >> From bits of the previous conversation.
> >> 
> >> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> >> xattrs can travel from one mount of sysfs to another via the sysfs
> >> backing store.
> >> 
> >> For tmpfs and any other filesystems we support mounting without
> >> privilige that support xattrs.  We need to identify them and
> >> see if userspace is taking advantage of the ability to set
> >> xattrs and file caps (unlikely).  If they are we need to call
> >> sget_userns(..., &init_user_ns) on those filesystems as well.
> >> 
> >> Possibly/Probably we should just do that for all of the interesting
> >> filesystems to start with and then change back to an ordinary old sget
> >> after we have done the testing and confirmed we will not be introducing
> >> userspace regressions.
> >
> > I was reviewing everything in preparation for sending v2 patches, and I
> > realized that doing this has an undesirable side effect. In patch 2 the
> > implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> > is used to block opening devices in these mounts. When we set s_user_ns
> > to &init_user_ns, it becomes possible to open device nodes from
> > unprivileged mounts of these filesystems.
> >
> > This doesn't pose a real problem today. The only filesystems it will
> > affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> > &init_user_ns for user namespace mounts), and all of these aren't
> > problems. sysfs is okay because kernfs doesn't (currently?) allow device
> > nodes, and a user would require CAP_MKNOD to create any device nodes in
> > a tmpfs or ramfs mount.
> >
> > But for sysfs in particular it does mean that we will need to make sure
> > that there's no way that device nodes could start appearing in an
> > unprivileged mount.
> 
> Good point about nodev.  
> 
> For tmpfs and ramfs and security labels the smack policy of allowing but
> filtering security labels mean smack once it has those bits will not
> care which user namespace ramfs and tmpfs live in.  The labels should
> pretty much stay the same in any case.

Smack does care which namespace ramfs and tmpfs are in. With the patch
I've got right now, if s_user_ns != &init_user_ns and the label of an
inode does not match that of the root inode then
security_inode_permission() will return EACCES.

So if something with CAP_MAC_ADMIN is changing security labels in such a
mount, suddenly those inodes might become inaccessible. And while it may
be unlikely that anyone is doing this it's impossible for me to prove
that's the case.

> If the same class of handling will also apply to selinux and those are
> the only two security modules that apply labels than we can leave tmpfs
> and ramfs with the security labels of whomever mounted them.

For SELinux I now have a patch which applies mountpoint labeling to
mounts for which s_user_ns != &init_user_ns. I'm less sure then with
Smack how this behavior will differ from what happens today, but my
understanding is that this means that the label of the mountpoint is
used for all objects from that superblock. Afaik it does not have the
Smack behavior of denying access to filesystem objects which have a
different label in the backing store.

> For sysfs things get a little more interesting.  Assuming tmpfs and
> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> with possibly invalid securitly labels set on a different mount of
> selinux.  (I am wondering now how all of these labels work in the
> context of

Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-05 Thread Eric W. Biederman
Seth Forshee  writes:

> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>> Seth Forshee  writes:
>> 
>> > Initially this will be used to eliminate the implicit MNT_NODEV
>> > flag for mounts from user namespaces. In the future it will also
>> > be used for translating ids and checking capabilities for
>> > filesystems mounted from user namespaces.
>> >
>> > s_user_ns is initialized in alloc_super() and is generally set to
>> > current_user_ns(). To avoid security and corruption issues, two
>> > additional mount checks are also added:
>> >
>> >  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>> >in current_user_ns().
>> >
>> >  - sget() will fail with EBUSY when the filesystem it's looking
>> >for is already mounted from another user namespace.
>> >
>> > proc needs some special handling here. The user namespace of
>> > current isn't appropriate when forking as a result of clone (2)
>> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>> > from within the new user namespace. Instead, the user namespace
>> > which owns the new pid namespace should be used. sget_userns() is
>> > added to allow passing of a user namespace other than that of
>> > current, and this is used by proc_mount(). sget() becomes a
>> > wrapper around sget_userns() which passes current_user_ns().
>> 
>> From bits of the previous conversation.
>> 
>> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
>> xattrs can travel from one mount of sysfs to another via the sysfs
>> backing store.
>> 
>> For tmpfs and any other filesystems we support mounting without
>> privilige that support xattrs.  We need to identify them and
>> see if userspace is taking advantage of the ability to set
>> xattrs and file caps (unlikely).  If they are we need to call
>> sget_userns(..., &init_user_ns) on those filesystems as well.
>> 
>> Possibly/Probably we should just do that for all of the interesting
>> filesystems to start with and then change back to an ordinary old sget
>> after we have done the testing and confirmed we will not be introducing
>> userspace regressions.
>
> I was reviewing everything in preparation for sending v2 patches, and I
> realized that doing this has an undesirable side effect. In patch 2 the
> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> is used to block opening devices in these mounts. When we set s_user_ns
> to &init_user_ns, it becomes possible to open device nodes from
> unprivileged mounts of these filesystems.
>
> This doesn't pose a real problem today. The only filesystems it will
> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> &init_user_ns for user namespace mounts), and all of these aren't
> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> nodes, and a user would require CAP_MKNOD to create any device nodes in
> a tmpfs or ramfs mount.
>
> But for sysfs in particular it does mean that we will need to make sure
> that there's no way that device nodes could start appearing in an
> unprivileged mount.

Good point about nodev.  

For tmpfs and ramfs and security labels the smack policy of allowing but
filtering security labels mean smack once it has those bits will not
care which user namespace ramfs and tmpfs live in.  The labels should
pretty much stay the same in any case.

If the same class of handling will also apply to selinux and those are
the only two security modules that apply labels than we can leave tmpfs
and ramfs with the security labels of whomever mounted them.

For sysfs things get a little more interesting.  Assuming tmpfs and
ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
with possibly invalid securitly labels set on a different mount of
selinux.  (I am wondering now how all of these labels work in the
context of nfs).

The worst case for sysfs is that we come up with a cousin of
SB_I_NO_EXEC say SB_I_NO_DEV.

But at the moment I am hoping that limited label storage in a user
namespace as you and Casey have been talking about winds up being the
norm and then we can follow the standard rules for setting s_user_ns and
still preserve the current label setting behavior.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-08-05 Thread Seth Forshee
On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > Initially this will be used to eliminate the implicit MNT_NODEV
> > flag for mounts from user namespaces. In the future it will also
> > be used for translating ids and checking capabilities for
> > filesystems mounted from user namespaces.
> >
> > s_user_ns is initialized in alloc_super() and is generally set to
> > current_user_ns(). To avoid security and corruption issues, two
> > additional mount checks are also added:
> >
> >  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >in current_user_ns().
> >
> >  - sget() will fail with EBUSY when the filesystem it's looking
> >for is already mounted from another user namespace.
> >
> > proc needs some special handling here. The user namespace of
> > current isn't appropriate when forking as a result of clone (2)
> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> > from within the new user namespace. Instead, the user namespace
> > which owns the new pid namespace should be used. sget_userns() is
> > added to allow passing of a user namespace other than that of
> > current, and this is used by proc_mount(). sget() becomes a
> > wrapper around sget_userns() which passes current_user_ns().
> 
> From bits of the previous conversation.
> 
> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> xattrs can travel from one mount of sysfs to another via the sysfs
> backing store.
> 
> For tmpfs and any other filesystems we support mounting without
> privilige that support xattrs.  We need to identify them and
> see if userspace is taking advantage of the ability to set
> xattrs and file caps (unlikely).  If they are we need to call
> sget_userns(..., &init_user_ns) on those filesystems as well.
> 
> Possibly/Probably we should just do that for all of the interesting
> filesystems to start with and then change back to an ordinary old sget
> after we have done the testing and confirmed we will not be introducing
> userspace regressions.

I was reviewing everything in preparation for sending v2 patches, and I
realized that doing this has an undesirable side effect. In patch 2 the
implicit nodev is removed for unprivileged mounts, and instead s_user_ns
is used to block opening devices in these mounts. When we set s_user_ns
to &init_user_ns, it becomes possible to open device nodes from
unprivileged mounts of these filesystems.

This doesn't pose a real problem today. The only filesystems it will
affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
&init_user_ns for user namespace mounts), and all of these aren't
problems. sysfs is okay because kernfs doesn't (currently?) allow device
nodes, and a user would require CAP_MKNOD to create any device nodes in
a tmpfs or ramfs mount.

But for sysfs in particular it does mean that we will need to make sure
that there's no way that device nodes could start appearing in an
unprivileged mount.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-07-31 Thread Eric W. Biederman
Amir Goldstein  writes:

> On Thu, Jul 16, 2015 at 5:47 AM, Eric W. Biederman
>  wrote:
>> Seth Forshee  writes:
>>
>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>> flag for mounts from user namespaces. In the future it will also
>>> be used for translating ids and checking capabilities for
>>> filesystems mounted from user namespaces.
>>>
>>> s_user_ns is initialized in alloc_super() and is generally set to
>>> current_user_ns(). To avoid security and corruption issues, two
>>> additional mount checks are also added:
>>>
>>>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>>in current_user_ns().
>>>
>>>  - sget() will fail with EBUSY when the filesystem it's looking
>>>for is already mounted from another user namespace.
>>>
>>> proc needs some special handling here. The user namespace of
>>> current isn't appropriate when forking as a result of clone (2)
>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>> from within the new user namespace. Instead, the user namespace
>>> which owns the new pid namespace should be used. sget_userns() is
>>> added to allow passing of a user namespace other than that of
>>> current, and this is used by proc_mount(). sget() becomes a
>>> wrapper around sget_userns() which passes current_user_ns().
>>
>> From bits of the previous conversation.
>>
>> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
>> xattrs can travel from one mount of sysfs to another via the sysfs
>> backing store.
>>
>> For tmpfs and any other filesystems we support mounting without
>> privilige that support xattrs.  We need to identify them and
>> see if userspace is taking advantage of the ability to set
>> xattrs and file caps (unlikely).  If they are we need to call
>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>
>> Possibly/Probably we should just do that for all of the interesting
>> filesystems to start with and then change back to an ordinary old sget
>> after we have done the testing and confirmed we will not be introducing
>> userspace regressions.
>
> Eric,
>
> Perhaps it is too soon to discuss here, but how do you envision
> handling of file system private mount options in user ns.
>
> For example, suppose that we get to a point where we can trust
> an ext4 loopback mount to be non vulnerable to exploits.
> That loopback mounted fs could very well have errors and so
> error=panic option would be very much undesired from unprivileged user mount.
>
> Do you think this would require extra flags/callbacks from VFS to
> file system code or would s_user_ns be sufficient?

This case is easy.  In mount or remount we just need to check
capable(CAP_SYS_ADMIN) if someone sets error=panic, and if the capable
call fails don't allow the mount or the remount.

But this corner case is another good reminder that we have to be very
deliberate and very careful before we enable mounting a filesystem this
way.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-07-31 Thread Amir Goldstein
On Thu, Jul 16, 2015 at 5:47 AM, Eric W. Biederman
 wrote:
> Seth Forshee  writes:
>
>> Initially this will be used to eliminate the implicit MNT_NODEV
>> flag for mounts from user namespaces. In the future it will also
>> be used for translating ids and checking capabilities for
>> filesystems mounted from user namespaces.
>>
>> s_user_ns is initialized in alloc_super() and is generally set to
>> current_user_ns(). To avoid security and corruption issues, two
>> additional mount checks are also added:
>>
>>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>in current_user_ns().
>>
>>  - sget() will fail with EBUSY when the filesystem it's looking
>>for is already mounted from another user namespace.
>>
>> proc needs some special handling here. The user namespace of
>> current isn't appropriate when forking as a result of clone (2)
>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>> from within the new user namespace. Instead, the user namespace
>> which owns the new pid namespace should be used. sget_userns() is
>> added to allow passing of a user namespace other than that of
>> current, and this is used by proc_mount(). sget() becomes a
>> wrapper around sget_userns() which passes current_user_ns().
>
> From bits of the previous conversation.
>
> We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
> xattrs can travel from one mount of sysfs to another via the sysfs
> backing store.
>
> For tmpfs and any other filesystems we support mounting without
> privilige that support xattrs.  We need to identify them and
> see if userspace is taking advantage of the ability to set
> xattrs and file caps (unlikely).  If they are we need to call
> sget_userns(..., &init_user_ns) on those filesystems as well.
>
> Possibly/Probably we should just do that for all of the interesting
> filesystems to start with and then change back to an ordinary old sget
> after we have done the testing and confirmed we will not be introducing
> userspace regressions.

Eric,

Perhaps it is too soon to discuss here, but how do you envision
handling of file system private mount options in user ns.

For example, suppose that we get to a point where we can trust
an ext4 loopback mount to be non vulnerable to exploits.
That loopback mounted fs could very well have errors and so
error=panic option would be very much undesired from unprivileged user mount.

Do you think this would require extra flags/callbacks from VFS to
file system code or would s_user_ns be sufficient?

>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-07-15 Thread Eric W. Biederman
Seth Forshee  writes:

> Initially this will be used to eliminate the implicit MNT_NODEV
> flag for mounts from user namespaces. In the future it will also
> be used for translating ids and checking capabilities for
> filesystems mounted from user namespaces.
>
> s_user_ns is initialized in alloc_super() and is generally set to
> current_user_ns(). To avoid security and corruption issues, two
> additional mount checks are also added:
>
>  - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>in current_user_ns().
>
>  - sget() will fail with EBUSY when the filesystem it's looking
>for is already mounted from another user namespace.
>
> proc needs some special handling here. The user namespace of
> current isn't appropriate when forking as a result of clone (2)
> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> from within the new user namespace. Instead, the user namespace
> which owns the new pid namespace should be used. sget_userns() is
> added to allow passing of a user namespace other than that of
> current, and this is used by proc_mount(). sget() becomes a
> wrapper around sget_userns() which passes current_user_ns().

>From bits of the previous conversation.

We need sget_userns(..., &init_user_ns) for sysfs.  The sysfs
xattrs can travel from one mount of sysfs to another via the sysfs
backing store.

For tmpfs and any other filesystems we support mounting without
privilige that support xattrs.  We need to identify them and
see if userspace is taking advantage of the ability to set
xattrs and file caps (unlikely).  If they are we need to call
sget_userns(..., &init_user_ns) on those filesystems as well.

Possibly/Probably we should just do that for all of the interesting
filesystems to start with and then change back to an ordinary old sget
after we have done the testing and confirmed we will not be introducing
userspace regressions.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] fs: Add user namesapace member to struct super_block

2015-07-15 Thread Seth Forshee
Initially this will be used to eliminate the implicit MNT_NODEV
flag for mounts from user namespaces. In the future it will also
be used for translating ids and checking capabilities for
filesystems mounted from user namespaces.

s_user_ns is initialized in alloc_super() and is generally set to
current_user_ns(). To avoid security and corruption issues, two
additional mount checks are also added:

 - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
   in current_user_ns().

 - sget() will fail with EBUSY when the filesystem it's looking
   for is already mounted from another user namespace.

proc needs some special handling here. The user namespace of
current isn't appropriate when forking as a result of clone (2)
with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
from within the new user namespace. Instead, the user namespace
which owns the new pid namespace should be used. sget_userns() is
added to allow passing of a user namespace other than that of
current, and this is used by proc_mount(). sget() becomes a
wrapper around sget_userns() which passes current_user_ns().

Signed-off-by: Seth Forshee 
---
 fs/namespace.c |  3 +++
 fs/proc/root.c |  3 ++-
 fs/super.c | 38 +-
 include/linux/fs.h |  8 
 4 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ce428cadd41f..f1f67d663d49 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2357,6 +2357,9 @@ static int do_new_mount(struct path *path, const char 
*fstype, int flags,
struct vfsmount *mnt;
int err;
 
+   if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+   return -EPERM;
+
if (!fstype)
return -EINVAL;
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 361ab4ee42fc..4b302cbf13f9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -117,7 +117,8 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
return ERR_PTR(-EPERM);
}
 
-   sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
+   sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ns->user_ns, ns);
if (IS_ERR(sb))
return ERR_CAST(sb);
 
diff --git a/fs/super.c b/fs/super.c
index b61372354f2b..b5f171aadbf7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 
@@ -148,6 +149,7 @@ static void destroy_super(struct super_block *s)
list_lru_destroy(&s->s_inode_lru);
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_counter_destroy(&s->s_writers.counter[i]);
+   put_user_ns(s->s_user_ns);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -163,7 +165,8 @@ static void destroy_super(struct super_block *s)
  * Allocates and initializes a new &struct super_block.  alloc_super()
  * returns a pointer new superblock or %NULL if allocation had failed.
  */
-static struct super_block *alloc_super(struct file_system_type *type, int 
flags)
+static struct super_block *alloc_super(struct file_system_type *type, int 
flags,
+  struct user_namespace *user_ns)
 {
struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
static const struct super_operations default_op;
@@ -231,6 +234,8 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+
+   s->s_user_ns = get_user_ns(user_ns);
return s;
 
 fail:
@@ -427,17 +432,17 @@ void generic_shutdown_super(struct super_block *sb)
 EXPORT_SYMBOL(generic_shutdown_super);
 
 /**
- * sget-   find or create a superblock
+ * sget_userns -   find or create a superblock
  * @type:  filesystem type superblock should belong to
  * @test:  comparison callback
  * @set:   setup callback
  * @flags: mount flags
  * @data:  argument to each of them
  */
-struct super_block *sget(struct file_system_type *type,
+struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
-   int flags,
+   int flags, struct user_namespace *user_ns,
void *data)
 {
struct super_block *s = NULL;
@@ -450,6 +455,10 @@ retry:
hlist_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
+   if (user_ns != old->s_user_ns) {
+   spin_unlock(&sb_lock);
+   r