from:"Seth Forshee"

[PATCH] tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha

2021-02-08 Thread Seth Forshee

As with s390, alpha is a 64-bit architecture with a 32-bit ino_t.
With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode
numbers and display "inode64" in the mount options, whereas
passing "inode64" in the mount options will fail. This leads to
erroneous behaviours such as this:

 # mkdir mnt
 # mount -t tmpfs nodev mnt
 # mount -o remount,rw mnt
 mount: /home/ubuntu/mnt: mount point not mounted or bad option.

Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.

Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Cc: sta...@vger.kernel.org # v5.9+
Signed-off-by: Seth Forshee 
---
 fs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 3347ec7bd837..da524c4d7b7e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -203,7 +203,7 @@ config TMPFS_XATTR
 
 config TMPFS_INODE64
bool "Use 64-bit ino_t by default in tmpfs"
-   depends on TMPFS && 64BIT && !S390
+   depends on TMPFS && 64BIT && !(S390 || ALPHA)
default n
help
  tmpfs has historically used only inode numbers as wide as an unsigned
-- 
2.29.2

Re: [PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390

2021-02-08 Thread Seth Forshee

On Sun, Feb 07, 2021 at 05:48:31PM +0300, Kirill A. Shutemov wrote:
> On Fri, Feb 05, 2021 at 05:06:20PM -0600, Seth Forshee wrote:
> > This feature requires ino_t be 64-bits, which is true for every
> > 64-bit architecture but s390, so prevent this option from being
> > selected there.
> 
> Quick grep suggests the same for alpha. Am I wrong?

No, it appears you are right. Looks like my grep missed alpha somehow.

Andrew, do you prefer an additional patch or an updated version of the
previous patch?

[PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390

2021-02-05 Thread Seth Forshee

This feature requires ino_t be 64-bits, which is true for every
64-bit architecture but s390, so prevent this option from being
selected there.

Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Cc:  # v5.9+
Signed-off-by: Seth Forshee 
---
 fs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index aa4c12282301..3347ec7bd837 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -203,7 +203,7 @@ config TMPFS_XATTR
 
 config TMPFS_INODE64
bool "Use 64-bit ino_t by default in tmpfs"
-   depends on TMPFS && 64BIT
+   depends on TMPFS && 64BIT && !S390
default n
help
  tmpfs has historically used only inode numbers as wide as an unsigned
-- 
2.29.2

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

On Fri, Feb 05, 2021 at 01:23:13PM -0800, Andrew Morton wrote:
> On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee  
> wrote:
> 
> > On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote:
> > > On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee 
> > >  wrote:
> > > 
> > > > Currently there seems to be an assumption in tmpfs that 64-bit
> > > > architectures also have a 64-bit ino_t. This is not true; s390 at
> > > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> > > > mounts will get 64-bit inode numbers and display "inode64" in the
> > > > mount options, but passing the "inode64" mount option will fail.
> > > > This leads to the following behavior:
> > > > 
> > > >  # mkdir mnt
> > > >  # mount -t tmpfs nodev mnt
> > > >  # mount -o remount,rw mnt
> > > >  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> > > > 
> > > > As mount sees "inode64" in the mount options and thus passes it
> > > > in the options for the remount.
> > > > 
> > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> > > > but I don't think it's possible to test for this (potentially
> > > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> > > > not sure whether or not that is wanted). So fix this by simply
> > > > refusing to honor the CONFIG_TMPFS_INODE64 setting when
> > > > sizeof(ino_t) < 8.
> > > 
> > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not 
> > > enabled?
> > 
> > I did do that for our config. I see the s390 defconfig has it enabled,
> > so I will send a patch for that too. But the fact that it can be
> > configured that way and that the code behaves badly still seems
> > problematic.
> 
> I meant
> 
> --- a/fs/Kconfig~a
> +++ a/fs/Kconfig
> @@ -203,7 +203,7 @@ config TMPFS_XATTR
>  
>  config TMPFS_INODE64
>   bool "Use 64-bit ino_t by default in tmpfs"
> - depends on TMPFS && 64BIT
> + depends on TMPFS && 64BIT && !S390
>   default n
>   help
> tmpfs has historically used only inode numbers as wide as an unsigned
> _

Ah. s390 does appear to be the only architecture that doesn't use
unsigned long for ino_t, so that seems good to me. I can send a patch if
you want, but seems like you've already got one ...

Seth

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote:
> On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee  
> wrote:
> 
> > Currently there seems to be an assumption in tmpfs that 64-bit
> > architectures also have a 64-bit ino_t. This is not true; s390 at
> > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> > mounts will get 64-bit inode numbers and display "inode64" in the
> > mount options, but passing the "inode64" mount option will fail.
> > This leads to the following behavior:
> > 
> >  # mkdir mnt
> >  # mount -t tmpfs nodev mnt
> >  # mount -o remount,rw mnt
> >  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> > 
> > As mount sees "inode64" in the mount options and thus passes it
> > in the options for the remount.
> > 
> > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> > but I don't think it's possible to test for this (potentially
> > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> > not sure whether or not that is wanted). So fix this by simply
> > refusing to honor the CONFIG_TMPFS_INODE64 setting when
> > sizeof(ino_t) < 8.
> 
> How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled?

I did do that for our config. I see the s390 defconfig has it enabled,
so I will send a patch for that too. But the fact that it can be
configured that way and that the code behaves badly still seems
problematic.

> 
> > Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
> > Signed-off-by: Seth Forshee 
> 
> With a cc:stable, I assume?

Yes, I forgot that. Thanks.

Seth

[PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

Currently there seems to be an assumption in tmpfs that 64-bit
architectures also have a 64-bit ino_t. This is not true; s390 at
least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
mounts will get 64-bit inode numbers and display "inode64" in the
mount options, but passing the "inode64" mount option will fail.
This leads to the following behavior:

 # mkdir mnt
 # mount -t tmpfs nodev mnt
 # mount -o remount,rw mnt
 mount: /home/ubuntu/mnt: mount point not mounted or bad option.

As mount sees "inode64" in the mount options and thus passes it
in the options for the remount.

Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
but I don't think it's possible to test for this (potentially
CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
not sure whether or not that is wanted). So fix this by simply
refusing to honor the CONFIG_TMPFS_INODE64 setting when
sizeof(ino_t) < 8.

Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..efde42acdc7a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3733,7 +3733,7 @@ static int shmem_fill_super(struct super_block *sb, 
struct fs_context *fc)
ctx->blocks = shmem_default_max_blocks();
if (!(ctx->seen & SHMEM_SEEN_INODES))
ctx->inodes = shmem_default_max_inodes();
-   if (!(ctx->seen & SHMEM_SEEN_INUMS))
+   if (!(ctx->seen & SHMEM_SEEN_INUMS) && sizeof(ino_t) >= 8)
ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
} else {
sb->s_flags |= SB_NOUSER;
-- 
2.29.2

Re: [PATCH] x86: Disable CET instrumentation in the kernel

2021-01-29 Thread Seth Forshee

On Fri, Jan 29, 2021 at 06:07:55PM +0100, Borislav Petkov wrote:
> On Fri, Jan 29, 2021 at 11:03:31AM -0600, Josh Poimboeuf wrote:
> > On Fri, Jan 29, 2021 at 06:54:08PM +0200, Nikolay Borisov wrote:
> > > 
> > > 
> > > On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote:
> > > > Agreed, stable is a good idea.   I think Nikolay saw it with GCC 9.
> > > 
> > > 
> > > Yes I did, with the default Ubuntu compiler as well as the default gcc-10 
> > > compiler: 
> > > 
> > > # gcc -v -Q -O2 --help=target | grep protection
> > > 
> > > gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) 
> > > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' 
> > > '-march=x86-64'
> > >  /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu 
> > > help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase 
> > > help-dummy -O2 -version --help=target -fasynchronous-unwind-tables 
> > > -fstack-protector-strong -Wformat -Wformat-security 
> > > -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s
> > > GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
> > >   compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, 
> > > MPC version 1.1.0, isl version isl-0.22.1-GMP
> > > 
> > > 
> > > It has -fcf-protection turned on by default it seems. 
> > 
> > Yup, explains why I didn't see it:
> > 
> > gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
> > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' 
> > '-march=x86-64'
> >  /usr/libexec/gcc/x86_64-redhat-linux/10/cc1 -v help-dummy -dumpbase 
> > help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version 
> > --help=target -o /tmp/cclBz55H.s
> 
> The fact that you triggered it with an Ubuntu gcc explains why the
> original patch adding that switch:
> 
> 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
> 
> came from a Canonical.
> 
> Adding the author to Cc for FYI.
> 
> Seth, you can find this thread starting here:
> 
> https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble

Thanks for the heads up. This still works fine for our needs.

Acked-by: Seth Forshee

Re: BPF selftests build failure in 5.10-rc

2020-12-10 Thread Seth Forshee

On Wed, Dec 09, 2020 at 04:15:35PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 9, 2020 at 2:24 PM Seth Forshee  
> wrote:
> >
> > Building the BPF selftests with clang 11, I'm getting the following
> > error:
> >
> >CLNG-LLC [test_maps] profiler1.o
> >  In file included from progs/profiler1.c:6:
> >  progs/profiler.inc.h:260:17: error: use of unknown builtin 
> > '__builtin_preserve_enum_value' [-Wimplicit-function-declaration]
> >  int cgrp_id = bpf_core_enum_value(enum 
> > cgroup_subsys_id___local,
> >^
> >  
> > /home/ubuntu/unstable/tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:179:2:
> >  note: expanded from macro 'bpf_core_enum_value'
> >  __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, 
> > BPF_ENUMVAL_VALUE)
> >  ^
> >  1 error generated.
> >  llc: error: llc: :1:1: error: expected top-level entity
> >  BPF obj compilation failed
> 
> Addressed by fb3558127cb6 ("bpf: Fix selftest compilation on clang 11")

Great, thanks!

> 
> >
> > I see that test_core_reloc_enumval.c takes precautions around the use of
> > __builtin_preserve_enum_value as it is currently only available in clang
> > 12 nightlies. Is it possible to do something similar here? Though I see
> > that the use of the builtin is not nearly so neatly localized as it is
> > in test_core_reloc_enumval.c.
> >
> > Thanks,
> > Seth

BPF selftests build failure in 5.10-rc

2020-12-09 Thread Seth Forshee

Building the BPF selftests with clang 11, I'm getting the following
error:

   CLNG-LLC [test_maps] profiler1.o
 In file included from progs/profiler1.c:6:
 progs/profiler.inc.h:260:17: error: use of unknown builtin 
'__builtin_preserve_enum_value' [-Wimplicit-function-declaration]
 int cgrp_id = bpf_core_enum_value(enum 
cgroup_subsys_id___local,
   ^
 
/home/ubuntu/unstable/tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:179:2:
 note: expanded from macro 'bpf_core_enum_value'
 __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, 
BPF_ENUMVAL_VALUE)
 ^
 1 error generated.
 llc: error: llc: :1:1: error: expected top-level entity
 BPF obj compilation failed

I see that test_core_reloc_enumval.c takes precautions around the use of
__builtin_preserve_enum_value as it is currently only available in clang
12 nightlies. Is it possible to do something similar here? Though I see
that the use of the builtin is not nearly so neatly localized as it is
in test_core_reloc_enumval.c.

Thanks,
Seth

Re: resolve_btfids breaks kernel cross-compilation

2020-09-17 Thread Seth Forshee

On Thu, Sep 17, 2020 at 11:14:06AM +0200, Jiri Olsa wrote:
> On Thu, Sep 17, 2020 at 10:38:12AM +0200, Jiri Olsa wrote:
> > On Thu, Sep 17, 2020 at 10:04:55AM +0200, Jiri Olsa wrote:
> > > On Wed, Sep 16, 2020 at 02:47:33PM -0500, Seth Forshee wrote:
> > > > The requirement to build resolve_btfids whenever CONFIG_DEBUG_INFO_BTF
> > > > is enabled breaks some cross builds. For example, when building a 64-bit
> > > > powerpc kernel on amd64 I get:
> > > > 
> > > >  Auto-detecting system features:
> > > >  ...libelf: [ [32mon[m  ]
> > > >  ...  zlib: [ [32mon[m  ]
> > > >  ...   bpf: [ [31mOFF[m ]
> > > >  
> > > >  BPF API too old
> > > >  make[6]: *** [Makefile:295: bpfdep] Error 1
> > > > 
> > > > The contents of tools/bpf/resolve_btfids/feature/test-bpf.make.output:
> > > > 
> > > >  In file included from 
> > > > /home/sforshee/src/u-k/unstable/tools/arch/powerpc/include/uapi/asm/bitsperlong.h:11,
> > > >   from /usr/include/asm-generic/int-ll64.h:12,
> > > >   from /usr/include/asm-generic/types.h:7,
> > > >   from /usr/include/x86_64-linux-gnu/asm/types.h:1,
> > > >   from 
> > > > /home/sforshee/src/u-k/unstable/tools/include/linux/types.h:10,
> > > >   from 
> > > > /home/sforshee/src/u-k/unstable/tools/include/uapi/linux/bpf.h:11,
> > > >   from test-bpf.c:3:
> > > >  
> > > > /home/sforshee/src/u-k/unstable/tools/include/asm-generic/bitsperlong.h:14:2:
> > > >  error: #error Inconsistent word size. Check asm/bitsperlong.h
> > > > 14 | #error Inconsistent word size. Check asm/bitsperlong.h
> > > >|  ^
> > > > 
> > > > This is because tools/arch/powerpc/include/uapi/asm/bitsperlong.h sets
> > > > __BITS_PER_LONG based on the predefinied compiler macro __powerpc64__,
> > > > which is not defined by the host compiler. What can we do to get cross
> > > > builds working again?
> > > 
> > > could you please share the command line and setup?
> > 
> > I just reproduced.. checking on fix
> 
> I still need to check on few things, but patch below should help

It does help with the word size problem, thanks.

> we might have a problem for cross builds with different endianity
> than the host because libbpf does not support reading BTF data
> with different endianity, and we get:
> 
>   BTFIDS  vmlinux
> libbpf: non-native ELF endianness is not supported

Yes, I see this now when cross building for s390.

Thanks,
Seth

> 
> jirka
> 
> 
> ---
> diff --git a/tools/bpf/resolve_btfids/Makefile 
> b/tools/bpf/resolve_btfids/Makefile
> index a88cd4426398..d3c818b8d8d3 100644
> --- a/tools/bpf/resolve_btfids/Makefile
> +++ b/tools/bpf/resolve_btfids/Makefile
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  include ../../scripts/Makefile.include
> +include ../../scripts/Makefile.arch
>  
>  ifeq ($(srctree),)
>  srctree := $(patsubst %/,%,$(dir $(CURDIR)))
> @@ -29,6 +30,7 @@ endif
>  AR   = $(HOSTAR)
>  CC   = $(HOSTCC)
>  LD   = $(HOSTLD)
> +ARCH = $(HOSTARCH)
>  
>  OUTPUT ?= $(srctree)/tools/bpf/resolve_btfids/
>  
>

resolve_btfids breaks kernel cross-compilation

2020-09-16 Thread Seth Forshee

The requirement to build resolve_btfids whenever CONFIG_DEBUG_INFO_BTF
is enabled breaks some cross builds. For example, when building a 64-bit
powerpc kernel on amd64 I get:

 Auto-detecting system features:
 ...libelf: [ [32mon[m  ]
 ...  zlib: [ [32mon[m  ]
 ...   bpf: [ [31mOFF[m ]
 
 BPF API too old
 make[6]: *** [Makefile:295: bpfdep] Error 1

The contents of tools/bpf/resolve_btfids/feature/test-bpf.make.output:

 In file included from 
/home/sforshee/src/u-k/unstable/tools/arch/powerpc/include/uapi/asm/bitsperlong.h:11,
  from /usr/include/asm-generic/int-ll64.h:12,
  from /usr/include/asm-generic/types.h:7,
  from /usr/include/x86_64-linux-gnu/asm/types.h:1,
  from 
/home/sforshee/src/u-k/unstable/tools/include/linux/types.h:10,
  from 
/home/sforshee/src/u-k/unstable/tools/include/uapi/linux/bpf.h:11,
  from test-bpf.c:3:
 /home/sforshee/src/u-k/unstable/tools/include/asm-generic/bitsperlong.h:14:2: 
error: #error Inconsistent word size. Check asm/bitsperlong.h
14 | #error Inconsistent word size. Check asm/bitsperlong.h
   |  ^

This is because tools/arch/powerpc/include/uapi/asm/bitsperlong.h sets
__BITS_PER_LONG based on the predefinied compiler macro __powerpc64__,
which is not defined by the host compiler. What can we do to get cross
builds working again?

Thanks,
Seth

test_bpf regressions on s390 since 5.4

2020-07-16 Thread seth . forshee

The tests in lib/test_bpf.c were all passing in 5.4 when using the JIT,
but some are failing in 5.7/5.8. Some of the failures are due to the
removal of BPF_SIZE_MAX causing some expected failures to pass, which I
have already send a patch for [1]. The remaining failures appear to be
regressions. I haven't tried 5.5 or 5.6, so I'm not sure exactly when
they first appeared.

These are the tests which currently fail:

 test_bpf: #37 INT: MUL_X jited:1 ret -1 != 1 FAIL (1 times)
 test_bpf: #42 INT: SUB jited:1 ret -55 != 11 FAIL (1 times)
 test_bpf: #44 INT: MUL jited:1 ret 439084800 != 903446258 FAIL (1 times)
 test_bpf: #49 INT: shifts by register jited:1 ret -617 != -1 FAIL (1 times)
 test_bpf: #371 JNE signed compare, test 1 jited:1 ret 2 != 1 FAIL (1 times)
 test_bpf: #372 JNE signed compare, test 2 jited:1 ret 2 != 1 FAIL (1 times)
 test_bpf: #374 JNE signed compare, test 4 jited:1 ret 1 != 2 FAIL (1 times)
 test_bpf: #375 JNE signed compare, test 5 jited:1 ret 2 != 1 FAIL (1 times)

Thanks,
Seth

[1] 
https://lore.kernel.org/lkml/20200716143931.330122-1-seth.fors...@canonical.com/

[PATCH] Revert "test_bpf: flag tests that cannot be jited on s390"

2020-07-16 Thread Seth Forshee

This reverts commit 3203c9010060806ff88c9989aeab4dc8d9a474dc.

The s390 bpf JIT previously had a restriction on the maximum
program size, which required some tests in test_bpf to be flagged
as expected failures. The program size limitation has been removed,
and the tests now pass, so these tests should no longer be flagged.

Fixes: d1242b10ff03 ("s390/bpf: Remove JITed image size limitations")
Signed-off-by: Seth Forshee 
---
 lib/test_bpf.c | 20 
 1 file changed, 20 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index a5fddf9ebcb7..ca7d635bccd9 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -5275,31 +5275,21 @@ static struct bpf_test tests[] = {
{   /* Mainly checking JIT here. */
"BPF_MAXINSNS: Ctx heavy transformations",
{ },
-#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
-   CLASSIC | FLAG_EXPECTED_FAIL,
-#else
CLASSIC,
-#endif
{ },
{
{  1, SKB_VLAN_PRESENT },
{ 10, SKB_VLAN_PRESENT }
},
.fill_helper = bpf_fill_maxinsns6,
-   .expected_errcode = -ENOTSUPP,
},
{   /* Mainly checking JIT here. */
"BPF_MAXINSNS: Call heavy transformations",
{ },
-#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
-   CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL,
-#else
CLASSIC | FLAG_NO_DATA,
-#endif
{ },
{ { 1, 0 }, { 10, 0 } },
.fill_helper = bpf_fill_maxinsns7,
-   .expected_errcode = -ENOTSUPP,
},
{   /* Mainly checking JIT here. */
"BPF_MAXINSNS: Jump heavy test",
@@ -5350,28 +5340,18 @@ static struct bpf_test tests[] = {
{
"BPF_MAXINSNS: exec all MSH",
{ },
-#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
-   CLASSIC | FLAG_EXPECTED_FAIL,
-#else
CLASSIC,
-#endif
{ 0xfa, 0xfb, 0xfc, 0xfd, },
{ { 4, 0xababab83 } },
.fill_helper = bpf_fill_maxinsns13,
-   .expected_errcode = -ENOTSUPP,
},
{
"BPF_MAXINSNS: ld_abs+get_processor_id",
{ },
-#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390)
-   CLASSIC | FLAG_EXPECTED_FAIL,
-#else
CLASSIC,
-#endif
{ },
{ { 1, 0xbee } },
.fill_helper = bpf_fill_ld_abs_get_processor_id,
-   .expected_errcode = -ENOTSUPP,
},
/*
 * LD_IND / LD_ABS on fragmented SKBs
-- 
2.27.0

Re: [PATCH] MAINTAINERS: remove obsolete entry after file renaming

2020-06-29 Thread Seth Forshee

On Sun, Jun 28, 2020 at 08:02:29PM +0200, Lukas Bulwahn wrote:
> Commit f16861b12fa0 ("regulator: rename da903x to da903x-regulator") missed
> to adjust the DIALOG SEMICONDUCTOR DRIVERS section in MAINTAINERS.
> 
> Hence, ./scripts/get_maintainer.pl --self-test=patterns complains:
> 
>   warning: no file matchesF:drivers/regulator/da903x.c
> 
> The da903x-regulator.c file is already covered by the pattern
> drivers/regulator/da9???-regulator.[ch] in the section.
> 
> So, simply remove the non-matching file entry in MAINTAINERS.
> 
> Signed-off-by: Lukas Bulwahn 

I didn't think to check MAINTAINERS when renaming the file. This makes
sense to me.

Acked-by: Seth Forshee 

> ---
> applies cleanly on next-20200626
> 
> Seth, please ack.
> Mark, please pick this minor non-urgent patch into your -next tree.
> 
>  MAINTAINERS | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 04fceaee5200..970136e262c2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5021,7 +5021,6 @@ F:  drivers/mfd/da91??-*.c
>  F:   drivers/pinctrl/pinctrl-da90??.c
>  F:   drivers/power/supply/da9052-battery.c
>  F:   drivers/power/supply/da91??-*.c
> -F:   drivers/regulator/da903x.c
>  F:   drivers/regulator/da9???-regulator.[ch]
>  F:   drivers/regulator/slg51000-regulator.[ch]
>  F:   drivers/rtc/rtc-da90??.c
> -- 
> 2.17.1
>

[PATCH] regulator: rename da903x to da903x-regulator

2020-06-24 Thread Seth Forshee

This module shares the same name as its parent PMIC driver, which
confuses tools like kmod. Rename the regulator driver to avoid
such problems.

Signed-off-by: Seth Forshee 
---
 drivers/regulator/Makefile | 2 +-
 drivers/regulator/{da903x.c => da903x-regulator.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/regulator/{da903x.c => da903x-regulator.c} (100%)

diff --git a/drivers/regulator/Makefile b/drivers/regulator/Makefile
index e8f163371071..0796e4a47afa 100644
--- a/drivers/regulator/Makefile
+++ b/drivers/regulator/Makefile
@@ -31,7 +31,7 @@ obj-$(CONFIG_REGULATOR_BD70528) += bd70528-regulator.o
 obj-$(CONFIG_REGULATOR_BD71828) += bd71828-regulator.o
 obj-$(CONFIG_REGULATOR_BD718XX) += bd718x7-regulator.o
 obj-$(CONFIG_REGULATOR_BD9571MWV) += bd9571mwv-regulator.o
-obj-$(CONFIG_REGULATOR_DA903X) += da903x.o
+obj-$(CONFIG_REGULATOR_DA903X) += da903x-regulator.o
 obj-$(CONFIG_REGULATOR_DA9052) += da9052-regulator.o
 obj-$(CONFIG_REGULATOR_DA9055) += da9055-regulator.o
 obj-$(CONFIG_REGULATOR_DA9062) += da9062-regulator.o
diff --git a/drivers/regulator/da903x.c b/drivers/regulator/da903x-regulator.c
similarity index 100%
rename from drivers/regulator/da903x.c
rename to drivers/regulator/da903x-regulator.c
-- 
2.27.0

Re: [PATCH v2] selftests/ftrace: Use printf instead of echo in kprobe syntax error tests

2020-06-04 Thread Seth Forshee

On Fri, May 29, 2020 at 03:26:06PM -0600, Shuah Khan wrote:
> On 5/29/20 2:37 PM, Seth Forshee wrote:
> > On Wed, Mar 04, 2020 at 04:20:09PM -0600, Seth Forshee wrote:
> > > Test cases which use echo to write strings containing backslashes
> > > fail with some shells, as echo's treatment of backslashes in
> > > strings varies between shell implementations. Use printf instead,
> > > as it should behave consistently across different shells. This
> > > requires adjustments to the strings to escape \ and % characters.
> > > ftrace_errlog_check() must also re-escape these characters after
> > > processing them to remove ^ characters.
> > > 
> > > Signed-off-by: Seth Forshee 
> > 
> > Ping. Someone just asked me about this patch, and I noticed that it
> > hasn't been applied or received any feedback.
> > 
> 
> I pulled in this patch from Masami:
> 
> selftests/ftrace: Use printf for backslash included command
> https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=next&id=8e923a2168afd221ea26e3d9716f21e9578b5c4d
> 
> Looks like a duplicate.
> 
> Seth,
> Is your patch still needed?

Nope, Masami's patch seems to fix the issues addressed by my patch.
Thanks!

Re: [PATCH v2] selftests/ftrace: Use printf instead of echo in kprobe syntax error tests

2020-05-29 Thread Seth Forshee

On Wed, Mar 04, 2020 at 04:20:09PM -0600, Seth Forshee wrote:
> Test cases which use echo to write strings containing backslashes
> fail with some shells, as echo's treatment of backslashes in
> strings varies between shell implementations. Use printf instead,
> as it should behave consistently across different shells. This
> requires adjustments to the strings to escape \ and % characters.
> ftrace_errlog_check() must also re-escape these characters after
> processing them to remove ^ characters.
> 
> Signed-off-by: Seth Forshee 

Ping. Someone just asked me about this patch, and I noticed that it
hasn't been applied or received any feedback.

> ---
> Changes in v2:
>  - Escape backslashes for a couple of additional tests
> 
>  .../testing/selftests/ftrace/test.d/functions |  6 +++---
>  .../test.d/kprobe/kprobe_syntax_errors.tc | 20 +--
>  2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/functions 
> b/tools/testing/selftests/ftrace/test.d/functions
> index 5d4550591ff9..b38c6eb029e8 100644
> --- a/tools/testing/selftests/ftrace/test.d/functions
> +++ b/tools/testing/selftests/ftrace/test.d/functions
> @@ -114,11 +114,11 @@ yield() {
>  }
>  
>  ftrace_errlog_check() { # err-prefix command-with-error-pos-by-^ command-file
> -pos=$(echo -n "${2%^*}" | wc -c) # error position
> -command=$(echo "$2" | tr -d ^)
> +pos=$(printf "${2%^*}" | wc -c) # error position
> +command=$(printf "$2" | sed -e 's/\^//g' -e 's/%/%%/g' -e 's/\\//g')
>  echo "Test command: $command"
>  echo > error_log
> -(! echo "$command" >> "$3" ) 2> /dev/null
> +(! printf "$command" >> "$3" ) 2> /dev/null
>  grep "$1: error:" -A 3 error_log
>  N=$(tail -n 1 error_log | wc -c)
>  # "  Command: " and "^\n" => 13
> diff --git 
> a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc 
> b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> index ef1e9bafb098..039c03d230b9 100644
> --- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> @@ -37,14 +37,14 @@ fi
>  
>  check_error 'p vfs_read ^$none_var'  # BAD_VAR
>  
> -check_error 'p vfs_read ^%none_reg'  # BAD_REG_NAME
> +check_error 'p vfs_read ^%%none_reg' # BAD_REG_NAME
>  check_error 'p vfs_read ^@12345678abcde' # BAD_MEM_ADDR
>  check_error 'p vfs_read ^@+10'   # FILE_ON_KPROBE
>  
>  grep -q "imm-value" README && \
> -check_error 'p vfs_read arg1=\^x'# BAD_IMM
> +check_error 'p vfs_read arg1=\\^x'   # BAD_IMM
>  grep -q "imm-string" README && \
> -check_error 'p vfs_read arg1=\"abcd^'# IMMSTR_NO_CLOSE
> +check_error 'p vfs_read arg1=\\"abcd^'   # IMMSTR_NO_CLOSE
>  
>  check_error 'p vfs_read ^+0@0)'  # DEREF_NEED_BRACE
>  check_error 'p vfs_read ^+0ab1(@0)'  # BAD_DEREF_OFFS
> @@ -80,7 +80,7 @@ check_error 'p vfs_read arg1=^' # 
> NO_ARG_BODY
>  # instruction boundary check is valid on x86 (at this moment)
>  case $(uname -m) in
>x86_64|i[3456]86)
> -echo 'p vfs_read' > kprobe_events
> +printf 'p vfs_read' > kprobe_events
>  if grep -q FTRACE ../kprobes/list ; then
>   check_error 'p ^vfs_read+3' # BAD_INSN_BNDRY (only if 
> function-tracer is enabled)
>  fi
> @@ -89,13 +89,13 @@ esac
>  
>  # multiprobe errors
>  if grep -q "Create/append/" README && grep -q "imm-value" README; then
> -echo 'p:kprobes/testevent _do_fork' > kprobe_events
> +printf 'p:kprobes/testevent _do_fork' > kprobe_events
>  check_error '^r:kprobes/testevent do_exit'   # DIFF_PROBE_TYPE
> -echo 'p:kprobes/testevent _do_fork abcd=\1' > kprobe_events
> -check_error 'p:kprobes/testevent _do_fork ^bcd=\1'   # DIFF_ARG_TYPE
> -check_error 'p:kprobes/testevent _do_fork ^abcd=\1:u8'   # DIFF_ARG_TYPE
> -check_error 'p:kprobes/testevent _do_fork ^abcd=\"foo"'  # DIFF_ARG_TYPE
> -check_error '^p:kprobes/testevent _do_fork abcd=\1'  # SAME_PROBE
> +printf 'p:kprobes/testevent _do_fork abcd=\\1' > kprobe_events
> +check_error 'p:kprobes/testevent _do_fork ^bcd=\\1'  # DIFF_ARG_TYPE
> +check_error 'p:kprobes/testevent _do_fork ^abcd=\\1:u8'  # DIFF_ARG_TYPE
> +check_error 'p:kprobes/testevent _do_fork ^abcd=\\"foo"'# DIFF_ARG_TYPE
> +check_error '^p:kprobes/testevent _do_fork abcd=\\1' # SAME_PROBE
>  fi
>  
>  exit 0
> -- 
> 2.25.0
>

[PATCH] sched: Add ASSEMBLY guards around struct clone_args

2019-09-17 Thread Seth Forshee

The addition of struct clone_args to uapi/linux/sched.h is not
protected by __ASSEMBLY__ guards, causing a FTBFS for glibc on
RISC-V. Add the guards to fix this.

Fixes: 7f192e3cd316 ("fork: add clone3")
Signed-off-by: Seth Forshee 
---
 include/uapi/linux/sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..851ff1feadd5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -33,6 +33,7 @@
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
 
+#ifndef __ASSEMBLY__
 /*
  * Arguments for the clone3 syscall
  */
@@ -46,6 +47,7 @@ struct clone_args {
__aligned_u64 stack_size;
__aligned_u64 tls;
 };
+#endif
 
 /*
  * Scheduling policies
-- 
2.20.1

Re: [PATCH 0/1] Small potential fix for shiftfs

2019-08-15 Thread Seth Forshee

On Thu, Aug 15, 2019 at 04:36:02PM +0200, Oleksandr Natalenko wrote:
> Hey, people.
> 
> I was lurking at shiftfs just out of curiosity and managed to bump into
> a compiler warning that is (as I suppose) easily fixed by the subsequent
> patch.
> 
> Feel free to drag this into your Ubuntu tree if needed. I haven't played
> with it yet, just compiling (because I'm looking for something that is
> bindfs but in-kernel) :).

Thanks for the patch. Christian has actually already sent a patch for
this along with another patch which is still under review:

https://lists.ubuntu.com/archives/kernel-team/2019-July/102449.html

Also note that currently shiftfs is only in Ubuntu distro kernels, and
Ubuntu-specific kernel patches should be directed at
kernel-t...@lists.ubuntu.com rather than lkml. If you'll be at LPC,
there's a session to discuss the future of upstreaming shiftfs that you
might find interesting.

Thanks!
Seth

[PATCH v2] kbuild: add -fcf-protection=none when using retpoline flags

2019-07-17 Thread Seth Forshee

The gcc -fcf-protection=branch option is not compatible with
-mindirect-branch=thunk-extern. The latter is used when
CONFIG_RETPOLINE is selected, and this will fail to build with
a gcc which has -fcf-protection=branch enabled by default. Adding
-fcf-protection=none when building with retpoline enabled
prevents such build failures.

Signed-off-by: Seth Forshee 
---
 Makefile | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Makefile b/Makefile
index 3e4868a6498b..73a94d1db2b6 100644
--- a/Makefile
+++ b/Makefile
@@ -878,6 +878,12 @@ KBUILD_CFLAGS   += $(call 
cc-option,-Werror=designated-init)
 # change __FILE__ to the relative path from the srctree
 KBUILD_CFLAGS  += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
 
+# ensure -fcf-protection is disabled when using retpoline as it is
+# incompatible with -mindirect-branch=thunk-extern
+ifdef CONFIG_RETPOLINE
+KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none,)
+endif
+
 # use the deterministic mode of AR if available
 KBUILD_ARFLAGS := $(call ar-option,D)
 
-- 
2.20.1

Re: [kbuild:kbuild 5/19] drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call found in RETPOLINE build

2019-07-17 Thread Seth Forshee

On Wed, Jul 17, 2019 at 11:52:07AM +0900, Masahiro Yamada wrote:
> On Wed, Jul 17, 2019 at 1:20 AM Josh Poimboeuf  wrote:
> >
> > On Tue, Jul 16, 2019 at 07:42:49AM -0500, Seth Forshee wrote:
> > > On Tue, Jul 16, 2019 at 03:57:24PM +0900, Masahiro Yamada wrote:
> > > > (+ Josh Poimboeuf)
> > > >
> > > > On Tue, Jul 16, 2019 at 8:44 AM kbuild test robot  
> > > > wrote:
> > > > >
> > > > > tree:   
> > > > > https://kernel.googlesource.com/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
> > > > >  kbuild
> > > > > head:   0ff0c3753e06c0420c80dac1b0187a442b372acb
> > > > > commit: 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f [5/19] kbuild: add 
> > > > > -fcf-protection=none to retpoline flags
> > > > > config: x86_64-randconfig-s2-07160214 (attached as .config)
> > > > > compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> > > > > reproduce:
> > > > > git checkout 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f
> > > > > # save the attached .config to linux build tree
> > > > > make ARCH=x86_64
> > > >
> > > > 0-day bot reports objtool warnings with the following applied:
> > > > https://patchwork.kernel.org/patch/11037379/
> > > >
> > > > I have no idea about objtool.
> > > >
> > > > Is it better to drop this patch for now?
> > >
> > > I'm surprised that the change would have any impact on a build with
> > > gcc-4.9, since -fcf-protection seems to have been introduced in gcc-8. I
> > > guess there's no full build log that would let us see the actual flags
> > > passed to the compiler.
> > >
> > > I'll try to reproduce this result. If you think the patch should be
> > > dropped in the meantime, that's fine.
> >
> > The problem with this patch is that it's breaking the following check in
> > arch/x86/Makefile.  GCC 4.9 doesn't support retpolines, so it's supposed
> > to fail with the below error.
> >
> > ifdef CONFIG_RETPOLINE
> > ifeq ($(RETPOLINE_CFLAGS),)
> > @echo "You are building kernel with non-retpoline compiler." >&2
> > @echo "Please update your compiler." >&2
> > @false
> > endif
> > endif
> >
> > Maybe the flags should be placed in another variable other than
> > RETPOLINE_CFLAGS.
> 
> 
> 
> Josh,
> Thanks. You are right.
> 
> 
> Seth,
> I think you can add the flag to KBUILD_CFLAGS.
> 
> If you want to make sure this does not affect non-retpoline
> build, you can surround the code with ifdef.
> 
> ifdef CONFIG_RETPOLINE
> KBUILD_CFLAGS  += $(call cc-option,-fcf-protection=none)
> endif

Thanks, I'll send an updated patch shortly.

Seth

Re: [kbuild:kbuild 5/19] drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call found in RETPOLINE build

2019-07-16 Thread Seth Forshee

On Tue, Jul 16, 2019 at 03:57:24PM +0900, Masahiro Yamada wrote:
> (+ Josh Poimboeuf)
> 
> On Tue, Jul 16, 2019 at 8:44 AM kbuild test robot  wrote:
> >
> > tree:   
> > https://kernel.googlesource.com/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git
> >  kbuild
> > head:   0ff0c3753e06c0420c80dac1b0187a442b372acb
> > commit: 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f [5/19] kbuild: add 
> > -fcf-protection=none to retpoline flags
> > config: x86_64-randconfig-s2-07160214 (attached as .config)
> > compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> > reproduce:
> > git checkout 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f
> > # save the attached .config to linux build tree
> > make ARCH=x86_64
> 
> 0-day bot reports objtool warnings with the following applied:
> https://patchwork.kernel.org/patch/11037379/
> 
> I have no idea about objtool.
> 
> Is it better to drop this patch for now?

I'm surprised that the change would have any impact on a build with
gcc-4.9, since -fcf-protection seems to have been introduced in gcc-8. I
guess there's no full build log that would let us see the actual flags
passed to the compiler.

I'll try to reproduce this result. If you think the patch should be
dropped in the meantime, that's fine.

Thanks,
Seth

> 
> Thanks.
> 
> 
> 
> > If you fix the issue, kindly add following tag
> > Reported-by: kbuild test robot 
> >
> > All warnings (new ones prefixed by >>):
> >
> >drivers/atm/eni.o: warning: objtool: eni_do_release()+0x1a: indirect 
> > call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: do_tx()+0x1be: indirect call found 
> > in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_send()+0x15b: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_send()+0x1b4: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_send()+0x24d: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_int()+0xd1: indirect call found 
> > in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x99: indirect 
> > call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0xf7: indirect 
> > call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x20c: indirect 
> > call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x266: indirect 
> > call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_ioctl()+0x54: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x3f7: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x420: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x62f: indirect call 
> > found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x673: indirect call 
> > found in RETPOLINE build
> > >> drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call 
> > >> found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: do_rx_dma.constprop.28()+0xaf: 
> > indirect call found in RETPOLINE build
> >drivers/atm/eni.o: warning: objtool: do_rx_dma.constprop.28()+0x49c: 
> > indirect call found in RETPOLINE build
> > --
> >net//batman-adv/sysfs.o: warning: objtool: 
> > batadv_show_gw_sel_class()+0x6c: indirect call found in RETPOLINE build
> > >> net//batman-adv/sysfs.o: warning: objtool: 
> > >> __batadv_store_uint_attr.isra.9.constprop.10()+0xb7: indirect call found 
> > >> in RETPOLINE build
> >net//batman-adv/sysfs.o: warning: objtool: 
> > batadv_store_gw_sel_class()+0x8b: indirect call found in RETPOLINE build
> > --
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwpower()+0xd7: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwpower()+0x257: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwpower()+0x2f4: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwretry()+0xfb: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwretry()+0x219: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwretry()+0x2b1: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwfrag()+0xce: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwfrag()+0x1f6: indirect call found in RETPOLINE build
> >net//wireless/wext-compat.o: warning: objtool: 
> > cfg80211_wext_siwfrag()+0x2c8: indirect call found in RETPOLINE bui

[PATCH] kbuild: add -fcf-protection=none to retpoline flags

2019-07-09 Thread Seth Forshee

-mindirect-branch and -fcf-protection are not compatible, and
so kernel builds fail with a gcc build where -fcf-protection is
enabled by default. Add -fcf-protection=none to the retpoline
flags to fix this.

Signed-off-by: Seth Forshee 
---
 Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/Makefile b/Makefile
index 3e4868a6498b..050f11d19777 100644
--- a/Makefile
+++ b/Makefile
@@ -636,6 +636,10 @@ RETPOLINE_CFLAGS_CLANG := -mretpoline-external-thunk
 RETPOLINE_VDSO_CFLAGS_CLANG := -mretpoline
 RETPOLINE_CFLAGS := $(call cc-option,$(RETPOLINE_CFLAGS_GCC),$(call 
cc-option,$(RETPOLINE_CFLAGS_CLANG)))
 RETPOLINE_VDSO_CFLAGS := $(call cc-option,$(RETPOLINE_VDSO_CFLAGS_GCC),$(call 
cc-option,$(RETPOLINE_VDSO_CFLAGS_CLANG)))
+# -mindirect-branch is incompatible with -fcf-protection, so ensure the
+# latter is disabled
+RETPOLINE_CFLAGS += $(call cc-option,-fcf-protection=none,)
+RETPOLINE_VDSO_CFLAGS += $(call cc-option,-fcf-protection=none,)
 export RETPOLINE_CFLAGS
 export RETPOLINE_VDSO_CFLAGS
 
-- 
2.20.1

Re: [PATCH] x86/ima: require signed kernel modules

2019-02-08 Thread Seth Forshee

On Tue, Feb 05, 2019 at 01:52:21PM -0500, Mimi Zohar wrote:
> On Tue, 2019-02-05 at 12:32 -0600, Seth Forshee wrote:
> > On Tue, Feb 05, 2019 at 11:47:24AM -0500, Mimi Zohar wrote:
> > > Hi Seth,
> > > 
> > > On Tue, 2019-02-05 at 09:18 -0600, Seth Forshee wrote:
> > > > On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote:
> > > > > Require signed kernel modules on systems with secure boot mode 
> > > > > enabled.
> > > > > 
> > > > > To coordinate between appended kernel module signatures and IMA
> > > > > signatures, only define an IMA MODULE_CHECK policy rule if
> > > > > CONFIG_MODULE_SIG is not enabled.
> > > > > 
> > > > > This patch defines a function named set_module_sig_required() and 
> > > > > renames
> > > > > is_module_sig_enforced() to is_module_sig_enforced_or_required().  The
> > > > > call to set_module_sig_required() is dependent on 
> > > > > CONFIG_IMA_ARCH_POLICY
> > > > > being enabled.
> > > > > 
> > > > > Signed-off-by: Mimi Zohar 
> > > > 
> > > > With respect to interactions with the kernel lockdown patches, this
> > > > looks better than the patches I saw previously. I don't feel like I know
> > > > enough about what's going on with IMA to ack the patch, but I feel
> > > > confident that it's at least not going to break signature enforcement
> > > > for us.
> > > 
> > > Thank you for testing!  Could this be translated into a "tested-by"
> > > "(for w/lockdown patches)"?
> > 
> > Yeah, that's fine. To be clear about what I tested, I've confirmed that
> > it doesn't interfere with requiring signed modules under lockdown with
> > CONFIG_IMA_ARCH_POLICY=n and IMA appraisal enabled.
> > 
> > Tested-by: Seth Forshee 
> 
> Oh!  You've disabled the coordination of the two signature
> verification methods.  Any chance you could test with
> "CONFIG_IMA_ARCH_POLICY" enabled?

Ok, I've tested this now and it also seems to be working. However it
does seem redundant to have CONFIG_IMA_ARCH_POLICY alongside lockdown,
as it doesn't enforce anything not already being enforced by lockdown.

Seth

Re: [PATCH] x86/ima: require signed kernel modules

2019-02-05 Thread Seth Forshee

On Tue, Feb 05, 2019 at 11:47:24AM -0500, Mimi Zohar wrote:
> Hi Seth,
> 
> On Tue, 2019-02-05 at 09:18 -0600, Seth Forshee wrote:
> > On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote:
> > > Require signed kernel modules on systems with secure boot mode enabled.
> > > 
> > > To coordinate between appended kernel module signatures and IMA
> > > signatures, only define an IMA MODULE_CHECK policy rule if
> > > CONFIG_MODULE_SIG is not enabled.
> > > 
> > > This patch defines a function named set_module_sig_required() and renames
> > > is_module_sig_enforced() to is_module_sig_enforced_or_required().  The
> > > call to set_module_sig_required() is dependent on CONFIG_IMA_ARCH_POLICY
> > > being enabled.
> > > 
> > > Signed-off-by: Mimi Zohar 
> > 
> > With respect to interactions with the kernel lockdown patches, this
> > looks better than the patches I saw previously. I don't feel like I know
> > enough about what's going on with IMA to ack the patch, but I feel
> > confident that it's at least not going to break signature enforcement
> > for us.
> 
> Thank you for testing!  Could this be translated into a "tested-by"
> "(for w/lockdown patches)"?

Yeah, that's fine. To be clear about what I tested, I've confirmed that
it doesn't interfere with requiring signed modules under lockdown with
CONFIG_IMA_ARCH_POLICY=n and IMA appraisal enabled.

Tested-by: Seth Forshee

Re: [PATCH] x86/ima: require signed kernel modules

2019-02-05 Thread Seth Forshee

On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote:
> Require signed kernel modules on systems with secure boot mode enabled.
> 
> To coordinate between appended kernel module signatures and IMA
> signatures, only define an IMA MODULE_CHECK policy rule if
> CONFIG_MODULE_SIG is not enabled.
> 
> This patch defines a function named set_module_sig_required() and renames
> is_module_sig_enforced() to is_module_sig_enforced_or_required().  The
> call to set_module_sig_required() is dependent on CONFIG_IMA_ARCH_POLICY
> being enabled.
> 
> Signed-off-by: Mimi Zohar 

With respect to interactions with the kernel lockdown patches, this
looks better than the patches I saw previously. I don't feel like I know
enough about what's going on with IMA to ack the patch, but I feel
confident that it's at least not going to break signature enforcement
for us.

Thanks,
Seth

Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

2018-11-02 Thread Seth Forshee

On Fri, Nov 02, 2018 at 03:16:05PM +0200, Amir Goldstein wrote:
> On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee  
> wrote:
> >
> > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee  
> > > wrote:
> > > >
> > > > shiftfs mounts cannot be nested for two reasons -- global
> > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > > > functional shiftfs mount meets the filesystem stacking depth
> > > > limit.
> > > >
> > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > > > ids in a mount must be within that mount's s_user_ns, so all that
> > > > is needed is CAP_SYS_ADMIN within that s_user_ns.
> > > >
> > > > The stack depth issue can be worked around with a couple of
> > > > adjustments. First, a mark mount doesn't really need to count
> > > > against the stacking depth as it doesn't contribute to the call
> > > > stack depth during filesystem operations. Therefore the mount
> > > > over the mark mount only needs to count as one more than the
> > > > lower filesystems stack depth.
> > >
> > > That's true, but it also highlights the point that the "mark" sb is
> > > completely unneeded and it really is just a nice little hack.
> > > All the information it really stores is a lower mount reference,
> > > a lower root dentry and a declarative statement "I am shiftable!".
> >
> > Seems I should have saved some of the things I said in my previous
> > response for this one. As you no doubt gleaned from that email, I agree
> > with this.
> >
> > > Come to think about it, "container shiftable" really isn't that different 
> > > from
> > > NFS export with "no_root_squash" and auto mounted USB drive.
> > > I mean the shifting itself is different of course, but the
> > > declaration, not so much.
> > > If I am allowing sudoers on another machine to mess with root owned
> > > files visible
> > > on my machine, I am pretty much have the same issues as container admins
> > > accessing root owned files on my init_user_ns filesystem. In all those 
> > > cases,
> > > I'd better not be executing suid binaries from the untrusted "external" 
> > > source.
> > >
> > > Instead of mounting a dummy filesystem to make the declaration, you could
> > > get the same thing with:
> > >mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> > > or whatnot)  constant to uapi and manage to come up good man page 
> > > description.
> > >
> > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> > > avoid the extra bind mount step (for a full filesystem tree export).
> > > Declaring a mounted image MS_EXTERN has merits on its own even without
> > > containers and shitfs, for example with pluggable storage. Other LSMs 
> > > could make
> > > good use of that declaration.
> >
> > I'm missing how we figure out the target user ns in this scheme. We need
> > a context with privileges towards the source path's s_user_ns to say
> > it's okay to shift ids for the files under the source path, and then we
> > need a target user ns for the id shifts. Currently the target is
> > current_user_ns when the final shiftfs mount is created.
> >
> > So, how do we determine the target s_user_ns in your scheme?
> >
> 
> Unless I am completely misunderstanding shiftfs, I think we are saying the
> same thing. You said you wish to get rid of the "mark" fs and that you had
> a POC of implementing the "mark" using xattr.

The PoC was using filesystem contexts and (an earlier version of) the
new mount API, but yes I do think we're in agreement about the mark fs
being awkward.

> I'm just saying another option to implement the mark is using a super block
> flag and you get the target s_user_ns from mnt_sb.
> I did miss the fact that a mount flag is not enough, so that makes the bind
> mount concept fail. Unless, maybe, the mount in the container is a slave
> mount and the "shiftable" mark is set on the master.
> I know too little about mount ns vs. user ns to tell if any of this makes 
> sense.

We need a source and target user ns for the shifting, currently they are
the lower filesystem's s_user_ns and t

Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

2018-11-02 Thread Seth Forshee

On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote:
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee  
> wrote:
> >
> > shiftfs mounts cannot be nested for two reasons -- global
> > CAP_SYS_ADMIN is required to set up a mark mount, and a single
> > functional shiftfs mount meets the filesystem stacking depth
> > limit.
> >
> > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
> > ids in a mount must be within that mount's s_user_ns, so all that
> > is needed is CAP_SYS_ADMIN within that s_user_ns.
> >
> > The stack depth issue can be worked around with a couple of
> > adjustments. First, a mark mount doesn't really need to count
> > against the stacking depth as it doesn't contribute to the call
> > stack depth during filesystem operations. Therefore the mount
> > over the mark mount only needs to count as one more than the
> > lower filesystems stack depth.
> 
> That's true, but it also highlights the point that the "mark" sb is
> completely unneeded and it really is just a nice little hack.
> All the information it really stores is a lower mount reference,
> a lower root dentry and a declarative statement "I am shiftable!".

Seems I should have saved some of the things I said in my previous
response for this one. As you no doubt gleaned from that email, I agree
with this.

> Come to think about it, "container shiftable" really isn't that different from
> NFS export with "no_root_squash" and auto mounted USB drive.
> I mean the shifting itself is different of course, but the
> declaration, not so much.
> If I am allowing sudoers on another machine to mess with root owned
> files visible
> on my machine, I am pretty much have the same issues as container admins
> accessing root owned files on my init_user_ns filesystem. In all those cases,
> I'd better not be executing suid binaries from the untrusted "external" 
> source.
> 
> Instead of mounting a dummy filesystem to make the declaration, you could
> get the same thing with:
>mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC)
> and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED
> or whatnot)  constant to uapi and manage to come up good man page description.
> 
> Then users could actually mount a filesystem in init_user_ns MS_EXTERN and
> avoid the extra bind mount step (for a full filesystem tree export).
> Declaring a mounted image MS_EXTERN has merits on its own even without
> containers and shitfs, for example with pluggable storage. Other LSMs could 
> make
> good use of that declaration.

I'm missing how we figure out the target user ns in this scheme. We need
a context with privileges towards the source path's s_user_ns to say
it's okay to shift ids for the files under the source path, and then we
need a target user ns for the id shifts. Currently the target is
current_user_ns when the final shiftfs mount is created.

So, how do we determine the target s_user_ns in your scheme?

> 
> >
> > Second, when the lower mount is shiftfs we can just skip down to
> > that mount's lower filesystem and shift ids relative to that.
> > There is no reason to shift ids twice, and the lower path has
> > already been marked safe for id shifting by a user privileged
> > towards all ids in that mount's user ns.
> >
> > Signed-off-by: Seth Forshee 
> > ---
> >  fs/shiftfs.c | 68 +++-
> >  1 file changed, 46 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/shiftfs.c b/fs/shiftfs.c
> > index b19af7b2fe75..008ace2842b9 100644
> > --- a/fs/shiftfs.c
> > +++ b/fs/shiftfs.c
> > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, 
> > void *raw_data,
> > struct shiftfs_data *data = raw_data;
> > char *name = kstrdup(data->path, GFP_KERNEL);
> > int err = -ENOMEM;
> > -   struct shiftfs_super_info *ssi = NULL;
> > +   struct shiftfs_super_info *ssi = NULL, *mp_ssi;
> > struct path path;
> > struct dentry *dentry;
> >
> > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, 
> > void *raw_data,
> > if (err)
> > goto out;
> >
> > -   /* to mark a mount point, must be real root */
> > -   if (ssi->mark && !capable(CAP_SYS_ADMIN))
> > -   goto out;
> > -
> > -   /* else to mount a mark, must be userns admin */
> > +   /* to mount a mark, must be userns admin */
> > if (!ssi->mark &

Re: [RFC PATCH 0/6] shiftfs fixes and enhancements

2018-11-02 Thread Seth Forshee

On Fri, Nov 02, 2018 at 10:59:38AM +0200, Amir Goldstein wrote:
> [cc: linux-unionfs
> It should the mailing list for *all* "stacking fs".
> We have a lot of common problems I think ;-) ]
> 
> On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee  
> wrote:
> >
> > I've done some work to fix and enhance shiftfs for a number of use
> > cases, so that we would have an idea what a more full-featured shiftfs
> > would look like. I'm intending for these to serve as a point of
> > reference for discussing id shifting mounts/filesystems at plumbers in a
> > couple of weeks [1].
> >
> > Note that these are based on 4.18, and I've added a small fix to James'
> > most recent patch to fix a build issue there. To work with 4.19 they
> > will need a number of updates due to changes in the vfs.
> >
> 
> Seth,
> 
> I like the way you addressed my concerns about nesting and stacking depth.
> Will provide specific nits on patch.
> 
> In preparation to the Plumbers talk (which I will not be attending), I wanted 
> to
> get your opinion on the matters I brought up last time:
> https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2

I want the session at plumbers to not be a "talk" but more of a
discussion of the sorts of things you raise below. But I'm also happy to
talk about them here.

> 1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs
> and having peeked into 4.19 to see what work you still have lined up for you
> to bring shitfs up to speed with vfs, did you have time to look into my 
> proposal
> for sharing code with overlayfs in the manner that I have implemented the
> snapshotfs POC?
> https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06
> 
> Even if you end up not saving a single line of code for shiftfs v1
> meaning that all shiftfs_inode_ops are completely separate implementation
> from overlayfs inode ops, this may still be beneficial to shitfs in
> the long run.
> For example, you may, in fact, won't need to change anything to work with 
> v4.19.
> shittfs (as an overlayfs alias) would use ovl_file_operations and
> shiftfs_inode_ops.

I don't recall seeing the shapshotfs patches before. If id shifting
remains an overlay-style fs and not a feature of the vfs, then I
absolutely think something like this will make life much easier.

> Another example, from the top of my head, see what it took to add NFS export
> support to snapshotfs, because of the code reuse with overlayfs:
> https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893
> Its practically the exact same implementation shiftfs would need,
> so in the far future, shitfs and snapshotfs can share the same
> export_operations.
> 
> 2) Regarding this part:
> +   /*
> +* this part is visible unshifted, so make sure no
> +* executables that could be used to give suid
> +* privileges
> +*/
> +   sb->s_iflags = SB_I_NOEXEC;
> 
> Why would you want to make the unshifted fs visible at all?
> Is there a requirement for container users to access the unshifted fs
> content? Is there a requirement for container admin to mount shitfted fs
> NOT from the root of the marked mount?
> 
> If those are not required, then I propose NOOP inode operations for
> the unshifted fs, specifically, empty readdir, just enough ops to be able
> to use the mark mount point as the shitfed mount source - no more.

This is part of the original implementation that I didn't touch with
these updates. Imo the mark mount is kind of kludgy, and I'd like to see
it done a different way.

A couple of alternatives have been suggested. One was to use xattrs for
marking, or I did a PoC with an older version of the new mount API
patches where an fsfd was passed to the less privileged context that it
could attach to its mount tree:

https://lkml.kernel.org/r/20180717133847.GB15620@ubuntu-xps13

Either of these can accomplish the same things as the mark mount with
better control over who can create an id-shifted mount of the subtree.

However if the mark mount is kept then no-op inode operations seems
reasonable to me.

Thanks,
Seth

[RFC PATCH 2/6] shiftfs: map inodes to lower fs inodes instead of dentries

2018-11-01 Thread Seth Forshee

Since shiftfs inodes map to dentries in the lower fs, two links
to the same lowerfs inode create separate inodes in shiftfs. This
causes problems for inotify, as a watch on one of these files in
shiftfs will not see changes made to the underlying inode via the
other file.

Fix this by updating shiftfs to map its inodes to corresponding
inodes in the lower fs. Inodes are cached using the pointer to
the lower fs inode as the hash value. This fixes a second inotify
problem whereby a watch is set on an inode, the dentry is evicted
from the cache, and events on a new dentry are not reported back
to the watch original inode.

Signed-off-by: Seth Forshee 
---
 fs/shiftfs.c | 105 ++-
 1 file changed, 79 insertions(+), 26 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 6028244c2f42..b179a1be7bc1 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -22,6 +22,7 @@ struct shiftfs_super_info {
 
 static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
   struct dentry *dentry);
+static void shiftfs_init_inode(struct inode *inode, umode_t mode);
 
 enum {
OPT_MARK,
@@ -278,15 +279,27 @@ static void shiftfs_fill_inode(struct inode *inode, 
struct dentry *dentry)
inode->i_opflags |= IOP_NOFOLLOW;
 
inode->i_mapping = reali->i_mapping;
-   inode->i_private = dentry;
+   inode->i_private = reali;
+   set_nlink(inode, reali->i_nlink);
+}
+
+static int shiftfs_inode_test(struct inode *inode, void *data)
+{
+   return inode->i_private == data;
+}
+
+static int shiftfs_inode_set(struct inode *inode, void *data)
+{
+   inode->i_private = data;
+   return 0;
 }
 
 static int shiftfs_make_object(struct inode *dir, struct dentry *dentry,
   umode_t mode, const char *symlink,
   struct dentry *hardlink, bool excl)
 {
-   struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
-   struct inode *reali = real->d_inode, *newi;
+   struct dentry *new = dentry->d_fsdata;
+   struct inode *reali = dir->i_private, *inode, *newi;
const struct inode_operations *iop = reali->i_op;
int err;
const struct cred *oldcred, *newcred;
@@ -310,9 +323,14 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
return -EINVAL;
 
 
-   newi = shiftfs_new_inode(dentry->d_sb, mode, NULL);
-   if (!newi)
-   return -ENOMEM;
+   if (hardlink) {
+   inode = d_inode(hardlink);
+   ihold(inode);
+   } else {
+   inode = shiftfs_new_inode(dentry->d_sb, mode, NULL);
+   if (!inode)
+   return -ENOMEM;
+   }
 
oldcred = shiftfs_new_creds(&newcred, dentry->d_sb);
 
@@ -341,16 +359,33 @@ static int shiftfs_make_object(struct inode *dir, struct 
dentry *dentry,
if (err)
goto out_dput;
 
-   shiftfs_fill_inode(newi, new);
+   if (hardlink) {
+   WARN_ON(inode->i_private != new->d_inode);
+   inc_nlink(inode);
+   } else {
+   shiftfs_fill_inode(inode, new);
+
+   newi = inode_insert5(inode, (unsigned long)new->d_inode,
+shiftfs_inode_test, shiftfs_inode_set,
+new->d_inode);
+   if (newi != inode) {
+   pr_warn_ratelimited("shiftfs: newly created inode found 
in cache\n");
+   iput(inode);
+   inode = newi;
+   }
+   }
+
+   if (inode->i_state & I_NEW)
+   unlock_new_inode(inode);
 
-   d_instantiate(dentry, newi);
+   d_instantiate(dentry, inode);
 
new = NULL;
-   newi = NULL;
+   inode = NULL;
 
  out_dput:
dput(new);
-   iput(newi);
+   iput(inode);
inode_unlock(reali);
 
return err;
@@ -386,8 +421,8 @@ static int shiftfs_symlink(struct inode *dir, struct dentry 
*dentry,
 
 static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir)
 {
-   struct dentry *real = dir->i_private, *new = dentry->d_fsdata;
-   struct inode *reali = real->d_inode;
+   struct dentry *new = dentry->d_fsdata;
+   struct inode *reali = dir->i_private;
int err;
const struct cred *oldcred, *newcred;
 
@@ -400,6 +435,13 @@ static int shiftfs_rm(struct inode *dir, struct dentry 
*dentry, bool rmdir)
else
err = vfs_unlink(reali, new, NULL);
 
+   if (!err) {
+   if (rmdir)
+   clear_nlink(d_inode(dentry));
+   else
+   drop_nlink(d_inode(dentry));
+   }
+
shiftfs_old_creds(oldcred, &newcred);
inode_u

[RFC PATCH 3/6] shiftfs: copy inode attrs up from underlying fs

2018-11-01 Thread Seth Forshee

Not all inode permission checks go through the permission
callback, e.g. some checks related to file capabilities. Always
copy up the inode attrs to ensure these checks work as expected.

Also introduce helpers helpers for shifting kernel ids from one
user ns to another, as this is an operation that is going to be
repeated.

Signed-off-by: Seth Forshee 
---
 fs/shiftfs.c | 30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index b179a1be7bc1..556594988dd2 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -266,6 +266,33 @@ static int shiftfs_xattr_set(const struct xattr_handler 
*handler,
return shiftfs_setxattr(dentry, inode, name, value, size, flags);
 }
 
+static kuid_t shift_kuid(struct user_namespace *from, struct user_namespace 
*to,
+kuid_t kuid)
+{
+   uid_t uid = from_kuid(from, kuid);
+   return make_kuid(to, uid);
+}
+
+static kgid_t shift_kgid(struct user_namespace *from, struct user_namespace 
*to,
+kgid_t kgid)
+{
+   gid_t gid = from_kgid(from, kgid);
+   return make_kgid(to, gid);
+}
+
+static void shiftfs_copyattr(struct inode *from, struct inode *to)
+{
+   struct user_namespace *from_ns = from->i_sb->s_user_ns;
+   struct user_namespace *to_ns = to->i_sb->s_user_ns;
+
+   to->i_uid = shift_kuid(from_ns, to_ns, from->i_uid);
+   to->i_gid = shift_kgid(from_ns, to_ns, from->i_gid);
+   to->i_mode = from->i_mode;
+   to->i_atime = from->i_atime;
+   to->i_mtime = from->i_mtime;
+   to->i_ctime = from->i_ctime;
+}
+
 static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry)
 {
struct inode *reali;
@@ -278,6 +305,7 @@ static void shiftfs_fill_inode(struct inode *inode, struct 
dentry *dentry)
if (!reali->i_op->get_link)
inode->i_opflags |= IOP_NOFOLLOW;
 
+   shiftfs_copyattr(reali, inode);
inode->i_mapping = reali->i_mapping;
inode->i_private = reali;
set_nlink(inode, reali->i_nlink);
@@ -573,7 +601,7 @@ static int shiftfs_setattr(struct dentry *dentry, struct 
iattr *attr)
return err;
 
/* all OK, reflect the change on our inode */
-   setattr_copy(d_inode(dentry), attr);
+   shiftfs_copyattr(reali, d_inode(dentry));
return 0;
 }
 
-- 
2.19.1

[RFC PATCH 4/6] shiftfs: translate uids using s_user_ns from lower fs

2018-11-01 Thread Seth Forshee

Do not assume that ids from the lower filesystem are from
init_user_ns. Instead, translate them from that filesystem's
s_user_ns and then to the shiftfs user ns.

Signed-off-by: Seth Forshee 
---
 fs/shiftfs.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 556594988dd2..226c03d8588b 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -613,6 +613,8 @@ static int shiftfs_getattr(const struct path *path, struct 
kstat *stat,
struct inode *reali = real->d_inode;
const struct inode_operations *iop = reali->i_op;
struct path newpath = { .mnt = path->dentry->d_sb->s_fs_info, .dentry = 
real };
+   struct user_namespace *from_ns = reali->i_sb->s_user_ns;
+   struct user_namespace *to_ns = inode->i_sb->s_user_ns;
int err = 0;
 
if (iop->getattr)
@@ -624,8 +626,8 @@ static int shiftfs_getattr(const struct path *path, struct 
kstat *stat,
return err;
 
/* transform the underlying id */
-   stat->uid = make_kuid(inode->i_sb->s_user_ns, __kuid_val(stat->uid));
-   stat->gid = make_kgid(inode->i_sb->s_user_ns, __kgid_val(stat->gid));
+   stat->uid = shift_kuid(from_ns, to_ns, stat->uid);
+   stat->gid = shift_kgid(from_ns, to_ns, stat->gid);
return 0;
 }
 
-- 
2.19.1

[RFC PATCH 5/6] shiftfs: add support for posix acls

2018-11-01 Thread Seth Forshee

Signed-off-by: Seth Forshee 
---
 fs/Kconfig   |  10 +++
 fs/shiftfs.c | 185 +++
 2 files changed, 195 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 392c5a41a9f9..691f3c4fc7eb 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -121,6 +121,16 @@ config SHIFT_FS
  unprivileged containers can use this to mount root volumes
  using this technique.
 
+config SHIFT_FS_POSIX_ACL
+   bool "shiftfs POSIX Access Control Lists"
+   depends on SHIFT_FS
+   select FS_POSIX_ACL
+   help
+ POSIX Access Control Lists (ACLs) support permissions for users and
+ groups beyond the owner/group/world scheme.
+
+ If you don't know what Access Control Lists are, say N.
+
 menu "Caches"
 
 source "fs/fscache/Kconfig"
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index 226c03d8588b..b19af7b2fe75 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -13,6 +13,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 struct shiftfs_super_info {
struct vfsmount *mnt;
@@ -631,6 +633,183 @@ static int shiftfs_getattr(const struct path *path, 
struct kstat *stat,
return 0;
 }
 
+#ifdef CONFIG_SHIFT_FS_POSIX_ACL
+
+static int
+shift_acl_ids(struct user_namespace *from, struct user_namespace *to,
+ struct posix_acl *acl)
+{
+   int i;
+
+   for (i = 0; i < acl->a_count; i++) {
+   struct posix_acl_entry *e = &acl->a_entries[i];
+   switch(e->e_tag) {
+   case ACL_USER:
+   e->e_uid = shift_kuid(from, to, e->e_uid);
+   if (!uid_valid(e->e_uid))
+   return -EOVERFLOW;
+   break;
+   case ACL_GROUP:
+   e->e_gid = shift_kgid(from, to, e->e_gid);
+   if (!gid_valid(e->e_gid))
+   return -EOVERFLOW;
+   break;
+   }
+   }
+   return 0;
+}
+
+static void
+shift_acl_xattr_ids(struct user_namespace *from, struct user_namespace *to,
+   void *value, size_t size)
+{
+   struct posix_acl_xattr_header *header = value;
+   struct posix_acl_xattr_entry *entry = (void *)(header + 1), *end;
+   int count;
+   kuid_t kuid;
+   kgid_t kgid;
+
+   if (!value)
+   return;
+   if (size < sizeof(struct posix_acl_xattr_header))
+   return;
+   if (header->a_version != cpu_to_le32(POSIX_ACL_XATTR_VERSION))
+   return;
+
+   count = posix_acl_xattr_count(size);
+   if (count < 0)
+   return;
+   if (count == 0)
+   return;
+
+   for (end = entry + count; entry != end; entry++) {
+   switch(le16_to_cpu(entry->e_tag)) {
+   case ACL_USER:
+   kuid = make_kuid(&init_user_ns, 
le32_to_cpu(entry->e_id));
+   kuid = shift_kuid(from, to, kuid);
+   entry->e_id = cpu_to_le32(from_kuid(&init_user_ns, 
kuid));
+   break;
+   case ACL_GROUP:
+   kgid = make_kgid(&init_user_ns, 
le32_to_cpu(entry->e_id));
+   kgid = shift_kgid(from, to, kgid);
+   entry->e_id = cpu_to_le32(from_kgid(&init_user_ns, 
kgid));
+   break;
+   default:
+   break;
+   }
+   }
+}
+
+static struct posix_acl *shiftfs_get_acl(struct inode *inode, int type)
+{
+   struct inode *reali = inode->i_private;
+   const struct cred *oldcred, *newcred;
+   struct posix_acl *real_acl, *acl = NULL;
+   struct user_namespace *from_ns = reali->i_sb->s_user_ns;
+   struct user_namespace *to_ns = inode->i_sb->s_user_ns;
+   int size;
+   int err;
+
+   if (!IS_POSIXACL(reali))
+   return NULL;
+
+   oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
+   real_acl = get_acl(reali, type);
+   shiftfs_old_creds(oldcred, &newcred);
+
+   if (real_acl && !IS_ERR(acl)) {
+   /* XXX: export posix_acl_clone? */
+   size = sizeof(struct posix_acl) +
+  real_acl->a_count * sizeof(struct posix_acl_entry);
+   acl = kmemdup(acl, size, GFP_KERNEL);
+   posix_acl_release(real_acl);
+
+   if (!acl)
+   return ERR_PTR(-ENOMEM);
+
+   refcount_set(&acl->a_refcount, 1);
+
+   err = shift_acl_ids(from_ns, to_ns, acl);
+   if (err) {
+   kfree(acl);
+   return ERR_PTR(err);
+   }
+   }
+
+   return acl;
+}
+
+static int
+shiftfs_posix_acl_xattr_get(const struct xattr_handler *handler,
+

[RFC PATCH 6/6] shiftfs: support nested shiftfs mounts

2018-11-01 Thread Seth Forshee

shiftfs mounts cannot be nested for two reasons -- global
CAP_SYS_ADMIN is required to set up a mark mount, and a single
functional shiftfs mount meets the filesystem stacking depth
limit.

The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel
ids in a mount must be within that mount's s_user_ns, so all that
is needed is CAP_SYS_ADMIN within that s_user_ns.

The stack depth issue can be worked around with a couple of
adjustments. First, a mark mount doesn't really need to count
against the stacking depth as it doesn't contribute to the call
stack depth during filesystem operations. Therefore the mount
over the mark mount only needs to count as one more than the
lower filesystems stack depth.

Second, when the lower mount is shiftfs we can just skip down to
that mount's lower filesystem and shift ids relative to that.
There is no reason to shift ids twice, and the lower path has
already been marked safe for id shifting by a user privileged
towards all ids in that mount's user ns.

Signed-off-by: Seth Forshee 
---
 fs/shiftfs.c | 68 +++-
 1 file changed, 46 insertions(+), 22 deletions(-)

diff --git a/fs/shiftfs.c b/fs/shiftfs.c
index b19af7b2fe75..008ace2842b9 100644
--- a/fs/shiftfs.c
+++ b/fs/shiftfs.c
@@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void 
*raw_data,
struct shiftfs_data *data = raw_data;
char *name = kstrdup(data->path, GFP_KERNEL);
int err = -ENOMEM;
-   struct shiftfs_super_info *ssi = NULL;
+   struct shiftfs_super_info *ssi = NULL, *mp_ssi;
struct path path;
struct dentry *dentry;
 
@@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void 
*raw_data,
if (err)
goto out;
 
-   /* to mark a mount point, must be real root */
-   if (ssi->mark && !capable(CAP_SYS_ADMIN))
-   goto out;
-
-   /* else to mount a mark, must be userns admin */
+   /* to mount a mark, must be userns admin */
if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
goto out;
 
@@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, 
void *raw_data,
 
if (!S_ISDIR(path.dentry->d_inode->i_mode)) {
err = -ENOTDIR;
-   goto out_put;
-   }
-
-   sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
-   if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
-   printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n");
-   err = -EINVAL;
-   goto out_put;
+   goto out_put_path;
}
 
if (ssi->mark) {
+   struct super_block *lower_sb = path.mnt->mnt_sb;
+
+   /* to mark a mount point, must root wrt lower s_user_ns */
+   if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN))
+   goto out_put_path;
+
+
/*
 * this part is visible unshifted, so make sure no
 * executables that could be used to give suid
 * privileges
 */
sb->s_iflags = SB_I_NOEXEC;
-   ssi->mnt = path.mnt;
-   dentry = path.dentry;
-   } else {
-   struct shiftfs_super_info *mp_ssi;
 
+   /*
+* Handle nesting of shiftfs mounts by referring this mark
+* mount back to the original mark mount. This is more
+* efficient and alleviates concerns about stack depth.
+*/
+   if (lower_sb->s_magic == SHIFTFS_MAGIC) {
+   mp_ssi = lower_sb->s_fs_info;
+
+   /* Doesn't make sense to mark a mark mount */
+   if (mp_ssi->mark) {
+   err = -EINVAL;
+   goto out_put_path;
+   }
+
+   ssi->mnt = mntget(mp_ssi->mnt);
+   dentry = dget(path.dentry->d_fsdata);
+   } else {
+   ssi->mnt = mntget(path.mnt);
+   dentry = dget(path.dentry);
+   }
+   } else {
/*
 * this leg executes if we're admin capable in
 * the namespace, so be very careful
 */
if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC)
-   goto out_put;
+   goto out_put_path;
mp_ssi = path.dentry->d_sb->s_fs_info;
if (!mp_ssi->mark)
-   goto out_put;
+   goto out_put_path;
ssi->mnt = mntget(mp_ssi->mnt);
dentry = dget(path.dentry->d_fsdata);
-

[RFC PATCH 1/6] shiftfs: uid/gid shifting bind mount

2018-11-01 Thread Seth Forshee

From: James Bottomley 

This allows any subtree to be uid/gid shifted and bound elsewhere.  It
does this by operating simlarly to overlayfs.  Its primary use is for
shifting the underlying uids of filesystems used to support
unpriviliged (uid shifted) containers.  The usual use case here is
that the container is operating with an uid shifted unprivileged root
but sometimes needs to make use of or work with a filesystem image
that has root at real uid 0.

The mechanism is to allow any subordinate mount namespace to mount a
shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing
it to mount marked subtrees (using the -o mark option as root).  Once
mounted, the subtree is mapped via the super block user namespace so
that the interior ids of the mounting user namespace are the ids
written to the filesystem.

Signed-off-by: James Bottomley 
[ saf: use designated initializers for path declarations to fix errors
  with struct randomization ]
Signed-off-by: Seth Forshee 

---

v3 - update to 4.14 (d_real changes)
v1 - based on original shiftfs with uid mappings now done via s_user_ns
v2 - fix revalidation of dentries
 add inode aliasing
---
 fs/Kconfig |   8 +
 fs/Makefile|   1 +
 fs/shiftfs.c   | 783 +
 include/uapi/linux/magic.h |   2 +
 4 files changed, 794 insertions(+)
 create mode 100644 fs/shiftfs.c

diff --git a/fs/Kconfig b/fs/Kconfig
index ac474a61be37..392c5a41a9f9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -113,6 +113,14 @@ source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+config SHIFT_FS
+   tristate "UID/GID shifting overlay filesystem for containers"
+   help
+ This filesystem can overlay any mounted filesystem and shift
+ the uid/gid the files appear at.  The idea is that
+ unprivileged containers can use this to mount root volumes
+ using this technique.
+
 menu "Caches"
 
 source "fs/fscache/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..d0222f3816bd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y   += exofs/ # Multiple 
modules
 obj-$(CONFIG_CEPH_FS)  += ceph/
 obj-$(CONFIG_PSTORE)   += pstore/
 obj-$(CONFIG_EFIVAR_FS)+= efivarfs/
+obj-$(CONFIG_SHIFT_FS) += shiftfs.o
diff --git a/fs/shiftfs.c b/fs/shiftfs.c
new file mode 100644
index ..6028244c2f42
--- /dev/null
+++ b/fs/shiftfs.c
@@ -0,0 +1,783 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct shiftfs_super_info {
+   struct vfsmount *mnt;
+   struct user_namespace *userns;
+   bool mark;
+};
+
+static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode,
+  struct dentry *dentry);
+
+enum {
+   OPT_MARK,
+   OPT_LAST,
+};
+
+/* global filesystem options */
+static const match_table_t tokens = {
+   { OPT_MARK, "mark" },
+   { OPT_LAST, NULL }
+};
+
+static const struct cred *shiftfs_get_up_creds(struct super_block *sb)
+{
+   struct shiftfs_super_info *ssi = sb->s_fs_info;
+   struct cred *cred = prepare_creds();
+
+   if (!cred)
+   return NULL;
+
+   cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
+   cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
+   put_user_ns(cred->user_ns);
+   cred->user_ns = get_user_ns(ssi->userns);
+
+   return cred;
+}
+
+static const struct cred *shiftfs_new_creds(const struct cred **newcred,
+   struct super_block *sb)
+{
+   const struct cred *cred = shiftfs_get_up_creds(sb);
+
+   *newcred = cred;
+
+   if (cred)
+   cred = override_creds(cred);
+   else
+   printk(KERN_ERR "shiftfs: Credential override failed: no 
memory\n");
+
+   return cred;
+}
+
+static void shiftfs_old_creds(const struct cred *oldcred,
+ const struct cred **newcred)
+{
+   if (!*newcred)
+   return;
+
+   revert_creds(oldcred);
+   put_cred(*newcred);
+}
+
+static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options)
+{
+   char *p;
+   substring_t args[MAX_OPT_ARGS];
+
+   ssi->mark = false;
+
+   while ((p = strsep(&options, ",")) != NULL) {
+   int token;
+
+   if (!*p)
+   continue;
+
+   token = match_token(p, tokens, args);
+   switch (token) {
+   case OPT_MARK:
+   ssi->mark = true;
+   break;
+   default:
+

[RFC PATCH 0/6] shiftfs fixes and enhancements

2018-11-01 Thread Seth Forshee

I've done some work to fix and enhance shiftfs for a number of use
cases, so that we would have an idea what a more full-featured shiftfs
would look like. I'm intending for these to serve as a point of
reference for discussing id shifting mounts/filesystems at plumbers in a
couple of weeks [1].

Note that these are based on 4.18, and I've added a small fix to James'
most recent patch to fix a build issue there. To work with 4.19 they
will need a number of updates due to changes in the vfs.

The features I focused on fixing in or adding to shiftfs in these
patches are inotify, file capabilities, posix acls, and nesting. These
are all now working for at least simple use cases, but further testing
and cleanups are needed before I'd consider these finished. I also kept
all the changes within shiftfs, but some of the code might belong in the
vfs instead (in particular some of the posix acl code).

I've also pushed these patches to:

git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git shiftfs

Thanks,
Seth

[1] https://linuxplumbersconf.org/event/2/contributions/212/

---

James Bottomley (1):
  shiftfs: uid/gid shifting bind mount

Seth Forshee (5):
  shiftfs: map inodes to lower fs inodes instead of dentries
  shiftfs: copy inode attrs up from underlying fs
  shiftfs: translate uids using s_user_ns from lower fs
  shiftfs: add support for posix acls
  shiftfs: support nested shiftfs mounts

 fs/Kconfig |   18 +
 fs/Makefile|1 +
 fs/shiftfs.c   | 1075 
 include/uapi/linux/magic.h |2 +
 4 files changed, 1096 insertions(+)
 create mode 100644 fs/shiftfs.c

Re: [PATCH v3 0/1] shiftfs: uid/gid shifting filesystem

2018-07-17 Thread Seth Forshee

+Cc David

On Fri, Jun 15, 2018 at 02:35:14PM -0700, James Bottomley wrote:
> This is a repost of the v2 patch updated for the d_real changes
> 
> For those who want to test it out, there's a git tree here
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc.git
> 
> on the shiftfs-v3 branch
> 
> v2:
> 
> This is a rewrite of the original shiftfs code to make use of super
> block user namespaces.  I've also removed the mappings passed in as
> mount options in favour of using the mappings in s_user_ns.  The upshot
> is that it probably needs retesting for all the bugs people found,
> since there's a lot of new code, and the use case has changed.  Now, to
> use it, you have to mark the filesystems you want to be mountable
> inside a user namespace as root:
> 
> mount -t shiftfs -o mark  
> 
> The origin should be inaccessible to the unprivileged user, and the
> access to the  can be controlled by the usual filesystem
> permissions.  Once this is done, any user who can get access to the
>  can do (as the local user namespace root):
> 
> mount -t shiftfs  

David, I wanted to pull you in here based on something you said on the
most recent filesystem context thread (thought it would make more sense
here rather than piggypacking on that already massive thread).

> I want to be able to add support for a bunch of things:
> 
>  (1) UID, GID and Project ID mapping/translation.  I want to be able to
>  install a translation table of some sort on the superblock to translate
>  source identifiers (which may be foreign numeric UIDs/GIDs, text names,
>  GUIDs) into system identifiers.  This needs to be done before the
>  superblock is published[*].
> 
>  Note that this may, for example, involve using the context and the
>  superblock held therein to issue an RPC to a server to look up
>  translations.
> 
>  [*] By "published" I mean made available through mount so that other
>userspace processes can access it by path.
> 
>  Maybe specifying a translation range element with something like:
> 
>   write(fd, "t uid   ");
> 
>  The translation information also needs to propagate over an automount in
>  some circumstances.
> 
>  (2) Namespace configuration.  I want to be able to tell the superblock
>  creation process what namespaces should be applied when it created (in
>  particular the userns and netns) for containerisation purposes, e.g.:
> 
>   write(fd, "n user= net=");

There's some obvious overlap between shiftfs and (1), but also important
differences. Primarily that shiftfs tries to make something that looks
like a bind mount rather than applying the mappings to new superblocks
for arbitrary filesystems.

I've already been playing with shiftfs on top of the filesystem context
patches, because I thought it would allow getting rid of the
intermediate "mark" mount described above. I have a hacky proof of
concept implementation that I've pushed to the shiftfs-fscontext branch
of

 git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git

Basically the idea is that the more privileged "host" context can create
the fs fd and set the source on it to "bless" a subtree for id shifted
mounting, and the less privileged "client" context can use the fd to do
the mount (test program below). But I had to mess with fc->user_ns to
ensure s_user_ns gets set correctly, and it would likely be nicer to do
something more like (2) above.

The idea is that we need the more privileged "host" context to bless the
subtree being id shifted before actually executing the mount in the less
privileged "client" context. I'm doing this by having the host set the
source, then have the client use the fd to create the superblock (my
test program is below). This leads to some undesirable changing of the
fs contexts user ns in shiftfs (so that s_user_ns is the client's
namespace), which could likely be eliminated by doing something like
what's described in (2) and having the super block created on the host
side.

However, maybe these things are similar enough to settle on a common
solution, such as supporting id mapping at the vfsmount level.

Seth

---

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define __NR_move_mount 336
#define __NR_fsopen 337
#define __NR_fsmount338

#define FSOPEN_CLOEXEC  0x0001
#define FSMOUNT_CLOEXEC 0x0001
#define MOVE_MOUNT_F_EMPTY_PATH 0x0004

static int move_mount(int from_dfd, const char *from_pathname,
  int to_dfd, const char *to_pathname, unsigned int flags)
{
return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, 
to_pathname,
   flags);
}

static int fsopen(const char *fs_name, unsigned int flags)
{
return syscall(__NR_fsopen, fs_name, flags);
}

static int fsmount(int fsfd, unsigned int flags, unsigned in

Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

2018-05-24 Thread Seth Forshee

On Thu, May 24, 2018 at 11:55:45AM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
> >> These filesystems already always set SB_I_NODEV so mknod will not be
> >> useful for gaining control of any devices no matter their permissions.
> >> This will allow overlayfs and applications to fakeroot to use device
> >> nodes to represent things on disk.
> >> 
> >> Signed-off-by: "Eric W. Biederman" 
> >
> > For a normal filesystem this does seem safe enough.
> >
> > However, I'd also like to see us allow unprivileged mounting for
> > overlayfs, and there we need to worry about whether this would allow a
> > mknod in an underlying filesystem which should not be allowed. That
> > mknod will be subject to this same check in the underlying filesystem
> > using the credentials of the user that mounted the overaly fs, which
> > should be sufficient to ensure that the mknod is permitted.
> 
> Sufficient to ensure the mknod is not permitted on the underlying
> filesystem.  I believe you mean.

Right, or in other words with the relaxed capability check a user still
could not use an overlayfs mount in a user namespace to mknod in a
filesystem when that user couldn't otherwise mknod in that filesystem.
Sorry if I wasn't clear.

> 
> > Thus this looks okay to me.
> >
> > Acked-by: Seth Forshee 
> 
> Eric
>

Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

2018-05-24 Thread Seth Forshee

On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
> These filesystems already always set SB_I_NODEV so mknod will not be
> useful for gaining control of any devices no matter their permissions.
> This will allow overlayfs and applications to fakeroot to use device
> nodes to represent things on disk.
> 
> Signed-off-by: "Eric W. Biederman" 

For a normal filesystem this does seem safe enough.

However, I'd also like to see us allow unprivileged mounting for
overlayfs, and there we need to worry about whether this would allow a
mknod in an underlying filesystem which should not be allowed. That
mknod will be subject to this same check in the underlying filesystem
using the credentials of the user that mounted the overaly fs, which
should be sufficient to ensure that the mknod is permitted.

Thus this looks okay to me.

Acked-by: Seth Forshee

Re: [REVIEW][PATCH 1/6] vfs: Don't allow changing the link count of an inode with an invalid uid or gid

2018-05-24 Thread Seth Forshee

On Wed, May 23, 2018 at 06:25:33PM -0500, Eric W. Biederman wrote:
> Changing the link count of an inode via unlink or link will cause a
> write back of that inode.  If the uids or gids are invalid (aka not known
> to the kernel) writing the inode back may change the uid or gid in the
> filesystem.   To prevent possible filesystem and to avoid the need for
> filesystem maintainers to worry about it don't allow operations on
> inodes with an invalid uid or gid.
> 
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Seth Forshee

Re: [PATCH] fuse: Ensure posix acls are translated outside of init_user_ns

2018-05-08 Thread Seth Forshee

On Fri, May 04, 2018 at 11:47:28AM -0500, Eric W. Biederman wrote:
> 
> Ensure the translation happens by failing to read or write
> posix acls when the filesystem has not indicated it supports
> posix acls.
> 
> This ensures that modern cached posix acl support is available
> and used when dealing with posix acls.  This is important
> because only that path has the code to convernt the uids and
> gids in posix acls into the user namespace of a fuse filesystem.
> 
> Signed-off-by: "Eric W. Biederman" 
> ---
> 
> Miklos after several attempts to handle this better last cycle.  I
> figure we should go with the stupid version for now.  I think I know
> how to do better but I don't want that to gate forward progress on
> fully unprivileged fuse mounts.  Especially as this is the last known
> issue to deal with.

This seems reasonable as a short-term measure.

Acked-by: Seth Forshee

Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE

2018-01-22 Thread Seth Forshee

On Mon, Jan 22, 2018 at 05:24:52PM +0100, Alban Crequy wrote:
> From: Alban Crequy 
> 
> This patch forces files to be re-measured, re-appraised and re-audited
> on file systems with the feature flag FS_IMA_NO_CACHE. In that way,
> cached integrity results won't be used.
> 
> How to test this:
> 
> The test I did was using a patched version of the memfs FUSE driver
> [1][2] and two very simple "hello-world" programs [4] (prog1 prints
> "hello world: 1" and prog2 prints "hello world: 2").
> 
> I copy prog1 and prog2 in the fuse-memfs mount point, execute them and
> check the sha1 hash in
> "/sys/kernel/security/ima/ascii_runtime_measurements".
> 
> My patch on the memfs FUSE driver added a backdoor command to serve
> prog1 when the kernel asks for prog2 or vice-versa. In this way, I can
> exec prog1 and get it to print "hello world: 2" without ever replacing
> the file via the VFS, so the kernel is not aware of the change.
> 
> The test was done using the branch "alban/fuse-flag-ima-nocache-v3" [3].
> 
> Step by step test procedure:
> 
> 1. Mount the memfs FUSE using [2]:
> rm -f  /tmp/memfs-switch* ; memfs -L DEBUG  /mnt/memfs
> 
> 2. Copy prog1 and prog2 using [4]
> cp prog1 /mnt/memfs/prog1
> cp prog2 /mnt/memfs/prog2
> 
> 3. Lookup the files and let the FUSE driver to keep the handles open:
> dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) &
> dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) &
> 
> 4. Check the 2 programs work correctly:
> $ /mnt/memfs/prog1
> hello world: 1
> $ /mnt/memfs/prog2
> hello world: 2
> 
> 5. Check the measurements for prog1 and prog2:
> $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
> | grep /mnt/memfs/prog
> 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1
> 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2
> 
> 6. Use the backdoor command in my patched memfs to redirect file
> operations on file handle 3 to file handle 2:
> rm -f  /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2
> 
> 7. Check how the FUSE driver serves different content for the files:
> $ /mnt/memfs/prog1
> hello world: 2
> $ /mnt/memfs/prog2
> hello world: 2
> 
> 8. Check the measurements:
> sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \
> | grep /mnt/memfs/prog
> 
> Without the patch, there are no new measurements, despite the FUSE
> driver having served different executables.
> 
> With the patch, I can see additional measurements for prog1 and prog2
> with the hashes reversed when the FUSE driver served the alternative
> content.
> 
> [1] https://github.com/bbengfort/memfs
> [2] https://github.com/kinvolk/memfs/commits/alban/switch-files
> [3] https://github.com/kinvolk/linux/commits/alban/fuse-flag-ima-nocache-v3
> [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0
> 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-integr...@vger.kernel.org
> Cc: linux-security-mod...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: Miklos Szeredi 
> Cc: Alexander Viro 
> Cc: Mimi Zohar 
> Cc: Dmitry Kasatkin 
> Cc: James Morris 
> Cc: "Serge E. Hallyn" 
> Cc: Seth Forshee 
> Cc: Christoph Hellwig 
> Tested-by: Dongsu Park 
> Signed-off-by: Alban Crequy 

I like this approach as it doesn't require any IMA policy changes to be
effective. I'm not sure about the flag name (in my mind it would
describe the fs property and not be specifically giving instructions to
IMA), but if others are happy with it I won't complain.

Acked-by: Seth Forshee 

> ---
>  security/integrity/ima/ima_main.c | 24 ++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/security/integrity/ima/ima_main.c 
> b/security/integrity/ima/ima_main.c
> index 6d78cb26784d..8870a7bbe9b9 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "ima.h"
>  
> @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, char 
> *buf, loff_t size,
>IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK |
>IMA_ACTION_FLAGS);
>  
> - if (test_and_clear_bit(IMA_CHANGE_XATTR, &iint->atomic_flags))
> - /* reset all flags if ima_inode_setxattr was called */
> + /*
> +  * Reset the measure, appraise and audit cached flags either if:
> +  * - ima_inode_setxattr was called, or
> +  * - based on filesystem feat

Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2018-01-17 Thread Seth Forshee

On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
>  wrote:
> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> >> [Adding Tejun, David, Tom for question about cuse]
> >>
> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park  wrote:
> >> > From: Seth Forshee 
> >> >
> >> > In order to support mounts from namespaces other than
> >> > init_user_ns, fuse must translate uids and gids to/from the
> >> > userns of the process servicing requests on /dev/fuse. This
> >> > patch does that, with a couple of restrictions on the namespace:
> >> >
> >> >  - The userns for the fuse connection is fixed to the namespace
> >> >from which /dev/fuse is opened.
> >> >
> >> >  - The namespace must be the same as s_user_ns.
> >> >
> >> > These restrictions simplify the implementation by avoiding the
> >> > need to pass around userns references and by allowing fuse to
> >> > rely on the checks in inode_change_ok for ownership changes.
> >> > Either restriction could be relaxed in the future if needed.
> >> >
> >> > For cuse the namespace used for the connection is also simply
> >> > current_user_ns() at the time /dev/cuse is opened.
> >>
> >> Was a use case discussed for using cuse in a new unprivileged userns?
> >>
> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
> >> device as an unprivileged user with:
> >>
> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> >> --maj=99 --min=30 --name=foo
> >>
> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> >> Then, I could see the new device:
> >>
> >> $ cat /proc/devices | grep foo
> >>  99 foo
> >>
> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> >> seems dangerous if the dev node can be provided otherwise and if we
> >> don't have a use case for it.
> >>
> >> Thoughts?
> >
> > I can't remember the specific reasons, but I had concluded that letting
> > unprivileged users use cuse within a user namespace isn't safe. But
> > having a cuse device node usable by regular users at all is equally
> > unsafe I suspect,
> 
> This makes sense.
> 
> > so I don't think your example demonstrates any problem
> > specific to user namespaces. There shouldn't be any way to use a user
> > namespace to gain access permissions towards /dev/cuse, otherwise we
> > have bigger problems than cuse to worry about.
> 
> From my tests, the patch seem safe but I don't fully understand why that is.
> 
> I am not trying to gain more permissions towards /dev/cuse but to
> create another cuse char file from within the unprivileged userns. I
> tested the scenario by patching the memfs userspace FUSE driver to
> generate the char device whenever the file is named "cuse" (turning
> the regular file into a char device with the cuse major/minor behind
> the scene):
> 
> $ unshare -U -r -m
> # memfs /mnt/memfs &
> # ls -l /mnt/memfs
> # echo -n > /mnt/memfs/cuse
> -bash: /mnt/memfs/cuse: Input/output error
> # ls -l /mnt/memfs/cuse
> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
> # cat /mnt/memfs/cuse
> cat: /mnt/memfs/cuse: Permission denied
> 
> But then, I could not use that char device, even though it seems to
> have the correct major/minor and permissions. The kernel FUSE code
> seems to call init_special_inode() to handle character devices. I
> don't understand why it seems to be safe.

Because for new mounts in non-init user namespaces alloc_super() sets
SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
that filesystem.

Seth

Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

2018-01-17 Thread Seth Forshee

On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> [Adding Tejun, David, Tom for question about cuse]
> 
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park  wrote:
> > From: Seth Forshee 
> >
> > In order to support mounts from namespaces other than
> > init_user_ns, fuse must translate uids and gids to/from the
> > userns of the process servicing requests on /dev/fuse. This
> > patch does that, with a couple of restrictions on the namespace:
> >
> >  - The userns for the fuse connection is fixed to the namespace
> >from which /dev/fuse is opened.
> >
> >  - The namespace must be the same as s_user_ns.
> >
> > These restrictions simplify the implementation by avoiding the
> > need to pass around userns references and by allowing fuse to
> > rely on the checks in inode_change_ok for ownership changes.
> > Either restriction could be relaxed in the future if needed.
> >
> > For cuse the namespace used for the connection is also simply
> > current_user_ns() at the time /dev/cuse is opened.
> 
> Was a use case discussed for using cuse in a new unprivileged userns?
> 
> I ran some tests yesterday with cusexmp [1] and I could add a new char
> device as an unprivileged user with:
> 
> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> --maj=99 --min=30 --name=foo
> 
> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> Then, I could see the new device:
> 
> $ cat /proc/devices | grep foo
>  99 foo
> 
> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> seems dangerous if the dev node can be provided otherwise and if we
> don't have a use case for it.
> 
> Thoughts?

I can't remember the specific reasons, but I had concluded that letting
unprivileged users use cuse within a user namespace isn't safe. But
having a cuse device node usable by regular users at all is equally
unsafe I suspect, so I don't think your example demonstrates any problem
specific to user namespaces. There shouldn't be any way to use a user
namespace to gain access permissions towards /dev/cuse, otherwise we
have bigger problems than cuse to worry about.

Seth

Re: Memory hotplug regression in 4.13

2017-12-22 Thread Seth Forshee

On Fri, Dec 22, 2017 at 10:12:40AM -0600, Seth Forshee wrote:
> On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote:
> > On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> > > On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > > [...]
> > > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 
> > > > > > build
> > > > > > and the problem still occurs there. It's pointing to the call to
> > > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which 
> > > > > > we
> > > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > > > > 
> > > > > Hmm, this is interesting. That would mean that we have successfully
> > > > > mapped the destination page but its memory is still not accessible.
> > > > > 
> > > > > Right now I do not see how the patch you have bisected to could make 
> > > > > any
> > > > > difference because it only postponed the onlining to be independent 
> > > > > but
> > > > > your config simply onlines automatically so there shouldn't be any
> > > > > semantic change. Maybe there is some sort of off-by-one or something.
> > > > > 
> > > > > I will try to investigate some more. Do you think it would be possible
> > > > > to configure kdump on your system and provide me with the vmcore in 
> > > > > some
> > > > > way?
> > > > 
> > > > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > > > It came to my attention again recently though.
> > > 
> > > Apology on my side. This has completely fall of my radar.
> > > 
> > > > I was looking through the hotplug rework changes, and I noticed that
> > > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > > > rework it doesn't look like it's possible for memory to be associated
> > > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > > > and am now no longer seeing the oopses.
> > > 
> > > Thanks a lot for debugging! Do I read the above correctly that the
> > > current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> > > this zone and that leads to later blowups? Could you attach the fresh
> > > boot dmesg output please?
> > > 
> > > > I'm sure this isn't the correct fix, but I think it does confirm that
> > > > the problem is that the memory should be associated with ZONE_HIGHMEM
> > > > but is not.
> > > 
> > > 
> > > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> > > zone. The kernel cannot access that memory directly. It is essentially a
> > > movable zone from the hotplug API POV. We simply do not have any way to
> > > tell into which zone we want to online this memory range in.
> > > Unfortunately both zones _can_ be present. It would require an explicit
> > > configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> > > or and movable memory configured explicitly on the kernel command line).
> > > 
> > > The below patch is not really complete but I would rather start simple.
> > > Maybe we do not even have to care as most 32b users will never use both
> > > zones at the same time. I've placed a warning to learn about those.
> > > 
> > > Does this pass your testing?
> > 
> > Any chances to test this?
> 
> Yes, I should get to testing it soon. I'm working through a backlog of
> things I need to get done and this just hasn't quite made it to the top.

I started by testing vanilla 4.15-rc4 with a vm that has several memory
slots already populated at boot. With that I no longer get an oops,
however while /sys/devices/system/memory/*/online is 1 it looks like the
memory isn't being used. With your patch the behavior is the same. I'm
attaching dmesg from both kernels.

Thanks,
Seth

> 
> > > ---
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index 262bfd26baf9..18fec18bdb60 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int 
> > > nid, unsigned long s

Re: Memory hotplug regression in 4.13

2017-12-22 Thread Seth Forshee

On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote:
> On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> > On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > [...]
> > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 
> > > > > build
> > > > > and the problem still occurs there. It's pointing to the call to
> > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > > > 
> > > > Hmm, this is interesting. That would mean that we have successfully
> > > > mapped the destination page but its memory is still not accessible.
> > > > 
> > > > Right now I do not see how the patch you have bisected to could make any
> > > > difference because it only postponed the onlining to be independent but
> > > > your config simply onlines automatically so there shouldn't be any
> > > > semantic change. Maybe there is some sort of off-by-one or something.
> > > > 
> > > > I will try to investigate some more. Do you think it would be possible
> > > > to configure kdump on your system and provide me with the vmcore in some
> > > > way?
> > > 
> > > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > > It came to my attention again recently though.
> > 
> > Apology on my side. This has completely fall of my radar.
> > 
> > > I was looking through the hotplug rework changes, and I noticed that
> > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > > rework it doesn't look like it's possible for memory to be associated
> > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > > and am now no longer seeing the oopses.
> > 
> > Thanks a lot for debugging! Do I read the above correctly that the
> > current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> > this zone and that leads to later blowups? Could you attach the fresh
> > boot dmesg output please?
> > 
> > > I'm sure this isn't the correct fix, but I think it does confirm that
> > > the problem is that the memory should be associated with ZONE_HIGHMEM
> > > but is not.
> > 
> > 
> > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> > zone. The kernel cannot access that memory directly. It is essentially a
> > movable zone from the hotplug API POV. We simply do not have any way to
> > tell into which zone we want to online this memory range in.
> > Unfortunately both zones _can_ be present. It would require an explicit
> > configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> > or and movable memory configured explicitly on the kernel command line).
> > 
> > The below patch is not really complete but I would rather start simple.
> > Maybe we do not even have to care as most 32b users will never use both
> > zones at the same time. I've placed a warning to learn about those.
> > 
> > Does this pass your testing?
> 
> Any chances to test this?

Yes, I should get to testing it soon. I'm working through a backlog of
things I need to get done and this just hasn't quite made it to the top.

> > ---
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 262bfd26baf9..18fec18bdb60 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int 
> > nid, unsigned long start_pfn
> > return &pgdat->node_zones[ZONE_NORMAL];
> >  }
> >  
> > +static struct zone *default_movable_zone_for_pfn(int nid)
> > +{
> > +   /*
> > +* Please note that 32b HIGHMEM systems might have 2 movable zones
> > +* actually so we have to check for both. This is rather ugly hack
> > +* to enforce using Highmem on those systems but we do not have a
> > +* good user API to tell into which movable zone we should online.
> > +* WARN if we have a movable zone which is not highmem.
> > +*/
> > +#ifdef CONFIG_HIGHMEM
> > +   WARN_ON_ONCE(!zone_movable_is_highmem());
> > +   return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
> > +#else
> > +   return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > +#endi

Re: Memory hotplug regression in 4.13

2017-12-01 Thread Seth Forshee

On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote:
> > > Hi,
> > > I am currently at a conference so I will most probably get to this next
> > > week but I will try to ASAP.
> > > 
> > > On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> > > > Hi Michal,
> > > > 
> > > > I'm seeing oopses in various locations when hotplugging memory in an x86
> > > > vm while running a 32-bit kernel. The config I'm using is attached. To
> > > > reproduce I'm using kvm with the memory options "-m
> > > > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
> > > > 
> > > >   object_add memory-backend-ram,id=mem1,size=512M
> > > >   device_add pc-dimm,id=dimm1,memdev=mem1
> > > > 
> > > > Not long after that I'll see an oops, not always in the same location
> > > > but most often in wp_page_copy, like this one:
> > > 
> > > This is rather surprising. How do you online the memory?
> > 
> > The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.
> 
> OK, so the memory gets online automagically at the time when it is
> hotadded. Could you send the full dmesg?
> 
> > > > [   24.673623] BUG: unable to handle kernel paging request at d000
> > > > [   24.675569] IP: wp_page_copy+0xa8/0x660
> > > 
> > > could you resolve the IP into the source line?
> > 
> > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > and the problem still occurs there. It's pointing to the call to
> > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> 
> Hmm, this is interesting. That would mean that we have successfully
> mapped the destination page but its memory is still not accessible.
> 
> Right now I do not see how the patch you have bisected to could make any
> difference because it only postponed the onlining to be independent but
> your config simply onlines automatically so there shouldn't be any
> semantic change. Maybe there is some sort of off-by-one or something.
> 
> I will try to investigate some more. Do you think it would be possible
> to configure kdump on your system and provide me with the vmcore in some
> way?

Sorry, I got busy with other stuff and this kind of fell off my radar.
It came to my attention again recently though.

I was looking through the hotplug rework changes, and I noticed that
32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
rework it doesn't look like it's possible for memory to be associated
with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
and am now no longer seeing the oopses.

I'm sure this isn't the correct fix, but I think it does confirm that
the problem is that the memory should be associated with ZONE_HIGHMEM
but is not.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29906b9..fddc134c5c3b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -833,6 +833,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone,
set_zone_contiguous(zone);
 }
 
+#ifdef CONFIG_HIGHMEM
+static enum zone_type default_zone = ZONE_HIGHMEM;
+#else
+static enum zone_type default_zone = ZONE_NORMAL;
+#endif
+
 /*
  * Returns a default kernel memory zone for the given pfn range.
  * If no kernel zone covers this pfn range it will automatically go
@@ -844,14 +850,14 @@ static struct zone *default_kernel_zone_for_pfn(int nid, 
unsigned long start_pfn
struct pglist_data *pgdat = NODE_DATA(nid);
int zid;
 
-   for (zid = 0; zid <= ZONE_NORMAL; zid++) {
+   for (zid = 0; zid <= default_zone; zid++) {
struct zone *zone = &pgdat->node_zones[zid];
 
if (zone_intersects(zone, start_pfn, nr_pages))
return zone;
}
 
-   return &pgdat->node_zones[ZONE_NORMAL];
+   return &pgdat->node_zones[default_zone];
 }
 
 static inline struct zone *default_zone_for_pfn(int nid, unsigned long 
start_pfn,

[PATCH] selftests/powerpc: Use snprintf to construct DSCR sysfs interface paths

2017-09-28 Thread Seth Forshee

Currently sprintf is used, and while paths should never exceed
the size of the buffer it is theoretically possible since
dirent.d_name is 256 bytes. As a result this trips
-Wformat-overflow, and since the test is built with -Wall -Werror
the causes the build to fail. Switch to using snprintf and skip
any paths which are too long for the filename buffer.

Signed-off-by: Seth Forshee 
---
 tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c 
b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
index 17fb1b43c320..1899bd85121f 100644
--- a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
+++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c
@@ -53,6 +53,8 @@ static int check_all_cpu_dscr_defaults(unsigned long val)
}
 
while ((dp = readdir(sysfs))) {
+   int len;
+
if (!(dp->d_type & DT_DIR))
continue;
if (!strcmp(dp->d_name, "cpuidle"))
@@ -60,7 +62,9 @@ static int check_all_cpu_dscr_defaults(unsigned long val)
if (!strstr(dp->d_name, "cpu"))
continue;
 
-   sprintf(file, "%s%s/dscr", CPU_PATH, dp->d_name);
+   len = snprintf(file, LEN_MAX, "%s%s/dscr", CPU_PATH, 
dp->d_name);
+   if (len >= LEN_MAX)
+   continue;
if (access(file, F_OK))
continue;
 
-- 
2.14.1

[PATCH] powerpc: Always initialize input array when calling epapr_hypercall()

2017-09-28 Thread Seth Forshee

Several callers to epapr_hypercall() pass an uninitialized stack
allocated array for the input arguments, presumably because they
have no input arguments. However this can produce errors like
this one

 arch/powerpc/include/asm/epapr_hcalls.h:470:42: error: 'in' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  unsigned long register r3 asm("r3") = in[0];
~~^~~

Fix callers to this function to always zero-initialize the input
arguments array to prevent this.

Signed-off-by: Seth Forshee 
---
 arch/powerpc/include/asm/epapr_hcalls.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/epapr_hcalls.h 
b/arch/powerpc/include/asm/epapr_hcalls.h
index 334459ad145b..90863245df53 100644
--- a/arch/powerpc/include/asm/epapr_hcalls.h
+++ b/arch/powerpc/include/asm/epapr_hcalls.h
@@ -508,7 +508,7 @@ static unsigned long epapr_hypercall(unsigned long *in,
 
 static inline long epapr_hypercall0_1(unsigned int nr, unsigned long *r2)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
unsigned long r;
 
@@ -520,7 +520,7 @@ static inline long epapr_hypercall0_1(unsigned int nr, 
unsigned long *r2)
 
 static inline long epapr_hypercall0(unsigned int nr)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
return epapr_hypercall(in, out, nr);
@@ -528,7 +528,7 @@ static inline long epapr_hypercall0(unsigned int nr)
 
 static inline long epapr_hypercall1(unsigned int nr, unsigned long p1)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -538,7 +538,7 @@ static inline long epapr_hypercall1(unsigned int nr, 
unsigned long p1)
 static inline long epapr_hypercall2(unsigned int nr, unsigned long p1,
unsigned long p2)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -549,7 +549,7 @@ static inline long epapr_hypercall2(unsigned int nr, 
unsigned long p1,
 static inline long epapr_hypercall3(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
@@ -562,7 +562,7 @@ static inline long epapr_hypercall4(unsigned int nr, 
unsigned long p1,
unsigned long p2, unsigned long p3,
unsigned long p4)
 {
-   unsigned long in[8];
+   unsigned long in[8] = {0};
unsigned long out[8];
 
in[0] = p1;
-- 
2.14.1

Re: Memory hotplug regression in 4.13

2017-09-20 Thread Seth Forshee

On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote:
> Hi,
> I am currently at a conference so I will most probably get to this next
> week but I will try to ASAP.
> 
> On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> > Hi Michal,
> > 
> > I'm seeing oopses in various locations when hotplugging memory in an x86
> > vm while running a 32-bit kernel. The config I'm using is attached. To
> > reproduce I'm using kvm with the memory options "-m
> > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
> > 
> >   object_add memory-backend-ram,id=mem1,size=512M
> >   device_add pc-dimm,id=dimm1,memdev=mem1
> > 
> > Not long after that I'll see an oops, not always in the same location
> > but most often in wp_page_copy, like this one:
> 
> This is rather surprising. How do you online the memory?

The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.

> > [   24.673623] BUG: unable to handle kernel paging request at d000
> > [   24.675569] IP: wp_page_copy+0xa8/0x660
> 
> could you resolve the IP into the source line?

It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
and the problem still occurs there. It's pointing to the call to
__builtin_memcpy in memcpy (include/linux/string.h line 340), which we
get to via wp_page_copy -> cow_user_page -> copy_user_highpage.

Thanks,
Seth

Re: [PATCH] selftests/seccomp: Support glibc 2.26 siginfo_t.h

2017-09-07 Thread Seth Forshee

On Thu, Sep 07, 2017 at 04:32:46PM -0700, Kees Cook wrote:
> The 2.26 release of glibc changed how siginfo_t is defined, and the earlier
> work-around to using the kernel definition are no longer needed. The old
> way needs to stay around for a while, though.
> 
> Reported-by: Seth Forshee 
> Cc: Andy Lutomirski 
> Cc: Will Drewry 
> Cc: Shuah Khan 
> Cc: linux-kselft...@vger.kernel.org
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
> Seth, can you double check this to confirm it works for you too? This builds
> and tests correctly for me on both Ubuntu 17.10 (-proposed) with glibc 2.26
> and with earlier distros with 2.24, etc.

It builds and tests correctly for me too, with both glibc 2.26 and 2.24.

Tested-by: Seth Forshee 

Thanks!

seccomp selftest fails to build with glibc 2.26

2017-09-07 Thread Seth Forshee

Hi Kees,

I'm seeing build failures with your seccomp selftest when using glibc
2.26. The first are related to changing macro names from __have_sig*_t
to __sig*_t_defined. But after defining those there are more conflicting
definitions. I was able to get it to build with the changes below,
however it's ugly so I'm hesitant to suggest that it's a fix (and I
haven't tested with older glibc either).

The full build output is a little lengthy, so I've pasted it at
http://pastebin.ubuntu.com/25486192/ rather than including it inline.

Thanks,
Seth


diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 03f1fa49..d234a3e5 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -11,6 +11,13 @@
 #define __have_sigval_t 1
 #define __have_sigevent_t 1
 
+/* These fix errors with glibc 2.26 */
+#define __siginfo_t_defined 1
+#define __sigval_t_defined 1
+#define __sigevent_t_defined 1
+#define _BITS_SIGINFO_CONSTS_H 1
+#define _BITS_SIGEVENT_CONSTS_H 1
+
 #include 
 #include 
 #include

[PATCH] scsi: aacraid: Don't copy uninitialized stack memory to userspace

2017-06-23 Thread Seth Forshee

Both aac_send_raw_srb() and aac_get_hba_info() may copy stack
allocated structs to userspace without initializing all members
of these structs. Clear out this memory to prevent information
leaks.

Fixes: 423400e64d377 ("scsi: aacraid: Include HBA direct interface")
Fixes: c799d519bf088 ("scsi: aacraid: Retrieve HBA host information ioctl")
Signed-off-by: Seth Forshee 
---
 drivers/scsi/aacraid/commctrl.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/aacraid/commctrl.c b/drivers/scsi/aacraid/commctrl.c
index d2f8d5954840..476ada6e39d0 100644
--- a/drivers/scsi/aacraid/commctrl.c
+++ b/drivers/scsi/aacraid/commctrl.c
@@ -950,6 +950,7 @@ static int aac_send_raw_srb(struct aac_dev* dev, void 
__user * arg)
&((struct aac_native_hba *)srbfib->hw_fib_va)->resp.err;
struct aac_srb_reply reply;
 
+   memset(&reply, 0, sizeof(reply));
reply.status = ST_OK;
if (srbfib->flags & FIB_CONTEXT_FLAG_FASTRESP) {
/* fast response */
@@ -1019,6 +1020,7 @@ static int aac_get_hba_info(struct aac_dev *dev, void 
__user *arg)
 {
struct aac_hba_info hbainfo;
 
+   memset(&hbainfo, 0, sizeof(hbainfo));
hbainfo.adapter_number  = (u8) dev->id;
hbainfo.system_io_bus_number= dev->pdev->bus->number;
hbainfo.device_number   = (dev->pdev->devfn >> 3);
-- 
2.11.0

Re: audit regressions in 4.11

2017-04-09 Thread Seth Forshee

On Sun, Apr 09, 2017 at 09:14:03AM -0400, Paul Moore wrote:
> On Sat, Apr 8, 2017 at 11:02 PM, Seth Forshee
>  wrote:
> > I've observed audit regressions in 4.11-rc when not using a userspace
> > audit daemon. The most obvious issue is that audit messages are not
> > appearing in dmesg anymore. If a sufficient number of audit messages are
> > generated the kernel will also start invoking the OOM killer.
> >
> > It looks like previously, when there's no auditd in userspace kauditd
> > would call kauditd_hold_skb(), which prints the message using printk and
> > either frees the skb or queues it (with a limit on the number of queued
> > skb's by default).
> >
> > Since 5b52330bbfe6 "audit: fix auditd/kernel connection state tracking"
> > when there's no auditd kauditd will instead use the retry queue, which
> > has no limit. But it will not process the retry queue when there's no
> > auditd, so messages pile up there indefinitely.
> 
> Hi Seth,
> 
> Thanks for the report.  Let me play with this and think on it for a
> bit, but looking at the code again I think the issue is that we check
> to see if auditd is connected at the top of the kauditd_thread() loop
> and if it isn't we skip right to the main_queue label and bypass the
> hold/retry queue processing which has the logic to ensure the retry
> queue is managed correctly.  My initial thinking is that the fix is to
> check and see if auditd is connected in kauditd_retry_skb(), if it
> isn't we skip the retry queue and call kauditd_hold_skb(), if auditd
> is connected we add the record to the retry queue (what we currently
> do).

Yeah, my first thought was to make this change:

kauditd_send_queue(sk, portid, &audit_queue, 1,
   kauditd_send_multicast_skb,
-  kauditd_retry_skb);
+  sk ? kauditd_retry_skb : kauditd_hold_skb);

However some scenarios could result in unbounded queueing on the hold
queue as well, so I'm not sure if that's quite enough.

Thanks,
Seth

audit regressions in 4.11

2017-04-08 Thread Seth Forshee

I've observed audit regressions in 4.11-rc when not using a userspace
audit daemon. The most obvious issue is that audit messages are not
appearing in dmesg anymore. If a sufficient number of audit messages are
generated the kernel will also start invoking the OOM killer.

It looks like previously, when there's no auditd in userspace kauditd
would call kauditd_hold_skb(), which prints the message using printk and
either frees the skb or queues it (with a limit on the number of queued
skb's by default).

Since 5b52330bbfe6 "audit: fix auditd/kernel connection state tracking"
when there's no auditd kauditd will instead use the retry queue, which
has no limit. But it will not process the retry queue when there's no
auditd, so messages pile up there indefinitely.

Thanks,
Seth

Re: [PATCH] vfs: Partially revert addition of cred override in follow_automount()

2017-02-22 Thread Seth Forshee

On Wed, Feb 22, 2017 at 01:36:32PM +, David Howells wrote:
> The following commit:
> 
>   commit aeaa4a79ff6a5ed912b7362f206cf8576fca538b
>   Author: Eric W. Biederman 
>   Date:   Sat Jul 23 11:20:44 2016 -0500
>   fs: Call d_automount with the filesystems creds
> 
> brackets the call to ->d_automount() with override_creds() and
> revert_creds(), setting the initial credentials for use whilst
> automounting.
> 
> This, however, breaks AFS as it's no longer able to access the calling
> process's keyrings to read the destination on a mountpoint.  This has gone
> unnoticed till now because, to this point, stat'ing or validating the inode
> caused the body of the mountpoint to be read into the pagecache (so that we
> could determine whether what we were looking at was really a mountpoint).
> 
> However, the page containing the mountpoint destination is merely *cached*
> and not pinned.  If it gets discarded and we try to read it in d_automount,
> we may fail because we have no authentication tokens available.
> 
> So, for the moment, revert the addition of override_creds() and
> revert_creds() and their variable.
> 
> Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
> Signed-off-by: David Howells 
> cc: Seth Forshee 
> cc: "Eric W. Biederman" 
> cc: Al Viro 

Eric's already applied a patch that should fix this problem. As far as I
know it's only been tested against NFS though, so you might want to test
that it also fixes the issue with AFS.

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=93faccbbfa958a9668d3ab4e30f38dd205cee8d8

Thanks,
Seth

Re: cifs mount regression in 4.8 and 4.4 stable

2016-09-23 Thread Seth Forshee

On Thu, Sep 22, 2016 at 10:27:56AM -0500, Seth Forshee wrote:
> On Thu, Sep 22, 2016 at 04:17:09PM +0100, Sachin Prabhu wrote:
> > On Thu, 2016-09-22 at 10:09 -0500, Seth Forshee wrote:
> > > We've received reports from users of a cifs mount regression in our
> > > 4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing
> > > commit
> > > from 4.8 which was applied to 4.4 stable:
> > > 
> > >  a6b5058 fs/cifs: make share unaccessible at root level mountable
> > > 
> > > Testing against 4.8-rc7 shows that the problem is present there as
> > > well.
> > > 
> > > Thanks,
> > > Seth
> > > 
> > > [1] http://bugs.launchpad.net/bugs/1626112
> > 
> > Hello Seth, 
> > 
> > We have identified some regressions introduced by the mentioned patch
> > These include
> 
> I saw those, but none of the ones already in Linus's tree fix the
> problem.
> 
> > a) mounting of DFS shares breaks. The fix is included in Steve's tree
> > at
> > https://git.samba.org/?p=sfrench/cifs-2.6.git;a=commit;h=de5233745cd59c
> > f5853d963ad216067788a87594
> 
> However this one isn't in Linus's tree yet.
> 
> I'm not certain what circumstances cause the failure, but based on what
> I see in dmesg it looks likely that this one is the fix. I'll get some
> testing with this applied.

Looks like this is is the fix we need. Thanks for your help.

Seth

Re: cifs mount regression in 4.8 and 4.4 stable

2016-09-22 Thread Seth Forshee

On Thu, Sep 22, 2016 at 04:17:09PM +0100, Sachin Prabhu wrote:
> On Thu, 2016-09-22 at 10:09 -0500, Seth Forshee wrote:
> > We've received reports from users of a cifs mount regression in our
> > 4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing
> > commit
> > from 4.8 which was applied to 4.4 stable:
> > 
> >  a6b5058 fs/cifs: make share unaccessible at root level mountable
> > 
> > Testing against 4.8-rc7 shows that the problem is present there as
> > well.
> > 
> > Thanks,
> > Seth
> > 
> > [1] http://bugs.launchpad.net/bugs/1626112
> 
> Hello Seth, 
> 
> We have identified some regressions introduced by the mentioned patch
> These include

I saw those, but none of the ones already in Linus's tree fix the
problem.

> a) mounting of DFS shares breaks. The fix is included in Steve's tree
> at
> https://git.samba.org/?p=sfrench/cifs-2.6.git;a=commit;h=de5233745cd59c
> f5853d963ad216067788a87594

However this one isn't in Linus's tree yet.

I'm not certain what circumstances cause the failure, but based on what
I see in dmesg it looks likely that this one is the fix. I'll get some
testing with this applied.

Since the broken patch was applied to stable kernels, all of these
kernels need to receive those fixes as well.

Thanks,
Seth

cifs mount regression in 4.8 and 4.4 stable

2016-09-22 Thread Seth Forshee

We've received reports from users of a cifs mount regression in our
4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing commit
from 4.8 which was applied to 4.4 stable:

 a6b5058 fs/cifs: make share unaccessible at root level mountable

Testing against 4.8-rc7 shows that the problem is present there as well.

Thanks,
Seth

[1] http://bugs.launchpad.net/bugs/1626112

[PATCH RESEND] xenbus: Use proc_create_mount_point() to create /proc/xen

2016-08-30 Thread Seth Forshee

Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.

Acked-by: David Vrabel 
Signed-off-by: Seth Forshee 
---

Resending to add some Cc's I missed the first time.

 drivers/xen/xenbus/xenbus_probe.c | 2 +-
 fs/proc/generic.c | 1 +
 fs/proc/internal.h| 1 -
 include/linux/proc_fs.h   | 2 ++
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 33a31cfef55d..b5c1dec4a7c2 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -826,7 +826,7 @@ static int __init xenbus_init(void)
 * Create xenfs mountpoint in /proc for compatibility with
 * utilities that expect to find "xenbus" under "/proc/xen".
 */
-   proc_mkdir("xen", NULL);
+   proc_create_mount_point("xen");
 #endif
 
 out_error:
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index c633476616e0..be014c544d50 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const char 
*name)
}
return ent;
 }
+EXPORT_SYMBOL(proc_create_mount_point);
 
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
struct proc_dir_entry *parent,
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 7931c558c192..ff7259559d70 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -195,7 +195,6 @@ static inline bool is_empty_pde(const struct proc_dir_entry 
*pde)
 {
return S_ISDIR(pde->mode) && !pde->proc_iops;
 }
-struct proc_dir_entry *proc_create_mount_point(const char *name);
 
 /*
  * inode.c
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2ef996e..8bd2f726436a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -21,6 +21,7 @@ extern struct proc_dir_entry *proc_mkdir_data(const char *, 
umode_t,
  struct proc_dir_entry *, void *);
 extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t,
  struct proc_dir_entry *);
+struct proc_dir_entry *proc_create_mount_point(const char *name);
  
 extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
   struct proc_dir_entry *,
@@ -56,6 +57,7 @@ static inline struct proc_dir_entry *proc_symlink(const char 
*name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
 static inline struct proc_dir_entry *proc_mkdir(const char *name,
struct proc_dir_entry *parent) {return NULL;}
+static inline struct proc_dir_entry *proc_create_mount_point(const char *name) 
{ return NULL; }
 static inline struct proc_dir_entry *proc_mkdir_data(const char *name,
umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; 
}
 static inline struct proc_dir_entry *proc_mkdir_mode(const char *name,
-- 
2.7.4

Re: [Xen-devel] [PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen

2016-08-30 Thread Seth Forshee

On Tue, Aug 30, 2016 at 04:00:03PM +0100, David Vrabel wrote:
> On 29/08/16 16:03, Seth Forshee wrote:
> > Mounting proc in user namespace containers fails if the xenbus
> > filesystem is mounted on /proc/xen because this directory fails
> > the "permanently empty" test. proc_create_mount_point() exists
> > specifically to create such mountpoints in proc but is currently
> > proc-internal. Export this interface to modules, then use it in
> > xenbus when creating /proc/xen.
> 
> Acked-by: David Vrabel 
> 
> This either needs to be acked by the fs maintainer or go via their tree
> but you don't appear to have Cc'd the relevant people or lists.

Huh. I use a script which uses get_maintainer.pl to add the relevant
maintainers and lists, but appaerntly that failed me this time. Even
running get_maintainer.pl by hand though it doesn't suggest
linux-fsdevel though, and it seems it should.

Thanks,
Seth

Re: [PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen

2016-08-30 Thread Seth Forshee

On Tue, Aug 30, 2016 at 04:48:08PM +0200, Juergen Gross wrote:
> On 29/08/16 17:03, Seth Forshee wrote:
> > Mounting proc in user namespace containers fails if the xenbus
> > filesystem is mounted on /proc/xen because this directory fails
> > the "permanently empty" test. proc_create_mount_point() exists
> > specifically to create such mountpoints in proc but is currently
> > proc-internal. Export this interface to modules, then use it in
> > xenbus when creating /proc/xen.
> > 
> > Signed-off-by: Seth Forshee 
> > ---
> >  drivers/xen/xenbus/xenbus_probe.c | 2 +-
> >  fs/proc/generic.c | 1 +
> >  fs/proc/internal.h| 1 -
> >  include/linux/proc_fs.h   | 2 ++
> >  4 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/xen/xenbus/xenbus_probe.c 
> > b/drivers/xen/xenbus/xenbus_probe.c
> > index 33a31cfef55d..b5c1dec4a7c2 100644
> > --- a/drivers/xen/xenbus/xenbus_probe.c
> > +++ b/drivers/xen/xenbus/xenbus_probe.c
> > @@ -826,7 +826,7 @@ static int __init xenbus_init(void)
> >  * Create xenfs mountpoint in /proc for compatibility with
> >  * utilities that expect to find "xenbus" under "/proc/xen".
> >  */
> > -   proc_mkdir("xen", NULL);
> > +   proc_create_mount_point("xen");
> >  #endif
> >  
> >  out_error:
> > diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> > index c633476616e0..be014c544d50 100644
> > --- a/fs/proc/generic.c
> > +++ b/fs/proc/generic.c
> > @@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const 
> > char *name)
> > }
> > return ent;
> >  }
> > +EXPORT_SYMBOL(proc_create_mount_point);
> 
> EXPORT_SYMBOL_GPL()?

Other similar sorts of calls in proc (proc_mkdir in particular) are
EXPORT_SYMBOL, so I guessed this one should follow suit. But if it
should be EXPORT_SYMOBL_GPL then that's fine too.

Thanks,
Seth

[PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen

2016-08-29 Thread Seth Forshee

Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.

Signed-off-by: Seth Forshee 
---
 drivers/xen/xenbus/xenbus_probe.c | 2 +-
 fs/proc/generic.c | 1 +
 fs/proc/internal.h| 1 -
 include/linux/proc_fs.h   | 2 ++
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 33a31cfef55d..b5c1dec4a7c2 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -826,7 +826,7 @@ static int __init xenbus_init(void)
 * Create xenfs mountpoint in /proc for compatibility with
 * utilities that expect to find "xenbus" under "/proc/xen".
 */
-   proc_mkdir("xen", NULL);
+   proc_create_mount_point("xen");
 #endif
 
 out_error:
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index c633476616e0..be014c544d50 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const char 
*name)
}
return ent;
 }
+EXPORT_SYMBOL(proc_create_mount_point);
 
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
struct proc_dir_entry *parent,
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 7931c558c192..ff7259559d70 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -195,7 +195,6 @@ static inline bool is_empty_pde(const struct proc_dir_entry 
*pde)
 {
return S_ISDIR(pde->mode) && !pde->proc_iops;
 }
-struct proc_dir_entry *proc_create_mount_point(const char *name);
 
 /*
  * inode.c
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2ef996e..8bd2f726436a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -21,6 +21,7 @@ extern struct proc_dir_entry *proc_mkdir_data(const char *, 
umode_t,
  struct proc_dir_entry *, void *);
 extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t,
  struct proc_dir_entry *);
+struct proc_dir_entry *proc_create_mount_point(const char *name);
  
 extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
   struct proc_dir_entry *,
@@ -56,6 +57,7 @@ static inline struct proc_dir_entry *proc_symlink(const char 
*name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
 static inline struct proc_dir_entry *proc_mkdir(const char *name,
struct proc_dir_entry *parent) {return NULL;}
+static inline struct proc_dir_entry *proc_create_mount_point(const char *name) 
{ return NULL; }
 static inline struct proc_dir_entry *proc_mkdir_data(const char *name,
umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; 
}
 static inline struct proc_dir_entry *proc_mkdir_mode(const char *name,
-- 
2.7.4

Re: [PATCH v4 18/21] fuse: Add support for pid namespaces

2016-07-20 Thread Seth Forshee

On Tue, Jul 19, 2016 at 07:44:11PM -0700, Sheng Yang wrote:
> On Tue, Apr 26, 2016 at 12:36 PM, Seth Forshee
>  wrote:
> > When the userspace process servicing fuse requests is running in
> > a pid namespace then pids passed via the fuse fd are not being
> > translated into that process' namespace. Translation is necessary
> > for the pid to be useful to that process.
> >
> > Since no use case currently exists for changing namespaces all
> > translations can be done relative to the pid namespace in use
> > when fuse_conn_init() is called. For fuse this translates to
> > mount time, and for cuse this is when /dev/cuse is opened. IO for
> > this connection from another namespace will return errors.
> >
> > Requests from processes whose pid cannot be translated into the
> > target namespace are not permitted, except for requests
> > allocated via fuse_get_req_nofail_nopages. For no-fail requests
> > in.h.pid will be 0 if the pid translation fails.
> 
> Hi Seth,
> 
> This patch caused a regression in our major container use case with
> FUSE in Ubuntu 16.04, as patch was checked in as Ubuntu Sauce in
> Ubuntu 4.4.0-6.21 kernel.
> 
> The use case is:
> 1. Create a Docker container.
> 2. Inside the container, start the FUSE backend, and mounted fs.
> 3. Following step 2 in the container, create a loopback device to map
> a file in the mounted fuse to create a block device, which will be
> available to the whole system.
> 
> It works well before this commit.
> 
> The use case is broken because no matter which namespace losetup runs,
> the real request from loopback device seems always come from init ns,
> thus it will be in different ns running fuse backend. So the request
> will got denied, because the ns running fuse won't able to see the
> things from higher level(level 0 in fact) pid namespace.
> 
> I think since init pid ns has ability to access any process in the
> system, it should able to access the fuse mounted by any pid namespace
> process as well.
> 
> What you think?

It sounds like we need to remove the restriction on accessing the
filesystem from a different pid namespace. I don't think this poses a
security problem. However there's no pid mapping that is usable by the
userspace fuse process, so what do we put in the fuse request? Probably
the only candidates are 0 and 0x.

So a question for the fuse developers - is one value or the other
preferrable for fuse_in_header.pid when the pid cannot be mapped, and is
this going to cause problems for any fuse filesystems? I suspect that
few filesystems actually look at the pid anyway, and already for a
filesystem mounted in a pid namespace the values being given to
userspace won't be correct for the namespace of the fuse process.

Seth

Re: Hang due to nfs letting tasks freeze with locked inodes

2016-07-11 Thread Seth Forshee

On Mon, Jul 11, 2016 at 07:03:31AM -0400, Jeff Layton wrote:
> On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > > > 
> > > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > > [...]
> > > > > 
> > > > > > 
> > > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > > is the
> > > > > > point to use freezable wait from this path in the first place?
> > > > > > 
> > > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > > path and
> > > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > > in two
> > > > > > different modes from the same path AFAICS. There do not seem to
> > > > > > be other
> > > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > > like
> > > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > > regular
> > > > > > wait then the freezing would fail as well but at least it would
> > > > > > be clear
> > > > > > who is the culrprit rather than having an indirect dependency.
> > > > > The codepaths involved there are a lot more complex than that
> > > > > unfortunately.
> > > > > 
> > > > > nfs4_delay is the function that we use to handle the case where the
> > > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > > busy
> > > > > right now or has some transient error and the client should retry
> > > > > after
> > > > > a small, sliding delay.
> > > > > 
> > > > > That codepath could probably be made more freezer-safe. The typical
> > > > > case however, is that we've sent a call and just haven't gotten a
> > > > > reply. That's the trickier one to handle.
> > > > Why using a regular non-freezable wait would be a problem?
> > > 
> > > It has been a while since I looked at that code, but IIRC, that could
> > > block the freezer for up to 15s, which is a significant portion of the
> > > 20s that you get before the freezer gives up.
> > 
> > But how does that differ from the situation when the freezer has to give
> > up on the timeout because another task fails due to lock dependency.
> > 
> > As Trond and Dave have written in other emails. It is really danngerous
> > to freeze a task while it is holding locks and other resources.
> 
> It's not really dangerous if you're freezing every task on the host.
> Sure, you're freezing with locks held, but everything else is freezing
> too, so nothing will be contending for those locks.

Unless you have tasks either already waiting on those locks or that will
attaempt to lock them before calling try_to_freeze. That happens to be
the case in this cgroup freezer hang too, all the tasks stuck waiting on
the i_mutex are p being frozen.

> I'm not at all opposed to changing how all of that works. My only
> stipulation is that we not break the ability to reliably suspend a host
> that is actively using an NFS mount. If you can come up with a way to
> do that that also works for freezing cgroups, then I'm all for it.

The deadlock that we've seen should apply equally to suspend and to the
cgroup freezer. The only difference is that suspend will eventually time
out and abort the suspend whereas the cgroup freezer does not.

Seth

Re: Hang due to nfs letting tasks freeze with locked inodes

2016-07-08 Thread Seth Forshee

On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > 
> > Yeah, known problem. Not a simple one to fix though.
> 
> Actually, it is simple to fix.
> 
>  not sys_sync(), to suspend filesystem operations>
> 
> i.e. the VFS blocks new operations from starting, and then then the
> NFS client simply needs to implement ->freeze_fs to drain all it's
> active operations before returning. Problem solved.

No, this won't solve my problem. We're not doing a full suspend, rather
using a freezer cgroup to freeze a subset of processes. We don't want to
want to fully freeze the filesystem.

Thanks,
Seth

Re: Hang due to nfs letting tasks freeze with locked inodes

2016-07-08 Thread Seth Forshee

On Fri, Jul 08, 2016 at 02:22:24PM +0200, Michal Hocko wrote:
> On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > 
> > Yeah, known problem. Not a simple one to fix though.
> 
> Apart from alternative Dave was mentioning in other email, what is the
> point to use freezable wait from this path in the first place?
> 
> nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> different modes from the same path AFAICS. There do not seem to be other
> callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> something is not quite right here to me. If the nfs4_delay did regular
> wait then the freezing would fail as well but at least it would be clear
> who is the culrprit rather than having an indirect dependency.

It turns out there are more paths than this one doing a freezable wait,
and they're all also killable. This leads me to a slightly different
question than yours, why nfs can give up waiting in the case of a signal
but not when the task is frozen.

I know the changes below aren't "correct," but I've been experimenting
with them anyway to see what would happen. So far things seem to be
fine, and the deadlock is gone. That should give you an idea of all the
places I found using a freezable wait.

Seth


diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f714b98..62dbe59 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -77,8 +77,8 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
  */
 int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-   freezable_schedule_unsafe();
-   if (signal_pending_state(mode, current))
+   schedule();
+   if (signal_pending_state(mode, current) || freezing(current))
return -ERESTARTSYS;
return 0;
 }
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index cb28cce..2315183 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -35,9 +35,9 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message 
*msg, int flags)
res = rpc_call_sync(clnt, msg, flags);
if (res != -EJUKEBOX)
break;
-   
freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+   schedule_timeout_killable(NFS_JUKEBOX_RETRY_TIME);
res = -ERESTARTSYS;
-   } while (!fatal_signal_pending(current));
+   } while (!fatal_signal_pending(current) && !freezing(current));
return res;
 }
 
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 98a4415..0dad2fb 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -334,9 +334,8 @@ static int nfs4_delay(struct rpc_clnt *clnt, long *timeout)
 
might_sleep();
 
-   freezable_schedule_timeout_killable_unsafe(
-   nfs4_update_delay(timeout));
-   if (fatal_signal_pending(current))
+   schedule_timeout_killable(nfs4_update_delay(timeout));
+   if (fatal_signal_pending(current) || freezing(current))
res = -ERESTARTSYS;
return res;
 }
@@ -5447,7 +5446,7 @@ int nfs4_proc_delegreturn(struct inode *inode, struct 
rpc_cred *cred, const nfs4
 static unsigned long
 nfs4_set_lock_task_retry(unsigned long timeout)
 {
-   freezable_schedule_timeout_killable_unsafe(timeout);
+   schedule_timeout_killable(timeout);
timeout <<= 1;
if (timeout > NFS4_LOCK_MAXTIMEOUT)
return NFS4_LOCK_MAXTIMEOUT;
@@ -6148,7 +6147,7 @@ nfs4_proc_lock(struct file *filp,

Re: Hang due to nfs letting tasks freeze with locked inodes

2016-07-06 Thread Seth Forshee

On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> > 
> >  [] schedule+0x35/0x80
> >  [] schedule_preempt_disabled+0xe/0x10
> >  [] __mutex_lock_slowpath+0xb9/0x130
> >  [] mutex_lock+0x1f/0x30
> >  [] do_unlinkat+0x12b/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> > 
> >  [] __refrigerator+0x7a/0x140
> >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> >  [] nfs_unlink+0x149/0x350 [nfs]
> >  [] vfs_unlink+0xf1/0x1a0
> >  [] do_unlinkat+0x279/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> > 
> 
> Yeah, known problem. Not a simple one to fix though.
> 
> > Can nfs do something different here to prevent this? Maybe use a
> > non-freezable sleep and let the operation complete, or else abort the
> > operation and return ERESTARTSYS?
> 
> The problem with letting the op complete is that often by the time you
> get to the point of trying to freeze processes, the network interfaces
> are already shut down. So the operation you're waiting on might never
> complete. Stuff like suspend operations on your laptop fail, leading to
> fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> because the suspend never completed."
> 
> You could (in principle) return something like -ERESTARTSYS iff the
> call has not yet been transmitted. If it has already been transmitted,
> then you might end up sending the call a second time (but not as an RPC
> retransmission of course). If that call was non-idempotent then you end
> up with all of _those_ sorts of problems.
> 
> Also, -ERESTARTSYS is not quite right as it doesn't always cause the
> call to be restarted. It depends on the syscall. I think this would
> probably need some other sort of syscall-restart machinery plumbed in.

I don't really know much at all about how NFS works, so I hope you don't
mind indulging me in some questions.

What happens then if you suspend waiting for an op to complete and then
resume an hour later? Will it actually succeed or end up returning some
sort of "timed out" error?

If it's going to be an error (or even likely to be one) could the op
just be aborted immediately with an error code? It just seems like there
must be something better than potentially deadlocking the kernel.

Thanks,
Seth

Hang due to nfs letting tasks freeze with locked inodes

2016-07-06 Thread Seth Forshee

We're seeing a hang when freezing a container with an nfs bind mount while
running iozone. Two iozone processes were hung with this stack trace.

 [] schedule+0x35/0x80
 [] schedule_preempt_disabled+0xe/0x10
 [] __mutex_lock_slowpath+0xb9/0x130
 [] mutex_lock+0x1f/0x30
 [] do_unlinkat+0x12b/0x2d0
 [] SyS_unlink+0x16/0x20
 [] entry_SYSCALL_64_fastpath+0x16/0x71

This seems to be due to another iozone thread frozen during unlink with
this stack trace:

 [] __refrigerator+0x7a/0x140
 [] nfs4_handle_exception+0x118/0x130 [nfsv4]
 [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
 [] nfs_unlink+0x149/0x350 [nfs]
 [] vfs_unlink+0xf1/0x1a0
 [] do_unlinkat+0x279/0x2d0
 [] SyS_unlink+0x16/0x20
 [] entry_SYSCALL_64_fastpath+0x16/0x71

Since nfs is allowing the thread to be frozen with the inode locked it's
preventing other threads trying to lock the same inode from freezing. It
seems like a bad idea for nfs to be doing this.

Can nfs do something different here to prevent this? Maybe use a
non-freezable sleep and let the operation complete, or else abort the
operation and return ERESTARTSYS?

Thanks,
Seth

Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces

2016-05-18 Thread Seth Forshee

On Wed, May 18, 2016 at 10:45:31AM -0500, Eric W. Biederman wrote:
> > But if we do that it violates some of the assumptions of the patch to
> > rework MNT_NODEV on your testing branch (and also those behind patch 2
> > in this series). Something will need to be changed there to prevent a
> > regression in mount behavior when a user ns tries to mount without
> > MNT_NODEV when the mount inherited from its parent has it set.
> 
> Thank you for pointing that out.  I will look into that.
> 
> I believe I know exactly what you are talking about.  Of the choices I
> think it is better to a minor localized change in the fs_fully_visible
> logic than it is to cause problems elsewhere.

Agreed.

> >> Apologies for not catching this earlier.
> >
> > Actually this is a more recent patch, so you possibly hadn't seen it
> > before.
> >
> >> I am looking at folding all of this into the patch that introduces
> >> sget_userns so that even bisects won't have regresssions.
> >
> > That's fine with me.
> 
> And thank you for keeping everything as separate patches.  That is at
> least helping me catch up.  Even if I don't agree that these things
> should be separate come merge time.

Honestly I probably would have squashed some of them into that first
patch myself if you hadn't already applied it to your testing branch, so
that's all just luck.

Keep in mind that I also have that patch for mqueue that isn't in this
series, and I haven't yet checked to see if the 4.7 merges introduce
anything which is going to require updating these patches. I was
planning to wait and send out updates after -rc1, but if you want that
stuff sooner just let me know.

Thanks,
Seth

Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces

2016-05-17 Thread Seth Forshee

On Tue, May 17, 2016 at 05:39:33PM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > Both of these filesystems already have use cases for mounting the
> > same super block from multiple user namespaces. For sysfs this
> > happens when using criu for snapshotting a container, where sysfs
> > is mnounted in the containers network ns but the hosts user ns.
> > The cgroup filesystem shares the same super block for all mounts
> > of the same hierarchy regardless of the namespace.
> >
> > As a result, the restriction on mounting a super block from a
> > single user namespace creates regressions for existing uses of
> > these filesystems. For these specific filesystems this
> > restriction isn't really necessary since the backing store is
> > objects in kernel memory and thus the ids assigned from inodes
> > is not subject to translation relative to s_user_ns.
> >
> > Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
> > causes sget_userns() to skip the check of s_user_ns. Set this
> > flag for the sysfs and cgroup filesystems to fix the
> > regressions.
> 
> So this one needs to be sget_userns(..., &init_user_ns, ...).
> And not a new special case.

This is actually what I wanted to do, but based on a previous discussion
where I had suggested doing this (for a different reason) I came away
thinking you did not want it that way. So I'm happy with that change.

But if we do that it violates some of the assumptions of the patch to
rework MNT_NODEV on your testing branch (and also those behind patch 2
in this series). Something will need to be changed there to prevent a
regression in mount behavior when a user ns tries to mount without
MNT_NODEV when the mount inherited from its parent has it set.

> Apologies for not catching this earlier.

Actually this is a more recent patch, so you possibly hadn't seen it
before.

> I am looking at folding all of this into the patch that introduces
> sget_userns so that even bisects won't have regresssions.

That's fine with me.

Thanks,
Seth

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee

On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote:
> Seth Forshee  writes:
> 
> > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
> >>   I have slowly been working with Seth Forshee on these issues as
> >>   the last thing I want is to introduce more security bugs right now.
> >>   Seth being a braver man than I am has already merged his changes into
> >>   the Ubuntu kernel.
> >
> > Maybe not quite so brave as you think. I also threw on a patch to
> > disable the feature unless explicitly enabled by a sys admin.
> >
> >> James I think you are missing the fact that all filesystems already have
> >> the make_kuid and make_kgid calls right where the data comes off disk,
> >> and the from_kuid and from_kgid calls right where the on-disk data is
> >> being created just before it goes on disk.  Which means that the actual
> >> impact on filesystems of the translation is trivial.
> >
> > It is fairly simple but a there's bit more that just id conversions to
> > change. With ext4 I found that there were mount options which needed to
> > be restricted, some capability checks to update, and access to external
> > journal devices must be checked. In all it wasn't a whole lot of changes
> > to the filesystem though. Fuse was a bit more involved, but the
> > complexities there won't apply to other filesystems.
> >
> >> Djalal if you could work with Seth I think that would be very useful.  I
> >> know I am dragging my heels there but I really hope I can dig in and get
> >> everything reviewed and merged soonish.
> >
> > That would make me very happy :-)
> 
> It has missed this merge window :( But I am hoping with am aiming to
> review them and get your patches (or modified versions of your patches)
> into my tree as soon after rc1 as humanly possible.
> 
> Part of that will have to be the fix for mqueuefs, that Docker just hit.

Yeah, I've got a patch that's been tested to fix the bug, so I'll send
new patches which include that before long.

Seth

Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

2016-05-16 Thread Seth Forshee

On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote:
>   I have slowly been working with Seth Forshee on these issues as
>   the last thing I want is to introduce more security bugs right now.
>   Seth being a braver man than I am has already merged his changes into
>   the Ubuntu kernel.

Maybe not quite so brave as you think. I also threw on a patch to
disable the feature unless explicitly enabled by a sys admin.

> James I think you are missing the fact that all filesystems already have
> the make_kuid and make_kgid calls right where the data comes off disk,
> and the from_kuid and from_kgid calls right where the on-disk data is
> being created just before it goes on disk.  Which means that the actual
> impact on filesystems of the translation is trivial.

It is fairly simple but a there's bit more that just id conversions to
change. With ext4 I found that there were mount options which needed to
be restricted, some capability checks to update, and access to external
journal devices must be checked. In all it wasn't a whole lot of changes
to the filesystem though. Fuse was a bit more involved, but the
complexities there won't apply to other filesystems.

> Djalal if you could work with Seth I think that would be very useful.  I
> know I am dragging my heels there but I really hope I can dig in and get
> everything reviewed and merged soonish.

That would make me very happy :-)

I'm happy to look with Djalal for commonalities. I did skim his patches
before, and based on that all I really expect to find are things related
to permission checks when ids don't map. The rest seems fundamentally
different.

Seth

Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid

2016-05-05 Thread Seth Forshee

On Wed, May 04, 2016 at 11:19:04PM +, Serge Hallyn wrote:
> Quoting Djalal Harouni (tix...@gmail.com):
> > If a process gets access to a mount from a different user
> > namespace, that process should not be able to take advantage of
> > setuid files or selinux entrypoints from that filesystem.  Prevent
> > this by treating mounts from other mount namespaces and those not
> > owned by current_user_ns() or an ancestor as nosuid.
> > 
> > This patch was just adapted from the original one that was written
> > by Andy Lutomirski 
> > https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html
> 
> I'm not sure that this makes sense given what you're doing.  In the
> case of Seth's set, a filesystem is mounted specifically (and privately)
> in a user namespace.  We don't want for instance the initial user ns
> to find a link to a setuid-root exploit left in the container-mounted
> filesystem.
> 
> But you are having a parent user namespace mount the fs so that its
> children can all access the fs, uid-shifted for convenience.  Not
> allowing the child namespaces to make use of setuid-root does not
> seem applicable here.

Right, the problem addressed by this patch probably isn't relevant to
this sort of uid shifting.

But I think there's another problem that needs to be addressed.
bprm_fill_uid() still gets the ids for sxid files unshifted from the
inode. We already protect against sxid to any user not in
bprm->cred->user_ns, so it will just ignore the sxid instead of e.g.
suid as global root from the id shifted mount, which is good. What would
be wanted though is to use the shifted ids so that something like
suid-root ping in the container rootfs would work.

Seth

Re: [RFC PATCH 0/0] VFS:userns: support portable root filesystems

2016-05-04 Thread Seth Forshee

On Wed, May 04, 2016 at 01:21:46AM +0200, Djalal Harouni wrote:
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 
> 
> 1) Presentation:
> 
> 
> The main aim is to support portable root filesystems and allow containers,
> virtual machines and other cases to use the same root filesystem.
> Due to security reasons, filesystems can't be mounted inside user
> namespaces, and mounting them outside will not solve the problem since
> they will show up with the wrong UIDs/GIDs. Read and write operations
> will also fail and so on.
> 
> The current userspace solution is to automatically chown the whole root
> filesystem before starting a container, example:
> (host) init_user_ns  100:1065536  => (container) user_ns_X1 0:65535
> (host) init_user_ns  200:2065536  => (container) user_ns_Y1 0:65535
> (host) init_user_ns  300:3065536  => (container) user_ns_Z1 0:65535
> ...
> 
> Every time a chown is called, files are changed and so on... This
> prevents to have portable filesystems where you can throw anywhere
> and boot. Having an extra step to adapt the filesystem to the current
> mapping and persist it will not allow to verify its integrity, it makes
> snapshots and migration a bit harder, and probably other limitations...
> 
> It seems that there are multiple ways to allow user namespaces combine
> nicely with filesystems, but none of them is that easy. The bind mount
> and pin the user namespace during mount time will not work, bind mounts
> share the same super block, hence you may endup working on the wrong
> vfsmount context and there is no easy way to get out of that...
> 
> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches which
> seem nice but perhaps too complex!? there is also the overlayfs solution,
> and finaly the VFS layer solution.

I'm not sure if you're meaning to propose your patches as an alternative
to mine or not, but I think they're orthogonal. My goal is to allow
containers in user namespaces to mount some subset of filesystem types
(not specifically container root filesystems, but in general), which
your patches won't enable. Your goal is to share a rootfs between
multiple containers with different uid/gid shifts, which my patches
don't help with.

> We present here a simple VFS solution, everything is packed inside VFS,
> filesystems don't need to know anything (except probably XFS, and special
> operations inside union filesystems). Currently it supports ext4, btrfs
> and overlayfs. Changes into filesystems are small, just parse the
> vfs_shift_uids and vfs_shift_gids options during mount and set the
> appropriate flags into the super_block structure.
> 
> 1) Filesystems don't need the FS_USERNS_MOUNT flag, so no user
> namespace mounting, they stay secure, nothing changes.
> 
> 2) The solution is based on VFS and mount namespaces, we use the user
> namespace of the containing mount namespace to check if we should shift
> UIDs/GIDs from/to virtual <=> on-disk view.
> If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
> options, and if it shows up inside a mount namespace that supports VFS
> UIDs/GIDs shifts then during each access we will remap UID/GID either
> to virtual or to on-disk view using simple helper functions to allow the
> access. In case the mount or current mount namespace do not support VFS
> UID/GID shifts, we fallback to the old behaviour, no shift is performed.
> 
> 3) inodes will always keep their original values which reflect the
> mapping inside init_user_ns which we consider the on-disk mapping.
> Therfore they will have a mapping from 0:65536 on-disk, these values are
> the persistent values that we have to write to the disk. We don't keep
> track of any UID/GID shift that was applied before. This gives
> portability and allows to use the previous mapping which was freed for
> another root filesystem...

Sorry, I haven't had time to look at the patches, but how are you
handling suid/sgid? Will the process get the ids in the inode or the
shifted ids?

Thanks,
Seth

Re: [PATCH 1/1] simplified security.nscapability xattr

2016-04-26 Thread Seth Forshee

On Fri, Apr 22, 2016 at 12:26:33PM -0500, serge.hal...@ubuntu.com wrote:
> From: Serge Hallyn 
> 
> This can only be set by root in his own namespace, and will
> only be respected by namespaces with that same root kuid
> mapped as root, or namespaces descended from it.
> 
> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.
> 
> Signed-off-by: Serge Hallyn 

Seems like the simplest possible design which meets the requirements.

Acked-by: Seth Forshee

[PATCH v4 00/21] Support fuse mounts in user namespaces

2016-04-26 Thread Seth Forshee

Hi Eric,

Here's another update to my patches for mouning with fuse from
unpivileged user namespaces. The main change here is a fix for a build
failure when fuse is built as a module. As usual the series is also
available at:

 git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git fuse-userns

Changes since v3:

 * Export current_in_userns() to fix an error when fuse is built as a
   module.
 * Add comment explaining the conditions for allowing CAP_CHOWN in
   s_user_ns to change the owner or group of an inode.
 * Added acks from Serge.

Thanks,
Seth

---

Andy Lutomirski (1):
  fs: Treat foreign mounts as nosuid

Pavel Tikhomirov (1):
  fs: fix a posible leak of allocated superblock

Seth Forshee (19):
  fs: Remove check of s_user_ns for existing mounts in
fs_fully_visible()
  fs: Allow sysfs and cgroupfs to share super blocks between user
namespaces
  block_dev: Support checking inode permissions in lookup_bdev()
  block_dev: Check permissions towards block device inode when mounting
  selinux: Add support for unprivileged mounts from user namespaces
  userns: Replace in_userns with current_in_userns
  Smack: Handle labels consistently in untrusted mounts
  fs: Check for invalid i_uid in may_follow_link()
  cred: Reject inodes with invalid ids in set_create_file_as()
  fs: Refuse uid/gid changes which don't map into s_user_ns
  fs: Update posix_acl support to handle user namespace mounts
  fs: Allow superblock owner to change ownership of inodes with
unmappable ids
  fs: Don't remove suid for CAP_FSETID in s_user_ns
  fs: Allow superblock owner to access do_remount_sb()
  capabilities: Allow privileged user in s_user_ns to set security.*
xattrs
  fuse: Add support for pid namespaces
  fuse: Support fuse filesystems outside of init_user_ns
  fuse: Restrict allow_other to the superblock's namespace or a
descendant
  fuse: Allow user namespace mounts

 drivers/md/bcache/super.c   |  2 +-
 drivers/md/dm-table.c   |  2 +-
 drivers/mtd/mtdsuper.c  |  2 +-
 fs/attr.c   | 73 -
 fs/block_dev.c  | 18 --
 fs/exec.c   |  2 +-
 fs/fuse/cuse.c  |  3 +-
 fs/fuse/dev.c   | 26 +++
 fs/fuse/dir.c   | 16 -
 fs/fuse/file.c  | 22 ++---
 fs/fuse/fuse_i.h| 10 +-
 fs/fuse/inode.c | 40 ++
 fs/inode.c  |  3 +-
 fs/kernfs/inode.c   |  2 ++
 fs/namei.c  |  2 +-
 fs/namespace.c  | 20 ---
 fs/posix_acl.c  | 67 +++--
 fs/proc/base.c  |  2 ++
 fs/proc/generic.c   |  3 ++
 fs/proc/proc_sysctl.c   |  2 ++
 fs/quota/quota.c|  2 +-
 fs/super.c  |  7 +++-
 fs/sysfs/mount.c|  3 +-
 fs/xattr.c  | 19 ---
 include/linux/fs.h  |  3 +-
 include/linux/mount.h   |  1 +
 include/linux/posix_acl_xattr.h | 17 +++---
 include/linux/uidgid.h  | 10 ++
 include/linux/user_namespace.h  |  6 ++--
 kernel/cgroup.c |  4 +--
 kernel/cred.c   |  2 ++
 kernel/user_namespace.c |  7 ++--
 security/commoncap.c| 22 +
 security/selinux/hooks.c| 25 +-
 security/smack/smack_lsm.c  | 29 ++--
 35 files changed, 355 insertions(+), 119 deletions(-)

Re: [PATCH v3 14/21] fs: Allow superblock owner to change ownership of inodes with unmappable ids

2016-04-26 Thread Seth Forshee

On Mon, Apr 25, 2016 at 03:30:47PM -0500, Serge E. Hallyn wrote:
> Quoting Seth Forshee (seth.fors...@canonical.com):
> > In a userns mount some on-disk inodes may have ids which do not
> > map into s_user_ns, in which case the in-kernel inodes are owned
> > by invalid users. The superblock owner should be able to change
> > attributes of these inodes but cannot. However it is unsafe to
> > grant the superblock owner privileged access to all inodes in the
> > superblock since proc, sysfs, etc. use DAC to protect files which
> > may not belong to s_user_ns. The problem is restricted to only
> > inodes where the owner or group is an invalid user.
> > 
> > We can work around this by allowing users with CAP_CHOWN in
> > s_user_ns to change an invalid owner or group id, so long as the
> > other id is either invalid or mappable in s_user_ns. After
> > changing ownership the user will be privileged towards the inode
> > and thus able to change other attributes.
> > 
> > As an precaution, checks for invalid ids are added to the proc
> > and kernfs setattr interfaces. These filesystems are not expected
> > to have inodes with invalid ids, but if it does happen any
> > setattr operations will return -EPERM.
> > 
> > Signed-off-by: Seth Forshee 
> 
> Acked-by: Serge Hallyn 
> 
> bug a request below,
> 
> > ---
> >  fs/attr.c | 47 +++
> >  fs/kernfs/inode.c |  2 ++
> >  fs/proc/base.c|  2 ++
> >  fs/proc/generic.c |  3 +++
> >  fs/proc/proc_sysctl.c |  2 ++
> >  5 files changed, 48 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/attr.c b/fs/attr.c
> > index 3cfaaac4a18e..a8b0931654a5 100644
> > --- a/fs/attr.c
> > +++ b/fs/attr.c
> > @@ -16,6 +16,43 @@
> >  #include 
> >  #include 
> >  
> > +static bool chown_ok(const struct inode *inode, kuid_t uid)
> > +{
> > +   struct user_namespace *user_ns;
> > +
> > +   if (uid_eq(current_fsuid(), inode->i_uid) && uid_eq(uid, inode->i_uid))
> > +   return true;
> > +   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> > +   return true;
> > +
> > +   user_ns = inode->i_sb->s_user_ns;
> > +   if (!uid_valid(inode->i_uid) &&
> > +   (!gid_valid(inode->i_gid) || kgid_has_mapping(user_ns, 
> > inode->i_gid)) &&
> 
> This confused me to no end :)  Perhaps a "is_unmapped_valid_gid()" helper
> would make it clearer what this is meant to do?  Or else maybe a comment
> above chown_ok(), explaining that
> 
> 1. for a blockdev, the uid is converted at inode read so that it is
> either mapped or invalid
> 2. for sysfs / etc, uid can be valid but not mapped into the userns

Even with a helper a comment is probably helpful to explain why. I'll do
that first, then see if a helper would make things any clearer.
Honestly, I had to think about the helper name you proposed for a minute
before it made sense even though I already understood the code ;-)

[PATCH v4 10/21] fs: Check for invalid i_uid in may_follow_link()

2016-04-26 Thread Seth Forshee

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Add a new helper function, uid_valid_eq(), and use this to
validate that the ids in may_follow_link() are both equal and
valid. Also add an equivalent helper for gids, which is
currently unused.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/namei.c |  2 +-
 include/linux/uidgid.h | 10 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index a29094c6f4a1..6fe8b0d8ca90 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -915,7 +915,7 @@ static inline int may_follow_link(struct nameidata *nd)
return 0;
 
/* Allowed if parent directory and link owner match. */
-   if (uid_eq(parent->i_uid, inode->i_uid))
+   if (uid_valid_eq(parent->i_uid, inode->i_uid))
return 0;
 
if (nd->flags & LOOKUP_RCU)
diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
index 03835522dfcb..e09529fe2668 100644
--- a/include/linux/uidgid.h
+++ b/include/linux/uidgid.h
@@ -117,6 +117,16 @@ static inline bool gid_valid(kgid_t gid)
return __kgid_val(gid) != (gid_t) -1;
 }
 
+static inline bool uid_valid_eq(kuid_t left, kuid_t right)
+{
+   return uid_eq(left, right) && uid_valid(left);
+}
+
+static inline bool gid_valid_eq(kgid_t left, kgid_t right)
+{
+   return gid_eq(left, right) && gid_valid(left);
+}
+
 #ifdef CONFIG_USER_NS
 
 extern kuid_t make_kuid(struct user_namespace *from, uid_t uid);
-- 
2.7.4

[PATCH v4 07/21] selinux: Add support for unprivileged mounts from user namespaces

2016-04-26 Thread Seth Forshee

Security labels from unprivileged mounts in user namespaces must
be ignored. Force superblocks from user namespaces whose labeling
behavior is to use xattrs to use mountpoint labeling instead.
For the mountpoint label, default to converting the current task
context into a form suitable for file objects, but also allow the
policy writer to specify a different label through policy
transition rules.

Pieced together from code snippets provided by Stephen Smalley.

Signed-off-by: Seth Forshee 
Acked-by: Stephen Smalley 
Acked-by: James Morris 
---
 security/selinux/hooks.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1350167635cb..33beed3ac589 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -820,6 +820,28 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+   /*
+* If this is a user namespace mount, no contexts are allowed
+* on the command line and security labels must be ignored.
+*/
+   if (sb->s_user_ns != &init_user_ns) {
+   if (context_sid || fscontext_sid || rootcontext_sid ||
+   defcontext_sid) {
+   rc = -EACCES;
+   goto out;
+   }
+   if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+   sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+   rc = security_transition_sid(current_sid(), 
current_sid(),
+SECCLASS_FILE, NULL,
+&sbsec->mntpoint_sid);
+   if (rc)
+   goto out;
+   }
+   goto out_set_opts;
+   }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec, cred);
@@ -888,6 +910,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}
 
+out_set_opts:
rc = sb_finish_set_opts(sb);
 out:
mutex_unlock(&sbsec->lock);
-- 
2.7.4

[PATCH v4 12/21] fs: Refuse uid/gid changes which don't map into s_user_ns

2016-04-26 Thread Seth Forshee

Add checks to inode_change_ok to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW. This cannot be overriden with ATTR_FORCE.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/attr.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..3cfaaac4a18e 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -42,6 +42,17 @@ int inode_change_ok(const struct inode *inode, struct iattr 
*attr)
return error;
}
 
+   /*
+* Verify that uid/gid changes are valid in the target namespace
+* of the superblock. This cannot be overriden using ATTR_FORCE.
+*/
+   if (ia_valid & ATTR_UID &&
+   from_kuid(inode->i_sb->s_user_ns, attr->ia_uid) == (uid_t)-1)
+   return -EOVERFLOW;
+   if (ia_valid & ATTR_GID &&
+   from_kgid(inode->i_sb->s_user_ns, attr->ia_gid) == (gid_t)-1)
+   return -EOVERFLOW;
+
/* If force is set do it anyway. */
if (ia_valid & ATTR_FORCE)
return 0;
-- 
2.7.4

[PATCH v4 02/21] fs: Remove check of s_user_ns for existing mounts in fs_fully_visible()

2016-04-26 Thread Seth Forshee

fs_fully_visible() ignores MNT_LOCK_NODEV when FS_USERS_DEV_MOUNT
is not set for the filesystem, but there is a bug in the logic
that may cause mounting to fail. It is doing this only when the
existing mount is not in init_user_ns but should check the new
mount instead. But the new mount is always in a non-init
namespace when fs_fully_visible() is called, so that condition
can simply be removed.

Signed-off-by: Seth Forshee 
---
 fs/namespace.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f20c82f91ecb..c133318bec35 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3234,8 +3234,7 @@ static bool fs_fully_visible(struct file_system_type 
*type, int *new_mnt_flags)
mnt_flags = mnt->mnt.mnt_flags;
if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
-   if (mnt->mnt.mnt_sb->s_user_ns != &init_user_ns &&
-   !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
+   if (!(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
mnt_flags &= ~(MNT_LOCK_NODEV);
 
/* Verify the mount flags are equal to or more permissive
-- 
2.7.4

[PATCH v4 11/21] cred: Reject inodes with invalid ids in set_create_file_as()

2016-04-26 Thread Seth Forshee

Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 kernel/cred.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index 0c0cd8a62285..5f264fb5737d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx);
  */
 int set_create_files_as(struct cred *new, struct inode *inode)
 {
+   if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+   return -EINVAL;
new->fsuid = inode->i_uid;
new->fsgid = inode->i_gid;
return security_kernel_create_files_as(new, inode);
-- 
2.7.4

[PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces

2016-04-26 Thread Seth Forshee

Both of these filesystems already have use cases for mounting the
same super block from multiple user namespaces. For sysfs this
happens when using criu for snapshotting a container, where sysfs
is mounted in the containers network ns but the hosts user ns.
The cgroup filesystem shares the same super block for all mounts
of the same hierarchy regardless of the namespace.

As a result, the restriction on mounting a super block from a
single user namespace creates regressions for existing uses of
these filesystems. For these specific filesystems this
restriction isn't really necessary since the backing store is
objects in kernel memory and thus the ids assigned from inodes
is not subject to translation relative to s_user_ns.

Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set
causes sget_userns() to skip the check of s_user_ns. Set this
flag for the sysfs and cgroup filesystems to fix the
regressions.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/super.c | 3 ++-
 fs/sysfs/mount.c   | 3 ++-
 include/linux/fs.h | 1 +
 kernel/cgroup.c| 4 ++--
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 092a7828442e..ead156b44bf8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -472,7 +472,8 @@ retry:
hlist_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
-   if (user_ns != old->s_user_ns) {
+   if (!(type->fs_flags & FS_USERNS_SHARE_SB) &&
+   user_ns != old->s_user_ns) {
spin_unlock(&sb_lock);
if (s) {
up_write(&s->s_umount);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index f3db82071cfb..9555accd4322 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -59,7 +59,8 @@ static struct file_system_type sysfs_fs_type = {
.name   = "sysfs",
.mount  = sysfs_mount,
.kill_sb= sysfs_kill_sb,
-   .fs_flags   = FS_USERNS_VISIBLE | FS_USERNS_MOUNT,
+   .fs_flags   = FS_USERNS_VISIBLE | FS_USERNS_MOUNT |
+ FS_USERNS_SHARE_SB,
 };
 
 int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index be0f8023e28c..66a639ec1bc4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1988,6 +1988,7 @@ struct file_system_type {
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
 #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */
 #define FS_USERNS_VISIBLE  32  /* FS must already be visible */
+#define FS_USERNS_SHARE_SB 64  /* Allow sharing sb between userns-es */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05c0b0f..9c9aa27e531a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2247,14 +2247,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
-   .fs_flags = FS_USERNS_MOUNT,
+   .fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB,
 };
 
 static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
-   .fs_flags = FS_USERNS_MOUNT,
+   .fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB,
 };
 
 static char *cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t 
buflen,
-- 
2.7.4

[PATCH v4 08/21] userns: Replace in_userns with current_in_userns

2016-04-26 Thread Seth Forshee

All current callers of in_userns pass current_user_ns as the
first argument. Simplify by replacing in_userns with
current_in_userns which checks whether current_user_ns is in the
namespace supplied as an argument.

Signed-off-by: Seth Forshee 
Acked-by: James Morris 
Acked-by: Serge Hallyn 
---
 fs/namespace.c | 2 +-
 include/linux/user_namespace.h | 6 ++
 kernel/user_namespace.c| 6 +++---
 security/commoncap.c   | 2 +-
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 6e9db4c166b4..0ad8e4a4f50b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3294,7 +3294,7 @@ bool mnt_may_suid(struct vfsmount *mnt)
 * in other namespaces.
 */
return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
-  in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns);
+  current_in_userns(mnt->mnt_sb->s_user_ns);
 }
 
 static struct ns_common *mntns_get(struct task_struct *task)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index a43faa727124..9217169c64cb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,8 +72,7 @@ extern ssize_t proc_projid_map_write(struct file *, const 
char __user *, size_t,
 extern ssize_t proc_setgroups_write(struct file *, const char __user *, 
size_t, loff_t *);
 extern int proc_setgroups_show(struct seq_file *m, void *v);
 extern bool userns_may_setgroups(const struct user_namespace *ns);
-extern bool in_userns(const struct user_namespace *ns,
- const struct user_namespace *target_ns);
+extern bool current_in_userns(const struct user_namespace *target_ns);
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -103,8 +102,7 @@ static inline bool userns_may_setgroups(const struct 
user_namespace *ns)
return true;
 }
 
-static inline bool in_userns(const struct user_namespace *ns,
-const struct user_namespace *target_ns)
+static inline bool current_in_userns(const struct user_namespace *target_ns)
 {
return true;
 }
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index dee3be5445da..68f594212759 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -942,10 +942,10 @@ bool userns_may_setgroups(const struct user_namespace *ns)
  * Returns true if @ns is the same namespace as or a descendant of
  * @target_ns.
  */
-bool in_userns(const struct user_namespace *ns,
-  const struct user_namespace *target_ns)
+bool current_in_userns(const struct user_namespace *target_ns)
 {
-   for (; ns; ns = ns->parent) {
+   struct user_namespace *ns;
+   for (ns = current_user_ns(); ns; ns = ns->parent) {
if (ns == target_ns)
return true;
}
diff --git a/security/commoncap.c b/security/commoncap.c
index a306c5d90709..e657227d221e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -461,7 +461,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool 
*effective, bool *has_c
 * explicit that capability bits are limited to s_user_ns and its
 * descendants.
 */
-   if (!in_userns(current_user_ns(), 
bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+   if (!current_in_userns(bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
 
rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
-- 
2.7.4

[PATCH v4 04/21] block_dev: Support checking inode permissions in lookup_bdev()

2016-04-26 Thread Seth Forshee

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c |  2 +-
 drivers/mtd/mtdsuper.c|  2 +-
 fs/block_dev.c| 13 ++---
 fs/quota/quota.c  |  2 +-
 include/linux/fs.h|  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a296425a7270..e169739a0253 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1950,7 +1950,7 @@ static ssize_t register_bcache(struct kobject *k, struct 
kobj_attribute *attr,
  sb);
if (IS_ERR(bdev)) {
if (bdev == ERR_PTR(-EBUSY)) {
-   bdev = lookup_bdev(strim(path));
+   bdev = lookup_bdev(strim(path), 0);
mutex_lock(&bch_register_lock);
if (!IS_ERR(bdev) && bch_is_open(bdev))
err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index f9e8f0bef332..13f568d527b5 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -372,7 +372,7 @@ dev_t dm_get_dev_t(const char *path)
dev_t uninitialized_var(dev);
struct block_device *bdev;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
dev = name_to_dev_t(path);
else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 20c02a3b7417..b5b60e1af31c 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -176,7 +176,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, 
int flags,
/* try the old way - the hack where we allowed users to mount
 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 */
-   bdev = lookup_bdev(dev_name);
+   bdev = lookup_bdev(dev_name, 0);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e84d62d0a25..e9b937845bdb 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1431,7 +1431,7 @@ struct block_device *blkdev_get_by_path(const char *path, 
fmode_t mode,
struct block_device *bdev;
int err;
 
-   bdev = lookup_bdev(path);
+   bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
return bdev;
 
@@ -1821,12 +1821,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:  special file representing the block device
+ * @mask:  rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
struct block_device *bdev;
struct inode *inode;
@@ -1841,6 +1843,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);
 
inode = d_backing_inode(path.dentry);
+   if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+   error = __inode_permission(inode, mask);
+   if (error)
+   goto fail;
+   }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..59223384b22e 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -799,7 +799,7 @@ static struct super_block *quotactl_block(const char __user 
*special, int cmd)
 
if (IS_ERR(tmp))
return ERR_CAST(tmp);
-   bdev = lookup_bdev(tmp->name);
+   bdev = lookup_bdev(tmp->name, 0);
putname(tmp);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 66a639ec1bc4..173b8adc6131 10

[PATCH v4 15/21] fs: Don't remove suid for CAP_FSETID in s_user_ns

2016-04-26 Thread Seth Forshee

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b526c194..cd52170f9117 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1690,7 +1690,8 @@ int should_remove_suid(struct dentry *dentry)
if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
kill |= ATTR_KILL_SGID;
 
-   if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+   if (unlikely(kill && !ns_capable(dentry->d_sb->s_user_ns, CAP_FSETID) &&
+S_ISREG(mode)))
return kill;
 
return 0;
-- 
2.7.4

[PATCH v4 14/21] fs: Allow superblock owner to change ownership of inodes with unmappable ids

2016-04-26 Thread Seth Forshee

In a userns mount some on-disk inodes may have ids which do not
map into s_user_ns, in which case the in-kernel inodes are owned
by invalid users. The superblock owner should be able to change
attributes of these inodes but cannot. However it is unsafe to
grant the superblock owner privileged access to all inodes in the
superblock since proc, sysfs, etc. use DAC to protect files which
may not belong to s_user_ns. The problem is restricted to only
inodes where the owner or group is an invalid user.

We can work around this by allowing users with CAP_CHOWN in
s_user_ns to change an invalid owner or group id, so long as the
other id is either invalid or mappable in s_user_ns. After
changing ownership the user will be privileged towards the inode
and thus able to change other attributes.

As an precaution, checks for invalid ids are added to the proc
and kernfs setattr interfaces. These filesystems are not expected
to have inodes with invalid ids, but if it does happen any
setattr operations will return -EPERM.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/attr.c | 62 ---
 fs/kernfs/inode.c |  2 ++
 fs/proc/base.c|  2 ++
 fs/proc/generic.c |  3 +++
 fs/proc/proc_sysctl.c |  2 ++
 5 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 3cfaaac4a18e..06bb3f401559 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -16,6 +16,58 @@
 #include 
 #include 
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+   struct user_namespace *user_ns;
+
+   if (uid_eq(current_fsuid(), inode->i_uid) && uid_eq(uid, inode->i_uid))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+
+   /*
+* Inode uids/gids are of type kuid_t/kgid_t. As such, they can be
+* a) INVALID_UID/INVALID_GID, b) valid and mappable into
+* i_sb->s_user_ns, or c) valid but not mappable into
+* i_sb->s_user_ns.
+*
+* For filesystems on user-supplied media ids will either be (a) or
+* (b), so we permit CAP_CHOWN in s_user_ns to change INVALID_UID if
+* the gid meets these conditions (and vice versa for INVALID_GID).
+*
+* For psuedo-filesystems like proc or sysfs ids will be either (b)
+* or (c), so these conditions do not permit namespace-root to chown
+* in those filesystems.
+*/
+   user_ns = inode->i_sb->s_user_ns;
+   if (!uid_valid(inode->i_uid) &&
+   (!gid_valid(inode->i_gid) || kgid_has_mapping(user_ns, 
inode->i_gid)) &&
+   ns_capable(user_ns, CAP_CHOWN))
+   return true;
+
+   return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+   struct user_namespace *user_ns;
+
+   if (uid_eq(current_fsuid(), inode->i_uid) &&
+   (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+   return true;
+   if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   return true;
+
+   /* Logic here is the same as in chown_ok(); see comment there. */
+   user_ns = inode->i_sb->s_user_ns;
+   if (!gid_valid(inode->i_gid) &&
+   (!uid_valid(inode->i_uid) || kuid_has_mapping(user_ns, 
inode->i_uid)) &&
+   ns_capable(user_ns, CAP_CHOWN))
+   return true;
+
+   return false;
+}
+
 /**
  * inode_change_ok - check if attribute changes to an inode are allowed
  * @inode: inode to check
@@ -58,17 +110,11 @@ int inode_change_ok(const struct inode *inode, struct 
iattr *attr)
return 0;
 
/* Make sure a caller can chown. */
-   if ((ia_valid & ATTR_UID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-!uid_eq(attr->ia_uid, inode->i_uid)) &&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;
 
/* Make sure caller can chgrp. */
-   if ((ia_valid & ATTR_GID) &&
-   (!uid_eq(current_fsuid(), inode->i_uid) ||
-   (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) 
&&
-   !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+   if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;
 
/* Make sure a caller can chmod. */
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 16405ae88d2d..2e97a337ee5f 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -117,6 +117,8 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr 
*iattr)
 
if (!kn)
return -EINVAL;
+   if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+   return -EPER

[PATCH v4 13/21] fs: Update posix_acl support to handle user namespace mounts

2016-04-26 Thread Seth Forshee

ids in on-disk ACLs should be converted to s_user_ns instead of
init_user_ns as is done now. This introduces the possibility for
id mappings to fail, and when this happens syscalls will return
EOVERFLOW.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/posix_acl.c  | 67 ++---
 fs/xattr.c  | 19 +---
 include/linux/posix_acl_xattr.h | 17 ---
 3 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 711dd5170376..dac2842dd4cb 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -595,59 +595,77 @@ EXPORT_SYMBOL_GPL(posix_acl_create);
 /*
  * Fix up the uids and gids in posix acl extended attributes in place.
  */
-static void posix_acl_fix_xattr_userns(
+static int posix_acl_fix_xattr_userns(
struct user_namespace *to, struct user_namespace *from,
void *value, size_t size)
 {
posix_acl_xattr_header *header = (posix_acl_xattr_header *)value;
posix_acl_xattr_entry *entry = (posix_acl_xattr_entry *)(header+1), 
*end;
int count;
-   kuid_t uid;
-   kgid_t gid;
+   kuid_t kuid;
+   uid_t uid;
+   kgid_t kgid;
+   gid_t gid;
 
if (!value)
-   return;
+   return 0;
if (size < sizeof(posix_acl_xattr_header))
-   return;
+   return 0;
if (header->a_version != cpu_to_le32(POSIX_ACL_XATTR_VERSION))
-   return;
+   return 0;
 
count = posix_acl_xattr_count(size);
if (count < 0)
-   return;
+   return 0;
if (count == 0)
-   return;
+   return 0;
 
for (end = entry + count; entry != end; entry++) {
switch(le16_to_cpu(entry->e_tag)) {
case ACL_USER:
-   uid = make_kuid(from, le32_to_cpu(entry->e_id));
-   entry->e_id = cpu_to_le32(from_kuid(to, uid));
+   kuid = make_kuid(from, le32_to_cpu(entry->e_id));
+   if (!uid_valid(kuid))
+   return -EOVERFLOW;
+   uid = from_kuid(to, kuid);
+   if (uid == (uid_t)-1)
+   return -EOVERFLOW;
+   entry->e_id = cpu_to_le32(uid);
break;
case ACL_GROUP:
-   gid = make_kgid(from, le32_to_cpu(entry->e_id));
-   entry->e_id = cpu_to_le32(from_kgid(to, gid));
+   kgid = make_kgid(from, le32_to_cpu(entry->e_id));
+   if (!gid_valid(kgid))
+   return -EOVERFLOW;
+   gid = from_kgid(to, kgid);
+   if (gid == (gid_t)-1)
+   return -EOVERFLOW;
+   entry->e_id = cpu_to_le32(gid);
break;
default:
break;
}
}
+
+   return 0;
 }
 
-void posix_acl_fix_xattr_from_user(void *value, size_t size)
+int
+posix_acl_fix_xattr_from_user(struct user_namespace *target_ns, void *value,
+ size_t size)
 {
-   struct user_namespace *user_ns = current_user_ns();
-   if (user_ns == &init_user_ns)
-   return;
-   posix_acl_fix_xattr_userns(&init_user_ns, user_ns, value, size);
+   struct user_namespace *source_ns = current_user_ns();
+   if (source_ns == target_ns)
+   return 0;
+   return posix_acl_fix_xattr_userns(target_ns, source_ns, value, size);
 }
 
-void posix_acl_fix_xattr_to_user(void *value, size_t size)
+int
+posix_acl_fix_xattr_to_user(struct user_namespace *source_ns, void *value,
+   size_t size)
 {
-   struct user_namespace *user_ns = current_user_ns();
-   if (user_ns == &init_user_ns)
-   return;
-   posix_acl_fix_xattr_userns(user_ns, &init_user_ns, value, size);
+   struct user_namespace *target_ns = current_user_ns();
+   if (target_ns == source_ns)
+   return 0;
+   return posix_acl_fix_xattr_userns(target_ns, source_ns, value, size);
 }
 
 /*
@@ -780,7 +798,7 @@ posix_acl_xattr_get(const struct xattr_handler *handler,
if (acl == NULL)
return -ENODATA;
 
-   error = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+   error = posix_acl_to_xattr(dentry->d_sb->s_user_ns, acl, value, size);
posix_acl_release(acl);
 
return error;
@@ -806,7 +824,8 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
return -EPERM;
 
if (value) {
-   acl = posix_acl_from_xattr(&init_user_ns, value, size);
+   acl = posix_acl_from_xattr(dentry->d_sb->s_user_ns, value,
+

[PATCH v4 06/21] fs: Treat foreign mounts as nosuid

2016-04-26 Thread Seth Forshee

From: Andy Lutomirski 

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.  Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID.  The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents.  If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski 
Signed-off-by: Seth Forshee 
Acked-by: James Morris 
Acked-by: Serge Hallyn 
---
 fs/exec.c|  2 +-
 fs/namespace.c   | 13 +
 include/linux/mount.h|  1 +
 security/commoncap.c |  8 +++-
 security/selinux/hooks.c |  2 +-
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c4010b8207a1..706088dd0cb1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1391,7 +1391,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();
 
-   if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+   if (!mnt_may_suid(bprm->file->f_path.mnt))
return;
 
if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index c133318bec35..6e9db4c166b4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3284,6 +3284,19 @@ found:
return visible;
 }
 
+bool mnt_may_suid(struct vfsmount *mnt)
+{
+   /*
+* Foreign mounts (accessed via fchdir or through /proc
+* symlinks) are always treated as if they are nosuid.  This
+* prevents namespaces from trusting potentially unsafe
+* suid/sgid bits, file caps, or security labels that originate
+* in other namespaces.
+*/
+   return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
+  in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns);
+}
+
 static struct ns_common *mntns_get(struct task_struct *task)
 {
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
 extern struct vfsmount *mnt_clone_internal(struct path *path);
 extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);
 
 struct path;
 extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 8912ef117faa..a306c5d90709 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -453,8 +453,14 @@ static int get_file_caps(struct linux_binprm *bprm, bool 
*effective, bool *has_c
if (!file_caps_enabled)
return 0;
 
-   if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+   if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
+
+   /*
+* This check is redundant with mnt_may_suid() but is kept to make
+* explicit that capability bits are limited to s_user_ns and its
+* descendants.
+*/
if (!in_userns(current_user_ns(), 
bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 912deee3f01e..1350167635cb 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2234,7 +2234,7 @@ static int check_nnp_nosuid(const struct linux_binprm 
*bprm,
const struct task_security_struct *new_tsec)
 {
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
-   int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+   int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;
 
if (!nnp && !nosuid)
-- 
2.7.4

[PATCH v4 19/21] fuse: Support fuse filesystems outside of init_user_ns

2016-04-26 Thread Seth Forshee

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Signed-off-by: Seth Forshee 
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c| 13 -
 fs/fuse/dir.c| 14 +++---
 fs/fuse/fuse_i.h |  6 +-
 fs/fuse/inode.c  | 33 +
 5 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index c5b6b7165489..98ebd0f4fd4c 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
if (!cc)
return -ENOMEM;
 
-   fuse_conn_init(&cc->fc);
+   fuse_conn_init(&cc->fc, current_user_ns());
 
fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 4e91b2ac25a7..8fa1ce934df3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -127,8 +127,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-   req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-   req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+   req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+   req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -186,7 +186,8 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
*fc, unsigned npages,
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
-   if (req->in.h.pid == 0) {
+   if (req->in.h.pid == 0 || req->in.h.uid == (uid_t)-1 ||
+   req->in.h.gid == (gid_t)-1) {
fuse_put_request(fc, req);
return ERR_PTR(-EOVERFLOW);
}
@@ -1248,7 +1249,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
struct fuse_in *in;
unsigned reqsize;
 
-   if (task_active_pid_ns(current) != fc->pid_ns)
+   if (task_active_pid_ns(current) != fc->pid_ns ||
+   current_user_ns() != fc->user_ns)
return -EIO;
 
  restart:
@@ -1880,7 +1882,8 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
struct fuse_req *req;
struct fuse_out_header oh;
 
-   if (task_active_pid_ns(current) != fc->pid_ns)
+   if (task_active_pid_ns(current) != fc->pid_ns ||
+   current_user_ns() != fc->user_ns)
return -EIO;
 
if (nbytes < sizeof(struct fuse_out_header))
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4b855b65d457..ecba75bf6640 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -841,8 +841,8 @@ static void fuse_fillattr(struct inode *inode, struct 
fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0);
stat->nlink = attr->nlink;
-   stat->uid = make_kuid(&init_user_ns, attr->uid);
-   stat->gid = make_kgid(&init_user_ns, attr->gid);
+   stat->uid = make_kuid(fc->user_ns, attr->uid);
+   stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1459,17 +1459,17 @@ static bool update_mtime(unsigned ivalid, bool 
trust_local_mtime)
return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-  bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+  struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
unsigned ivalid = iattr->ia_valid;
 
if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
-   arg->valid |= FATTR_UID,arg->uid = from_kuid(&init_user_ns, 
iattr->ia_uid);
+   arg->valid |= F

[PATCH v4 20/21] fuse: Restrict allow_other to the superblock's namespace or a descendant

2016-04-26 Thread Seth Forshee

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
Acked-by: Miklos Szeredi 
---
 fs/fuse/dir.c   | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ecba75bf6640..1a6c5af49608 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1015,7 +1015,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;
 
if (fc->flags & FUSE_ALLOW_OTHER)
-   return 1;
+   return current_in_userns(fc->user_ns);
 
cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 68f594212759..fa2294e14b77 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -951,6 +951,7 @@ bool current_in_userns(const struct user_namespace 
*target_ns)
}
return false;
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.7.4

[PATCH v4 21/21] fuse: Allow user namespace mounts

2016-04-26 Thread Seth Forshee

Signed-off-by: Seth Forshee 
Acked-by: Miklos Szeredi 
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0a771145d853..254f1944ee98 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1199,7 +1199,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
.owner  = THIS_MODULE,
.name   = "fuse",
-   .fs_flags   = FS_HAS_SUBTYPE,
+   .fs_flags   = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
.mount  = fuse_mount,
.kill_sb= fuse_kill_sb_anon,
 };
@@ -1231,7 +1231,7 @@ static struct file_system_type fuseblk_fs_type = {
.name   = "fuseblk",
.mount  = fuse_mount_blk,
.kill_sb= fuse_kill_sb_blk,
-   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+   .fs_flags   = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.7.4

[PATCH v4 17/21] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

2016-04-26 Thread Seth Forshee

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 security/commoncap.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index e657227d221e..12477afaa8ed 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -664,15 +664,17 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
   const void *value, size_t size, int flags)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
if (!strcmp(name, XATTR_NAME_CAPS)) {
-   if (!capable(CAP_SETFCAP))
+   if (!ns_capable(user_ns, CAP_SETFCAP))
return -EPERM;
return 0;
}
 
if (!strncmp(name, XATTR_SECURITY_PREFIX,
 sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-   !capable(CAP_SYS_ADMIN))
+   !ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
@@ -690,15 +692,17 @@ int cap_inode_setxattr(struct dentry *dentry, const char 
*name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+   struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
if (!strcmp(name, XATTR_NAME_CAPS)) {
-   if (!capable(CAP_SETFCAP))
+   if (!ns_capable(user_ns, CAP_SETFCAP))
return -EPERM;
return 0;
}
 
if (!strncmp(name, XATTR_SECURITY_PREFIX,
 sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-   !capable(CAP_SYS_ADMIN))
+   !ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
 }
-- 
2.7.4

[PATCH v4 18/21] fuse: Add support for pid namespaces

2016-04-26 Thread Seth Forshee

When the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd are not being
translated into that process' namespace. Translation is necessary
for the pid to be useful to that process.

Since no use case currently exists for changing namespaces all
translations can be done relative to the pid namespace in use
when fuse_conn_init() is called. For fuse this translates to
mount time, and for cuse this is when /dev/cuse is opened. IO for
this connection from another namespace will return errors.

Requests from processes whose pid cannot be translated into the
target namespace are not permitted, except for requests
allocated via fuse_get_req_nofail_nopages. For no-fail requests
in.h.pid will be 0 if the pid translation fails.

File locking changes based on previous work done by Eric
Biederman.

Signed-off-by: Seth Forshee 
Acked-by: Miklos Szeredi 
---
 fs/fuse/dev.c| 19 +++
 fs/fuse/file.c   | 22 +-
 fs/fuse/fuse_i.h |  4 
 fs/fuse/inode.c  |  3 +++
 4 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cbece1221417..4e91b2ac25a7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
@@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
atomic_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_req *req)
+static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
-   req->in.h.pid = current->pid;
+   req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -181,10 +182,14 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn 
*fc, unsigned npages,
goto out;
}
 
-   fuse_req_init_context(req);
+   fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
+   if (req->in.h.pid == 0) {
+   fuse_put_request(fc, req);
+   return ERR_PTR(-EOVERFLOW);
+   }
 
return req;
 
@@ -274,7 +279,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct 
fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);
 
-   fuse_req_init_context(req);
+   fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
__clear_bit(FR_BACKGROUND, &req->flags);
return req;
@@ -1243,6 +1248,9 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
struct fuse_in *in;
unsigned reqsize;
 
+   if (task_active_pid_ns(current) != fc->pid_ns)
+   return -EIO;
+
  restart:
spin_lock(&fiq->waitq.lock);
err = -EAGAIN;
@@ -1872,6 +1880,9 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
struct fuse_req *req;
struct fuse_out_header oh;
 
+   if (task_active_pid_ns(current) != fc->pid_ns)
+   return -EIO;
+
if (nbytes < sizeof(struct fuse_out_header))
return -EINVAL;
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 719924d6c706..b5c616c5ec98 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2067,7 +2067,8 @@ static int fuse_direct_mmap(struct file *file, struct 
vm_area_struct *vma)
return generic_file_mmap(file, vma);
 }
 
-static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
+static int convert_fuse_file_lock(struct fuse_conn *fc,
+ const struct fuse_file_lock *ffl,
  struct file_lock *fl)
 {
switch (ffl->type) {
@@ -2082,7 +2083,14 @@ static int convert_fuse_file_lock(const struct 
fuse_file_lock *ffl,
 
fl->fl_start = ffl->start;
fl->fl_end = ffl->end;
-   fl->fl_pid = ffl->pid;
+
+   /*
+* Convert pid into the caller's pid namespace. If the pid
+* does not map into the namespace fl_pid will get set to 0.
+*/
+   rcu_read_lock();
+   fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
+   rcu_read_unlock();
break;
 
default:
@@ -2131,7 +2139,7 @@ static int fuse_getlk(struct file *file, struct file_lock 
*fl)
args.out.args[0].value = &outarg;
err = fuse_simple_request(fc, &args);
if (!err)
-   err = convert_fuse_file_lock(&outarg.lk, fl);
+   err = convert_fuse_file_lock(fc, &outarg.lk

[PATCH v4 16/21] fs: Allow superblock owner to access do_remount_sb()

2016-04-26 Thread Seth Forshee

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Signed-off-by: Seth Forshee 
Acked-by: "Eric W. Biederman" 
Acked-by: Serge Hallyn 
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 0ad8e4a4f50b..575e3f8b34fd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1510,7 +1510,7 @@ static int do_umount(struct mount *mnt, int flags)
 * Special case for "unmounting" root ...
 * we just try to remount it readonly.
 */
-   if (!capable(CAP_SYS_ADMIN))
+   if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
down_write(&sb->s_umount);
if (!(sb->s_flags & MS_RDONLY))
@@ -2207,7 +2207,7 @@ static int do_remount(struct path *path, int flags, int 
mnt_flags,
down_write(&sb->s_umount);
if (flags & MS_BIND)
err = change_mount_flags(path->mnt, flags);
-   else if (!capable(CAP_SYS_ADMIN))
+   else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
err = -EPERM;
else
err = do_remount_sb(sb, flags, data, 0);
-- 
2.7.4

[PATCH v4 05/21] block_dev: Check permissions towards block device inode when mounting

2016-04-26 Thread Seth Forshee

Unprivileged users should not be able to mount block devices when
they lack sufficient privileges towards the block device inode.
Update blkdev_get_by_path() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Signed-off-by: Seth Forshee 
Acked-by: Serge Hallyn 
---
 fs/block_dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e9b937845bdb..2007040afb7b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1429,9 +1429,14 @@ struct block_device *blkdev_get_by_path(const char 
*path, fmode_t mode,
void *holder)
 {
struct block_device *bdev;
+   int perm = 0;
int err;
 
-   bdev = lookup_bdev(path, 0);
+   if (mode & FMODE_READ)
+   perm |= MAY_READ;
+   if (mode & FMODE_WRITE)
+   perm |= MAY_WRITE;
+   bdev = lookup_bdev(path, perm);
if (IS_ERR(bdev))
return bdev;
 
-- 
2.7.4

[PATCH v4 09/21] Smack: Handle labels consistently in untrusted mounts

2016-04-26 Thread Seth Forshee

The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled
differently in untrusted mounts. This is confusing and
potentically problematic. Change this to handle them all the same
way that SMACK64 is currently handled; that is, read the label
from disk and check it at use time. For SMACK64 and SMACK64MMAP
access is denied if the label does not match smk_root. To be
consistent with suid, a SMACK64EXEC label which does not match
smk_root will still allow execution of the file but will not run
with the label supplied in the xattr.

Signed-off-by: Seth Forshee 
Acked-by: Casey Schaufler 
---
 security/smack/smack_lsm.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index aa17198cd5f2..ca564590cc1b 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -919,6 +919,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
struct inode *inode = file_inode(bprm->file);
struct task_smack *bsp = bprm->cred->security;
struct inode_smack *isp;
+   struct superblock_smack *sbsp;
int rc;
 
if (bprm->cred_prepared)
@@ -928,6 +929,11 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
if (isp->smk_task == NULL || isp->smk_task == bsp->smk_task)
return 0;
 
+   sbsp = inode->i_sb->s_security;
+   if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) &&
+   isp->smk_task != sbsp->smk_root)
+   return 0;
+
if (bprm->unsafe & (LSM_UNSAFE_PTRACE | LSM_UNSAFE_PTRACE_CAP)) {
struct task_struct *tracer;
rc = 0;
@@ -1725,6 +1731,7 @@ static int smack_mmap_file(struct file *file,
struct task_smack *tsp;
struct smack_known *okp;
struct inode_smack *isp;
+   struct superblock_smack *sbsp;
int may;
int mmay;
int tmay;
@@ -1736,6 +1743,10 @@ static int smack_mmap_file(struct file *file,
isp = file_inode(file)->i_security;
if (isp->smk_mmap == NULL)
return 0;
+   sbsp = file_inode(file)->i_sb->s_security;
+   if (sbsp->smk_flags & SMK_SB_UNTRUSTED &&
+   isp->smk_mmap != sbsp->smk_root)
+   return -EACCES;
mkp = isp->smk_mmap;
 
tsp = current_security();
@@ -3546,16 +3557,14 @@ static void smack_d_instantiate(struct dentry 
*opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
-   if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
-   /*
-* Don't let the exec or mmap label be "*" or "@".
-*/
-   skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
-   if (IS_ERR(skp) || skp == &smack_known_star ||
-   skp == &smack_known_web)
-   skp = NULL;
-   isp->smk_task = skp;
-   }
+   /*
+* Don't let the exec or mmap label be "*" or "@".
+*/
+   skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+   if (IS_ERR(skp) || skp == &smack_known_star ||
+   skp == &smack_known_web)
+   skp = NULL;
+   isp->smk_task = skp;
 
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
-- 
2.7.4

[PATCH v4 01/21] fs: fix a posible leak of allocated superblock

2016-04-26 Thread Seth Forshee

From: Pavel Tikhomirov 

We probably need to fix superblock leak in patch (v4 "fs: Add user
namesapace member to struct super_block"):

Imagine posible code path in sget_userns: we iterate through
type->fs_supers and do not find suitable sb, we drop sb_lock to
allocate s and go to retry. After we dropped sb_lock some other
task from different userns takes sb_lock, it is already in retry
stage and has s allocated, so it puts its s in type->fs_supers
list. So in retry we will find these sb in list and check it has
a different userns, and finally we will return without freeing s.

Signed-off-by: Pavel Tikhomirov 
Acked-by: Seth Forshee 
---
 fs/super.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 829841e0ae7e..092a7828442e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -474,6 +474,10 @@ retry:
continue;
if (user_ns != old->s_user_ns) {
spin_unlock(&sb_lock);
+   if (s) {
+   up_write(&s->s_umount);
+   destroy_super(s);
+   }
return ERR_PTR(-EBUSY);
}
if (!grab_super(old))
-- 
2.7.4

Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()

2016-04-22 Thread Seth Forshee

On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote:
> A failed call to dqget() returns an ERR_PTR() and not null. Fix
> the check in ext4_ioctl_setproject() to handle this correctly.
> 
> Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface 
> support")
> Cc: sta...@vger.kernel.org # v4.5
> Signed-off-by: Seth Forshee 

Thought I'd check and see if this patch was forgotten, I sent it nearly
a month ago but the bug is still present.

Thanks,
Seth

1 2 3 4 5 6 >

1 - 100 of 558 matches

Mail list logo