[PATCH] tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
As with s390, alpha is a 64-bit architecture with a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and display "inode64" in the mount options, whereas passing "inode64" in the mount options will fail. This leads to erroneous behaviours such as this: # mkdir mnt # mount -t tmpfs nodev mnt # mount -o remount,rw mnt mount: /home/ubuntu/mnt: mount point not mounted or bad option. Prevent CONFIG_TMPFS_INODE64 from being selected on alpha. Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Cc: sta...@vger.kernel.org # v5.9+ Signed-off-by: Seth Forshee --- fs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 3347ec7bd837..da524c4d7b7e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -203,7 +203,7 @@ config TMPFS_XATTR config TMPFS_INODE64 bool "Use 64-bit ino_t by default in tmpfs" - depends on TMPFS && 64BIT && !S390 + depends on TMPFS && 64BIT && !(S390 || ALPHA) default n help tmpfs has historically used only inode numbers as wide as an unsigned -- 2.29.2
Re: [PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390
On Sun, Feb 07, 2021 at 05:48:31PM +0300, Kirill A. Shutemov wrote: > On Fri, Feb 05, 2021 at 05:06:20PM -0600, Seth Forshee wrote: > > This feature requires ino_t be 64-bits, which is true for every > > 64-bit architecture but s390, so prevent this option from being > > selected there. > > Quick grep suggests the same for alpha. Am I wrong? No, it appears you are right. Looks like my grep missed alpha somehow. Andrew, do you prefer an additional patch or an updated version of the previous patch?
[PATCH] tmpfs: Disallow CONFIG_TMPFS_INODE64 on s390
This feature requires ino_t be 64-bits, which is true for every 64-bit architecture but s390, so prevent this option from being selected there. Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Cc: # v5.9+ Signed-off-by: Seth Forshee --- fs/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index aa4c12282301..3347ec7bd837 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -203,7 +203,7 @@ config TMPFS_XATTR config TMPFS_INODE64 bool "Use 64-bit ino_t by default in tmpfs" - depends on TMPFS && 64BIT + depends on TMPFS && 64BIT && !S390 default n help tmpfs has historically used only inode numbers as wide as an unsigned -- 2.29.2
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, Feb 05, 2021 at 01:23:13PM -0800, Andrew Morton wrote: > On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee > wrote: > > > On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote: > > > On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee > > > wrote: > > > > > > > Currently there seems to be an assumption in tmpfs that 64-bit > > > > architectures also have a 64-bit ino_t. This is not true; s390 at > > > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > > > > mounts will get 64-bit inode numbers and display "inode64" in the > > > > mount options, but passing the "inode64" mount option will fail. > > > > This leads to the following behavior: > > > > > > > > # mkdir mnt > > > > # mount -t tmpfs nodev mnt > > > > # mount -o remount,rw mnt > > > > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > > > > > > > As mount sees "inode64" in the mount options and thus passes it > > > > in the options for the remount. > > > > > > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > > > > but I don't think it's possible to test for this (potentially > > > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > > > > not sure whether or not that is wanted). So fix this by simply > > > > refusing to honor the CONFIG_TMPFS_INODE64 setting when > > > > sizeof(ino_t) < 8. > > > > > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not > > > enabled? > > > > I did do that for our config. I see the s390 defconfig has it enabled, > > so I will send a patch for that too. But the fact that it can be > > configured that way and that the code behaves badly still seems > > problematic. > > I meant > > --- a/fs/Kconfig~a > +++ a/fs/Kconfig > @@ -203,7 +203,7 @@ config TMPFS_XATTR > > config TMPFS_INODE64 > bool "Use 64-bit ino_t by default in tmpfs" > - depends on TMPFS && 64BIT > + depends on TMPFS && 64BIT && !S390 > default n > help > tmpfs has historically used only inode numbers as wide as an unsigned > _ Ah. s390 does appear to be the only architecture that doesn't use unsigned long for ino_t, so that seems good to me. I can send a patch if you want, but seems like you've already got one ... Seth
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote: > On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee > wrote: > > > Currently there seems to be an assumption in tmpfs that 64-bit > > architectures also have a 64-bit ino_t. This is not true; s390 at > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > > mounts will get 64-bit inode numbers and display "inode64" in the > > mount options, but passing the "inode64" mount option will fail. > > This leads to the following behavior: > > > > # mkdir mnt > > # mount -t tmpfs nodev mnt > > # mount -o remount,rw mnt > > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > > > As mount sees "inode64" in the mount options and thus passes it > > in the options for the remount. > > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > > but I don't think it's possible to test for this (potentially > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > > not sure whether or not that is wanted). So fix this by simply > > refusing to honor the CONFIG_TMPFS_INODE64 setting when > > sizeof(ino_t) < 8. > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled? I did do that for our config. I see the s390 defconfig has it enabled, so I will send a patch for that too. But the fact that it can be configured that way and that the code behaves badly still seems problematic. > > > Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") > > Signed-off-by: Seth Forshee > > With a cc:stable, I assume? Yes, I forgot that. Thanks. Seth
[PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
Currently there seems to be an assumption in tmpfs that 64-bit architectures also have a 64-bit ino_t. This is not true; s390 at least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and display "inode64" in the mount options, but passing the "inode64" mount option will fail. This leads to the following behavior: # mkdir mnt # mount -t tmpfs nodev mnt # mount -o remount,rw mnt mount: /home/ubuntu/mnt: mount point not mounted or bad option. As mount sees "inode64" in the mount options and thus passes it in the options for the remount. Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, but I don't think it's possible to test for this (potentially CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm not sure whether or not that is wanted). So fix this by simply refusing to honor the CONFIG_TMPFS_INODE64 setting when sizeof(ino_t) < 8. Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Signed-off-by: Seth Forshee --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 7c6b6d8f6c39..efde42acdc7a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3733,7 +3733,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) ctx->blocks = shmem_default_max_blocks(); if (!(ctx->seen & SHMEM_SEEN_INODES)) ctx->inodes = shmem_default_max_inodes(); - if (!(ctx->seen & SHMEM_SEEN_INUMS)) + if (!(ctx->seen & SHMEM_SEEN_INUMS) && sizeof(ino_t) >= 8) ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); } else { sb->s_flags |= SB_NOUSER; -- 2.29.2
Re: [PATCH] x86: Disable CET instrumentation in the kernel
On Fri, Jan 29, 2021 at 06:07:55PM +0100, Borislav Petkov wrote: > On Fri, Jan 29, 2021 at 11:03:31AM -0600, Josh Poimboeuf wrote: > > On Fri, Jan 29, 2021 at 06:54:08PM +0200, Nikolay Borisov wrote: > > > > > > > > > On 29.01.21 г. 18:49 ч., Josh Poimboeuf wrote: > > > > Agreed, stable is a good idea. I think Nikolay saw it with GCC 9. > > > > > > > > > Yes I did, with the default Ubuntu compiler as well as the default gcc-10 > > > compiler: > > > > > > # gcc -v -Q -O2 --help=target | grep protection > > > > > > gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) > > > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' > > > '-march=x86-64' > > > /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v -imultiarch x86_64-linux-gnu > > > help-dummy -dumpbase help-dummy -mtune=generic -march=x86-64 -auxbase > > > help-dummy -O2 -version --help=target -fasynchronous-unwind-tables > > > -fstack-protector-strong -Wformat -Wformat-security > > > -fstack-clash-protection -fcf-protection -o /tmp/ccSecttk.s > > > GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu) > > > compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, > > > MPC version 1.1.0, isl version isl-0.22.1-GMP > > > > > > > > > It has -fcf-protection turned on by default it seems. > > > > Yup, explains why I didn't see it: > > > > gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) > > COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' > > '-march=x86-64' > > /usr/libexec/gcc/x86_64-redhat-linux/10/cc1 -v help-dummy -dumpbase > > help-dummy -mtune=generic -march=x86-64 -auxbase help-dummy -O2 -version > > --help=target -o /tmp/cclBz55H.s > > The fact that you triggered it with an Ubuntu gcc explains why the > original patch adding that switch: > > 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags") > > came from a Canonical. > > Adding the author to Cc for FYI. > > Seth, you can find this thread starting here: > > https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble Thanks for the heads up. This still works fine for our needs. Acked-by: Seth Forshee
Re: BPF selftests build failure in 5.10-rc
On Wed, Dec 09, 2020 at 04:15:35PM -0800, Andrii Nakryiko wrote: > On Wed, Dec 9, 2020 at 2:24 PM Seth Forshee > wrote: > > > > Building the BPF selftests with clang 11, I'm getting the following > > error: > > > >CLNG-LLC [test_maps] profiler1.o > > In file included from progs/profiler1.c:6: > > progs/profiler.inc.h:260:17: error: use of unknown builtin > > '__builtin_preserve_enum_value' [-Wimplicit-function-declaration] > > int cgrp_id = bpf_core_enum_value(enum > > cgroup_subsys_id___local, > >^ > > > > /home/ubuntu/unstable/tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:179:2: > > note: expanded from macro 'bpf_core_enum_value' > > __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, > > BPF_ENUMVAL_VALUE) > > ^ > > 1 error generated. > > llc: error: llc: :1:1: error: expected top-level entity > > BPF obj compilation failed > > Addressed by fb3558127cb6 ("bpf: Fix selftest compilation on clang 11") Great, thanks! > > > > > I see that test_core_reloc_enumval.c takes precautions around the use of > > __builtin_preserve_enum_value as it is currently only available in clang > > 12 nightlies. Is it possible to do something similar here? Though I see > > that the use of the builtin is not nearly so neatly localized as it is > > in test_core_reloc_enumval.c. > > > > Thanks, > > Seth
BPF selftests build failure in 5.10-rc
Building the BPF selftests with clang 11, I'm getting the following error: CLNG-LLC [test_maps] profiler1.o In file included from progs/profiler1.c:6: progs/profiler.inc.h:260:17: error: use of unknown builtin '__builtin_preserve_enum_value' [-Wimplicit-function-declaration] int cgrp_id = bpf_core_enum_value(enum cgroup_subsys_id___local, ^ /home/ubuntu/unstable/tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:179:2: note: expanded from macro 'bpf_core_enum_value' __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, BPF_ENUMVAL_VALUE) ^ 1 error generated. llc: error: llc: :1:1: error: expected top-level entity BPF obj compilation failed I see that test_core_reloc_enumval.c takes precautions around the use of __builtin_preserve_enum_value as it is currently only available in clang 12 nightlies. Is it possible to do something similar here? Though I see that the use of the builtin is not nearly so neatly localized as it is in test_core_reloc_enumval.c. Thanks, Seth
Re: resolve_btfids breaks kernel cross-compilation
On Thu, Sep 17, 2020 at 11:14:06AM +0200, Jiri Olsa wrote: > On Thu, Sep 17, 2020 at 10:38:12AM +0200, Jiri Olsa wrote: > > On Thu, Sep 17, 2020 at 10:04:55AM +0200, Jiri Olsa wrote: > > > On Wed, Sep 16, 2020 at 02:47:33PM -0500, Seth Forshee wrote: > > > > The requirement to build resolve_btfids whenever CONFIG_DEBUG_INFO_BTF > > > > is enabled breaks some cross builds. For example, when building a 64-bit > > > > powerpc kernel on amd64 I get: > > > > > > > > Auto-detecting system features: > > > > ...libelf: [ [32mon[m ] > > > > ... zlib: [ [32mon[m ] > > > > ... bpf: [ [31mOFF[m ] > > > > > > > > BPF API too old > > > > make[6]: *** [Makefile:295: bpfdep] Error 1 > > > > > > > > The contents of tools/bpf/resolve_btfids/feature/test-bpf.make.output: > > > > > > > > In file included from > > > > /home/sforshee/src/u-k/unstable/tools/arch/powerpc/include/uapi/asm/bitsperlong.h:11, > > > > from /usr/include/asm-generic/int-ll64.h:12, > > > > from /usr/include/asm-generic/types.h:7, > > > > from /usr/include/x86_64-linux-gnu/asm/types.h:1, > > > > from > > > > /home/sforshee/src/u-k/unstable/tools/include/linux/types.h:10, > > > > from > > > > /home/sforshee/src/u-k/unstable/tools/include/uapi/linux/bpf.h:11, > > > > from test-bpf.c:3: > > > > > > > > /home/sforshee/src/u-k/unstable/tools/include/asm-generic/bitsperlong.h:14:2: > > > > error: #error Inconsistent word size. Check asm/bitsperlong.h > > > > 14 | #error Inconsistent word size. Check asm/bitsperlong.h > > > >| ^ > > > > > > > > This is because tools/arch/powerpc/include/uapi/asm/bitsperlong.h sets > > > > __BITS_PER_LONG based on the predefinied compiler macro __powerpc64__, > > > > which is not defined by the host compiler. What can we do to get cross > > > > builds working again? > > > > > > could you please share the command line and setup? > > > > I just reproduced.. checking on fix > > I still need to check on few things, but patch below should help It does help with the word size problem, thanks. > we might have a problem for cross builds with different endianity > than the host because libbpf does not support reading BTF data > with different endianity, and we get: > > BTFIDS vmlinux > libbpf: non-native ELF endianness is not supported Yes, I see this now when cross building for s390. Thanks, Seth > > jirka > > > --- > diff --git a/tools/bpf/resolve_btfids/Makefile > b/tools/bpf/resolve_btfids/Makefile > index a88cd4426398..d3c818b8d8d3 100644 > --- a/tools/bpf/resolve_btfids/Makefile > +++ b/tools/bpf/resolve_btfids/Makefile > @@ -1,5 +1,6 @@ > # SPDX-License-Identifier: GPL-2.0-only > include ../../scripts/Makefile.include > +include ../../scripts/Makefile.arch > > ifeq ($(srctree),) > srctree := $(patsubst %/,%,$(dir $(CURDIR))) > @@ -29,6 +30,7 @@ endif > AR = $(HOSTAR) > CC = $(HOSTCC) > LD = $(HOSTLD) > +ARCH = $(HOSTARCH) > > OUTPUT ?= $(srctree)/tools/bpf/resolve_btfids/ > >
resolve_btfids breaks kernel cross-compilation
The requirement to build resolve_btfids whenever CONFIG_DEBUG_INFO_BTF is enabled breaks some cross builds. For example, when building a 64-bit powerpc kernel on amd64 I get: Auto-detecting system features: ...libelf: [ [32mon[m ] ... zlib: [ [32mon[m ] ... bpf: [ [31mOFF[m ] BPF API too old make[6]: *** [Makefile:295: bpfdep] Error 1 The contents of tools/bpf/resolve_btfids/feature/test-bpf.make.output: In file included from /home/sforshee/src/u-k/unstable/tools/arch/powerpc/include/uapi/asm/bitsperlong.h:11, from /usr/include/asm-generic/int-ll64.h:12, from /usr/include/asm-generic/types.h:7, from /usr/include/x86_64-linux-gnu/asm/types.h:1, from /home/sforshee/src/u-k/unstable/tools/include/linux/types.h:10, from /home/sforshee/src/u-k/unstable/tools/include/uapi/linux/bpf.h:11, from test-bpf.c:3: /home/sforshee/src/u-k/unstable/tools/include/asm-generic/bitsperlong.h:14:2: error: #error Inconsistent word size. Check asm/bitsperlong.h 14 | #error Inconsistent word size. Check asm/bitsperlong.h | ^ This is because tools/arch/powerpc/include/uapi/asm/bitsperlong.h sets __BITS_PER_LONG based on the predefinied compiler macro __powerpc64__, which is not defined by the host compiler. What can we do to get cross builds working again? Thanks, Seth
test_bpf regressions on s390 since 5.4
The tests in lib/test_bpf.c were all passing in 5.4 when using the JIT, but some are failing in 5.7/5.8. Some of the failures are due to the removal of BPF_SIZE_MAX causing some expected failures to pass, which I have already send a patch for [1]. The remaining failures appear to be regressions. I haven't tried 5.5 or 5.6, so I'm not sure exactly when they first appeared. These are the tests which currently fail: test_bpf: #37 INT: MUL_X jited:1 ret -1 != 1 FAIL (1 times) test_bpf: #42 INT: SUB jited:1 ret -55 != 11 FAIL (1 times) test_bpf: #44 INT: MUL jited:1 ret 439084800 != 903446258 FAIL (1 times) test_bpf: #49 INT: shifts by register jited:1 ret -617 != -1 FAIL (1 times) test_bpf: #371 JNE signed compare, test 1 jited:1 ret 2 != 1 FAIL (1 times) test_bpf: #372 JNE signed compare, test 2 jited:1 ret 2 != 1 FAIL (1 times) test_bpf: #374 JNE signed compare, test 4 jited:1 ret 1 != 2 FAIL (1 times) test_bpf: #375 JNE signed compare, test 5 jited:1 ret 2 != 1 FAIL (1 times) Thanks, Seth [1] https://lore.kernel.org/lkml/20200716143931.330122-1-seth.fors...@canonical.com/
[PATCH] Revert "test_bpf: flag tests that cannot be jited on s390"
This reverts commit 3203c9010060806ff88c9989aeab4dc8d9a474dc. The s390 bpf JIT previously had a restriction on the maximum program size, which required some tests in test_bpf to be flagged as expected failures. The program size limitation has been removed, and the tests now pass, so these tests should no longer be flagged. Fixes: d1242b10ff03 ("s390/bpf: Remove JITed image size limitations") Signed-off-by: Seth Forshee --- lib/test_bpf.c | 20 1 file changed, 20 deletions(-) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index a5fddf9ebcb7..ca7d635bccd9 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -5275,31 +5275,21 @@ static struct bpf_test tests[] = { { /* Mainly checking JIT here. */ "BPF_MAXINSNS: Ctx heavy transformations", { }, -#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390) - CLASSIC | FLAG_EXPECTED_FAIL, -#else CLASSIC, -#endif { }, { { 1, SKB_VLAN_PRESENT }, { 10, SKB_VLAN_PRESENT } }, .fill_helper = bpf_fill_maxinsns6, - .expected_errcode = -ENOTSUPP, }, { /* Mainly checking JIT here. */ "BPF_MAXINSNS: Call heavy transformations", { }, -#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390) - CLASSIC | FLAG_NO_DATA | FLAG_EXPECTED_FAIL, -#else CLASSIC | FLAG_NO_DATA, -#endif { }, { { 1, 0 }, { 10, 0 } }, .fill_helper = bpf_fill_maxinsns7, - .expected_errcode = -ENOTSUPP, }, { /* Mainly checking JIT here. */ "BPF_MAXINSNS: Jump heavy test", @@ -5350,28 +5340,18 @@ static struct bpf_test tests[] = { { "BPF_MAXINSNS: exec all MSH", { }, -#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390) - CLASSIC | FLAG_EXPECTED_FAIL, -#else CLASSIC, -#endif { 0xfa, 0xfb, 0xfc, 0xfd, }, { { 4, 0xababab83 } }, .fill_helper = bpf_fill_maxinsns13, - .expected_errcode = -ENOTSUPP, }, { "BPF_MAXINSNS: ld_abs+get_processor_id", { }, -#if defined(CONFIG_BPF_JIT_ALWAYS_ON) && defined(CONFIG_S390) - CLASSIC | FLAG_EXPECTED_FAIL, -#else CLASSIC, -#endif { }, { { 1, 0xbee } }, .fill_helper = bpf_fill_ld_abs_get_processor_id, - .expected_errcode = -ENOTSUPP, }, /* * LD_IND / LD_ABS on fragmented SKBs -- 2.27.0
Re: [PATCH] MAINTAINERS: remove obsolete entry after file renaming
On Sun, Jun 28, 2020 at 08:02:29PM +0200, Lukas Bulwahn wrote: > Commit f16861b12fa0 ("regulator: rename da903x to da903x-regulator") missed > to adjust the DIALOG SEMICONDUCTOR DRIVERS section in MAINTAINERS. > > Hence, ./scripts/get_maintainer.pl --self-test=patterns complains: > > warning: no file matchesF:drivers/regulator/da903x.c > > The da903x-regulator.c file is already covered by the pattern > drivers/regulator/da9???-regulator.[ch] in the section. > > So, simply remove the non-matching file entry in MAINTAINERS. > > Signed-off-by: Lukas Bulwahn I didn't think to check MAINTAINERS when renaming the file. This makes sense to me. Acked-by: Seth Forshee > --- > applies cleanly on next-20200626 > > Seth, please ack. > Mark, please pick this minor non-urgent patch into your -next tree. > > MAINTAINERS | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 04fceaee5200..970136e262c2 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -5021,7 +5021,6 @@ F: drivers/mfd/da91??-*.c > F: drivers/pinctrl/pinctrl-da90??.c > F: drivers/power/supply/da9052-battery.c > F: drivers/power/supply/da91??-*.c > -F: drivers/regulator/da903x.c > F: drivers/regulator/da9???-regulator.[ch] > F: drivers/regulator/slg51000-regulator.[ch] > F: drivers/rtc/rtc-da90??.c > -- > 2.17.1 >
[PATCH] regulator: rename da903x to da903x-regulator
This module shares the same name as its parent PMIC driver, which confuses tools like kmod. Rename the regulator driver to avoid such problems. Signed-off-by: Seth Forshee --- drivers/regulator/Makefile | 2 +- drivers/regulator/{da903x.c => da903x-regulator.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/regulator/{da903x.c => da903x-regulator.c} (100%) diff --git a/drivers/regulator/Makefile b/drivers/regulator/Makefile index e8f163371071..0796e4a47afa 100644 --- a/drivers/regulator/Makefile +++ b/drivers/regulator/Makefile @@ -31,7 +31,7 @@ obj-$(CONFIG_REGULATOR_BD70528) += bd70528-regulator.o obj-$(CONFIG_REGULATOR_BD71828) += bd71828-regulator.o obj-$(CONFIG_REGULATOR_BD718XX) += bd718x7-regulator.o obj-$(CONFIG_REGULATOR_BD9571MWV) += bd9571mwv-regulator.o -obj-$(CONFIG_REGULATOR_DA903X) += da903x.o +obj-$(CONFIG_REGULATOR_DA903X) += da903x-regulator.o obj-$(CONFIG_REGULATOR_DA9052) += da9052-regulator.o obj-$(CONFIG_REGULATOR_DA9055) += da9055-regulator.o obj-$(CONFIG_REGULATOR_DA9062) += da9062-regulator.o diff --git a/drivers/regulator/da903x.c b/drivers/regulator/da903x-regulator.c similarity index 100% rename from drivers/regulator/da903x.c rename to drivers/regulator/da903x-regulator.c -- 2.27.0
Re: [PATCH v2] selftests/ftrace: Use printf instead of echo in kprobe syntax error tests
On Fri, May 29, 2020 at 03:26:06PM -0600, Shuah Khan wrote: > On 5/29/20 2:37 PM, Seth Forshee wrote: > > On Wed, Mar 04, 2020 at 04:20:09PM -0600, Seth Forshee wrote: > > > Test cases which use echo to write strings containing backslashes > > > fail with some shells, as echo's treatment of backslashes in > > > strings varies between shell implementations. Use printf instead, > > > as it should behave consistently across different shells. This > > > requires adjustments to the strings to escape \ and % characters. > > > ftrace_errlog_check() must also re-escape these characters after > > > processing them to remove ^ characters. > > > > > > Signed-off-by: Seth Forshee > > > > Ping. Someone just asked me about this patch, and I noticed that it > > hasn't been applied or received any feedback. > > > > I pulled in this patch from Masami: > > selftests/ftrace: Use printf for backslash included command > https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=next&id=8e923a2168afd221ea26e3d9716f21e9578b5c4d > > Looks like a duplicate. > > Seth, > Is your patch still needed? Nope, Masami's patch seems to fix the issues addressed by my patch. Thanks!
Re: [PATCH v2] selftests/ftrace: Use printf instead of echo in kprobe syntax error tests
On Wed, Mar 04, 2020 at 04:20:09PM -0600, Seth Forshee wrote: > Test cases which use echo to write strings containing backslashes > fail with some shells, as echo's treatment of backslashes in > strings varies between shell implementations. Use printf instead, > as it should behave consistently across different shells. This > requires adjustments to the strings to escape \ and % characters. > ftrace_errlog_check() must also re-escape these characters after > processing them to remove ^ characters. > > Signed-off-by: Seth Forshee Ping. Someone just asked me about this patch, and I noticed that it hasn't been applied or received any feedback. > --- > Changes in v2: > - Escape backslashes for a couple of additional tests > > .../testing/selftests/ftrace/test.d/functions | 6 +++--- > .../test.d/kprobe/kprobe_syntax_errors.tc | 20 +-- > 2 files changed, 13 insertions(+), 13 deletions(-) > > diff --git a/tools/testing/selftests/ftrace/test.d/functions > b/tools/testing/selftests/ftrace/test.d/functions > index 5d4550591ff9..b38c6eb029e8 100644 > --- a/tools/testing/selftests/ftrace/test.d/functions > +++ b/tools/testing/selftests/ftrace/test.d/functions > @@ -114,11 +114,11 @@ yield() { > } > > ftrace_errlog_check() { # err-prefix command-with-error-pos-by-^ command-file > -pos=$(echo -n "${2%^*}" | wc -c) # error position > -command=$(echo "$2" | tr -d ^) > +pos=$(printf "${2%^*}" | wc -c) # error position > +command=$(printf "$2" | sed -e 's/\^//g' -e 's/%/%%/g' -e 's/\\//g') > echo "Test command: $command" > echo > error_log > -(! echo "$command" >> "$3" ) 2> /dev/null > +(! printf "$command" >> "$3" ) 2> /dev/null > grep "$1: error:" -A 3 error_log > N=$(tail -n 1 error_log | wc -c) > # " Command: " and "^\n" => 13 > diff --git > a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc > b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc > index ef1e9bafb098..039c03d230b9 100644 > --- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc > +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc > @@ -37,14 +37,14 @@ fi > > check_error 'p vfs_read ^$none_var' # BAD_VAR > > -check_error 'p vfs_read ^%none_reg' # BAD_REG_NAME > +check_error 'p vfs_read ^%%none_reg' # BAD_REG_NAME > check_error 'p vfs_read ^@12345678abcde' # BAD_MEM_ADDR > check_error 'p vfs_read ^@+10' # FILE_ON_KPROBE > > grep -q "imm-value" README && \ > -check_error 'p vfs_read arg1=\^x'# BAD_IMM > +check_error 'p vfs_read arg1=\\^x' # BAD_IMM > grep -q "imm-string" README && \ > -check_error 'p vfs_read arg1=\"abcd^'# IMMSTR_NO_CLOSE > +check_error 'p vfs_read arg1=\\"abcd^' # IMMSTR_NO_CLOSE > > check_error 'p vfs_read ^+0@0)' # DEREF_NEED_BRACE > check_error 'p vfs_read ^+0ab1(@0)' # BAD_DEREF_OFFS > @@ -80,7 +80,7 @@ check_error 'p vfs_read arg1=^' # > NO_ARG_BODY > # instruction boundary check is valid on x86 (at this moment) > case $(uname -m) in >x86_64|i[3456]86) > -echo 'p vfs_read' > kprobe_events > +printf 'p vfs_read' > kprobe_events > if grep -q FTRACE ../kprobes/list ; then > check_error 'p ^vfs_read+3' # BAD_INSN_BNDRY (only if > function-tracer is enabled) > fi > @@ -89,13 +89,13 @@ esac > > # multiprobe errors > if grep -q "Create/append/" README && grep -q "imm-value" README; then > -echo 'p:kprobes/testevent _do_fork' > kprobe_events > +printf 'p:kprobes/testevent _do_fork' > kprobe_events > check_error '^r:kprobes/testevent do_exit' # DIFF_PROBE_TYPE > -echo 'p:kprobes/testevent _do_fork abcd=\1' > kprobe_events > -check_error 'p:kprobes/testevent _do_fork ^bcd=\1' # DIFF_ARG_TYPE > -check_error 'p:kprobes/testevent _do_fork ^abcd=\1:u8' # DIFF_ARG_TYPE > -check_error 'p:kprobes/testevent _do_fork ^abcd=\"foo"' # DIFF_ARG_TYPE > -check_error '^p:kprobes/testevent _do_fork abcd=\1' # SAME_PROBE > +printf 'p:kprobes/testevent _do_fork abcd=\\1' > kprobe_events > +check_error 'p:kprobes/testevent _do_fork ^bcd=\\1' # DIFF_ARG_TYPE > +check_error 'p:kprobes/testevent _do_fork ^abcd=\\1:u8' # DIFF_ARG_TYPE > +check_error 'p:kprobes/testevent _do_fork ^abcd=\\"foo"'# DIFF_ARG_TYPE > +check_error '^p:kprobes/testevent _do_fork abcd=\\1' # SAME_PROBE > fi > > exit 0 > -- > 2.25.0 >
[PATCH] sched: Add __ASSEMBLY__ guards around struct clone_args
The addition of struct clone_args to uapi/linux/sched.h is not protected by __ASSEMBLY__ guards, causing a FTBFS for glibc on RISC-V. Add the guards to fix this. Fixes: 7f192e3cd316 ("fork: add clone3") Signed-off-by: Seth Forshee --- include/uapi/linux/sched.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index b3105ac1381a..851ff1feadd5 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -33,6 +33,7 @@ #define CLONE_NEWNET 0x4000 /* New network namespace */ #define CLONE_IO 0x8000 /* Clone io context */ +#ifndef __ASSEMBLY__ /* * Arguments for the clone3 syscall */ @@ -46,6 +47,7 @@ struct clone_args { __aligned_u64 stack_size; __aligned_u64 tls; }; +#endif /* * Scheduling policies -- 2.20.1
Re: [PATCH 0/1] Small potential fix for shiftfs
On Thu, Aug 15, 2019 at 04:36:02PM +0200, Oleksandr Natalenko wrote: > Hey, people. > > I was lurking at shiftfs just out of curiosity and managed to bump into > a compiler warning that is (as I suppose) easily fixed by the subsequent > patch. > > Feel free to drag this into your Ubuntu tree if needed. I haven't played > with it yet, just compiling (because I'm looking for something that is > bindfs but in-kernel) :). Thanks for the patch. Christian has actually already sent a patch for this along with another patch which is still under review: https://lists.ubuntu.com/archives/kernel-team/2019-July/102449.html Also note that currently shiftfs is only in Ubuntu distro kernels, and Ubuntu-specific kernel patches should be directed at kernel-t...@lists.ubuntu.com rather than lkml. If you'll be at LPC, there's a session to discuss the future of upstreaming shiftfs that you might find interesting. Thanks! Seth
[PATCH v2] kbuild: add -fcf-protection=none when using retpoline flags
The gcc -fcf-protection=branch option is not compatible with -mindirect-branch=thunk-extern. The latter is used when CONFIG_RETPOLINE is selected, and this will fail to build with a gcc which has -fcf-protection=branch enabled by default. Adding -fcf-protection=none when building with retpoline enabled prevents such build failures. Signed-off-by: Seth Forshee --- Makefile | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Makefile b/Makefile index 3e4868a6498b..73a94d1db2b6 100644 --- a/Makefile +++ b/Makefile @@ -878,6 +878,12 @@ KBUILD_CFLAGS += $(call cc-option,-Werror=designated-init) # change __FILE__ to the relative path from the srctree KBUILD_CFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=) +# ensure -fcf-protection is disabled when using retpoline as it is +# incompatible with -mindirect-branch=thunk-extern +ifdef CONFIG_RETPOLINE +KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none,) +endif + # use the deterministic mode of AR if available KBUILD_ARFLAGS := $(call ar-option,D) -- 2.20.1
Re: [kbuild:kbuild 5/19] drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call found in RETPOLINE build
On Wed, Jul 17, 2019 at 11:52:07AM +0900, Masahiro Yamada wrote: > On Wed, Jul 17, 2019 at 1:20 AM Josh Poimboeuf wrote: > > > > On Tue, Jul 16, 2019 at 07:42:49AM -0500, Seth Forshee wrote: > > > On Tue, Jul 16, 2019 at 03:57:24PM +0900, Masahiro Yamada wrote: > > > > (+ Josh Poimboeuf) > > > > > > > > On Tue, Jul 16, 2019 at 8:44 AM kbuild test robot > > > > wrote: > > > > > > > > > > tree: > > > > > https://kernel.googlesource.com/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git > > > > > kbuild > > > > > head: 0ff0c3753e06c0420c80dac1b0187a442b372acb > > > > > commit: 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f [5/19] kbuild: add > > > > > -fcf-protection=none to retpoline flags > > > > > config: x86_64-randconfig-s2-07160214 (attached as .config) > > > > > compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4 > > > > > reproduce: > > > > > git checkout 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f > > > > > # save the attached .config to linux build tree > > > > > make ARCH=x86_64 > > > > > > > > 0-day bot reports objtool warnings with the following applied: > > > > https://patchwork.kernel.org/patch/11037379/ > > > > > > > > I have no idea about objtool. > > > > > > > > Is it better to drop this patch for now? > > > > > > I'm surprised that the change would have any impact on a build with > > > gcc-4.9, since -fcf-protection seems to have been introduced in gcc-8. I > > > guess there's no full build log that would let us see the actual flags > > > passed to the compiler. > > > > > > I'll try to reproduce this result. If you think the patch should be > > > dropped in the meantime, that's fine. > > > > The problem with this patch is that it's breaking the following check in > > arch/x86/Makefile. GCC 4.9 doesn't support retpolines, so it's supposed > > to fail with the below error. > > > > ifdef CONFIG_RETPOLINE > > ifeq ($(RETPOLINE_CFLAGS),) > > @echo "You are building kernel with non-retpoline compiler." >&2 > > @echo "Please update your compiler." >&2 > > @false > > endif > > endif > > > > Maybe the flags should be placed in another variable other than > > RETPOLINE_CFLAGS. > > > > Josh, > Thanks. You are right. > > > Seth, > I think you can add the flag to KBUILD_CFLAGS. > > If you want to make sure this does not affect non-retpoline > build, you can surround the code with ifdef. > > ifdef CONFIG_RETPOLINE > KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none) > endif Thanks, I'll send an updated patch shortly. Seth
Re: [kbuild:kbuild 5/19] drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call found in RETPOLINE build
On Tue, Jul 16, 2019 at 03:57:24PM +0900, Masahiro Yamada wrote: > (+ Josh Poimboeuf) > > On Tue, Jul 16, 2019 at 8:44 AM kbuild test robot wrote: > > > > tree: > > https://kernel.googlesource.com/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git > > kbuild > > head: 0ff0c3753e06c0420c80dac1b0187a442b372acb > > commit: 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f [5/19] kbuild: add > > -fcf-protection=none to retpoline flags > > config: x86_64-randconfig-s2-07160214 (attached as .config) > > compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4 > > reproduce: > > git checkout 2eaf4e87ba258cc3f27e486cdf32d5ba76303c6f > > # save the attached .config to linux build tree > > make ARCH=x86_64 > > 0-day bot reports objtool warnings with the following applied: > https://patchwork.kernel.org/patch/11037379/ > > I have no idea about objtool. > > Is it better to drop this patch for now? I'm surprised that the change would have any impact on a build with gcc-4.9, since -fcf-protection seems to have been introduced in gcc-8. I guess there's no full build log that would let us see the actual flags passed to the compiler. I'll try to reproduce this result. If you think the patch should be dropped in the meantime, that's fine. Thanks, Seth > > Thanks. > > > > > If you fix the issue, kindly add following tag > > Reported-by: kbuild test robot > > > > All warnings (new ones prefixed by >>): > > > >drivers/atm/eni.o: warning: objtool: eni_do_release()+0x1a: indirect > > call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: do_tx()+0x1be: indirect call found > > in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_send()+0x15b: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_send()+0x1b4: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_send()+0x24d: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_int()+0xd1: indirect call found > > in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x99: indirect > > call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0xf7: indirect > > call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x20c: indirect > > call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: poll_rx.isra.16()+0x266: indirect > > call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_ioctl()+0x54: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x3f7: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x420: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x62f: indirect call > > found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: eni_tasklet()+0x673: indirect call > > found in RETPOLINE build > > >> drivers/atm/eni.o: warning: objtool: eni_init_one()+0xe42: indirect call > > >> found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: do_rx_dma.constprop.28()+0xaf: > > indirect call found in RETPOLINE build > >drivers/atm/eni.o: warning: objtool: do_rx_dma.constprop.28()+0x49c: > > indirect call found in RETPOLINE build > > -- > >net//batman-adv/sysfs.o: warning: objtool: > > batadv_show_gw_sel_class()+0x6c: indirect call found in RETPOLINE build > > >> net//batman-adv/sysfs.o: warning: objtool: > > >> __batadv_store_uint_attr.isra.9.constprop.10()+0xb7: indirect call found > > >> in RETPOLINE build > >net//batman-adv/sysfs.o: warning: objtool: > > batadv_store_gw_sel_class()+0x8b: indirect call found in RETPOLINE build > > -- > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwpower()+0xd7: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwpower()+0x257: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwpower()+0x2f4: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwretry()+0xfb: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwretry()+0x219: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwretry()+0x2b1: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwfrag()+0xce: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwfrag()+0x1f6: indirect call found in RETPOLINE build > >net//wireless/wext-compat.o: warning: objtool: > > cfg80211_wext_siwfrag()+0x2c8: indirect call found in RETPOLINE bui
[PATCH] kbuild: add -fcf-protection=none to retpoline flags
-mindirect-branch and -fcf-protection are not compatible, and so kernel builds fail with a gcc build where -fcf-protection is enabled by default. Add -fcf-protection=none to the retpoline flags to fix this. Signed-off-by: Seth Forshee --- Makefile | 4 1 file changed, 4 insertions(+) diff --git a/Makefile b/Makefile index 3e4868a6498b..050f11d19777 100644 --- a/Makefile +++ b/Makefile @@ -636,6 +636,10 @@ RETPOLINE_CFLAGS_CLANG := -mretpoline-external-thunk RETPOLINE_VDSO_CFLAGS_CLANG := -mretpoline RETPOLINE_CFLAGS := $(call cc-option,$(RETPOLINE_CFLAGS_GCC),$(call cc-option,$(RETPOLINE_CFLAGS_CLANG))) RETPOLINE_VDSO_CFLAGS := $(call cc-option,$(RETPOLINE_VDSO_CFLAGS_GCC),$(call cc-option,$(RETPOLINE_VDSO_CFLAGS_CLANG))) +# -mindirect-branch is incompatible with -fcf-protection, so ensure the +# latter is disabled +RETPOLINE_CFLAGS += $(call cc-option,-fcf-protection=none,) +RETPOLINE_VDSO_CFLAGS += $(call cc-option,-fcf-protection=none,) export RETPOLINE_CFLAGS export RETPOLINE_VDSO_CFLAGS -- 2.20.1
Re: [PATCH] x86/ima: require signed kernel modules
On Tue, Feb 05, 2019 at 01:52:21PM -0500, Mimi Zohar wrote: > On Tue, 2019-02-05 at 12:32 -0600, Seth Forshee wrote: > > On Tue, Feb 05, 2019 at 11:47:24AM -0500, Mimi Zohar wrote: > > > Hi Seth, > > > > > > On Tue, 2019-02-05 at 09:18 -0600, Seth Forshee wrote: > > > > On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote: > > > > > Require signed kernel modules on systems with secure boot mode > > > > > enabled. > > > > > > > > > > To coordinate between appended kernel module signatures and IMA > > > > > signatures, only define an IMA MODULE_CHECK policy rule if > > > > > CONFIG_MODULE_SIG is not enabled. > > > > > > > > > > This patch defines a function named set_module_sig_required() and > > > > > renames > > > > > is_module_sig_enforced() to is_module_sig_enforced_or_required(). The > > > > > call to set_module_sig_required() is dependent on > > > > > CONFIG_IMA_ARCH_POLICY > > > > > being enabled. > > > > > > > > > > Signed-off-by: Mimi Zohar > > > > > > > > With respect to interactions with the kernel lockdown patches, this > > > > looks better than the patches I saw previously. I don't feel like I know > > > > enough about what's going on with IMA to ack the patch, but I feel > > > > confident that it's at least not going to break signature enforcement > > > > for us. > > > > > > Thank you for testing! Could this be translated into a "tested-by" > > > "(for w/lockdown patches)"? > > > > Yeah, that's fine. To be clear about what I tested, I've confirmed that > > it doesn't interfere with requiring signed modules under lockdown with > > CONFIG_IMA_ARCH_POLICY=n and IMA appraisal enabled. > > > > Tested-by: Seth Forshee > > Oh! You've disabled the coordination of the two signature > verification methods. Any chance you could test with > "CONFIG_IMA_ARCH_POLICY" enabled? Ok, I've tested this now and it also seems to be working. However it does seem redundant to have CONFIG_IMA_ARCH_POLICY alongside lockdown, as it doesn't enforce anything not already being enforced by lockdown. Seth
Re: [PATCH] x86/ima: require signed kernel modules
On Tue, Feb 05, 2019 at 11:47:24AM -0500, Mimi Zohar wrote: > Hi Seth, > > On Tue, 2019-02-05 at 09:18 -0600, Seth Forshee wrote: > > On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote: > > > Require signed kernel modules on systems with secure boot mode enabled. > > > > > > To coordinate between appended kernel module signatures and IMA > > > signatures, only define an IMA MODULE_CHECK policy rule if > > > CONFIG_MODULE_SIG is not enabled. > > > > > > This patch defines a function named set_module_sig_required() and renames > > > is_module_sig_enforced() to is_module_sig_enforced_or_required(). The > > > call to set_module_sig_required() is dependent on CONFIG_IMA_ARCH_POLICY > > > being enabled. > > > > > > Signed-off-by: Mimi Zohar > > > > With respect to interactions with the kernel lockdown patches, this > > looks better than the patches I saw previously. I don't feel like I know > > enough about what's going on with IMA to ack the patch, but I feel > > confident that it's at least not going to break signature enforcement > > for us. > > Thank you for testing! Could this be translated into a "tested-by" > "(for w/lockdown patches)"? Yeah, that's fine. To be clear about what I tested, I've confirmed that it doesn't interfere with requiring signed modules under lockdown with CONFIG_IMA_ARCH_POLICY=n and IMA appraisal enabled. Tested-by: Seth Forshee
Re: [PATCH] x86/ima: require signed kernel modules
On Thu, Jan 31, 2019 at 02:18:59PM -0500, Mimi Zohar wrote: > Require signed kernel modules on systems with secure boot mode enabled. > > To coordinate between appended kernel module signatures and IMA > signatures, only define an IMA MODULE_CHECK policy rule if > CONFIG_MODULE_SIG is not enabled. > > This patch defines a function named set_module_sig_required() and renames > is_module_sig_enforced() to is_module_sig_enforced_or_required(). The > call to set_module_sig_required() is dependent on CONFIG_IMA_ARCH_POLICY > being enabled. > > Signed-off-by: Mimi Zohar With respect to interactions with the kernel lockdown patches, this looks better than the patches I saw previously. I don't feel like I know enough about what's going on with IMA to ack the patch, but I feel confident that it's at least not going to break signature enforcement for us. Thanks, Seth
Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
On Fri, Nov 02, 2018 at 03:16:05PM +0200, Amir Goldstein wrote: > On Fri, Nov 2, 2018 at 2:44 PM Seth Forshee > wrote: > > > > On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote: > > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee > > > wrote: > > > > > > > > shiftfs mounts cannot be nested for two reasons -- global > > > > CAP_SYS_ADMIN is required to set up a mark mount, and a single > > > > functional shiftfs mount meets the filesystem stacking depth > > > > limit. > > > > > > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel > > > > ids in a mount must be within that mount's s_user_ns, so all that > > > > is needed is CAP_SYS_ADMIN within that s_user_ns. > > > > > > > > The stack depth issue can be worked around with a couple of > > > > adjustments. First, a mark mount doesn't really need to count > > > > against the stacking depth as it doesn't contribute to the call > > > > stack depth during filesystem operations. Therefore the mount > > > > over the mark mount only needs to count as one more than the > > > > lower filesystems stack depth. > > > > > > That's true, but it also highlights the point that the "mark" sb is > > > completely unneeded and it really is just a nice little hack. > > > All the information it really stores is a lower mount reference, > > > a lower root dentry and a declarative statement "I am shiftable!". > > > > Seems I should have saved some of the things I said in my previous > > response for this one. As you no doubt gleaned from that email, I agree > > with this. > > > > > Come to think about it, "container shiftable" really isn't that different > > > from > > > NFS export with "no_root_squash" and auto mounted USB drive. > > > I mean the shifting itself is different of course, but the > > > declaration, not so much. > > > If I am allowing sudoers on another machine to mess with root owned > > > files visible > > > on my machine, I am pretty much have the same issues as container admins > > > accessing root owned files on my init_user_ns filesystem. In all those > > > cases, > > > I'd better not be executing suid binaries from the untrusted "external" > > > source. > > > > > > Instead of mounting a dummy filesystem to make the declaration, you could > > > get the same thing with: > > >mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC) > > > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED > > > or whatnot) constant to uapi and manage to come up good man page > > > description. > > > > > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and > > > avoid the extra bind mount step (for a full filesystem tree export). > > > Declaring a mounted image MS_EXTERN has merits on its own even without > > > containers and shitfs, for example with pluggable storage. Other LSMs > > > could make > > > good use of that declaration. > > > > I'm missing how we figure out the target user ns in this scheme. We need > > a context with privileges towards the source path's s_user_ns to say > > it's okay to shift ids for the files under the source path, and then we > > need a target user ns for the id shifts. Currently the target is > > current_user_ns when the final shiftfs mount is created. > > > > So, how do we determine the target s_user_ns in your scheme? > > > > Unless I am completely misunderstanding shiftfs, I think we are saying the > same thing. You said you wish to get rid of the "mark" fs and that you had > a POC of implementing the "mark" using xattr. The PoC was using filesystem contexts and (an earlier version of) the new mount API, but yes I do think we're in agreement about the mark fs being awkward. > I'm just saying another option to implement the mark is using a super block > flag and you get the target s_user_ns from mnt_sb. > I did miss the fact that a mount flag is not enough, so that makes the bind > mount concept fail. Unless, maybe, the mount in the container is a slave > mount and the "shiftable" mark is set on the master. > I know too little about mount ns vs. user ns to tell if any of this makes > sense. We need a source and target user ns for the shifting, currently they are the lower filesystem's s_user_ns and t
Re: [RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
On Fri, Nov 02, 2018 at 12:02:45PM +0200, Amir Goldstein wrote: > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee > wrote: > > > > shiftfs mounts cannot be nested for two reasons -- global > > CAP_SYS_ADMIN is required to set up a mark mount, and a single > > functional shiftfs mount meets the filesystem stacking depth > > limit. > > > > The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel > > ids in a mount must be within that mount's s_user_ns, so all that > > is needed is CAP_SYS_ADMIN within that s_user_ns. > > > > The stack depth issue can be worked around with a couple of > > adjustments. First, a mark mount doesn't really need to count > > against the stacking depth as it doesn't contribute to the call > > stack depth during filesystem operations. Therefore the mount > > over the mark mount only needs to count as one more than the > > lower filesystems stack depth. > > That's true, but it also highlights the point that the "mark" sb is > completely unneeded and it really is just a nice little hack. > All the information it really stores is a lower mount reference, > a lower root dentry and a declarative statement "I am shiftable!". Seems I should have saved some of the things I said in my previous response for this one. As you no doubt gleaned from that email, I agree with this. > Come to think about it, "container shiftable" really isn't that different from > NFS export with "no_root_squash" and auto mounted USB drive. > I mean the shifting itself is different of course, but the > declaration, not so much. > If I am allowing sudoers on another machine to mess with root owned > files visible > on my machine, I am pretty much have the same issues as container admins > accessing root owned files on my init_user_ns filesystem. In all those cases, > I'd better not be executing suid binaries from the untrusted "external" > source. > > Instead of mounting a dummy filesystem to make the declaration, you could > get the same thing with: >mount(path, path, "none", MS_BIND | MS_EXTERN | MS_NOEXEC) > and all you need to do is add MS_EXTERN (MS_SHIFTABLE MS_UNTRUSTED > or whatnot) constant to uapi and manage to come up good man page description. > > Then users could actually mount a filesystem in init_user_ns MS_EXTERN and > avoid the extra bind mount step (for a full filesystem tree export). > Declaring a mounted image MS_EXTERN has merits on its own even without > containers and shitfs, for example with pluggable storage. Other LSMs could > make > good use of that declaration. I'm missing how we figure out the target user ns in this scheme. We need a context with privileges towards the source path's s_user_ns to say it's okay to shift ids for the files under the source path, and then we need a target user ns for the id shifts. Currently the target is current_user_ns when the final shiftfs mount is created. So, how do we determine the target s_user_ns in your scheme? > > > > > Second, when the lower mount is shiftfs we can just skip down to > > that mount's lower filesystem and shift ids relative to that. > > There is no reason to shift ids twice, and the lower path has > > already been marked safe for id shifting by a user privileged > > towards all ids in that mount's user ns. > > > > Signed-off-by: Seth Forshee > > --- > > fs/shiftfs.c | 68 +++- > > 1 file changed, 46 insertions(+), 22 deletions(-) > > > > diff --git a/fs/shiftfs.c b/fs/shiftfs.c > > index b19af7b2fe75..008ace2842b9 100644 > > --- a/fs/shiftfs.c > > +++ b/fs/shiftfs.c > > @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, > > void *raw_data, > > struct shiftfs_data *data = raw_data; > > char *name = kstrdup(data->path, GFP_KERNEL); > > int err = -ENOMEM; > > - struct shiftfs_super_info *ssi = NULL; > > + struct shiftfs_super_info *ssi = NULL, *mp_ssi; > > struct path path; > > struct dentry *dentry; > > > > @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, > > void *raw_data, > > if (err) > > goto out; > > > > - /* to mark a mount point, must be real root */ > > - if (ssi->mark && !capable(CAP_SYS_ADMIN)) > > - goto out; > > - > > - /* else to mount a mark, must be userns admin */ > > + /* to mount a mark, must be userns admin */ > > if (!ssi->mark &
Re: [RFC PATCH 0/6] shiftfs fixes and enhancements
On Fri, Nov 02, 2018 at 10:59:38AM +0200, Amir Goldstein wrote: > [cc: linux-unionfs > It should the mailing list for *all* "stacking fs". > We have a lot of common problems I think ;-) ] > > On Thu, Nov 1, 2018 at 11:49 PM Seth Forshee > wrote: > > > > I've done some work to fix and enhance shiftfs for a number of use > > cases, so that we would have an idea what a more full-featured shiftfs > > would look like. I'm intending for these to serve as a point of > > reference for discussing id shifting mounts/filesystems at plumbers in a > > couple of weeks [1]. > > > > Note that these are based on 4.18, and I've added a small fix to James' > > most recent patch to fix a build issue there. To work with 4.19 they > > will need a number of updates due to changes in the vfs. > > > > Seth, > > I like the way you addressed my concerns about nesting and stacking depth. > Will provide specific nits on patch. > > In preparation to the Plumbers talk (which I will not be attending), I wanted > to > get your opinion on the matters I brought up last time: > https://marc.info/?l=linux-fsdevel&m=153013920904844&w=2 I want the session at plumbers to not be a "talk" but more of a discussion of the sorts of things you raise below. But I'm also happy to talk about them here. > 1) Having seen what it takes to catch up with overlayfs w.r.t inotify bugs > and having peeked into 4.19 to see what work you still have lined up for you > to bring shitfs up to speed with vfs, did you have time to look into my > proposal > for sharing code with overlayfs in the manner that I have implemented the > snapshotfs POC? > https://github.com/amir73il/linux/commit/25416757f2ca47759f59b115e6461b11898c4f06 > > Even if you end up not saving a single line of code for shiftfs v1 > meaning that all shiftfs_inode_ops are completely separate implementation > from overlayfs inode ops, this may still be beneficial to shitfs in > the long run. > For example, you may, in fact, won't need to change anything to work with > v4.19. > shittfs (as an overlayfs alias) would use ovl_file_operations and > shiftfs_inode_ops. I don't recall seeing the shapshotfs patches before. If id shifting remains an overlay-style fs and not a feature of the vfs, then I absolutely think something like this will make life much easier. > Another example, from the top of my head, see what it took to add NFS export > support to snapshotfs, because of the code reuse with overlayfs: > https://github.com/amir73il/linux/commit/d082eb615133490ec26fa2efaa80ed4723860893 > Its practically the exact same implementation shiftfs would need, > so in the far future, shitfs and snapshotfs can share the same > export_operations. > > 2) Regarding this part: > + /* > +* this part is visible unshifted, so make sure no > +* executables that could be used to give suid > +* privileges > +*/ > + sb->s_iflags = SB_I_NOEXEC; > > Why would you want to make the unshifted fs visible at all? > Is there a requirement for container users to access the unshifted fs > content? Is there a requirement for container admin to mount shitfted fs > NOT from the root of the marked mount? > > If those are not required, then I propose NOOP inode operations for > the unshifted fs, specifically, empty readdir, just enough ops to be able > to use the mark mount point as the shitfed mount source - no more. This is part of the original implementation that I didn't touch with these updates. Imo the mark mount is kind of kludgy, and I'd like to see it done a different way. A couple of alternatives have been suggested. One was to use xattrs for marking, or I did a PoC with an older version of the new mount API patches where an fsfd was passed to the less privileged context that it could attach to its mount tree: https://lkml.kernel.org/r/20180717133847.GB15620@ubuntu-xps13 Either of these can accomplish the same things as the mark mount with better control over who can create an id-shifted mount of the subtree. However if the mark mount is kept then no-op inode operations seems reasonable to me. Thanks, Seth
[RFC PATCH 2/6] shiftfs: map inodes to lower fs inodes instead of dentries
Since shiftfs inodes map to dentries in the lower fs, two links to the same lowerfs inode create separate inodes in shiftfs. This causes problems for inotify, as a watch on one of these files in shiftfs will not see changes made to the underlying inode via the other file. Fix this by updating shiftfs to map its inodes to corresponding inodes in the lower fs. Inodes are cached using the pointer to the lower fs inode as the hash value. This fixes a second inotify problem whereby a watch is set on an inode, the dentry is evicted from the cache, and events on a new dentry are not reported back to the watch original inode. Signed-off-by: Seth Forshee --- fs/shiftfs.c | 105 ++- 1 file changed, 79 insertions(+), 26 deletions(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index 6028244c2f42..b179a1be7bc1 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -22,6 +22,7 @@ struct shiftfs_super_info { static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode, struct dentry *dentry); +static void shiftfs_init_inode(struct inode *inode, umode_t mode); enum { OPT_MARK, @@ -278,15 +279,27 @@ static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry) inode->i_opflags |= IOP_NOFOLLOW; inode->i_mapping = reali->i_mapping; - inode->i_private = dentry; + inode->i_private = reali; + set_nlink(inode, reali->i_nlink); +} + +static int shiftfs_inode_test(struct inode *inode, void *data) +{ + return inode->i_private == data; +} + +static int shiftfs_inode_set(struct inode *inode, void *data) +{ + inode->i_private = data; + return 0; } static int shiftfs_make_object(struct inode *dir, struct dentry *dentry, umode_t mode, const char *symlink, struct dentry *hardlink, bool excl) { - struct dentry *real = dir->i_private, *new = dentry->d_fsdata; - struct inode *reali = real->d_inode, *newi; + struct dentry *new = dentry->d_fsdata; + struct inode *reali = dir->i_private, *inode, *newi; const struct inode_operations *iop = reali->i_op; int err; const struct cred *oldcred, *newcred; @@ -310,9 +323,14 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry, return -EINVAL; - newi = shiftfs_new_inode(dentry->d_sb, mode, NULL); - if (!newi) - return -ENOMEM; + if (hardlink) { + inode = d_inode(hardlink); + ihold(inode); + } else { + inode = shiftfs_new_inode(dentry->d_sb, mode, NULL); + if (!inode) + return -ENOMEM; + } oldcred = shiftfs_new_creds(&newcred, dentry->d_sb); @@ -341,16 +359,33 @@ static int shiftfs_make_object(struct inode *dir, struct dentry *dentry, if (err) goto out_dput; - shiftfs_fill_inode(newi, new); + if (hardlink) { + WARN_ON(inode->i_private != new->d_inode); + inc_nlink(inode); + } else { + shiftfs_fill_inode(inode, new); + + newi = inode_insert5(inode, (unsigned long)new->d_inode, +shiftfs_inode_test, shiftfs_inode_set, +new->d_inode); + if (newi != inode) { + pr_warn_ratelimited("shiftfs: newly created inode found in cache\n"); + iput(inode); + inode = newi; + } + } + + if (inode->i_state & I_NEW) + unlock_new_inode(inode); - d_instantiate(dentry, newi); + d_instantiate(dentry, inode); new = NULL; - newi = NULL; + inode = NULL; out_dput: dput(new); - iput(newi); + iput(inode); inode_unlock(reali); return err; @@ -386,8 +421,8 @@ static int shiftfs_symlink(struct inode *dir, struct dentry *dentry, static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir) { - struct dentry *real = dir->i_private, *new = dentry->d_fsdata; - struct inode *reali = real->d_inode; + struct dentry *new = dentry->d_fsdata; + struct inode *reali = dir->i_private; int err; const struct cred *oldcred, *newcred; @@ -400,6 +435,13 @@ static int shiftfs_rm(struct inode *dir, struct dentry *dentry, bool rmdir) else err = vfs_unlink(reali, new, NULL); + if (!err) { + if (rmdir) + clear_nlink(d_inode(dentry)); + else + drop_nlink(d_inode(dentry)); + } + shiftfs_old_creds(oldcred, &newcred); inode_u
[RFC PATCH 3/6] shiftfs: copy inode attrs up from underlying fs
Not all inode permission checks go through the permission callback, e.g. some checks related to file capabilities. Always copy up the inode attrs to ensure these checks work as expected. Also introduce helpers helpers for shifting kernel ids from one user ns to another, as this is an operation that is going to be repeated. Signed-off-by: Seth Forshee --- fs/shiftfs.c | 30 +- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index b179a1be7bc1..556594988dd2 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -266,6 +266,33 @@ static int shiftfs_xattr_set(const struct xattr_handler *handler, return shiftfs_setxattr(dentry, inode, name, value, size, flags); } +static kuid_t shift_kuid(struct user_namespace *from, struct user_namespace *to, +kuid_t kuid) +{ + uid_t uid = from_kuid(from, kuid); + return make_kuid(to, uid); +} + +static kgid_t shift_kgid(struct user_namespace *from, struct user_namespace *to, +kgid_t kgid) +{ + gid_t gid = from_kgid(from, kgid); + return make_kgid(to, gid); +} + +static void shiftfs_copyattr(struct inode *from, struct inode *to) +{ + struct user_namespace *from_ns = from->i_sb->s_user_ns; + struct user_namespace *to_ns = to->i_sb->s_user_ns; + + to->i_uid = shift_kuid(from_ns, to_ns, from->i_uid); + to->i_gid = shift_kgid(from_ns, to_ns, from->i_gid); + to->i_mode = from->i_mode; + to->i_atime = from->i_atime; + to->i_mtime = from->i_mtime; + to->i_ctime = from->i_ctime; +} + static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry) { struct inode *reali; @@ -278,6 +305,7 @@ static void shiftfs_fill_inode(struct inode *inode, struct dentry *dentry) if (!reali->i_op->get_link) inode->i_opflags |= IOP_NOFOLLOW; + shiftfs_copyattr(reali, inode); inode->i_mapping = reali->i_mapping; inode->i_private = reali; set_nlink(inode, reali->i_nlink); @@ -573,7 +601,7 @@ static int shiftfs_setattr(struct dentry *dentry, struct iattr *attr) return err; /* all OK, reflect the change on our inode */ - setattr_copy(d_inode(dentry), attr); + shiftfs_copyattr(reali, d_inode(dentry)); return 0; } -- 2.19.1
[RFC PATCH 4/6] shiftfs: translate uids using s_user_ns from lower fs
Do not assume that ids from the lower filesystem are from init_user_ns. Instead, translate them from that filesystem's s_user_ns and then to the shiftfs user ns. Signed-off-by: Seth Forshee --- fs/shiftfs.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index 556594988dd2..226c03d8588b 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -613,6 +613,8 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat, struct inode *reali = real->d_inode; const struct inode_operations *iop = reali->i_op; struct path newpath = { .mnt = path->dentry->d_sb->s_fs_info, .dentry = real }; + struct user_namespace *from_ns = reali->i_sb->s_user_ns; + struct user_namespace *to_ns = inode->i_sb->s_user_ns; int err = 0; if (iop->getattr) @@ -624,8 +626,8 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat, return err; /* transform the underlying id */ - stat->uid = make_kuid(inode->i_sb->s_user_ns, __kuid_val(stat->uid)); - stat->gid = make_kgid(inode->i_sb->s_user_ns, __kgid_val(stat->gid)); + stat->uid = shift_kuid(from_ns, to_ns, stat->uid); + stat->gid = shift_kgid(from_ns, to_ns, stat->gid); return 0; } -- 2.19.1
[RFC PATCH 5/6] shiftfs: add support for posix acls
Signed-off-by: Seth Forshee --- fs/Kconfig | 10 +++ fs/shiftfs.c | 185 +++ 2 files changed, 195 insertions(+) diff --git a/fs/Kconfig b/fs/Kconfig index 392c5a41a9f9..691f3c4fc7eb 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -121,6 +121,16 @@ config SHIFT_FS unprivileged containers can use this to mount root volumes using this technique. +config SHIFT_FS_POSIX_ACL + bool "shiftfs POSIX Access Control Lists" + depends on SHIFT_FS + select FS_POSIX_ACL + help + POSIX Access Control Lists (ACLs) support permissions for users and + groups beyond the owner/group/world scheme. + + If you don't know what Access Control Lists are, say N. + menu "Caches" source "fs/fscache/Kconfig" diff --git a/fs/shiftfs.c b/fs/shiftfs.c index 226c03d8588b..b19af7b2fe75 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -13,6 +13,8 @@ #include #include #include +#include +#include struct shiftfs_super_info { struct vfsmount *mnt; @@ -631,6 +633,183 @@ static int shiftfs_getattr(const struct path *path, struct kstat *stat, return 0; } +#ifdef CONFIG_SHIFT_FS_POSIX_ACL + +static int +shift_acl_ids(struct user_namespace *from, struct user_namespace *to, + struct posix_acl *acl) +{ + int i; + + for (i = 0; i < acl->a_count; i++) { + struct posix_acl_entry *e = &acl->a_entries[i]; + switch(e->e_tag) { + case ACL_USER: + e->e_uid = shift_kuid(from, to, e->e_uid); + if (!uid_valid(e->e_uid)) + return -EOVERFLOW; + break; + case ACL_GROUP: + e->e_gid = shift_kgid(from, to, e->e_gid); + if (!gid_valid(e->e_gid)) + return -EOVERFLOW; + break; + } + } + return 0; +} + +static void +shift_acl_xattr_ids(struct user_namespace *from, struct user_namespace *to, + void *value, size_t size) +{ + struct posix_acl_xattr_header *header = value; + struct posix_acl_xattr_entry *entry = (void *)(header + 1), *end; + int count; + kuid_t kuid; + kgid_t kgid; + + if (!value) + return; + if (size < sizeof(struct posix_acl_xattr_header)) + return; + if (header->a_version != cpu_to_le32(POSIX_ACL_XATTR_VERSION)) + return; + + count = posix_acl_xattr_count(size); + if (count < 0) + return; + if (count == 0) + return; + + for (end = entry + count; entry != end; entry++) { + switch(le16_to_cpu(entry->e_tag)) { + case ACL_USER: + kuid = make_kuid(&init_user_ns, le32_to_cpu(entry->e_id)); + kuid = shift_kuid(from, to, kuid); + entry->e_id = cpu_to_le32(from_kuid(&init_user_ns, kuid)); + break; + case ACL_GROUP: + kgid = make_kgid(&init_user_ns, le32_to_cpu(entry->e_id)); + kgid = shift_kgid(from, to, kgid); + entry->e_id = cpu_to_le32(from_kgid(&init_user_ns, kgid)); + break; + default: + break; + } + } +} + +static struct posix_acl *shiftfs_get_acl(struct inode *inode, int type) +{ + struct inode *reali = inode->i_private; + const struct cred *oldcred, *newcred; + struct posix_acl *real_acl, *acl = NULL; + struct user_namespace *from_ns = reali->i_sb->s_user_ns; + struct user_namespace *to_ns = inode->i_sb->s_user_ns; + int size; + int err; + + if (!IS_POSIXACL(reali)) + return NULL; + + oldcred = shiftfs_new_creds(&newcred, inode->i_sb); + real_acl = get_acl(reali, type); + shiftfs_old_creds(oldcred, &newcred); + + if (real_acl && !IS_ERR(acl)) { + /* XXX: export posix_acl_clone? */ + size = sizeof(struct posix_acl) + + real_acl->a_count * sizeof(struct posix_acl_entry); + acl = kmemdup(acl, size, GFP_KERNEL); + posix_acl_release(real_acl); + + if (!acl) + return ERR_PTR(-ENOMEM); + + refcount_set(&acl->a_refcount, 1); + + err = shift_acl_ids(from_ns, to_ns, acl); + if (err) { + kfree(acl); + return ERR_PTR(err); + } + } + + return acl; +} + +static int +shiftfs_posix_acl_xattr_get(const struct xattr_handler *handler, +
[RFC PATCH 6/6] shiftfs: support nested shiftfs mounts
shiftfs mounts cannot be nested for two reasons -- global CAP_SYS_ADMIN is required to set up a mark mount, and a single functional shiftfs mount meets the filesystem stacking depth limit. The CAP_SYS_ADMIN requirement can be relaxed. All of the kernel ids in a mount must be within that mount's s_user_ns, so all that is needed is CAP_SYS_ADMIN within that s_user_ns. The stack depth issue can be worked around with a couple of adjustments. First, a mark mount doesn't really need to count against the stacking depth as it doesn't contribute to the call stack depth during filesystem operations. Therefore the mount over the mark mount only needs to count as one more than the lower filesystems stack depth. Second, when the lower mount is shiftfs we can just skip down to that mount's lower filesystem and shift ids relative to that. There is no reason to shift ids twice, and the lower path has already been marked safe for id shifting by a user privileged towards all ids in that mount's user ns. Signed-off-by: Seth Forshee --- fs/shiftfs.c | 68 +++- 1 file changed, 46 insertions(+), 22 deletions(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index b19af7b2fe75..008ace2842b9 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -930,7 +930,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, struct shiftfs_data *data = raw_data; char *name = kstrdup(data->path, GFP_KERNEL); int err = -ENOMEM; - struct shiftfs_super_info *ssi = NULL; + struct shiftfs_super_info *ssi = NULL, *mp_ssi; struct path path; struct dentry *dentry; @@ -946,11 +946,7 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, if (err) goto out; - /* to mark a mount point, must be real root */ - if (ssi->mark && !capable(CAP_SYS_ADMIN)) - goto out; - - /* else to mount a mark, must be userns admin */ + /* to mount a mark, must be userns admin */ if (!ssi->mark && !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) goto out; @@ -962,41 +958,66 @@ static int shiftfs_fill_super(struct super_block *sb, void *raw_data, if (!S_ISDIR(path.dentry->d_inode->i_mode)) { err = -ENOTDIR; - goto out_put; - } - - sb->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1; - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) { - printk(KERN_ERR "shiftfs: maximum stacking depth exceeded\n"); - err = -EINVAL; - goto out_put; + goto out_put_path; } if (ssi->mark) { + struct super_block *lower_sb = path.mnt->mnt_sb; + + /* to mark a mount point, must root wrt lower s_user_ns */ + if (!ns_capable(lower_sb->s_user_ns, CAP_SYS_ADMIN)) + goto out_put_path; + + /* * this part is visible unshifted, so make sure no * executables that could be used to give suid * privileges */ sb->s_iflags = SB_I_NOEXEC; - ssi->mnt = path.mnt; - dentry = path.dentry; - } else { - struct shiftfs_super_info *mp_ssi; + /* +* Handle nesting of shiftfs mounts by referring this mark +* mount back to the original mark mount. This is more +* efficient and alleviates concerns about stack depth. +*/ + if (lower_sb->s_magic == SHIFTFS_MAGIC) { + mp_ssi = lower_sb->s_fs_info; + + /* Doesn't make sense to mark a mark mount */ + if (mp_ssi->mark) { + err = -EINVAL; + goto out_put_path; + } + + ssi->mnt = mntget(mp_ssi->mnt); + dentry = dget(path.dentry->d_fsdata); + } else { + ssi->mnt = mntget(path.mnt); + dentry = dget(path.dentry); + } + } else { /* * this leg executes if we're admin capable in * the namespace, so be very careful */ if (path.dentry->d_sb->s_magic != SHIFTFS_MAGIC) - goto out_put; + goto out_put_path; mp_ssi = path.dentry->d_sb->s_fs_info; if (!mp_ssi->mark) - goto out_put; + goto out_put_path; ssi->mnt = mntget(mp_ssi->mnt); dentry = dget(path.dentry->d_fsdata); -
[RFC PATCH 1/6] shiftfs: uid/gid shifting bind mount
From: James Bottomley This allows any subtree to be uid/gid shifted and bound elsewhere. It does this by operating simlarly to overlayfs. Its primary use is for shifting the underlying uids of filesystems used to support unpriviliged (uid shifted) containers. The usual use case here is that the container is operating with an uid shifted unprivileged root but sometimes needs to make use of or work with a filesystem image that has root at real uid 0. The mechanism is to allow any subordinate mount namespace to mount a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only allowing it to mount marked subtrees (using the -o mark option as root). Once mounted, the subtree is mapped via the super block user namespace so that the interior ids of the mounting user namespace are the ids written to the filesystem. Signed-off-by: James Bottomley [ saf: use designated initializers for path declarations to fix errors with struct randomization ] Signed-off-by: Seth Forshee --- v3 - update to 4.14 (d_real changes) v1 - based on original shiftfs with uid mappings now done via s_user_ns v2 - fix revalidation of dentries add inode aliasing --- fs/Kconfig | 8 + fs/Makefile| 1 + fs/shiftfs.c | 783 + include/uapi/linux/magic.h | 2 + 4 files changed, 794 insertions(+) create mode 100644 fs/shiftfs.c diff --git a/fs/Kconfig b/fs/Kconfig index ac474a61be37..392c5a41a9f9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -113,6 +113,14 @@ source "fs/autofs/Kconfig" source "fs/fuse/Kconfig" source "fs/overlayfs/Kconfig" +config SHIFT_FS + tristate "UID/GID shifting overlay filesystem for containers" + help + This filesystem can overlay any mounted filesystem and shift + the uid/gid the files appear at. The idea is that + unprivileged containers can use this to mount root volumes + using this technique. + menu "Caches" source "fs/fscache/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..d0222f3816bd 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -128,3 +128,4 @@ obj-y += exofs/ # Multiple modules obj-$(CONFIG_CEPH_FS) += ceph/ obj-$(CONFIG_PSTORE) += pstore/ obj-$(CONFIG_EFIVAR_FS)+= efivarfs/ +obj-$(CONFIG_SHIFT_FS) += shiftfs.o diff --git a/fs/shiftfs.c b/fs/shiftfs.c new file mode 100644 index ..6028244c2f42 --- /dev/null +++ b/fs/shiftfs.c @@ -0,0 +1,783 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct shiftfs_super_info { + struct vfsmount *mnt; + struct user_namespace *userns; + bool mark; +}; + +static struct inode *shiftfs_new_inode(struct super_block *sb, umode_t mode, + struct dentry *dentry); + +enum { + OPT_MARK, + OPT_LAST, +}; + +/* global filesystem options */ +static const match_table_t tokens = { + { OPT_MARK, "mark" }, + { OPT_LAST, NULL } +}; + +static const struct cred *shiftfs_get_up_creds(struct super_block *sb) +{ + struct shiftfs_super_info *ssi = sb->s_fs_info; + struct cred *cred = prepare_creds(); + + if (!cred) + return NULL; + + cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid)); + cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid)); + put_user_ns(cred->user_ns); + cred->user_ns = get_user_ns(ssi->userns); + + return cred; +} + +static const struct cred *shiftfs_new_creds(const struct cred **newcred, + struct super_block *sb) +{ + const struct cred *cred = shiftfs_get_up_creds(sb); + + *newcred = cred; + + if (cred) + cred = override_creds(cred); + else + printk(KERN_ERR "shiftfs: Credential override failed: no memory\n"); + + return cred; +} + +static void shiftfs_old_creds(const struct cred *oldcred, + const struct cred **newcred) +{ + if (!*newcred) + return; + + revert_creds(oldcred); + put_cred(*newcred); +} + +static int shiftfs_parse_options(struct shiftfs_super_info *ssi, char *options) +{ + char *p; + substring_t args[MAX_OPT_ARGS]; + + ssi->mark = false; + + while ((p = strsep(&options, ",")) != NULL) { + int token; + + if (!*p) + continue; + + token = match_token(p, tokens, args); + switch (token) { + case OPT_MARK: + ssi->mark = true; + break; + default: +
[RFC PATCH 0/6] shiftfs fixes and enhancements
I've done some work to fix and enhance shiftfs for a number of use cases, so that we would have an idea what a more full-featured shiftfs would look like. I'm intending for these to serve as a point of reference for discussing id shifting mounts/filesystems at plumbers in a couple of weeks [1]. Note that these are based on 4.18, and I've added a small fix to James' most recent patch to fix a build issue there. To work with 4.19 they will need a number of updates due to changes in the vfs. The features I focused on fixing in or adding to shiftfs in these patches are inotify, file capabilities, posix acls, and nesting. These are all now working for at least simple use cases, but further testing and cleanups are needed before I'd consider these finished. I also kept all the changes within shiftfs, but some of the code might belong in the vfs instead (in particular some of the posix acl code). I've also pushed these patches to: git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git shiftfs Thanks, Seth [1] https://linuxplumbersconf.org/event/2/contributions/212/ --- James Bottomley (1): shiftfs: uid/gid shifting bind mount Seth Forshee (5): shiftfs: map inodes to lower fs inodes instead of dentries shiftfs: copy inode attrs up from underlying fs shiftfs: translate uids using s_user_ns from lower fs shiftfs: add support for posix acls shiftfs: support nested shiftfs mounts fs/Kconfig | 18 + fs/Makefile|1 + fs/shiftfs.c | 1075 include/uapi/linux/magic.h |2 + 4 files changed, 1096 insertions(+) create mode 100644 fs/shiftfs.c
Re: [PATCH v3 0/1] shiftfs: uid/gid shifting filesystem
+Cc David On Fri, Jun 15, 2018 at 02:35:14PM -0700, James Bottomley wrote: > This is a repost of the v2 patch updated for the d_real changes > > For those who want to test it out, there's a git tree here > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc.git > > on the shiftfs-v3 branch > > v2: > > This is a rewrite of the original shiftfs code to make use of super > block user namespaces. I've also removed the mappings passed in as > mount options in favour of using the mappings in s_user_ns. The upshot > is that it probably needs retesting for all the bugs people found, > since there's a lot of new code, and the use case has changed. Now, to > use it, you have to mark the filesystems you want to be mountable > inside a user namespace as root: > > mount -t shiftfs -o mark > > The origin should be inaccessible to the unprivileged user, and the > access to the can be controlled by the usual filesystem > permissions. Once this is done, any user who can get access to the > can do (as the local user namespace root): > > mount -t shiftfs David, I wanted to pull you in here based on something you said on the most recent filesystem context thread (thought it would make more sense here rather than piggypacking on that already massive thread). > I want to be able to add support for a bunch of things: > > (1) UID, GID and Project ID mapping/translation. I want to be able to > install a translation table of some sort on the superblock to translate > source identifiers (which may be foreign numeric UIDs/GIDs, text names, > GUIDs) into system identifiers. This needs to be done before the > superblock is published[*]. > > Note that this may, for example, involve using the context and the > superblock held therein to issue an RPC to a server to look up > translations. > > [*] By "published" I mean made available through mount so that other >userspace processes can access it by path. > > Maybe specifying a translation range element with something like: > > write(fd, "t uid "); > > The translation information also needs to propagate over an automount in > some circumstances. > > (2) Namespace configuration. I want to be able to tell the superblock > creation process what namespaces should be applied when it created (in > particular the userns and netns) for containerisation purposes, e.g.: > > write(fd, "n user= net="); There's some obvious overlap between shiftfs and (1), but also important differences. Primarily that shiftfs tries to make something that looks like a bind mount rather than applying the mappings to new superblocks for arbitrary filesystems. I've already been playing with shiftfs on top of the filesystem context patches, because I thought it would allow getting rid of the intermediate "mark" mount described above. I have a hacky proof of concept implementation that I've pushed to the shiftfs-fscontext branch of git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git Basically the idea is that the more privileged "host" context can create the fs fd and set the source on it to "bless" a subtree for id shifted mounting, and the less privileged "client" context can use the fd to do the mount (test program below). But I had to mess with fc->user_ns to ensure s_user_ns gets set correctly, and it would likely be nicer to do something more like (2) above. The idea is that we need the more privileged "host" context to bless the subtree being id shifted before actually executing the mount in the less privileged "client" context. I'm doing this by having the host set the source, then have the client use the fd to create the superblock (my test program is below). This leads to some undesirable changing of the fs contexts user ns in shiftfs (so that s_user_ns is the client's namespace), which could likely be eliminated by doing something like what's described in (2) and having the super block created on the host side. However, maybe these things are similar enough to settle on a common solution, such as supporting id mapping at the vfsmount level. Seth --- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #define __NR_move_mount 336 #define __NR_fsopen 337 #define __NR_fsmount338 #define FSOPEN_CLOEXEC 0x0001 #define FSMOUNT_CLOEXEC 0x0001 #define MOVE_MOUNT_F_EMPTY_PATH 0x0004 static int move_mount(int from_dfd, const char *from_pathname, int to_dfd, const char *to_pathname, unsigned int flags) { return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags); } static int fsopen(const char *fs_name, unsigned int flags) { return syscall(__NR_fsopen, fs_name, flags); } static int fsmount(int fsfd, unsigned int flags, unsigned in
Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.
On Thu, May 24, 2018 at 11:55:45AM -0500, Eric W. Biederman wrote: > Seth Forshee writes: > > > On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote: > >> These filesystems already always set SB_I_NODEV so mknod will not be > >> useful for gaining control of any devices no matter their permissions. > >> This will allow overlayfs and applications to fakeroot to use device > >> nodes to represent things on disk. > >> > >> Signed-off-by: "Eric W. Biederman" > > > > For a normal filesystem this does seem safe enough. > > > > However, I'd also like to see us allow unprivileged mounting for > > overlayfs, and there we need to worry about whether this would allow a > > mknod in an underlying filesystem which should not be allowed. That > > mknod will be subject to this same check in the underlying filesystem > > using the credentials of the user that mounted the overaly fs, which > > should be sufficient to ensure that the mknod is permitted. > > Sufficient to ensure the mknod is not permitted on the underlying > filesystem. I believe you mean. Right, or in other words with the relaxed capability check a user still could not use an overlayfs mount in a user namespace to mknod in a filesystem when that user couldn't otherwise mknod in that filesystem. Sorry if I wasn't clear. > > > Thus this looks okay to me. > > > > Acked-by: Seth Forshee > > Eric >
Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.
On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote: > These filesystems already always set SB_I_NODEV so mknod will not be > useful for gaining control of any devices no matter their permissions. > This will allow overlayfs and applications to fakeroot to use device > nodes to represent things on disk. > > Signed-off-by: "Eric W. Biederman" For a normal filesystem this does seem safe enough. However, I'd also like to see us allow unprivileged mounting for overlayfs, and there we need to worry about whether this would allow a mknod in an underlying filesystem which should not be allowed. That mknod will be subject to this same check in the underlying filesystem using the credentials of the user that mounted the overaly fs, which should be sufficient to ensure that the mknod is permitted. Thus this looks okay to me. Acked-by: Seth Forshee
Re: [REVIEW][PATCH 1/6] vfs: Don't allow changing the link count of an inode with an invalid uid or gid
On Wed, May 23, 2018 at 06:25:33PM -0500, Eric W. Biederman wrote: > Changing the link count of an inode via unlink or link will cause a > write back of that inode. If the uids or gids are invalid (aka not known > to the kernel) writing the inode back may change the uid or gid in the > filesystem. To prevent possible filesystem and to avoid the need for > filesystem maintainers to worry about it don't allow operations on > inodes with an invalid uid or gid. > > Signed-off-by: "Eric W. Biederman" Acked-by: Seth Forshee
Re: [PATCH] fuse: Ensure posix acls are translated outside of init_user_ns
On Fri, May 04, 2018 at 11:47:28AM -0500, Eric W. Biederman wrote: > > Ensure the translation happens by failing to read or write > posix acls when the filesystem has not indicated it supports > posix acls. > > This ensures that modern cached posix acl support is available > and used when dealing with posix acls. This is important > because only that path has the code to convernt the uids and > gids in posix acls into the user namespace of a fuse filesystem. > > Signed-off-by: "Eric W. Biederman" > --- > > Miklos after several attempts to handle this better last cycle. I > figure we should go with the stupid version for now. I think I know > how to do better but I don't want that to gate forward progress on > fully unprivileged fuse mounts. Especially as this is the last known > issue to deal with. This seems reasonable as a short-term measure. Acked-by: Seth Forshee
Re: [RFC PATCH v3 2/2] ima: force re-appraisal on filesystems with FS_IMA_NO_CACHE
On Mon, Jan 22, 2018 at 05:24:52PM +0100, Alban Crequy wrote: > From: Alban Crequy > > This patch forces files to be re-measured, re-appraised and re-audited > on file systems with the feature flag FS_IMA_NO_CACHE. In that way, > cached integrity results won't be used. > > How to test this: > > The test I did was using a patched version of the memfs FUSE driver > [1][2] and two very simple "hello-world" programs [4] (prog1 prints > "hello world: 1" and prog2 prints "hello world: 2"). > > I copy prog1 and prog2 in the fuse-memfs mount point, execute them and > check the sha1 hash in > "/sys/kernel/security/ima/ascii_runtime_measurements". > > My patch on the memfs FUSE driver added a backdoor command to serve > prog1 when the kernel asks for prog2 or vice-versa. In this way, I can > exec prog1 and get it to print "hello world: 2" without ever replacing > the file via the VFS, so the kernel is not aware of the change. > > The test was done using the branch "alban/fuse-flag-ima-nocache-v3" [3]. > > Step by step test procedure: > > 1. Mount the memfs FUSE using [2]: > rm -f /tmp/memfs-switch* ; memfs -L DEBUG /mnt/memfs > > 2. Copy prog1 and prog2 using [4] > cp prog1 /mnt/memfs/prog1 > cp prog2 /mnt/memfs/prog2 > > 3. Lookup the files and let the FUSE driver to keep the handles open: > dd if=/mnt/memfs/prog1 bs=1 | (read -n 1 x ; sleep 3600 ) & > dd if=/mnt/memfs/prog2 bs=1 | (read -n 1 x ; sleep 3600 ) & > > 4. Check the 2 programs work correctly: > $ /mnt/memfs/prog1 > hello world: 1 > $ /mnt/memfs/prog2 > hello world: 2 > > 5. Check the measurements for prog1 and prog2: > $ sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ > | grep /mnt/memfs/prog > 10 [...] ima-ng sha1:ac14c9268cd2[...] /mnt/memfs/prog1 > 10 [...] ima-ng sha1:799cb5d1e06d[...] /mnt/memfs/prog2 > > 6. Use the backdoor command in my patched memfs to redirect file > operations on file handle 3 to file handle 2: > rm -f /tmp/memfs-switch* ; touch /tmp/memfs-switch-3-2 > > 7. Check how the FUSE driver serves different content for the files: > $ /mnt/memfs/prog1 > hello world: 2 > $ /mnt/memfs/prog2 > hello world: 2 > > 8. Check the measurements: > sudo cat /sys/kernel/security/ima/ascii_runtime_measurements \ > | grep /mnt/memfs/prog > > Without the patch, there are no new measurements, despite the FUSE > driver having served different executables. > > With the patch, I can see additional measurements for prog1 and prog2 > with the hashes reversed when the FUSE driver served the alternative > content. > > [1] https://github.com/bbengfort/memfs > [2] https://github.com/kinvolk/memfs/commits/alban/switch-files > [3] https://github.com/kinvolk/linux/commits/alban/fuse-flag-ima-nocache-v3 > [4] https://github.com/kinvolk/fuse-userns-patches/commit/cf1f5750cab0 > > Cc: linux-kernel@vger.kernel.org > Cc: linux-integr...@vger.kernel.org > Cc: linux-security-mod...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org > Cc: Miklos Szeredi > Cc: Alexander Viro > Cc: Mimi Zohar > Cc: Dmitry Kasatkin > Cc: James Morris > Cc: "Serge E. Hallyn" > Cc: Seth Forshee > Cc: Christoph Hellwig > Tested-by: Dongsu Park > Signed-off-by: Alban Crequy I like this approach as it doesn't require any IMA policy changes to be effective. I'm not sure about the flag name (in my mind it would describe the fs property and not be specifically giving instructions to IMA), but if others are happy with it I won't complain. Acked-by: Seth Forshee > --- > security/integrity/ima/ima_main.c | 24 ++-- > 1 file changed, 22 insertions(+), 2 deletions(-) > > diff --git a/security/integrity/ima/ima_main.c > b/security/integrity/ima/ima_main.c > index 6d78cb26784d..8870a7bbe9b9 100644 > --- a/security/integrity/ima/ima_main.c > +++ b/security/integrity/ima/ima_main.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > > #include "ima.h" > > @@ -228,9 +229,28 @@ static int process_measurement(struct file *file, char > *buf, loff_t size, >IMA_APPRAISE_SUBMASK | IMA_APPRAISED_SUBMASK | >IMA_ACTION_FLAGS); > > - if (test_and_clear_bit(IMA_CHANGE_XATTR, &iint->atomic_flags)) > - /* reset all flags if ima_inode_setxattr was called */ > + /* > + * Reset the measure, appraise and audit cached flags either if: > + * - ima_inode_setxattr was called, or > + * - based on filesystem feat
Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote: > On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee > wrote: > > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote: > >> [Adding Tejun, David, Tom for question about cuse] > >> > >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > >> > From: Seth Forshee > >> > > >> > In order to support mounts from namespaces other than > >> > init_user_ns, fuse must translate uids and gids to/from the > >> > userns of the process servicing requests on /dev/fuse. This > >> > patch does that, with a couple of restrictions on the namespace: > >> > > >> > - The userns for the fuse connection is fixed to the namespace > >> >from which /dev/fuse is opened. > >> > > >> > - The namespace must be the same as s_user_ns. > >> > > >> > These restrictions simplify the implementation by avoiding the > >> > need to pass around userns references and by allowing fuse to > >> > rely on the checks in inode_change_ok for ownership changes. > >> > Either restriction could be relaxed in the future if needed. > >> > > >> > For cuse the namespace used for the connection is also simply > >> > current_user_ns() at the time /dev/cuse is opened. > >> > >> Was a use case discussed for using cuse in a new unprivileged userns? > >> > >> I ran some tests yesterday with cusexmp [1] and I could add a new char > >> device as an unprivileged user with: > >> > >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp > >> --maj=99 --min=30 --name=foo > >> > >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777. > >> Then, I could see the new device: > >> > >> $ cat /proc/devices | grep foo > >> 99 foo > >> > >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it > >> seems dangerous if the dev node can be provided otherwise and if we > >> don't have a use case for it. > >> > >> Thoughts? > > > > I can't remember the specific reasons, but I had concluded that letting > > unprivileged users use cuse within a user namespace isn't safe. But > > having a cuse device node usable by regular users at all is equally > > unsafe I suspect, > > This makes sense. > > > so I don't think your example demonstrates any problem > > specific to user namespaces. There shouldn't be any way to use a user > > namespace to gain access permissions towards /dev/cuse, otherwise we > > have bigger problems than cuse to worry about. > > From my tests, the patch seem safe but I don't fully understand why that is. > > I am not trying to gain more permissions towards /dev/cuse but to > create another cuse char file from within the unprivileged userns. I > tested the scenario by patching the memfs userspace FUSE driver to > generate the char device whenever the file is named "cuse" (turning > the regular file into a char device with the cuse major/minor behind > the scene): > > $ unshare -U -r -m > # memfs /mnt/memfs & > # ls -l /mnt/memfs > # echo -n > /mnt/memfs/cuse > -bash: /mnt/memfs/cuse: Input/output error > # ls -l /mnt/memfs/cuse > crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse > # cat /mnt/memfs/cuse > cat: /mnt/memfs/cuse: Permission denied > > But then, I could not use that char device, even though it seems to > have the correct major/minor and permissions. The kernel FUSE code > seems to call init_special_inode() to handle character devices. I > don't understand why it seems to be safe. Because for new mounts in non-init user namespaces alloc_super() sets SB_I_NODEV flag in s_iflags, which disallows opening device nodes in that filesystem. Seth
Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote: > [Adding Tejun, David, Tom for question about cuse] > > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park wrote: > > From: Seth Forshee > > > > In order to support mounts from namespaces other than > > init_user_ns, fuse must translate uids and gids to/from the > > userns of the process servicing requests on /dev/fuse. This > > patch does that, with a couple of restrictions on the namespace: > > > > - The userns for the fuse connection is fixed to the namespace > >from which /dev/fuse is opened. > > > > - The namespace must be the same as s_user_ns. > > > > These restrictions simplify the implementation by avoiding the > > need to pass around userns references and by allowing fuse to > > rely on the checks in inode_change_ok for ownership changes. > > Either restriction could be relaxed in the future if needed. > > > > For cuse the namespace used for the connection is also simply > > current_user_ns() at the time /dev/cuse is opened. > > Was a use case discussed for using cuse in a new unprivileged userns? > > I ran some tests yesterday with cusexmp [1] and I could add a new char > device as an unprivileged user with: > > $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp > --maj=99 --min=30 --name=foo > > where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777. > Then, I could see the new device: > > $ cat /proc/devices | grep foo > 99 foo > > On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it > seems dangerous if the dev node can be provided otherwise and if we > don't have a use case for it. > > Thoughts? I can't remember the specific reasons, but I had concluded that letting unprivileged users use cuse within a user namespace isn't safe. But having a cuse device node usable by regular users at all is equally unsafe I suspect, so I don't think your example demonstrates any problem specific to user namespaces. There shouldn't be any way to use a user namespace to gain access permissions towards /dev/cuse, otherwise we have bigger problems than cuse to worry about. Seth
Re: Memory hotplug regression in 4.13
On Fri, Dec 22, 2017 at 10:12:40AM -0600, Seth Forshee wrote: > On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote: > > On Mon 18-12-17 15:53:20, Michal Hocko wrote: > > > On Fri 01-12-17 08:23:27, Seth Forshee wrote: > > > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote: > > > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote: > > > [...] > > > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 > > > > > > build > > > > > > and the problem still occurs there. It's pointing to the call to > > > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which > > > > > > we > > > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage. > > > > > > > > > > Hmm, this is interesting. That would mean that we have successfully > > > > > mapped the destination page but its memory is still not accessible. > > > > > > > > > > Right now I do not see how the patch you have bisected to could make > > > > > any > > > > > difference because it only postponed the onlining to be independent > > > > > but > > > > > your config simply onlines automatically so there shouldn't be any > > > > > semantic change. Maybe there is some sort of off-by-one or something. > > > > > > > > > > I will try to investigate some more. Do you think it would be possible > > > > > to configure kdump on your system and provide me with the vmcore in > > > > > some > > > > > way? > > > > > > > > Sorry, I got busy with other stuff and this kind of fell off my radar. > > > > It came to my attention again recently though. > > > > > > Apology on my side. This has completely fall of my radar. > > > > > > > I was looking through the hotplug rework changes, and I noticed that > > > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the > > > > rework it doesn't look like it's possible for memory to be associated > > > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14 > > > > and am now no longer seeing the oopses. > > > > > > Thanks a lot for debugging! Do I read the above correctly that the > > > current code simply returns ZONE_NORMAL and maps an unrelated pfn into > > > this zone and that leads to later blowups? Could you attach the fresh > > > boot dmesg output please? > > > > > > > I'm sure this isn't the correct fix, but I think it does confirm that > > > > the problem is that the memory should be associated with ZONE_HIGHMEM > > > > but is not. > > > > > > > > > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory > > > zone. The kernel cannot access that memory directly. It is essentially a > > > movable zone from the hotplug API POV. We simply do not have any way to > > > tell into which zone we want to online this memory range in. > > > Unfortunately both zones _can_ be present. It would require an explicit > > > configuration (movable_node and a NUMA hoptlugable nodes running in 32b > > > or and movable memory configured explicitly on the kernel command line). > > > > > > The below patch is not really complete but I would rather start simple. > > > Maybe we do not even have to care as most 32b users will never use both > > > zones at the same time. I've placed a warning to learn about those. > > > > > > Does this pass your testing? > > > > Any chances to test this? > > Yes, I should get to testing it soon. I'm working through a backlog of > things I need to get done and this just hasn't quite made it to the top. I started by testing vanilla 4.15-rc4 with a vm that has several memory slots already populated at boot. With that I no longer get an oops, however while /sys/devices/system/memory/*/online is 1 it looks like the memory isn't being used. With your patch the behavior is the same. I'm attaching dmesg from both kernels. Thanks, Seth > > > > --- > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > > index 262bfd26baf9..18fec18bdb60 100644 > > > --- a/mm/memory_hotplug.c > > > +++ b/mm/memory_hotplug.c > > > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int > > > nid, unsigned long s
Re: Memory hotplug regression in 4.13
On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote: > On Mon 18-12-17 15:53:20, Michal Hocko wrote: > > On Fri 01-12-17 08:23:27, Seth Forshee wrote: > > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote: > > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote: > > [...] > > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 > > > > > build > > > > > and the problem still occurs there. It's pointing to the call to > > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we > > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage. > > > > > > > > Hmm, this is interesting. That would mean that we have successfully > > > > mapped the destination page but its memory is still not accessible. > > > > > > > > Right now I do not see how the patch you have bisected to could make any > > > > difference because it only postponed the onlining to be independent but > > > > your config simply onlines automatically so there shouldn't be any > > > > semantic change. Maybe there is some sort of off-by-one or something. > > > > > > > > I will try to investigate some more. Do you think it would be possible > > > > to configure kdump on your system and provide me with the vmcore in some > > > > way? > > > > > > Sorry, I got busy with other stuff and this kind of fell off my radar. > > > It came to my attention again recently though. > > > > Apology on my side. This has completely fall of my radar. > > > > > I was looking through the hotplug rework changes, and I noticed that > > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the > > > rework it doesn't look like it's possible for memory to be associated > > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14 > > > and am now no longer seeing the oopses. > > > > Thanks a lot for debugging! Do I read the above correctly that the > > current code simply returns ZONE_NORMAL and maps an unrelated pfn into > > this zone and that leads to later blowups? Could you attach the fresh > > boot dmesg output please? > > > > > I'm sure this isn't the correct fix, but I think it does confirm that > > > the problem is that the memory should be associated with ZONE_HIGHMEM > > > but is not. > > > > > > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory > > zone. The kernel cannot access that memory directly. It is essentially a > > movable zone from the hotplug API POV. We simply do not have any way to > > tell into which zone we want to online this memory range in. > > Unfortunately both zones _can_ be present. It would require an explicit > > configuration (movable_node and a NUMA hoptlugable nodes running in 32b > > or and movable memory configured explicitly on the kernel command line). > > > > The below patch is not really complete but I would rather start simple. > > Maybe we do not even have to care as most 32b users will never use both > > zones at the same time. I've placed a warning to learn about those. > > > > Does this pass your testing? > > Any chances to test this? Yes, I should get to testing it soon. I'm working through a backlog of things I need to get done and this just hasn't quite made it to the top. > > --- > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index 262bfd26baf9..18fec18bdb60 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int > > nid, unsigned long start_pfn > > return &pgdat->node_zones[ZONE_NORMAL]; > > } > > > > +static struct zone *default_movable_zone_for_pfn(int nid) > > +{ > > + /* > > +* Please note that 32b HIGHMEM systems might have 2 movable zones > > +* actually so we have to check for both. This is rather ugly hack > > +* to enforce using Highmem on those systems but we do not have a > > +* good user API to tell into which movable zone we should online. > > +* WARN if we have a movable zone which is not highmem. > > +*/ > > +#ifdef CONFIG_HIGHMEM > > + WARN_ON_ONCE(!zone_movable_is_highmem()); > > + return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM]; > > +#else > > + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; > > +#endi
Re: Memory hotplug regression in 4.13
On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote: > On Thu 21-09-17 00:40:34, Seth Forshee wrote: > > On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote: > > > Hi, > > > I am currently at a conference so I will most probably get to this next > > > week but I will try to ASAP. > > > > > > On Tue 19-09-17 11:41:14, Seth Forshee wrote: > > > > Hi Michal, > > > > > > > > I'm seeing oopses in various locations when hotplugging memory in an x86 > > > > vm while running a 32-bit kernel. The config I'm using is attached. To > > > > reproduce I'm using kvm with the memory options "-m > > > > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run: > > > > > > > > object_add memory-backend-ram,id=mem1,size=512M > > > > device_add pc-dimm,id=dimm1,memdev=mem1 > > > > > > > > Not long after that I'll see an oops, not always in the same location > > > > but most often in wp_page_copy, like this one: > > > > > > This is rather surprising. How do you online the memory? > > > > The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y. > > OK, so the memory gets online automagically at the time when it is > hotadded. Could you send the full dmesg? > > > > > [ 24.673623] BUG: unable to handle kernel paging request at d000 > > > > [ 24.675569] IP: wp_page_copy+0xa8/0x660 > > > > > > could you resolve the IP into the source line? > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build > > and the problem still occurs there. It's pointing to the call to > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage. > > Hmm, this is interesting. That would mean that we have successfully > mapped the destination page but its memory is still not accessible. > > Right now I do not see how the patch you have bisected to could make any > difference because it only postponed the onlining to be independent but > your config simply onlines automatically so there shouldn't be any > semantic change. Maybe there is some sort of off-by-one or something. > > I will try to investigate some more. Do you think it would be possible > to configure kdump on your system and provide me with the vmcore in some > way? Sorry, I got busy with other stuff and this kind of fell off my radar. It came to my attention again recently though. I was looking through the hotplug rework changes, and I noticed that 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the rework it doesn't look like it's possible for memory to be associated with ZONE_HIGHMEM when onlining. So I made the change below against 4.14 and am now no longer seeing the oopses. I'm sure this isn't the correct fix, but I think it does confirm that the problem is that the memory should be associated with ZONE_HIGHMEM but is not. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d4b5f29906b9..fddc134c5c3b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -833,6 +833,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone, set_zone_contiguous(zone); } +#ifdef CONFIG_HIGHMEM +static enum zone_type default_zone = ZONE_HIGHMEM; +#else +static enum zone_type default_zone = ZONE_NORMAL; +#endif + /* * Returns a default kernel memory zone for the given pfn range. * If no kernel zone covers this pfn range it will automatically go @@ -844,14 +850,14 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn struct pglist_data *pgdat = NODE_DATA(nid); int zid; - for (zid = 0; zid <= ZONE_NORMAL; zid++) { + for (zid = 0; zid <= default_zone; zid++) { struct zone *zone = &pgdat->node_zones[zid]; if (zone_intersects(zone, start_pfn, nr_pages)) return zone; } - return &pgdat->node_zones[ZONE_NORMAL]; + return &pgdat->node_zones[default_zone]; } static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
[PATCH] selftests/powerpc: Use snprintf to construct DSCR sysfs interface paths
Currently sprintf is used, and while paths should never exceed the size of the buffer it is theoretically possible since dirent.d_name is 256 bytes. As a result this trips -Wformat-overflow, and since the test is built with -Wall -Werror the causes the build to fail. Switch to using snprintf and skip any paths which are too long for the filename buffer. Signed-off-by: Seth Forshee --- tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c index 17fb1b43c320..1899bd85121f 100644 --- a/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c +++ b/tools/testing/selftests/powerpc/dscr/dscr_sysfs_test.c @@ -53,6 +53,8 @@ static int check_all_cpu_dscr_defaults(unsigned long val) } while ((dp = readdir(sysfs))) { + int len; + if (!(dp->d_type & DT_DIR)) continue; if (!strcmp(dp->d_name, "cpuidle")) @@ -60,7 +62,9 @@ static int check_all_cpu_dscr_defaults(unsigned long val) if (!strstr(dp->d_name, "cpu")) continue; - sprintf(file, "%s%s/dscr", CPU_PATH, dp->d_name); + len = snprintf(file, LEN_MAX, "%s%s/dscr", CPU_PATH, dp->d_name); + if (len >= LEN_MAX) + continue; if (access(file, F_OK)) continue; -- 2.14.1
[PATCH] powerpc: Always initialize input array when calling epapr_hypercall()
Several callers to epapr_hypercall() pass an uninitialized stack allocated array for the input arguments, presumably because they have no input arguments. However this can produce errors like this one arch/powerpc/include/asm/epapr_hcalls.h:470:42: error: 'in' may be used uninitialized in this function [-Werror=maybe-uninitialized] unsigned long register r3 asm("r3") = in[0]; ~~^~~ Fix callers to this function to always zero-initialize the input arguments array to prevent this. Signed-off-by: Seth Forshee --- arch/powerpc/include/asm/epapr_hcalls.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/epapr_hcalls.h b/arch/powerpc/include/asm/epapr_hcalls.h index 334459ad145b..90863245df53 100644 --- a/arch/powerpc/include/asm/epapr_hcalls.h +++ b/arch/powerpc/include/asm/epapr_hcalls.h @@ -508,7 +508,7 @@ static unsigned long epapr_hypercall(unsigned long *in, static inline long epapr_hypercall0_1(unsigned int nr, unsigned long *r2) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; unsigned long r; @@ -520,7 +520,7 @@ static inline long epapr_hypercall0_1(unsigned int nr, unsigned long *r2) static inline long epapr_hypercall0(unsigned int nr) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; return epapr_hypercall(in, out, nr); @@ -528,7 +528,7 @@ static inline long epapr_hypercall0(unsigned int nr) static inline long epapr_hypercall1(unsigned int nr, unsigned long p1) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; in[0] = p1; @@ -538,7 +538,7 @@ static inline long epapr_hypercall1(unsigned int nr, unsigned long p1) static inline long epapr_hypercall2(unsigned int nr, unsigned long p1, unsigned long p2) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; in[0] = p1; @@ -549,7 +549,7 @@ static inline long epapr_hypercall2(unsigned int nr, unsigned long p1, static inline long epapr_hypercall3(unsigned int nr, unsigned long p1, unsigned long p2, unsigned long p3) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; in[0] = p1; @@ -562,7 +562,7 @@ static inline long epapr_hypercall4(unsigned int nr, unsigned long p1, unsigned long p2, unsigned long p3, unsigned long p4) { - unsigned long in[8]; + unsigned long in[8] = {0}; unsigned long out[8]; in[0] = p1; -- 2.14.1
Re: Memory hotplug regression in 4.13
On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote: > Hi, > I am currently at a conference so I will most probably get to this next > week but I will try to ASAP. > > On Tue 19-09-17 11:41:14, Seth Forshee wrote: > > Hi Michal, > > > > I'm seeing oopses in various locations when hotplugging memory in an x86 > > vm while running a 32-bit kernel. The config I'm using is attached. To > > reproduce I'm using kvm with the memory options "-m > > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run: > > > > object_add memory-backend-ram,id=mem1,size=512M > > device_add pc-dimm,id=dimm1,memdev=mem1 > > > > Not long after that I'll see an oops, not always in the same location > > but most often in wp_page_copy, like this one: > > This is rather surprising. How do you online the memory? The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y. > > [ 24.673623] BUG: unable to handle kernel paging request at d000 > > [ 24.675569] IP: wp_page_copy+0xa8/0x660 > > could you resolve the IP into the source line? It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build and the problem still occurs there. It's pointing to the call to __builtin_memcpy in memcpy (include/linux/string.h line 340), which we get to via wp_page_copy -> cow_user_page -> copy_user_highpage. Thanks, Seth
Re: [PATCH] selftests/seccomp: Support glibc 2.26 siginfo_t.h
On Thu, Sep 07, 2017 at 04:32:46PM -0700, Kees Cook wrote: > The 2.26 release of glibc changed how siginfo_t is defined, and the earlier > work-around to using the kernel definition are no longer needed. The old > way needs to stay around for a while, though. > > Reported-by: Seth Forshee > Cc: Andy Lutomirski > Cc: Will Drewry > Cc: Shuah Khan > Cc: linux-kselft...@vger.kernel.org > Cc: sta...@vger.kernel.org > Signed-off-by: Kees Cook > --- > Seth, can you double check this to confirm it works for you too? This builds > and tests correctly for me on both Ubuntu 17.10 (-proposed) with glibc 2.26 > and with earlier distros with 2.24, etc. It builds and tests correctly for me too, with both glibc 2.26 and 2.24. Tested-by: Seth Forshee Thanks!
seccomp selftest fails to build with glibc 2.26
Hi Kees, I'm seeing build failures with your seccomp selftest when using glibc 2.26. The first are related to changing macro names from __have_sig*_t to __sig*_t_defined. But after defining those there are more conflicting definitions. I was able to get it to build with the changes below, however it's ugly so I'm hesitant to suggest that it's a fix (and I haven't tested with older glibc either). The full build output is a little lengthy, so I've pasted it at http://pastebin.ubuntu.com/25486192/ rather than including it inline. Thanks, Seth diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 03f1fa49..d234a3e5 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -11,6 +11,13 @@ #define __have_sigval_t 1 #define __have_sigevent_t 1 +/* These fix errors with glibc 2.26 */ +#define __siginfo_t_defined 1 +#define __sigval_t_defined 1 +#define __sigevent_t_defined 1 +#define _BITS_SIGINFO_CONSTS_H 1 +#define _BITS_SIGEVENT_CONSTS_H 1 + #include #include #include
[PATCH] scsi: aacraid: Don't copy uninitialized stack memory to userspace
Both aac_send_raw_srb() and aac_get_hba_info() may copy stack allocated structs to userspace without initializing all members of these structs. Clear out this memory to prevent information leaks. Fixes: 423400e64d377 ("scsi: aacraid: Include HBA direct interface") Fixes: c799d519bf088 ("scsi: aacraid: Retrieve HBA host information ioctl") Signed-off-by: Seth Forshee --- drivers/scsi/aacraid/commctrl.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/aacraid/commctrl.c b/drivers/scsi/aacraid/commctrl.c index d2f8d5954840..476ada6e39d0 100644 --- a/drivers/scsi/aacraid/commctrl.c +++ b/drivers/scsi/aacraid/commctrl.c @@ -950,6 +950,7 @@ static int aac_send_raw_srb(struct aac_dev* dev, void __user * arg) &((struct aac_native_hba *)srbfib->hw_fib_va)->resp.err; struct aac_srb_reply reply; + memset(&reply, 0, sizeof(reply)); reply.status = ST_OK; if (srbfib->flags & FIB_CONTEXT_FLAG_FASTRESP) { /* fast response */ @@ -1019,6 +1020,7 @@ static int aac_get_hba_info(struct aac_dev *dev, void __user *arg) { struct aac_hba_info hbainfo; + memset(&hbainfo, 0, sizeof(hbainfo)); hbainfo.adapter_number = (u8) dev->id; hbainfo.system_io_bus_number= dev->pdev->bus->number; hbainfo.device_number = (dev->pdev->devfn >> 3); -- 2.11.0
Re: audit regressions in 4.11
On Sun, Apr 09, 2017 at 09:14:03AM -0400, Paul Moore wrote: > On Sat, Apr 8, 2017 at 11:02 PM, Seth Forshee > wrote: > > I've observed audit regressions in 4.11-rc when not using a userspace > > audit daemon. The most obvious issue is that audit messages are not > > appearing in dmesg anymore. If a sufficient number of audit messages are > > generated the kernel will also start invoking the OOM killer. > > > > It looks like previously, when there's no auditd in userspace kauditd > > would call kauditd_hold_skb(), which prints the message using printk and > > either frees the skb or queues it (with a limit on the number of queued > > skb's by default). > > > > Since 5b52330bbfe6 "audit: fix auditd/kernel connection state tracking" > > when there's no auditd kauditd will instead use the retry queue, which > > has no limit. But it will not process the retry queue when there's no > > auditd, so messages pile up there indefinitely. > > Hi Seth, > > Thanks for the report. Let me play with this and think on it for a > bit, but looking at the code again I think the issue is that we check > to see if auditd is connected at the top of the kauditd_thread() loop > and if it isn't we skip right to the main_queue label and bypass the > hold/retry queue processing which has the logic to ensure the retry > queue is managed correctly. My initial thinking is that the fix is to > check and see if auditd is connected in kauditd_retry_skb(), if it > isn't we skip the retry queue and call kauditd_hold_skb(), if auditd > is connected we add the record to the retry queue (what we currently > do). Yeah, my first thought was to make this change: kauditd_send_queue(sk, portid, &audit_queue, 1, kauditd_send_multicast_skb, - kauditd_retry_skb); + sk ? kauditd_retry_skb : kauditd_hold_skb); However some scenarios could result in unbounded queueing on the hold queue as well, so I'm not sure if that's quite enough. Thanks, Seth
audit regressions in 4.11
I've observed audit regressions in 4.11-rc when not using a userspace audit daemon. The most obvious issue is that audit messages are not appearing in dmesg anymore. If a sufficient number of audit messages are generated the kernel will also start invoking the OOM killer. It looks like previously, when there's no auditd in userspace kauditd would call kauditd_hold_skb(), which prints the message using printk and either frees the skb or queues it (with a limit on the number of queued skb's by default). Since 5b52330bbfe6 "audit: fix auditd/kernel connection state tracking" when there's no auditd kauditd will instead use the retry queue, which has no limit. But it will not process the retry queue when there's no auditd, so messages pile up there indefinitely. Thanks, Seth
Re: [PATCH] vfs: Partially revert addition of cred override in follow_automount()
On Wed, Feb 22, 2017 at 01:36:32PM +, David Howells wrote: > The following commit: > > commit aeaa4a79ff6a5ed912b7362f206cf8576fca538b > Author: Eric W. Biederman > Date: Sat Jul 23 11:20:44 2016 -0500 > fs: Call d_automount with the filesystems creds > > brackets the call to ->d_automount() with override_creds() and > revert_creds(), setting the initial credentials for use whilst > automounting. > > This, however, breaks AFS as it's no longer able to access the calling > process's keyrings to read the destination on a mountpoint. This has gone > unnoticed till now because, to this point, stat'ing or validating the inode > caused the body of the mountpoint to be read into the pagecache (so that we > could determine whether what we were looking at was really a mountpoint). > > However, the page containing the mountpoint destination is merely *cached* > and not pinned. If it gets discarded and we try to read it in d_automount, > we may fail because we have no authentication tokens available. > > So, for the moment, revert the addition of override_creds() and > revert_creds() and their variable. > > Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") > Signed-off-by: David Howells > cc: Seth Forshee > cc: "Eric W. Biederman" > cc: Al Viro Eric's already applied a patch that should fix this problem. As far as I know it's only been tested against NFS though, so you might want to test that it also fixes the issue with AFS. https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=93faccbbfa958a9668d3ab4e30f38dd205cee8d8 Thanks, Seth
Re: cifs mount regression in 4.8 and 4.4 stable
On Thu, Sep 22, 2016 at 10:27:56AM -0500, Seth Forshee wrote: > On Thu, Sep 22, 2016 at 04:17:09PM +0100, Sachin Prabhu wrote: > > On Thu, 2016-09-22 at 10:09 -0500, Seth Forshee wrote: > > > We've received reports from users of a cifs mount regression in our > > > 4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing > > > commit > > > from 4.8 which was applied to 4.4 stable: > > > > > > a6b5058 fs/cifs: make share unaccessible at root level mountable > > > > > > Testing against 4.8-rc7 shows that the problem is present there as > > > well. > > > > > > Thanks, > > > Seth > > > > > > [1] http://bugs.launchpad.net/bugs/1626112 > > > > Hello Seth, > > > > We have identified some regressions introduced by the mentioned patch > > These include > > I saw those, but none of the ones already in Linus's tree fix the > problem. > > > a) mounting of DFS shares breaks. The fix is included in Steve's tree > > at > > https://git.samba.org/?p=sfrench/cifs-2.6.git;a=commit;h=de5233745cd59c > > f5853d963ad216067788a87594 > > However this one isn't in Linus's tree yet. > > I'm not certain what circumstances cause the failure, but based on what > I see in dmesg it looks likely that this one is the fix. I'll get some > testing with this applied. Looks like this is is the fix we need. Thanks for your help. Seth
Re: cifs mount regression in 4.8 and 4.4 stable
On Thu, Sep 22, 2016 at 04:17:09PM +0100, Sachin Prabhu wrote: > On Thu, 2016-09-22 at 10:09 -0500, Seth Forshee wrote: > > We've received reports from users of a cifs mount regression in our > > 4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing > > commit > > from 4.8 which was applied to 4.4 stable: > > > > a6b5058 fs/cifs: make share unaccessible at root level mountable > > > > Testing against 4.8-rc7 shows that the problem is present there as > > well. > > > > Thanks, > > Seth > > > > [1] http://bugs.launchpad.net/bugs/1626112 > > Hello Seth, > > We have identified some regressions introduced by the mentioned patch > These include I saw those, but none of the ones already in Linus's tree fix the problem. > a) mounting of DFS shares breaks. The fix is included in Steve's tree > at > https://git.samba.org/?p=sfrench/cifs-2.6.git;a=commit;h=de5233745cd59c > f5853d963ad216067788a87594 However this one isn't in Linus's tree yet. I'm not certain what circumstances cause the failure, but based on what I see in dmesg it looks likely that this one is the fix. I'll get some testing with this applied. Since the broken patch was applied to stable kernels, all of these kernels need to receive those fixes as well. Thanks, Seth
cifs mount regression in 4.8 and 4.4 stable
We've received reports from users of a cifs mount regression in our 4.4-based kernel, e.g. [1]. It is fixed by reverting the follwing commit from 4.8 which was applied to 4.4 stable: a6b5058 fs/cifs: make share unaccessible at root level mountable Testing against 4.8-rc7 shows that the problem is present there as well. Thanks, Seth [1] http://bugs.launchpad.net/bugs/1626112
[PATCH RESEND] xenbus: Use proc_create_mount_point() to create /proc/xen
Mounting proc in user namespace containers fails if the xenbus filesystem is mounted on /proc/xen because this directory fails the "permanently empty" test. proc_create_mount_point() exists specifically to create such mountpoints in proc but is currently proc-internal. Export this interface to modules, then use it in xenbus when creating /proc/xen. Acked-by: David Vrabel Signed-off-by: Seth Forshee --- Resending to add some Cc's I missed the first time. drivers/xen/xenbus/xenbus_probe.c | 2 +- fs/proc/generic.c | 1 + fs/proc/internal.h| 1 - include/linux/proc_fs.h | 2 ++ 4 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c index 33a31cfef55d..b5c1dec4a7c2 100644 --- a/drivers/xen/xenbus/xenbus_probe.c +++ b/drivers/xen/xenbus/xenbus_probe.c @@ -826,7 +826,7 @@ static int __init xenbus_init(void) * Create xenfs mountpoint in /proc for compatibility with * utilities that expect to find "xenbus" under "/proc/xen". */ - proc_mkdir("xen", NULL); + proc_create_mount_point("xen"); #endif out_error: diff --git a/fs/proc/generic.c b/fs/proc/generic.c index c633476616e0..be014c544d50 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const char *name) } return ent; } +EXPORT_SYMBOL(proc_create_mount_point); struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 7931c558c192..ff7259559d70 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -195,7 +195,6 @@ static inline bool is_empty_pde(const struct proc_dir_entry *pde) { return S_ISDIR(pde->mode) && !pde->proc_iops; } -struct proc_dir_entry *proc_create_mount_point(const char *name); /* * inode.c diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index b97bf2ef996e..8bd2f726436a 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -21,6 +21,7 @@ extern struct proc_dir_entry *proc_mkdir_data(const char *, umode_t, struct proc_dir_entry *, void *); extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t, struct proc_dir_entry *); +struct proc_dir_entry *proc_create_mount_point(const char *name); extern struct proc_dir_entry *proc_create_data(const char *, umode_t, struct proc_dir_entry *, @@ -56,6 +57,7 @@ static inline struct proc_dir_entry *proc_symlink(const char *name, struct proc_dir_entry *parent,const char *dest) { return NULL;} static inline struct proc_dir_entry *proc_mkdir(const char *name, struct proc_dir_entry *parent) {return NULL;} +static inline struct proc_dir_entry *proc_create_mount_point(const char *name) { return NULL; } static inline struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; } static inline struct proc_dir_entry *proc_mkdir_mode(const char *name, -- 2.7.4
Re: [Xen-devel] [PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen
On Tue, Aug 30, 2016 at 04:00:03PM +0100, David Vrabel wrote: > On 29/08/16 16:03, Seth Forshee wrote: > > Mounting proc in user namespace containers fails if the xenbus > > filesystem is mounted on /proc/xen because this directory fails > > the "permanently empty" test. proc_create_mount_point() exists > > specifically to create such mountpoints in proc but is currently > > proc-internal. Export this interface to modules, then use it in > > xenbus when creating /proc/xen. > > Acked-by: David Vrabel > > This either needs to be acked by the fs maintainer or go via their tree > but you don't appear to have Cc'd the relevant people or lists. Huh. I use a script which uses get_maintainer.pl to add the relevant maintainers and lists, but appaerntly that failed me this time. Even running get_maintainer.pl by hand though it doesn't suggest linux-fsdevel though, and it seems it should. Thanks, Seth
Re: [PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen
On Tue, Aug 30, 2016 at 04:48:08PM +0200, Juergen Gross wrote: > On 29/08/16 17:03, Seth Forshee wrote: > > Mounting proc in user namespace containers fails if the xenbus > > filesystem is mounted on /proc/xen because this directory fails > > the "permanently empty" test. proc_create_mount_point() exists > > specifically to create such mountpoints in proc but is currently > > proc-internal. Export this interface to modules, then use it in > > xenbus when creating /proc/xen. > > > > Signed-off-by: Seth Forshee > > --- > > drivers/xen/xenbus/xenbus_probe.c | 2 +- > > fs/proc/generic.c | 1 + > > fs/proc/internal.h| 1 - > > include/linux/proc_fs.h | 2 ++ > > 4 files changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/xen/xenbus/xenbus_probe.c > > b/drivers/xen/xenbus/xenbus_probe.c > > index 33a31cfef55d..b5c1dec4a7c2 100644 > > --- a/drivers/xen/xenbus/xenbus_probe.c > > +++ b/drivers/xen/xenbus/xenbus_probe.c > > @@ -826,7 +826,7 @@ static int __init xenbus_init(void) > > * Create xenfs mountpoint in /proc for compatibility with > > * utilities that expect to find "xenbus" under "/proc/xen". > > */ > > - proc_mkdir("xen", NULL); > > + proc_create_mount_point("xen"); > > #endif > > > > out_error: > > diff --git a/fs/proc/generic.c b/fs/proc/generic.c > > index c633476616e0..be014c544d50 100644 > > --- a/fs/proc/generic.c > > +++ b/fs/proc/generic.c > > @@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const > > char *name) > > } > > return ent; > > } > > +EXPORT_SYMBOL(proc_create_mount_point); > > EXPORT_SYMBOL_GPL()? Other similar sorts of calls in proc (proc_mkdir in particular) are EXPORT_SYMBOL, so I guessed this one should follow suit. But if it should be EXPORT_SYMOBL_GPL then that's fine too. Thanks, Seth
[PATCH] xenbus: Use proc_create_mount_point() to create /proc/xen
Mounting proc in user namespace containers fails if the xenbus filesystem is mounted on /proc/xen because this directory fails the "permanently empty" test. proc_create_mount_point() exists specifically to create such mountpoints in proc but is currently proc-internal. Export this interface to modules, then use it in xenbus when creating /proc/xen. Signed-off-by: Seth Forshee --- drivers/xen/xenbus/xenbus_probe.c | 2 +- fs/proc/generic.c | 1 + fs/proc/internal.h| 1 - include/linux/proc_fs.h | 2 ++ 4 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c index 33a31cfef55d..b5c1dec4a7c2 100644 --- a/drivers/xen/xenbus/xenbus_probe.c +++ b/drivers/xen/xenbus/xenbus_probe.c @@ -826,7 +826,7 @@ static int __init xenbus_init(void) * Create xenfs mountpoint in /proc for compatibility with * utilities that expect to find "xenbus" under "/proc/xen". */ - proc_mkdir("xen", NULL); + proc_create_mount_point("xen"); #endif out_error: diff --git a/fs/proc/generic.c b/fs/proc/generic.c index c633476616e0..be014c544d50 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -477,6 +477,7 @@ struct proc_dir_entry *proc_create_mount_point(const char *name) } return ent; } +EXPORT_SYMBOL(proc_create_mount_point); struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 7931c558c192..ff7259559d70 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -195,7 +195,6 @@ static inline bool is_empty_pde(const struct proc_dir_entry *pde) { return S_ISDIR(pde->mode) && !pde->proc_iops; } -struct proc_dir_entry *proc_create_mount_point(const char *name); /* * inode.c diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index b97bf2ef996e..8bd2f726436a 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -21,6 +21,7 @@ extern struct proc_dir_entry *proc_mkdir_data(const char *, umode_t, struct proc_dir_entry *, void *); extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t, struct proc_dir_entry *); +struct proc_dir_entry *proc_create_mount_point(const char *name); extern struct proc_dir_entry *proc_create_data(const char *, umode_t, struct proc_dir_entry *, @@ -56,6 +57,7 @@ static inline struct proc_dir_entry *proc_symlink(const char *name, struct proc_dir_entry *parent,const char *dest) { return NULL;} static inline struct proc_dir_entry *proc_mkdir(const char *name, struct proc_dir_entry *parent) {return NULL;} +static inline struct proc_dir_entry *proc_create_mount_point(const char *name) { return NULL; } static inline struct proc_dir_entry *proc_mkdir_data(const char *name, umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; } static inline struct proc_dir_entry *proc_mkdir_mode(const char *name, -- 2.7.4
Re: [PATCH v4 18/21] fuse: Add support for pid namespaces
On Tue, Jul 19, 2016 at 07:44:11PM -0700, Sheng Yang wrote: > On Tue, Apr 26, 2016 at 12:36 PM, Seth Forshee > wrote: > > When the userspace process servicing fuse requests is running in > > a pid namespace then pids passed via the fuse fd are not being > > translated into that process' namespace. Translation is necessary > > for the pid to be useful to that process. > > > > Since no use case currently exists for changing namespaces all > > translations can be done relative to the pid namespace in use > > when fuse_conn_init() is called. For fuse this translates to > > mount time, and for cuse this is when /dev/cuse is opened. IO for > > this connection from another namespace will return errors. > > > > Requests from processes whose pid cannot be translated into the > > target namespace are not permitted, except for requests > > allocated via fuse_get_req_nofail_nopages. For no-fail requests > > in.h.pid will be 0 if the pid translation fails. > > Hi Seth, > > This patch caused a regression in our major container use case with > FUSE in Ubuntu 16.04, as patch was checked in as Ubuntu Sauce in > Ubuntu 4.4.0-6.21 kernel. > > The use case is: > 1. Create a Docker container. > 2. Inside the container, start the FUSE backend, and mounted fs. > 3. Following step 2 in the container, create a loopback device to map > a file in the mounted fuse to create a block device, which will be > available to the whole system. > > It works well before this commit. > > The use case is broken because no matter which namespace losetup runs, > the real request from loopback device seems always come from init ns, > thus it will be in different ns running fuse backend. So the request > will got denied, because the ns running fuse won't able to see the > things from higher level(level 0 in fact) pid namespace. > > I think since init pid ns has ability to access any process in the > system, it should able to access the fuse mounted by any pid namespace > process as well. > > What you think? It sounds like we need to remove the restriction on accessing the filesystem from a different pid namespace. I don't think this poses a security problem. However there's no pid mapping that is usable by the userspace fuse process, so what do we put in the fuse request? Probably the only candidates are 0 and 0x. So a question for the fuse developers - is one value or the other preferrable for fuse_in_header.pid when the pid cannot be mapped, and is this going to cause problems for any fuse filesystems? I suspect that few filesystems actually look at the pid anyway, and already for a filesystem mounted in a pid namespace the values being given to userspace won't be correct for the namespace of the fuse process. Seth
Re: Hang due to nfs letting tasks freeze with locked inodes
On Mon, Jul 11, 2016 at 07:03:31AM -0400, Jeff Layton wrote: > On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote: > > On Fri 08-07-16 10:27:38, Jeff Layton wrote: > > > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote: > > > > On Fri 08-07-16 08:51:54, Jeff Layton wrote: > > > > > > > > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote: > > > > [...] > > > > > > > > > > > > > > > > > Apart from alternative Dave was mentioning in other email, what > > > > > > is the > > > > > > point to use freezable wait from this path in the first place? > > > > > > > > > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same > > > > > > path and > > > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting > > > > > > in two > > > > > > different modes from the same path AFAICS. There do not seem to > > > > > > be other > > > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds > > > > > > like > > > > > > something is not quite right here to me. If the nfs4_delay did > > > > > > regular > > > > > > wait then the freezing would fail as well but at least it would > > > > > > be clear > > > > > > who is the culrprit rather than having an indirect dependency. > > > > > The codepaths involved there are a lot more complex than that > > > > > unfortunately. > > > > > > > > > > nfs4_delay is the function that we use to handle the case where the > > > > > server returns NFS4ERR_DELAY. Basically telling us that it's too > > > > > busy > > > > > right now or has some transient error and the client should retry > > > > > after > > > > > a small, sliding delay. > > > > > > > > > > That codepath could probably be made more freezer-safe. The typical > > > > > case however, is that we've sent a call and just haven't gotten a > > > > > reply. That's the trickier one to handle. > > > > Why using a regular non-freezable wait would be a problem? > > > > > > It has been a while since I looked at that code, but IIRC, that could > > > block the freezer for up to 15s, which is a significant portion of the > > > 20s that you get before the freezer gives up. > > > > But how does that differ from the situation when the freezer has to give > > up on the timeout because another task fails due to lock dependency. > > > > As Trond and Dave have written in other emails. It is really danngerous > > to freeze a task while it is holding locks and other resources. > > It's not really dangerous if you're freezing every task on the host. > Sure, you're freezing with locks held, but everything else is freezing > too, so nothing will be contending for those locks. Unless you have tasks either already waiting on those locks or that will attaempt to lock them before calling try_to_freeze. That happens to be the case in this cgroup freezer hang too, all the tasks stuck waiting on the i_mutex are p being frozen. > I'm not at all opposed to changing how all of that works. My only > stipulation is that we not break the ability to reliably suspend a host > that is actively using an NFS mount. If you can come up with a way to > do that that also works for freezing cgroups, then I'm all for it. The deadlock that we've seen should apply equally to suspend and to the cgroup freezer. The only difference is that suspend will eventually time out and abort the suspend whereas the cgroup freezer does not. Seth
Re: Hang due to nfs letting tasks freeze with locked inodes
On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote: > On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote: > > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote: > > > We're seeing a hang when freezing a container with an nfs bind mount while > > > running iozone. Two iozone processes were hung with this stack trace. > > > > > > [] schedule+0x35/0x80 > > > [] schedule_preempt_disabled+0xe/0x10 > > > [] __mutex_lock_slowpath+0xb9/0x130 > > > [] mutex_lock+0x1f/0x30 > > > [] do_unlinkat+0x12b/0x2d0 > > > [] SyS_unlink+0x16/0x20 > > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > > > This seems to be due to another iozone thread frozen during unlink with > > > this stack trace: > > > > > > [] __refrigerator+0x7a/0x140 > > > [] nfs4_handle_exception+0x118/0x130 [nfsv4] > > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4] > > > [] nfs_unlink+0x149/0x350 [nfs] > > > [] vfs_unlink+0xf1/0x1a0 > > > [] do_unlinkat+0x279/0x2d0 > > > [] SyS_unlink+0x16/0x20 > > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > > > Since nfs is allowing the thread to be frozen with the inode locked it's > > > preventing other threads trying to lock the same inode from freezing. It > > > seems like a bad idea for nfs to be doing this. > > > > > > > Yeah, known problem. Not a simple one to fix though. > > Actually, it is simple to fix. > > not sys_sync(), to suspend filesystem operations> > > i.e. the VFS blocks new operations from starting, and then then the > NFS client simply needs to implement ->freeze_fs to drain all it's > active operations before returning. Problem solved. No, this won't solve my problem. We're not doing a full suspend, rather using a freezer cgroup to freeze a subset of processes. We don't want to want to fully freeze the filesystem. Thanks, Seth
Re: Hang due to nfs letting tasks freeze with locked inodes
On Fri, Jul 08, 2016 at 02:22:24PM +0200, Michal Hocko wrote: > On Wed 06-07-16 18:07:18, Jeff Layton wrote: > > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote: > > > We're seeing a hang when freezing a container with an nfs bind mount while > > > running iozone. Two iozone processes were hung with this stack trace. > > > > > > [] schedule+0x35/0x80 > > > [] schedule_preempt_disabled+0xe/0x10 > > > [] __mutex_lock_slowpath+0xb9/0x130 > > > [] mutex_lock+0x1f/0x30 > > > [] do_unlinkat+0x12b/0x2d0 > > > [] SyS_unlink+0x16/0x20 > > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > > > This seems to be due to another iozone thread frozen during unlink with > > > this stack trace: > > > > > > [] __refrigerator+0x7a/0x140 > > > [] nfs4_handle_exception+0x118/0x130 [nfsv4] > > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4] > > > [] nfs_unlink+0x149/0x350 [nfs] > > > [] vfs_unlink+0xf1/0x1a0 > > > [] do_unlinkat+0x279/0x2d0 > > > [] SyS_unlink+0x16/0x20 > > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > > > Since nfs is allowing the thread to be frozen with the inode locked it's > > > preventing other threads trying to lock the same inode from freezing. It > > > seems like a bad idea for nfs to be doing this. > > > > > > > Yeah, known problem. Not a simple one to fix though. > > Apart from alternative Dave was mentioning in other email, what is the > point to use freezable wait from this path in the first place? > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and > that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two > different modes from the same path AFAICS. There do not seem to be other > callers of nfs4_delay outside of nfs4_handle_exception. Sounds like > something is not quite right here to me. If the nfs4_delay did regular > wait then the freezing would fail as well but at least it would be clear > who is the culrprit rather than having an indirect dependency. It turns out there are more paths than this one doing a freezable wait, and they're all also killable. This leads me to a slightly different question than yours, why nfs can give up waiting in the case of a signal but not when the task is frozen. I know the changes below aren't "correct," but I've been experimenting with them anyway to see what would happen. So far things seem to be fine, and the deadlock is gone. That should give you an idea of all the places I found using a freezable wait. Seth diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index f714b98..62dbe59 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -77,8 +77,8 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr) */ int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); - if (signal_pending_state(mode, current)) + schedule(); + if (signal_pending_state(mode, current) || freezing(current)) return -ERESTARTSYS; return 0; } diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c index cb28cce..2315183 100644 --- a/fs/nfs/nfs3proc.c +++ b/fs/nfs/nfs3proc.c @@ -35,9 +35,9 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags) res = rpc_call_sync(clnt, msg, flags); if (res != -EJUKEBOX) break; - freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME); + schedule_timeout_killable(NFS_JUKEBOX_RETRY_TIME); res = -ERESTARTSYS; - } while (!fatal_signal_pending(current)); + } while (!fatal_signal_pending(current) && !freezing(current)); return res; } diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 98a4415..0dad2fb 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -334,9 +334,8 @@ static int nfs4_delay(struct rpc_clnt *clnt, long *timeout) might_sleep(); - freezable_schedule_timeout_killable_unsafe( - nfs4_update_delay(timeout)); - if (fatal_signal_pending(current)) + schedule_timeout_killable(nfs4_update_delay(timeout)); + if (fatal_signal_pending(current) || freezing(current)) res = -ERESTARTSYS; return res; } @@ -5447,7 +5446,7 @@ int nfs4_proc_delegreturn(struct inode *inode, struct rpc_cred *cred, const nfs4 static unsigned long nfs4_set_lock_task_retry(unsigned long timeout) { - freezable_schedule_timeout_killable_unsafe(timeout); + schedule_timeout_killable(timeout); timeout <<= 1; if (timeout > NFS4_LOCK_MAXTIMEOUT) return NFS4_LOCK_MAXTIMEOUT; @@ -6148,7 +6147,7 @@ nfs4_proc_lock(struct file *filp,
Re: Hang due to nfs letting tasks freeze with locked inodes
On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote: > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote: > > We're seeing a hang when freezing a container with an nfs bind mount while > > running iozone. Two iozone processes were hung with this stack trace. > > > > [] schedule+0x35/0x80 > > [] schedule_preempt_disabled+0xe/0x10 > > [] __mutex_lock_slowpath+0xb9/0x130 > > [] mutex_lock+0x1f/0x30 > > [] do_unlinkat+0x12b/0x2d0 > > [] SyS_unlink+0x16/0x20 > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > This seems to be due to another iozone thread frozen during unlink with > > this stack trace: > > > > [] __refrigerator+0x7a/0x140 > > [] nfs4_handle_exception+0x118/0x130 [nfsv4] > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4] > > [] nfs_unlink+0x149/0x350 [nfs] > > [] vfs_unlink+0xf1/0x1a0 > > [] do_unlinkat+0x279/0x2d0 > > [] SyS_unlink+0x16/0x20 > > [] entry_SYSCALL_64_fastpath+0x16/0x71 > > > > Since nfs is allowing the thread to be frozen with the inode locked it's > > preventing other threads trying to lock the same inode from freezing. It > > seems like a bad idea for nfs to be doing this. > > > > Yeah, known problem. Not a simple one to fix though. > > > Can nfs do something different here to prevent this? Maybe use a > > non-freezable sleep and let the operation complete, or else abort the > > operation and return ERESTARTSYS? > > The problem with letting the op complete is that often by the time you > get to the point of trying to freeze processes, the network interfaces > are already shut down. So the operation you're waiting on might never > complete. Stuff like suspend operations on your laptop fail, leading to > fun bug reports like: "Oh, my laptop burned to crisp inside my bag > because the suspend never completed." > > You could (in principle) return something like -ERESTARTSYS iff the > call has not yet been transmitted. If it has already been transmitted, > then you might end up sending the call a second time (but not as an RPC > retransmission of course). If that call was non-idempotent then you end > up with all of _those_ sorts of problems. > > Also, -ERESTARTSYS is not quite right as it doesn't always cause the > call to be restarted. It depends on the syscall. I think this would > probably need some other sort of syscall-restart machinery plumbed in. I don't really know much at all about how NFS works, so I hope you don't mind indulging me in some questions. What happens then if you suspend waiting for an op to complete and then resume an hour later? Will it actually succeed or end up returning some sort of "timed out" error? If it's going to be an error (or even likely to be one) could the op just be aborted immediately with an error code? It just seems like there must be something better than potentially deadlocking the kernel. Thanks, Seth
Hang due to nfs letting tasks freeze with locked inodes
We're seeing a hang when freezing a container with an nfs bind mount while running iozone. Two iozone processes were hung with this stack trace. [] schedule+0x35/0x80 [] schedule_preempt_disabled+0xe/0x10 [] __mutex_lock_slowpath+0xb9/0x130 [] mutex_lock+0x1f/0x30 [] do_unlinkat+0x12b/0x2d0 [] SyS_unlink+0x16/0x20 [] entry_SYSCALL_64_fastpath+0x16/0x71 This seems to be due to another iozone thread frozen during unlink with this stack trace: [] __refrigerator+0x7a/0x140 [] nfs4_handle_exception+0x118/0x130 [nfsv4] [] nfs4_proc_remove+0x7d/0xf0 [nfsv4] [] nfs_unlink+0x149/0x350 [nfs] [] vfs_unlink+0xf1/0x1a0 [] do_unlinkat+0x279/0x2d0 [] SyS_unlink+0x16/0x20 [] entry_SYSCALL_64_fastpath+0x16/0x71 Since nfs is allowing the thread to be frozen with the inode locked it's preventing other threads trying to lock the same inode from freezing. It seems like a bad idea for nfs to be doing this. Can nfs do something different here to prevent this? Maybe use a non-freezable sleep and let the operation complete, or else abort the operation and return ERESTARTSYS? Thanks, Seth
Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
On Wed, May 18, 2016 at 10:45:31AM -0500, Eric W. Biederman wrote: > > But if we do that it violates some of the assumptions of the patch to > > rework MNT_NODEV on your testing branch (and also those behind patch 2 > > in this series). Something will need to be changed there to prevent a > > regression in mount behavior when a user ns tries to mount without > > MNT_NODEV when the mount inherited from its parent has it set. > > Thank you for pointing that out. I will look into that. > > I believe I know exactly what you are talking about. Of the choices I > think it is better to a minor localized change in the fs_fully_visible > logic than it is to cause problems elsewhere. Agreed. > >> Apologies for not catching this earlier. > > > > Actually this is a more recent patch, so you possibly hadn't seen it > > before. > > > >> I am looking at folding all of this into the patch that introduces > >> sget_userns so that even bisects won't have regresssions. > > > > That's fine with me. > > And thank you for keeping everything as separate patches. That is at > least helping me catch up. Even if I don't agree that these things > should be separate come merge time. Honestly I probably would have squashed some of them into that first patch myself if you hadn't already applied it to your testing branch, so that's all just luck. Keep in mind that I also have that patch for mqueue that isn't in this series, and I haven't yet checked to see if the 4.7 merges introduce anything which is going to require updating these patches. I was planning to wait and send out updates after -rc1, but if you want that stuff sooner just let me know. Thanks, Seth
Re: [PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
On Tue, May 17, 2016 at 05:39:33PM -0500, Eric W. Biederman wrote: > Seth Forshee writes: > > > Both of these filesystems already have use cases for mounting the > > same super block from multiple user namespaces. For sysfs this > > happens when using criu for snapshotting a container, where sysfs > > is mnounted in the containers network ns but the hosts user ns. > > The cgroup filesystem shares the same super block for all mounts > > of the same hierarchy regardless of the namespace. > > > > As a result, the restriction on mounting a super block from a > > single user namespace creates regressions for existing uses of > > these filesystems. For these specific filesystems this > > restriction isn't really necessary since the backing store is > > objects in kernel memory and thus the ids assigned from inodes > > is not subject to translation relative to s_user_ns. > > > > Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set > > causes sget_userns() to skip the check of s_user_ns. Set this > > flag for the sysfs and cgroup filesystems to fix the > > regressions. > > So this one needs to be sget_userns(..., &init_user_ns, ...). > And not a new special case. This is actually what I wanted to do, but based on a previous discussion where I had suggested doing this (for a different reason) I came away thinking you did not want it that way. So I'm happy with that change. But if we do that it violates some of the assumptions of the patch to rework MNT_NODEV on your testing branch (and also those behind patch 2 in this series). Something will need to be changed there to prevent a regression in mount behavior when a user ns tries to mount without MNT_NODEV when the mount inherited from its parent has it set. > Apologies for not catching this earlier. Actually this is a more recent patch, so you possibly hadn't seen it before. > I am looking at folding all of this into the patch that introduces > sget_userns so that even bisects won't have regresssions. That's fine with me. Thanks, Seth
Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
On Mon, May 16, 2016 at 11:42:46AM -0500, Eric W. Biederman wrote: > Seth Forshee writes: > > > On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote: > >> I have slowly been working with Seth Forshee on these issues as > >> the last thing I want is to introduce more security bugs right now. > >> Seth being a braver man than I am has already merged his changes into > >> the Ubuntu kernel. > > > > Maybe not quite so brave as you think. I also threw on a patch to > > disable the feature unless explicitly enabled by a sys admin. > > > >> James I think you are missing the fact that all filesystems already have > >> the make_kuid and make_kgid calls right where the data comes off disk, > >> and the from_kuid and from_kgid calls right where the on-disk data is > >> being created just before it goes on disk. Which means that the actual > >> impact on filesystems of the translation is trivial. > > > > It is fairly simple but a there's bit more that just id conversions to > > change. With ext4 I found that there were mount options which needed to > > be restricted, some capability checks to update, and access to external > > journal devices must be checked. In all it wasn't a whole lot of changes > > to the filesystem though. Fuse was a bit more involved, but the > > complexities there won't apply to other filesystems. > > > >> Djalal if you could work with Seth I think that would be very useful. I > >> know I am dragging my heels there but I really hope I can dig in and get > >> everything reviewed and merged soonish. > > > > That would make me very happy :-) > > It has missed this merge window :( But I am hoping with am aiming to > review them and get your patches (or modified versions of your patches) > into my tree as soon after rc1 as humanly possible. > > Part of that will have to be the fix for mqueuefs, that Docker just hit. Yeah, I've got a patch that's been tested to fix the bug, so I'll send new patches which include that before long. Seth
Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
On Sat, May 14, 2016 at 09:21:55PM -0500, Eric W. Biederman wrote: > I have slowly been working with Seth Forshee on these issues as > the last thing I want is to introduce more security bugs right now. > Seth being a braver man than I am has already merged his changes into > the Ubuntu kernel. Maybe not quite so brave as you think. I also threw on a patch to disable the feature unless explicitly enabled by a sys admin. > James I think you are missing the fact that all filesystems already have > the make_kuid and make_kgid calls right where the data comes off disk, > and the from_kuid and from_kgid calls right where the on-disk data is > being created just before it goes on disk. Which means that the actual > impact on filesystems of the translation is trivial. It is fairly simple but a there's bit more that just id conversions to change. With ext4 I found that there were mount options which needed to be restricted, some capability checks to update, and access to external journal devices must be checked. In all it wasn't a whole lot of changes to the filesystem though. Fuse was a bit more involved, but the complexities there won't apply to other filesystems. > Djalal if you could work with Seth I think that would be very useful. I > know I am dragging my heels there but I really hope I can dig in and get > everything reviewed and merged soonish. That would make me very happy :-) I'm happy to look with Djalal for commonalities. I did skim his patches before, and based on that all I really expect to find are things related to permission checks when ids don't map. The rest seems fundamentally different. Seth
Re: [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
On Wed, May 04, 2016 at 11:19:04PM +, Serge Hallyn wrote: > Quoting Djalal Harouni (tix...@gmail.com): > > If a process gets access to a mount from a different user > > namespace, that process should not be able to take advantage of > > setuid files or selinux entrypoints from that filesystem. Prevent > > this by treating mounts from other mount namespaces and those not > > owned by current_user_ns() or an ancestor as nosuid. > > > > This patch was just adapted from the original one that was written > > by Andy Lutomirski > > https://www.redhat.com/archives/dm-devel/2016-April/msg00374.html > > I'm not sure that this makes sense given what you're doing. In the > case of Seth's set, a filesystem is mounted specifically (and privately) > in a user namespace. We don't want for instance the initial user ns > to find a link to a setuid-root exploit left in the container-mounted > filesystem. > > But you are having a parent user namespace mount the fs so that its > children can all access the fs, uid-shifted for convenience. Not > allowing the child namespaces to make use of setuid-root does not > seem applicable here. Right, the problem addressed by this patch probably isn't relevant to this sort of uid shifting. But I think there's another problem that needs to be addressed. bprm_fill_uid() still gets the ids for sxid files unshifted from the inode. We already protect against sxid to any user not in bprm->cred->user_ns, so it will just ignore the sxid instead of e.g. suid as global root from the id shifted mount, which is good. What would be wanted though is to use the shifted ids so that something like suid-root ping in the container rootfs would work. Seth
Re: [RFC PATCH 0/0] VFS:userns: support portable root filesystems
On Wed, May 04, 2016 at 01:21:46AM +0200, Djalal Harouni wrote: > This RFC tries to explore how to support filesystem operations inside > user namespace using only VFS and a per mount namespace solution. This > allows to take advantage of user namespace separations without > introducing any change at the filesystems level. All this is handled > with the virtual view of mount namespaces. > > > 1) Presentation: > > > The main aim is to support portable root filesystems and allow containers, > virtual machines and other cases to use the same root filesystem. > Due to security reasons, filesystems can't be mounted inside user > namespaces, and mounting them outside will not solve the problem since > they will show up with the wrong UIDs/GIDs. Read and write operations > will also fail and so on. > > The current userspace solution is to automatically chown the whole root > filesystem before starting a container, example: > (host) init_user_ns 100:1065536 => (container) user_ns_X1 0:65535 > (host) init_user_ns 200:2065536 => (container) user_ns_Y1 0:65535 > (host) init_user_ns 300:3065536 => (container) user_ns_Z1 0:65535 > ... > > Every time a chown is called, files are changed and so on... This > prevents to have portable filesystems where you can throw anywhere > and boot. Having an extra step to adapt the filesystem to the current > mapping and persist it will not allow to verify its integrity, it makes > snapshots and migration a bit harder, and probably other limitations... > > It seems that there are multiple ways to allow user namespaces combine > nicely with filesystems, but none of them is that easy. The bind mount > and pin the user namespace during mount time will not work, bind mounts > share the same super block, hence you may endup working on the wrong > vfsmount context and there is no easy way to get out of that... > > Using the user namespace in the super block seems the way to go, and > there is the "Support fuse mounts in user namespaces" [1] patches which > seem nice but perhaps too complex!? there is also the overlayfs solution, > and finaly the VFS layer solution. I'm not sure if you're meaning to propose your patches as an alternative to mine or not, but I think they're orthogonal. My goal is to allow containers in user namespaces to mount some subset of filesystem types (not specifically container root filesystems, but in general), which your patches won't enable. Your goal is to share a rootfs between multiple containers with different uid/gid shifts, which my patches don't help with. > We present here a simple VFS solution, everything is packed inside VFS, > filesystems don't need to know anything (except probably XFS, and special > operations inside union filesystems). Currently it supports ext4, btrfs > and overlayfs. Changes into filesystems are small, just parse the > vfs_shift_uids and vfs_shift_gids options during mount and set the > appropriate flags into the super_block structure. > > 1) Filesystems don't need the FS_USERNS_MOUNT flag, so no user > namespace mounting, they stay secure, nothing changes. > > 2) The solution is based on VFS and mount namespaces, we use the user > namespace of the containing mount namespace to check if we should shift > UIDs/GIDs from/to virtual <=> on-disk view. > If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids" > options, and if it shows up inside a mount namespace that supports VFS > UIDs/GIDs shifts then during each access we will remap UID/GID either > to virtual or to on-disk view using simple helper functions to allow the > access. In case the mount or current mount namespace do not support VFS > UID/GID shifts, we fallback to the old behaviour, no shift is performed. > > 3) inodes will always keep their original values which reflect the > mapping inside init_user_ns which we consider the on-disk mapping. > Therfore they will have a mapping from 0:65536 on-disk, these values are > the persistent values that we have to write to the disk. We don't keep > track of any UID/GID shift that was applied before. This gives > portability and allows to use the previous mapping which was freed for > another root filesystem... Sorry, I haven't had time to look at the patches, but how are you handling suid/sgid? Will the process get the ids in the inode or the shifted ids? Thanks, Seth
Re: [PATCH 1/1] simplified security.nscapability xattr
On Fri, Apr 22, 2016 at 12:26:33PM -0500, serge.hal...@ubuntu.com wrote: > From: Serge Hallyn > > This can only be set by root in his own namespace, and will > only be respected by namespaces with that same root kuid > mapped as root, or namespaces descended from it. > > This allows a simple setxattr to work, allows tar/untar to > work, and allows us to tar in one namespace and untar in > another while preserving the capability, without risking > leaking privilege into a parent namespace. > > Signed-off-by: Serge Hallyn Seems like the simplest possible design which meets the requirements. Acked-by: Seth Forshee
[PATCH v4 00/21] Support fuse mounts in user namespaces
Hi Eric, Here's another update to my patches for mouning with fuse from unpivileged user namespaces. The main change here is a fix for a build failure when fuse is built as a module. As usual the series is also available at: git://git.kernel.org/pub/scm/linux/kernel/git/sforshee/linux.git fuse-userns Changes since v3: * Export current_in_userns() to fix an error when fuse is built as a module. * Add comment explaining the conditions for allowing CAP_CHOWN in s_user_ns to change the owner or group of an inode. * Added acks from Serge. Thanks, Seth --- Andy Lutomirski (1): fs: Treat foreign mounts as nosuid Pavel Tikhomirov (1): fs: fix a posible leak of allocated superblock Seth Forshee (19): fs: Remove check of s_user_ns for existing mounts in fs_fully_visible() fs: Allow sysfs and cgroupfs to share super blocks between user namespaces block_dev: Support checking inode permissions in lookup_bdev() block_dev: Check permissions towards block device inode when mounting selinux: Add support for unprivileged mounts from user namespaces userns: Replace in_userns with current_in_userns Smack: Handle labels consistently in untrusted mounts fs: Check for invalid i_uid in may_follow_link() cred: Reject inodes with invalid ids in set_create_file_as() fs: Refuse uid/gid changes which don't map into s_user_ns fs: Update posix_acl support to handle user namespace mounts fs: Allow superblock owner to change ownership of inodes with unmappable ids fs: Don't remove suid for CAP_FSETID in s_user_ns fs: Allow superblock owner to access do_remount_sb() capabilities: Allow privileged user in s_user_ns to set security.* xattrs fuse: Add support for pid namespaces fuse: Support fuse filesystems outside of init_user_ns fuse: Restrict allow_other to the superblock's namespace or a descendant fuse: Allow user namespace mounts drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c | 2 +- fs/attr.c | 73 - fs/block_dev.c | 18 -- fs/exec.c | 2 +- fs/fuse/cuse.c | 3 +- fs/fuse/dev.c | 26 +++ fs/fuse/dir.c | 16 - fs/fuse/file.c | 22 ++--- fs/fuse/fuse_i.h| 10 +- fs/fuse/inode.c | 40 ++ fs/inode.c | 3 +- fs/kernfs/inode.c | 2 ++ fs/namei.c | 2 +- fs/namespace.c | 20 --- fs/posix_acl.c | 67 +++-- fs/proc/base.c | 2 ++ fs/proc/generic.c | 3 ++ fs/proc/proc_sysctl.c | 2 ++ fs/quota/quota.c| 2 +- fs/super.c | 7 +++- fs/sysfs/mount.c| 3 +- fs/xattr.c | 19 --- include/linux/fs.h | 3 +- include/linux/mount.h | 1 + include/linux/posix_acl_xattr.h | 17 +++--- include/linux/uidgid.h | 10 ++ include/linux/user_namespace.h | 6 ++-- kernel/cgroup.c | 4 +-- kernel/cred.c | 2 ++ kernel/user_namespace.c | 7 ++-- security/commoncap.c| 22 + security/selinux/hooks.c| 25 +- security/smack/smack_lsm.c | 29 ++-- 35 files changed, 355 insertions(+), 119 deletions(-)
Re: [PATCH v3 14/21] fs: Allow superblock owner to change ownership of inodes with unmappable ids
On Mon, Apr 25, 2016 at 03:30:47PM -0500, Serge E. Hallyn wrote: > Quoting Seth Forshee (seth.fors...@canonical.com): > > In a userns mount some on-disk inodes may have ids which do not > > map into s_user_ns, in which case the in-kernel inodes are owned > > by invalid users. The superblock owner should be able to change > > attributes of these inodes but cannot. However it is unsafe to > > grant the superblock owner privileged access to all inodes in the > > superblock since proc, sysfs, etc. use DAC to protect files which > > may not belong to s_user_ns. The problem is restricted to only > > inodes where the owner or group is an invalid user. > > > > We can work around this by allowing users with CAP_CHOWN in > > s_user_ns to change an invalid owner or group id, so long as the > > other id is either invalid or mappable in s_user_ns. After > > changing ownership the user will be privileged towards the inode > > and thus able to change other attributes. > > > > As an precaution, checks for invalid ids are added to the proc > > and kernfs setattr interfaces. These filesystems are not expected > > to have inodes with invalid ids, but if it does happen any > > setattr operations will return -EPERM. > > > > Signed-off-by: Seth Forshee > > Acked-by: Serge Hallyn > > bug a request below, > > > --- > > fs/attr.c | 47 +++ > > fs/kernfs/inode.c | 2 ++ > > fs/proc/base.c| 2 ++ > > fs/proc/generic.c | 3 +++ > > fs/proc/proc_sysctl.c | 2 ++ > > 5 files changed, 48 insertions(+), 8 deletions(-) > > > > diff --git a/fs/attr.c b/fs/attr.c > > index 3cfaaac4a18e..a8b0931654a5 100644 > > --- a/fs/attr.c > > +++ b/fs/attr.c > > @@ -16,6 +16,43 @@ > > #include > > #include > > > > +static bool chown_ok(const struct inode *inode, kuid_t uid) > > +{ > > + struct user_namespace *user_ns; > > + > > + if (uid_eq(current_fsuid(), inode->i_uid) && uid_eq(uid, inode->i_uid)) > > + return true; > > + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) > > + return true; > > + > > + user_ns = inode->i_sb->s_user_ns; > > + if (!uid_valid(inode->i_uid) && > > + (!gid_valid(inode->i_gid) || kgid_has_mapping(user_ns, > > inode->i_gid)) && > > This confused me to no end :) Perhaps a "is_unmapped_valid_gid()" helper > would make it clearer what this is meant to do? Or else maybe a comment > above chown_ok(), explaining that > > 1. for a blockdev, the uid is converted at inode read so that it is > either mapped or invalid > 2. for sysfs / etc, uid can be valid but not mapped into the userns Even with a helper a comment is probably helpful to explain why. I'll do that first, then see if a helper would make things any clearer. Honestly, I had to think about the helper name you proposed for a minute before it made sense even though I already understood the code ;-)
[PATCH v4 10/21] fs: Check for invalid i_uid in may_follow_link()
Filesystem uids which don't map into a user namespace may result in inode->i_uid being INVALID_UID. A symlink and its parent could have different owners in the filesystem can both get mapped to INVALID_UID, which may result in following a symlink when this would not have otherwise been permitted when protected symlinks are enabled. Add a new helper function, uid_valid_eq(), and use this to validate that the ids in may_follow_link() are both equal and valid. Also add an equivalent helper for gids, which is currently unused. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/namei.c | 2 +- include/linux/uidgid.h | 10 ++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/namei.c b/fs/namei.c index a29094c6f4a1..6fe8b0d8ca90 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -915,7 +915,7 @@ static inline int may_follow_link(struct nameidata *nd) return 0; /* Allowed if parent directory and link owner match. */ - if (uid_eq(parent->i_uid, inode->i_uid)) + if (uid_valid_eq(parent->i_uid, inode->i_uid)) return 0; if (nd->flags & LOOKUP_RCU) diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h index 03835522dfcb..e09529fe2668 100644 --- a/include/linux/uidgid.h +++ b/include/linux/uidgid.h @@ -117,6 +117,16 @@ static inline bool gid_valid(kgid_t gid) return __kgid_val(gid) != (gid_t) -1; } +static inline bool uid_valid_eq(kuid_t left, kuid_t right) +{ + return uid_eq(left, right) && uid_valid(left); +} + +static inline bool gid_valid_eq(kgid_t left, kgid_t right) +{ + return gid_eq(left, right) && gid_valid(left); +} + #ifdef CONFIG_USER_NS extern kuid_t make_kuid(struct user_namespace *from, uid_t uid); -- 2.7.4
[PATCH v4 07/21] selinux: Add support for unprivileged mounts from user namespaces
Security labels from unprivileged mounts in user namespaces must be ignored. Force superblocks from user namespaces whose labeling behavior is to use xattrs to use mountpoint labeling instead. For the mountpoint label, default to converting the current task context into a form suitable for file objects, but also allow the policy writer to specify a different label through policy transition rules. Pieced together from code snippets provided by Stephen Smalley. Signed-off-by: Seth Forshee Acked-by: Stephen Smalley Acked-by: James Morris --- security/selinux/hooks.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 1350167635cb..33beed3ac589 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -820,6 +820,28 @@ static int selinux_set_mnt_opts(struct super_block *sb, goto out; } } + + /* +* If this is a user namespace mount, no contexts are allowed +* on the command line and security labels must be ignored. +*/ + if (sb->s_user_ns != &init_user_ns) { + if (context_sid || fscontext_sid || rootcontext_sid || + defcontext_sid) { + rc = -EACCES; + goto out; + } + if (sbsec->behavior == SECURITY_FS_USE_XATTR) { + sbsec->behavior = SECURITY_FS_USE_MNTPOINT; + rc = security_transition_sid(current_sid(), current_sid(), +SECCLASS_FILE, NULL, +&sbsec->mntpoint_sid); + if (rc) + goto out; + } + goto out_set_opts; + } + /* sets the context of the superblock for the fs being mounted. */ if (fscontext_sid) { rc = may_context_mount_sb_relabel(fscontext_sid, sbsec, cred); @@ -888,6 +910,7 @@ static int selinux_set_mnt_opts(struct super_block *sb, sbsec->def_sid = defcontext_sid; } +out_set_opts: rc = sb_finish_set_opts(sb); out: mutex_unlock(&sbsec->lock); -- 2.7.4
[PATCH v4 12/21] fs: Refuse uid/gid changes which don't map into s_user_ns
Add checks to inode_change_ok to verify that uid and gid changes will map into the superblock's user namespace. If they do not fail with -EOVERFLOW. This cannot be overriden with ATTR_FORCE. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/attr.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/attr.c b/fs/attr.c index 25b24d0f6c88..3cfaaac4a18e 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -42,6 +42,17 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) return error; } + /* +* Verify that uid/gid changes are valid in the target namespace +* of the superblock. This cannot be overriden using ATTR_FORCE. +*/ + if (ia_valid & ATTR_UID && + from_kuid(inode->i_sb->s_user_ns, attr->ia_uid) == (uid_t)-1) + return -EOVERFLOW; + if (ia_valid & ATTR_GID && + from_kgid(inode->i_sb->s_user_ns, attr->ia_gid) == (gid_t)-1) + return -EOVERFLOW; + /* If force is set do it anyway. */ if (ia_valid & ATTR_FORCE) return 0; -- 2.7.4
[PATCH v4 02/21] fs: Remove check of s_user_ns for existing mounts in fs_fully_visible()
fs_fully_visible() ignores MNT_LOCK_NODEV when FS_USERS_DEV_MOUNT is not set for the filesystem, but there is a bug in the logic that may cause mounting to fail. It is doing this only when the existing mount is not in init_user_ns but should check the new mount instead. But the new mount is always in a non-init namespace when fs_fully_visible() is called, so that condition can simply be removed. Signed-off-by: Seth Forshee --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f20c82f91ecb..c133318bec35 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3234,8 +3234,7 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags) mnt_flags = mnt->mnt.mnt_flags; if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC) mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC); - if (mnt->mnt.mnt_sb->s_user_ns != &init_user_ns && - !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) + if (!(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) mnt_flags &= ~(MNT_LOCK_NODEV); /* Verify the mount flags are equal to or more permissive -- 2.7.4
[PATCH v4 11/21] cred: Reject inodes with invalid ids in set_create_file_as()
Using INVALID_[UG]ID for the LSM file creation context doesn't make sense, so return an error if the inode passed to set_create_file_as() has an invalid id. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- kernel/cred.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/cred.c b/kernel/cred.c index 0c0cd8a62285..5f264fb5737d 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx); */ int set_create_files_as(struct cred *new, struct inode *inode) { + if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid)) + return -EINVAL; new->fsuid = inode->i_uid; new->fsgid = inode->i_gid; return security_kernel_create_files_as(new, inode); -- 2.7.4
[PATCH v4 03/21] fs: Allow sysfs and cgroupfs to share super blocks between user namespaces
Both of these filesystems already have use cases for mounting the same super block from multiple user namespaces. For sysfs this happens when using criu for snapshotting a container, where sysfs is mounted in the containers network ns but the hosts user ns. The cgroup filesystem shares the same super block for all mounts of the same hierarchy regardless of the namespace. As a result, the restriction on mounting a super block from a single user namespace creates regressions for existing uses of these filesystems. For these specific filesystems this restriction isn't really necessary since the backing store is objects in kernel memory and thus the ids assigned from inodes is not subject to translation relative to s_user_ns. Add a new filesystem flag, FS_USERNS_SHARE_SB, which when set causes sget_userns() to skip the check of s_user_ns. Set this flag for the sysfs and cgroup filesystems to fix the regressions. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/super.c | 3 ++- fs/sysfs/mount.c | 3 ++- include/linux/fs.h | 1 + kernel/cgroup.c| 4 ++-- 4 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/super.c b/fs/super.c index 092a7828442e..ead156b44bf8 100644 --- a/fs/super.c +++ b/fs/super.c @@ -472,7 +472,8 @@ retry: hlist_for_each_entry(old, &type->fs_supers, s_instances) { if (!test(old, data)) continue; - if (user_ns != old->s_user_ns) { + if (!(type->fs_flags & FS_USERNS_SHARE_SB) && + user_ns != old->s_user_ns) { spin_unlock(&sb_lock); if (s) { up_write(&s->s_umount); diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index f3db82071cfb..9555accd4322 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -59,7 +59,8 @@ static struct file_system_type sysfs_fs_type = { .name = "sysfs", .mount = sysfs_mount, .kill_sb= sysfs_kill_sb, - .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_VISIBLE | FS_USERNS_MOUNT | + FS_USERNS_SHARE_SB, }; int __init sysfs_init(void) diff --git a/include/linux/fs.h b/include/linux/fs.h index be0f8023e28c..66a639ec1bc4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1988,6 +1988,7 @@ struct file_system_type { #define FS_USERNS_MOUNT8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */ #define FS_USERNS_VISIBLE 32 /* FS must already be visible */ +#define FS_USERNS_SHARE_SB 64 /* Allow sharing sb between userns-es */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 671dc05c0b0f..9c9aa27e531a 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2247,14 +2247,14 @@ static struct file_system_type cgroup_fs_type = { .name = "cgroup", .mount = cgroup_mount, .kill_sb = cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB, }; static struct file_system_type cgroup2_fs_type = { .name = "cgroup2", .mount = cgroup_mount, .kill_sb = cgroup_kill_sb, - .fs_flags = FS_USERNS_MOUNT, + .fs_flags = FS_USERNS_MOUNT | FS_USERNS_SHARE_SB, }; static char *cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen, -- 2.7.4
[PATCH v4 08/21] userns: Replace in_userns with current_in_userns
All current callers of in_userns pass current_user_ns as the first argument. Simplify by replacing in_userns with current_in_userns which checks whether current_user_ns is in the namespace supplied as an argument. Signed-off-by: Seth Forshee Acked-by: James Morris Acked-by: Serge Hallyn --- fs/namespace.c | 2 +- include/linux/user_namespace.h | 6 ++ kernel/user_namespace.c| 6 +++--- security/commoncap.c | 2 +- 4 files changed, 7 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6e9db4c166b4..0ad8e4a4f50b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3294,7 +3294,7 @@ bool mnt_may_suid(struct vfsmount *mnt) * in other namespaces. */ return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) && - in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns); + current_in_userns(mnt->mnt_sb->s_user_ns); } static struct ns_common *mntns_get(struct task_struct *task) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index a43faa727124..9217169c64cb 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -72,8 +72,7 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *); extern int proc_setgroups_show(struct seq_file *m, void *v); extern bool userns_may_setgroups(const struct user_namespace *ns); -extern bool in_userns(const struct user_namespace *ns, - const struct user_namespace *target_ns); +extern bool current_in_userns(const struct user_namespace *target_ns); #else static inline struct user_namespace *get_user_ns(struct user_namespace *ns) @@ -103,8 +102,7 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns) return true; } -static inline bool in_userns(const struct user_namespace *ns, -const struct user_namespace *target_ns) +static inline bool current_in_userns(const struct user_namespace *target_ns) { return true; } diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index dee3be5445da..68f594212759 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -942,10 +942,10 @@ bool userns_may_setgroups(const struct user_namespace *ns) * Returns true if @ns is the same namespace as or a descendant of * @target_ns. */ -bool in_userns(const struct user_namespace *ns, - const struct user_namespace *target_ns) +bool current_in_userns(const struct user_namespace *target_ns) { - for (; ns; ns = ns->parent) { + struct user_namespace *ns; + for (ns = current_user_ns(); ns; ns = ns->parent) { if (ns == target_ns) return true; } diff --git a/security/commoncap.c b/security/commoncap.c index a306c5d90709..e657227d221e 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -461,7 +461,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c * explicit that capability bits are limited to s_user_ns and its * descendants. */ - if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns)) + if (!current_in_userns(bprm->file->f_path.mnt->mnt_sb->s_user_ns)) return 0; rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps); -- 2.7.4
[PATCH v4 04/21] block_dev: Support checking inode permissions in lookup_bdev()
When looking up a block device by path no permission check is done to verify that the user has access to the block device inode at the specified path. In some cases it may be necessary to check permissions towards the inode, such as allowing unprivileged users to mount block devices in user namespaces. Add an argument to lookup_bdev() to optionally perform this permission check. A value of 0 skips the permission check and behaves the same as before. A non-zero value specifies the mask of access rights required towards the inode at the specified path. The check is always skipped if the user has CAP_SYS_ADMIN. All callers of lookup_bdev() currently pass a mask of 0, so this patch results in no functional change. Subsequent patches will add permission checks where appropriate. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- drivers/md/bcache/super.c | 2 +- drivers/md/dm-table.c | 2 +- drivers/mtd/mtdsuper.c| 2 +- fs/block_dev.c| 13 ++--- fs/quota/quota.c | 2 +- include/linux/fs.h| 2 +- 6 files changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a296425a7270..e169739a0253 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1950,7 +1950,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, sb); if (IS_ERR(bdev)) { if (bdev == ERR_PTR(-EBUSY)) { - bdev = lookup_bdev(strim(path)); + bdev = lookup_bdev(strim(path), 0); mutex_lock(&bch_register_lock); if (!IS_ERR(bdev) && bch_is_open(bdev)) err = "device already registered"; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index f9e8f0bef332..13f568d527b5 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -372,7 +372,7 @@ dev_t dm_get_dev_t(const char *path) dev_t uninitialized_var(dev); struct block_device *bdev; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) dev = name_to_dev_t(path); else { diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c index 20c02a3b7417..b5b60e1af31c 100644 --- a/drivers/mtd/mtdsuper.c +++ b/drivers/mtd/mtdsuper.c @@ -176,7 +176,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags, /* try the old way - the hack where we allowed users to mount * /dev/mtdblock$(n) but didn't actually _use_ the blockdev */ - bdev = lookup_bdev(dev_name); + bdev = lookup_bdev(dev_name, 0); if (IS_ERR(bdev)) { ret = PTR_ERR(bdev); pr_debug("MTDSB: lookup_bdev() returned %d\n", ret); diff --git a/fs/block_dev.c b/fs/block_dev.c index 3e84d62d0a25..e9b937845bdb 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1431,7 +1431,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode, struct block_device *bdev; int err; - bdev = lookup_bdev(path); + bdev = lookup_bdev(path, 0); if (IS_ERR(bdev)) return bdev; @@ -1821,12 +1821,14 @@ EXPORT_SYMBOL(ioctl_by_bdev); /** * lookup_bdev - lookup a struct block_device by name * @pathname: special file representing the block device + * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) * * Get a reference to the blockdevice at @pathname in the current * namespace if possible and return it. Return ERR_PTR(error) - * otherwise. + * otherwise. If @mask is non-zero, check for access rights to the + * inode at @pathname. */ -struct block_device *lookup_bdev(const char *pathname) +struct block_device *lookup_bdev(const char *pathname, int mask) { struct block_device *bdev; struct inode *inode; @@ -1841,6 +1843,11 @@ struct block_device *lookup_bdev(const char *pathname) return ERR_PTR(error); inode = d_backing_inode(path.dentry); + if (mask != 0 && !capable(CAP_SYS_ADMIN)) { + error = __inode_permission(inode, mask); + if (error) + goto fail; + } error = -ENOTBLK; if (!S_ISBLK(inode->i_mode)) goto fail; diff --git a/fs/quota/quota.c b/fs/quota/quota.c index 0f10ee9892ce..59223384b22e 100644 --- a/fs/quota/quota.c +++ b/fs/quota/quota.c @@ -799,7 +799,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd) if (IS_ERR(tmp)) return ERR_CAST(tmp); - bdev = lookup_bdev(tmp->name); + bdev = lookup_bdev(tmp->name, 0); putname(tmp); if (IS_ERR(bdev)) return ERR_CAST(bdev); diff --git a/include/linux/fs.h b/include/linux/fs.h index 66a639ec1bc4..173b8adc6131 10
[PATCH v4 15/21] fs: Don't remove suid for CAP_FSETID in s_user_ns
Expand the check in should_remove_suid() to keep privileges for CAP_FSETID in s_user_ns rather than init_user_ns. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/inode.c b/fs/inode.c index 69b8b526c194..cd52170f9117 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1690,7 +1690,8 @@ int should_remove_suid(struct dentry *dentry) if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) kill |= ATTR_KILL_SGID; - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) + if (unlikely(kill && !ns_capable(dentry->d_sb->s_user_ns, CAP_FSETID) && +S_ISREG(mode))) return kill; return 0; -- 2.7.4
[PATCH v4 14/21] fs: Allow superblock owner to change ownership of inodes with unmappable ids
In a userns mount some on-disk inodes may have ids which do not map into s_user_ns, in which case the in-kernel inodes are owned by invalid users. The superblock owner should be able to change attributes of these inodes but cannot. However it is unsafe to grant the superblock owner privileged access to all inodes in the superblock since proc, sysfs, etc. use DAC to protect files which may not belong to s_user_ns. The problem is restricted to only inodes where the owner or group is an invalid user. We can work around this by allowing users with CAP_CHOWN in s_user_ns to change an invalid owner or group id, so long as the other id is either invalid or mappable in s_user_ns. After changing ownership the user will be privileged towards the inode and thus able to change other attributes. As an precaution, checks for invalid ids are added to the proc and kernfs setattr interfaces. These filesystems are not expected to have inodes with invalid ids, but if it does happen any setattr operations will return -EPERM. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/attr.c | 62 --- fs/kernfs/inode.c | 2 ++ fs/proc/base.c| 2 ++ fs/proc/generic.c | 3 +++ fs/proc/proc_sysctl.c | 2 ++ 5 files changed, 63 insertions(+), 8 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 3cfaaac4a18e..06bb3f401559 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -16,6 +16,58 @@ #include #include +static bool chown_ok(const struct inode *inode, kuid_t uid) +{ + struct user_namespace *user_ns; + + if (uid_eq(current_fsuid(), inode->i_uid) && uid_eq(uid, inode->i_uid)) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + + /* +* Inode uids/gids are of type kuid_t/kgid_t. As such, they can be +* a) INVALID_UID/INVALID_GID, b) valid and mappable into +* i_sb->s_user_ns, or c) valid but not mappable into +* i_sb->s_user_ns. +* +* For filesystems on user-supplied media ids will either be (a) or +* (b), so we permit CAP_CHOWN in s_user_ns to change INVALID_UID if +* the gid meets these conditions (and vice versa for INVALID_GID). +* +* For psuedo-filesystems like proc or sysfs ids will be either (b) +* or (c), so these conditions do not permit namespace-root to chown +* in those filesystems. +*/ + user_ns = inode->i_sb->s_user_ns; + if (!uid_valid(inode->i_uid) && + (!gid_valid(inode->i_gid) || kgid_has_mapping(user_ns, inode->i_gid)) && + ns_capable(user_ns, CAP_CHOWN)) + return true; + + return false; +} + +static bool chgrp_ok(const struct inode *inode, kgid_t gid) +{ + struct user_namespace *user_ns; + + if (uid_eq(current_fsuid(), inode->i_uid) && + (in_group_p(gid) || gid_eq(gid, inode->i_gid))) + return true; + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + return true; + + /* Logic here is the same as in chown_ok(); see comment there. */ + user_ns = inode->i_sb->s_user_ns; + if (!gid_valid(inode->i_gid) && + (!uid_valid(inode->i_uid) || kuid_has_mapping(user_ns, inode->i_uid)) && + ns_capable(user_ns, CAP_CHOWN)) + return true; + + return false; +} + /** * inode_change_ok - check if attribute changes to an inode are allowed * @inode: inode to check @@ -58,17 +110,11 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr) return 0; /* Make sure a caller can chown. */ - if ((ia_valid & ATTR_UID) && - (!uid_eq(current_fsuid(), inode->i_uid) || -!uid_eq(attr->ia_uid, inode->i_uid)) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid)) return -EPERM; /* Make sure caller can chgrp. */ - if ((ia_valid & ATTR_GID) && - (!uid_eq(current_fsuid(), inode->i_uid) || - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) && - !capable_wrt_inode_uidgid(inode, CAP_CHOWN)) + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid)) return -EPERM; /* Make sure a caller can chmod. */ diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index 16405ae88d2d..2e97a337ee5f 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -117,6 +117,8 @@ int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr) if (!kn) return -EINVAL; + if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid)) + return -EPER
[PATCH v4 13/21] fs: Update posix_acl support to handle user namespace mounts
ids in on-disk ACLs should be converted to s_user_ns instead of init_user_ns as is done now. This introduces the possibility for id mappings to fail, and when this happens syscalls will return EOVERFLOW. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/posix_acl.c | 67 ++--- fs/xattr.c | 19 +--- include/linux/posix_acl_xattr.h | 17 --- 3 files changed, 70 insertions(+), 33 deletions(-) diff --git a/fs/posix_acl.c b/fs/posix_acl.c index 711dd5170376..dac2842dd4cb 100644 --- a/fs/posix_acl.c +++ b/fs/posix_acl.c @@ -595,59 +595,77 @@ EXPORT_SYMBOL_GPL(posix_acl_create); /* * Fix up the uids and gids in posix acl extended attributes in place. */ -static void posix_acl_fix_xattr_userns( +static int posix_acl_fix_xattr_userns( struct user_namespace *to, struct user_namespace *from, void *value, size_t size) { posix_acl_xattr_header *header = (posix_acl_xattr_header *)value; posix_acl_xattr_entry *entry = (posix_acl_xattr_entry *)(header+1), *end; int count; - kuid_t uid; - kgid_t gid; + kuid_t kuid; + uid_t uid; + kgid_t kgid; + gid_t gid; if (!value) - return; + return 0; if (size < sizeof(posix_acl_xattr_header)) - return; + return 0; if (header->a_version != cpu_to_le32(POSIX_ACL_XATTR_VERSION)) - return; + return 0; count = posix_acl_xattr_count(size); if (count < 0) - return; + return 0; if (count == 0) - return; + return 0; for (end = entry + count; entry != end; entry++) { switch(le16_to_cpu(entry->e_tag)) { case ACL_USER: - uid = make_kuid(from, le32_to_cpu(entry->e_id)); - entry->e_id = cpu_to_le32(from_kuid(to, uid)); + kuid = make_kuid(from, le32_to_cpu(entry->e_id)); + if (!uid_valid(kuid)) + return -EOVERFLOW; + uid = from_kuid(to, kuid); + if (uid == (uid_t)-1) + return -EOVERFLOW; + entry->e_id = cpu_to_le32(uid); break; case ACL_GROUP: - gid = make_kgid(from, le32_to_cpu(entry->e_id)); - entry->e_id = cpu_to_le32(from_kgid(to, gid)); + kgid = make_kgid(from, le32_to_cpu(entry->e_id)); + if (!gid_valid(kgid)) + return -EOVERFLOW; + gid = from_kgid(to, kgid); + if (gid == (gid_t)-1) + return -EOVERFLOW; + entry->e_id = cpu_to_le32(gid); break; default: break; } } + + return 0; } -void posix_acl_fix_xattr_from_user(void *value, size_t size) +int +posix_acl_fix_xattr_from_user(struct user_namespace *target_ns, void *value, + size_t size) { - struct user_namespace *user_ns = current_user_ns(); - if (user_ns == &init_user_ns) - return; - posix_acl_fix_xattr_userns(&init_user_ns, user_ns, value, size); + struct user_namespace *source_ns = current_user_ns(); + if (source_ns == target_ns) + return 0; + return posix_acl_fix_xattr_userns(target_ns, source_ns, value, size); } -void posix_acl_fix_xattr_to_user(void *value, size_t size) +int +posix_acl_fix_xattr_to_user(struct user_namespace *source_ns, void *value, + size_t size) { - struct user_namespace *user_ns = current_user_ns(); - if (user_ns == &init_user_ns) - return; - posix_acl_fix_xattr_userns(user_ns, &init_user_ns, value, size); + struct user_namespace *target_ns = current_user_ns(); + if (target_ns == source_ns) + return 0; + return posix_acl_fix_xattr_userns(target_ns, source_ns, value, size); } /* @@ -780,7 +798,7 @@ posix_acl_xattr_get(const struct xattr_handler *handler, if (acl == NULL) return -ENODATA; - error = posix_acl_to_xattr(&init_user_ns, acl, value, size); + error = posix_acl_to_xattr(dentry->d_sb->s_user_ns, acl, value, size); posix_acl_release(acl); return error; @@ -806,7 +824,8 @@ posix_acl_xattr_set(const struct xattr_handler *handler, return -EPERM; if (value) { - acl = posix_acl_from_xattr(&init_user_ns, value, size); + acl = posix_acl_from_xattr(dentry->d_sb->s_user_ns, value, +
[PATCH v4 06/21] fs: Treat foreign mounts as nosuid
From: Andy Lutomirski If a process gets access to a mount from a different user namespace, that process should not be able to take advantage of setuid files or selinux entrypoints from that filesystem. Prevent this by treating mounts from other mount namespaces and those not owned by current_user_ns() or an ancestor as nosuid. This will make it safer to allow more complex filesystems to be mounted in non-root user namespaces. This does not remove the need for MNT_LOCK_NOSUID. The setuid, setgid, and file capability bits can no longer be abused if code in a user namespace were to clear nosuid on an untrusted filesystem, but this patch, by itself, is insufficient to protect the system from abuse of files that, when execed, would increase MAC privilege. As a more concrete explanation, any task that can manipulate a vfsmount associated with a given user namespace already has capabilities in that namespace and all of its descendents. If they can cause a malicious setuid, setgid, or file-caps executable to appear in that mount, then that executable will only allow them to elevate privileges in exactly the set of namespaces in which they are already privileges. On the other hand, if they can cause a malicious executable to appear with a dangerous MAC label, running it could change the caller's security context in a way that should not have been possible, even inside the namespace in which the task is confined. As a hardening measure, this would have made CVE-2014-5207 much more difficult to exploit. Signed-off-by: Andy Lutomirski Signed-off-by: Seth Forshee Acked-by: James Morris Acked-by: Serge Hallyn --- fs/exec.c| 2 +- fs/namespace.c | 13 + include/linux/mount.h| 1 + security/commoncap.c | 8 +++- security/selinux/hooks.c | 2 +- 5 files changed, 23 insertions(+), 3 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index c4010b8207a1..706088dd0cb1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1391,7 +1391,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm) bprm->cred->euid = current_euid(); bprm->cred->egid = current_egid(); - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) + if (!mnt_may_suid(bprm->file->f_path.mnt)) return; if (task_no_new_privs(current)) diff --git a/fs/namespace.c b/fs/namespace.c index c133318bec35..6e9db4c166b4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3284,6 +3284,19 @@ found: return visible; } +bool mnt_may_suid(struct vfsmount *mnt) +{ + /* +* Foreign mounts (accessed via fchdir or through /proc +* symlinks) are always treated as if they are nosuid. This +* prevents namespaces from trusting potentially unsafe +* suid/sgid bits, file caps, or security labels that originate +* in other namespaces. +*/ + return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) && + in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns); +} + static struct ns_common *mntns_get(struct task_struct *task) { struct ns_common *ns = NULL; diff --git a/include/linux/mount.h b/include/linux/mount.h index f822c3c11377..54a594d49733 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt); extern struct vfsmount *mntget(struct vfsmount *mnt); extern struct vfsmount *mnt_clone_internal(struct path *path); extern int __mnt_is_readonly(struct vfsmount *mnt); +extern bool mnt_may_suid(struct vfsmount *mnt); struct path; extern struct vfsmount *clone_private_mount(struct path *path); diff --git a/security/commoncap.c b/security/commoncap.c index 8912ef117faa..a306c5d90709 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -453,8 +453,14 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c if (!file_caps_enabled) return 0; - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) + if (!mnt_may_suid(bprm->file->f_path.mnt)) return 0; + + /* +* This check is redundant with mnt_may_suid() but is kept to make +* explicit that capability bits are limited to s_user_ns and its +* descendants. +*/ if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns)) return 0; diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 912deee3f01e..1350167635cb 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2234,7 +2234,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm, const struct task_security_struct *new_tsec) { int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS); - int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID); + int nosuid = !mnt_may_suid(bprm->file->f_path.mnt); int rc; if (!nnp && !nosuid) -- 2.7.4
[PATCH v4 19/21] fuse: Support fuse filesystems outside of init_user_ns
In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in inode_change_ok for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the namespace used for the connection is also simply current_user_ns() at the time /dev/cuse is opened. Signed-off-by: Seth Forshee --- fs/fuse/cuse.c | 3 ++- fs/fuse/dev.c| 13 - fs/fuse/dir.c| 14 +++--- fs/fuse/fuse_i.h | 6 +- fs/fuse/inode.c | 33 + 5 files changed, 43 insertions(+), 26 deletions(-) diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index c5b6b7165489..98ebd0f4fd4c 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "fuse_i.h" @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file) if (!cc) return -ENOMEM; - fuse_conn_init(&cc->fc); + fuse_conn_init(&cc->fc, current_user_ns()); fud = fuse_dev_alloc(&cc->fc); if (!fud) { diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 4e91b2ac25a7..8fa1ce934df3 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -127,8 +127,8 @@ static void __fuse_put_request(struct fuse_req *req) static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid()); + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid()); req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); } @@ -186,7 +186,8 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, __set_bit(FR_WAITING, &req->flags); if (for_background) __set_bit(FR_BACKGROUND, &req->flags); - if (req->in.h.pid == 0) { + if (req->in.h.pid == 0 || req->in.h.uid == (uid_t)-1 || + req->in.h.gid == (gid_t)-1) { fuse_put_request(fc, req); return ERR_PTR(-EOVERFLOW); } @@ -1248,7 +1249,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, struct fuse_in *in; unsigned reqsize; - if (task_active_pid_ns(current) != fc->pid_ns) + if (task_active_pid_ns(current) != fc->pid_ns || + current_user_ns() != fc->user_ns) return -EIO; restart: @@ -1880,7 +1882,8 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud, struct fuse_req *req; struct fuse_out_header oh; - if (task_active_pid_ns(current) != fc->pid_ns) + if (task_active_pid_ns(current) != fc->pid_ns || + current_user_ns() != fc->user_ns) return -EIO; if (nbytes < sizeof(struct fuse_out_header)) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 4b855b65d457..ecba75bf6640 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -841,8 +841,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr, stat->ino = attr->ino; stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 0); stat->nlink = attr->nlink; - stat->uid = make_kuid(&init_user_ns, attr->uid); - stat->gid = make_kgid(&init_user_ns, attr->gid); + stat->uid = make_kuid(fc->user_ns, attr->uid); + stat->gid = make_kgid(fc->user_ns, attr->gid); stat->rdev = inode->i_rdev; stat->atime.tv_sec = attr->atime; stat->atime.tv_nsec = attr->atimensec; @@ -1459,17 +1459,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime) return true; } -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg, - bool trust_local_cmtime) +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr, + struct fuse_setattr_in *arg, bool trust_local_cmtime) { unsigned ivalid = iattr->ia_valid; if (ivalid & ATTR_MODE) arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode; if (ivalid & ATTR_UID) - arg->valid |= FATTR_UID,arg->uid = from_kuid(&init_user_ns, iattr->ia_uid); + arg->valid |= F
[PATCH v4 20/21] fuse: Restrict allow_other to the superblock's namespace or a descendant
Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn Acked-by: Miklos Szeredi --- fs/fuse/dir.c | 2 +- kernel/user_namespace.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index ecba75bf6640..1a6c5af49608 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1015,7 +1015,7 @@ int fuse_allow_current_process(struct fuse_conn *fc) const struct cred *cred; if (fc->flags & FUSE_ALLOW_OTHER) - return 1; + return current_in_userns(fc->user_ns); cred = current_cred(); if (uid_eq(cred->euid, fc->user_id) && diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 68f594212759..fa2294e14b77 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -951,6 +951,7 @@ bool current_in_userns(const struct user_namespace *target_ns) } return false; } +EXPORT_SYMBOL(current_in_userns); static inline struct user_namespace *to_user_ns(struct ns_common *ns) { -- 2.7.4
[PATCH v4 21/21] fuse: Allow user namespace mounts
Signed-off-by: Seth Forshee Acked-by: Miklos Szeredi --- fs/fuse/inode.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 0a771145d853..254f1944ee98 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1199,7 +1199,7 @@ static void fuse_kill_sb_anon(struct super_block *sb) static struct file_system_type fuse_fs_type = { .owner = THIS_MODULE, .name = "fuse", - .fs_flags = FS_HAS_SUBTYPE, + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT, .mount = fuse_mount, .kill_sb= fuse_kill_sb_anon, }; @@ -1231,7 +1231,7 @@ static struct file_system_type fuseblk_fs_type = { .name = "fuseblk", .mount = fuse_mount_blk, .kill_sb= fuse_kill_sb_blk, - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE, + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT, }; MODULE_ALIAS_FS("fuseblk"); -- 2.7.4
[PATCH v4 17/21] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
A privileged user in s_user_ns will generally have the ability to manipulate the backing store and insert security.* xattrs into the filesystem directly. Therefore the kernel must be prepared to handle these xattrs from unprivileged mounts, and it makes little sense for commoncap to prevent writing these xattrs to the filesystem. The capability and LSM code have already been updated to appropriately handle xattrs from unprivileged mounts, so it is safe to loosen this restriction on setting xattrs. The exception to this logic is that writing xattrs to a mounted filesystem may also cause the LSM inode_post_setxattr or inode_setsecurity callbacks to be invoked. SELinux will deny the xattr update by virtue of applying mountpoint labeling to unprivileged userns mounts, and Smack will deny the writes for any user without global CAP_MAC_ADMIN, so loosening the capability check in commoncap is safe in this respect as well. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- security/commoncap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/security/commoncap.c b/security/commoncap.c index e657227d221e..12477afaa8ed 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -664,15 +664,17 @@ int cap_bprm_secureexec(struct linux_binprm *bprm) int cap_inode_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + if (!strcmp(name, XATTR_NAME_CAPS)) { - if (!capable(CAP_SETFCAP)) + if (!ns_capable(user_ns, CAP_SETFCAP)) return -EPERM; return 0; } if (!strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) && - !capable(CAP_SYS_ADMIN)) + !ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } @@ -690,15 +692,17 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name, */ int cap_inode_removexattr(struct dentry *dentry, const char *name) { + struct user_namespace *user_ns = dentry->d_sb->s_user_ns; + if (!strcmp(name, XATTR_NAME_CAPS)) { - if (!capable(CAP_SETFCAP)) + if (!ns_capable(user_ns, CAP_SETFCAP)) return -EPERM; return 0; } if (!strncmp(name, XATTR_SECURITY_PREFIX, sizeof(XATTR_SECURITY_PREFIX) - 1) && - !capable(CAP_SYS_ADMIN)) + !ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; return 0; } -- 2.7.4
[PATCH v4 18/21] fuse: Add support for pid namespaces
When the userspace process servicing fuse requests is running in a pid namespace then pids passed via the fuse fd are not being translated into that process' namespace. Translation is necessary for the pid to be useful to that process. Since no use case currently exists for changing namespaces all translations can be done relative to the pid namespace in use when fuse_conn_init() is called. For fuse this translates to mount time, and for cuse this is when /dev/cuse is opened. IO for this connection from another namespace will return errors. Requests from processes whose pid cannot be translated into the target namespace are not permitted, except for requests allocated via fuse_get_req_nofail_nopages. For no-fail requests in.h.pid will be 0 if the pid translation fails. File locking changes based on previous work done by Eric Biederman. Signed-off-by: Seth Forshee Acked-by: Miklos Szeredi --- fs/fuse/dev.c| 19 +++ fs/fuse/file.c | 22 +- fs/fuse/fuse_i.h | 4 fs/fuse/inode.c | 3 +++ 4 files changed, 39 insertions(+), 9 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index cbece1221417..4e91b2ac25a7 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -19,6 +19,7 @@ #include #include #include +#include MODULE_ALIAS_MISCDEV(FUSE_MINOR); MODULE_ALIAS("devname:fuse"); @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req) atomic_dec(&req->count); } -static void fuse_req_init_context(struct fuse_req *req) +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req) { req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid()); req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid()); - req->in.h.pid = current->pid; + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns); } void fuse_set_initialized(struct fuse_conn *fc) @@ -181,10 +182,14 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages, goto out; } - fuse_req_init_context(req); + fuse_req_init_context(fc, req); __set_bit(FR_WAITING, &req->flags); if (for_background) __set_bit(FR_BACKGROUND, &req->flags); + if (req->in.h.pid == 0) { + fuse_put_request(fc, req); + return ERR_PTR(-EOVERFLOW); + } return req; @@ -274,7 +279,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc, if (!req) req = get_reserved_req(fc, file); - fuse_req_init_context(req); + fuse_req_init_context(fc, req); __set_bit(FR_WAITING, &req->flags); __clear_bit(FR_BACKGROUND, &req->flags); return req; @@ -1243,6 +1248,9 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, struct fuse_in *in; unsigned reqsize; + if (task_active_pid_ns(current) != fc->pid_ns) + return -EIO; + restart: spin_lock(&fiq->waitq.lock); err = -EAGAIN; @@ -1872,6 +1880,9 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud, struct fuse_req *req; struct fuse_out_header oh; + if (task_active_pid_ns(current) != fc->pid_ns) + return -EIO; + if (nbytes < sizeof(struct fuse_out_header)) return -EINVAL; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 719924d6c706..b5c616c5ec98 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -2067,7 +2067,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma) return generic_file_mmap(file, vma); } -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl, +static int convert_fuse_file_lock(struct fuse_conn *fc, + const struct fuse_file_lock *ffl, struct file_lock *fl) { switch (ffl->type) { @@ -2082,7 +2083,14 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl, fl->fl_start = ffl->start; fl->fl_end = ffl->end; - fl->fl_pid = ffl->pid; + + /* +* Convert pid into the caller's pid namespace. If the pid +* does not map into the namespace fl_pid will get set to 0. +*/ + rcu_read_lock(); + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns)); + rcu_read_unlock(); break; default: @@ -2131,7 +2139,7 @@ static int fuse_getlk(struct file *file, struct file_lock *fl) args.out.args[0].value = &outarg; err = fuse_simple_request(fc, &args); if (!err) - err = convert_fuse_file_lock(&outarg.lk, fl); + err = convert_fuse_file_lock(fc, &outarg.lk
[PATCH v4 16/21] fs: Allow superblock owner to access do_remount_sb()
Superblock level remounts are currently restricted to global CAP_SYS_ADMIN, as is the path for changing the root mount to read only on umount. Loosen both of these permission checks to also allow CAP_SYS_ADMIN in any namespace which is privileged towards the userns which originally mounted the filesystem. Signed-off-by: Seth Forshee Acked-by: "Eric W. Biederman" Acked-by: Serge Hallyn --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 0ad8e4a4f50b..575e3f8b34fd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1510,7 +1510,7 @@ static int do_umount(struct mount *mnt, int flags) * Special case for "unmounting" root ... * we just try to remount it readonly. */ - if (!capable(CAP_SYS_ADMIN)) + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) return -EPERM; down_write(&sb->s_umount); if (!(sb->s_flags & MS_RDONLY)) @@ -2207,7 +2207,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags, down_write(&sb->s_umount); if (flags & MS_BIND) err = change_mount_flags(path->mnt, flags); - else if (!capable(CAP_SYS_ADMIN)) + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) err = -EPERM; else err = do_remount_sb(sb, flags, data, 0); -- 2.7.4
[PATCH v4 05/21] block_dev: Check permissions towards block device inode when mounting
Unprivileged users should not be able to mount block devices when they lack sufficient privileges towards the block device inode. Update blkdev_get_by_path() to validate that the user has the required access to the inode at the specified path. The check will be skipped for CAP_SYS_ADMIN, so privileged mounts will continue working as before. Signed-off-by: Seth Forshee Acked-by: Serge Hallyn --- fs/block_dev.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index e9b937845bdb..2007040afb7b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1429,9 +1429,14 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode, void *holder) { struct block_device *bdev; + int perm = 0; int err; - bdev = lookup_bdev(path, 0); + if (mode & FMODE_READ) + perm |= MAY_READ; + if (mode & FMODE_WRITE) + perm |= MAY_WRITE; + bdev = lookup_bdev(path, perm); if (IS_ERR(bdev)) return bdev; -- 2.7.4
[PATCH v4 09/21] Smack: Handle labels consistently in untrusted mounts
The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled differently in untrusted mounts. This is confusing and potentically problematic. Change this to handle them all the same way that SMACK64 is currently handled; that is, read the label from disk and check it at use time. For SMACK64 and SMACK64MMAP access is denied if the label does not match smk_root. To be consistent with suid, a SMACK64EXEC label which does not match smk_root will still allow execution of the file but will not run with the label supplied in the xattr. Signed-off-by: Seth Forshee Acked-by: Casey Schaufler --- security/smack/smack_lsm.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index aa17198cd5f2..ca564590cc1b 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -919,6 +919,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm) struct inode *inode = file_inode(bprm->file); struct task_smack *bsp = bprm->cred->security; struct inode_smack *isp; + struct superblock_smack *sbsp; int rc; if (bprm->cred_prepared) @@ -928,6 +929,11 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm) if (isp->smk_task == NULL || isp->smk_task == bsp->smk_task) return 0; + sbsp = inode->i_sb->s_security; + if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) && + isp->smk_task != sbsp->smk_root) + return 0; + if (bprm->unsafe & (LSM_UNSAFE_PTRACE | LSM_UNSAFE_PTRACE_CAP)) { struct task_struct *tracer; rc = 0; @@ -1725,6 +1731,7 @@ static int smack_mmap_file(struct file *file, struct task_smack *tsp; struct smack_known *okp; struct inode_smack *isp; + struct superblock_smack *sbsp; int may; int mmay; int tmay; @@ -1736,6 +1743,10 @@ static int smack_mmap_file(struct file *file, isp = file_inode(file)->i_security; if (isp->smk_mmap == NULL) return 0; + sbsp = file_inode(file)->i_sb->s_security; + if (sbsp->smk_flags & SMK_SB_UNTRUSTED && + isp->smk_mmap != sbsp->smk_root) + return -EACCES; mkp = isp->smk_mmap; tsp = current_security(); @@ -3546,16 +3557,14 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode) if (rc >= 0) transflag = SMK_INODE_TRANSMUTE; } - if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) { - /* -* Don't let the exec or mmap label be "*" or "@". -*/ - skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp); - if (IS_ERR(skp) || skp == &smack_known_star || - skp == &smack_known_web) - skp = NULL; - isp->smk_task = skp; - } + /* +* Don't let the exec or mmap label be "*" or "@". +*/ + skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp); + if (IS_ERR(skp) || skp == &smack_known_star || + skp == &smack_known_web) + skp = NULL; + isp->smk_task = skp; skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp); if (IS_ERR(skp) || skp == &smack_known_star || -- 2.7.4
[PATCH v4 01/21] fs: fix a posible leak of allocated superblock
From: Pavel Tikhomirov We probably need to fix superblock leak in patch (v4 "fs: Add user namesapace member to struct super_block"): Imagine posible code path in sget_userns: we iterate through type->fs_supers and do not find suitable sb, we drop sb_lock to allocate s and go to retry. After we dropped sb_lock some other task from different userns takes sb_lock, it is already in retry stage and has s allocated, so it puts its s in type->fs_supers list. So in retry we will find these sb in list and check it has a different userns, and finally we will return without freeing s. Signed-off-by: Pavel Tikhomirov Acked-by: Seth Forshee --- fs/super.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/super.c b/fs/super.c index 829841e0ae7e..092a7828442e 100644 --- a/fs/super.c +++ b/fs/super.c @@ -474,6 +474,10 @@ retry: continue; if (user_ns != old->s_user_ns) { spin_unlock(&sb_lock); + if (s) { + up_write(&s->s_umount); + destroy_super(s); + } return ERR_PTR(-EBUSY); } if (!grab_super(old)) -- 2.7.4
Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()
On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote: > A failed call to dqget() returns an ERR_PTR() and not null. Fix > the check in ext4_ioctl_setproject() to handle this correctly. > > Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface > support") > Cc: sta...@vger.kernel.org # v4.5 > Signed-off-by: Seth Forshee Thought I'd check and see if this patch was forgotten, I sent it nearly a month ago but the bug is still present. Thanks, Seth