Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 02:24:30PM +, David Howells wrote: > Al Viro wrote: > > > - if (fc->user_ns) > > - put_user_ns(fc->user_ns); > > - fc->user_ns = get_user_ns(netns->user_ns); > > + if (netns) { > > + if (fc->user_ns) > > + put_user_ns(fc->user_ns); > > + fc->user_ns = get_user_ns(netns->user_ns); > > + } > > This begs the question why is sysfs using the current network namespace's idea > of the user namespace? Why not just use the one directly from current->cred? Because it gives access to that netns guts, presumably. In a saner world sysfs wouldn't _have_ netns-dependent bits; a separate per-netns filesystem would contain those, and be mounted separately. And yes, we do have way too many kinds of namespaces, along with filesystems that try to mix unrelated bits and lead to something that looks like Cthulhu's arse after an unfortunate accident with capsaicin suppository...
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 02:24:30PM +, David Howells wrote: > Al Viro wrote: > > > - if (fc->user_ns) > > - put_user_ns(fc->user_ns); > > - fc->user_ns = get_user_ns(netns->user_ns); > > + if (netns) { > > + if (fc->user_ns) > > + put_user_ns(fc->user_ns); > > + fc->user_ns = get_user_ns(netns->user_ns); > > + } > > This begs the question why is sysfs using the current network namespace's idea > of the user namespace? Why not just use the one directly from current->cred? Ask the networking people that question, I have no idea :)
Re: fs_context-related oops in mainline
Al Viro wrote: > - if (fc->user_ns) > - put_user_ns(fc->user_ns); > - fc->user_ns = get_user_ns(netns->user_ns); > + if (netns) { > + if (fc->user_ns) > + put_user_ns(fc->user_ns); > + fc->user_ns = get_user_ns(netns->user_ns); > + } This begs the question why is sysfs using the current network namespace's idea of the user namespace? Why not just use the one directly from current->cred? David
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 12:18:13PM +, Al Viro wrote: > On Fri, Mar 15, 2019 at 12:50:02PM +0100, Dominik Brodowski wrote: > > On Fri, Mar 15, 2019 at 11:44:45AM +, David Howells wrote: > > > Dominik Brodowski wrote: > > > > > > > [0.839322] RIP: 0010:sysfs_init_fs_context+0x82/0xd0 > > > > > > Could you load your kernel into gdb and then do: > > > > > > i li *sysfs_init_fs_context+0x82 > > > > Doesn't seem necessary as per my mail to Al a few seconds ago: > > kobj_ns_grab_current(KOBJ_NS_TYPE_NET) returns NULL, so > > > > fc->user_ns = get_user_ns(netns->user_ns); > > > > will try to dereference a null pointer. > > > > Thanks, > > Dominik > > > > Charming... It's netns being turned off, and the thing we'd missed > is that kobj_ns_current_may_mount() becomes constant true in that > setup. IOW, ns_capable(..., CAP_SYS_ADMIN) is suppressed entirely > in that case (well, aside of what may_mount() has done in > do_mount/sys_fsmount). > > So what we need is making that "change fc->user_ns" conditional > on netns != NULL, as in > > diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c > index 4cb21b558a85..1b56686ab178 100644 > --- a/fs/sysfs/mount.c > +++ b/fs/sysfs/mount.c > @@ -71,9 +71,11 @@ static int sysfs_init_fs_context(struct fs_context *fc) > kfc->magic = SYSFS_MAGIC; > fc->fs_private = kfc; > fc->ops = _fs_context_ops; > - if (fc->user_ns) > - put_user_ns(fc->user_ns); > - fc->user_ns = get_user_ns(netns->user_ns); > + if (netns) { > + if (fc->user_ns) > + put_user_ns(fc->user_ns); > + fc->user_ns = get_user_ns(netns->user_ns); > + } > fc->global = true; > return 0; > } That fixes the issue at hand, thanks (verified both on top of the problematic commit and on top of mainline). Thanks, Dominik
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 12:50:02PM +0100, Dominik Brodowski wrote: > On Fri, Mar 15, 2019 at 11:44:45AM +, David Howells wrote: > > Dominik Brodowski wrote: > > > > > [0.839322] RIP: 0010:sysfs_init_fs_context+0x82/0xd0 > > > > Could you load your kernel into gdb and then do: > > > > i li *sysfs_init_fs_context+0x82 > > Doesn't seem necessary as per my mail to Al a few seconds ago: > kobj_ns_grab_current(KOBJ_NS_TYPE_NET) returns NULL, so > > fc->user_ns = get_user_ns(netns->user_ns); > > will try to dereference a null pointer. > > Thanks, > Dominik > Charming... It's netns being turned off, and the thing we'd missed is that kobj_ns_current_may_mount() becomes constant true in that setup. IOW, ns_capable(..., CAP_SYS_ADMIN) is suppressed entirely in that case (well, aside of what may_mount() has done in do_mount/sys_fsmount). So what we need is making that "change fc->user_ns" conditional on netns != NULL, as in diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c index 4cb21b558a85..1b56686ab178 100644 --- a/fs/sysfs/mount.c +++ b/fs/sysfs/mount.c @@ -71,9 +71,11 @@ static int sysfs_init_fs_context(struct fs_context *fc) kfc->magic = SYSFS_MAGIC; fc->fs_private = kfc; fc->ops = _fs_context_ops; - if (fc->user_ns) - put_user_ns(fc->user_ns); - fc->user_ns = get_user_ns(netns->user_ns); + if (netns) { + if (fc->user_ns) + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(netns->user_ns); + } fc->global = true; return 0; }
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 11:44:45AM +, David Howells wrote: > Dominik Brodowski wrote: > > > [0.839322] RIP: 0010:sysfs_init_fs_context+0x82/0xd0 > > Could you load your kernel into gdb and then do: > > i li *sysfs_init_fs_context+0x82 Doesn't seem necessary as per my mail to Al a few seconds ago: kobj_ns_grab_current(KOBJ_NS_TYPE_NET) returns NULL, so fc->user_ns = get_user_ns(netns->user_ns); will try to dereference a null pointer. Thanks, Dominik
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 11:34:47AM +, Al Viro wrote: > On Fri, Mar 15, 2019 at 08:43:07AM +0100, Dominik Brodowski wrote: > > David, Al, > > > > commit 23bf1b6be9c2 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > > seems to have introduced a bug; at least that's the commit I bisected the > > following oops down to: As additional information: In sysfs_init_fs_context, we have kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET); ... fc->user_ns = get_user_ns(netns->user_ns); With my config, kobj_ns_grab_current() returns NULL here, which obviously causes a null pointer dereference when trying to set fs->user_ns. Probably the most important line of the .config (attached): # CONFIG_NET is not set Thanks, Dominik # # Automatically generated file; DO NOT EDIT. # Linux/x86 5.0.0-rc2 Kernel Configuration # # # Compiler: gcc (GCC) 8.2.1 20181127 # CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=80201 CONFIG_CLANG_VERSION=0 CONFIG_CC_HAS_ASM_GOTO=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_CROSS_MEMORY_ATTACH is not set # CONFIG_USELIB is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_ARCH_CLOCKSOURCE_INIT=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_PSI is not set # CONFIG_CPU_ISOLATION is not set # # RCU Subsystem # CONFIG_PREEMPT_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_TASKS_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_LOG_CPU_MAX_BUF_SHIFT=13 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_ARCH_SUPPORTS_INT128=y # CONFIG_NUMA_BALANCING is not set CONFIG_CGROUPS=y # CONFIG_MEMCG is not set CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y # CONFIG_CGROUP_RDMA is not set CONFIG_CGROUP_FREEZER=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y # CONFIG_CHECKPOINT_RESTORE is not set # CONFIG_SCHED_AUTOGROUP is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y CONFIG_RD_BZIP2=y CONFIG_RD_LZMA=y CONFIG_RD_XZ=y CONFIG_RD_LZO=y CONFIG_RD_LZ4=y CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y CONFIG_SYSCTL_EXCEPTION_TRACE=y CONFIG_HAVE_PCSPKR_PLATFORM=y CONFIG_BPF=y # CONFIG_EXPERT is not set CONFIG_MULTIUSER=y CONFIG_SGETMASK_SYSCALL=y CONFIG_SYSFS_SYSCALL=y CONFIG_FHANDLE=y CONFIG_POSIX_TIMERS=y CONFIG_PRINTK=y CONFIG_PRINTK_NMI=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_PCSPKR_PLATFORM=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_FUTEX_PI=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_ADVISE_SYSCALLS=y CONFIG_MEMBARRIER=y CONFIG_KALLSYMS=y
Re: fs_context-related oops in mainline
Dominik Brodowski wrote: > [0.839322] RIP: 0010:sysfs_init_fs_context+0x82/0xd0 Could you load your kernel into gdb and then do: i li *sysfs_init_fs_context+0x82 David
Re: fs_context-related oops in mainline
On Fri, Mar 15, 2019 at 08:43:07AM +0100, Dominik Brodowski wrote: > David, Al, > > commit 23bf1b6be9c2 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") > seems to have introduced a bug; at least that's the commit I bisected the > following oops down to: > This occurs while trying to mount sysfs in initramfs > > mount -n -t sysfs sysfs /sys > > All this obviously runs in qemu; config and further information are available > upon request. .config, please?