Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/31/2018 10:22 PM, Mimi Zohar wrote: > On Wed, 2018-01-31 at 21:03 -0500, Arvind Sankar wrote: >> On Wed, Jan 31, 2018 at 05:48:20PM -0600, Rob Landley wrote: >>> On 01/31/2018 04:07 PM, Mimi Zohar wrote: >>>> On Wed, 2018-01-31 at 13:32 -0600, Rob Landley wrote:>> (The old "I >>>> configured in tmpfs and am using rootfs but I want that >>> rootfs >>>>> to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does >>>>> it?) >>>> >>>> I must be missing something. Which systems don't specify "root=" on >>>> the boot command line. >>> >>> Any system using initrd or initramfs? >>> >> >> Don't a lot of initramfs setups use root= to tell the initramfs which >> actual root file system to switch to after early boot? > > With your patch and specifying "root=tmpfs", dracut is complaining: > > dracut: FATAL: Don't know how to handle 'root=tmpfs' > dracut: refusing to continue "The kernel can't break this buggy userspace package." "The kernel must give access to a new feature to this buggy userspace package". I think kernel policy asks you to pick one, but I could be wrong... Rob
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 02/01/2018 09:55 AM, Mimi Zohar wrote: > On Thu, 2018-02-01 at 09:20 -0600, Rob Landley wrote: > >>> With your patch and specifying "root=tmpfs", dracut is complaining: >>> >>> dracut: FATAL: Don't know how to handle 'root=tmpfs' >>> dracut: refusing to continue >> >> [googles]... I do not understand why this package exists. >> >> If you're switching to another root filesystem, using a tool that >> wikipedia[citation needed] says has no purpose but to switch to another >> root filesystem, (so let's reproduce the kernel infrastructure in >> userspace while leaving it the kernel too)... why do you need initramfs >> to be tmpfs? You're using it for half a second, then discarding it, >> what's the point of it being tmpfs? > > Unlike the kernel image which is signed by the distros, the initramfs > doesn't come signed, because it is built on the target system. Even > if the initramfs did come signed, it is beneficial to measure and > appraise the individual files in the initramfs. You can still shoot yourself in the foot with tmpfs. People mount a /run and a /tmp and then as a normal user you can go https://twitter.com/landley/status/959103235305951233 and maybe the default should be a little more clever there... I'll throw it on the todo heap. :) >> Sigh. If people are ok with having rootfs just be tmpfs whenever tmpfs >> is configured in, even when you're then going to overmount it with >> something else like you're doing, let's just _remove_ the test. If it >> can be tmpfs, have it be tmpfs. > > Very much appreciated! Not yet tested, but something like the attached? (Sorry for the half-finished doc changes in there, I'm at work and have a 5 minute break. I can test properly this evening if you don't get to it...) Rob diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b98048b..a5b44b2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3771,8 +3771,14 @@ debug-uart get routed to the D+ and D- pins of the usb port and the regular usb controller gets disabled. - root= [KNL] Root filesystem - See name_to_dev_t comment in init/do_mounts.c. + root= [KNL] Fallback root filesystem when not using initramfs + If initramfs contains an /init file to run as PID 1 the + kernel ignores this setting. When initramfs doesn't have + /init (or whatever rdinit= points to) the kernel calls + prepare_namespace() in init/do_mounts.c to mount another + filesystem over / and chroot into it, then looks for + /sbin/init in there. (And /etc/init, /bin/init, and + /bin/sh for historical reasons.) rootdelay= [KNL] Delay (in seconds) to pause before attempting to mount the root filesystem diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index b176928..f3c57ba 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -67,6 +67,10 @@ A ramfs derivative called tmpfs was created to add size limits, and the ability to write the data to swap space. Normal users can be allowed write access to tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. +The kernel uses tmpfs for ramfs when CONFIG_TMPFS=y and no "root=" is +specified in the kernel command line. If you can't stop yourself from +specifying root= you can also use "root=tmpfs". + What is rootfs? --- @@ -236,22 +240,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux. If you don't already understand what shared libraries, devices, and paths you need to get a minimal root filesystem up and running, here are some references: -http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ -http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html -http://www.linuxfromscratch.org/lfs/view/stable/ - -The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is -designed to be a tiny C library to statically link early userspace -code against, along with some related utilities. It is BSD licensed. -I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) -myself. These are LGPL and GPL, respectively. (A self-contained initramfs -package is planned for the busybox 1.3 release.) - -In theory you could use glibc, but that's not well suited for small embedded -uses like this. (A "hello world" program statically linked against glibc is -over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do -name lookups, even when otherwise statically linked.) + http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ + http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html + http://www.linuxfromscratc
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 02/01/2018 09:55 AM, Mimi Zohar wrote: > On Thu, 2018-02-01 at 09:20 -0600, Rob Landley wrote: > >>> With your patch and specifying "root=tmpfs", dracut is complaining: >>> >>> dracut: FATAL: Don't know how to handle 'root=tmpfs' >>> dracut: refusing to continue >> >> [googles]... I do not understand why this package exists. >> >> If you're switching to another root filesystem, using a tool that >> wikipedia[citation needed] says has no purpose but to switch to another >> root filesystem, (so let's reproduce the kernel infrastructure in >> userspace while leaving it the kernel too)... why do you need initramfs >> to be tmpfs? You're using it for half a second, then discarding it, >> what's the point of it being tmpfs? > > Unlike the kernel image which is signed by the distros, the initramfs > doesn't come signed, because it is built on the target system. Even > if the initramfs did come signed, it is beneficial to measure and > appraise the individual files in the initramfs. You can still shoot yourself in the foot with tmpfs. People mount a /run and a /tmp and then as a normal user you can go https://twitter.com/landley/status/959103235305951233 and maybe the default should be a little more clever there... I'll throw it on the todo heap. :) >> Sigh. If people are ok with having rootfs just be tmpfs whenever tmpfs >> is configured in, even when you're then going to overmount it with >> something else like you're doing, let's just _remove_ the test. If it >> can be tmpfs, have it be tmpfs. > > Very much appreciated! Not yet tested, but something like the attached? (Sorry for the half-finished doc changes in there, I'm at work and have a 5 minute break. I can test properly this evening if you don't get to it...) Rob diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b98048b..a5b44b2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3771,8 +3771,14 @@ debug-uart get routed to the D+ and D- pins of the usb port and the regular usb controller gets disabled. - root= [KNL] Root filesystem - See name_to_dev_t comment in init/do_mounts.c. + root= [KNL] Fallback root filesystem when not using initramfs + If initramfs contains an /init file to run as PID 1 the + kernel ignores this setting. When initramfs doesn't have + /init (or whatever rdinit= points to) the kernel calls + prepare_namespace() in init/do_mounts.c to mount another + filesystem over / and chroot into it, then looks for + /sbin/init in there. (And /etc/init, /bin/init, and + /bin/sh for historical reasons.) rootdelay= [KNL] Delay (in seconds) to pause before attempting to mount the root filesystem diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index b176928..f3c57ba 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -67,6 +67,10 @@ A ramfs derivative called tmpfs was created to add size limits, and the ability to write the data to swap space. Normal users can be allowed write access to tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. +The kernel uses tmpfs for ramfs when CONFIG_TMPFS=y and no "root=" is +specified in the kernel command line. If you can't stop yourself from +specifying root= you can also use "root=tmpfs". + What is rootfs? --- @@ -236,22 +240,10 @@ An initramfs archive is a complete self-contained root filesystem for Linux. If you don't already understand what shared libraries, devices, and paths you need to get a minimal root filesystem up and running, here are some references: -http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ -http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html -http://www.linuxfromscratch.org/lfs/view/stable/ - -The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is -designed to be a tiny C library to statically link early userspace -code against, along with some related utilities. It is BSD licensed. -I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) -myself. These are LGPL and GPL, respectively. (A self-contained initramfs -package is planned for the busybox 1.3 release.) - -In theory you could use glibc, but that's not well suited for small embedded -uses like this. (A "hello world" program statically linked against glibc is -over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do -name lookups, even when otherwise statically linked.) + http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ + http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html + http://www.linuxfromscratc
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/31/2018 10:22 PM, Mimi Zohar wrote: > On Wed, 2018-01-31 at 21:03 -0500, Arvind Sankar wrote: >> On Wed, Jan 31, 2018 at 05:48:20PM -0600, Rob Landley wrote: >>> On 01/31/2018 04:07 PM, Mimi Zohar wrote: >>>> On Wed, 2018-01-31 at 13:32 -0600, Rob Landley wrote:>> (The old "I >>>> configured in tmpfs and am using rootfs but I want that >>> rootfs >>>>> to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does >>>>> it?) >>>> >>>> I must be missing something. Which systems don't specify "root=" on >>>> the boot command line. >>> >>> Any system using initrd or initramfs? >>> >> >> Don't a lot of initramfs setups use root= to tell the initramfs which >> actual root file system to switch to after early boot? You mean the option that _isn't_ passed through as an environment variable (the way ROOT= would be) so you have to parse /proc/cmdline to to see if it was passed in? If you really, really, really, really, really want to double down on the "no, this is the button, it doesn't do what I thought but I will MAKE it work" obsession, sure. > With your patch and specifying "root=tmpfs", dracut is complaining: > > dracut: FATAL: Don't know how to handle 'root=tmpfs' > dracut: refusing to continue [googles]... I do not understand why this package exists. If you're switching to another root filesystem, using a tool that wikipedia[citation needed] says has no purpose but to switch to another root filesystem, (so let's reproduce the kernel infrastructure in userspace while leaving it the kernel too)... why do you need initramfs to be tmpfs? You're using it for half a second, then discarding it, what's the point of it being tmpfs? Sigh. If people are ok with having rootfs just be tmpfs whenever tmpfs is configured in, even when you're then going to overmount it with something else like you're doing, let's just _remove_ the test. If it can be tmpfs, have it be tmpfs. Rob
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/31/2018 10:22 PM, Mimi Zohar wrote: > On Wed, 2018-01-31 at 21:03 -0500, Arvind Sankar wrote: >> On Wed, Jan 31, 2018 at 05:48:20PM -0600, Rob Landley wrote: >>> On 01/31/2018 04:07 PM, Mimi Zohar wrote: >>>> On Wed, 2018-01-31 at 13:32 -0600, Rob Landley wrote:>> (The old "I >>>> configured in tmpfs and am using rootfs but I want that >>> rootfs >>>>> to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does >>>>> it?) >>>> >>>> I must be missing something. Which systems don't specify "root=" on >>>> the boot command line. >>> >>> Any system using initrd or initramfs? >>> >> >> Don't a lot of initramfs setups use root= to tell the initramfs which >> actual root file system to switch to after early boot? You mean the option that _isn't_ passed through as an environment variable (the way ROOT= would be) so you have to parse /proc/cmdline to to see if it was passed in? If you really, really, really, really, really want to double down on the "no, this is the button, it doesn't do what I thought but I will MAKE it work" obsession, sure. > With your patch and specifying "root=tmpfs", dracut is complaining: > > dracut: FATAL: Don't know how to handle 'root=tmpfs' > dracut: refusing to continue [googles]... I do not understand why this package exists. If you're switching to another root filesystem, using a tool that wikipedia[citation needed] says has no purpose but to switch to another root filesystem, (so let's reproduce the kernel infrastructure in userspace while leaving it the kernel too)... why do you need initramfs to be tmpfs? You're using it for half a second, then discarding it, what's the point of it being tmpfs? Sigh. If people are ok with having rootfs just be tmpfs whenever tmpfs is configured in, even when you're then going to overmount it with something else like you're doing, let's just _remove_ the test. If it can be tmpfs, have it be tmpfs. Rob
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/31/2018 04:07 PM, Mimi Zohar wrote: > On Wed, 2018-01-31 at 13:32 -0600, Rob Landley wrote:>> (The old "I > configured in tmpfs and am using rootfs but I want that rootfs >> to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does >> it?) > > I must be missing something. Which systems don't specify "root=" on > the boot command line. Any system using initrd or initramfs? I have one at https://github.com/landley/mkroot that doesn't, for example. It's 600 lines of bash that builds simple Linux systems for a bunch of different architectures, each with a qemu wrapper to boot it to a shell prompt. And yes, it's using tmpfs for its initramfs, you can tell because "grep rootfs /proc/mounts" gives a size. That's also where I tested the patch I sent you. The root= option specifies the filesystem to mount OVER rootfs. I.E. it's the fallback root filesystem to mount when initramfs doesn't contain an executable /init that can become PID 1. If you DO have an /init in rootfs which the kernel manages to launch as PID 1, the kernel code never reaches the part that uses the root= argument. (Look for the call to prepare_namespace() in init/main.c, notice how it's only called if it can't _already_ find "/init".) That's why the test I added for initramfs vs initmpfs was "did they specify root=", because if they did it means they're telling the kernel what to mount over rootfs, so they're not staying in rootfs. That's what that argument MEANS. They're telling init/main.c what fallback filesystem to mount over rootfs _after_ failing to find /init in rootfs, therefore they're not keeping rootfs as their root filesystem for userspace. That said, a lot of people don't understand how this works, and they set root= to things like /dev/ram when using initrd because "we must set this knob to something, this is something, therefore we must set this knob to it". The fact setting root=/dev/random would have the exact same effect doesn't seem to bother them, they had Done It and It Worked, therefore it was the Right Thing To Do. QED. The patch last message was me going "alright, if people can't NOT twiddle the knob, even when doing it breaks things in an immediate and obvious way, and a big DO NOT TOUCH sign won't dissuade them, just give the knob an explicit 'off' setting that literally does the same thing as not touching it at all would". Your solution was to add a safety catch for the knob, which is edging into Rube Goldberg territory if you ask me. > If we want to include and restore xattrs, > there needs to be a way of using tmpfs. Yes, using tmpfs for initramfs is useful, that's why I submitted patches to hook it up back in 2013. (Personally I find "cat /dev/zero > /filename" _not_ hard locking your system instantly the most compelling feature. Although I believe what motivated my initmpfs patches way back when was somebody wanting to install an rpm into intramfs and the installer failing because ramfs hasn't got a size so "df" always returns zero.) > Mimi Rob
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/31/2018 04:07 PM, Mimi Zohar wrote: > On Wed, 2018-01-31 at 13:32 -0600, Rob Landley wrote:>> (The old "I > configured in tmpfs and am using rootfs but I want that rootfs >> to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does >> it?) > > I must be missing something. Which systems don't specify "root=" on > the boot command line. Any system using initrd or initramfs? I have one at https://github.com/landley/mkroot that doesn't, for example. It's 600 lines of bash that builds simple Linux systems for a bunch of different architectures, each with a qemu wrapper to boot it to a shell prompt. And yes, it's using tmpfs for its initramfs, you can tell because "grep rootfs /proc/mounts" gives a size. That's also where I tested the patch I sent you. The root= option specifies the filesystem to mount OVER rootfs. I.E. it's the fallback root filesystem to mount when initramfs doesn't contain an executable /init that can become PID 1. If you DO have an /init in rootfs which the kernel manages to launch as PID 1, the kernel code never reaches the part that uses the root= argument. (Look for the call to prepare_namespace() in init/main.c, notice how it's only called if it can't _already_ find "/init".) That's why the test I added for initramfs vs initmpfs was "did they specify root=", because if they did it means they're telling the kernel what to mount over rootfs, so they're not staying in rootfs. That's what that argument MEANS. They're telling init/main.c what fallback filesystem to mount over rootfs _after_ failing to find /init in rootfs, therefore they're not keeping rootfs as their root filesystem for userspace. That said, a lot of people don't understand how this works, and they set root= to things like /dev/ram when using initrd because "we must set this knob to something, this is something, therefore we must set this knob to it". The fact setting root=/dev/random would have the exact same effect doesn't seem to bother them, they had Done It and It Worked, therefore it was the Right Thing To Do. QED. The patch last message was me going "alright, if people can't NOT twiddle the knob, even when doing it breaks things in an immediate and obvious way, and a big DO NOT TOUCH sign won't dissuade them, just give the knob an explicit 'off' setting that literally does the same thing as not touching it at all would". Your solution was to add a safety catch for the knob, which is edging into Rube Goldberg territory if you ask me. > If we want to include and restore xattrs, > there needs to be a way of using tmpfs. Yes, using tmpfs for initramfs is useful, that's why I submitted patches to hook it up back in 2013. (Personally I find "cat /dev/zero > /filename" _not_ hard locking your system instantly the most compelling feature. Although I believe what motivated my initmpfs patches way back when was somebody wanting to install an rpm into intramfs and the installer failing because ramfs hasn't got a size so "df" always returns zero.) > Mimi Rob
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/30/2018 03:46 PM, Mimi Zohar wrote: > Commit 16203a7a9422 ("initmpfs: make rootfs use tmpfs when CONFIG_TMPFS > enabled") introduced using tmpfs as the rootfs filesystem. The use of > tmpfs is limited to systems that do not specify "root=" on the boot > command line. > > Without the check "!saved_root_name[0]", rootfs uses tmpfs. As there > must be a valid reason for this check, this patch introduces a new boot > command line option named "noramfs" to force rootfs to use tmpfs. > > Signed-off-by: Mimi Zohar <zo...@linux.vnet.ibm.com> How about just: diff --git a/init/do_mounts.c b/init/do_mounts.c index 7cf4f6d..af66ede 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -632,8 +632,8 @@ int __init init_rootfs(void) if (err) return err; - if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] && - (!root_fs_names || strstr(root_fs_names, "tmpfs"))) { + if (IS_ENABLED(CONFIG_TMPFS) && (!saved_root_name[0] || + !strcmp(saved_root_name, "tmpfs"))) { err = shmem_init(); is_tmpfs = true; } else { (Obviously-signed-off-by: Rob Landley <r...@landley.net>) I.E. if you somehow just can't stop yourself from specifying root= when using rootfs, have "root=tmpfs" do what you want. (The old "I configured in tmpfs and am using rootfs but I want that rootfs to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does it?) > --- > Documentation/admin-guide/kernel-parameters.txt | 2 ++ > init/do_mounts.c| 15 +-- > 2 files changed, 15 insertions(+), 2 deletions(-) I suppose I should do a documentation update too. Lemme send a proper one after work... Rob P.S. While I'm at it, I've meant to wire up rootflags= so you can specify a memory limit other than 50% forever, I should do that too. And resend my "make DEVTMPFS_MOUNT apply to initramfs" patch (with the debian bug workaround)...
Re: [RFC PATCH] rootfs: force mounting rootfs as tmpfs
On 01/30/2018 03:46 PM, Mimi Zohar wrote: > Commit 16203a7a9422 ("initmpfs: make rootfs use tmpfs when CONFIG_TMPFS > enabled") introduced using tmpfs as the rootfs filesystem. The use of > tmpfs is limited to systems that do not specify "root=" on the boot > command line. > > Without the check "!saved_root_name[0]", rootfs uses tmpfs. As there > must be a valid reason for this check, this patch introduces a new boot > command line option named "noramfs" to force rootfs to use tmpfs. > > Signed-off-by: Mimi Zohar How about just: diff --git a/init/do_mounts.c b/init/do_mounts.c index 7cf4f6d..af66ede 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -632,8 +632,8 @@ int __init init_rootfs(void) if (err) return err; - if (IS_ENABLED(CONFIG_TMPFS) && !saved_root_name[0] && - (!root_fs_names || strstr(root_fs_names, "tmpfs"))) { + if (IS_ENABLED(CONFIG_TMPFS) && (!saved_root_name[0] || + !strcmp(saved_root_name, "tmpfs"))) { err = shmem_init(); is_tmpfs = true; } else { (Obviously-signed-off-by: Rob Landley ) I.E. if you somehow just can't stop yourself from specifying root= when using rootfs, have "root=tmpfs" do what you want. (The old "I configured in tmpfs and am using rootfs but I want that rootfs to be ramfs, not tmpfs" code doesn't seem to be a real-world concern, does it?) > --- > Documentation/admin-guide/kernel-parameters.txt | 2 ++ > init/do_mounts.c| 15 +-- > 2 files changed, 15 insertions(+), 2 deletions(-) I suppose I should do a documentation update too. Lemme send a proper one after work... Rob P.S. While I'm at it, I've meant to wire up rootflags= so you can specify a memory limit other than 50% forever, I should do that too. And resend my "make DEVTMPFS_MOUNT apply to initramfs" patch (with the debian bug workaround)...
Allnoconfig build still broken on x86-64 in today's git.
$ make clean && make allnoconfig && make HOSTCC scripts/basic/fixdep HOSTCC scripts/kconfig/conf.o HOSTCC scripts/kconfig/zconf.tab.o HOSTLD scripts/kconfig/conf scripts/kconfig/conf --allnoconfig Kconfig # # configuration written to .config # Makefile:932: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel". Stop. $ grep CONFIG_UNWINDER .config CONFIG_UNWINDER_FRAME_POINTER=y # CONFIG_UNWINDER_GUESS is not set $ Still an unnecessary dependency that breaks the build even when it's configged out. Rob
Allnoconfig build still broken on x86-64 in today's git.
$ make clean && make allnoconfig && make HOSTCC scripts/basic/fixdep HOSTCC scripts/kconfig/conf.o HOSTCC scripts/kconfig/zconf.tab.o HOSTLD scripts/kconfig/conf scripts/kconfig/conf --allnoconfig Kconfig # # configuration written to .config # Makefile:932: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel". Stop. $ grep CONFIG_UNWINDER .config CONFIG_UNWINDER_FRAME_POINTER=y # CONFIG_UNWINDER_GUESS is not set $ Still an unnecessary dependency that breaks the build even when it's configged out. Rob
Re: [PATCH v2 11/15] gen_init_cpio: add newcx format
On 01/24/2018 09:27 PM, Taras Kondratiuk wrote: > diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c > index 7a2a6d85345d..78a47a5bdcb1 100644 > --- a/usr/gen_init_cpio.c > +++ b/usr/gen_init_cpio.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include You're adding an assert? Really? > fputs(s, stdout); > - offset += 110; > + assert((offset & 3) == 0); > + offset += cpio_hdr_size; Why? Rob
Re: [PATCH v2 01/15] Documentation: add newcx initramfs format description
On 01/24/2018 09:27 PM, Taras Kondratiuk wrote: > diff --git a/Documentation/early-userspace/buffer-format.txt > b/Documentation/early-userspace/buffer-format.txt > index e1fd7f9dad16..d818df4f72dc 100644 > --- a/Documentation/early-userspace/buffer-format.txt > +++ b/Documentation/early-userspace/buffer-format.txt > +compressed and/or uncompressed cpio archives; arbitrary amounts > +zero bytes (for padding) can be added between members. Missing "of" between amounts and zero. (Yeah it was in the original, but if you're touching it anyway...) > +c_xattrs_size 8 bytesSize of xattrs field > + > +Most of the fields match cpio_newc_header except c_mtime that contains > +microseconds. c_chksum field is dropped. > + > +xattr_size is a total size of xattr_entry including 8 bytes of > +xattr_size. xattr_size has the same hexadecimal ASCII encoding as other > +fields of cpio header. xattrs_size or xattr_size? Total nitpicks, I know. :) Rob
Re: [PATCH v2 11/15] gen_init_cpio: add newcx format
On 01/24/2018 09:27 PM, Taras Kondratiuk wrote: > diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c > index 7a2a6d85345d..78a47a5bdcb1 100644 > --- a/usr/gen_init_cpio.c > +++ b/usr/gen_init_cpio.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include You're adding an assert? Really? > fputs(s, stdout); > - offset += 110; > + assert((offset & 3) == 0); > + offset += cpio_hdr_size; Why? Rob
Re: [PATCH v2 01/15] Documentation: add newcx initramfs format description
On 01/24/2018 09:27 PM, Taras Kondratiuk wrote: > diff --git a/Documentation/early-userspace/buffer-format.txt > b/Documentation/early-userspace/buffer-format.txt > index e1fd7f9dad16..d818df4f72dc 100644 > --- a/Documentation/early-userspace/buffer-format.txt > +++ b/Documentation/early-userspace/buffer-format.txt > +compressed and/or uncompressed cpio archives; arbitrary amounts > +zero bytes (for padding) can be added between members. Missing "of" between amounts and zero. (Yeah it was in the original, but if you're touching it anyway...) > +c_xattrs_size 8 bytesSize of xattrs field > + > +Most of the fields match cpio_newc_header except c_mtime that contains > +microseconds. c_chksum field is dropped. > + > +xattr_size is a total size of xattr_entry including 8 bytes of > +xattr_size. xattr_size has the same hexadecimal ASCII encoding as other > +fields of cpio header. xattrs_size or xattr_size? Total nitpicks, I know. :) Rob
Re: [PATCH v2 01/15] Documentation: add newcx initramfs format description
On 01/25/2018 03:29 AM, Arnd Bergmann wrote: > On Thu, Jan 25, 2018 at 4:27 AM, Taras Kondratiukwrote: >> Many of the Linux security/integrity features are dependent on file >> metadata, stored as extended attributes (xattrs), for making decisions. >> These features need to be initialized during initcall and enabled as >> early as possible for complete security coverage. >> >> Initramfs (tmpfs) supports xattrs, but newc CPIO archive format does not >> support including them into the archive. >> >> This patch describes "extended" newc format (newcx) that is based on >> newc and has following changes: >> - extended attributes support >> - increased size of filesize to support files >4GB. >> - increased mtime field size to have usec precision and more than >> 32-bit of seconds. >> - removed unused checksum field. >> >> Signed-off-by: Taras Kondratiuk >> Signed-off-by: Mimi Zohar >> Signed-off-by: Victor Kamensky > > Ah nice, I like the extension of the time handling, that certainly > addresses one of the issues with y2038 that we have previously > hacked around in an ugly way (interpreting the 32-bit > number as unsigned). Taras and I exchanged email like a year ago working out format stuff, so I don't have any real complaints. My feedback's already worked in, and I can make toybox cpio support -h newcx as soon as the format's finalized and I get a free weekend. That said, I don't think -h newcx should emit (or recognize) the "TRAILER!!!1!" entry. That's kinda silly in-band signaling for 2018: files have a length, pipes provide EOF, and each cpiox entry starts with 6 bytes of c_magic anyway. (I stopped toybox from producing the TRAILER entry back in june, toybox commit 32550751997d, and the kernel consumes the resulting cpio just fine. All the trailer does is prevent you from concatenating cpio files, which is a feature multiple people asked me for.) > However, if this is to become a generally supported format > for cpio files, After Joerg Schilling dies (or admits solaris has) it might even make it into posix. > could we make it use nanosecond resolution > instead? The issue that I see with microseconds is that > storing a file in an archive and extracting it again would > otherwise keep the mtime stamp /almost/ identical on file > systems that have nanosecond resolution, but most of > the time a comparison would indicate that the files are > not the same. I have no strong opinion on this? The tmpfs is still going to track nanoseconds, this is just rounding when it populates them. > Unfortunately, the range of a 64-bit nanoseconds counter > is still a bit limited (584 years, or half of that if we make it > signed). While this is clearly enough for the uses in > initramfs, it still has a similar problem: someone creating > a fake timestamp a long time in the past or future on > a file system would lose information after going though > cpio. Hence microseconds. This came up in email when we were talking about this (like a year ago) and I decided I didn't care. :) 64 bits of microseconds is +- 584 centuries, while being accurate enough[1] that making a getpid() syscall probably takes longer than that on our highest end boxen, let alone doing a dentry lookup in the vfs (even if it's hot in cache). Rob [1] Is future proofing an issue here? The s-curve of moore's law started bending down around y2k back when Intel had to recall its 1.13ghz pentium III for having overclocked its own chip at the factory, and it's pretty darn flat these days. Clock speeds first hit 4ghz 15 years ago and haven't been back, most of the work since 2005 has been about parallelism, and recent performance improvements are once again going to pentium 4 pipeline length levels of absurdity, as meltdown/spectre demonstrates (140 instructions of prefetch!??!?). Maybe intel will make 9 nanometer manufacturing work, but atomic limits are already an issue. The problem with 1 second timestamps was you honestly could confuse "make" about which file was newer once an exec() could complete in the same second having done real work. That was the motivating issue causing the change, going to nanoseconds was just the big hammer of "this is large enough it won't matter again in our lifetimes". But nanosecond time stamps are recording more jitter than useful information, and that seems unlikely to change this century?
Re: [PATCH v2 01/15] Documentation: add newcx initramfs format description
On 01/25/2018 03:29 AM, Arnd Bergmann wrote: > On Thu, Jan 25, 2018 at 4:27 AM, Taras Kondratiuk wrote: >> Many of the Linux security/integrity features are dependent on file >> metadata, stored as extended attributes (xattrs), for making decisions. >> These features need to be initialized during initcall and enabled as >> early as possible for complete security coverage. >> >> Initramfs (tmpfs) supports xattrs, but newc CPIO archive format does not >> support including them into the archive. >> >> This patch describes "extended" newc format (newcx) that is based on >> newc and has following changes: >> - extended attributes support >> - increased size of filesize to support files >4GB. >> - increased mtime field size to have usec precision and more than >> 32-bit of seconds. >> - removed unused checksum field. >> >> Signed-off-by: Taras Kondratiuk >> Signed-off-by: Mimi Zohar >> Signed-off-by: Victor Kamensky > > Ah nice, I like the extension of the time handling, that certainly > addresses one of the issues with y2038 that we have previously > hacked around in an ugly way (interpreting the 32-bit > number as unsigned). Taras and I exchanged email like a year ago working out format stuff, so I don't have any real complaints. My feedback's already worked in, and I can make toybox cpio support -h newcx as soon as the format's finalized and I get a free weekend. That said, I don't think -h newcx should emit (or recognize) the "TRAILER!!!1!" entry. That's kinda silly in-band signaling for 2018: files have a length, pipes provide EOF, and each cpiox entry starts with 6 bytes of c_magic anyway. (I stopped toybox from producing the TRAILER entry back in june, toybox commit 32550751997d, and the kernel consumes the resulting cpio just fine. All the trailer does is prevent you from concatenating cpio files, which is a feature multiple people asked me for.) > However, if this is to become a generally supported format > for cpio files, After Joerg Schilling dies (or admits solaris has) it might even make it into posix. > could we make it use nanosecond resolution > instead? The issue that I see with microseconds is that > storing a file in an archive and extracting it again would > otherwise keep the mtime stamp /almost/ identical on file > systems that have nanosecond resolution, but most of > the time a comparison would indicate that the files are > not the same. I have no strong opinion on this? The tmpfs is still going to track nanoseconds, this is just rounding when it populates them. > Unfortunately, the range of a 64-bit nanoseconds counter > is still a bit limited (584 years, or half of that if we make it > signed). While this is clearly enough for the uses in > initramfs, it still has a similar problem: someone creating > a fake timestamp a long time in the past or future on > a file system would lose information after going though > cpio. Hence microseconds. This came up in email when we were talking about this (like a year ago) and I decided I didn't care. :) 64 bits of microseconds is +- 584 centuries, while being accurate enough[1] that making a getpid() syscall probably takes longer than that on our highest end boxen, let alone doing a dentry lookup in the vfs (even if it's hot in cache). Rob [1] Is future proofing an issue here? The s-curve of moore's law started bending down around y2k back when Intel had to recall its 1.13ghz pentium III for having overclocked its own chip at the factory, and it's pretty darn flat these days. Clock speeds first hit 4ghz 15 years ago and haven't been back, most of the work since 2005 has been about parallelism, and recent performance improvements are once again going to pentium 4 pipeline length levels of absurdity, as meltdown/spectre demonstrates (140 instructions of prefetch!??!?). Maybe intel will make 9 nanometer manufacturing work, but atomic limits are already an issue. The problem with 1 second timestamps was you honestly could confuse "make" about which file was newer once an exec() could complete in the same second having done real work. That was the motivating issue causing the change, going to nanoseconds was just the big hammer of "this is large enough it won't matter again in our lifetimes". But nanosecond time stamps are recording more jitter than useful information, and that seems unlikely to change this century?
Commit fc72ae40e303 broke x86-64 build environment.
You've made the ORC unwinder part of allnoconfig, which means trying to build "make ARCH=x86_64 allnoconfig" requires installing a new package (libelf-dev) or else the build breaks. What's worse, if I go into menuconfig and switch it back to frame pointer, the build STILL breaks: $ make -j 8 Makefile:932: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel". Stop. $ grep UNWIND .config # CONFIG_UNWINDER_ORC is not set CONFIG_UNWINDER_FRAME_POINTER=y # CONFIG_UNWINDER_GUESS is not set As far as I can tell, x86-64 doesn't build anymore without libelf-dev. It's a new hard requirement for the build. Why? Rob
Commit fc72ae40e303 broke x86-64 build environment.
You've made the ORC unwinder part of allnoconfig, which means trying to build "make ARCH=x86_64 allnoconfig" requires installing a new package (libelf-dev) or else the build breaks. What's worse, if I go into menuconfig and switch it back to frame pointer, the build STILL breaks: $ make -j 8 Makefile:932: *** "Cannot generate ORC metadata for CONFIG_UNWINDER_ORC=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel". Stop. $ grep UNWIND .config # CONFIG_UNWINDER_ORC is not set CONFIG_UNWINDER_FRAME_POINTER=y # CONFIG_UNWINDER_GUESS is not set As far as I can tell, x86-64 doesn't build anymore without libelf-dev. It's a new hard requirement for the build. Why? Rob
powerpc64 kernel panic if you disable CONFIG_PPC_TRANSACTIONAL_MEM?
I just added a ppc64 target to https://github.com/landley/mkroot which means I built 4.14 with the attached miniconfig and ran it with the attached qemu command line, and it works fine as is but if you remove the transactional mem line from the config the kernel panics instead of launching a shell prompt: init[1]: unhandled signal 4 at 10001a04 nip 10001a04 lr 1002ebe8 code 1 Kernel panic - not syncing: Attempted to kill init! exitcode=0x0004 CPU: 0 PID: 1 Comm: init Not tainted 4.14.0 #1 Call Trace: [ce02fa40] [c04ba730] dump_stack+0xb0/0xf0 (unreliable) [ce02fa80] [c00602a0] panic+0x138/0x2f8 [ce02fb20] [c006541c] do_exit+0xa9c/0xaa0 [ce02fbe0] [c00654d8] do_group_exit+0x58/0xf0 [ce02fc20] [c0073274] get_signal+0x1c4/0x6b0 [ce02fd10] [c00142a0] do_signal+0x60/0x290 [ce02fe00] [c001461c] do_notify_resume+0x8c/0xd0 [ce02fe30] [c000b630] ret_from_except_lite+0x5c/0x60 Rebooting in 1 seconds.. Rob powerpc64le.miniconf Description: Binary data qemu-powerpc64le.sh Description: Bourne shell script
powerpc64 kernel panic if you disable CONFIG_PPC_TRANSACTIONAL_MEM?
I just added a ppc64 target to https://github.com/landley/mkroot which means I built 4.14 with the attached miniconfig and ran it with the attached qemu command line, and it works fine as is but if you remove the transactional mem line from the config the kernel panics instead of launching a shell prompt: init[1]: unhandled signal 4 at 10001a04 nip 10001a04 lr 1002ebe8 code 1 Kernel panic - not syncing: Attempted to kill init! exitcode=0x0004 CPU: 0 PID: 1 Comm: init Not tainted 4.14.0 #1 Call Trace: [ce02fa40] [c04ba730] dump_stack+0xb0/0xf0 (unreliable) [ce02fa80] [c00602a0] panic+0x138/0x2f8 [ce02fb20] [c006541c] do_exit+0xa9c/0xaa0 [ce02fbe0] [c00654d8] do_group_exit+0x58/0xf0 [ce02fc20] [c0073274] get_signal+0x1c4/0x6b0 [ce02fd10] [c00142a0] do_signal+0x60/0x290 [ce02fe00] [c001461c] do_notify_resume+0x8c/0xd0 [ce02fe30] [c000b630] ret_from_except_lite+0x5c/0x60 Rebooting in 1 seconds.. Rob powerpc64le.miniconf Description: Binary data qemu-powerpc64le.sh Description: Bourne shell script
Re: [J-core] [PATCH v5 00/22] sh: LANDISK and R2Dplus convert to device tree
On 11/17/2017 04:37 AM, John Paul Adrian Glaubitz wrote: > Hi there! > > On 07/03/2016 06:46 PM, Yoshinori Sato wrote: >> SH get devicetree support. But it not working on existing H/W. >> >> IO-DATA HDL-U (aka landisk) currentry supported. >> This H/W like SH7751 evalution board. It's a best to use this as a >> change base H/W. >> RTS7751R2Dplus is QEMU-SH4 target. So easy trying. > > This patch series - which would make a huge improvement - is still not > applied. It would be very useful to be able to test the device tree > implementation with QEMU. > > Any of the SH maintainers can apply this? It's Rich's call, but given that it's _from_ one of the sh maintainers, sounds to me like it can just go in if it still applies? (If there's bugfixes needed they can go in -rc2 or so, after this merge window.) Given that qemu serial's been broken for 9 months now, I doubt this would make anything worse. (I should really check Cedric's qemu fork to see if he fixed that...) Rob
Re: [J-core] [PATCH v5 00/22] sh: LANDISK and R2Dplus convert to device tree
On 11/17/2017 04:37 AM, John Paul Adrian Glaubitz wrote: > Hi there! > > On 07/03/2016 06:46 PM, Yoshinori Sato wrote: >> SH get devicetree support. But it not working on existing H/W. >> >> IO-DATA HDL-U (aka landisk) currentry supported. >> This H/W like SH7751 evalution board. It's a best to use this as a >> change base H/W. >> RTS7751R2Dplus is QEMU-SH4 target. So easy trying. > > This patch series - which would make a huge improvement - is still not > applied. It would be very useful to be able to test the device tree > implementation with QEMU. > > Any of the SH maintainers can apply this? It's Rich's call, but given that it's _from_ one of the sh maintainers, sounds to me like it can just go in if it still applies? (If there's bugfixes needed they can go in -rc2 or so, after this merge window.) Given that qemu serial's been broken for 9 months now, I doubt this would make anything worse. (I should really check Cedric's qemu fork to see if he fixed that...) Rob
Re: Regression: commit da029c11e6b1 broke toybox xargs.
On 11/03/2017 08:37 PM, Kees Cook wrote: > We don't. (In fact, arg copying happens before we've even figured out > which binfmt is involved.) I lifted it to just before the point of no > return, but moving it before arg copying looks very hard (which > contributed to why we went with the implementation we did). > >> So it's pretty painful to make the limits different for suid and >> non-suid binaries. > > I would agree. I think I know what to implement for toybox now: xargs should trust libc's sysconf() to provide the common-case starting limit (subtracting env space) then implement the fallback pipe-from-child thing to iteratively try half the argument list when that fails. Elliott's even cc'd so he can update bionic's sysconf for the new 10 meg thing from the title commit. :) Rob
Re: Regression: commit da029c11e6b1 broke toybox xargs.
On 11/03/2017 08:37 PM, Kees Cook wrote: > We don't. (In fact, arg copying happens before we've even figured out > which binfmt is involved.) I lifted it to just before the point of no > return, but moving it before arg copying looks very hard (which > contributed to why we went with the implementation we did). > >> So it's pretty painful to make the limits different for suid and >> non-suid binaries. > > I would agree. I think I know what to implement for toybox now: xargs should trust libc's sysconf() to provide the common-case starting limit (subtracting env space) then implement the fallback pipe-from-child thing to iteratively try half the argument list when that fails. Elliott's even cc'd so he can update bionic's sysconf for the new 10 meg thing from the title commit. :) Rob
Re: Regression: commit da029c11e6b1 broke toybox xargs.
Correcting Elliot's email to google, not gmail. (Sorry, I'm in Tokyo for work this month, almost over the jetlag...) On 11/03/2017 08:07 PM, Linus Torvalds wrote: > On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <r...@landley.net> wrote: >> On 11/02/2017 10:40 AM, Linus Torvalds wrote: >> >> But it boils down to "got the limit wrong, the exec failed after the >> fork(), dynamic recovery from which is awkward so I'm trying to figure >> out the right limit". Sounds later like dynamic recovery is what you recommend. (Awkward doesn't mean I can't do it.) > I suspect we _do_ have to raise that limit, because clearly this is a > regression, but I absolutely _detest_ the fact that a stupid > _embedded_ OS thinks that it should have a bigger stack limit than > stuff that runs on supercomputers. > > That just makes me go "there's something seriously wrong". This was me trying not to assume what other people will do, I think android's default is still 8mb (it was in M) but my test systems for this are literally on the other side of the planet right now. Google's internal frame of reference is very different from mine. I got pointed at a podcast (Android Developers Backstage #53) where Elliott and another android dev talked about toybox for a few minutes in the second half, they they shared a chuckle over my complaint that downloading AOSP takes 150 gigabytes _before_ it tries to build anything, and only the largest machine I own can build it at all (and that very slowly). It was just so alien to them that this would be a _problem_... > For something like "xargs", I'm actually really saddened by the stupid > decision to think it's a single value. The whole and *only* reason for > xargs to exist is to just get it right, Which is what I was trying very hard to do. :( > and the natural thing for > xargs to do would be to not ask, but simply try to do the whole thing, > and if you get E2BIG, you decide to split it in half or something > until it works. That kind of approach would just make it work > _without_ depending on some magic value. > > The fact that apparently xargs is too stupid to do that, and instead > requires _SC_ARG_MAX to magically give it the "One True Value(tm)" is > just all kinds of crap. I'm writing this xargs, I can _make_ it do that, it just requires a pipe back from the forked child to return status and is either slow (remove one argument at a time) or inaccurate (cut it in half, result coulda been longer). Either way xargs still needs an internal limit or "yes | xargs" will try to fill all memory before ever calling exec(). The reason I wanted to support "exactly as big as possible" is that calling a command as one invocation vs multiple invocations can change behavior. Once you've decided to split, how BIG you split is much less important, so falling back to an arbitrary limit would be fine except I'd still have to check the stack size to see if it's _lower_ than that arbitrary limit. (If you set the stack ulimit to 128k, which nommu systems may wanna do, then the exec limit is 32k. It can be _anything_.) And this limit is shared with environment variables so the problem might be that your environment's pathological and you can't run this command line with even one argument because envp ate all the space, but that's another story and the user can wash it through env -i to make it work. Except: $ env -i {A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P}=$(printf '%0*d' 130657) \ env | wc -c Says 2090560 (of 2097152), but 130658 says argument list too long when it's only 16 more bytes of the ~6k we should have left (envp[]=17*8, argc=2*8, argv[0]=4...) argc and it sounds like you're saying I should just stop _trying_ to figure out exact up-front measurements. So stacksize /4, then split in half each time, and if it strips down to one argument that can't run, have an error message for that. Ok. > Oh well. Enough ranting. > > What _is_ the stack limit when using toybox? Is it just entirely unlimited? Answer to second question on ubuntu 14.04: landley@driftwood:~/linux/linux/fs$ ulimit -s 9 landley@driftwood:~/linux/linux/fs$ ulimit -s 9 Anybody can call ulimit to expand it as a normal user, so effectively yes it is unlimited. I have no IDEA what my users are gonna do. (If they do something stupid it's their fault, but I don't necessarily get to say what stupid is from here.) Answer to first: the default is whatever I inherited from the Android fork du jour it's running on. The google developers seem to be drinking from a firehose of contributions from the half-dozen phone companies trying to get code upstream. Elliott presumably says no to what he can but they're hugely outnumbered and there's politics I'm only dimly aware of (never having worked for google and only having met Elliott for lunch
Re: Regression: commit da029c11e6b1 broke toybox xargs.
Correcting Elliot's email to google, not gmail. (Sorry, I'm in Tokyo for work this month, almost over the jetlag...) On 11/03/2017 08:07 PM, Linus Torvalds wrote: > On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley wrote: >> On 11/02/2017 10:40 AM, Linus Torvalds wrote: >> >> But it boils down to "got the limit wrong, the exec failed after the >> fork(), dynamic recovery from which is awkward so I'm trying to figure >> out the right limit". Sounds later like dynamic recovery is what you recommend. (Awkward doesn't mean I can't do it.) > I suspect we _do_ have to raise that limit, because clearly this is a > regression, but I absolutely _detest_ the fact that a stupid > _embedded_ OS thinks that it should have a bigger stack limit than > stuff that runs on supercomputers. > > That just makes me go "there's something seriously wrong". This was me trying not to assume what other people will do, I think android's default is still 8mb (it was in M) but my test systems for this are literally on the other side of the planet right now. Google's internal frame of reference is very different from mine. I got pointed at a podcast (Android Developers Backstage #53) where Elliott and another android dev talked about toybox for a few minutes in the second half, they they shared a chuckle over my complaint that downloading AOSP takes 150 gigabytes _before_ it tries to build anything, and only the largest machine I own can build it at all (and that very slowly). It was just so alien to them that this would be a _problem_... > For something like "xargs", I'm actually really saddened by the stupid > decision to think it's a single value. The whole and *only* reason for > xargs to exist is to just get it right, Which is what I was trying very hard to do. :( > and the natural thing for > xargs to do would be to not ask, but simply try to do the whole thing, > and if you get E2BIG, you decide to split it in half or something > until it works. That kind of approach would just make it work > _without_ depending on some magic value. > > The fact that apparently xargs is too stupid to do that, and instead > requires _SC_ARG_MAX to magically give it the "One True Value(tm)" is > just all kinds of crap. I'm writing this xargs, I can _make_ it do that, it just requires a pipe back from the forked child to return status and is either slow (remove one argument at a time) or inaccurate (cut it in half, result coulda been longer). Either way xargs still needs an internal limit or "yes | xargs" will try to fill all memory before ever calling exec(). The reason I wanted to support "exactly as big as possible" is that calling a command as one invocation vs multiple invocations can change behavior. Once you've decided to split, how BIG you split is much less important, so falling back to an arbitrary limit would be fine except I'd still have to check the stack size to see if it's _lower_ than that arbitrary limit. (If you set the stack ulimit to 128k, which nommu systems may wanna do, then the exec limit is 32k. It can be _anything_.) And this limit is shared with environment variables so the problem might be that your environment's pathological and you can't run this command line with even one argument because envp ate all the space, but that's another story and the user can wash it through env -i to make it work. Except: $ env -i {A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P}=$(printf '%0*d' 130657) \ env | wc -c Says 2090560 (of 2097152), but 130658 says argument list too long when it's only 16 more bytes of the ~6k we should have left (envp[]=17*8, argc=2*8, argv[0]=4...) argc and it sounds like you're saying I should just stop _trying_ to figure out exact up-front measurements. So stacksize /4, then split in half each time, and if it strips down to one argument that can't run, have an error message for that. Ok. > Oh well. Enough ranting. > > What _is_ the stack limit when using toybox? Is it just entirely unlimited? Answer to second question on ubuntu 14.04: landley@driftwood:~/linux/linux/fs$ ulimit -s 9 landley@driftwood:~/linux/linux/fs$ ulimit -s 9 Anybody can call ulimit to expand it as a normal user, so effectively yes it is unlimited. I have no IDEA what my users are gonna do. (If they do something stupid it's their fault, but I don't necessarily get to say what stupid is from here.) Answer to first: the default is whatever I inherited from the Android fork du jour it's running on. The google developers seem to be drinking from a firehose of contributions from the half-dozen phone companies trying to get code upstream. Elliott presumably says no to what he can but they're hugely outnumbered and there's politics I'm only dimly aware of (never having worked for google and only having met Elliott for lunch once a couple years ago when I was
Re: Regression: commit da029c11e6b1 broke toybox xargs.
On 11/02/2017 10:40 AM, Linus Torvalds wrote: > On Wed, Nov 1, 2017 at 9:28 PM, Linus Torvalds >wrote: >> >> Behavior changed. Things that test particular limits will get different >> results. That's not breakage. >> >> Did an actual user application or script break? Only due to getting the limit wrong. The actual failure's in the android internal bugzilla I've never been able to read: http://lists.landley.net/pipermail/toybox-landley.net/2017-September/009167.html But it boils down to "got the limit wrong, the exec failed after the fork(), dynamic recovery from which is awkward so I'm trying to figure out the right limit". > Ahh. I should have read that email more carefully. If xargs broke, > that _will_ break actual scripts, yes. Do you actually set the stack > limit to insane values? Anybody using toybox really shouldn't be doing > 32MB stacks. Toybox is the default command line of android since M, which went 64 bit in L, and the Pixel 2 phone has 4 gigs of ram. My goal with toybox is to turn android into a self-hosting development environment no longer cross-compiled from a PC (http://landley.net/talks/celf-2013.txt) so I'm trying to implement a command line that can run the entire AOSP build. I.E. I have no idea what people will do with it, and try not to get in their way. My problem here is it's hard to figure out what exec size the limit _is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are currently returning as stack_limit/4, which is now too big and exec() will error out after the fork. Musl is returning the 131072 limit from 2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but "printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to avoid. Maybe I don't have that luxury... Each argument has its own limit separate from the argv+envp total limit, but there's only one "size" you can query through sysconf, so the querying API is insufficient at the design level. Meanwhile under bash you can allocate and dirty 256 megabytes from the command line with: echo $(printf '%0*d' $((1<<28))) Because it's a shell builtin so there's no actual exec. (And if https://sourceware.org/bugzilla/show_bug.cgi?id=17829 ever gets fixed it'll go back to allowing INT_MAX.) Posix is its usual helpful self, read conservatively http://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html says to break the line at 2048 bytes. > So I still do wonder if this actually breaks anything real, or just a > test-suite or something? I've cc'd Elliott, who would know. (He's the Android base os userspace maintainer, he knows everything. Or can at least decode http://b/65818597 .) But this just broke my _fix_, not the earlier deployed stuff. I removed the size measuring code when the 131072 limit went away, the bug was there's a new limit I need to not hit, I tried to figure out what the limit is now, confirmed that the various libc implementations don't agree, then the actual kernel limit changed again while I was looking at it. >Linus Should I just go back to hardwiring in 131072? It's no _less_ arbitrary than 10 megs, and it sounds like getting it _right_ is unachievable. Thanks, Rob
Re: Regression: commit da029c11e6b1 broke toybox xargs.
On 11/02/2017 10:40 AM, Linus Torvalds wrote: > On Wed, Nov 1, 2017 at 9:28 PM, Linus Torvalds > wrote: >> >> Behavior changed. Things that test particular limits will get different >> results. That's not breakage. >> >> Did an actual user application or script break? Only due to getting the limit wrong. The actual failure's in the android internal bugzilla I've never been able to read: http://lists.landley.net/pipermail/toybox-landley.net/2017-September/009167.html But it boils down to "got the limit wrong, the exec failed after the fork(), dynamic recovery from which is awkward so I'm trying to figure out the right limit". > Ahh. I should have read that email more carefully. If xargs broke, > that _will_ break actual scripts, yes. Do you actually set the stack > limit to insane values? Anybody using toybox really shouldn't be doing > 32MB stacks. Toybox is the default command line of android since M, which went 64 bit in L, and the Pixel 2 phone has 4 gigs of ram. My goal with toybox is to turn android into a self-hosting development environment no longer cross-compiled from a PC (http://landley.net/talks/celf-2013.txt) so I'm trying to implement a command line that can run the entire AOSP build. I.E. I have no idea what people will do with it, and try not to get in their way. My problem here is it's hard to figure out what exec size the limit _is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are currently returning as stack_limit/4, which is now too big and exec() will error out after the fork. Musl is returning the 131072 limit from 2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but "printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to avoid. Maybe I don't have that luxury... Each argument has its own limit separate from the argv+envp total limit, but there's only one "size" you can query through sysconf, so the querying API is insufficient at the design level. Meanwhile under bash you can allocate and dirty 256 megabytes from the command line with: echo $(printf '%0*d' $((1<<28))) Because it's a shell builtin so there's no actual exec. (And if https://sourceware.org/bugzilla/show_bug.cgi?id=17829 ever gets fixed it'll go back to allowing INT_MAX.) Posix is its usual helpful self, read conservatively http://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html says to break the line at 2048 bytes. > So I still do wonder if this actually breaks anything real, or just a > test-suite or something? I've cc'd Elliott, who would know. (He's the Android base os userspace maintainer, he knows everything. Or can at least decode http://b/65818597 .) But this just broke my _fix_, not the earlier deployed stuff. I removed the size measuring code when the 131072 limit went away, the bug was there's a new limit I need to not hit, I tried to figure out what the limit is now, confirmed that the various libc implementations don't agree, then the actual kernel limit changed again while I was looking at it. >Linus Should I just go back to hardwiring in 131072? It's no _less_ arbitrary than 10 megs, and it sounds like getting it _right_ is unachievable. Thanks, Rob
Regression: commit da029c11e6b1 broke toybox xargs.
Toybox has been trying to figure out how big an xargs is allowed to be for a while: http://lists.landley.net/pipermail/toybox-landley.net/2017-October/009186.html We're trying to avoid the case where you can run something from the command line, but not through xargs. In theory this limit is sysconf(_SC_ARG_MAX) which on bionic and glibc returns 1/4 RLIMIT_STACK (in accordance with the prophecy fs/exec.c function get_arg_page()), but that turns out to be too simple. There's also a 131071 byte limit on each _individual_ argument, which I think I've tracked down to fs/exec.c function setup_arg_pages() doing: stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages * And then it worked under ubuntu 14.04 but not current kernels. Why? Because the above commit from Kees Cook broke it, by taking this: include/uapi/linux/resource.h: /* * Limit the stack by to some sane default: root can always * increase this limit if needed.. 8MB seems reasonable. */ #define _STK_LIM(8*1024*1024) And hardwiring in a random adjustment as a "640k ought to be enough for anybody" constant on TOP of the existing RLIMIT_STACK/4 check. Without even adjusting the "oh of course root can make this bigger, this is just a default value" comment where it's #defined. Look, if you want to cap RLIMIT_STACK for suid binaries, go for it. The existing code will notice and adapt. But this new commit is crazy and arbitrary and introduces more random version dependencies (how is sysconf() supposed to know the value, an #if/else staircase based on kernel version in every libc)? Please revert it, Rob
Regression: commit da029c11e6b1 broke toybox xargs.
Toybox has been trying to figure out how big an xargs is allowed to be for a while: http://lists.landley.net/pipermail/toybox-landley.net/2017-October/009186.html We're trying to avoid the case where you can run something from the command line, but not through xargs. In theory this limit is sysconf(_SC_ARG_MAX) which on bionic and glibc returns 1/4 RLIMIT_STACK (in accordance with the prophecy fs/exec.c function get_arg_page()), but that turns out to be too simple. There's also a 131071 byte limit on each _individual_ argument, which I think I've tracked down to fs/exec.c function setup_arg_pages() doing: stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages * And then it worked under ubuntu 14.04 but not current kernels. Why? Because the above commit from Kees Cook broke it, by taking this: include/uapi/linux/resource.h: /* * Limit the stack by to some sane default: root can always * increase this limit if needed.. 8MB seems reasonable. */ #define _STK_LIM(8*1024*1024) And hardwiring in a random adjustment as a "640k ought to be enough for anybody" constant on TOP of the existing RLIMIT_STACK/4 check. Without even adjusting the "oh of course root can make this bigger, this is just a default value" comment where it's #defined. Look, if you want to cap RLIMIT_STACK for suid binaries, go for it. The existing code will notice and adapt. But this new commit is crazy and arbitrary and introduces more random version dependencies (how is sysconf() supposed to know the value, an #if/else staircase based on kernel version in every libc)? Please revert it, Rob
[PATCH 1/1] Change ping_group_range default to what Android's init script sets.
From: Rob Landley <r...@landley.net> See message from the Android "native tools and libraries team" lead (I.E. the maintainer of bionic, adb, toolbox, etc) at http://lists.landley.net/pipermail/toybox-landley.net/2017-July/009103.html Signed-off-by: Rob Landley <r...@landley.net> --- net/ipv4/af_inet.c |8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index e31108e..5b39a96 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1712,12 +1712,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.ip_local_ports.range[1] = 60999; seqlock_init(>ipv4.ping_group_range.lock); - /* -* Sane defaults - nobody may create ping sockets. -* Boot scripts should set this to distro-specific group. -*/ - net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 1); - net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 2147483647); /* Default values for sysctl-controlled parameters. * We set them here, in case sysctl is not compiled.
[PATCH 1/1] Change ping_group_range default to what Android's init script sets.
From: Rob Landley See message from the Android "native tools and libraries team" lead (I.E. the maintainer of bionic, adb, toolbox, etc) at http://lists.landley.net/pipermail/toybox-landley.net/2017-July/009103.html Signed-off-by: Rob Landley --- net/ipv4/af_inet.c |8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index e31108e..5b39a96 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1712,12 +1712,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.ip_local_ports.range[1] = 60999; seqlock_init(>ipv4.ping_group_range.lock); - /* -* Sane defaults - nobody may create ping sockets. -* Boot scripts should set this to distro-specific group. -*/ - net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 1); - net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 2147483647); /* Default values for sysctl-controlled parameters. * We set them here, in case sysctl is not compiled.
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 09/17/2017 08:51 AM, Henrique de Moraes Holschuh wrote: > On Sat, 16 Sep 2017, Rob Landley wrote: >> So, I added a workaround with a printk in hopes of embarassing them into >> someday fixing it. > > Oh, it will be fixed in Debian alright. Cool! But part of the problem is people upgrade the kernel on existing deployed root filesystems, some of which are a fork off of a fork off of debian, so we won't exhaust the broken userspace for probably a couple years. I'd put it in feature-removal-schedule.txt but Linus zapped that, so... > I am just waiting the issue to > settle a bit to file the bug reports, or maybe even send in the Debian > patches myself (note that I am not responsible for the code in question, > so I am not wearing a brown paperbag at this time). Even if I didn't do > it, there are several other Debian Developers reading LKML that could do > it (provided they noticed this specific thread and are aware of the > situation) :p There was a previous thread last merge window they didn't notice. I was hoping the warning would be obvious enough. :) > I can even push for the fixes to be accepted into the stable and > oldstable branches of Debian, but that can take anything from a few > weeks to several months, due to the way our stable releases work. But > it would eventually happen. > > Whether such fixes will ever make it to LTS branches, especially > Ubuntu's, *that* I don't know. I have no idea what that powerpc system was, the guy didn't say... Rob
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 09/17/2017 08:51 AM, Henrique de Moraes Holschuh wrote: > On Sat, 16 Sep 2017, Rob Landley wrote: >> So, I added a workaround with a printk in hopes of embarassing them into >> someday fixing it. > > Oh, it will be fixed in Debian alright. Cool! But part of the problem is people upgrade the kernel on existing deployed root filesystems, some of which are a fork off of a fork off of debian, so we won't exhaust the broken userspace for probably a couple years. I'd put it in feature-removal-schedule.txt but Linus zapped that, so... > I am just waiting the issue to > settle a bit to file the bug reports, or maybe even send in the Debian > patches myself (note that I am not responsible for the code in question, > so I am not wearing a brown paperbag at this time). Even if I didn't do > it, there are several other Debian Developers reading LKML that could do > it (provided they noticed this specific thread and are aware of the > situation) :p There was a previous thread last merge window they didn't notice. I was hoping the warning would be obvious enough. :) > I can even push for the fixes to be accepted into the stable and > oldstable branches of Debian, but that can take anything from a few > weeks to several months, due to the way our stable releases work. But > it would eventually happen. > > Whether such fixes will ever make it to LTS branches, especially > Ubuntu's, *that* I don't know. I have no idea what that powerpc system was, the guy didn't say... Rob
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 09/14/2017 04:17 AM, Christophe LEROY wrote: > Le 14/09/2017 à 01:51, Rob Landley a écrit : >> From: Rob Landley <r...@landley.net> >> >> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move >> /dev/console open after devtmpfs mount. >> >> Add workaround for Debian bug that was copied by Ubuntu. > > Is that a bug only for Debian ? Why ? Look down, specifically this bit: >> v2 discussion: >> http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html That's some discussion of version 2 of this patch, which was merged for a while last dev cycle, then backed out again because it triggered the same bug in a number of system init scripts: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/07072.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01182.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01505.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01320.html All of whom copied the broken error "recovery" path from debian. If they checked whether it was already mounted, or didn't _blank_ the /dev directory in response to mounting the exact same filesystem over itself giving -EBUSY, the system would work fine. Heck, if you built a kernel with a static /dev in initramfs and no devtmpfs configured in, the script would break things exactly the same way. The breakage is that script takes a hammer to a perfectly functional /dev directory and then continues the boot with an empty /dev. That's bonkers. > Why should a Debian bug be fixed by a workaround in the mainline kernel ? That was my argument last time, and the answer was "Breaking userspace is bad, mmmkay." Even when userspace is doing something REALLY OBVIOUSLY STUPID and it is _clearly_ their fault, as long as they got there first they've established the status quo and it doesn't matter how silly it is. This was explicitly stated to me here: http://lkml.iu.edu/hypermail/linux/kernel/1705.3/03292.html I.E. don't argue with me, argue with him. :) So, I added a workaround with a printk in hopes of embarassing them into someday fixing it. Rob
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 09/14/2017 04:17 AM, Christophe LEROY wrote: > Le 14/09/2017 à 01:51, Rob Landley a écrit : >> From: Rob Landley >> >> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move >> /dev/console open after devtmpfs mount. >> >> Add workaround for Debian bug that was copied by Ubuntu. > > Is that a bug only for Debian ? Why ? Look down, specifically this bit: >> v2 discussion: >> http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html That's some discussion of version 2 of this patch, which was merged for a while last dev cycle, then backed out again because it triggered the same bug in a number of system init scripts: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/07072.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01182.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01505.html http://lkml.iu.edu/hypermail/linux/kernel/1705.3/01320.html All of whom copied the broken error "recovery" path from debian. If they checked whether it was already mounted, or didn't _blank_ the /dev directory in response to mounting the exact same filesystem over itself giving -EBUSY, the system would work fine. Heck, if you built a kernel with a static /dev in initramfs and no devtmpfs configured in, the script would break things exactly the same way. The breakage is that script takes a hammer to a perfectly functional /dev directory and then continues the boot with an empty /dev. That's bonkers. > Why should a Debian bug be fixed by a workaround in the mainline kernel ? That was my argument last time, and the answer was "Breaking userspace is bad, mmmkay." Even when userspace is doing something REALLY OBVIOUSLY STUPID and it is _clearly_ their fault, as long as they got there first they've established the status quo and it doesn't matter how silly it is. This was explicitly stated to me here: http://lkml.iu.edu/hypermail/linux/kernel/1705.3/03292.html I.E. don't argue with me, argue with him. :) So, I added a workaround with a printk in hopes of embarassing them into someday fixing it. Rob
[PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley <r...@landley.net> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move /dev/console open after devtmpfs mount. Add workaround for Debian bug that was copied by Ubuntu. Signed-off-by: Rob Landley <r...@landley.net> --- v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html drivers/base/Kconfig | 14 -- fs/namespace.c | 14 ++ init/main.c | 15 +-- 3 files changed, 27 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index f046d21..97352d4 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system to come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc..06057d7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2417,7 +2417,21 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) + { + if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) && + !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs")) + { + /* Debian's kernel config enables DEVTMPFS_MOUNT, then + its initramfs setup script tries to mount devtmpfs + again, and if the second mount-over-itself fails + the script overmounts a tmpfs on /dev to hide the + existing contents, then boot fails with empty /dev. */ + printk(KERN_WARNING "Debian bug workaround for devtmpfs overmount."); + + err = 0; + } goto unlock; + } err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) diff --git a/init/main.c b/init/main.c index 0ee9c686..0d8e5ec 100644 --- a/init/main.c +++ b/init/main.c @@ -1065,12 +1065,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1082,8 +1076,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
[PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move /dev/console open after devtmpfs mount. Add workaround for Debian bug that was copied by Ubuntu. Signed-off-by: Rob Landley --- v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html drivers/base/Kconfig | 14 -- fs/namespace.c | 14 ++ init/main.c | 15 +-- 3 files changed, 27 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index f046d21..97352d4 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system to come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc..06057d7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2417,7 +2417,21 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) + { + if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) && + !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs")) + { + /* Debian's kernel config enables DEVTMPFS_MOUNT, then + its initramfs setup script tries to mount devtmpfs + again, and if the second mount-over-itself fails + the script overmounts a tmpfs on /dev to hide the + existing contents, then boot fails with empty /dev. */ + printk(KERN_WARNING "Debian bug workaround for devtmpfs overmount."); + + err = 0; + } goto unlock; + } err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) diff --git a/init/main.c b/init/main.c index 0ee9c686..0d8e5ec 100644 --- a/init/main.c +++ b/init/main.c @@ -1065,12 +1065,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1082,8 +1076,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 09/11/2017 06:45 AM, Petr Mladek wrote: >> Except for the second printk line: If you boot with rdinit=/bin/hush >> then the first time you mount -t devtmpfs /dev /dev after boot (with >> CONFIG_DEVTMPFS_MOUNT already having mounted it), you get the 0 return >> value but the last printk() doesn't output? The second and later times >> you repeat it, both printk() lines are output. >> >> What's up with printk? >> >> (I added the second printk because the _first_ one wasn't outputting >> that first time. Something is happening to flush the printk() queue >> instead of writing it out? > > You need to add "\n" at the end of the line. Otherwise, it expects > that the message would continue and puts it into a cont buffer. > The buffer is flushed only when another non-continuous message > is added. Ah. The next one flushes the previous one, meaning when I repeat the command I get the output I expected the second time but I'm seeing the _previous_ instance of it, not the current one. > This problem is more visible since the commit 5c2992ee7fd8a29d0412 > ("printk: remove console flushing special cases for partial buffered > lines"). Gotcha. My bad. Thanks, Rob
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 09/11/2017 06:45 AM, Petr Mladek wrote: >> Except for the second printk line: If you boot with rdinit=/bin/hush >> then the first time you mount -t devtmpfs /dev /dev after boot (with >> CONFIG_DEVTMPFS_MOUNT already having mounted it), you get the 0 return >> value but the last printk() doesn't output? The second and later times >> you repeat it, both printk() lines are output. >> >> What's up with printk? >> >> (I added the second printk because the _first_ one wasn't outputting >> that first time. Something is happening to flush the printk() queue >> instead of writing it out? > > You need to add "\n" at the end of the line. Otherwise, it expects > that the message would continue and puts it into a cont buffer. > The buffer is flushed only when another non-continuous message > is added. Ah. The next one flushes the previous one, meaning when I repeat the command I get the output I expected the second time but I'm seeing the _previous_ instance of it, not the current one. > This problem is more visible since the commit 5c2992ee7fd8a29d0412 > ("printk: remove console flushing special cases for partial buffered > lines"). Gotcha. My bad. Thanks, Rob
Re: execve(NULL, argv, envp) for nommu?
On 09/12/2017 06:30 AM, Geert Uytterhoeven wrote: > Hi Rob, > > On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley <r...@landley.net> wrote: >> Your stack has pointers. Your heap has pointers. Your data and bss (once >> initialized) can have pointers. These pointers can be in the middle of >> malloc()'ed structures so no ELF table anywhere knows anything about >> them. A long variable containing a value that _could_ point into one of >> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting >> it is breakage. Tracking them all down and fixing up just the right ones >> without missing any or changing data you shouldn't is REALLY HARD. > > Hence (make the compiler) never store pointers, only offsets relative to a > base register. So after making copies of stack, data/bss, and heap, all you > need to do is adjust these base registers for the child process. > Nothing in main memory needs to be modified. Ok, I'll bite. How do you set a signal handler under this regime, since that needs to pass a function pointer to the syscall? Have a different function pointer type for when you want a real pointer instead of an offset pointer? Perhaps label them "near" and "far" pointers, since there's precedent for that back under DOS? When you call printf(), how does it accept both a "string constant" living in rodata and a char array on the stack? Two printf functions with different argument types? If it _does_ take an actual memory address rather than an offset that isn't always vs the same segment then you've written pointers to the stack... You're also requiring static linking: shared libraries work just fine with fdpic, but under your segment:offset addressing system all text has to be relative to the same code segment. Plus there's still the "fork() off of mozilla" problem that you may copy lots of data just to immediately discard it as the common case (unless you'd still use vfork() for most things), and you still need contiguous blocks of memory for each segment (nommu is vulnerable to fragmentation, increasingly so as the system stays up longer) so your fork() will fail where vfork() succeeds. But that just makes it really slow and unreliable, rather than requiring a large rewrite of the C language. > Text accesses can be PC-relative => nothing to adjust. > Local variable accesses are stack-relative => nothing to adjust. > Data/bss accesses can be relative to a reserved register that stores the > data base address => only adjust the base register, nothing in RAM to adjust. Does this compiler setup you're describing actually exist? Instead of making a minor adjustment to one system call, it's better to extensively rewrite compilers and calling conventions, ignoring the way C traditionally treats strings and arrays as pointers where pointers into data, bss, heap, and stack are all used interchangeably... > Heap accesses can be relative to a reserved register that stores the heap > base address => only adjust the base register, nothing in RAM to adjust. Query: if you implement a linked list ala: struct blah { struct blah *next; char *key, *value; }; If next points to a malloc(), key is a constant string in rodata, and value was strchr(getenv(key), '=')+1 (with appropriate error checking of course), how does your compiler know which segment each pointer in that structure is offset from? (What segment IS your environment space relative to, anyway? It's not the _current_ value of your stack pointer, that moves.) How does your proposed compiler rewrite handle mmap()? You can do MAP_SHARED just fine on nommu today, it's only MAP_PRIVATE that requires copy on write. (Yes MAP_SHARED can be read only.) You're aware that most heap implementations can have more than one underlying mmap(), right? http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n320 https://github.com/kraj/uClibc/blob/master/libc/stdlib/malloc/malloc.c#L121 So when you say _the_ heap base address above, which chunk are you referring to? Rob
Re: execve(NULL, argv, envp) for nommu?
On 09/12/2017 06:30 AM, Geert Uytterhoeven wrote: > Hi Rob, > > On Tue, Sep 12, 2017 at 12:48 PM, Rob Landley wrote: >> Your stack has pointers. Your heap has pointers. Your data and bss (once >> initialized) can have pointers. These pointers can be in the middle of >> malloc()'ed structures so no ELF table anywhere knows anything about >> them. A long variable containing a value that _could_ point into one of >> these ranges isn't guaranteed to _be_ a pointer, in which case adjusting >> it is breakage. Tracking them all down and fixing up just the right ones >> without missing any or changing data you shouldn't is REALLY HARD. > > Hence (make the compiler) never store pointers, only offsets relative to a > base register. So after making copies of stack, data/bss, and heap, all you > need to do is adjust these base registers for the child process. > Nothing in main memory needs to be modified. Ok, I'll bite. How do you set a signal handler under this regime, since that needs to pass a function pointer to the syscall? Have a different function pointer type for when you want a real pointer instead of an offset pointer? Perhaps label them "near" and "far" pointers, since there's precedent for that back under DOS? When you call printf(), how does it accept both a "string constant" living in rodata and a char array on the stack? Two printf functions with different argument types? If it _does_ take an actual memory address rather than an offset that isn't always vs the same segment then you've written pointers to the stack... You're also requiring static linking: shared libraries work just fine with fdpic, but under your segment:offset addressing system all text has to be relative to the same code segment. Plus there's still the "fork() off of mozilla" problem that you may copy lots of data just to immediately discard it as the common case (unless you'd still use vfork() for most things), and you still need contiguous blocks of memory for each segment (nommu is vulnerable to fragmentation, increasingly so as the system stays up longer) so your fork() will fail where vfork() succeeds. But that just makes it really slow and unreliable, rather than requiring a large rewrite of the C language. > Text accesses can be PC-relative => nothing to adjust. > Local variable accesses are stack-relative => nothing to adjust. > Data/bss accesses can be relative to a reserved register that stores the > data base address => only adjust the base register, nothing in RAM to adjust. Does this compiler setup you're describing actually exist? Instead of making a minor adjustment to one system call, it's better to extensively rewrite compilers and calling conventions, ignoring the way C traditionally treats strings and arrays as pointers where pointers into data, bss, heap, and stack are all used interchangeably... > Heap accesses can be relative to a reserved register that stores the heap > base address => only adjust the base register, nothing in RAM to adjust. Query: if you implement a linked list ala: struct blah { struct blah *next; char *key, *value; }; If next points to a malloc(), key is a constant string in rodata, and value was strchr(getenv(key), '=')+1 (with appropriate error checking of course), how does your compiler know which segment each pointer in that structure is offset from? (What segment IS your environment space relative to, anyway? It's not the _current_ value of your stack pointer, that moves.) How does your proposed compiler rewrite handle mmap()? You can do MAP_SHARED just fine on nommu today, it's only MAP_PRIVATE that requires copy on write. (Yes MAP_SHARED can be read only.) You're aware that most heap implementations can have more than one underlying mmap(), right? http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n320 https://github.com/kraj/uClibc/blob/master/libc/stdlib/malloc/malloc.c#L121 So when you say _the_ heap base address above, which chunk are you referring to? Rob
Re: execve(NULL, argv, envp) for nommu?
On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > On 09/08, Rob Landley wrote: >> >> So is exec(NULL, argv, envp) a reasonable thing to want? > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > dentry_open(current->mm->exe_file->path, O_PATH) > > and returns fd make more sense. > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). I'm all for it? That sounds like a cosmetic difference, a more verbose way of achieving the same outcome. (Of course now you've got a filehandle you can read xattrs and such through from otherwise jailed contexts letting you do things you couldn't necessarily do before, but I assume you know the security implications of that more than I do. I tried to suggest something that _didn't_ create new capabilities, just let nommu do a thing that mmu could already do.) > But to be honest, I can't understand the problem, because I know nothing > about nommu. > > You need to unblock parent sleeping in vfork(), and you can't do another > fork (I don't undestand why). A nommu system doesn't have a memory management unit, so all addresses are physical addresses. This means two processes can't see different things at the same address: either they see the same thing or one of them can't see that address (due to a range register making it). Conventional fork() creates copy on write mappings of all the existing writable memory of the parent process. So when the new PID dirties a page, the old page gets copied by the fault handler. The problem isn't the copies (that's just slow), the problem is two processes seeing different things at the same address. That requires an MMU with a TLB loaded from page tables. If you create _new_ mappings and copy the data over, they'll have different addresses. But any pointers you copied will point to the _old_ addresses. Finding and adjusting all those pointers to point to the new addresses instead is basically the same problem as doing garbage collection in C. Your stack has pointers. Your heap has pointers. Your data and bss (once initialized) can have pointers. These pointers can be in the middle of malloc()'ed structures so no ELF table anywhere knows anything about them. A long variable containing a value that _could_ point into one of these ranges isn't guaranteed to _be_ a pointer, in which case adjusting it is breakage. Tracking them all down and fixing up just the right ones without missing any or changing data you shouldn't is REALLY HARD. The vfork() system call is what you use on nommu instead: it creates a child process that uses its parent's memory mappings. The parent process is stopped until the child calls _exit() or exec(), either of which means it stops using those mappings and the parent can go back to using them without the two stomping on each other. (Usually they even share the same stack, so the child shouldn't return from the function that called vfork() or it'll corrupt the stack for the parent process. And be careful about changing local variables, the parent might see the changes when it resumes. Some vfork() implementations provide a small new stack, ala signal handlers or kernel interrupts, so you can't guarantee your parent will see your local variable changes, but you still can't return from the function that called vfork() in either case.) So after calling vfork(), the child _must_ call exec() in order for there to be two independent processes running at the same time. Until then, the parent is stopped. The real problem with implementing full fork() isn't the expense of copying the data (although if you fork and exec from a mozilla style pig process, you could copy hundreds of megabytes of data and then immediately discard it again; that's why fork() doesn't usually do that; oh and on nommu systems you need _contiguous_ memory blocks for the data because it can't collect disparate pages together into a longer mapping, so this is actually a largeish real-world issue on those systems, not merely slow and expensive.) The hard problem is translating the pointers so the new mapping doesn't read/write objects in the old mapping. > Perhaps the child can create another thread? The main thread can exit > after that and unblock the parent. Or perhaps even something like > clone(CLONE_VM | CLONE_PARENT), I dunno... Launching a new thread doesn't unblock the parent. A second vfork() from the child wouldn't unblock the parent. Your mappings are still overcommited, only _exit() or execve() releases the child process's use of those mappings. You can create threads on nommu because they're designed to share the same mappings. In that case you're guaranteed a new stack, and not stomping the parent's data is your problem. But if you exec() from a thread, posix says it kills all the other threads: http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html And even without that, we're still in the "vfork
Re: execve(NULL, argv, envp) for nommu?
On 09/11/2017 10:15 AM, Oleg Nesterov wrote: > On 09/08, Rob Landley wrote: >> >> So is exec(NULL, argv, envp) a reasonable thing to want? > > I think that something like prctl(PR_OPEN_EXE_FILE) which does > > dentry_open(current->mm->exe_file->path, O_PATH) > > and returns fd make more sense. > > Then you can do execveat(fd, "", ..., AT_EMPTY_PATH). I'm all for it? That sounds like a cosmetic difference, a more verbose way of achieving the same outcome. (Of course now you've got a filehandle you can read xattrs and such through from otherwise jailed contexts letting you do things you couldn't necessarily do before, but I assume you know the security implications of that more than I do. I tried to suggest something that _didn't_ create new capabilities, just let nommu do a thing that mmu could already do.) > But to be honest, I can't understand the problem, because I know nothing > about nommu. > > You need to unblock parent sleeping in vfork(), and you can't do another > fork (I don't undestand why). A nommu system doesn't have a memory management unit, so all addresses are physical addresses. This means two processes can't see different things at the same address: either they see the same thing or one of them can't see that address (due to a range register making it). Conventional fork() creates copy on write mappings of all the existing writable memory of the parent process. So when the new PID dirties a page, the old page gets copied by the fault handler. The problem isn't the copies (that's just slow), the problem is two processes seeing different things at the same address. That requires an MMU with a TLB loaded from page tables. If you create _new_ mappings and copy the data over, they'll have different addresses. But any pointers you copied will point to the _old_ addresses. Finding and adjusting all those pointers to point to the new addresses instead is basically the same problem as doing garbage collection in C. Your stack has pointers. Your heap has pointers. Your data and bss (once initialized) can have pointers. These pointers can be in the middle of malloc()'ed structures so no ELF table anywhere knows anything about them. A long variable containing a value that _could_ point into one of these ranges isn't guaranteed to _be_ a pointer, in which case adjusting it is breakage. Tracking them all down and fixing up just the right ones without missing any or changing data you shouldn't is REALLY HARD. The vfork() system call is what you use on nommu instead: it creates a child process that uses its parent's memory mappings. The parent process is stopped until the child calls _exit() or exec(), either of which means it stops using those mappings and the parent can go back to using them without the two stomping on each other. (Usually they even share the same stack, so the child shouldn't return from the function that called vfork() or it'll corrupt the stack for the parent process. And be careful about changing local variables, the parent might see the changes when it resumes. Some vfork() implementations provide a small new stack, ala signal handlers or kernel interrupts, so you can't guarantee your parent will see your local variable changes, but you still can't return from the function that called vfork() in either case.) So after calling vfork(), the child _must_ call exec() in order for there to be two independent processes running at the same time. Until then, the parent is stopped. The real problem with implementing full fork() isn't the expense of copying the data (although if you fork and exec from a mozilla style pig process, you could copy hundreds of megabytes of data and then immediately discard it again; that's why fork() doesn't usually do that; oh and on nommu systems you need _contiguous_ memory blocks for the data because it can't collect disparate pages together into a longer mapping, so this is actually a largeish real-world issue on those systems, not merely slow and expensive.) The hard problem is translating the pointers so the new mapping doesn't read/write objects in the old mapping. > Perhaps the child can create another thread? The main thread can exit > after that and unblock the parent. Or perhaps even something like > clone(CLONE_VM | CLONE_PARENT), I dunno... Launching a new thread doesn't unblock the parent. A second vfork() from the child wouldn't unblock the parent. Your mappings are still overcommited, only _exit() or execve() releases the child process's use of those mappings. You can create threads on nommu because they're designed to share the same mappings. In that case you're guaranteed a new stack, and not stomping the parent's data is your problem. But if you exec() from a thread, posix says it kills all the other threads: http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html And even without that, we're still in the "vfork
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
Taking another stab at this old issue from last merge window... > Rob Landley <r...@landley.net> writes: >> On 05/23/2017 03:01 AM, Yury Norov wrote: >>> On Mon, May 22, 2017 at 09:07:54PM -0500, Rob Landley wrote: >>>> Your userspace mounted a tmpfs over /dev when it couldn't mount a second >>>> identical instance of devtmpfs over itself. If you had a static /dev in >>>> initramfs but didn't configure _in_ devtmpfs to your kernel, your broken >>>> error path would have taken that out too with a pointless tmpfs mount. >>> >>> CONFIG_DEVTMPFS_MOUNT is enabled on my machine, so I think your >>> suggestion is correct. But I didn't do that specifically - I run >>> almost default kernel based on Ubuntu 14.04 config and environment. >> >> I.E. ubuntu has a bug: they enabled CONFIG_DEVTMPFS_MOUNT and then >> launchd an initramfs instead (which didn't do the automount they >> requested so why request it), but if CONFIG_DEVTMPFS_MOUNT actually >> starts working in initramfs they have an insane error path that breaks >> the system, and does nothing _except_ break the system. ... On 05/25/2017 01:13 AM, Michael Ellerman wrote: > Hi Rob, > > This is breaking a bunch of my powerpc boxes, for the exact same > reason, they use a config that has DEVTMPFS_MOUNT=y and that trips > up the initramfs. I've continued to use this locally but should probably make another stab at submitting upstream. The obvious workaround until debian fixes its 100% obvious bug seems to be: diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc..f57d5df 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2417,7 +2417,17 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) + { + if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) && + !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs")) + { + printk(KERN_WARNING "Debian bug workaround for devtmpfs overmount."); + printk(KERN_WARNING "This line doesn't output for some reason."); + + err = 0; + } goto unlock; + } err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) Except for the second printk line: If you boot with rdinit=/bin/hush then the first time you mount -t devtmpfs /dev /dev after boot (with CONFIG_DEVTMPFS_MOUNT already having mounted it), you get the 0 return value but the last printk() doesn't output? The second and later times you repeat it, both printk() lines are output. What's up with printk? (I added the second printk because the _first_ one wasn't outputting that first time. Something is happening to flush the printk() queue instead of writing it out? Built for x86-64, miniconfig attached for reference. I tested commit 4dfc2788033d from yesterday.) Rob # make ARCH=x86 allnoconfig KCONFIG_ALLCONFIG=x86_64.miniconf # make ARCH=x86 -j $(nproc) # boot arch/x86/boot/bzImage CONFIG_64BIT=y CONFIG_PCI=y CONFIG_BLK_DEV_SD=y CONFIG_ATA=y CONFIG_ATA_SFF=y CONFIG_ATA_BMDMA=y CONFIG_ATA_PIIX=y CONFIG_NET_VENDOR_INTEL=y CONFIG_E1000=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_RTC_CLASS=y # CONFIG_EMBEDDED is not set CONFIG_EARLY_PRINTK=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_SCRIPT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_BLK_DEV=y CONFIG_BLK_DEV_INITRD=y CONFIG_RD_GZIP=y CONFIG_BLK_DEV_LOOP=y CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT2=y CONFIG_VFAT_FS=y CONFIG_FAT_DEFAULT_UTF8=y CONFIG_MISC_FILESYSTEMS=y CONFIG_SQUASHFS=y CONFIG_SQUASHFS_XATTR=y CONFIG_SQUASHFS_ZLIB=y CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_UNIX=y CONFIG_INET=y CONFIG_IPV6=y CONFIG_NETDEVICES=y #CONFIG_NET_CORE=y #CONFIG_NETCONSOLE=y CONFIG_ETHERNET=y
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
Taking another stab at this old issue from last merge window... > Rob Landley writes: >> On 05/23/2017 03:01 AM, Yury Norov wrote: >>> On Mon, May 22, 2017 at 09:07:54PM -0500, Rob Landley wrote: >>>> Your userspace mounted a tmpfs over /dev when it couldn't mount a second >>>> identical instance of devtmpfs over itself. If you had a static /dev in >>>> initramfs but didn't configure _in_ devtmpfs to your kernel, your broken >>>> error path would have taken that out too with a pointless tmpfs mount. >>> >>> CONFIG_DEVTMPFS_MOUNT is enabled on my machine, so I think your >>> suggestion is correct. But I didn't do that specifically - I run >>> almost default kernel based on Ubuntu 14.04 config and environment. >> >> I.E. ubuntu has a bug: they enabled CONFIG_DEVTMPFS_MOUNT and then >> launchd an initramfs instead (which didn't do the automount they >> requested so why request it), but if CONFIG_DEVTMPFS_MOUNT actually >> starts working in initramfs they have an insane error path that breaks >> the system, and does nothing _except_ break the system. ... On 05/25/2017 01:13 AM, Michael Ellerman wrote: > Hi Rob, > > This is breaking a bunch of my powerpc boxes, for the exact same > reason, they use a config that has DEVTMPFS_MOUNT=y and that trips > up the initramfs. I've continued to use this locally but should probably make another stab at submitting upstream. The obvious workaround until debian fixes its 100% obvious bug seems to be: diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc..f57d5df 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2417,7 +2417,17 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) + { + if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) && + !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs")) + { + printk(KERN_WARNING "Debian bug workaround for devtmpfs overmount."); + printk(KERN_WARNING "This line doesn't output for some reason."); + + err = 0; + } goto unlock; + } err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) Except for the second printk line: If you boot with rdinit=/bin/hush then the first time you mount -t devtmpfs /dev /dev after boot (with CONFIG_DEVTMPFS_MOUNT already having mounted it), you get the 0 return value but the last printk() doesn't output? The second and later times you repeat it, both printk() lines are output. What's up with printk? (I added the second printk because the _first_ one wasn't outputting that first time. Something is happening to flush the printk() queue instead of writing it out? Built for x86-64, miniconfig attached for reference. I tested commit 4dfc2788033d from yesterday.) Rob # make ARCH=x86 allnoconfig KCONFIG_ALLCONFIG=x86_64.miniconf # make ARCH=x86 -j $(nproc) # boot arch/x86/boot/bzImage CONFIG_64BIT=y CONFIG_PCI=y CONFIG_BLK_DEV_SD=y CONFIG_ATA=y CONFIG_ATA_SFF=y CONFIG_ATA_BMDMA=y CONFIG_ATA_PIIX=y CONFIG_NET_VENDOR_INTEL=y CONFIG_E1000=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_RTC_CLASS=y # CONFIG_EMBEDDED is not set CONFIG_EARLY_PRINTK=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_SCRIPT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_BLK_DEV=y CONFIG_BLK_DEV_INITRD=y CONFIG_RD_GZIP=y CONFIG_BLK_DEV_LOOP=y CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT2=y CONFIG_VFAT_FS=y CONFIG_FAT_DEFAULT_UTF8=y CONFIG_MISC_FILESYSTEMS=y CONFIG_SQUASHFS=y CONFIG_SQUASHFS_XATTR=y CONFIG_SQUASHFS_ZLIB=y CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_UNIX=y CONFIG_INET=y CONFIG_IPV6=y CONFIG_NETDEVICES=y #CONFIG_NET_CORE=y #CONFIG_NETCONSOLE=y CONFIG_ETHERNET=y
Re: execve(NULL, argv, envp) for nommu?
On 09/05/2017 08:12 PM, Rob Landley wrote: > On 09/05/2017 08:24 AM, Alan Cox wrote: >>>> honoring the suid bit if people feel that way. I just wanna unblock >>>> vfork() while still running this code. >> >> Would it make more sense to have a way to promote your vfork into a >> fork when you hit these cases (I appreciate that fork on NOMMU has a much >> higher performance cost as you start having to softmmu copy or swap >> pages). > > It's not the performance cost, it's rewriting all the pointers. > > Without address translation, copying the existing mappings to a new > range requires finding and adjusting every pointer to the old data, > which you can do for the executable mappings in PIE* binaries, but > tracking down all the pointers on the stack, heap, and in your global > variables? Flaming pain. > > Making fork() work on nommu is basically the same problem as making > garbage collection work in C on mmu. Thus those of us who defend vfork() > from the people who don't understand why it exists periodically > suggesting we remove it. So is exec(NULL, argv, envp) a reasonable thing to want? Rob
Re: execve(NULL, argv, envp) for nommu?
On 09/05/2017 08:12 PM, Rob Landley wrote: > On 09/05/2017 08:24 AM, Alan Cox wrote: >>>> honoring the suid bit if people feel that way. I just wanna unblock >>>> vfork() while still running this code. >> >> Would it make more sense to have a way to promote your vfork into a >> fork when you hit these cases (I appreciate that fork on NOMMU has a much >> higher performance cost as you start having to softmmu copy or swap >> pages). > > It's not the performance cost, it's rewriting all the pointers. > > Without address translation, copying the existing mappings to a new > range requires finding and adjusting every pointer to the old data, > which you can do for the executable mappings in PIE* binaries, but > tracking down all the pointers on the stack, heap, and in your global > variables? Flaming pain. > > Making fork() work on nommu is basically the same problem as making > garbage collection work in C on mmu. Thus those of us who defend vfork() > from the people who don't understand why it exists periodically > suggesting we remove it. So is exec(NULL, argv, envp) a reasonable thing to want? Rob
Re: execve(NULL, argv, envp) for nommu?
On 09/05/2017 08:24 AM, Alan Cox wrote: >>> anymore. But I'm already _running_ this program. If I could fork() I >>> could already get a second copy of the sucker and call main() again >>> myself if necessary, but I can't, so... > > You can - ptrace 8) Oh I can call clone() with various flags and try to fake it myself, it just won't do what I want. :) >>> honoring the suid bit if people feel that way. I just wanna unblock >>> vfork() while still running this code. > > Would it make more sense to have a way to promote your vfork into a > fork when you hit these cases (I appreciate that fork on NOMMU has a much > higher performance cost as you start having to softmmu copy or swap > pages). It's not the performance cost, it's rewriting all the pointers. Without address translation, copying the existing mappings to a new range requires finding and adjusting every pointer to the old data, which you can do for the executable mappings in PIE* binaries, but tracking down all the pointers on the stack, heap, and in your global variables? Flaming pain. Making fork() work on nommu is basically the same problem as making garbage collection work in C on mmu. Thus those of us who defend vfork() from the people who don't understand why it exists periodically suggesting we remove it. > Alan Rob * or FDPIC, which is basically just PIE with 4 individually relocatable text/data/rodata/bss segments instead of one big mapping you relocate as a contiguous block; both work on nommu but fdpic can fit into more fragmented memory, and becauase the segments are independent it lets nommu share some segments between processes (code+rodata**) without sharing others (data and bss). That's why nommu can't run normal elf but can run PIE or FDPIC binaries. Or binflt which is the old a.out version. ** Don't ask me what happens when rodata contains a constant pointer to a bss or data object. I'm guessing the compiler Does A Thing. Ask Rich Felker?
Re: execve(NULL, argv, envp) for nommu?
On 09/05/2017 08:24 AM, Alan Cox wrote: >>> anymore. But I'm already _running_ this program. If I could fork() I >>> could already get a second copy of the sucker and call main() again >>> myself if necessary, but I can't, so... > > You can - ptrace 8) Oh I can call clone() with various flags and try to fake it myself, it just won't do what I want. :) >>> honoring the suid bit if people feel that way. I just wanna unblock >>> vfork() while still running this code. > > Would it make more sense to have a way to promote your vfork into a > fork when you hit these cases (I appreciate that fork on NOMMU has a much > higher performance cost as you start having to softmmu copy or swap > pages). It's not the performance cost, it's rewriting all the pointers. Without address translation, copying the existing mappings to a new range requires finding and adjusting every pointer to the old data, which you can do for the executable mappings in PIE* binaries, but tracking down all the pointers on the stack, heap, and in your global variables? Flaming pain. Making fork() work on nommu is basically the same problem as making garbage collection work in C on mmu. Thus those of us who defend vfork() from the people who don't understand why it exists periodically suggesting we remove it. > Alan Rob * or FDPIC, which is basically just PIE with 4 individually relocatable text/data/rodata/bss segments instead of one big mapping you relocate as a contiguous block; both work on nommu but fdpic can fit into more fragmented memory, and becauase the segments are independent it lets nommu share some segments between processes (code+rodata**) without sharing others (data and bss). That's why nommu can't run normal elf but can run PIE or FDPIC binaries. Or binflt which is the old a.out version. ** Don't ask me what happens when rodata contains a constant pointer to a bss or data object. I'm guessing the compiler Does A Thing. Ask Rich Felker?
Re: INITRAMFS_SOURCE broken by 6e19eded3684dc184181093af3bff2ff440f5b53?
On 08/08/2017 07:04 AM, Willy Tarreau wrote: > Hi Thomas, > > On Tue, Aug 08, 2017 at 01:46:25PM +0200, Thomas Meyer wrote: >> Hi, >> >> did the commit 6e19eded3684dc184181093af3bff2ff440f5b53 break a linux kernel >> build with an included ramdisk? >> >> As fas as I understand you must expliclity add rootfstype=ramfs to the kernel >> command line to boot from the included ramfsdisk? >> >> bug or feature? > > Strange, I'm running my kernels with the modules packaged inside the initramfs > and never met this problem even after this commit (my 4.9 kernels are still > packaged this way and run fine). And yes, I do have TMPFS enabled. I can't > tell whether tmpfs or ramfs was used however given that at this level I don't > have all the tools available to report the FS type (and proc says "rootfs"). > Are you sure you're not missing anything ? If your rootfs has a size= in /proc/mounts it's tmpfs, ala: rootfs / rootfs rw,size=126564k,nr_inodes=31641 0 0 Rob
Re: INITRAMFS_SOURCE broken by 6e19eded3684dc184181093af3bff2ff440f5b53?
On 08/08/2017 07:04 AM, Willy Tarreau wrote: > Hi Thomas, > > On Tue, Aug 08, 2017 at 01:46:25PM +0200, Thomas Meyer wrote: >> Hi, >> >> did the commit 6e19eded3684dc184181093af3bff2ff440f5b53 break a linux kernel >> build with an included ramdisk? >> >> As fas as I understand you must expliclity add rootfstype=ramfs to the kernel >> command line to boot from the included ramfsdisk? >> >> bug or feature? > > Strange, I'm running my kernels with the modules packaged inside the initramfs > and never met this problem even after this commit (my 4.9 kernels are still > packaged this way and run fine). And yes, I do have TMPFS enabled. I can't > tell whether tmpfs or ramfs was used however given that at this level I don't > have all the tools available to report the FS type (and proc says "rootfs"). > Are you sure you're not missing anything ? If your rootfs has a size= in /proc/mounts it's tmpfs, ala: rootfs / rootfs rw,size=126564k,nr_inodes=31641 0 0 Rob
ping/icmp sockets: define "sane".
The title is from this comment in net/ipv4: /* * Sane defaults - nobody may create ping sockets. * Boot scripts should set this to distro-specific group. */ So in 2011 you added ICMP sockets, but made it so nobody could use them without root performing a magic incatation at boot time. From the original commit message: > socket(2) is restricted to the group range specified in > "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning > that nobody (not even root) may create ping sockets. Why? What's the point of NOT letting root use this? So ping programs like busybox's can't use the new api as a drop-in replacment for the old one because even if they keep the suid bit on the command, it won't work? I thought busybox was using it, but they ripped it back out in 2014: https://git.busybox.net/busybox/commit/?id=f0058b1b1fe9 What is the point of creating a new api to do something root could previously do in a safer way that doesn't require root access, and then not even let root do it by default? What's the point? I thought commit ba6b918ab234 removed this blockage, but instead it moved it to a different file. Is ping flood from icmp somehow more dangerous than UDP flood from an arbitrary user? What's the issue reeuiring this elaborate infrastructure to render your new api so useless busybox went BACK to the suid root version, and the ping in ubuntu 14.04 is also suid root? I ask because I'm finally getting around to implementing ping in toybox and of course I was going to use the new API, and testing it on Ubuntu didn't work, so I dug this mess up and boggled. Perhaps you could explain it? The Android guys say that yes they use this API, and make it available to everybody, even from java: http://lists.landley.net/pipermail/toybox-landley.net/2017-July/009101.html The kernel patch to make it work is presumably just: --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1712,12 +1712,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.ip_local_ports.range[1] = 60999; seqlock_init(>ipv4.ping_group_range.lock); - /* -* Sane defaults - nobody may create ping sockets. -* Boot scripts should set this to distro-specific group. -*/ - net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 1); - net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 65535); /* Default values for sysctl-controlled parameters. * We set them here, in case sysctl is not compiled. I'm tempted to put that diff into the toybox FAQ for people who want to use toybox on vanilla linux. But first, I thought I'd ask for an explanation of why it's explicitly, intentionally broken in the first place? You made a "safer" API to not require root access, and then made it so even root can't use it. Why did you do that? Rob
ping/icmp sockets: define "sane".
The title is from this comment in net/ipv4: /* * Sane defaults - nobody may create ping sockets. * Boot scripts should set this to distro-specific group. */ So in 2011 you added ICMP sockets, but made it so nobody could use them without root performing a magic incatation at boot time. From the original commit message: > socket(2) is restricted to the group range specified in > "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning > that nobody (not even root) may create ping sockets. Why? What's the point of NOT letting root use this? So ping programs like busybox's can't use the new api as a drop-in replacment for the old one because even if they keep the suid bit on the command, it won't work? I thought busybox was using it, but they ripped it back out in 2014: https://git.busybox.net/busybox/commit/?id=f0058b1b1fe9 What is the point of creating a new api to do something root could previously do in a safer way that doesn't require root access, and then not even let root do it by default? What's the point? I thought commit ba6b918ab234 removed this blockage, but instead it moved it to a different file. Is ping flood from icmp somehow more dangerous than UDP flood from an arbitrary user? What's the issue reeuiring this elaborate infrastructure to render your new api so useless busybox went BACK to the suid root version, and the ping in ubuntu 14.04 is also suid root? I ask because I'm finally getting around to implementing ping in toybox and of course I was going to use the new API, and testing it on Ubuntu didn't work, so I dug this mess up and boggled. Perhaps you could explain it? The Android guys say that yes they use this API, and make it available to everybody, even from java: http://lists.landley.net/pipermail/toybox-landley.net/2017-July/009101.html The kernel patch to make it work is presumably just: --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1712,12 +1712,8 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.ip_local_ports.range[1] = 60999; seqlock_init(>ipv4.ping_group_range.lock); - /* -* Sane defaults - nobody may create ping sockets. -* Boot scripts should set this to distro-specific group. -*/ - net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 1); - net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[0] = make_kgid(_user_ns, 0); + net->ipv4.ping_group_range.range[1] = make_kgid(_user_ns, 65535); /* Default values for sysctl-controlled parameters. * We set them here, in case sysctl is not compiled. I'm tempted to put that diff into the toybox FAQ for people who want to use toybox on vanilla linux. But first, I thought I'd ask for an explanation of why it's explicitly, intentionally broken in the first place? You made a "safer" API to not require root access, and then made it so even root can't use it. Why did you do that? Rob
Re: [linux-next] PPC Lpar fail to boot with error hid: module verification failed: signature and/or required key missing - tainting kernel
On 05/25/2017 04:24 PM, Stephen Rothwell wrote: > Hi Michael, > > On Thu, 25 May 2017 23:02:06 +1000 Michael Ellerman> wrote: >> >> It'll be: >> >> ee35011fd032 ("initramfs: make initramfs honor CONFIG_DEVTMPFS_MOUNT") > > And Andrew has asked me to drop that patch from linux-next which will > happen today. What approach do the kernel developers suggest I take here? I would have thought letting it soak in linux-next for a release so people could fix userspace bugs would be the next step, but this sounds like that's not an option? Is the behavior the patch implements wrong? Rob
Re: [linux-next] PPC Lpar fail to boot with error hid: module verification failed: signature and/or required key missing - tainting kernel
On 05/25/2017 04:24 PM, Stephen Rothwell wrote: > Hi Michael, > > On Thu, 25 May 2017 23:02:06 +1000 Michael Ellerman > wrote: >> >> It'll be: >> >> ee35011fd032 ("initramfs: make initramfs honor CONFIG_DEVTMPFS_MOUNT") > > And Andrew has asked me to drop that patch from linux-next which will > happen today. What approach do the kernel developers suggest I take here? I would have thought letting it soak in linux-next for a release so people could fix userspace bugs would be the next step, but this sounds like that's not an option? Is the behavior the patch implements wrong? Rob
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/23/2017 06:08 PM, Yury Norov wrote: >> It was 2 years ago, but AFAIR I took the Ubuntu image here: >> http://cdimage.ubuntu.com/ubuntu-base/releases/14.04.1/release/ubuntu-base-14.04.1-core-arm64.tar.gz Have you applied updates since then? (Maybe they fixed their init script since 2 years ago?) >> Kernel config is attached. I build the kernel with simple 'make'. >> >> Yury > > Sorry, config is here. $ diff -u yury.conf /boot/config-4.4.0-78-generic | grep '^[-+]' | wc -l 10384 So that's not Ubuntu's current 14.04 kernel config. $ diff -u yury.conf /boot/config-4.2.0-36-generic | grep '^[-+]' | wc -l 10212 And it's not the oldest Ubuntu 14.04 config I have lying around (from a year ago). $ cd linux && make defconfig $ diff -u ~/yury.conf .config | grep '^[-+]' | wc -l 4369 It's much closer to the current defconfig, but still significantly different. So you're using a custom config, and can't switch off a symbol. Rob
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/23/2017 06:08 PM, Yury Norov wrote: >> It was 2 years ago, but AFAIR I took the Ubuntu image here: >> http://cdimage.ubuntu.com/ubuntu-base/releases/14.04.1/release/ubuntu-base-14.04.1-core-arm64.tar.gz Have you applied updates since then? (Maybe they fixed their init script since 2 years ago?) >> Kernel config is attached. I build the kernel with simple 'make'. >> >> Yury > > Sorry, config is here. $ diff -u yury.conf /boot/config-4.4.0-78-generic | grep '^[-+]' | wc -l 10384 So that's not Ubuntu's current 14.04 kernel config. $ diff -u yury.conf /boot/config-4.2.0-36-generic | grep '^[-+]' | wc -l 10212 And it's not the oldest Ubuntu 14.04 config I have lying around (from a year ago). $ cd linux && make defconfig $ diff -u ~/yury.conf .config | grep '^[-+]' | wc -l 4369 It's much closer to the current defconfig, but still significantly different. So you're using a custom config, and can't switch off a symbol. Rob
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/23/2017 03:01 AM, Yury Norov wrote: > On Mon, May 22, 2017 at 09:07:54PM -0500, Rob Landley wrote: >> Your userspace mounted a tmpfs over /dev when it couldn't mount a second >> identical instance of devtmpfs over itself. If you had a static /dev in >> initramfs but didn't configure _in_ devtmpfs to your kernel, your broken >> error path would have taken that out too with a pointless tmpfs mount. > > CONFIG_DEVTMPFS_MOUNT is enabled on my machine, so I think your > suggestion is correct. But I didn't do that specifically - I run > almost default kernel based on Ubuntu 14.04 config and environment. I.E. ubuntu has a bug: they enabled CONFIG_DEVTMPFS_MOUNT and then launchd an initramfs instead (which didn't do the automount they requested so why request it), but if CONFIG_DEVTMPFS_MOUNT actually starts working in initramfs they have an insane error path that breaks the system, and does nothing _except_ break the system. > Grepping the kernel code shows that arc, arm, arm64, m86k, metag, > mips, nios2, openrisc, parisc, powerpc, sh, tile, um, x86 and xetensa > enable it by default. Most of which Ubuntu doesn't support, so none of them could trigger the broken error path in ubuntu's init script. Wait, are you saying you're doing a "make defconfig" on x86-64 and booting ubuntu from the result? (Or is this arm?) Is _that_ the config you still haven't specified in this conversation? I thought you were using the /boot/config-4.4.0-78-generic and friends ubuntu installs. (Which yes, also switch this symbol on.) I can add a "default n" line to drivers/base/Kconfig if "make defconfig" is what you're building from. (This symbol never specified a default in the first place, so I dunno which way it falls, but it's repeated in a gazillion defconfig files and not present in others... meaning I still dunno which way the default goes. When I do a "make defconfig" it uses arch/x86/configs/x86_64_defconfig because the kernel has multiple codepaths to accomplish the same thing. I'm not sure the built-in "default y" lines are used at all anymore? What a mess...) But again, I'm just guessing what config you're using because you still haven't _said_. I'm still trying to guess what you're doing when you hit Ubuntu's bug. > So it means for me that (at least) users why run > Ubuntu 14.04 will have bricked system one day after updating the > kernel. Unless when they build their new kernel they open up menuconfig and switch this symbol off. Which can't be done because...? Or you could add the devtmpfs.mount=0 argument to your kernel command line, as documented in the CONFIG_DEVTMPFS_MOUNT menuconfig help text. The kernel already provides multiple workarounds for Ubuntu's bug, and the issue only hits people who are manually building a new kernel from source. If ubuntu provides a new kernel, I assume they'll tweak their config _and_ fix their initramfs error path (which is just plain wrong). > If you say that currently CONFIG_DEVTMPFS_MOUNT is ignored by kernel, It's not _me_ saying it, it's the kernel doing it. The patch is conceptually a straightforward fix on the kernel side to make it _not_ ignore that symbol in that context. > I think you cannot relay on it anymore because people may have it > enabled or disabled randomly. I expected configs would have it randomly set, but the bug here is a broken error path that does something actively harmful rather than going "oh, we got a static /dev from somewhere, let's just leave it alone". This error path goes out of its way to blank the contents of /dev by mounting an empty tmpfs over it and leaving it empty, and then complaining that /dev is blank _because_it_blanked_it_. If Ubuntu meant to intentionally halt the proceedings the script could have done that explicitly. What did the author of that error path think would happen, exactly? > So the proper way is to remove broken > config option and introduce new one. BTW, I see it is used once in > drivers/base/devtmpfs.c. How does removing the broken config option (or renaming it to CONFIG_DEFTMPFS_UBUNTU_IS_BROKEN) _not_ impact systems that were previously happily using it in the contexts where it already worked? If it's too much to ask people to switch it off when it was previously on (but shouldn't have been), how is asking them to manually switch it back on when it was previously on and needs to stay on better? (And if you arrange it so "make oldconfig" migrates the old symbol to the new one automatically, how would that work around the broken error path in ubuntu's initramfs script? The rename becomes a NOP.) If you're saying it should default to "n" I can send a patch. If you want me to tweak every arch/*/configs file that redundantly includes the same darn symbol, I can do that too. (Makes the patch big but it's just a sed invocation to do it.) >>
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/23/2017 03:01 AM, Yury Norov wrote: > On Mon, May 22, 2017 at 09:07:54PM -0500, Rob Landley wrote: >> Your userspace mounted a tmpfs over /dev when it couldn't mount a second >> identical instance of devtmpfs over itself. If you had a static /dev in >> initramfs but didn't configure _in_ devtmpfs to your kernel, your broken >> error path would have taken that out too with a pointless tmpfs mount. > > CONFIG_DEVTMPFS_MOUNT is enabled on my machine, so I think your > suggestion is correct. But I didn't do that specifically - I run > almost default kernel based on Ubuntu 14.04 config and environment. I.E. ubuntu has a bug: they enabled CONFIG_DEVTMPFS_MOUNT and then launchd an initramfs instead (which didn't do the automount they requested so why request it), but if CONFIG_DEVTMPFS_MOUNT actually starts working in initramfs they have an insane error path that breaks the system, and does nothing _except_ break the system. > Grepping the kernel code shows that arc, arm, arm64, m86k, metag, > mips, nios2, openrisc, parisc, powerpc, sh, tile, um, x86 and xetensa > enable it by default. Most of which Ubuntu doesn't support, so none of them could trigger the broken error path in ubuntu's init script. Wait, are you saying you're doing a "make defconfig" on x86-64 and booting ubuntu from the result? (Or is this arm?) Is _that_ the config you still haven't specified in this conversation? I thought you were using the /boot/config-4.4.0-78-generic and friends ubuntu installs. (Which yes, also switch this symbol on.) I can add a "default n" line to drivers/base/Kconfig if "make defconfig" is what you're building from. (This symbol never specified a default in the first place, so I dunno which way it falls, but it's repeated in a gazillion defconfig files and not present in others... meaning I still dunno which way the default goes. When I do a "make defconfig" it uses arch/x86/configs/x86_64_defconfig because the kernel has multiple codepaths to accomplish the same thing. I'm not sure the built-in "default y" lines are used at all anymore? What a mess...) But again, I'm just guessing what config you're using because you still haven't _said_. I'm still trying to guess what you're doing when you hit Ubuntu's bug. > So it means for me that (at least) users why run > Ubuntu 14.04 will have bricked system one day after updating the > kernel. Unless when they build their new kernel they open up menuconfig and switch this symbol off. Which can't be done because...? Or you could add the devtmpfs.mount=0 argument to your kernel command line, as documented in the CONFIG_DEVTMPFS_MOUNT menuconfig help text. The kernel already provides multiple workarounds for Ubuntu's bug, and the issue only hits people who are manually building a new kernel from source. If ubuntu provides a new kernel, I assume they'll tweak their config _and_ fix their initramfs error path (which is just plain wrong). > If you say that currently CONFIG_DEVTMPFS_MOUNT is ignored by kernel, It's not _me_ saying it, it's the kernel doing it. The patch is conceptually a straightforward fix on the kernel side to make it _not_ ignore that symbol in that context. > I think you cannot relay on it anymore because people may have it > enabled or disabled randomly. I expected configs would have it randomly set, but the bug here is a broken error path that does something actively harmful rather than going "oh, we got a static /dev from somewhere, let's just leave it alone". This error path goes out of its way to blank the contents of /dev by mounting an empty tmpfs over it and leaving it empty, and then complaining that /dev is blank _because_it_blanked_it_. If Ubuntu meant to intentionally halt the proceedings the script could have done that explicitly. What did the author of that error path think would happen, exactly? > So the proper way is to remove broken > config option and introduce new one. BTW, I see it is used once in > drivers/base/devtmpfs.c. How does removing the broken config option (or renaming it to CONFIG_DEFTMPFS_UBUNTU_IS_BROKEN) _not_ impact systems that were previously happily using it in the contexts where it already worked? If it's too much to ask people to switch it off when it was previously on (but shouldn't have been), how is asking them to manually switch it back on when it was previously on and needs to stay on better? (And if you arrange it so "make oldconfig" migrates the old symbol to the new one automatically, how would that work around the broken error path in ubuntu's initramfs script? The rename becomes a NOP.) If you're saying it should default to "n" I can send a patch. If you want me to tweak every arch/*/configs file that redundantly includes the same darn symbol, I can do that too. (Makes the patch big but it's just a sed invocation to do it.) >>
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/22/2017 07:05 AM, Yury Norov wrote: > Hi Rob, > > I found that next-20170522 fails to boot on arm64 machine with the > following log: I don't know anything about your kernel config (is CONFIG_DEVTMPFS_MOUNT enabled or disabled?) or what userspace you're booting with, but it seems I can guess: > [...] > [4.179509] Freeing unused kernel memory: 1088K > Loading, please wait... At this point, the kernel has launched init and your userspace is running. During that boot,the kernel mounted devtmpfs on /dev (you edited the part where it did that out of your boot log), but the next line: > mount: mounting udev on /dev failed: Device or resource busy has an error that says you already have devtmpfs mounted on /dev, and your userspace tries to mount devtmpfs on it _again_ and it fails because you can't mount the exact same filesystem over itself due to a sanity check in the kernel in fs/namespace.s line 2475 or so: /* Refuse the same filesystem on the same mount point */ err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) goto unlock; > W: devtmpfs not available, falling back to tmpfs for /dev > Couldn't get a file descriptor referring to the console At which point your userspace does a "fixup" mounting something else over the previously working devtmpfs, which succeeds (because you're mounting a _different_ filesystem and not hitting the above sanity test), thus breaking your userspace. > Begin: Loading essential drivers ... done. > Begin: Running /scripts/init-premount ... done. > Begin: Mounting root file system ... Begin: Running > /scripts/local-top ... done. > chvt: can't open console And then your userspace didn't notice for a while. > Gave up waiting for root device. Common problems: > - Boot args (cat /proc/cmdline) >- Check rootdelay= (did the system wait long enough?) >- Check root= (did the system wait for the right device?) > - Missing modules (cat /proc/modules; ls /dev) > chvt: can't open console > ALERT! /dev/sda does not exist. Dropping to a shell! > Couldn't get a file descriptor referring to the console And then it died. > BusyBox v1.21.1 (Ubuntu 1:1.21.0-1ubuntu1) built-in shell (ash) > Enter 'help' for a list of built-in commands. > > (initramfs) > > Bisect points to your patch (attached below). If I revert it, everything > becomes fine. If you need to know something more about my environment, > feel free to ask me. You were inappropriately specifying CONFIG_DEVTMPFS_MOUNT in your config, now that it's no longer being ignored your init script is having an allergic reaction to it. Either yank it from your config or fix your userspace. It looks to me like my patch triggered a bug in your setup. Your userspace mounted a tmpfs over /dev when it couldn't mount a second identical instance of devtmpfs over itself. If you had a static /dev in initramfs but didn't configure _in_ devtmpfs to your kernel, your broken error path would have taken that out too with a pointless tmpfs mount. By the way, _why_ are you mounting a tmpfs over /dev on _initramfs_? That can already be tmpfs. (Commits 137fdcc18a59 through 6e19eded3684.) Feel free to send more context if you think I'm wrong about this. > Yury Rob
Re: Patch 0727d35de ("Make initramfs honor CONFIG_DEVTMPFS_MOUNT") breaks boot
On 05/22/2017 07:05 AM, Yury Norov wrote: > Hi Rob, > > I found that next-20170522 fails to boot on arm64 machine with the > following log: I don't know anything about your kernel config (is CONFIG_DEVTMPFS_MOUNT enabled or disabled?) or what userspace you're booting with, but it seems I can guess: > [...] > [4.179509] Freeing unused kernel memory: 1088K > Loading, please wait... At this point, the kernel has launched init and your userspace is running. During that boot,the kernel mounted devtmpfs on /dev (you edited the part where it did that out of your boot log), but the next line: > mount: mounting udev on /dev failed: Device or resource busy has an error that says you already have devtmpfs mounted on /dev, and your userspace tries to mount devtmpfs on it _again_ and it fails because you can't mount the exact same filesystem over itself due to a sanity check in the kernel in fs/namespace.s line 2475 or so: /* Refuse the same filesystem on the same mount point */ err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) goto unlock; > W: devtmpfs not available, falling back to tmpfs for /dev > Couldn't get a file descriptor referring to the console At which point your userspace does a "fixup" mounting something else over the previously working devtmpfs, which succeeds (because you're mounting a _different_ filesystem and not hitting the above sanity test), thus breaking your userspace. > Begin: Loading essential drivers ... done. > Begin: Running /scripts/init-premount ... done. > Begin: Mounting root file system ... Begin: Running > /scripts/local-top ... done. > chvt: can't open console And then your userspace didn't notice for a while. > Gave up waiting for root device. Common problems: > - Boot args (cat /proc/cmdline) >- Check rootdelay= (did the system wait long enough?) >- Check root= (did the system wait for the right device?) > - Missing modules (cat /proc/modules; ls /dev) > chvt: can't open console > ALERT! /dev/sda does not exist. Dropping to a shell! > Couldn't get a file descriptor referring to the console And then it died. > BusyBox v1.21.1 (Ubuntu 1:1.21.0-1ubuntu1) built-in shell (ash) > Enter 'help' for a list of built-in commands. > > (initramfs) > > Bisect points to your patch (attached below). If I revert it, everything > becomes fine. If you need to know something more about my environment, > feel free to ask me. You were inappropriately specifying CONFIG_DEVTMPFS_MOUNT in your config, now that it's no longer being ignored your init script is having an allergic reaction to it. Either yank it from your config or fix your userspace. It looks to me like my patch triggered a bug in your setup. Your userspace mounted a tmpfs over /dev when it couldn't mount a second identical instance of devtmpfs over itself. If you had a static /dev in initramfs but didn't configure _in_ devtmpfs to your kernel, your broken error path would have taken that out too with a pointless tmpfs mount. By the way, _why_ are you mounting a tmpfs over /dev on _initramfs_? That can already be tmpfs. (Commits 137fdcc18a59 through 6e19eded3684.) Feel free to send more context if you think I'm wrong about this. > Yury Rob
[tip:x86/urgent] x86/boot: Use CROSS_COMPILE prefix for readelf
Commit-ID: 3780578761921f094179c6289072a74b2228c602 Gitweb: http://git.kernel.org/tip/3780578761921f094179c6289072a74b2228c602 Author: Rob Landley <r...@landley.net> AuthorDate: Sat, 20 May 2017 15:03:29 -0500 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Sun, 21 May 2017 13:04:27 +0200 x86/boot: Use CROSS_COMPILE prefix for readelf The boot code Makefile contains a straight 'readelf' invocation. This causes build warnings in cross compile environments, when there is no unprefixed readelf accessible via $PATH. Add the missing $(CROSS_COMPILE) prefix. [ tglx: Rewrote changelog ] Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations") Signed-off-by: Rob Landley <r...@landley.net> Acked-by: Kees Cook <keesc...@chromium.org> Cc: Jiri Kosina <jkos...@suse.cz> Cc: Paul Bolle <pebo...@tiscali.nl> Cc: "H.J. Lu" <hjl.to...@gmail.com> Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a2...@landley.net Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/boot/compressed/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 44163e8..2c860ad 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -94,7 +94,7 @@ vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o quiet_cmd_check_data_rel = DATAREL $@ define cmd_check_data_rel for obj in $(filter %.o,$^); do \ - readelf -S $$obj | grep -qF .rel.local && { \ + ${CROSS_COMPILE}readelf -S $$obj | grep -qF .rel.local && { \ echo "error: $$obj has data relocations!" >&2; \ exit 1; \ } || true; \
[tip:x86/urgent] x86/boot: Use CROSS_COMPILE prefix for readelf
Commit-ID: 3780578761921f094179c6289072a74b2228c602 Gitweb: http://git.kernel.org/tip/3780578761921f094179c6289072a74b2228c602 Author: Rob Landley AuthorDate: Sat, 20 May 2017 15:03:29 -0500 Committer: Thomas Gleixner CommitDate: Sun, 21 May 2017 13:04:27 +0200 x86/boot: Use CROSS_COMPILE prefix for readelf The boot code Makefile contains a straight 'readelf' invocation. This causes build warnings in cross compile environments, when there is no unprefixed readelf accessible via $PATH. Add the missing $(CROSS_COMPILE) prefix. [ tglx: Rewrote changelog ] Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations") Signed-off-by: Rob Landley Acked-by: Kees Cook Cc: Jiri Kosina Cc: Paul Bolle Cc: "H.J. Lu" Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a2...@landley.net Signed-off-by: Thomas Gleixner --- arch/x86/boot/compressed/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 44163e8..2c860ad 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -94,7 +94,7 @@ vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o quiet_cmd_check_data_rel = DATAREL $@ define cmd_check_data_rel for obj in $(filter %.o,$^); do \ - readelf -S $$obj | grep -qF .rel.local && { \ + ${CROSS_COMPILE}readelf -S $$obj | grep -qF .rel.local && { \ echo "error: $$obj has data relocations!" >&2; \ exit 1; \ } || true; \
[PATCH] Make x86 use $TARGET-readelf like all the other arches.
From: Rob Landley <r...@landley.net> My cross-compile environment doesn't provide an unprefixed readelf in the $PATH, which works fine on every target but x86, where you get a bunch of "/bin/sh: 1: readelf: not found" messages (but the result still works anyway). Signed-off-by: Rob Landley <r...@landley.net> --- arch/x86/boot/compressed/Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 44163e8..2c860ad 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -94,7 +94,7 @@ vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o quiet_cmd_check_data_rel = DATAREL $@ define cmd_check_data_rel for obj in $(filter %.o,$^); do \ - readelf -S $$obj | grep -qF .rel.local && { \ + ${CROSS_COMPILE}readelf -S $$obj | grep -qF .rel.local && { \ echo "error: $$obj has data relocations!" >&2; \ exit 1; \ } || true; \
[PATCH] Make x86 use $TARGET-readelf like all the other arches.
From: Rob Landley My cross-compile environment doesn't provide an unprefixed readelf in the $PATH, which works fine on every target but x86, where you get a bunch of "/bin/sh: 1: readelf: not found" messages (but the result still works anyway). Signed-off-by: Rob Landley --- arch/x86/boot/compressed/Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile index 44163e8..2c860ad 100644 --- a/arch/x86/boot/compressed/Makefile +++ b/arch/x86/boot/compressed/Makefile @@ -94,7 +94,7 @@ vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o quiet_cmd_check_data_rel = DATAREL $@ define cmd_check_data_rel for obj in $(filter %.o,$^); do \ - readelf -S $$obj | grep -qF .rel.local && { \ + ${CROSS_COMPILE}readelf -S $$obj | grep -qF .rel.local && { \ echo "error: $$obj has data relocations!" >&2; \ exit 1; \ } || true; \
Re: [PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 05/16/2017 10:58 PM, Michael Ellerman wrote: > Rob Landley <r...@landley.net> writes: > >> diff --git a/init/main.c b/init/main.c >> index f866510..9ec09ff 100644 >> --- a/init/main.c >> +++ b/init/main.c >> @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) >> if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { >> ramdisk_execute_command = NULL; >> prepare_namespace(); >> +} else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { >> +sys_mkdir("/dev", 0755); >> +devtmpfs_mount("/dev"); >> } >> >> +/* Open the /dev/console on the rootfs, this should never fail */ >> +if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) > > Sorry to pile on, The correct phrase is "bikeshed". (It's a verb now.) > but while you're moving it do you want _I_ don't, no. I intentionally moved it unmodified. If you want to submit a patch on top of mine, be my guest. > to update this fairly misleading comment. Define "should". (I'll get the popcorn.) > It definitely can fail, eg. if /dev/console doesn't exist, or if no > console driver is registered. /dev/console not existing in an initramfs created by pointing CONFIG_INITRAMFS_SOURCE at a directory created by a normal user was pretty much my initial motivation for poking at this area, yes. That said, /dev/console should always exist. My patch was just finding a different way for it to exist so the condition was satisfied. Meaning the comment isn't exactly _wrong_, just really terse. Feel free to submit a patch rephrasing it. Rob
Re: [PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 05/16/2017 10:58 PM, Michael Ellerman wrote: > Rob Landley writes: > >> diff --git a/init/main.c b/init/main.c >> index f866510..9ec09ff 100644 >> --- a/init/main.c >> +++ b/init/main.c >> @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) >> if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { >> ramdisk_execute_command = NULL; >> prepare_namespace(); >> +} else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { >> +sys_mkdir("/dev", 0755); >> +devtmpfs_mount("/dev"); >> } >> >> +/* Open the /dev/console on the rootfs, this should never fail */ >> +if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) > > Sorry to pile on, The correct phrase is "bikeshed". (It's a verb now.) > but while you're moving it do you want _I_ don't, no. I intentionally moved it unmodified. If you want to submit a patch on top of mine, be my guest. > to update this fairly misleading comment. Define "should". (I'll get the popcorn.) > It definitely can fail, eg. if /dev/console doesn't exist, or if no > console driver is registered. /dev/console not existing in an initramfs created by pointing CONFIG_INITRAMFS_SOURCE at a directory created by a normal user was pretty much my initial motivation for poking at this area, yes. That said, /dev/console should always exist. My patch was just finding a different way for it to exist so the condition was satisfied. Meaning the comment isn't exactly _wrong_, just really terse. Feel free to submit a patch rephrasing it. Rob
Re: [PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
Andrew asked for "a more complete changelog" and I've had a reply window open for _days_ trying to figure out what he wants. Maybe it's in the following somewhere... Otherwise the same v2 patch. From: Rob Landley <r...@landley.net> Make initramfs honor CONFIG_DEVTMPFS_MOUNT (fixing commit 2b2af54a5bb6 which didn't bother), move /dev/console open after devtmpfs mount, and update help text. Commit 456eeabab849 in 2005 made gen_initramfs_list (when run with no arguments) spit out an 'example' config creating /dev and /dev/console. The kernel accidentally(?) included this for many years when you didn't specify initramfs contents, and of course grew dependencies on this /dev/console node in the (often hidden) initramfs. Commit c33df4eaaf41 in 2007 explicitly preserved this dependency. Commit 2bd3a997befc in 2010 claimed it "removes the occasionally problematic assumption that /dev/console exists from the boot code" but actually just moved it later. But nobody never tested statically linking an initramfs. If you point CONFIG_INITRAMFS_SOURCE at a directory running the build as a normal user you _don't_ get a /dev/console (because you can't create it without being root, and can't use the existing one out of /dev unless you create your own initramfs list file), in which case init runs with stdin/stdout/stderr closed and you get no output. Eric's test case for his 2010 commit referenced above was: With this patch I was able to throw busybox on my /boot partition (which has no /dev directory) and boot into userspace without problems. But it didn't work pointing CONFIG_INITRAMFS_SOURCE at a directory of the same files. This provides the "automatically mounting devtmpfs on /dev" workaround the earlier commit was trying to avoid. Signed-off-by: Rob Landley <r...@landley.net> --- drivers/base/Kconfig | 14 -- init/main.c | 15 +-- 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index d718ae4..74779ee 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system to come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/init/main.c b/init/main.c index f866510..9ec09ff 100644 --- a/init/main.c +++ b/init/main.c @@ -1038,12 +1038,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
Re: [PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
Andrew asked for "a more complete changelog" and I've had a reply window open for _days_ trying to figure out what he wants. Maybe it's in the following somewhere... Otherwise the same v2 patch. From: Rob Landley Make initramfs honor CONFIG_DEVTMPFS_MOUNT (fixing commit 2b2af54a5bb6 which didn't bother), move /dev/console open after devtmpfs mount, and update help text. Commit 456eeabab849 in 2005 made gen_initramfs_list (when run with no arguments) spit out an 'example' config creating /dev and /dev/console. The kernel accidentally(?) included this for many years when you didn't specify initramfs contents, and of course grew dependencies on this /dev/console node in the (often hidden) initramfs. Commit c33df4eaaf41 in 2007 explicitly preserved this dependency. Commit 2bd3a997befc in 2010 claimed it "removes the occasionally problematic assumption that /dev/console exists from the boot code" but actually just moved it later. But nobody never tested statically linking an initramfs. If you point CONFIG_INITRAMFS_SOURCE at a directory running the build as a normal user you _don't_ get a /dev/console (because you can't create it without being root, and can't use the existing one out of /dev unless you create your own initramfs list file), in which case init runs with stdin/stdout/stderr closed and you get no output. Eric's test case for his 2010 commit referenced above was: With this patch I was able to throw busybox on my /boot partition (which has no /dev directory) and boot into userspace without problems. But it didn't work pointing CONFIG_INITRAMFS_SOURCE at a directory of the same files. This provides the "automatically mounting devtmpfs on /dev" workaround the earlier commit was trying to avoid. Signed-off-by: Rob Landley --- drivers/base/Kconfig | 14 -- init/main.c | 15 +-- 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index d718ae4..74779ee 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system to come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/init/main.c b/init/main.c index f866510..9ec09ff 100644 --- a/init/main.c +++ b/init/main.c @@ -1038,12 +1038,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/12/2017 12:49 PM, Andreas Schwab wrote: > On Mai 12 2017, Rob Landley <r...@landley.net> wrote: > >> Last I checked I couldn't just "git push" the fullhist tree to >> git.kernel.org because git graft didn't propagate right. > > Perhaps you could recreate them with git replace --graft. That creates > replace objects that can be pushed and fetched. (They are stored in > refs/replace, and must be pushed/fetched explicitly.) It's the "must be pushed/fetched explicitly" part that I couldn't figure out back when I tried it. I inherited this tree from somebody who made it. I noticed its existence because lwn.net covered it, and then 6 months later it had vanished without trace (as so many things do). I reproduced it from the build script (if you can't reproduce the experiment from initial starting conditions, it's not science), went "look, cool thing", and hosted a copy with an occasional repaint. I would be _thrilled_ to hand it off to somebody who knows what they're doing with git. I'm just unusually interested in computer history and the preservation thereof. (https://landley.net/history/mirror). Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/12/2017 12:49 PM, Andreas Schwab wrote: > On Mai 12 2017, Rob Landley wrote: > >> Last I checked I couldn't just "git push" the fullhist tree to >> git.kernel.org because git graft didn't propagate right. > > Perhaps you could recreate them with git replace --graft. That creates > replace objects that can be pushed and fetched. (They are stored in > refs/replace, and must be pushed/fetched explicitly.) It's the "must be pushed/fetched explicitly" part that I couldn't figure out back when I tried it. I inherited this tree from somebody who made it. I noticed its existence because lwn.net covered it, and then 6 months later it had vanished without trace (as so many things do). I reproduced it from the build script (if you can't reproduce the experiment from initial starting conditions, it's not science), went "look, cool thing", and hosted a copy with an occasional repaint. I would be _thrilled_ to hand it off to somebody who knows what they're doing with git. I'm just unusually interested in computer history and the preservation thereof. (https://landley.net/history/mirror). Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/13/2017 04:35 AM, Thomas Gleixner wrote: > On Fri, 12 May 2017, Eric W. Biederman wrote: >> Which leaves me perplexed. The hashes from tglx's current tree: >> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git >> on kernel.org and the hashes in your full history tree differ. >> Given that they are in theory the same tree this distrubs me. The original build script used to make fullhist is at: http://landley.net/kdocs/fullhist/make-full-linux-history.tgz And his original description of what he did and why is at: https://lwn.net/Articles/285366/ He mentioned something about rewriting dates? I used the "graft" feature of git (thanks to Junio and people on #git for the tip) to link them together. I also modified (via a git-filter-branch) the dates of some commits as for instance all commits from the Dave Jones's repo had the same date (23 Nov 2007). For this I mainly used the timestamp info of files on kernel.org. The script and info I used are also available on my website[2]. (I tried to read his conversion plumbing but it's in ocaml.) Apparently he only considered the git commits in Linus's tree to be worth preserving. I'd forgotten that part. (It was 9 years ago. I remembered the pre-bitkeeper tree got edited but I forgot the other one did too.) >> Case in point in the commit connected to: >> "[PATCH] linux-2.5.66-signal-cleanup.patch" >> in tglx's tree is: da334d91ff7001d234863fc7692de1ff90bed57a > > That's the proper sha1 for my tree. I jsut verified it against the original > tree which I still have in my archive. > >> *scratches my head* >> >> Something appears to have changed somewhere. > > Correct. That full history git rewrote the commits in my bitkeeper import. I only checked that the current ones in Linus's tree were the same. Nobody'd ever pointed me at a file hash in your conversion of bitkeeper to git, so over the years I forgot that the date editing extended into bitkeeper for some reason. > history.git: > > commit 7a2deb32924142696b8174cdf9b38cd72a11fc96 > Author: Linus Torvalds> Date: Mon Feb 4 17:40:40 2002 -0800 > > Import changeset February 4, 2002. > full-history: > > commit 26245c315da55330cb25dbfdd80be62db41dedb2 > Author: linus1 > Date: Thu Jan 4 12:00:00 2001 -0600 > > Import changeset January 4, 2001. According to https://www.kernel.org/pub/linux/kernel/v2.4/ January 4 2001 is when 2.4.0 was released. So yes, it looks like he rewrote these dates to be correct. I see what he did. Linus started his bitkeeper tree by importing 2.4.0 and then applying a year's worth of release diffs from 2.4.0 as individual commits. That year+ worth of work was all dated February 4, 2002 in the repo, so the fullhist script went through and changed the dates on those commits to match the release tarballs for those kernel versions, and that changed the hashes in the rest of the history tree. Upside, there's no longer a year+ hole in the commit dates (which makes looking up associated mailing list posts a lot easier). Downside: this changed the history.git commit hashes for the rest of that era. (I'd missed that.) > and as a consequence all other commits have different shas as well. The most embarassing part is that the ocaml plumbing appears to occasionally leak host context when doing the conversion, specifically from "git log 26245c315da5" (checking to make sure the fullhist tree's dates make sense in context) I get: commit 26245c315da55330cb25dbfdd80be62db41dedb2 Author: linus1 Date: Thu Jan 4 12:00:00 2001 -0600 Import changeset commit 13a80dffb74939e292b6e90e5d79dd26d577489f Author: linus1 Date: Thu Jan 4 12:00:00 2001 -0600 add prerelease patch to get a 2.4.0 commit 4c5b4d50bb08753433f5962bd926198fe2b7105d Author: linus1 Date: Sun Dec 31 12:00:00 2000 -0600 That landley@driftood should not be there. Sigh. I guess the question is which is more broken? I linked the build scripts above if somebody else wants to modify or rerun them, but... lithp. Do you prefer a year gap in the archive dates, or do you prefer to call the history.git hashes cannonical? Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/13/2017 04:35 AM, Thomas Gleixner wrote: > On Fri, 12 May 2017, Eric W. Biederman wrote: >> Which leaves me perplexed. The hashes from tglx's current tree: >> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git >> on kernel.org and the hashes in your full history tree differ. >> Given that they are in theory the same tree this distrubs me. The original build script used to make fullhist is at: http://landley.net/kdocs/fullhist/make-full-linux-history.tgz And his original description of what he did and why is at: https://lwn.net/Articles/285366/ He mentioned something about rewriting dates? I used the "graft" feature of git (thanks to Junio and people on #git for the tip) to link them together. I also modified (via a git-filter-branch) the dates of some commits as for instance all commits from the Dave Jones's repo had the same date (23 Nov 2007). For this I mainly used the timestamp info of files on kernel.org. The script and info I used are also available on my website[2]. (I tried to read his conversion plumbing but it's in ocaml.) Apparently he only considered the git commits in Linus's tree to be worth preserving. I'd forgotten that part. (It was 9 years ago. I remembered the pre-bitkeeper tree got edited but I forgot the other one did too.) >> Case in point in the commit connected to: >> "[PATCH] linux-2.5.66-signal-cleanup.patch" >> in tglx's tree is: da334d91ff7001d234863fc7692de1ff90bed57a > > That's the proper sha1 for my tree. I jsut verified it against the original > tree which I still have in my archive. > >> *scratches my head* >> >> Something appears to have changed somewhere. > > Correct. That full history git rewrote the commits in my bitkeeper import. I only checked that the current ones in Linus's tree were the same. Nobody'd ever pointed me at a file hash in your conversion of bitkeeper to git, so over the years I forgot that the date editing extended into bitkeeper for some reason. > history.git: > > commit 7a2deb32924142696b8174cdf9b38cd72a11fc96 > Author: Linus Torvalds > Date: Mon Feb 4 17:40:40 2002 -0800 > > Import changeset February 4, 2002. > full-history: > > commit 26245c315da55330cb25dbfdd80be62db41dedb2 > Author: linus1 > Date: Thu Jan 4 12:00:00 2001 -0600 > > Import changeset January 4, 2001. According to https://www.kernel.org/pub/linux/kernel/v2.4/ January 4 2001 is when 2.4.0 was released. So yes, it looks like he rewrote these dates to be correct. I see what he did. Linus started his bitkeeper tree by importing 2.4.0 and then applying a year's worth of release diffs from 2.4.0 as individual commits. That year+ worth of work was all dated February 4, 2002 in the repo, so the fullhist script went through and changed the dates on those commits to match the release tarballs for those kernel versions, and that changed the hashes in the rest of the history tree. Upside, there's no longer a year+ hole in the commit dates (which makes looking up associated mailing list posts a lot easier). Downside: this changed the history.git commit hashes for the rest of that era. (I'd missed that.) > and as a consequence all other commits have different shas as well. The most embarassing part is that the ocaml plumbing appears to occasionally leak host context when doing the conversion, specifically from "git log 26245c315da5" (checking to make sure the fullhist tree's dates make sense in context) I get: commit 26245c315da55330cb25dbfdd80be62db41dedb2 Author: linus1 Date: Thu Jan 4 12:00:00 2001 -0600 Import changeset commit 13a80dffb74939e292b6e90e5d79dd26d577489f Author: linus1 Date: Thu Jan 4 12:00:00 2001 -0600 add prerelease patch to get a 2.4.0 commit 4c5b4d50bb08753433f5962bd926198fe2b7105d Author: linus1 Date: Sun Dec 31 12:00:00 2000 -0600 That landley@driftood should not be there. Sigh. I guess the question is which is more broken? I linked the build scripts above if somebody else wants to modify or rerun them, but... lithp. Do you prefer a year gap in the archive dates, or do you prefer to call the history.git hashes cannonical? Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/12/2017 09:45 AM, Eric W. Biederman wrote: > Thomas Gleixnerwrites: > >> On Fri, 12 May 2017, Michael Ellerman wrote: >>> Fixes: BKrev: 3e8e57a1JvR25MkFRNzoz85l2Gzccg ("[PATCH] >>> linux-2.5.66-signal-cleanup.patch") >>> >>> In your tree that is c3c107051660 ("[PATCH] >>> linux-2.5.66-signal-cleanup.patch"), >>> but you don't have the 3e8e57a1JvR25MkFRNzoz85l2Gzccg revision recorded >>> anywhere that I can see. >> >> That's correct. I did not include the BK revisions when I imported the >> commits into the history git. I did not see any reason to do so. I still >> have no idea what the value would have been or why anyone wants to >> reference them at all. > > Thomas your import seems to be significantly better than the one I got > my hands on years ago. > > I just know that if were to do something similar today we would really > want to preserve the existing git sha1 hashes somewhere because we > refer to commits everywhere in the code. Which is why the https://landley.net/kdocs/fullhist tree uses "git graft", so the git commit numbers are the same. As Yoann Padioleau said: > It's built from 3 other git repositories: > - the one from Dave Jones from 0.01 to 2.4.0, > - the one from tglx from 2.4.0 to 2.6.12, > - the one from Linus Torvalds from 2.6.12 to now. And the hashes in his tree were the same as in each of those trees, all three of which are on git.kernel.org. If you "git pull" the fullhist tree to current, it still uses the same hashes today. (I think you can still reproduce it localy using his scripts, which I mirrored. You'll have to manually re-tag those old commits from last message, and reset the "upstream" to pull current from.) Last I checked I couldn't just "git push" the fullhist tree to git.kernel.org because git graft didn't propagate right. I had to start people from a tarball. Local clones doing the hardlink thing worked fine though. (Maybe that's changed?) > So I was imagining that bitkeeper would be similar. When Larry flounced and people lost access to the bitkeeper tool they lost access to read the old data, so what the bitkeeper numbers were became irrelevant. That's why nobody's cared before now. You're looking for a consistent way to refer to old commits, even using bitkeeper numbers wouldn't fully solve that problem (it only goes back to 2.4). Between the dave jones and tglx trees, there's complete coverage back to 0.0.1. Yoann stitched them together, and I've kept a current version. I used to host it on kernel.org/doc until http://lkml.iu.edu/hypermail/linux/kernel/1411.3/04693.html happened, they've since deleted it but it's GPL so anybody who wants to host a mirror... :) I'm traveling and not downloading a gigabyte through my phone tether (darn tmobile 4 gig monthly tethering limit) but the date on the https://landley.net/kdocs/local/linux-fullhist.tar.bz2 tarball is February 2016 so I'm pretty sure that's 4.0 with the old major releases tagged (ala last email). Anybody who wants to mirror it somewhere more official (and presumably .xz instead of .bz2) is welcome to. (I would if I still had rsync access to kernel.org/doc, but alas I can't even get them to link kernel.org/doc/Documentation from the page above it. It used to, they accidentally deleted it, and nobody maintains the page anymore...) > Especially since the copy of the bitkeeper > import into git had appened to each commit a BKrev which I presume > tacked back to the original source. > > If everyone who had imported the bitkeepr tree had done that it would > not have mattered which bitkeeper import you were using they would all > share a common identifier for commits. With that absent the robustness > we have to allow looking things up in an alternate tree lies solely > with the one line patch description. > > Compare the quotes lines above with what I have below. Every tree > appears to have a different identifier. The commits in the fullhist tree have been stable since at least https://lwn.net/Articles/285366/ which was June 6, 2008. It's derived from earlier trees with the same commits, and kept those commit hashes. > Below is what I wound up doing, and have queued for the next merge > window. Comments? I've bisected or used the "git annotate... git annotate HASH^1... git annotate NEXTHASH^1..." peeling trick back to some really old commits over the years, which I've then referred to by submitter and date if it wasn't in the current git tree. I'll happily give people hashes out of the fullhist tree if they ask, but haven't assumed they're using it. But if you're looking for an existing standard, this exists and predates my use of it. > Eric Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/12/2017 09:45 AM, Eric W. Biederman wrote: > Thomas Gleixner writes: > >> On Fri, 12 May 2017, Michael Ellerman wrote: >>> Fixes: BKrev: 3e8e57a1JvR25MkFRNzoz85l2Gzccg ("[PATCH] >>> linux-2.5.66-signal-cleanup.patch") >>> >>> In your tree that is c3c107051660 ("[PATCH] >>> linux-2.5.66-signal-cleanup.patch"), >>> but you don't have the 3e8e57a1JvR25MkFRNzoz85l2Gzccg revision recorded >>> anywhere that I can see. >> >> That's correct. I did not include the BK revisions when I imported the >> commits into the history git. I did not see any reason to do so. I still >> have no idea what the value would have been or why anyone wants to >> reference them at all. > > Thomas your import seems to be significantly better than the one I got > my hands on years ago. > > I just know that if were to do something similar today we would really > want to preserve the existing git sha1 hashes somewhere because we > refer to commits everywhere in the code. Which is why the https://landley.net/kdocs/fullhist tree uses "git graft", so the git commit numbers are the same. As Yoann Padioleau said: > It's built from 3 other git repositories: > - the one from Dave Jones from 0.01 to 2.4.0, > - the one from tglx from 2.4.0 to 2.6.12, > - the one from Linus Torvalds from 2.6.12 to now. And the hashes in his tree were the same as in each of those trees, all three of which are on git.kernel.org. If you "git pull" the fullhist tree to current, it still uses the same hashes today. (I think you can still reproduce it localy using his scripts, which I mirrored. You'll have to manually re-tag those old commits from last message, and reset the "upstream" to pull current from.) Last I checked I couldn't just "git push" the fullhist tree to git.kernel.org because git graft didn't propagate right. I had to start people from a tarball. Local clones doing the hardlink thing worked fine though. (Maybe that's changed?) > So I was imagining that bitkeeper would be similar. When Larry flounced and people lost access to the bitkeeper tool they lost access to read the old data, so what the bitkeeper numbers were became irrelevant. That's why nobody's cared before now. You're looking for a consistent way to refer to old commits, even using bitkeeper numbers wouldn't fully solve that problem (it only goes back to 2.4). Between the dave jones and tglx trees, there's complete coverage back to 0.0.1. Yoann stitched them together, and I've kept a current version. I used to host it on kernel.org/doc until http://lkml.iu.edu/hypermail/linux/kernel/1411.3/04693.html happened, they've since deleted it but it's GPL so anybody who wants to host a mirror... :) I'm traveling and not downloading a gigabyte through my phone tether (darn tmobile 4 gig monthly tethering limit) but the date on the https://landley.net/kdocs/local/linux-fullhist.tar.bz2 tarball is February 2016 so I'm pretty sure that's 4.0 with the old major releases tagged (ala last email). Anybody who wants to mirror it somewhere more official (and presumably .xz instead of .bz2) is welcome to. (I would if I still had rsync access to kernel.org/doc, but alas I can't even get them to link kernel.org/doc/Documentation from the page above it. It used to, they accidentally deleted it, and nobody maintains the page anymore...) > Especially since the copy of the bitkeeper > import into git had appened to each commit a BKrev which I presume > tacked back to the original source. > > If everyone who had imported the bitkeepr tree had done that it would > not have mattered which bitkeeper import you were using they would all > share a common identifier for commits. With that absent the robustness > we have to allow looking things up in an alternate tree lies solely > with the one line patch description. > > Compare the quotes lines above with what I have below. Every tree > appears to have a different identifier. The commits in the fullhist tree have been stable since at least https://lwn.net/Articles/285366/ which was June 6, 2008. It's derived from earlier trees with the same commits, and kept those commit hashes. > Below is what I wound up doing, and have queued for the next merge > window. Comments? I've bisected or used the "git annotate... git annotate HASH^1... git annotate NEXTHASH^1..." peeling trick back to some really old commits over the years, which I've then referred to by submitter and date if it wasn't in the current git tree. I'll happily give people hashes out of the fullhist tree if they ask, but haven't assumed they're using it. But if you're looking for an existing standard, this exists and predates my use of it. > Eric Rob
[PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley <r...@landley.net> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, move /dev/console open after devtmpfs mount, and update help text. Signed-off-by: Rob Landley <r...@landley.net> --- drivers/base/Kconfig | 14 -- init/main.c | 15 +-- 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index d718ae4..74779ee 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/init/main.c b/init/main.c index f866510..9ec09ff 100644 --- a/init/main.c +++ b/init/main.c @@ -1038,12 +1038,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
[PATCHv2] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley Make initramfs honor CONFIG_DEVTMPFS_MOUNT, move /dev/console open after devtmpfs mount, and update help text. Signed-off-by: Rob Landley --- drivers/base/Kconfig | 14 -- init/main.c | 15 +-- 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index d718ae4..74779ee 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/init/main.c b/init/main.c index f866510..9ec09ff 100644 --- a/init/main.c +++ b/init/main.c @@ -1038,12 +1038,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1055,8 +1049,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
Re: [PATCH] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 05/09/2017 04:31 PM, Andrew Morton wrote: > On Thu, 4 May 2017 16:09:06 -0500 Rob Landley <r...@landley.net> wrote: > >> From: Rob Landley <r...@landley.net> >> >> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move >> /dev/console open after devtmpfs mount. > > > Could we please see complete description of the runtime effects of this > change? How does it affect users? How does it benefit users? It makes the behavior consistent. If you're going to have the config symbol anyway, why is initramfs a second class citizen? That said, I was fixing a specific bug when I started the patch: when you statically link in an initramfs by pointing the kernel build at a directory (so it makes its own cpio archive from that), if you're not running the build as root you can't create dev/console in there and there's no obvious way to add nodes (like you can editing the gen_initramfs_list) output. This means there's no /dev/console when init gets launched, so PID 1's stdin/stdout/stderr go nowhere, and until your init script can open its own and redirect you get no output if something goes wrong, so debugging is fiddly and there's a hole where output gets lost. Userspace can't close that hole. When making the patch I did a version that mounted /proc /sys and /dev/pts too, so rdinit=/bin/sh had pretty much its full environment without an init script just like the DEVTMPFS_MOUNT option's help text implied... but that seemed unlikely to be accepted. The console gap is a problem userspace can't fix, the rest userspace can, so I did the minimal thing. > The DEVTMPFS_MOUNT Kconfig help (drivers/base/Kconfig) says: > > This option does not affect initramfs based booting, here > the devtmpfs filesystem always needs to be mounted manually > after the rootfs is mounted. > > which seems to no longer be correct? Ah, sorry. I rewrote the help text and didn't include that file in the diff. And rechecking I see the override part wasn't implemented by my patch, I'll send a new one. Rob
Re: [PATCH] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On 05/09/2017 04:31 PM, Andrew Morton wrote: > On Thu, 4 May 2017 16:09:06 -0500 Rob Landley wrote: > >> From: Rob Landley >> >> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move >> /dev/console open after devtmpfs mount. > > > Could we please see complete description of the runtime effects of this > change? How does it affect users? How does it benefit users? It makes the behavior consistent. If you're going to have the config symbol anyway, why is initramfs a second class citizen? That said, I was fixing a specific bug when I started the patch: when you statically link in an initramfs by pointing the kernel build at a directory (so it makes its own cpio archive from that), if you're not running the build as root you can't create dev/console in there and there's no obvious way to add nodes (like you can editing the gen_initramfs_list) output. This means there's no /dev/console when init gets launched, so PID 1's stdin/stdout/stderr go nowhere, and until your init script can open its own and redirect you get no output if something goes wrong, so debugging is fiddly and there's a hole where output gets lost. Userspace can't close that hole. When making the patch I did a version that mounted /proc /sys and /dev/pts too, so rdinit=/bin/sh had pretty much its full environment without an init script just like the DEVTMPFS_MOUNT option's help text implied... but that seemed unlikely to be accepted. The console gap is a problem userspace can't fix, the rest userspace can, so I did the minimal thing. > The DEVTMPFS_MOUNT Kconfig help (drivers/base/Kconfig) says: > > This option does not affect initramfs based booting, here > the devtmpfs filesystem always needs to be mounted manually > after the rootfs is mounted. > > which seems to no longer be correct? Ah, sorry. I rewrote the help text and didn't include that file in the diff. And rechecking I see the override part wasn't implemented by my patch, I'll send a new one. Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/11/2017 01:59 AM, Michael Ellerman wrote: > Linus Torvalds <torva...@linux-foundation.org> writes: > >> On Wed, May 10, 2017 at 3:04 PM, Eric W. Biederman >> <ebied...@xmission.com> wrote: >>> >>> Thomas Gleixner appears to have a tree with all of those same commits >>> except with the BKrev tags stripped out. >> >> That's the best import - so use that tree by Thomas, and just use the >> git revision numbers in it (and say "tglx's linux-history tree" or >> something). > > I've been using this one by Rob Landley which seems good: > > https://landley.net/kdocs/fullhist/ > > It's grafted into the modern history so you can search seamlessly > between the two which is pretty nice. I don't see any Bitkeeper tags > though. I went through and found/tagged the major old releases, did I forget to upload a new tarball after that? v0.0.1 cff5a6fb66765e90470f4d9ca2398da0ca3c75d5 v1.0.0 a068026b4a060e822892a64d5107fb58c45743ef v1.2.0 8610c92442d125f165dc84e4a96f5cbc9b240484 v2.0.0 a374953c636bd91ea40b2d1e31af5405b90e8bf8 v2.2.0 bf330b5e3c471d0b67737c4822b0174ef4f89bed v2.4.0 13a80dffb74939e292b6e90e5d79dd26d577489f v2.6.0 4e9b4bc7a660962ae5f04f939469263b91cf95c2 Rob
Re: Is there an recommended way to refer to bitkeepr commits?
On 05/11/2017 01:59 AM, Michael Ellerman wrote: > Linus Torvalds writes: > >> On Wed, May 10, 2017 at 3:04 PM, Eric W. Biederman >> wrote: >>> >>> Thomas Gleixner appears to have a tree with all of those same commits >>> except with the BKrev tags stripped out. >> >> That's the best import - so use that tree by Thomas, and just use the >> git revision numbers in it (and say "tglx's linux-history tree" or >> something). > > I've been using this one by Rob Landley which seems good: > > https://landley.net/kdocs/fullhist/ > > It's grafted into the modern history so you can search seamlessly > between the two which is pretty nice. I don't see any Bitkeeper tags > though. I went through and found/tagged the major old releases, did I forget to upload a new tarball after that? v0.0.1 cff5a6fb66765e90470f4d9ca2398da0ca3c75d5 v1.0.0 a068026b4a060e822892a64d5107fb58c45743ef v1.2.0 8610c92442d125f165dc84e4a96f5cbc9b240484 v2.0.0 a374953c636bd91ea40b2d1e31af5405b90e8bf8 v2.2.0 bf330b5e3c471d0b67737c4822b0174ef4f89bed v2.4.0 13a80dffb74939e292b6e90e5d79dd26d577489f v2.6.0 4e9b4bc7a660962ae5f04f939469263b91cf95c2 Rob
[PATCH] Clarify help text that compression applies to ramfs as well as legacy ramdisk.
From: Rob Landley <r...@landley.net> Clarify help text that compression applies to ramfs as well as legacy ramdisk. Signed-off-by: Rob Landley <r...@landley.net> --- usr/Kconfig | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/usr/Kconfig b/usr/Kconfig index 572dcf7..d6f4633 100644 --- a/usr/Kconfig +++ b/usr/Kconfig @@ -46,7 +46,7 @@ config INITRAMFS_ROOT_GID If you are not sure, leave it set to "0". config RD_GZIP - bool "Support initial ramdisks compressed using gzip" + bool "Support initial ramdisk/ramfs compressed using gzip" depends on BLK_DEV_INITRD default y select DECOMPRESS_GZIP @@ -55,7 +55,7 @@ config RD_GZIP If unsure, say Y. config RD_BZIP2 - bool "Support initial ramdisks compressed using bzip2" + bool "Support initial ramdisk/ramfs compressed using bzip2" default y depends on BLK_DEV_INITRD select DECOMPRESS_BZIP2 @@ -64,7 +64,7 @@ config RD_BZIP2 If unsure, say N. config RD_LZMA - bool "Support initial ramdisks compressed using LZMA" + bool "Support initial ramdisk/ramfs compressed using LZMA" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZMA @@ -73,7 +73,7 @@ config RD_LZMA If unsure, say N. config RD_XZ - bool "Support initial ramdisks compressed using XZ" + bool "Support initial ramdisk/ramfs compressed using XZ" depends on BLK_DEV_INITRD default y select DECOMPRESS_XZ @@ -82,7 +82,7 @@ config RD_XZ If unsure, say N. config RD_LZO - bool "Support initial ramdisks compressed using LZO" + bool "Support initial ramdisk/ramfs compressed using LZO" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZO @@ -91,7 +91,7 @@ config RD_LZO If unsure, say N. config RD_LZ4 - bool "Support initial ramdisks compressed using LZ4" + bool "Support initial ramdisk/ramfs compressed using LZ4" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZ4
[PATCH] Clarify help text that compression applies to ramfs as well as legacy ramdisk.
From: Rob Landley Clarify help text that compression applies to ramfs as well as legacy ramdisk. Signed-off-by: Rob Landley --- usr/Kconfig | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/usr/Kconfig b/usr/Kconfig index 572dcf7..d6f4633 100644 --- a/usr/Kconfig +++ b/usr/Kconfig @@ -46,7 +46,7 @@ config INITRAMFS_ROOT_GID If you are not sure, leave it set to "0". config RD_GZIP - bool "Support initial ramdisks compressed using gzip" + bool "Support initial ramdisk/ramfs compressed using gzip" depends on BLK_DEV_INITRD default y select DECOMPRESS_GZIP @@ -55,7 +55,7 @@ config RD_GZIP If unsure, say Y. config RD_BZIP2 - bool "Support initial ramdisks compressed using bzip2" + bool "Support initial ramdisk/ramfs compressed using bzip2" default y depends on BLK_DEV_INITRD select DECOMPRESS_BZIP2 @@ -64,7 +64,7 @@ config RD_BZIP2 If unsure, say N. config RD_LZMA - bool "Support initial ramdisks compressed using LZMA" + bool "Support initial ramdisk/ramfs compressed using LZMA" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZMA @@ -73,7 +73,7 @@ config RD_LZMA If unsure, say N. config RD_XZ - bool "Support initial ramdisks compressed using XZ" + bool "Support initial ramdisk/ramfs compressed using XZ" depends on BLK_DEV_INITRD default y select DECOMPRESS_XZ @@ -82,7 +82,7 @@ config RD_XZ If unsure, say N. config RD_LZO - bool "Support initial ramdisks compressed using LZO" + bool "Support initial ramdisk/ramfs compressed using LZO" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZO @@ -91,7 +91,7 @@ config RD_LZO If unsure, say N. config RD_LZ4 - bool "Support initial ramdisks compressed using LZ4" + bool "Support initial ramdisk/ramfs compressed using LZ4" default y depends on BLK_DEV_INITRD select DECOMPRESS_LZ4
[PATCH] Teach INITRAMFS_ROOT_UID and INITRAMFS_ROOT_GID that -1 means "current user".
From: Rob Landley <r...@landley.net> Teach INITRAMFS_ROOT_UID and INITRAMFS_ROOT_GID that -1 means "current user". Signed-off-by: Rob Landley <r...@landley.net> --- scripts/gen_initramfs_list.sh |2 ++ usr/Kconfig | 12 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/scripts/gen_initramfs_list.sh b/scripts/gen_initramfs_list.sh index 17fa901..7666fa1 100755 --- a/scripts/gen_initramfs_list.sh +++ b/scripts/gen_initramfs_list.sh @@ -268,10 +268,12 @@ while [ $# -gt 0 ]; do case "$arg" in "-u") # map $1 to uid=0 (root) root_uid="$1" + [ "$root_uid" = "-1" ] && root_uid=$(id -u || echo 0) shift ;; "-g") # map $1 to gid=0 (root) root_gid="$1" + [ "$root_gid" = "-1" ] && root_gid=$(id -g || echo 0) shift ;; "-d") # display default initramfs list diff --git a/usr/Kconfig b/usr/Kconfig index 572dcf7..3b6ff16 100644 --- a/usr/Kconfig +++ b/usr/Kconfig @@ -26,10 +26,8 @@ config INITRAMFS_ROOT_UID depends on INITRAMFS_SOURCE!="" default "0" help - This setting is only meaningful if the INITRAMFS_SOURCE is - contains a directory. Setting this user ID (UID) to something - other than "0" will cause all files owned by that UID to be - owned by user root in the initial ramdisk image. + If INITRAMFS_SOURCE points to a directory, files owned by this UID + (-1 = current user) will be owned by root in the resulting image. If you are not sure, leave it set to "0". @@ -38,10 +36,8 @@ config INITRAMFS_ROOT_GID depends on INITRAMFS_SOURCE!="" default "0" help - This setting is only meaningful if the INITRAMFS_SOURCE is - contains a directory. Setting this group ID (GID) to something - other than "0" will cause all files owned by that GID to be - owned by group root in the initial ramdisk image. + If INITRAMFS_SOURCE points to a directory, files owned by this GID + (-1 = current group) will be owned by root in the resulting image. If you are not sure, leave it set to "0".
[PATCH] Teach INITRAMFS_ROOT_UID and INITRAMFS_ROOT_GID that -1 means "current user".
From: Rob Landley Teach INITRAMFS_ROOT_UID and INITRAMFS_ROOT_GID that -1 means "current user". Signed-off-by: Rob Landley --- scripts/gen_initramfs_list.sh |2 ++ usr/Kconfig | 12 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/scripts/gen_initramfs_list.sh b/scripts/gen_initramfs_list.sh index 17fa901..7666fa1 100755 --- a/scripts/gen_initramfs_list.sh +++ b/scripts/gen_initramfs_list.sh @@ -268,10 +268,12 @@ while [ $# -gt 0 ]; do case "$arg" in "-u") # map $1 to uid=0 (root) root_uid="$1" + [ "$root_uid" = "-1" ] && root_uid=$(id -u || echo 0) shift ;; "-g") # map $1 to gid=0 (root) root_gid="$1" + [ "$root_gid" = "-1" ] && root_gid=$(id -g || echo 0) shift ;; "-d") # display default initramfs list diff --git a/usr/Kconfig b/usr/Kconfig index 572dcf7..3b6ff16 100644 --- a/usr/Kconfig +++ b/usr/Kconfig @@ -26,10 +26,8 @@ config INITRAMFS_ROOT_UID depends on INITRAMFS_SOURCE!="" default "0" help - This setting is only meaningful if the INITRAMFS_SOURCE is - contains a directory. Setting this user ID (UID) to something - other than "0" will cause all files owned by that UID to be - owned by user root in the initial ramdisk image. + If INITRAMFS_SOURCE points to a directory, files owned by this UID + (-1 = current user) will be owned by root in the resulting image. If you are not sure, leave it set to "0". @@ -38,10 +36,8 @@ config INITRAMFS_ROOT_GID depends on INITRAMFS_SOURCE!="" default "0" help - This setting is only meaningful if the INITRAMFS_SOURCE is - contains a directory. Setting this group ID (GID) to something - other than "0" will cause all files owned by that GID to be - owned by group root in the initial ramdisk image. + If INITRAMFS_SOURCE points to a directory, files owned by this GID + (-1 = current group) will be owned by root in the resulting image. If you are not sure, leave it set to "0".
[PATCH] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley <r...@landley.net> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move /dev/console open after devtmpfs mount. Signed-off-by: Rob Landley <r...@landley.net> --- init/main.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/init/main.c b/init/main.c index 2858be7..71ed0d7 100644 --- a/init/main.c +++ b/init/main.c @@ -1016,12 +1016,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1033,8 +1027,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + sys_mount("dev", "dev", "devtmpfs", MS_SILENT, NULL); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
[PATCH] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
From: Rob Landley Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move /dev/console open after devtmpfs mount. Signed-off-by: Rob Landley --- init/main.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/init/main.c b/init/main.c index 2858be7..71ed0d7 100644 --- a/init/main.c +++ b/init/main.c @@ -1016,12 +1016,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1033,8 +1027,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); + sys_mount("dev", "dev", "devtmpfs", MS_SILENT, NULL); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the
Re: [PATCH 1/3] futex: remove duplicated code
On 03/04/2017 07:05 AM, Russell King - ARM Linux wrote: > On Fri, Mar 03, 2017 at 01:27:10PM +0100, Jiri Slaby wrote: >> diff --git a/kernel/futex.c b/kernel/futex.c >> index b687cb22301c..c5ff9850952f 100644 >> --- a/kernel/futex.c >> +++ b/kernel/futex.c >> @@ -1457,6 +1457,42 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int >> nr_wake, u32 bitset) >> return ret; >> } >> >> +static int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) >> +{ >> +int op = (encoded_op >> 28) & 7; >> +int cmp = (encoded_op >> 24) & 15; >> +int oparg = (encoded_op << 8) >> 20; >> +int cmparg = (encoded_op << 20) >> 20; > > Hmm. oparg and cmparg look like they're doing these shifts to get sign > extension of the 12-bit values by assuming that "int" is 32-bit - > probably worth a comment, or for safety, they should be "s32" so it's > not dependent on the bit-width of "int". I thought Linux depended on the LP64 standard for all architectures? Standard: http://www.unix.org/whitepapers/64bit.html Rationale: http://www.unix.org/version2/whatsnew/lp64_wp.html So int has a defined bit width (32) on linux? Rob
Re: [PATCH 1/3] futex: remove duplicated code
On 03/04/2017 07:05 AM, Russell King - ARM Linux wrote: > On Fri, Mar 03, 2017 at 01:27:10PM +0100, Jiri Slaby wrote: >> diff --git a/kernel/futex.c b/kernel/futex.c >> index b687cb22301c..c5ff9850952f 100644 >> --- a/kernel/futex.c >> +++ b/kernel/futex.c >> @@ -1457,6 +1457,42 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int >> nr_wake, u32 bitset) >> return ret; >> } >> >> +static int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) >> +{ >> +int op = (encoded_op >> 28) & 7; >> +int cmp = (encoded_op >> 24) & 15; >> +int oparg = (encoded_op << 8) >> 20; >> +int cmparg = (encoded_op << 20) >> 20; > > Hmm. oparg and cmparg look like they're doing these shifts to get sign > extension of the 12-bit values by assuming that "int" is 32-bit - > probably worth a comment, or for safety, they should be "s32" so it's > not dependent on the bit-width of "int". I thought Linux depended on the LP64 standard for all architectures? Standard: http://www.unix.org/whitepapers/64bit.html Rationale: http://www.unix.org/version2/whatsnew/lp64_wp.html So int has a defined bit width (32) on linux? Rob
Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'
On 09/18/2016 10:17 AM, Rich Felker wrote: > On Sat, Sep 17, 2016 at 11:40:28PM -0500, Rob Landley wrote: >> >> >> On 09/16/2016 09:23 PM, Guenter Roeck wrote: >>> On 09/16/2016 04:32 PM, Rich Felker wrote: >>>>> 4.6.3 from kernel.org. >>>> >>>> That is utterly ancient and probaby very buggy. I would recommend 5.x+ >>>> or at the very least 4.7 or 4.8. >>>> >>> Unfortunately that is the latest one available from kernel.org :-(. >>> I'll try to build one myself. >> >> Rich, you really, really need to get an actual release version of >> https://github.com/richfelker/musl-cross-make posted. > > What do you mean? Binaries? There are release tags, though it would > probably be a good time to make another one. > > But this project (musl-cross-make) is not needed for building kernels > -- stock gcc, any modern-ish version, should work fine. The canonical > way (from prior to my involvement) to build sh* kernels is to use a > gcc that supports any ISA level, and this can be done without multilib > libgcc since the kernel provides its own libgcc replacement functions. The above was an example of somebody using a broken toolchain because there isn't a known-good reference toolchain for the architecture, which the kernel maintainer is known to regression test against. Having such a thing might help people distinguish "bug in kernel" from "bug in gcc". > Rich Rob
Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'
On 09/18/2016 10:17 AM, Rich Felker wrote: > On Sat, Sep 17, 2016 at 11:40:28PM -0500, Rob Landley wrote: >> >> >> On 09/16/2016 09:23 PM, Guenter Roeck wrote: >>> On 09/16/2016 04:32 PM, Rich Felker wrote: >>>>> 4.6.3 from kernel.org. >>>> >>>> That is utterly ancient and probaby very buggy. I would recommend 5.x+ >>>> or at the very least 4.7 or 4.8. >>>> >>> Unfortunately that is the latest one available from kernel.org :-(. >>> I'll try to build one myself. >> >> Rich, you really, really need to get an actual release version of >> https://github.com/richfelker/musl-cross-make posted. > > What do you mean? Binaries? There are release tags, though it would > probably be a good time to make another one. > > But this project (musl-cross-make) is not needed for building kernels > -- stock gcc, any modern-ish version, should work fine. The canonical > way (from prior to my involvement) to build sh* kernels is to use a > gcc that supports any ISA level, and this can be done without multilib > libgcc since the kernel provides its own libgcc replacement functions. The above was an example of somebody using a broken toolchain because there isn't a known-good reference toolchain for the architecture, which the kernel maintainer is known to regression test against. Having such a thing might help people distinguish "bug in kernel" from "bug in gcc". > Rich Rob
Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'
On 09/16/2016 09:23 PM, Guenter Roeck wrote: > On 09/16/2016 04:32 PM, Rich Felker wrote: >>> 4.6.3 from kernel.org. >> >> That is utterly ancient and probaby very buggy. I would recommend 5.x+ >> or at the very least 4.7 or 4.8. >> > Unfortunately that is the latest one available from kernel.org :-(. > I'll try to build one myself. Rich, you really, really need to get an actual release version of https://github.com/richfelker/musl-cross-make posted. Rob
Re: Runtime failure running sh:qemu in -next due to 'sh: fix copy_from_user()'
On 09/16/2016 09:23 PM, Guenter Roeck wrote: > On 09/16/2016 04:32 PM, Rich Felker wrote: >>> 4.6.3 from kernel.org. >> >> That is utterly ancient and probaby very buggy. I would recommend 5.x+ >> or at the very least 4.7 or 4.8. >> > Unfortunately that is the latest one available from kernel.org :-(. > I'll try to build one myself. Rich, you really, really need to get an actual release version of https://github.com/richfelker/musl-cross-make posted. Rob
Re: [RFC] fs: add userspace critical mounts event support
On 09/02/2016 07:20 PM, Luis R. Rodriguez wrote: > kernel_read_file_from_path() can try to read a file from > the system's filesystem. This is typically done for firmware > for instance, which lives in /lib/firmware. One issue with > this is that the kernel cannot know for sure when the real > final /lib/firmare/ is ready, and even if you use initramfs > drivers are currently initialized *first* prior to the initramfs > kicking off. Why? > During init we run through all init calls first > (do_initcalls()) and finally the initramfs is processed via > prepare_namespace(): What's the downside of moving initramfs cpio extraction earlier in the boot? I did some shuffling around of those code to make initmpfs work, does anybody know why initramfs extraction _before_ we initialize drivers would be a bad thing? (The cpio is in memory, either linked into the kernel or from the bootloader. No drivers are needed to extract it, that's sort of the point.) The only things I can think of are memory churn (large contiguous physical page allocations), or if a driver somehow got us access to more physical memory? Rob
Re: [RFC] fs: add userspace critical mounts event support
On 09/02/2016 07:20 PM, Luis R. Rodriguez wrote: > kernel_read_file_from_path() can try to read a file from > the system's filesystem. This is typically done for firmware > for instance, which lives in /lib/firmware. One issue with > this is that the kernel cannot know for sure when the real > final /lib/firmare/ is ready, and even if you use initramfs > drivers are currently initialized *first* prior to the initramfs > kicking off. Why? > During init we run through all init calls first > (do_initcalls()) and finally the initramfs is processed via > prepare_namespace(): What's the downside of moving initramfs cpio extraction earlier in the boot? I did some shuffling around of those code to make initmpfs work, does anybody know why initramfs extraction _before_ we initialize drivers would be a bad thing? (The cpio is in memory, either linked into the kernel or from the bootloader. No drivers are needed to extract it, that's sort of the point.) The only things I can think of are memory churn (large contiguous physical page allocations), or if a driver somehow got us access to more physical memory? Rob
Re: [PATCH] sh: Fix building j2_defconfig
On 08/16/2016 04:23 PM, Jason Cooper wrote: > Hi Rob, > > On Tue, Aug 16, 2016 at 04:15:22PM -0500, Rob Landley wrote: >> On 08/16/2016 10:41 AM, Jason Cooper wrote: >>> When targeting the j2, we need to retain '-m2'. Previously, the >>> Makefile blew out -m2 on the next line via :=. >>> >>> Fix this by s/:=/+=/ when building for the J2. >>> >>> Fixes: 5a846abad07f6 ("sh: add support for J-Core J2 processor") >>> Signed-off-by: Jason Cooper <ja...@lakedaemon.net> >> >> Speaking of j2, any status on the missing pieces of infratsructure that >> went in through other trees, without which booting hangs awaiting the >> first interrupt? >> >> http://lists.j-core.org/pipermail/j-core/2016-August/000326.html >> >> It would be nice if the rest of the board support could make it in this >> release. Which trees are they going through? > > I'm not aware of the status of other bits, but the irqchip driver can be > found [1] in a stable, based off of v4.8-rc1, branch here: > > git://git.infradead.org/users/jcooper/linux.git irqchip/jcore That's got the interrupt controller, and presumably Thomas' tree has the timer. Is it likely to go upstream this dev cycle? Basic j2 board support did, and as I said it hangs before userspace without the rest of the interrupt controller and timer plumbing (which are currently only used by this board). The above message to the j-core list had an attached patch that adds the missing bits to -rc2. I tested that patch and it worked for me: Tested-by: Rob Landley <r...@landley.net> I just checked the current git pull (not quite rc3) and vanilla is still hanging at the same place, and the patch still applies cleanly. I'm aware we're in bugfix-only mode, but "kernel hangs before launching init" seems bug-ish to me. Rob
Re: [PATCH] sh: Fix building j2_defconfig
On 08/16/2016 04:23 PM, Jason Cooper wrote: > Hi Rob, > > On Tue, Aug 16, 2016 at 04:15:22PM -0500, Rob Landley wrote: >> On 08/16/2016 10:41 AM, Jason Cooper wrote: >>> When targeting the j2, we need to retain '-m2'. Previously, the >>> Makefile blew out -m2 on the next line via :=. >>> >>> Fix this by s/:=/+=/ when building for the J2. >>> >>> Fixes: 5a846abad07f6 ("sh: add support for J-Core J2 processor") >>> Signed-off-by: Jason Cooper >> >> Speaking of j2, any status on the missing pieces of infratsructure that >> went in through other trees, without which booting hangs awaiting the >> first interrupt? >> >> http://lists.j-core.org/pipermail/j-core/2016-August/000326.html >> >> It would be nice if the rest of the board support could make it in this >> release. Which trees are they going through? > > I'm not aware of the status of other bits, but the irqchip driver can be > found [1] in a stable, based off of v4.8-rc1, branch here: > > git://git.infradead.org/users/jcooper/linux.git irqchip/jcore That's got the interrupt controller, and presumably Thomas' tree has the timer. Is it likely to go upstream this dev cycle? Basic j2 board support did, and as I said it hangs before userspace without the rest of the interrupt controller and timer plumbing (which are currently only used by this board). The above message to the j-core list had an attached patch that adds the missing bits to -rc2. I tested that patch and it worked for me: Tested-by: Rob Landley I just checked the current git pull (not quite rc3) and vanilla is still hanging at the same place, and the patch still applies cleanly. I'm aware we're in bugfix-only mode, but "kernel hangs before launching init" seems bug-ish to me. Rob
Re: [PATCH] sh: Fix building j2_defconfig
On 08/16/2016 10:41 AM, Jason Cooper wrote: > When targeting the j2, we need to retain '-m2'. Previously, the > Makefile blew out -m2 on the next line via :=. > > Fix this by s/:=/+=/ when building for the J2. > > Fixes: 5a846abad07f6 ("sh: add support for J-Core J2 processor") > Signed-off-by: Jason CooperSpeaking of j2, any status on the missing pieces of infratsructure that went in through other trees, without which booting hangs awaiting the first interrupt? http://lists.j-core.org/pipermail/j-core/2016-August/000326.html It would be nice if the rest of the board support could make it in this release. Which trees are they going through? Rob