Re: Creating compressed backing_store as swapfile
On 11/5/2018 11:53 AM, valdis.kletni...@vt.edu wrote: On Mon, 05 Nov 2018 11:28:49 -0500, "Austin S. Hemmelgarn" said: Also, it's probably worth noting that BTRFS doesn't need to decompress the entire file to read or write blocks in the middle, it splits the file into 128k blocks and compresses each of those independent of the others, so it can just decompress the 128k block that holds the actual block that's needed. Presumably it does something sane with block allocation for the now-compressed 128K that's presumably much smaller. Also, that limits the damage from writing to the middle of a compression unit That *does* however increase the memory requirement - you can OOM or deadlock if your read/write from the swap needs an additional 128K for the compression buffer at an inconvenient time... Indeed, and I can't really comment on how it might behave under those circumstances (the systems I did the testing on never saw memory pressure quite _that_ bad, and I had them set up to swap things out pretty early and really aggressively).
Re: Creating compressed backing_store as swapfile
On 11/5/2018 11:53 AM, valdis.kletni...@vt.edu wrote: On Mon, 05 Nov 2018 11:28:49 -0500, "Austin S. Hemmelgarn" said: Also, it's probably worth noting that BTRFS doesn't need to decompress the entire file to read or write blocks in the middle, it splits the file into 128k blocks and compresses each of those independent of the others, so it can just decompress the 128k block that holds the actual block that's needed. Presumably it does something sane with block allocation for the now-compressed 128K that's presumably much smaller. Also, that limits the damage from writing to the middle of a compression unit That *does* however increase the memory requirement - you can OOM or deadlock if your read/write from the swap needs an additional 128K for the compression buffer at an inconvenient time... Indeed, and I can't really comment on how it might behave under those circumstances (the systems I did the testing on never saw memory pressure quite _that_ bad, and I had them set up to swap things out pretty early and really aggressively).
Re: POSIX violation by writeback error
On 2018-09-05 04:37, 焦晓冬 wrote: On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff wrote: On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote: Rogier Wolff - 05.09.18, 09:08: So when a mail queuer puts mail the mailq files and the mail processor can get them out of there intact, nobody is going to notice. (I know mail queuers should call fsync and report errors when that fails, but there are bound to be applications where calling fsync is not appropriate (*)) AFAIK at least Postfix MDA only reports mail as being accepted over SMTP once fsync() on the mail file completed successfully. And I´d expect every sensible MDA to do this. I don´t know how Dovecot MDA which I currently use for sieve support does this tough. Is every implementation of mail editor really going to call fsync()? Why they are going to call fsync(), when fsync() is meant to persist the file on disk which is apparently unnecessary if the delivering to SMTP task won't start again after reboot? Not mail clients, the actual servers. If they implement the SMTP standard correctly, they _have_ to call fsync() before they return that an email was accepted for delivery or relaying, because SMTP requires that a successful return means that the system can actually attempt delivery, which is not guaranteed if they haven't verified that it's actually written out to persistent storage. Yes. That's why I added the remark that mailers will call fsync and know about it on the write side. I encountered a situation in the last few days that when a developer runs into this while developing, would have caused him to write: /* Calling this fsync causes unacceptable performance */ // fsync (fd); I know of an application somewhere that does realtime-gathering of call-records (number X called Y for Z seconds). They come in from a variety of sources, get de-duplicated standardized and written to files. Then different output modules push the data to the different consumers within the company. Billing among them. Now getting old data there would be pretty bad. And calling fsync all the time might have performance issues That's the situation where "old data is really bad". But when apt-get upgrade replaces your /bin/sh and gets a write error returning error on subsequent reads is really bad. At this point, the /bin/sh may be partially old and partially new. Execute this corrupted bin is also dangerous though. But the system may still be usable in that state, while returning an error there guarantees it isn't. This is, in general, not the best example though, because no sane package manager directly overwrites _anything_, they all do some variation on replace-by-rename and call fsync _before_ renaming, so this situation is not realistically going to happen on any real system.
Re: POSIX violation by writeback error
On 2018-09-05 04:37, 焦晓冬 wrote: On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff wrote: On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote: Rogier Wolff - 05.09.18, 09:08: So when a mail queuer puts mail the mailq files and the mail processor can get them out of there intact, nobody is going to notice. (I know mail queuers should call fsync and report errors when that fails, but there are bound to be applications where calling fsync is not appropriate (*)) AFAIK at least Postfix MDA only reports mail as being accepted over SMTP once fsync() on the mail file completed successfully. And I´d expect every sensible MDA to do this. I don´t know how Dovecot MDA which I currently use for sieve support does this tough. Is every implementation of mail editor really going to call fsync()? Why they are going to call fsync(), when fsync() is meant to persist the file on disk which is apparently unnecessary if the delivering to SMTP task won't start again after reboot? Not mail clients, the actual servers. If they implement the SMTP standard correctly, they _have_ to call fsync() before they return that an email was accepted for delivery or relaying, because SMTP requires that a successful return means that the system can actually attempt delivery, which is not guaranteed if they haven't verified that it's actually written out to persistent storage. Yes. That's why I added the remark that mailers will call fsync and know about it on the write side. I encountered a situation in the last few days that when a developer runs into this while developing, would have caused him to write: /* Calling this fsync causes unacceptable performance */ // fsync (fd); I know of an application somewhere that does realtime-gathering of call-records (number X called Y for Z seconds). They come in from a variety of sources, get de-duplicated standardized and written to files. Then different output modules push the data to the different consumers within the company. Billing among them. Now getting old data there would be pretty bad. And calling fsync all the time might have performance issues That's the situation where "old data is really bad". But when apt-get upgrade replaces your /bin/sh and gets a write error returning error on subsequent reads is really bad. At this point, the /bin/sh may be partially old and partially new. Execute this corrupted bin is also dangerous though. But the system may still be usable in that state, while returning an error there guarantees it isn't. This is, in general, not the best example though, because no sane package manager directly overwrites _anything_, they all do some variation on replace-by-rename and call fsync _before_ renaming, so this situation is not realistically going to happen on any real system.
Re: [PATCH 1/2] x86: x86_64_defconfig: Enable KSM.
On 2018-06-14 18:50, Daniel Díaz wrote: As per the documentation, Kernel Samepage Merging (available since 2.6.32) is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y and activated via sysfs. More information can be found here: https://www.kernel.org/doc/Documentation/vm/ksm.txt When enabled in the kernel, the default is to not do anything at all, until it is activated at run-time with: echo 1 > /sys/kernel/mm/ksm/run As pointed out by a couple of others, this doesn't explain why this is a good idea. All you're doing here is giving a reason that it won't have a negative impact on most users. Two points that may be worth adding, but also don't really argue for it being a significant improvement: * Pretty much all of the major distributions that use pre-built kernels have it enabled in their kernels (At minimum, Debian, Ubuntu, Fedora (and by extension RHEL and CentOS), openSUSE (and by extension SLES), Arch, and Alpine have it enabled), so enabling this in defconfig would bring it a bit closer to parity with distribution kernels in terms of core features. * Software other than QEMU is starting to take advantage of it if available (for example, Netdata [1] can mark it's in-memory TSDB's for deduplication, which usually cuts it's memory usage roughly in half). [1] https://my-netdata.io/
Re: [PATCH 1/2] x86: x86_64_defconfig: Enable KSM.
On 2018-06-14 18:50, Daniel Díaz wrote: As per the documentation, Kernel Samepage Merging (available since 2.6.32) is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y and activated via sysfs. More information can be found here: https://www.kernel.org/doc/Documentation/vm/ksm.txt When enabled in the kernel, the default is to not do anything at all, until it is activated at run-time with: echo 1 > /sys/kernel/mm/ksm/run As pointed out by a couple of others, this doesn't explain why this is a good idea. All you're doing here is giving a reason that it won't have a negative impact on most users. Two points that may be worth adding, but also don't really argue for it being a significant improvement: * Pretty much all of the major distributions that use pre-built kernels have it enabled in their kernels (At minimum, Debian, Ubuntu, Fedora (and by extension RHEL and CentOS), openSUSE (and by extension SLES), Arch, and Alpine have it enabled), so enabling this in defconfig would bring it a bit closer to parity with distribution kernels in terms of core features. * Software other than QEMU is starting to take advantage of it if available (for example, Netdata [1] can mark it's in-memory TSDB's for deduplication, which usually cuts it's memory usage roughly in half). [1] https://my-netdata.io/
Re: Read-protected UEFI variables
On 2018-02-14 08:21, Benjamin Drung wrote: Am Mittwoch, den 14.02.2018, 13:09 + schrieb Ard Biesheuvel: On 14 February 2018 at 12:52, Benjamin Drungwrote: Hi, I am exploring the possibility to store SSH and other keys in UEFI variables for systems that do not have persistent storage. These systems boot via network and need individual SSH keys which ideally should not be distributed via network. The plan is to write a small daemon that starts at boot and gets the SSH keys from EFI variables to individualize the system with SSH keys. I plan to release the code as free software. Simple proof-of- concept code: mount -t efivarfs none /sys/firmware/efi/efivars for key in ssh_host_dsa_key ssh_host_ecdsa_key ssh_host_rsa_key; do dd ibs=1 skip=4 if=/sys/firmware/efi/efivars/${key}-89df11f4- 38e6-473e-ab43-b4406b76fba9 of=/etc/ssh/$key done I am not the first person having the idea to use UEFI variables to store keys: https://www.usenix.org/conference/srecon17asia/program/presentation /korgachin There is one problem: The keys should be readable only by root. When mounting efivarfs, all variables have the permission 644 which makes them readable by all users. I have different ideas how to solve it: 1) Hard-code a list of GUIDs that should be only readable by root in the kernel module. These modules would also be not set to immutable. 2) Instead of hard-coding GUIDs, add a kernel module parameter to specify the GUIDs. Maybe have a default list in the kernel module. 3) Add a mount option to specify the protected GUIDs. Feedback is welcome. I'd consider a patch that makes the permissions a mount option for efivarfs, applying to all variables. The reason is that these variables shouldn't have been world readable in the first place, and I am reluctant to make this overly complex. Having some variables (like the Boot and BootOrder variables) world readable is useful. This allows normal users to run 'efibootmgr' to display the boot options. Some variables maybe (ISTR variables for things like the system time-zone or the firmware locale settings, which _might_ be useful), but I would say the boot variables are not on that list. The only practical application for a regular (non-root) user to read those variables is to gather info for an attack on the system. Anybody who legitimately needs to access them (either for debugging the boot process, or changing the boot options) should have administrative access to the system anyway, and even then they will usually not need to read them. In fact, most of the UEFI variables fall into the same category, but even more so, userspace has no legitimate reason to need to read them. You can get an absolutely insane amount of info out of them on some systems, most of which is a gold-mine for potential attackers. For the handful that may actually be useful to userspace, most would be needed only during system startup, and thus could safely be made readable by root only. On the other hand, you should realize that UEFI was never designed to keep secrets, and so whether it is a good idea to put secrets in UEFI variables to begin with is dubious IMHO. If the UEFI is as secure as storing an unencrypted file on a hard drive, I am satisfied. Or do you have a better idea where to store the SSH keys for a diskless system that boots via network? There really isn't any other option unless you're willing to put a small amount of flash storage in the system somehow (maybe a small USB flash drive connected directly to a USB header inside the system?). As far as the security of UEFI variables, the same limitations as storing the data on an unencrypted hard drive apply, with the addition that it's much easier to get at them through stuff like Intel's AMT or IPMI than it is to read data off of the hard drive.
Re: Read-protected UEFI variables
On 2018-02-14 08:21, Benjamin Drung wrote: Am Mittwoch, den 14.02.2018, 13:09 + schrieb Ard Biesheuvel: On 14 February 2018 at 12:52, Benjamin Drung wrote: Hi, I am exploring the possibility to store SSH and other keys in UEFI variables for systems that do not have persistent storage. These systems boot via network and need individual SSH keys which ideally should not be distributed via network. The plan is to write a small daemon that starts at boot and gets the SSH keys from EFI variables to individualize the system with SSH keys. I plan to release the code as free software. Simple proof-of- concept code: mount -t efivarfs none /sys/firmware/efi/efivars for key in ssh_host_dsa_key ssh_host_ecdsa_key ssh_host_rsa_key; do dd ibs=1 skip=4 if=/sys/firmware/efi/efivars/${key}-89df11f4- 38e6-473e-ab43-b4406b76fba9 of=/etc/ssh/$key done I am not the first person having the idea to use UEFI variables to store keys: https://www.usenix.org/conference/srecon17asia/program/presentation /korgachin There is one problem: The keys should be readable only by root. When mounting efivarfs, all variables have the permission 644 which makes them readable by all users. I have different ideas how to solve it: 1) Hard-code a list of GUIDs that should be only readable by root in the kernel module. These modules would also be not set to immutable. 2) Instead of hard-coding GUIDs, add a kernel module parameter to specify the GUIDs. Maybe have a default list in the kernel module. 3) Add a mount option to specify the protected GUIDs. Feedback is welcome. I'd consider a patch that makes the permissions a mount option for efivarfs, applying to all variables. The reason is that these variables shouldn't have been world readable in the first place, and I am reluctant to make this overly complex. Having some variables (like the Boot and BootOrder variables) world readable is useful. This allows normal users to run 'efibootmgr' to display the boot options. Some variables maybe (ISTR variables for things like the system time-zone or the firmware locale settings, which _might_ be useful), but I would say the boot variables are not on that list. The only practical application for a regular (non-root) user to read those variables is to gather info for an attack on the system. Anybody who legitimately needs to access them (either for debugging the boot process, or changing the boot options) should have administrative access to the system anyway, and even then they will usually not need to read them. In fact, most of the UEFI variables fall into the same category, but even more so, userspace has no legitimate reason to need to read them. You can get an absolutely insane amount of info out of them on some systems, most of which is a gold-mine for potential attackers. For the handful that may actually be useful to userspace, most would be needed only during system startup, and thus could safely be made readable by root only. On the other hand, you should realize that UEFI was never designed to keep secrets, and so whether it is a good idea to put secrets in UEFI variables to begin with is dubious IMHO. If the UEFI is as secure as storing an unencrypted file on a hard drive, I am satisfied. Or do you have a better idea where to store the SSH keys for a diskless system that boots via network? There really isn't any other option unless you're willing to put a small amount of flash storage in the system somehow (maybe a small USB flash drive connected directly to a USB header inside the system?). As far as the security of UEFI variables, the same limitations as storing the data on an unencrypted hard drive apply, with the addition that it's much easier to get at them through stuff like Intel's AMT or IPMI than it is to read data off of the hard drive.
Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all
On 2017-12-15 12:24, Vincent Legoll wrote: Hello, This looks fine to me. Ard? Doesn't this break existing configs? Would adding a "default yes" on the new menuconfig be OK. If yes, I'd respin it for a v2 Alternatively, would it not make some degree of sense to just turn the CONFIG_EFI symbol into the menuconfig? It already controls all the EFI related stuff except GPT support (which should not be dependent on EFI support), so anyone who wants EFI support already has it enabled, and it would be a bit nicer to work with when actually configuring a kernel (It is rather annoying to have to effectively enable something twice). The same (in theory) goes for pretty much any other patch like this where there's already a config option controlling it all that just isn't a menuconfig.
Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all
On 2017-12-15 12:24, Vincent Legoll wrote: Hello, This looks fine to me. Ard? Doesn't this break existing configs? Would adding a "default yes" on the new menuconfig be OK. If yes, I'd respin it for a v2 Alternatively, would it not make some degree of sense to just turn the CONFIG_EFI symbol into the menuconfig? It already controls all the EFI related stuff except GPT support (which should not be dependent on EFI support), so anyone who wants EFI support already has it enabled, and it would be a bit nicer to work with when actually configuring a kernel (It is rather annoying to have to effectively enable something twice). The same (in theory) goes for pretty much any other patch like this where there's already a config option controlling it all that just isn't a menuconfig.
Re: [PATCH v5 3/5] btrfs: Add zstd support
On 2017-08-09 22:39, Nick Terrell wrote: Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |-|---|--|-| | None| 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib| 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | ||---|---|--||--| | None | 0.97 | 0.78 |0.981 | 5.501 |8.807 | | lzo| 2.06 | 1.38 |1.631 | 8.458 |8.585 | | zlib | 3.40 | 1.86 |7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 |2.579 | 11.479 |9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick TerrellConsidering how things went with the previous patch, I've been a bit more aggressive testing this one, but after now almost 72 hours of combined runtime for each of the architectures I have tests set up for with nothing breaking (well, nothing breaking that wasn't already breaking before this patch set, some of the raid56 tests are still failing semi-reliably as expected), I'd say this is reasonably stable and looks good overall. --- v2 -> v3: - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff - Change default compression level for BtrFS to 3 v3 -> v4: - Add missing includes, which fixes the aarch64 build - Fix minor linter warnings fs/btrfs/Kconfig | 2 + fs/btrfs/Makefile | 2 +- fs/btrfs/compression.c | 1 + fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 2 + fs/btrfs/ioctl.c | 6 +- fs/btrfs/props.c | 6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c | 2 + fs/btrfs/zstd.c| 432 + include/uapi/linux/btrfs.h | 8 +- 12 files changed, 468 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index 80e9c18..a26c63b 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -6,6 +6,8 @@ config BTRFS_FS select ZLIB_DEFLATE select LZO_COMPRESS select LZO_DECOMPRESS + select ZSTD_COMPRESS + select ZSTD_DECOMPRESS select RAID6_PQ select XOR_BLOCKS select SRCU diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 128ce17..962a95a 100644 ---
Re: [PATCH v5 3/5] btrfs: Add zstd support
On 2017-08-09 22:39, Nick Terrell wrote: Add zstd compression and decompression support to BtrFS. zstd at its fastest level compresses almost as well as zlib, while offering much faster compression and decompression, approaching lzo speeds. I benchmarked btrfs with zstd compression against no compression, lzo compression, and zlib compression. I benchmarked two scenarios. Copying a set of files to btrfs, and then reading the files. Copying a tarball to btrfs, extracting it to btrfs, and then reading the extracted files. After every operation, I call `sync` and include the sync time. Between every pair of operations I unmount and remount the filesystem to avoid caching. The benchmark files can be found in the upstream zstd source repository under `contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}` [1] [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. The first compression benchmark is copying 10 copies of the unzipped Silesia corpus [3] into a BtrFS filesystem mounted with `-o compress-force=Method`. The decompression benchmark times how long it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is measured by comparing the output of `df` and `du`. See the benchmark file [1] for details. I benchmarked multiple zstd compression levels, although the patch uses zstd level 1. | Method | Ratio | Compression MB/s | Decompression speed | |-|---|--|-| | None| 0.99 | 504 | 686 | | lzo | 1.66 | 398 | 442 | | zlib| 2.58 | 65 | 241 | | zstd 1 | 2.57 | 260 | 383 | | zstd 3 | 2.71 | 174 | 408 | | zstd 6 | 2.87 | 70 | 398 | | zstd 9 | 2.92 | 43 | 406 | | zstd 12 | 2.93 | 21 | 408 | | zstd 15 | 3.01 | 11 | 354 | The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it measures the compression ratio, extracts the tar, and deletes the tar. Then it measures the compression ratio again, and `tar`s the extracted files into `/dev/null`. See the benchmark file [2] for details. | Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) | ||---|---|--||--| | None | 0.97 | 0.78 |0.981 | 5.501 |8.807 | | lzo| 2.06 | 1.38 |1.631 | 8.458 |8.585 | | zlib | 3.40 | 1.86 |7.750 | 21.544 | 11.744 | | zstd 1 | 3.57 | 1.85 |2.579 | 11.479 |9.389 | [1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh [2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh [3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia [4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz zstd source repository: https://github.com/facebook/zstd Signed-off-by: Nick Terrell Considering how things went with the previous patch, I've been a bit more aggressive testing this one, but after now almost 72 hours of combined runtime for each of the architectures I have tests set up for with nothing breaking (well, nothing breaking that wasn't already breaking before this patch set, some of the raid56 tests are still failing semi-reliably as expected), I'd say this is reasonably stable and looks good overall. --- v2 -> v3: - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff - Change default compression level for BtrFS to 3 v3 -> v4: - Add missing includes, which fixes the aarch64 build - Fix minor linter warnings fs/btrfs/Kconfig | 2 + fs/btrfs/Makefile | 2 +- fs/btrfs/compression.c | 1 + fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 1 + fs/btrfs/disk-io.c | 2 + fs/btrfs/ioctl.c | 6 +- fs/btrfs/props.c | 6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c | 2 + fs/btrfs/zstd.c| 432 + include/uapi/linux/btrfs.h | 8 +- 12 files changed, 468 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index 80e9c18..a26c63b 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -6,6 +6,8 @@ config BTRFS_FS select ZLIB_DEFLATE select LZO_COMPRESS select LZO_DECOMPRESS + select ZSTD_COMPRESS + select ZSTD_DECOMPRESS select RAID6_PQ select XOR_BLOCKS select SRCU diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 128ce17..962a95a 100644 --- a/fs/btrfs/Makefile +++
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 15:25, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. AFAIUI, the intent is to extend the compression type specification for both the mount options and the property, not to add a new mount option. I think we all agree that `mount -o compress=zstd3` is a lot better than `mount -o compress=zstd,compresslevel=3`. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. Are properties set on the root subvolume inherited properly? Because unless they are, we can't get the same semantics. Two other counter arguments on completely removing BTRFS-specific mount options: 1. It's a lot easier and a lot more clearly defined to change things that affect global behavior of the FS by a remount than having to iterate everything in the FS to update properties. If I'm disabling autodefrag, I'd much rather just `mount -o remount,noautodefrag` than `find / -xdev -exec btrfs property set \{\} autodefrag false`, as the first will take effect for everything simultaneously and run exponentially quicker. 2. There are some things that don't make sense as per-object settings or are otherwise nonsensical on objects. Many, but not all, of the BTRFS specific mount options fall into this category IMO, with the notable exception of compress[-force], [no]autodefrag, [no]datacow, and [no]datasum. Some other options do make sense as properties of the filesystem (commit, flushoncommit, {inode,space}_cache, max_inline, metadata_ratio, [no]ssd, and [no]treelog are such options), but many are one-off options that affect behavior on mount (like skip_balance, clear_cache, nologreplay, norecovery, usebbackuproot, and subvol).
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 15:25, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. AFAIUI, the intent is to extend the compression type specification for both the mount options and the property, not to add a new mount option. I think we all agree that `mount -o compress=zstd3` is a lot better than `mount -o compress=zstd,compresslevel=3`. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. Are properties set on the root subvolume inherited properly? Because unless they are, we can't get the same semantics. Two other counter arguments on completely removing BTRFS-specific mount options: 1. It's a lot easier and a lot more clearly defined to change things that affect global behavior of the FS by a remount than having to iterate everything in the FS to update properties. If I'm disabling autodefrag, I'd much rather just `mount -o remount,noautodefrag` than `find / -xdev -exec btrfs property set \{\} autodefrag false`, as the first will take effect for everything simultaneously and run exponentially quicker. 2. There are some things that don't make sense as per-object settings or are otherwise nonsensical on objects. Many, but not all, of the BTRFS specific mount options fall into this category IMO, with the notable exception of compress[-force], [no]autodefrag, [no]datacow, and [no]datasum. Some other options do make sense as properties of the filesystem (commit, flushoncommit, {inode,space}_cache, max_inline, metadata_ratio, [no]ssd, and [no]treelog are such options), but many are one-off options that affect behavior on mount (like skip_balance, clear_cache, nologreplay, norecovery, usebbackuproot, and subvol).
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 13:24, Eric Biggers wrote: On Thu, Aug 10, 2017 at 07:32:18AM -0400, Austin S. Hemmelgarn wrote: On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). This may be a reasonable use case, but note this cannot just be the regular "zstd" compression setting, since filesystem compression by default must provide reasonable performance for many different access patterns. See the patch in this series which actually adds zstd compression to btrfs; it only uses level 1. I do not see a patch which adds a higher compression mode. It would need to be a special setting like "zstdhc" that users could opt-in to on specific directories. It also would need to be compared to simply compressing in userspace. In many cases compressing in userspace is probably the better solution for the use case in question because it works on any filesystem, allows using any compression algorithm, and if random access is not needed it is possible to compress each file as a single stream (like a .xz file), which produces a much better compression ratio than the block-by-block compression that filesystems have to use. There has been discussion as well as (I think) initial patches merged for support of specifying the compression level for algorithms which support multiple compression levels in BTRFS. I was actually under the impression that we had decided to use level 3 as the default for zstd, but that apparently isn't the case, and with the benchmark issues, it may not be once proper benchmarks are run. Also, on the note of compressing in userspace, the use case I quoted at least can't do that because we have to deal with Windows clients and users have to be able to open files directly on said Windows clients. I entirely agree that real archival storage is better off using userspace compression, but sometimes real archival storage isn't an option. Note also that LZ4HC is in the kernel source tree currently but no one is using it vs. the regular LZ4. I think it is the kind of thing that sounded useful originally, but at the end of the day no one really wants to use it in kernel mode. I'd certainly be interested in actual patches, though. Part of that is the fact that BTRFS is one of the only consumers (AFAIK) of this API that can freely choose all aspects of their usage, and the consensus here (which I don't agree with I might add) amounts to the argument that 'we already have compression with a compression ratio, we don't need more things like that'. I would personally love to see LZ4HC support in BTRFS (based on testing my own use cases, LZ4 is more deterministic than LZO for both compression and decompression, and most of the non archival usage I have of BTRFS benefits from determinism), but there's not any point in me writing up such a patch because it's almost certain to get rejected because BTRFS already has LZO. The main reason that zstd is getting considered at all is that the quoted benchmarks show clear benefits in decompression speed relative to zlib and far better compression ratios than LZO.
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 13:24, Eric Biggers wrote: On Thu, Aug 10, 2017 at 07:32:18AM -0400, Austin S. Hemmelgarn wrote: On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). This may be a reasonable use case, but note this cannot just be the regular "zstd" compression setting, since filesystem compression by default must provide reasonable performance for many different access patterns. See the patch in this series which actually adds zstd compression to btrfs; it only uses level 1. I do not see a patch which adds a higher compression mode. It would need to be a special setting like "zstdhc" that users could opt-in to on specific directories. It also would need to be compared to simply compressing in userspace. In many cases compressing in userspace is probably the better solution for the use case in question because it works on any filesystem, allows using any compression algorithm, and if random access is not needed it is possible to compress each file as a single stream (like a .xz file), which produces a much better compression ratio than the block-by-block compression that filesystems have to use. There has been discussion as well as (I think) initial patches merged for support of specifying the compression level for algorithms which support multiple compression levels in BTRFS. I was actually under the impression that we had decided to use level 3 as the default for zstd, but that apparently isn't the case, and with the benchmark issues, it may not be once proper benchmarks are run. Also, on the note of compressing in userspace, the use case I quoted at least can't do that because we have to deal with Windows clients and users have to be able to open files directly on said Windows clients. I entirely agree that real archival storage is better off using userspace compression, but sometimes real archival storage isn't an option. Note also that LZ4HC is in the kernel source tree currently but no one is using it vs. the regular LZ4. I think it is the kind of thing that sounded useful originally, but at the end of the day no one really wants to use it in kernel mode. I'd certainly be interested in actual patches, though. Part of that is the fact that BTRFS is one of the only consumers (AFAIK) of this API that can freely choose all aspects of their usage, and the consensus here (which I don't agree with I might add) amounts to the argument that 'we already have compression with a compression ratio, we don't need more things like that'. I would personally love to see LZ4HC support in BTRFS (based on testing my own use cases, LZ4 is more deterministic than LZO for both compression and decompression, and most of the non archival usage I have of BTRFS benefits from determinism), but there's not any point in me writing up such a patch because it's almost certain to get rejected because BTRFS already has LZO. The main reason that zstd is getting considered at all is that the quoted benchmarks show clear benefits in decompression speed relative to zlib and far better compression ratios than LZO.
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 07:32, Austin S. Hemmelgarn wrote: On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). Also didn't think to mention this, but I could see the max level being very popular for use with SquashFS root filesystems used in LiveCD's. Currently, they have to decide between read performance and image size, while zstd would provide both.
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 07:32, Austin S. Hemmelgarn wrote: On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). Also didn't think to mention this, but I could see the max level being very popular for use with SquashFS root filesystems used in LiveCD's. Currently, they have to decide between read performance and image size, while zstd would provide both.
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). The code was ported from the upstream zstd source repository. What version? `linux/zstd.h` header was modified to match linux kernel style. The cross-platform and allocation code was stripped out. Instead zstd requires the caller to pass a preallocated workspace. The source files were clang-formatted [1] to match the Linux Kernel style as much as possible. It would be easier to compare to the upstream version if it was not all reformatted. There is a chance that bugs were introduced by Linux-specific changes, and it would be nice if they could be easily reviewed. (Also I don't know what clang-format settings you used, but there are still a lot of differences from the Linux coding style.) I benchmarked zstd compression as a special character device. I ran zstd and zlib compression at several levels, as well as performing no compression, which measure the time spent copying the data to kernel space. Data is passed to the compresser 4096 B at a time. The benchmark file is located in the upstream zstd source repository under `contrib/linux-kernel/zstd_compress_test.c` [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes).
Re: [PATCH v5 2/5] lib: Add zstd modules
On 2017-08-10 04:30, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: It can compress at speeds approaching lz4, and quality approaching lzma. Well, for a very loose definition of "approaching", and certainly not at the same time. I doubt there's a use case for using the highest compression levels in kernel mode --- especially the ones using zstd_opt.h. Large data-sets with WORM access patterns and infrequent writes immediately come to mind as a use case for the highest compression level. As a more specific example, the company I work for has a very large amount of documentation, and we keep all old versions. This is all stored on a file server which is currently using BTRFS. Once a document is written, it's almost never rewritten, so write performance only matters for the first write. However, they're read back pretty frequently, so we need good read performance. As of right now, the system is set to use LZO compression by default, and then when a new document is added, the previous version of that document gets re-compressed using zlib compression, which actually results in pretty significant space savings most of the time. I would absolutely love to use zstd compression with this system with the highest compression level, because most people don't care how long it takes to write the file out, but they do care how long it takes to read a file (even if it's an older version). The code was ported from the upstream zstd source repository. What version? `linux/zstd.h` header was modified to match linux kernel style. The cross-platform and allocation code was stripped out. Instead zstd requires the caller to pass a preallocated workspace. The source files were clang-formatted [1] to match the Linux Kernel style as much as possible. It would be easier to compare to the upstream version if it was not all reformatted. There is a chance that bugs were introduced by Linux-specific changes, and it would be nice if they could be easily reviewed. (Also I don't know what clang-format settings you used, but there are still a lot of differences from the Linux coding style.) I benchmarked zstd compression as a special character device. I ran zstd and zlib compression at several levels, as well as performing no compression, which measure the time spent copying the data to kernel space. Data is passed to the compresser 4096 B at a time. The benchmark file is located in the upstream zstd source repository under `contrib/linux-kernel/zstd_compress_test.c` [2]. I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes).
Re: bcache with existing ext4 filesystem
On 2017-07-26 13:41, Eric Wheeler wrote: On Wed, 26 Jul 2017, Pavel Machek wrote: Hi! Unfortunately, that would mean shifting 400GB data 8KB forward, and compatibility problems. So I'd prefer adding bcache superblock into the reserved space, so I can have caching _and_ compatibility with grub2 etc (and avoid 400GB move): The common way to do that is to move the beginning of the partition, assuming your ext4 lives in a partition. Well... if I move the partition, grub2 (etc) will be unable to access data on it. (Plus I do not have free space before some of the partitions I'd like to be cached). Why not use dm-linear and prepend space for the bcache superblock? If this is your boot device, then you would need to write a custom initrd hook too. Thanks for a pointer. That would actually work, but I'd have to be very, very careful using it... ...because if I, or systemd or some kind of automounter sees the underlying device (sda4) and writes to it (it is valid ext4 after all), I'll have inconsistent base device and cache ... and that will be asking for major problems (even in writethrough mode). Sigh. Gone are the days when distributions would only mount filesystems if you ask them to. This is only an issue if: 1. You are crazy and have your desktop set up to auto-mount stuff on insertion (instead of on-access), or you're using a brain-dead desktop that doesn't let you turn this functionality off. or: 2. You are mounting by filesystem label or filesystem UUID (/dev/mapper/* and /dev// links are more than sufficiently stable unless you're changing the storage stack all the time). This is of course almost always the case these days because for some reason the big distros assume that the device mapper links aren't stable or aren't safe to use. or: 3. You're working with a multi-device BTRFS volume (and in that case, it's not an issue of auto-mounting it, but an issue of buggy kernel behavior combined with the 'scan everything for BTRFS' udev rule that's now upstream in udev). If this is a desktop, then I'm not sure what to suggest. But for server with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in /etc/sysconfig/grub). If this was LVM, then I would suggest also setting global_filter in lvm.conf. If you find other places that need poked to prevent automounting then please let me know! Nope, just those and what I mentioned above under 1 and 2. As for ext4 feature bits, can they be arbitrarily named? (I think they are bits, so maybe not). Maybe propose a patch to ext4 to provide a "disable_mount" feature. This would prevent mounting altogether, and you would set/clear it when you care to. A strange feature indeed. Doing this as an obscure feature on a single filesystem doesn't quite seem right. In the case of ext4, you can use the MMP feature (with the associated minor overhead) to enforce this. It might be better to have a block-layer device-mount mask so devices that are allowed to be mounted can be whitelisted on the kernel cmdline or something. blk.allow_mount=8:16,253:*,... or blk.disallow_mount=8:32 (or probably both). Just ideas, but it would be great to allow mounting only those major/minors which are authorized, particularly with increasingly complex block-device stacks. If it's not explicitly created as a mount unit, fstab entry, or automount entry, and you're not using a brain-dead desktop, and you're not using FS labels or UUID's to mount, it _really_ isn't an issue except on BTRFS, and for that this would provide _no_ benefit, because the issue there results from confusion over which device to use combined with the assumption that a UUID stored on the device is sufficient to uniquely identify a filesystem.
Re: bcache with existing ext4 filesystem
On 2017-07-26 13:41, Eric Wheeler wrote: On Wed, 26 Jul 2017, Pavel Machek wrote: Hi! Unfortunately, that would mean shifting 400GB data 8KB forward, and compatibility problems. So I'd prefer adding bcache superblock into the reserved space, so I can have caching _and_ compatibility with grub2 etc (and avoid 400GB move): The common way to do that is to move the beginning of the partition, assuming your ext4 lives in a partition. Well... if I move the partition, grub2 (etc) will be unable to access data on it. (Plus I do not have free space before some of the partitions I'd like to be cached). Why not use dm-linear and prepend space for the bcache superblock? If this is your boot device, then you would need to write a custom initrd hook too. Thanks for a pointer. That would actually work, but I'd have to be very, very careful using it... ...because if I, or systemd or some kind of automounter sees the underlying device (sda4) and writes to it (it is valid ext4 after all), I'll have inconsistent base device and cache ... and that will be asking for major problems (even in writethrough mode). Sigh. Gone are the days when distributions would only mount filesystems if you ask them to. This is only an issue if: 1. You are crazy and have your desktop set up to auto-mount stuff on insertion (instead of on-access), or you're using a brain-dead desktop that doesn't let you turn this functionality off. or: 2. You are mounting by filesystem label or filesystem UUID (/dev/mapper/* and /dev// links are more than sufficiently stable unless you're changing the storage stack all the time). This is of course almost always the case these days because for some reason the big distros assume that the device mapper links aren't stable or aren't safe to use. or: 3. You're working with a multi-device BTRFS volume (and in that case, it's not an issue of auto-mounting it, but an issue of buggy kernel behavior combined with the 'scan everything for BTRFS' udev rule that's now upstream in udev). If this is a desktop, then I'm not sure what to suggest. But for server with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in /etc/sysconfig/grub). If this was LVM, then I would suggest also setting global_filter in lvm.conf. If you find other places that need poked to prevent automounting then please let me know! Nope, just those and what I mentioned above under 1 and 2. As for ext4 feature bits, can they be arbitrarily named? (I think they are bits, so maybe not). Maybe propose a patch to ext4 to provide a "disable_mount" feature. This would prevent mounting altogether, and you would set/clear it when you care to. A strange feature indeed. Doing this as an obscure feature on a single filesystem doesn't quite seem right. In the case of ext4, you can use the MMP feature (with the associated minor overhead) to enforce this. It might be better to have a block-layer device-mount mask so devices that are allowed to be mounted can be whitelisted on the kernel cmdline or something. blk.allow_mount=8:16,253:*,... or blk.disallow_mount=8:32 (or probably both). Just ideas, but it would be great to allow mounting only those major/minors which are authorized, particularly with increasingly complex block-device stacks. If it's not explicitly created as a mount unit, fstab entry, or automount entry, and you're not using a brain-dead desktop, and you're not using FS labels or UUID's to mount, it _really_ isn't an issue except on BTRFS, and for that this would provide _no_ benefit, because the issue there results from confusion over which device to use combined with the assumption that a UUID stored on the device is sufficient to uniquely identify a filesystem.
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-22 07:35, Adam Borowski wrote: On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote: On 2017-07-20 17:27, Nick Terrell wrote: This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. For patches 2-3, I've compile tested and had runtime testing running for about 18 hours now with no issues, so you can add: Tested-by: Austin S. Hemmelgarn <ahferro...@gmail.com> I assume you haven't tried it on arm64, right? I had no time to get 'round to it before, and just got the following build failure: CC fs/btrfs/zstd.o In file included from fs/btrfs/zstd.c:28:0: fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’ refcount_t pending_bios; ^~ scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed It's trivially fixably by: --- a/fs/btrfs/zstd.c +++ b/fs/btrfs/zstd.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include "compression.h" after which it works fine, although half an hour of testing isn't exactly exhaustive. I did, and didn't hit this somehow... Off to go verify my tool-chain and scripts then... Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that would be quite painful so I did not try yet. I guess re-testing your patch set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't be of much help. So far, previous versions have been running for weeks, with no issue since you fixed workspace flickering. I also didn't see this, but I test on some seriously bare-bones configurations for both the 32-bit ARM tests I run. On further inspection, it looks like my scripts decided to use btrfs-for-4.13 as the base, not 4.13-rc1 like I thought they did, so I don't know anymore how helpful my testing may have been. On amd64 all is fine. I haven't tested SquashFS at all. Meow!
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-22 07:35, Adam Borowski wrote: On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote: On 2017-07-20 17:27, Nick Terrell wrote: This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. For patches 2-3, I've compile tested and had runtime testing running for about 18 hours now with no issues, so you can add: Tested-by: Austin S. Hemmelgarn I assume you haven't tried it on arm64, right? I had no time to get 'round to it before, and just got the following build failure: CC fs/btrfs/zstd.o In file included from fs/btrfs/zstd.c:28:0: fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’ refcount_t pending_bios; ^~ scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed It's trivially fixably by: --- a/fs/btrfs/zstd.c +++ b/fs/btrfs/zstd.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include "compression.h" after which it works fine, although half an hour of testing isn't exactly exhaustive. I did, and didn't hit this somehow... Off to go verify my tool-chain and scripts then... Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that would be quite painful so I did not try yet. I guess re-testing your patch set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't be of much help. So far, previous versions have been running for weeks, with no issue since you fixed workspace flickering. I also didn't see this, but I test on some seriously bare-bones configurations for both the 32-bit ARM tests I run. On further inspection, it looks like my scripts decided to use btrfs-for-4.13 as the base, not 4.13-rc1 like I thought they did, so I don't know anymore how helpful my testing may have been. On amd64 all is fine. I haven't tested SquashFS at all. Meow!
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-20 17:27, Nick Terrell wrote: Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell For patches 2-3, I've compile tested and had runtime testing running for about 18 hours now with no issues, so you can add: Tested-by: Austin S. Hemmelgarn <ahferro...@gmail.com> For patch 1, I've only compile tested it, but had no issues and got no warnings about it when booting to test 2-4. For patch 4, I've compile tested it and done some really basic runtime testing using a couple of hand-crafted SquashFS images. It appears to run fine, but I've not done enough that I feel it warrants a Tested-by tag from me. Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-20 17:27, Nick Terrell wrote: Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell For patches 2-3, I've compile tested and had runtime testing running for about 18 hours now with no issues, so you can add: Tested-by: Austin S. Hemmelgarn For patch 1, I've only compile tested it, but had no issues and got no warnings about it when booting to test 2-4. For patch 4, I've compile tested it and done some really basic runtime testing using a couple of hand-crafted SquashFS images. It appears to run fine, but I've not done enough that I feel it warrants a Tested-by tag from me. Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-21 07:16, Austin S. Hemmelgarn wrote: On 2017-07-20 17:27, Nick Terrell wrote: Well this is embarrassing, forgot to type anything before hitting send... Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell I've compile tested all of this on my end now, and have gotten some preliminary testing of patches 2-4, and everything looks good so far here. I'll reply to this if I end up finding issues, but it doesn't look like I will right now considering the testing has run over night without issue. Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-21 07:16, Austin S. Hemmelgarn wrote: On 2017-07-20 17:27, Nick Terrell wrote: Well this is embarrassing, forgot to type anything before hitting send... Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell I've compile tested all of this on my end now, and have gotten some preliminary testing of patches 2-4, and everything looks good so far here. I'll reply to this if I end up finding issues, but it doesn't look like I will right now considering the testing has run over night without issue. Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-20 17:27, Nick Terrell wrote: Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v3 0/4] Add xxhash and zstd modules
On 2017-07-20 17:27, Nick Terrell wrote: Hi all, This patch set adds xxhash, zstd compression, and zstd decompression modules. It also adds zstd support to BtrFS and SquashFS. Each patch has relevant summaries, benchmarks, and tests. Best, Nick Terrell Changelog: v1 -> v2: - Make pointer in lib/xxhash.c:394 non-const (1/4) - Use div_u64() for division of u64s (2/4) - Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(), ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(), HUF_writeCTable(), HUF_readStats(), HUF_readCTable(), HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4) - No zstd function uses more than 400 B of stack space (2/4) v2 -> v3: - Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388 (2/4) - Fix bug in dictionary compression from upstream commit cc1522351f (2/4) - Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4) - Change default compression level for BtrFS to 3 (3/4) Nick Terrell (4): lib: Add xxhash module lib: Add zstd modules btrfs: Add zstd support squashfs: Add zstd support fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 435 ++ fs/squashfs/Kconfig| 14 + fs/squashfs/Makefile |1 + fs/squashfs/decompressor.c |7 + fs/squashfs/decompressor.h |4 + fs/squashfs/squashfs_fs.h |1 + fs/squashfs/zstd_wrapper.c | 150 ++ include/linux/xxhash.h | 236 +++ include/linux/zstd.h | 1157 +++ include/uapi/linux/btrfs.h |8 +- lib/Kconfig| 11 + lib/Makefile |3 + lib/xxhash.c | 500 +++ lib/zstd/Makefile | 18 + lib/zstd/bitstream.h | 374 + lib/zstd/compress.c| 3479 lib/zstd/decompress.c | 2526 lib/zstd/entropy_common.c | 243 lib/zstd/error_private.h | 53 + lib/zstd/fse.h | 575 lib/zstd/fse_compress.c| 795 ++ lib/zstd/fse_decompress.c | 332 + lib/zstd/huf.h | 212 +++ lib/zstd/huf_compress.c| 770 ++ lib/zstd/huf_decompress.c | 960 lib/zstd/mem.h | 151 ++ lib/zstd/zstd_common.c | 75 + lib/zstd/zstd_internal.h | 250 lib/zstd/zstd_opt.h| 1014 + 39 files changed, 14382 insertions(+), 12 deletions(-) create mode 100644 fs/btrfs/zstd.c create mode 100644 fs/squashfs/zstd_wrapper.c create mode 100644 include/linux/xxhash.h create mode 100644 include/linux/zstd.h create mode 100644 lib/xxhash.c create mode 100644 lib/zstd/Makefile create mode 100644 lib/zstd/bitstream.h create mode 100644 lib/zstd/compress.c create mode 100644 lib/zstd/decompress.c create mode 100644 lib/zstd/entropy_common.c create mode 100644 lib/zstd/error_private.h create mode 100644 lib/zstd/fse.h create mode 100644 lib/zstd/fse_compress.c create mode 100644 lib/zstd/fse_decompress.c create mode 100644 lib/zstd/huf.h create mode 100644 lib/zstd/huf_compress.c create mode 100644 lib/zstd/huf_decompress.c create mode 100644 lib/zstd/mem.h create mode 100644 lib/zstd/zstd_common.c create mode 100644 lib/zstd/zstd_internal.h create mode 100644 lib/zstd/zstd_opt.h -- 2.9.3
Re: [PATCH v2 3/4] btrfs: Add zstd support
On 2017-07-07 23:07, Adam Borowski wrote: On Sat, Jul 08, 2017 at 01:40:18AM +0200, Adam Borowski wrote: On Fri, Jul 07, 2017 at 11:17:49PM +, Nick Terrell wrote: On 7/6/17, 9:32 AM, "Adam Borowski"wrote: Got a reproducible crash on amd64: Thanks for the bug report Adam! I'm looking into the failure, and haven't been able to reproduce it yet. I've built my kernel from your tree, and I ran your script with the kernel.tar tarball 100 times, but haven't gotten a failure yet. I have a few questions to guide my debugging. - How many cores are you running with? I’ve run the script with 1, 2, and 4 cores. - Which version of gcc are you using to compile the kernel? I’m using gcc-6.2.0-5ubuntu12. - Are the failures always in exactly the same place, and does it fail 100% of the time or just regularly? 6 cores -- all on bare metal. gcc-7.1.0-9. Lemme try with gcc-6, a different config or in a VM. I've tried the following: * gcc-6, defconfig (+btrfs obviously) * gcc-7, defconfig * gcc-6, my regular config * gcc-7, my regular config * gcc-7, debug + UBSAN + etc * gcc-7, defconfig, qemu-kvm with only 1 core Every build with gcc-7 reproduces the crash, every with gcc-6 does not. Got a GCC7 tool-chain built, and I can confirm this here too, tested with various numbers of cores ranging from 1-32 in a QEMU+KVM VM, with various combinations of debug options and other config switches.
Re: [PATCH v2 3/4] btrfs: Add zstd support
On 2017-07-07 23:07, Adam Borowski wrote: On Sat, Jul 08, 2017 at 01:40:18AM +0200, Adam Borowski wrote: On Fri, Jul 07, 2017 at 11:17:49PM +, Nick Terrell wrote: On 7/6/17, 9:32 AM, "Adam Borowski" wrote: Got a reproducible crash on amd64: Thanks for the bug report Adam! I'm looking into the failure, and haven't been able to reproduce it yet. I've built my kernel from your tree, and I ran your script with the kernel.tar tarball 100 times, but haven't gotten a failure yet. I have a few questions to guide my debugging. - How many cores are you running with? I’ve run the script with 1, 2, and 4 cores. - Which version of gcc are you using to compile the kernel? I’m using gcc-6.2.0-5ubuntu12. - Are the failures always in exactly the same place, and does it fail 100% of the time or just regularly? 6 cores -- all on bare metal. gcc-7.1.0-9. Lemme try with gcc-6, a different config or in a VM. I've tried the following: * gcc-6, defconfig (+btrfs obviously) * gcc-7, defconfig * gcc-6, my regular config * gcc-7, my regular config * gcc-7, debug + UBSAN + etc * gcc-7, defconfig, qemu-kvm with only 1 core Every build with gcc-7 reproduces the crash, every with gcc-6 does not. Got a GCC7 tool-chain built, and I can confirm this here too, tested with various numbers of cores ranging from 1-32 in a QEMU+KVM VM, with various combinations of debug options and other config switches.
Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?
On 2017-05-13 19:17, PGNet Dev wrote: On 5/13/17 3:15 PM, Valentin Vidic wrote: Try booting without 'hpet=force,verbose clocksource=hpet' and it should select xen by default: Nope. Well, not quite ... With both 'hpet=force,verbose clocksource=hpet' removed, I end up with cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc xen cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc But with clocksource=xen *explicitly* added cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc xen cat /sys/devices/system/clocksource/clocksource0/current_clocksourcexen xen and in *console*, NOT dmesg, output, grep -i hpet tmp.txt (XEN) ACPI: HPET 9E8298F8, 0038 (r1 SUPERM SMCI--MB 1072009 AMI.5) (XEN) ACPI: HPET id: 0x8086a701 base: 0xfed0 (XEN) [VT-D] MSI HPET: :f0:0f.0 (XEN) Platform timer is 14.318MHz HPET [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM SMCI--MB 01072009 AMI. 00 [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0 [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM SMCI--MB 01072009 AMI. 00 [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0 [8.515245] hpet_acpi_add: no address or irqs in _CRS [8.515245] hpet_acpi_add: no address or irqs in _CRS (XEN) [2017-05-13 23:04:27] HVM1 save: HPET and dmesg | grep -i clocksource | grep -v line: [0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 7645519600211568 ns [0.004000] clocksource: xen: mask: 0x max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [0.375709] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 764504178510 ns [4.656634] clocksource: Switched to clocksource xen [8.912897] clocksource: tsc: mask: 0x max_cycles: 0x2c94dffea94, max_idle_ns: 440795361700 ns jiffies, now? hm. no idea where that came from. and why the 'tsc' ? So I'm still unclear -- is this^ now, correctly "all" using MSI/HPET? That depends on what you mean by everything correctly using the HPET. Using clocksource=xen (or autoselecting it) will cause the kernel to get timing info from Xen. If you're running as a guest, this is absolutely what you want (unless you're using HVM), and with possible limited and extremely specific exceptions, this is almost certainly what you want in Domain-0 as well. Given that Xen is using the HPEt for timing itself, using clocksource=xen will result in Linux _indirectly_ using the HPET through Xen without involving the HPET driver (in essence, Xen is your HPET driver in this situation), which will get you essentially the same precision that you would get by using the HPET directly. So, if you just want the precision offered by the HPET, then yes, you are getting the same thing through the Xen clocksource.
Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?
On 2017-05-13 19:17, PGNet Dev wrote: On 5/13/17 3:15 PM, Valentin Vidic wrote: Try booting without 'hpet=force,verbose clocksource=hpet' and it should select xen by default: Nope. Well, not quite ... With both 'hpet=force,verbose clocksource=hpet' removed, I end up with cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc xen cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc But with clocksource=xen *explicitly* added cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc xen cat /sys/devices/system/clocksource/clocksource0/current_clocksourcexen xen and in *console*, NOT dmesg, output, grep -i hpet tmp.txt (XEN) ACPI: HPET 9E8298F8, 0038 (r1 SUPERM SMCI--MB 1072009 AMI.5) (XEN) ACPI: HPET id: 0x8086a701 base: 0xfed0 (XEN) [VT-D] MSI HPET: :f0:0f.0 (XEN) Platform timer is 14.318MHz HPET [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM SMCI--MB 01072009 AMI. 00 [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0 [0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM SMCI--MB 01072009 AMI. 00 [0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0 [8.515245] hpet_acpi_add: no address or irqs in _CRS [8.515245] hpet_acpi_add: no address or irqs in _CRS (XEN) [2017-05-13 23:04:27] HVM1 save: HPET and dmesg | grep -i clocksource | grep -v line: [0.00] clocksource: refined-jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 7645519600211568 ns [0.004000] clocksource: xen: mask: 0x max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [0.375709] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 764504178510 ns [4.656634] clocksource: Switched to clocksource xen [8.912897] clocksource: tsc: mask: 0x max_cycles: 0x2c94dffea94, max_idle_ns: 440795361700 ns jiffies, now? hm. no idea where that came from. and why the 'tsc' ? So I'm still unclear -- is this^ now, correctly "all" using MSI/HPET? That depends on what you mean by everything correctly using the HPET. Using clocksource=xen (or autoselecting it) will cause the kernel to get timing info from Xen. If you're running as a guest, this is absolutely what you want (unless you're using HVM), and with possible limited and extremely specific exceptions, this is almost certainly what you want in Domain-0 as well. Given that Xen is using the HPEt for timing itself, using clocksource=xen will result in Linux _indirectly_ using the HPET through Xen without involving the HPET driver (in essence, Xen is your HPET driver in this situation), which will get you essentially the same precision that you would get by using the HPET directly. So, if you just want the precision offered by the HPET, then yes, you are getting the same thing through the Xen clocksource.
Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division
On 2017-04-08 17:07, Adam Borowski wrote: Unbreaks ARM and possibly other 32-bit architectures. Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len Reported-by: Icenowy ZhengSigned-off-by: Adam Borowski --- You'd probably want to squash this with Liu's commit, to be nice to future bisects. Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub sometimes works, but, like most operations, randomly dies with some badness that doesn't look related: io_schedule, kunmap_high. That badness wasn't there in 4.11-rc5, needs investigating, but since it's not connected to our issue at hand, I consider this patch sort-of tested. fs/btrfs/scrub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b6fe1cd08048..95372e3679f3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct scrub_parity *sparity, start -= sparity->logic_start; start = div64_u64_rem(start, sparity->stripe_len, ); - offset /= sectorsize; + do_div(offset, sectorsize); nsectors = (int)len / sectorsize; if (offset + nsectors <= sparity->nsectors) { Also fixes things on: 32-bit MIPS (eb and el variants) 32-bit SPARC 32-bit PPC You can add my Tested-by if you want.
Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division
On 2017-04-08 17:07, Adam Borowski wrote: Unbreaks ARM and possibly other 32-bit architectures. Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len Reported-by: Icenowy Zheng Signed-off-by: Adam Borowski --- You'd probably want to squash this with Liu's commit, to be nice to future bisects. Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub sometimes works, but, like most operations, randomly dies with some badness that doesn't look related: io_schedule, kunmap_high. That badness wasn't there in 4.11-rc5, needs investigating, but since it's not connected to our issue at hand, I consider this patch sort-of tested. fs/btrfs/scrub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b6fe1cd08048..95372e3679f3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct scrub_parity *sparity, start -= sparity->logic_start; start = div64_u64_rem(start, sparity->stripe_len, ); - offset /= sectorsize; + do_div(offset, sectorsize); nsectors = (int)len / sectorsize; if (offset + nsectors <= sparity->nsectors) { Also fixes things on: 32-bit MIPS (eb and el variants) 32-bit SPARC 32-bit PPC You can add my Tested-by if you want.
Re: [PATCH 00/24] Kernel lockdown
On 2017-04-05 16:14, David Howells wrote: These patches provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down. These include: (*) No unsigned modules and no modules for which can't validate the signature. (*) No use of ioperm(), iopl() and no writing to /dev/port. (*) No writing to /dev/mem or /dev/kmem. (*) No hibernation. (*) Restrict PCI BAR access. (*) Restrict MSR access. (*) No kexec_load(). (*) Certain ACPI restrictions. (*) Restrict debugfs interface to ASUS WMI. The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn't insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system. This has already been mentioned both in response to previous versions of this patch set, and by at least 2 people in response to a specific patch in this posting, but for any kind of proper security analysis, you need to better clarify your threat model. 'Prevent modification to the running kernel image' is a decent start on this, but at least some of the patches don't explain very well _how_ what you're disabling could be used to modify the running kernel image. Clarifying how something is a threat will help with verifying that you're correctly blocking the threat. Furthermore, why is the only way to enable this to boot in UEFI Secure Boot mode? Almost all of the hardening done here has general utility in hardening regular systems, and as such I'd suggest adding a command line option to enable kernel lock-down (which would greatly simplify testing), and a kconfig option to enforce it at build-time. In addition to all that, it would be nice to be able to disable all of the following at build time independent of the kernel lock-down state * The acpi_rsdp kernel parameter (I could easily see many distros building kernels with this disabled, it's insanely use-case specific). * IO port and resource reservation module parameters (this would actually be easier than having runtime blacklisting, and I could also easily see this being turned on by default by a number of distros). * TOICSERIAL (this one is more likely than the above two to break systems). And these would probably be useful as lockable sysctls that would be automatically locked disabled when the kernel is locked down: * ioperm/iopl (these can technically be blocked by seccomp or other means, but that is non-trivial to do). * Most of the other ACPI stuff (some of this is useful for troubleshooting, but is not normally used during regular operation). * PCI BAR access.
Re: [PATCH 00/24] Kernel lockdown
On 2017-04-05 16:14, David Howells wrote: These patches provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down. These include: (*) No unsigned modules and no modules for which can't validate the signature. (*) No use of ioperm(), iopl() and no writing to /dev/port. (*) No writing to /dev/mem or /dev/kmem. (*) No hibernation. (*) Restrict PCI BAR access. (*) Restrict MSR access. (*) No kexec_load(). (*) Certain ACPI restrictions. (*) Restrict debugfs interface to ASUS WMI. The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn't insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system. This has already been mentioned both in response to previous versions of this patch set, and by at least 2 people in response to a specific patch in this posting, but for any kind of proper security analysis, you need to better clarify your threat model. 'Prevent modification to the running kernel image' is a decent start on this, but at least some of the patches don't explain very well _how_ what you're disabling could be used to modify the running kernel image. Clarifying how something is a threat will help with verifying that you're correctly blocking the threat. Furthermore, why is the only way to enable this to boot in UEFI Secure Boot mode? Almost all of the hardening done here has general utility in hardening regular systems, and as such I'd suggest adding a command line option to enable kernel lock-down (which would greatly simplify testing), and a kconfig option to enforce it at build-time. In addition to all that, it would be nice to be able to disable all of the following at build time independent of the kernel lock-down state * The acpi_rsdp kernel parameter (I could easily see many distros building kernels with this disabled, it's insanely use-case specific). * IO port and resource reservation module parameters (this would actually be easier than having runtime blacklisting, and I could also easily see this being turned on by default by a number of distros). * TOICSERIAL (this one is more likely than the above two to break systems). And these would probably be useful as lockable sysctls that would be automatically locked disabled when the kernel is locked down: * ioperm/iopl (these can technically be blocked by seccomp or other means, but that is non-trivial to do). * Most of the other ACPI stuff (some of this is useful for troubleshooting, but is not normally used during regular operation). * PCI BAR access.
Re: [PATCH linux-next V2] tty: Disable default console blanking interval
On 2017-03-22 21:32, Scot Doyle wrote: On Wed, 22 Mar 2017, Tim Gardner wrote: BugLink: http://bugs.launchpad.net/bugs/869017 Console blanking is not enabling DPMS power saving (thereby negating any power-saving benefit), and is simply turning the screen content blank. This means that any crash output is invisible which is unhelpful on a server (virtual or otherwise). Furthermore, CRT burn in concerns should no longer govern the default case. Affected users could always set consoleblank on the kernel command line. Does screen blanking save some power by disabling the cursor blink? Unless you're dealing with ancient hardware, the difference in power usage is probably on the order of single digit micro-watts, which is not worth worrying about on almost anything you would expect to have a console display connected to.
Re: [PATCH linux-next V2] tty: Disable default console blanking interval
On 2017-03-22 21:32, Scot Doyle wrote: On Wed, 22 Mar 2017, Tim Gardner wrote: BugLink: http://bugs.launchpad.net/bugs/869017 Console blanking is not enabling DPMS power saving (thereby negating any power-saving benefit), and is simply turning the screen content blank. This means that any crash output is invisible which is unhelpful on a server (virtual or otherwise). Furthermore, CRT burn in concerns should no longer govern the default case. Affected users could always set consoleblank on the kernel command line. Does screen blanking save some power by disabling the cursor blink? Unless you're dealing with ancient hardware, the difference in power usage is probably on the order of single digit micro-watts, which is not worth worrying about on almost anything you would expect to have a console display connected to.
Re: [PATCH linux-next] tty: Disable default console blanking interval
On 2017-03-22 09:50, Tim Gardner wrote: BugLink: http://bugs.launchpad.net/bugs/869017 Signed-off-by: Tim GardnerCc: Greg Kroah-Hartman Cc: Jiri Slaby Cc: Adam Borowski Cc: Scot Doyle --- I'm not particularly knowledgable about console issues. Is a blaknking interval relevant in a post CRT world ? The argument in the bug description seems compelling. Burn-in still happens on at least LCD screens (not sure about anything else except DLP, where it doesn't happen unless it's a really crappy display), but on many of those it happens regardless of the contents of the display (I've actually got an old LCD display where the image is constantly dark because of having displayed so many hours of a black screen), but displaying a black screen instead of powering off the display doesn't really save any power on most modern displays and the fact that the screen isn't un-blanked when the kernel crashes is a pretty compelling argument against having it enabled by default IMO. drivers/tty/vt/vt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index 5c4933b..9c99452 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -181,7 +181,7 @@ int console_blanked; static int vesa_blank_mode; /* 0:none 1:suspendV 2:suspendH 3:powerdown */ static int vesa_off_interval; -static int blankinterval = 10*60; +static int blankinterval; core_param(consoleblank, blankinterval, int, 0444); static DECLARE_WORK(console_work, console_callback);
Re: [PATCH linux-next] tty: Disable default console blanking interval
On 2017-03-22 09:50, Tim Gardner wrote: BugLink: http://bugs.launchpad.net/bugs/869017 Signed-off-by: Tim Gardner Cc: Greg Kroah-Hartman Cc: Jiri Slaby Cc: Adam Borowski Cc: Scot Doyle --- I'm not particularly knowledgable about console issues. Is a blaknking interval relevant in a post CRT world ? The argument in the bug description seems compelling. Burn-in still happens on at least LCD screens (not sure about anything else except DLP, where it doesn't happen unless it's a really crappy display), but on many of those it happens regardless of the contents of the display (I've actually got an old LCD display where the image is constantly dark because of having displayed so many hours of a black screen), but displaying a black screen instead of powering off the display doesn't really save any power on most modern displays and the fact that the screen isn't un-blanked when the kernel crashes is a pretty compelling argument against having it enabled by default IMO. drivers/tty/vt/vt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index 5c4933b..9c99452 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -181,7 +181,7 @@ int console_blanked; static int vesa_blank_mode; /* 0:none 1:suspendV 2:suspendH 3:powerdown */ static int vesa_off_interval; -static int blankinterval = 10*60; +static int blankinterval; core_param(consoleblank, blankinterval, int, 0444); static DECLARE_WORK(console_work, console_callback);
Re: When will Linux support M2 on RAID ?
On 2017-03-07 10:15, Christoph Hellwig wrote: On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote: He's referring to the RAID mode most modern Intel chipsets have, which (last I checked) Linux does not support completely and many OEM's are setting by default on new systems because it apparently provides better performance than AHCI even for a single device. It actually provides worse performance. What it does it that it shoves up to three nvme device bars into the bar of an AHCI device, and requires the OS to handle them all using a single driver. The Money's on crack at Intel decided to do that to provide their "valueable" RSTe IP (which is a windows ATA + RAID driver in a blob, which now has also grown a NVMe driver). The only remotely sane thing is to disable it in the bios, and burn all people involved with it. The next best thing is to provide a fake PCIe root port driver untangling this before it hits the driver, but unfortunately Intel is unwilling to either do this on their own or at least provide enough documentation for others to do it. For NVMe, yeah, it hurts performance horribly. For SATA devices though, it's hit or miss, some setups perform better, some perform worse. It does have one advantage though, it lets you put the C drive for a Windows install on a soft-RAID array insanely easily compared to trying to do so through Windows itself (although still significantly less easily that doing the equivalent on Linux...). The cynic in me is tempted to believe that the OEM's who are turning it on by default are trying to either: 1. Make their low-end systems look even crappier in terms of performance while adding to their marketing checklist (Of the systems I've seen that have this on by default, most were cheap ones with really low specs). 2. Actively make it harder to run anything but Windows on their hardware.
Re: When will Linux support M2 on RAID ?
On 2017-03-07 10:15, Christoph Hellwig wrote: On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote: He's referring to the RAID mode most modern Intel chipsets have, which (last I checked) Linux does not support completely and many OEM's are setting by default on new systems because it apparently provides better performance than AHCI even for a single device. It actually provides worse performance. What it does it that it shoves up to three nvme device bars into the bar of an AHCI device, and requires the OS to handle them all using a single driver. The Money's on crack at Intel decided to do that to provide their "valueable" RSTe IP (which is a windows ATA + RAID driver in a blob, which now has also grown a NVMe driver). The only remotely sane thing is to disable it in the bios, and burn all people involved with it. The next best thing is to provide a fake PCIe root port driver untangling this before it hits the driver, but unfortunately Intel is unwilling to either do this on their own or at least provide enough documentation for others to do it. For NVMe, yeah, it hurts performance horribly. For SATA devices though, it's hit or miss, some setups perform better, some perform worse. It does have one advantage though, it lets you put the C drive for a Windows install on a soft-RAID array insanely easily compared to trying to do so through Windows itself (although still significantly less easily that doing the equivalent on Linux...). The cynic in me is tempted to believe that the OEM's who are turning it on by default are trying to either: 1. Make their low-end systems look even crappier in terms of performance while adding to their marketing checklist (Of the systems I've seen that have this on by default, most were cheap ones with really low specs). 2. Actively make it harder to run anything but Windows on their hardware.
Re: When will Linux support M2 on RAID ?
On 2017-03-06 23:52, Christoph Hellwig wrote: On Sun, Mar 05, 2017 at 06:09:42PM -0800, David F. wrote: More and more systems are coming with M2 on RAID and Linux doesn't work unless you change the system out of RAID mode. This is becoming more and more of a problem. What is the status of Linux support for the new systems? Your message doesn't make sense at all. MD works on absolutely any Linux block device, and I've used it on plenty M.2 form factor devices - not that the form factor has anything to do with how Linux would treat a device. He's referring to the RAID mode most modern Intel chipsets have, which (last I checked) Linux does not support completely and many OEM's are setting by default on new systems because it apparently provides better performance than AHCI even for a single device. The bigger issue with this is that it's damn near impossible to get to the firmware on many new systems, and even if you can, some OEM's aren't even giving the option to use AHCI mode instead of the RAID mode. The whole M.2 thing is absolutely bogus though, my thought is that people are conflating the two because this switch to using RAID mode by default is happening concurrently with a general movement towards using M.2 devices as primary storage.
Re: When will Linux support M2 on RAID ?
On 2017-03-06 23:52, Christoph Hellwig wrote: On Sun, Mar 05, 2017 at 06:09:42PM -0800, David F. wrote: More and more systems are coming with M2 on RAID and Linux doesn't work unless you change the system out of RAID mode. This is becoming more and more of a problem. What is the status of Linux support for the new systems? Your message doesn't make sense at all. MD works on absolutely any Linux block device, and I've used it on plenty M.2 form factor devices - not that the form factor has anything to do with how Linux would treat a device. He's referring to the RAID mode most modern Intel chipsets have, which (last I checked) Linux does not support completely and many OEM's are setting by default on new systems because it apparently provides better performance than AHCI even for a single device. The bigger issue with this is that it's damn near impossible to get to the firmware on many new systems, and even if you can, some OEM's aren't even giving the option to use AHCI mode instead of the RAID mode. The whole M.2 thing is absolutely bogus though, my thought is that people are conflating the two because this switch to using RAID mode by default is happening concurrently with a general movement towards using M.2 devices as primary storage.
Re: Dell XPS13: MCE (Hardware Error) reported
On 2017-01-31 10:29, Paul Menzel wrote: Dear Borislav, dear Mario, On 01/27/17 18:16, mario.limoncie...@dell.com wrote: -Original Message- From: Borislav Petkov [mailto:b...@alien8.de] Sent: Friday, January 27, 2017 11:11 AM To: Paul MenzelCc: Ashok Raj ; Linux Kernel Mailing List ; Thorsten Leemhuis ; Len Brown ; Tony Luck ; Limonciello, Mario Subject: Re: Dell XPS13: MCE (Hardware Error) reported On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote: Thank you very much for your help. After wasting my time with the Dell support over Twitter [1], where they basically also make you jump through hoops, Fun read that twitter page. Especially the bit about not supporting Linux but but you did ship this laptop with ubuntu. LOOL. FWIW, this is not the first time I've heard vendors trying to get away from dealing with bugs by playing the "we-dont-support-Linux" card. I think, in this case it was an honest mistake from the support person, which they corrected in a follow-up post. But the next conversion was a little disappointing as the problem was already diagnosed out on the LKML. As Mario replied, the Twitter people do not seem to know the free software world that well, that “normal” people actually are in contact with the developers. What surprises me though is, that the change-log for the firmware seems to be incomplete even for the Dell staff. Good to know when I go looking for a new laptop in the future. Unfortunately, there are not a lot of options. If you know a good vendor, please share. Dell is one of the view “big” vendors selling laptops with a GNU/Linux operating system installed. There are Google Chromebooks, which even have coreboot. Google’s Chromium team does a great job and have a great quality assurance. On par, or even better than Apple, I’d say. Some people need more powerful devices though. In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux installed [2]. TUXEDO does not build the devices themselves, and gets template models(?) from Chinese manufactures. They only seem to start hiring the expertise to do Linux kernel and OS work themselves [3]. Currently, it’s mostly the stock Ubuntu for desktop with their artwork and some small optimizations. When it comes to laptops, I've recently become a fan of System76 [1]. Linux runs pretty much flawlessly on all the systems I've ever encountered from them without needing any special configuration, and they offer some pretty high-end systems. Their Oryx Pro laptop is particularly impressive, it's got hardware on par with a desktop bearing a similar price-tag, which is rare even for systems that ship with Windows per-installed. They're based in the US, but I'm pretty certain they ship worldwide, and like most of the better manufacturers these days, the systems are made-to-order with quite a lot in the way of options (you can even get an NVMe boot drive in their higher-end systems if you're willing to shell out more than a third of the base cost for the system on it). They also do desktops and servers, but it's generally cheaper in my experience for anything but a laptop to just build it yourself. Sorry you had that experience. I'll pass that on to get that messaging corrected. The team that does Linux support doesn't use twitter, they do phone and email support only. Thank you that’s good to know. I'm glad you got things sorted moving to the next FW version. Me too. Mario, there are at least two more firmware bugs [4][5][6]. Having fast suspend and resume says something about the quality of a device. If the Dell XPS13 9360 is supposed to compete with Apple devices, and Google Chromebooks, then this should be improved in my opinion. Do you have a suggestion on how to best get this solved? Do you want me to contact the support, or are there Dell employees with access to the Linux kernel Bugzilla bug tracker? Kind regards, Paul [1] https://www.tuxedocomputers.com/ [2] https://puri.sm/ [3] https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5 [4] https://github.molgen.mpg.de/mariux64/dell_xps13/ [5] https://bugzilla.kernel.org/show_bug.cgi?id=185611 [6] https://bugzilla.kernel.org/show_bug.cgi?id=185621 [1] https://system76.com/
Re: Dell XPS13: MCE (Hardware Error) reported
On 2017-01-31 10:29, Paul Menzel wrote: Dear Borislav, dear Mario, On 01/27/17 18:16, mario.limoncie...@dell.com wrote: -Original Message- From: Borislav Petkov [mailto:b...@alien8.de] Sent: Friday, January 27, 2017 11:11 AM To: Paul Menzel Cc: Ashok Raj ; Linux Kernel Mailing List ; Thorsten Leemhuis ; Len Brown ; Tony Luck ; Limonciello, Mario Subject: Re: Dell XPS13: MCE (Hardware Error) reported On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote: Thank you very much for your help. After wasting my time with the Dell support over Twitter [1], where they basically also make you jump through hoops, Fun read that twitter page. Especially the bit about not supporting Linux but but you did ship this laptop with ubuntu. LOOL. FWIW, this is not the first time I've heard vendors trying to get away from dealing with bugs by playing the "we-dont-support-Linux" card. I think, in this case it was an honest mistake from the support person, which they corrected in a follow-up post. But the next conversion was a little disappointing as the problem was already diagnosed out on the LKML. As Mario replied, the Twitter people do not seem to know the free software world that well, that “normal” people actually are in contact with the developers. What surprises me though is, that the change-log for the firmware seems to be incomplete even for the Dell staff. Good to know when I go looking for a new laptop in the future. Unfortunately, there are not a lot of options. If you know a good vendor, please share. Dell is one of the view “big” vendors selling laptops with a GNU/Linux operating system installed. There are Google Chromebooks, which even have coreboot. Google’s Chromium team does a great job and have a great quality assurance. On par, or even better than Apple, I’d say. Some people need more powerful devices though. In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux installed [2]. TUXEDO does not build the devices themselves, and gets template models(?) from Chinese manufactures. They only seem to start hiring the expertise to do Linux kernel and OS work themselves [3]. Currently, it’s mostly the stock Ubuntu for desktop with their artwork and some small optimizations. When it comes to laptops, I've recently become a fan of System76 [1]. Linux runs pretty much flawlessly on all the systems I've ever encountered from them without needing any special configuration, and they offer some pretty high-end systems. Their Oryx Pro laptop is particularly impressive, it's got hardware on par with a desktop bearing a similar price-tag, which is rare even for systems that ship with Windows per-installed. They're based in the US, but I'm pretty certain they ship worldwide, and like most of the better manufacturers these days, the systems are made-to-order with quite a lot in the way of options (you can even get an NVMe boot drive in their higher-end systems if you're willing to shell out more than a third of the base cost for the system on it). They also do desktops and servers, but it's generally cheaper in my experience for anything but a laptop to just build it yourself. Sorry you had that experience. I'll pass that on to get that messaging corrected. The team that does Linux support doesn't use twitter, they do phone and email support only. Thank you that’s good to know. I'm glad you got things sorted moving to the next FW version. Me too. Mario, there are at least two more firmware bugs [4][5][6]. Having fast suspend and resume says something about the quality of a device. If the Dell XPS13 9360 is supposed to compete with Apple devices, and Google Chromebooks, then this should be improved in my opinion. Do you have a suggestion on how to best get this solved? Do you want me to contact the support, or are there Dell employees with access to the Linux kernel Bugzilla bug tracker? Kind regards, Paul [1] https://www.tuxedocomputers.com/ [2] https://puri.sm/ [3] https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5 [4] https://github.molgen.mpg.de/mariux64/dell_xps13/ [5] https://bugzilla.kernel.org/show_bug.cgi?id=185611 [6] https://bugzilla.kernel.org/show_bug.cgi?id=185621 [1] https://system76.com/
Re: [PATCH v2] Add a "nosymlinks" mount option.
On 2016-11-16 16:18, Mattias Nissler wrote: I understand that silence suggests there's little interest, but here's some new information I discovered today that may justify to reconsider the patch: The BSDs already have exactly what I propose, the mount option is called "nosymfollow" there: https://github.com/freebsd/freebsd/blob/a41f4cc9a57cd74604ae7b051eec2f48865f18d6/sys/kern/vfs_lookup.c#L939 There's also some evidence on the net that people have been using said nosymfollow mount option to mitigate symlink attacks. I will comment that I do have interest in this (to the point that if this doesn't go into mainline, I'll probably add it to my local set of patches), though I'm not a maintainer or reviewer for the relevant area, so there's not much I can do except say that there is interest from at least one person who didn't work on the patch. That said, it would be nice if we matched the BSD option name (we do on pretty much every other core VFS level mount option). On Mon, Oct 24, 2016 at 7:09 AM, Mattias Nisslerwrote: Friendly ping - does this version of the patch have any chance on getting included in mainline? On Wed, Oct 19, 2016 at 2:31 PM, Mattias Nissler wrote: For mounts that have the new "nosymlinks" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options. Note that symlinks may still be created on mounts where the "nosymlinks" option is present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymlinks" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymlinks" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. Signed-off-by: Mattias Nissler --- fs/namei.c | 3 +++ fs/namespace.c | 9 ++--- fs/proc_namespace.c | 1 + fs/statfs.c | 2 ++ include/linux/mount.h | 3 ++- include/linux/statfs.h | 1 + include/uapi/linux/fs.h | 1 + 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 5b4eed2..4cddcf3 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1021,6 +1021,9 @@ const char *get_link(struct nameidata *nd) touch_atime(>link); } + if (nd->path.mnt->mnt_flags & MNT_NOSYMLINKS) + return ERR_PTR(-EACCES); + error = security_inode_follow_link(dentry, inode, nd->flags & LOOKUP_RCU); if (unlikely(error)) diff --git a/fs/namespace.c b/fs/namespace.c index e6c234b..deec84e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2732,6 +2732,8 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; + if (flags & MS_NOSYMLINKS) + mnt_flags |= MNT_NOSYMLINKS; /* The default atime for remount is preservation */ if ((flags & MS_REMOUNT) && @@ -2741,9 +2743,10 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK; } - flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT | - MS_STRICTATIME | MS_NOREMOTELOCK); + flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_NOSYMLINKS | + MS_ACTIVE | MS_BORN | MS_NOATIME | MS_NODIRATIME | + MS_RELATIME | MS_KERNMOUNT | MS_STRICTATIME | + MS_NOREMOTELOCK); if (flags & MS_REMOUNT) retval = do_remount(, flags & ~MS_REMOUNT, mnt_flags, diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 3f1190d..a1949d9 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -67,6 +67,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_NOSYMLINKS, ",nosymlinks" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; diff --git a/fs/statfs.c b/fs/statfs.c index 083dc0a..7ff7c32 100644 --- a/fs/statfs.c +++ b/fs/statfs.c @@ -27,6 +27,8 @@ static int flags_by_mnt(int mnt_flags) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; + if (mnt_flags & MNT_NOSYMLINKS) + flags |= ST_NOSYMLINKS; return flags; } diff --git a/include/linux/mount.h b/include/linux/mount.h index 1172cce..5e302f4 100644 ---
Re: [PATCH v2] Add a "nosymlinks" mount option.
On 2016-11-16 16:18, Mattias Nissler wrote: I understand that silence suggests there's little interest, but here's some new information I discovered today that may justify to reconsider the patch: The BSDs already have exactly what I propose, the mount option is called "nosymfollow" there: https://github.com/freebsd/freebsd/blob/a41f4cc9a57cd74604ae7b051eec2f48865f18d6/sys/kern/vfs_lookup.c#L939 There's also some evidence on the net that people have been using said nosymfollow mount option to mitigate symlink attacks. I will comment that I do have interest in this (to the point that if this doesn't go into mainline, I'll probably add it to my local set of patches), though I'm not a maintainer or reviewer for the relevant area, so there's not much I can do except say that there is interest from at least one person who didn't work on the patch. That said, it would be nice if we matched the BSD option name (we do on pretty much every other core VFS level mount option). On Mon, Oct 24, 2016 at 7:09 AM, Mattias Nissler wrote: Friendly ping - does this version of the patch have any chance on getting included in mainline? On Wed, Oct 19, 2016 at 2:31 PM, Mattias Nissler wrote: For mounts that have the new "nosymlinks" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options. Note that symlinks may still be created on mounts where the "nosymlinks" option is present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymlinks" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymlinks" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. Signed-off-by: Mattias Nissler --- fs/namei.c | 3 +++ fs/namespace.c | 9 ++--- fs/proc_namespace.c | 1 + fs/statfs.c | 2 ++ include/linux/mount.h | 3 ++- include/linux/statfs.h | 1 + include/uapi/linux/fs.h | 1 + 7 files changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 5b4eed2..4cddcf3 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1021,6 +1021,9 @@ const char *get_link(struct nameidata *nd) touch_atime(>link); } + if (nd->path.mnt->mnt_flags & MNT_NOSYMLINKS) + return ERR_PTR(-EACCES); + error = security_inode_follow_link(dentry, inode, nd->flags & LOOKUP_RCU); if (unlikely(error)) diff --git a/fs/namespace.c b/fs/namespace.c index e6c234b..deec84e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2732,6 +2732,8 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME); if (flags & MS_RDONLY) mnt_flags |= MNT_READONLY; + if (flags & MS_NOSYMLINKS) + mnt_flags |= MNT_NOSYMLINKS; /* The default atime for remount is preservation */ if ((flags & MS_REMOUNT) && @@ -2741,9 +2743,10 @@ long do_mount(const char *dev_name, const char __user *dir_name, mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK; } - flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT | - MS_STRICTATIME | MS_NOREMOTELOCK); + flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_NOSYMLINKS | + MS_ACTIVE | MS_BORN | MS_NOATIME | MS_NODIRATIME | + MS_RELATIME | MS_KERNMOUNT | MS_STRICTATIME | + MS_NOREMOTELOCK); if (flags & MS_REMOUNT) retval = do_remount(, flags & ~MS_REMOUNT, mnt_flags, diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 3f1190d..a1949d9 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -67,6 +67,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) { MNT_NOATIME, ",noatime" }, { MNT_NODIRATIME, ",nodiratime" }, { MNT_RELATIME, ",relatime" }, + { MNT_NOSYMLINKS, ",nosymlinks" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; diff --git a/fs/statfs.c b/fs/statfs.c index 083dc0a..7ff7c32 100644 --- a/fs/statfs.c +++ b/fs/statfs.c @@ -27,6 +27,8 @@ static int flags_by_mnt(int mnt_flags) flags |= ST_NODIRATIME; if (mnt_flags & MNT_RELATIME) flags |= ST_RELATIME; + if (mnt_flags & MNT_NOSYMLINKS) + flags |= ST_NOSYMLINKS; return flags; } diff --git a/include/linux/mount.h b/include/linux/mount.h index 1172cce..5e302f4 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -28,6 +28,7 @@ struct
Re: [PATCH] f2fs: support multiple devices
On 2016-11-09 21:29, Qu Wenruo wrote: At 11/10/2016 06:57 AM, Andreas Dilger wrote: On Nov 9, 2016, at 1:56 PM, Jaegeuk Kimwrote: This patch implements multiple devices support for f2fs. Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big volume under one f2fs instance. Internal block management is very simple, but we will modify block allocation and background GC policy to boost IO speed by exploiting them accoording to each device speed. How will you integrate this into FIEMAP, since it is now possible if a file is split across multiple devices then it will return ambiguous block numbers for a file. I've been meaning to merge the FIEMAP handling in Lustre to support multiple devices in a single filesystem, so that this can be detected in userspace. struct ll_fiemap_extent { __u64 fe_logical; /* logical offset in bytes for the start of * the extent from the beginning of the file */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ __u64 fe_reserved64[2]; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_device; /* device number for this extent */ __u32 fe_reserved[2]; }; Btrfs introduce a new layer for multi-device (even for single device). So fiemap returned by btrfs is never real device bytenr, but logical address in btrfs logical address space. Much like traditional soft RAID. This is a really important point. BTRFS does a good job of segregating the layers here, so the file-level allocator really has very limited knowledge of the underlying storage, which in turn means that adding this to BTRFS would likely be a pretty invasive change for the FIEMAP implementation. This adds the 32-bit "fe_device" field, which would optionally be filled in by the filesystem (zero otherwise). It would return the kernel device number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET set) this could just return an integer device number since the device number is meaningless (and may conflict) on a remote system. Since AFAIK Btrfs also has multiple device support there are an increasing number of places where this would be useful. AFAIK, btrfs multi-device is here due to scrub with its data/meta csum. It's also here for an attempt at parity with ZFS. Unlike device-mapper based multi-device, btrfs has csum so it can detect which mirror is correct. This makes btrfs scrub a little better than soft raid. For example, for RAID1 if two mirror differs from each other, btrfs can find the correct one and rewrite it into the other mirror. And further more, btrfs supports snapshot and is faster than device-mapper based snapshot(LVM). This makes it a little more worthy to implement multi-device support in btrfs. But for f2fs, no data csum, no snapshot. I don't really see the point to use so many codes to implement it, especially we can use mdadm or LVM to implement it. I'd tend to agree on this, if it weren't for the fact that this looks to me like preparation for implementing storage tiering, which neither LVM nor MD have a good implementation of. Whether or not such functionality is worthwhile for the embedded systems that F2FS typically targets is another story of course. Not to mention btrfs multi-device support still has quite a lot of bugs, like scrub can corrupt correct data stripes. This sounds like you're lumping raid5/6 code in with the general multi-device code, which is not a good way of describing things for multiple reasons. Pretty much, if you're using just raid1 mode, without compression, on reasonable storage devices, things are rock-solid relative to the rest of BTRFS. Yes, there is a bug with compression and multiple copies of things, but that requires a pretty spectacular device failure to manifest, and it impacts single device mode too (it happens in dup profiles as well as raid1). As far as the raid5/6 stuff, that shouldn't have been merged in the state it was in when it got merged, and should probably just be rewritten from the ground up. Personally speaking, I am not a fan of btrfs multi-device management, despite the above advantage. As the complexity is really not worthy. (So I think XFS with LVM is much better than Btrfs considering the stability)
Re: [PATCH] f2fs: support multiple devices
On 2016-11-09 21:29, Qu Wenruo wrote: At 11/10/2016 06:57 AM, Andreas Dilger wrote: On Nov 9, 2016, at 1:56 PM, Jaegeuk Kim wrote: This patch implements multiple devices support for f2fs. Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big volume under one f2fs instance. Internal block management is very simple, but we will modify block allocation and background GC policy to boost IO speed by exploiting them accoording to each device speed. How will you integrate this into FIEMAP, since it is now possible if a file is split across multiple devices then it will return ambiguous block numbers for a file. I've been meaning to merge the FIEMAP handling in Lustre to support multiple devices in a single filesystem, so that this can be detected in userspace. struct ll_fiemap_extent { __u64 fe_logical; /* logical offset in bytes for the start of * the extent from the beginning of the file */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ __u64 fe_reserved64[2]; __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_device; /* device number for this extent */ __u32 fe_reserved[2]; }; Btrfs introduce a new layer for multi-device (even for single device). So fiemap returned by btrfs is never real device bytenr, but logical address in btrfs logical address space. Much like traditional soft RAID. This is a really important point. BTRFS does a good job of segregating the layers here, so the file-level allocator really has very limited knowledge of the underlying storage, which in turn means that adding this to BTRFS would likely be a pretty invasive change for the FIEMAP implementation. This adds the 32-bit "fe_device" field, which would optionally be filled in by the filesystem (zero otherwise). It would return the kernel device number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET set) this could just return an integer device number since the device number is meaningless (and may conflict) on a remote system. Since AFAIK Btrfs also has multiple device support there are an increasing number of places where this would be useful. AFAIK, btrfs multi-device is here due to scrub with its data/meta csum. It's also here for an attempt at parity with ZFS. Unlike device-mapper based multi-device, btrfs has csum so it can detect which mirror is correct. This makes btrfs scrub a little better than soft raid. For example, for RAID1 if two mirror differs from each other, btrfs can find the correct one and rewrite it into the other mirror. And further more, btrfs supports snapshot and is faster than device-mapper based snapshot(LVM). This makes it a little more worthy to implement multi-device support in btrfs. But for f2fs, no data csum, no snapshot. I don't really see the point to use so many codes to implement it, especially we can use mdadm or LVM to implement it. I'd tend to agree on this, if it weren't for the fact that this looks to me like preparation for implementing storage tiering, which neither LVM nor MD have a good implementation of. Whether or not such functionality is worthwhile for the embedded systems that F2FS typically targets is another story of course. Not to mention btrfs multi-device support still has quite a lot of bugs, like scrub can corrupt correct data stripes. This sounds like you're lumping raid5/6 code in with the general multi-device code, which is not a good way of describing things for multiple reasons. Pretty much, if you're using just raid1 mode, without compression, on reasonable storage devices, things are rock-solid relative to the rest of BTRFS. Yes, there is a bug with compression and multiple copies of things, but that requires a pretty spectacular device failure to manifest, and it impacts single device mode too (it happens in dup profiles as well as raid1). As far as the raid5/6 stuff, that shouldn't have been merged in the state it was in when it got merged, and should probably just be rewritten from the ground up. Personally speaking, I am not a fan of btrfs multi-device management, despite the above advantage. As the complexity is really not worthy. (So I think XFS with LVM is much better than Btrfs considering the stability)
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-04 10:39, Markus Trippelsdorf wrote: On 2016.11.04 at 15:24 +0100, Sebastian Andrzej Siewior wrote: On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote: clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years). Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default. They have their own PIE patch since it is not yet supported by gcc. And let me quote why that works for them: | This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z now | as default if the defines and the spec allow it. | Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened in flag-o-matic.eclass | to support older hardened GCC patches and we don't need to change the code on gcc-specs-* and _filter-hardened. | This will add some unsupported upstream commands options as -nopie and -nonow. | -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all and -fstack-check when building kernels. | ESP_CC1_SPEC is added to CC1_SPEC. | ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow check. | ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static. | ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or -fpie. I was thinking about asking doko for something similar but no. Looking at portage they have a few patches where they add -fno-PIE to some packages. Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo did not better. And according to Google, there are also people in the ARCH Linux camp with the same problem. Gentoo's 6 gcc is completely masked and it does not reference the patch I quote above so Gentoo will run into this problem once they enable gcc 6 and don't add the -D__KERNEL__ hack. Eventually Fedora and SUSE will migrate to PIE by default and by then we should cover all major distros so even Al should be affected unless he decides not to update or is using something else. But why enable PIE by default? What additional security does one gain if e.g. "cat" or "dmesg" are position independent executables? It also adds overhead (,although this is smaller than it used to be on X86_64). So instead of doing the sane thing and adding -fPIE to long running daemons only, now many project have to add -fno-pie to avoid breakage. Actually, the number of things that don't work with PIE is pretty small (pretty much kernel code and some stuff that does in-line assembly). In general, it falls under the heading of 'secure by default'. Yeah, some stuff like cat and dmesg have essentially zero security benefit from being compiled with -fPIE, but once you start looking at pretty much anything that's doing any processing of potentially untrusted data (which includes anything that talks to the network), you do start having some security improvement.
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-04 10:39, Markus Trippelsdorf wrote: On 2016.11.04 at 15:24 +0100, Sebastian Andrzej Siewior wrote: On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote: clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years). Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default. They have their own PIE patch since it is not yet supported by gcc. And let me quote why that works for them: | This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z now | as default if the defines and the spec allow it. | Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened in flag-o-matic.eclass | to support older hardened GCC patches and we don't need to change the code on gcc-specs-* and _filter-hardened. | This will add some unsupported upstream commands options as -nopie and -nonow. | -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all and -fstack-check when building kernels. | ESP_CC1_SPEC is added to CC1_SPEC. | ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow check. | ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static. | ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or -fpie. I was thinking about asking doko for something similar but no. Looking at portage they have a few patches where they add -fno-PIE to some packages. Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo did not better. And according to Google, there are also people in the ARCH Linux camp with the same problem. Gentoo's 6 gcc is completely masked and it does not reference the patch I quote above so Gentoo will run into this problem once they enable gcc 6 and don't add the -D__KERNEL__ hack. Eventually Fedora and SUSE will migrate to PIE by default and by then we should cover all major distros so even Al should be affected unless he decides not to update or is using something else. But why enable PIE by default? What additional security does one gain if e.g. "cat" or "dmesg" are position independent executables? It also adds overhead (,although this is smaller than it used to be on X86_64). So instead of doing the sane thing and adding -fPIE to long running daemons only, now many project have to add -fno-pie to avoid breakage. Actually, the number of things that don't work with PIE is pretty small (pretty much kernel code and some stuff that does in-line assembly). In general, it falls under the heading of 'secure by default'. Yeah, some stuff like cat and dmesg have essentially zero security benefit from being compiled with -fPIE, but once you start looking at pretty much anything that's doing any processing of potentially untrusted data (which includes anything that talks to the network), you do start having some security improvement.
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-04 10:24, Sebastian Andrzej Siewior wrote: On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote: clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years). Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default. They have their own PIE patch since it is not yet supported by gcc. And let me quote why that works for them: | This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z now | as default if the defines and the spec allow it. | Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened in flag-o-matic.eclass | to support older hardened GCC patches and we don't need to change the code on gcc-specs-* and _filter-hardened. | This will add some unsupported upstream commands options as -nopie and -nonow. | -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all and -fstack-check when building kernels. | ESP_CC1_SPEC is added to CC1_SPEC. | ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow check. | ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static. | ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or -fpie. I was thinking about asking doko for something similar but no. Looking at portage they have a few patches where they add -fno-PIE to some packages. Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo did not better. And according to Google, there are also people in the ARCH Linux camp with the same problem. Gentoo's 6 gcc is completely masked and it does not reference the patch I quote above so Gentoo will run into this problem once they enable gcc 6 and don't add the -D__KERNEL__ hack. Eventually Fedora and SUSE will migrate to PIE by default and by then we should cover all major distros so even Al should be affected unless he decides not to update or is using something else. While I don't agree with _how_ they worked around it, it still works correctly with no user intervention for pretty much every important case, and my point was more that it is possible to make this work without a kernel patch than 'Hey, it works over here, lets do what they're doing'. I would still argue that the root of the issue is how GCC handles options specified on the command line that conflict with it's compile time defaults. This is at least the second kernel related case where things broke because GCC doesn't do the sensible thing and override defaults based on command line options (there is (or was, not sure if it's been resolved yet or not) an issue too on MIPS with some other option that I can't remember right now). If option X and option Y are mutually exclusive, and option X is specified on the command line while option Y isn't, they should not use option Y regardless of whether or not it's the default and possibly spit out a warning if it is the default (for PIC, yes, there probably should be a warning), not die.
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-04 10:24, Sebastian Andrzej Siewior wrote: On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote: clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years). Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default. They have their own PIE patch since it is not yet supported by gcc. And let me quote why that works for them: | This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z now | as default if the defines and the spec allow it. | Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened in flag-o-matic.eclass | to support older hardened GCC patches and we don't need to change the code on gcc-specs-* and _filter-hardened. | This will add some unsupported upstream commands options as -nopie and -nonow. | -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all and -fstack-check when building kernels. | ESP_CC1_SPEC is added to CC1_SPEC. | ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow check. | ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static. | ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or -fpie. I was thinking about asking doko for something similar but no. Looking at portage they have a few patches where they add -fno-PIE to some packages. Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo did not better. And according to Google, there are also people in the ARCH Linux camp with the same problem. Gentoo's 6 gcc is completely masked and it does not reference the patch I quote above so Gentoo will run into this problem once they enable gcc 6 and don't add the -D__KERNEL__ hack. Eventually Fedora and SUSE will migrate to PIE by default and by then we should cover all major distros so even Al should be affected unless he decides not to update or is using something else. While I don't agree with _how_ they worked around it, it still works correctly with no user intervention for pretty much every important case, and my point was more that it is possible to make this work without a kernel patch than 'Hey, it works over here, lets do what they're doing'. I would still argue that the root of the issue is how GCC handles options specified on the command line that conflict with it's compile time defaults. This is at least the second kernel related case where things broke because GCC doesn't do the sensible thing and override defaults based on command line options (there is (or was, not sure if it's been resolved yet or not) an issue too on MIPS with some other option that I can't remember right now). If option X and option Y are mutually exclusive, and option X is specified on the command line while option Y isn't, they should not use option Y regardless of whether or not it's the default and possibly spit out a warning if it is the default (for PIC, yes, there probably should be a warning), not die.
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-03 21:08, Al Viro wrote: On Thu, Nov 03, 2016 at 04:50:55PM -0600, Ben Hutchings wrote: On Wed, 2016-11-02 at 18:20 +0100, Sebastian Andrzej Siewior wrote: Debian started to build the gcc with -fPIE by default so the kernel build ends before it starts properly with: |kernel/bounds.c:1:0: error: code model kernel does not support PIC mode Also add to KBUILD_AFLAGSi due to: |gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY … vdso/vdso32/note.S |arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t supported for 32-bit in c ombination with -fpic [...] Unfortunately this isn't sufficient: Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available but compiler is broken /build/linux-4.9~rc3/Makefile:1069: recipe for target 'prepare-compiler-check' failed make[5]: *** [prepare-compiler-check] Error 1 scripts/gcc-x86_64-has-stack-protector.sh has its own list of options which will need to include -fno-PIE. That shit should be fixed in debian; no amount of kbuild changes will help on bisect going back to a state prior to those. IOW, no matter what we do kernel-side, that fuckup by doku, balint, et.al. needs to be fixed in debian gcc package. Al, seriously disappointed by that mess - debian gcc packagers are generally clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years).
Re: [PATCH 2/2] kbuild: add -fno-PIE
On 2016-11-03 21:08, Al Viro wrote: On Thu, Nov 03, 2016 at 04:50:55PM -0600, Ben Hutchings wrote: On Wed, 2016-11-02 at 18:20 +0100, Sebastian Andrzej Siewior wrote: Debian started to build the gcc with -fPIE by default so the kernel build ends before it starts properly with: |kernel/bounds.c:1:0: error: code model kernel does not support PIC mode Also add to KBUILD_AFLAGSi due to: |gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY … vdso/vdso32/note.S |arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t supported for 32-bit in c ombination with -fpic [...] Unfortunately this isn't sufficient: Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available but compiler is broken /build/linux-4.9~rc3/Makefile:1069: recipe for target 'prepare-compiler-check' failed make[5]: *** [prepare-compiler-check] Error 1 scripts/gcc-x86_64-has-stack-protector.sh has its own list of options which will need to include -fno-PIE. That shit should be fixed in debian; no amount of kbuild changes will help on bisect going back to a state prior to those. IOW, no matter what we do kernel-side, that fuckup by doku, balint, et.al. needs to be fixed in debian gcc package. Al, seriously disappointed by that mess - debian gcc packagers are generally clued enough to have known better. Reassigning bug reports in question from gcc-6 to linux is beyond stupid; Balint is either being deliberately obtuse, or geniunely unable to imagine that somebody might be using the compiler _not_ for debian package builds. If it helps, you could point out that Gentoo's hardened profile's GCC builds use PIE by default and have absolutely zero issues building the Linux kernel without any special kernel patches to turn it off (and has been doing so for years).
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On 2016-10-17 09:02, Mattias Nissler wrote: OK, no more feedback thus far. Is there generally any interest in a mount option to avoid path name aliasing resulting in target file confusion? Perhaps a version that only disables symlinks instead of also hard-disabling files hard-linked to multiple locations (those are much lower risk for the situation I care about)? If there is interest, I'm happy to iterate the patch until it's accepted. If there's no interest, that's fine too - I'll then likely resort to moving the restrictions desired for Chrome OS into an LSM we compile into our kernels. I can see the symlink related part potentially being useful in other cases, although if you do get rid of the hardlink portion, I'd suggest renaming the mount option to 'nosymlinks'. One use that comes to mind is securing multi-protocol file servers (for example, something serving both NFS and SMB) where at least one protocol doesn't directly handle symlinks (or there is inconsistency among the protocols in how they're handled). On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nisslerwrote: Forgot to mention: I realize my motivation is very specific to Chrome OS, however the nolinks option seemed useful also as a mitigation to generic privilege escalation symlink attacks, for cases where disabling symlinks/hardlinks is acceptable. On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler wrote: On Fri, Oct 14, 2016 at 5:00 PM, Al Viro wrote: On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote: Setting the "nolinks" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nolinks" option is thus useful as a defensive measure against persistent exploits (i.e. a system getting re-exploited after a reboot) for systems that employ a read-only or dm-verity-protected rootfs. These systems prevent non-legit binaries from running after reboot. However, legit code typically still reads from and writes to a writable file system previously under full control of the attacker, who can place symlinks to trick file writes after reboot to target a file of their choice. "nolinks" fundamentally prevents this. Which parts of the tree would be on that "protected" rootfs and which would you mount with that option? Description above is rather vague and I'm not convinced that it actually buys you anything. Details, please... Apologies for the vague description, I'm happy to explain in detail. In case of Chrome OS, we have all binaries on a dm-verity rootfs, so an attacker can't modify any binaries. After reboot, everything except the rootfs is mounted noexec, so there's no way to re-gain code execution after reboot by modifying existing binaries or dropping new ones. We've seen multiple exploits now where the attacker worked around these limitations in two steps: 1. Before reboot, the attacker sets up symlinks on the writeable file system (called "stateful" file system), which are later accessed by legit boot code (such as init scripts) after reboot. For example, an init script that copies file A to B can be abused by an attacker by symlinking or hardlinking B to a location C of their choice, and placing desired data to be written to C in A. That gives the attacker a primitive to write data of their choice to a path of their choice after reboot. Note that this primitive may target locations _outside_ the stateful file system the attacker previously had control of. Particularly of interest are targets on /sys, but also tmpfs on /run etc. 2. The second step for a successful attack is finding some legit code invoked in the boot flow that has a vulnerability exploitable by feeding it unexpected data. As an example, there are Linux userspace utilities that read config from /run which may contain shell commands the the utility executes, through which the attacker can gain code execution again. The purpose of the proposed patch is to raise the bar for the first step of the attack: Writing arbitrary files after reboot. I'm intending to mount the stateful file system with the nolinks option (or otherwise prevent symlink traversal). This will help make sure that any legit writes taking place during boot in init scripts etc. go to the files intended by the developer, and can't be redirected by an attacker. Does this make more sense to you? PS: what the hell do restrictions on _following_ symlinks have to _creating_ hardlinks? I'm trying to imagine a threat model where both would apply or anything else beyond the word "link" they would have in common... The restriction is not on _creating_ hard links, but _opening_ hardlinks. The commonality is in the confusion between the file you're meaning to write vs. the file you actually end up writing to, which stems from the fact that as things stand a file can be accessible on other paths than its canonical one. For Chrome OS,
Re: [RFC] [PATCH] Add a "nolinks" mount option.
On 2016-10-17 09:02, Mattias Nissler wrote: OK, no more feedback thus far. Is there generally any interest in a mount option to avoid path name aliasing resulting in target file confusion? Perhaps a version that only disables symlinks instead of also hard-disabling files hard-linked to multiple locations (those are much lower risk for the situation I care about)? If there is interest, I'm happy to iterate the patch until it's accepted. If there's no interest, that's fine too - I'll then likely resort to moving the restrictions desired for Chrome OS into an LSM we compile into our kernels. I can see the symlink related part potentially being useful in other cases, although if you do get rid of the hardlink portion, I'd suggest renaming the mount option to 'nosymlinks'. One use that comes to mind is securing multi-protocol file servers (for example, something serving both NFS and SMB) where at least one protocol doesn't directly handle symlinks (or there is inconsistency among the protocols in how they're handled). On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nissler wrote: Forgot to mention: I realize my motivation is very specific to Chrome OS, however the nolinks option seemed useful also as a mitigation to generic privilege escalation symlink attacks, for cases where disabling symlinks/hardlinks is acceptable. On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler wrote: On Fri, Oct 14, 2016 at 5:00 PM, Al Viro wrote: On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote: Setting the "nolinks" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nolinks" option is thus useful as a defensive measure against persistent exploits (i.e. a system getting re-exploited after a reboot) for systems that employ a read-only or dm-verity-protected rootfs. These systems prevent non-legit binaries from running after reboot. However, legit code typically still reads from and writes to a writable file system previously under full control of the attacker, who can place symlinks to trick file writes after reboot to target a file of their choice. "nolinks" fundamentally prevents this. Which parts of the tree would be on that "protected" rootfs and which would you mount with that option? Description above is rather vague and I'm not convinced that it actually buys you anything. Details, please... Apologies for the vague description, I'm happy to explain in detail. In case of Chrome OS, we have all binaries on a dm-verity rootfs, so an attacker can't modify any binaries. After reboot, everything except the rootfs is mounted noexec, so there's no way to re-gain code execution after reboot by modifying existing binaries or dropping new ones. We've seen multiple exploits now where the attacker worked around these limitations in two steps: 1. Before reboot, the attacker sets up symlinks on the writeable file system (called "stateful" file system), which are later accessed by legit boot code (such as init scripts) after reboot. For example, an init script that copies file A to B can be abused by an attacker by symlinking or hardlinking B to a location C of their choice, and placing desired data to be written to C in A. That gives the attacker a primitive to write data of their choice to a path of their choice after reboot. Note that this primitive may target locations _outside_ the stateful file system the attacker previously had control of. Particularly of interest are targets on /sys, but also tmpfs on /run etc. 2. The second step for a successful attack is finding some legit code invoked in the boot flow that has a vulnerability exploitable by feeding it unexpected data. As an example, there are Linux userspace utilities that read config from /run which may contain shell commands the the utility executes, through which the attacker can gain code execution again. The purpose of the proposed patch is to raise the bar for the first step of the attack: Writing arbitrary files after reboot. I'm intending to mount the stateful file system with the nolinks option (or otherwise prevent symlink traversal). This will help make sure that any legit writes taking place during boot in init scripts etc. go to the files intended by the developer, and can't be redirected by an attacker. Does this make more sense to you? PS: what the hell do restrictions on _following_ symlinks have to _creating_ hardlinks? I'm trying to imagine a threat model where both would apply or anything else beyond the word "link" they would have in common... The restriction is not on _creating_ hard links, but _opening_ hardlinks. The commonality is in the confusion between the file you're meaning to write vs. the file you actually end up writing to, which stems from the fact that as things stand a file can be accessible on other paths than its canonical one. For Chrome OS, I'd like to get to a point where most privileged code can only access
Re: md/raid1: Improve another size determination in setup_conf()
On 2016-10-07 06:50, SF Markus Elfring wrote: Linux has tons of issues, fixes for real problems are very welcome. Is a spectrum of software improvements to reconsider there? But coding style bike shedding is just a waste of time. Why do various software developers bother about coding style specifications at all then? Coding style is important, but patches that just fix coding style are a bad thing because they break things like `git blame` and run the risk of introducing new bugs without any net benefit to end users. This goes double for code you don't actually work on regularly or don't completely understand.
Re: md/raid1: Improve another size determination in setup_conf()
On 2016-10-07 06:50, SF Markus Elfring wrote: Linux has tons of issues, fixes for real problems are very welcome. Is a spectrum of software improvements to reconsider there? But coding style bike shedding is just a waste of time. Why do various software developers bother about coding style specifications at all then? Coding style is important, but patches that just fix coding style are a bad thing because they break things like `git blame` and run the risk of introducing new bugs without any net benefit to end users. This goes double for code you don't actually work on regularly or don't completely understand.
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 11:05, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn <ahferro...@gmail.com> ha scritto: On 2016-10-06 08:50, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <ahferro...@gmail.com> ha scritto: On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <t...@kernel.org> wrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this). All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested. My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload. Yes At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads. Definitely Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't). Well, no fairness as deadline and noop, but with much less throughput than deadline and noop, doesn't sound much like the best scheduler for those workloads. With BFQ you have service guarantees, with noop or deadline you have maximum throughput. And with CFQ you have something in between, which is half of why I think CFQ is still worth keeping (the other half being the people who inevitably want to stay on CFQ). And TBH, deadline and noop only give good throughput with specific workloads (and in the case of noop, it's usually only useful on tiny systems where the overhead of scheduling is greater than the time saved by doing so (like some very low power embedded systems), or when you have scheduling done elsewher in the storage stack (like in a VM)). Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ" I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now. Exactly! Thanks, Paolo
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 11:05, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn ha scritto: On 2016-10-06 08:50, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn ha scritto: On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo wrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this). All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested. My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload. Yes At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads. Definitely Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't). Well, no fairness as deadline and noop, but with much less throughput than deadline and noop, doesn't sound much like the best scheduler for those workloads. With BFQ you have service guarantees, with noop or deadline you have maximum throughput. And with CFQ you have something in between, which is half of why I think CFQ is still worth keeping (the other half being the people who inevitably want to stay on CFQ). And TBH, deadline and noop only give good throughput with specific workloads (and in the case of noop, it's usually only useful on tiny systems where the overhead of scheduling is greater than the time saved by doing so (like some very low power embedded systems), or when you have scheduling done elsewher in the storage stack (like in a VM)). Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ" I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now. Exactly! Thanks, Paolo
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 08:50, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn <ahferro...@gmail.com> ha scritto: On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <t...@kernel.org> wrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this). All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested. My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload. At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads. Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't). Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ" I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now.
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 08:50, Paolo Valente wrote: Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn ha scritto: On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo wrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this). All correct, apart from the workload-specific issue, which is not very clear to me. Over the last five years I have not found a single workload for which CFQ is better than BFQ, and none has been suggested. My point is that whether or not BFQ is better depends on the workload. You can't test for every workload, so you can't say definitively that BFQ is better for every workload. At a minimum, there are workloads where the deadline and noop schedulers are better, but they're very domain specific workloads. Based on the numbers from Shaohua, it looks like CFQ has better throughput than BFQ, and that will affect some workloads (for most, the improved fairness is worth the reduced throughput, but there probably are some cases where it isn't). Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: "we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, and thus replace, CFQ" I agree that that's part of the issue, but I also don't entirely agree with the reasoning on it. Until blk-mq has proper I/O scheduling, people will continue to use CFQ, and based on the way things are going, it will be multiple months before that happens, whereas BFQ exists and is working now.
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heowrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).
Re: [PATCH V3 00/11] block-throttle: add .high limit
On 2016-10-06 07:03, Mark Brown wrote: On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote: On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo wrote: I get that bfq can be a good compromise on most desktop workloads and behave reasonably well for some server workloads with the slice expiration mechanism but it really isn't an IO resource partitioning mechanism. Not just desktops, also Android phones. So why not have BFQ as a separate scheduling policy upstream, alongside CFQ, deadline and noop? Right. We're already doing the per-usecase Kconfig thing for preemption. But maybe somebody already hates that and want to get rid of it, I don't know. Hannes also suggested going back to making BFQ a separate scheduler rather than replacing CFQ earlier, pointing out that it mitigates against the risks of changing CFQ substantially at this point (which seems to be the biggest issue here). ISTR that the original argument for this approach essentially amounted to: 'If it's so much better, why do we need both?'. Such an argument is valid only if the new design is better in all respects (which there isn't sufficient information to decide in this case), or the negative aspects are worth the improvements (which is too workload specific to decide for something like this).
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-17 01:14, James Bottomley wrote: On Fri, 2016-09-16 at 13:06 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 12:21, James Bottomley wrote: On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. We're getting a bit off topic on mptsas and it's eccentric SATL. The point is, you're asking for UAS devices which each have an internal SATL which you say potentially doesn't support discard. The three problems we have are 1. How do we identify if the UAS SATL doesn't support discard. If it does, we really don't want to cause further SATL related issues by bypassing it, so we need a way of telling this. 2. If the SATL doesn't support discard, will it reliably support the ATA_12 or ATA_16 pass through (and which one) .. we need a way of checking this because there are known SATLs that don't do pass through. 3. How do we actually configure it? Presumably if the SATL doesn't support discard, it also doesn't give us the useful mode page indications we use to configure TRIM, so we're going to have to do some pass through discovery as well. I assume by 'discard' here you're referring to SCSI UNMAP, as anything that supports ATA_12 or ATA_16 pass through correctly will support ATA TRIM/DISCARD on drives that support it. discard is the block layer terminology it's mapped per transport to UNMAP or WRITE SAME on SCSI and TRIM on ATA. I actually didn't know this. I'm not quite as knowledgeable about the block layer as I probably should be, and definitely not as up-to-date as I could be on the ATA and SCSI specs. If that's the case, then: 1. If SCSI UNMAP fails, it doesn't support UNMAP. This is of course non-trivial to verify safely (we pretty much have to assume it is supported if we have no clear indication it isn't, and then switch based on what happens the first time we try to use it). It's not quite that simple: to get us to configure discard in the first place, you have to indicate support in READ CAPACITY (16): the LBPME bit. The chances are your UAS SATL isn't setting this. OK, that makes sense. Given that though, is it known how something like that may react if you tried to issue an UNMAP or WRITE SAME command when it's not supported? 2. Unless there are SATL's out there that write garbage to the device or die when sent an ATA_12 or ATA_16 pass through command Yes, there are; the problems with USB devices that fail to speak standard versions of SCSI are legion. encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue. Even if such SATL's exist, they can easily be blacklisted. 3. This isn't hard, a SATL which actually supports ATA pass through will almost always pass through the mode page unmodified. You mean the ATA Information VPD page? Yes, that's feasible because we already queried the supported VPD pages, so we can tell if this one's there. I kind of got my terminology confused here, and didn't proof-read properly. I'm not sure exactly what I was trying to refer to originally, but what I meant was that pretty much all UAS SATL's I've seen that support ATA pass through either have a proper ATA Information VPD page, or properly pass through ATA DEVICE IDENTIFY and related commands. On the note of UAS SATL's, all of them that I've seen fall into one of four categories: 1. Supports one or both of ATA_12 or ATA_16 pass through, and supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP. 2. Supports one of ATA_12 or ATA_16 pass through, and does not support passing through ATA TRIM/DISCARD or translating SCSI UNMAP. All
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-17 01:14, James Bottomley wrote: On Fri, 2016-09-16 at 13:06 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 12:21, James Bottomley wrote: On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. We're getting a bit off topic on mptsas and it's eccentric SATL. The point is, you're asking for UAS devices which each have an internal SATL which you say potentially doesn't support discard. The three problems we have are 1. How do we identify if the UAS SATL doesn't support discard. If it does, we really don't want to cause further SATL related issues by bypassing it, so we need a way of telling this. 2. If the SATL doesn't support discard, will it reliably support the ATA_12 or ATA_16 pass through (and which one) .. we need a way of checking this because there are known SATLs that don't do pass through. 3. How do we actually configure it? Presumably if the SATL doesn't support discard, it also doesn't give us the useful mode page indications we use to configure TRIM, so we're going to have to do some pass through discovery as well. I assume by 'discard' here you're referring to SCSI UNMAP, as anything that supports ATA_12 or ATA_16 pass through correctly will support ATA TRIM/DISCARD on drives that support it. discard is the block layer terminology it's mapped per transport to UNMAP or WRITE SAME on SCSI and TRIM on ATA. I actually didn't know this. I'm not quite as knowledgeable about the block layer as I probably should be, and definitely not as up-to-date as I could be on the ATA and SCSI specs. If that's the case, then: 1. If SCSI UNMAP fails, it doesn't support UNMAP. This is of course non-trivial to verify safely (we pretty much have to assume it is supported if we have no clear indication it isn't, and then switch based on what happens the first time we try to use it). It's not quite that simple: to get us to configure discard in the first place, you have to indicate support in READ CAPACITY (16): the LBPME bit. The chances are your UAS SATL isn't setting this. OK, that makes sense. Given that though, is it known how something like that may react if you tried to issue an UNMAP or WRITE SAME command when it's not supported? 2. Unless there are SATL's out there that write garbage to the device or die when sent an ATA_12 or ATA_16 pass through command Yes, there are; the problems with USB devices that fail to speak standard versions of SCSI are legion. encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue. Even if such SATL's exist, they can easily be blacklisted. 3. This isn't hard, a SATL which actually supports ATA pass through will almost always pass through the mode page unmodified. You mean the ATA Information VPD page? Yes, that's feasible because we already queried the supported VPD pages, so we can tell if this one's there. I kind of got my terminology confused here, and didn't proof-read properly. I'm not sure exactly what I was trying to refer to originally, but what I meant was that pretty much all UAS SATL's I've seen that support ATA pass through either have a proper ATA Information VPD page, or properly pass through ATA DEVICE IDENTIFY and related commands. On the note of UAS SATL's, all of them that I've seen fall into one of four categories: 1. Supports one or both of ATA_12 or ATA_16 pass through, and supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP. 2. Supports one of ATA_12 or ATA_16 pass through, and does not support passing through ATA TRIM/DISCARD or translating SCSI UNMAP. All
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-16 12:21, James Bottomley wrote: On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. We're getting a bit off topic on mptsas and it's eccentric SATL. The point is, you're asking for UAS devices which each have an internal SATL which you say potentially doesn't support discard. The three problems we have are 1. How do we identify if the UAS SATL doesn't support discard. If it does, we really don't want to cause further SATL related issues by bypassing it, so we need a way of telling this. 2. If the SATL doesn't support discard, will it reliably support the ATA_12 or ATA_16 pass through (and which one) .. we need a way of checking this because there are known SATLs that don't do pass through. 3. How do we actually configure it? Presumably if the SATL doesn't support discard, it also doesn't give us the useful mode page indications we use to configure TRIM, so we're going to have to do some pass through discovery as well. I assume by 'discard' here you're referring to SCSI UNMAP, as anything that supports ATA_12 or ATA_16 pass through correctly will support ATA TRIM/DISCARD on drives that support it. If that's the case, then: 1. If SCSI UNMAP fails, it doesn't support UNMAP. This is of course non-trivial to verify safely (we pretty much have to assume it is supported if we have no clear indication it isn't, and then switch based on what happens the first time we try to use it). 2. Unless there are SATL's out there that write garbage to the device or die when sent an ATA_12 or ATA_16 pass through command encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue. Even if such SATL's exist, they can easily be blacklisted. 3. This isn't hard, a SATL which actually supports ATA pass through will almost always pass through the mode page unmodified. On the note of UAS SATL's, all of them that I've seen fall into one of four categories: 1. Supports one or both of ATA_12 or ATA_16 pass through, and supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP. 2. Supports one of ATA_12 or ATA_16 pass through, and does not support passing through ATA TRIM/DISCARD or translating SCSI UNMAP. All devices I've seen that fit this will modify the ATA DEVICE IDENTIFY data so it doesn't report DISCARD support, or will simply return an error for DISCARD requests. I haven't seen any like this that were manufactured after UAS became standardized. 3. Supports neither ATA_12 or ATA_16 pass through, and doesn't support UNMAP. 4. Like type 1, except it supports both pass through commands, and also properly translates SCSI UNMAP commands (I've only ever seen one of these, have no idea what chipset it had, and it was insanely expensive (upside of 300 USD)). All we really can do anything about is category 1. Category 4 works with the current drivers, and we can't properly support it on category 2 or 3. All three devices I have right now are in category 1, I know a number of other people in a similar situation, and it sounds like Jason has at least one such device as well. Given that Windows does this (I've confirmed this with a hardware USB analyzer I borrowed from a friend), and that I've not seen anything since the UAS spec was actually released that falls into category 2 (and if I understand the spec correctly, such a device is actually not compliant with it anyway), I think it's probably safe to do this in Linux and just base the check on: 1. UAS (not some other SCSI transport) without UNMAP support. 2. Supports ATA_12 or ATA_16 pass through. 3. ATA
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-16 12:21, James Bottomley wrote: On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote: On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. We're getting a bit off topic on mptsas and it's eccentric SATL. The point is, you're asking for UAS devices which each have an internal SATL which you say potentially doesn't support discard. The three problems we have are 1. How do we identify if the UAS SATL doesn't support discard. If it does, we really don't want to cause further SATL related issues by bypassing it, so we need a way of telling this. 2. If the SATL doesn't support discard, will it reliably support the ATA_12 or ATA_16 pass through (and which one) .. we need a way of checking this because there are known SATLs that don't do pass through. 3. How do we actually configure it? Presumably if the SATL doesn't support discard, it also doesn't give us the useful mode page indications we use to configure TRIM, so we're going to have to do some pass through discovery as well. I assume by 'discard' here you're referring to SCSI UNMAP, as anything that supports ATA_12 or ATA_16 pass through correctly will support ATA TRIM/DISCARD on drives that support it. If that's the case, then: 1. If SCSI UNMAP fails, it doesn't support UNMAP. This is of course non-trivial to verify safely (we pretty much have to assume it is supported if we have no clear indication it isn't, and then switch based on what happens the first time we try to use it). 2. Unless there are SATL's out there that write garbage to the device or die when sent an ATA_12 or ATA_16 pass through command encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue. Even if such SATL's exist, they can easily be blacklisted. 3. This isn't hard, a SATL which actually supports ATA pass through will almost always pass through the mode page unmodified. On the note of UAS SATL's, all of them that I've seen fall into one of four categories: 1. Supports one or both of ATA_12 or ATA_16 pass through, and supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP. 2. Supports one of ATA_12 or ATA_16 pass through, and does not support passing through ATA TRIM/DISCARD or translating SCSI UNMAP. All devices I've seen that fit this will modify the ATA DEVICE IDENTIFY data so it doesn't report DISCARD support, or will simply return an error for DISCARD requests. I haven't seen any like this that were manufactured after UAS became standardized. 3. Supports neither ATA_12 or ATA_16 pass through, and doesn't support UNMAP. 4. Like type 1, except it supports both pass through commands, and also properly translates SCSI UNMAP commands (I've only ever seen one of these, have no idea what chipset it had, and it was insanely expensive (upside of 300 USD)). All we really can do anything about is category 1. Category 4 works with the current drivers, and we can't properly support it on category 2 or 3. All three devices I have right now are in category 1, I know a number of other people in a similar situation, and it sounds like Jason has at least one such device as well. Given that Windows does this (I've confirmed this with a hardware USB analyzer I borrowed from a friend), and that I've not seen anything since the UAS spec was actually released that falls into category 2 (and if I understand the spec correctly, such a device is actually not compliant with it anyway), I think it's probably safe to do this in Linux and just base the check on: 1. UAS (not some other SCSI transport) without UNMAP support. 2. Supports ATA_12 or ATA_16 pass through. 3. ATA
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. It's pretty easy to tell that something is a UAS device (the uas driver wouldn't be bound to it otherwise), so if we check that and then check whether or not IDENTIFY DEVICE ATA works when sent via SG_ATA_16 or SG_ATA_12, it should be relatively safe (ignoring of course the fact that there will inevitably be some brain-dead hardware that for some obscure reason translates the command into something that will corrupt data). I've got three USB 3.0 UAS SATA adapters (all ASMedia branded chips) that behave pretty much identically to what Jason is describing, so it appears that at least one brand behaves this way in a reliable and reproducible manner.
Re: TRIM/UNMAP/DISCARD via ATA Passthrough
On 2016-09-16 07:16, Hannes Reinecke wrote: On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote: Hi Martin, On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen But how do they signal that ATA passthrough is possible? Is there an ATA Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported? We need really solid discovery data before we can entertain enabling something like this. `sg_opcodes` said invalid request, so I think there isn't REPORT SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too. However, sg_sat_identify worked reliably, which means a solid way of probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or SG_ATA_12. Let me know and I can give you access to the hardware if you're curious. Sadly, that's not sufficient. linux is not the only provider of an SATL (mpt3sas being the most prominent other one). And while they might support ATA_12/ATA_16, there is no indication that you can pass DSM TRIM that way. So it's better to not support it at all than to support it on hardware we can reliably identify? I get that having feature parity is a good thing, but the discussion isn't about providing support for all SATL devices, it's specifically about UAS connected SATL devices. Last I checked, mpt3sas doesn't do anything with UAS, which means it's kind of irrelevant WRT supporting this for UAS devices. It's pretty easy to tell that something is a UAS device (the uas driver wouldn't be bound to it otherwise), so if we check that and then check whether or not IDENTIFY DEVICE ATA works when sent via SG_ATA_16 or SG_ATA_12, it should be relatively safe (ignoring of course the fact that there will inevitably be some brain-dead hardware that for some obscure reason translates the command into something that will corrupt data). I've got three USB 3.0 UAS SATA adapters (all ASMedia branded chips) that behave pretty much identically to what Jason is describing, so it appears that at least one brand behaves this way in a reliable and reproducible manner.
Re: [PATCH] logfs: remove from tree
On 2016-09-12 02:55, Artem Bityutskiy wrote: On Sun, 2016-09-11 at 15:04 +0200, Christoph Hellwig wrote: Logfs was introduced to the kernel in 2009, and hasn't seen any non drive-by changes since 2012, while having lots of unsolved issues including the complete lack of error handling, with more and more issues popping up without any fixes. The logfs.org domain has been bouncing from a mail, and the maintainer on the non-logfs.org domain hasn't repsonded to past queries either. Signed-off-by: Christoph HellwigBack in 2008 logfs and UBIFS were in sort of competing projects. I remember we inspected logfs code and tested it - we did not find proper wear-levelling and bad block handling, we did not see proper error handling, and it exploded when we were running relatively simple tests. We indicated this here in a very humble way to avoid the "conflict of interest" perseption: https://lkml.org/lkml/2008/3/31/117 I did not follow logfs since then, but I think there wasn't much development since then and all these issue are still there. I mean, unless I am horribly mistaken, logfs does not really have the basic features of a flash file system and there is no point keeping it in the tree and consuming people's time maintaining it. FWIW, I tried testing it about a year ago, and got similar results both from the tests and from trying to contact the maintainer.
Re: [PATCH] logfs: remove from tree
On 2016-09-12 02:55, Artem Bityutskiy wrote: On Sun, 2016-09-11 at 15:04 +0200, Christoph Hellwig wrote: Logfs was introduced to the kernel in 2009, and hasn't seen any non drive-by changes since 2012, while having lots of unsolved issues including the complete lack of error handling, with more and more issues popping up without any fixes. The logfs.org domain has been bouncing from a mail, and the maintainer on the non-logfs.org domain hasn't repsonded to past queries either. Signed-off-by: Christoph Hellwig Back in 2008 logfs and UBIFS were in sort of competing projects. I remember we inspected logfs code and tested it - we did not find proper wear-levelling and bad block handling, we did not see proper error handling, and it exploded when we were running relatively simple tests. We indicated this here in a very humble way to avoid the "conflict of interest" perseption: https://lkml.org/lkml/2008/3/31/117 I did not follow logfs since then, but I think there wasn't much development since then and all these issue are still there. I mean, unless I am horribly mistaken, logfs does not really have the basic features of a flash file system and there is no point keeping it in the tree and consuming people's time maintaining it. FWIW, I tried testing it about a year ago, and got similar results both from the tests and from trying to contact the maintainer.
Re: [Documentation] State of CPU controller in cgroup v2
On 2016-09-09 18:57, Tejun Heo wrote: Hello, again. On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: * It doesn't bring any practical benefits in terms of capability. Userland can trivially handle the system-root and namespace-roots in a symmetrical manner. Your idea of "trivially" doesn't match mine. You gave a use case in I suppose I wasn't clear enough. It is trivial in the sense that if the userland implements something which works for namespace-root, it would work the same in system-root without further modifications. which userspace might take advantage of root being special. If I was emphasizing the cases where userspace would have to deal with the inherent differences, and, when they don't, they can behave exactly the same way. userspace does that, then that userspace cannot be run in a container. This could be a problem for real users. Sure, "don't do that" is a *valid* answer, but it's not a very helpful answer. Great, now we agree that what's currently implemented is valid. I think you're still failing to recognize the inherent specialness of the system-root and how much unnecessary pain the removal of the exemption would cause at virtually no practical gain. I won't repeat the same backing points here. * It's an unncessary inconvenience, especially for cases where the cgroup agent isn't in control of boot, for partial usage cases, or just for playing with it. You say that I'm ignoring the same use case for namespace-scope but namespace-roots don't have the same hybrid function for partial and uncontrolled systems, so it's not clear why there even NEEDS to be strict symmetry. I think their functions are much closer than you think they are. I want a whole Linux distro to be able to run in a container. This means that useful things people do in a distro or initramfs or whatever should just work if containerized. There isn't much which is getting in the way of doing that. Again, something which follows no-internal-task rule would behave the same no matter where it is. The system-root is different in that it is exempt from the rule and thus is more flexible but that difference is serving the purpose of handling the inherent specialness of the system-root. AFAICS, it is the solution which causes the least amount of contortion and unnecessary inconvenience to userland. It's easy and understandable to get hangups on asymmetries or exemptions like this, but they also often are acceptable trade-offs. It's really frustrating to see you first getting hung up on "this must be wrong" and even after explanations repeating the same thing just in different ways. If there is something fundamentally wrong with it, sure, let's fix it, but what's actually broken? I'm not saying it's fundamentally wrong. I'm saying it's a design You were. that has a big wart, and that wart is unfortunate, and after thinking a bit, I'm starting to agree with PeterZ that this is problematic. It also seems fixable: the constraint could be relaxed. You've been pushing for enforcing the restriction on the system-root too and now are jumping to the opposite end. It's really frustrating that this is such a whack-a-mole game where you throw ideas without really thinking through them and only concede the bare minimum when all other logical avenues are closed off. Here, again, you seem to be stating a strong opinion when you haven't fully thought about it or tried to understand the reasons behind it. But, whatever, let's go there: Given the arguments that I laid out for the no-internal-tasks rule, how does the problem seem fixable through relaxing the constraint? Also, here's an idea to maybe make PeterZ happier: relax the restriction a bit per-controller. Currently (except for /), if you have subtree control enabled you can't have any processes in the cgroup. Could you change this so it only applies to certain controllers? If the cpu controller is entirely happy to have processes and cgroups as siblings, then maybe a cgroup with only cpu subtree control enabled could allow processes to exist. The document lists several reasons for not doing this and also that there is no known real world use case for such configuration. So, up until this point, we were talking about no-internal-tasks constraint. Isn't this the same thing? IIUC the constraint in question is that, if a non-root cgroup has subtree control on, then it can't have processes in it. This is the no-internal-tasks constraint, right? Yes, that is what no-internal-tasks rule is but I don't understand how that is the same thing as process granularity. Am I completely misunderstanding what you are trying to say here? And I still think that, at least for cpu, nothing at all goes wrong if you allow processes to exist in cgroups that have cpu set in subtree-control. If you confine it to the cpu controller, ignore anonymous consumptions, the rather ugly mapping between nice and weight values and the
Re: [Documentation] State of CPU controller in cgroup v2
On 2016-09-09 18:57, Tejun Heo wrote: Hello, again. On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: * It doesn't bring any practical benefits in terms of capability. Userland can trivially handle the system-root and namespace-roots in a symmetrical manner. Your idea of "trivially" doesn't match mine. You gave a use case in I suppose I wasn't clear enough. It is trivial in the sense that if the userland implements something which works for namespace-root, it would work the same in system-root without further modifications. which userspace might take advantage of root being special. If I was emphasizing the cases where userspace would have to deal with the inherent differences, and, when they don't, they can behave exactly the same way. userspace does that, then that userspace cannot be run in a container. This could be a problem for real users. Sure, "don't do that" is a *valid* answer, but it's not a very helpful answer. Great, now we agree that what's currently implemented is valid. I think you're still failing to recognize the inherent specialness of the system-root and how much unnecessary pain the removal of the exemption would cause at virtually no practical gain. I won't repeat the same backing points here. * It's an unncessary inconvenience, especially for cases where the cgroup agent isn't in control of boot, for partial usage cases, or just for playing with it. You say that I'm ignoring the same use case for namespace-scope but namespace-roots don't have the same hybrid function for partial and uncontrolled systems, so it's not clear why there even NEEDS to be strict symmetry. I think their functions are much closer than you think they are. I want a whole Linux distro to be able to run in a container. This means that useful things people do in a distro or initramfs or whatever should just work if containerized. There isn't much which is getting in the way of doing that. Again, something which follows no-internal-task rule would behave the same no matter where it is. The system-root is different in that it is exempt from the rule and thus is more flexible but that difference is serving the purpose of handling the inherent specialness of the system-root. AFAICS, it is the solution which causes the least amount of contortion and unnecessary inconvenience to userland. It's easy and understandable to get hangups on asymmetries or exemptions like this, but they also often are acceptable trade-offs. It's really frustrating to see you first getting hung up on "this must be wrong" and even after explanations repeating the same thing just in different ways. If there is something fundamentally wrong with it, sure, let's fix it, but what's actually broken? I'm not saying it's fundamentally wrong. I'm saying it's a design You were. that has a big wart, and that wart is unfortunate, and after thinking a bit, I'm starting to agree with PeterZ that this is problematic. It also seems fixable: the constraint could be relaxed. You've been pushing for enforcing the restriction on the system-root too and now are jumping to the opposite end. It's really frustrating that this is such a whack-a-mole game where you throw ideas without really thinking through them and only concede the bare minimum when all other logical avenues are closed off. Here, again, you seem to be stating a strong opinion when you haven't fully thought about it or tried to understand the reasons behind it. But, whatever, let's go there: Given the arguments that I laid out for the no-internal-tasks rule, how does the problem seem fixable through relaxing the constraint? Also, here's an idea to maybe make PeterZ happier: relax the restriction a bit per-controller. Currently (except for /), if you have subtree control enabled you can't have any processes in the cgroup. Could you change this so it only applies to certain controllers? If the cpu controller is entirely happy to have processes and cgroups as siblings, then maybe a cgroup with only cpu subtree control enabled could allow processes to exist. The document lists several reasons for not doing this and also that there is no known real world use case for such configuration. So, up until this point, we were talking about no-internal-tasks constraint. Isn't this the same thing? IIUC the constraint in question is that, if a non-root cgroup has subtree control on, then it can't have processes in it. This is the no-internal-tasks constraint, right? Yes, that is what no-internal-tasks rule is but I don't understand how that is the same thing as process granularity. Am I completely misunderstanding what you are trying to say here? And I still think that, at least for cpu, nothing at all goes wrong if you allow processes to exist in cgroups that have cpu set in subtree-control. If you confine it to the cpu controller, ignore anonymous consumptions, the rather ugly mapping between nice and weight values and the
Re: bcache vs bcachefs
On 2016-09-06 20:55, Kent Overstreet wrote: On Tue, Sep 06, 2016 at 11:46:28AM +0200, Harald Dunkel wrote: Hi folks, I am pretty hesitant replacing the rock-solid ext4 by bcachefs on my servers. Meaning no offense, but surely I would prefer to have ext4 with a thin "SSD caching layer" over a completely different filesystem, potentially with alot of teething troubles. Question: Is bcache EOL or can I rely on it for the next 5 to 10 years? bcache is not EOL - it's still receiving bugfixes. That said though, there's no reason to expect a long teething period with bcachefs, it's already more reliable than btrfs in single device mode. I'd be curious to see any actual data you have to back that up, especially regarding what kernel and userspace were involved with the BTRFS testing.
Re: bcache vs bcachefs
On 2016-09-06 20:55, Kent Overstreet wrote: On Tue, Sep 06, 2016 at 11:46:28AM +0200, Harald Dunkel wrote: Hi folks, I am pretty hesitant replacing the rock-solid ext4 by bcachefs on my servers. Meaning no offense, but surely I would prefer to have ext4 with a thin "SSD caching layer" over a completely different filesystem, potentially with alot of teething troubles. Question: Is bcache EOL or can I rely on it for the next 5 to 10 years? bcache is not EOL - it's still receiving bugfixes. That said though, there's no reason to expect a long teething period with bcachefs, it's already more reliable than btrfs in single device mode. I'd be curious to see any actual data you have to back that up, especially regarding what kernel and userspace were involved with the BTRFS testing.
Re: Who reordered my disks (probably v4.8-rc1 problem)
On 2016-08-13 15:30, Pavel Machek wrote: Hi! It seems that in v4.8-rc0, /dev/sdX got reordered, and now USB devices are probed before SATA drivers. That is pretty anti-social. It broke my boot on my primary machine, and unfortunately due to BIOS problems (keyboard does not work when connected through a hub) it is less fun than it should be. The simple solution (which I'm surprised nobody else has suggested yet) is to build everything but the driver for your root device as a module. This will prevent anything else from getting enumerated first, and works regardless of whether or not your using an initrd and independently of what root= format you use.
Re: Who reordered my disks (probably v4.8-rc1 problem)
On 2016-08-13 15:30, Pavel Machek wrote: Hi! It seems that in v4.8-rc0, /dev/sdX got reordered, and now USB devices are probed before SATA drivers. That is pretty anti-social. It broke my boot on my primary machine, and unfortunately due to BIOS problems (keyboard does not work when connected through a hub) it is less fun than it should be. The simple solution (which I'm surprised nobody else has suggested yet) is to build everything but the driver for your root device as a module. This will prevent anything else from getting enumerated first, and works regardless of whether or not your using an initrd and independently of what root= format you use.
Re: [PATCH 0002/1285] Replace numeric parameter like 0444 with macro
On 2016-08-02 06:33, Baole Ni wrote: I find that the developers often just specified the numeric value when calling a macro which is defined with a parameter for access permission. As we know, these numeric value for access permission have had the corresponding macro, and that using macro can improve the robustness and readability of the code, thus, I suggest replacing the numeric parameter with the macro. Have you taken the time to consider _why_ this is common practice? In most cases, people who have been using UNIX-like systems for any reasonable period of time can mentally parse the octal permissions almost instantly, while it often takes them a while to figure out the macros. So, overall, this change actually hurts readability for a large number of people. Additionally, the argument of robustness is somewhat bogus too, the internal representation of POSIX DAC permissions is not going to change any time soon, if at all, so abstracting the values away isn't really improving robustness considering that what it's supposed to protect against won't ever happen. Beyond that, there are also a number of other issues with the set as a whole: 1. In many places, you haven't properly re-wrapped lines at the 80 column limit. 2. In many places, you haven't condensed macros like you should (for example, 0644 should be (S_IWUSR | S_IRUGO), not (S_IWUSR | S_IRUSR | S_IRGRP | S_IROTH), and this would help with the line length limit too). 3. You're missing subsystem tags in your patch subjects. 4. Patches shouldn't be per-file for changes like this, they should be per-subsystem. Dropping this many patches (you've sent more patches in a few hours than the whole sees in a day on average) is a very good way to get people to ignore you.
Re: [PATCH 0002/1285] Replace numeric parameter like 0444 with macro
On 2016-08-02 06:33, Baole Ni wrote: I find that the developers often just specified the numeric value when calling a macro which is defined with a parameter for access permission. As we know, these numeric value for access permission have had the corresponding macro, and that using macro can improve the robustness and readability of the code, thus, I suggest replacing the numeric parameter with the macro. Have you taken the time to consider _why_ this is common practice? In most cases, people who have been using UNIX-like systems for any reasonable period of time can mentally parse the octal permissions almost instantly, while it often takes them a while to figure out the macros. So, overall, this change actually hurts readability for a large number of people. Additionally, the argument of robustness is somewhat bogus too, the internal representation of POSIX DAC permissions is not going to change any time soon, if at all, so abstracting the values away isn't really improving robustness considering that what it's supposed to protect against won't ever happen. Beyond that, there are also a number of other issues with the set as a whole: 1. In many places, you haven't properly re-wrapped lines at the 80 column limit. 2. In many places, you haven't condensed macros like you should (for example, 0644 should be (S_IWUSR | S_IRUGO), not (S_IWUSR | S_IRUSR | S_IRGRP | S_IROTH), and this would help with the line length limit too). 3. You're missing subsystem tags in your patch subjects. 4. Patches shouldn't be per-file for changes like this, they should be per-subsystem. Dropping this many patches (you've sent more patches in a few hours than the whole sees in a day on average) is a very good way to get people to ignore you.
Re: [PATCH 00/14] Present useful limits to user (v2)
On 2016-07-15 16:54, H. Peter Anvin wrote: On July 15, 2016 6:59:56 AM PDT, Peter Zijlstrawrote: On Fri, Jul 15, 2016 at 01:52:48PM +, Topi Miettinen wrote: On 07/15/16 12:43, Peter Zijlstra wrote: On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote: Hello, There are many basic ways to control processes, including capabilities, cgroups and resource limits. However, there are far fewer ways to find out useful values for the limits, except blind trial and error. This patch series attempts to fix that by giving at least a nice starting point from the highwater mark values of the resources in question. I looked where each limit is checked and added a call to update the mark nearby. And how is that useful? Setting things to the high watermark is basically the same as not setting the limit at all. What else would you use, too small limits? That question doesn't make sense. What's the point of setting a limit if it ends up being the same as no-limit (aka unlimited). If you cannot explain; and you have not so far; what use these values are, why would we look at the patches. One reason is to catch a malfunctioning process rather than dragging the whole system down with it. It could also be useful for development. Additionally, there are quite a few applications which don't gracefully handle memory allocation or process creation failures, either hanging, constantly retrying, or just dying when this happens. For such an application, you have to set the limit to the high watermark if you want them limited at all, otherwise they don't work. A classic example of this is the official client for Dropbox. If it can't start up all the insane number of threads it thinks it needs, then it just hangs. However, it's also a network service, and therefore is a reasonable target for hackers, so it makes sense to try and limit it. I've run into similar issues with quite a few 'desktop' services, both open and closed source. Looking at this another way, this is most useful for things that have a deterministic maximum resource usage under regular use, not something like a forking server which has a functionally unbounded maximum resource usage.
Re: [PATCH 00/14] Present useful limits to user (v2)
On 2016-07-15 16:54, H. Peter Anvin wrote: On July 15, 2016 6:59:56 AM PDT, Peter Zijlstra wrote: On Fri, Jul 15, 2016 at 01:52:48PM +, Topi Miettinen wrote: On 07/15/16 12:43, Peter Zijlstra wrote: On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote: Hello, There are many basic ways to control processes, including capabilities, cgroups and resource limits. However, there are far fewer ways to find out useful values for the limits, except blind trial and error. This patch series attempts to fix that by giving at least a nice starting point from the highwater mark values of the resources in question. I looked where each limit is checked and added a call to update the mark nearby. And how is that useful? Setting things to the high watermark is basically the same as not setting the limit at all. What else would you use, too small limits? That question doesn't make sense. What's the point of setting a limit if it ends up being the same as no-limit (aka unlimited). If you cannot explain; and you have not so far; what use these values are, why would we look at the patches. One reason is to catch a malfunctioning process rather than dragging the whole system down with it. It could also be useful for development. Additionally, there are quite a few applications which don't gracefully handle memory allocation or process creation failures, either hanging, constantly retrying, or just dying when this happens. For such an application, you have to set the limit to the high watermark if you want them limited at all, otherwise they don't work. A classic example of this is the official client for Dropbox. If it can't start up all the insane number of threads it thinks it needs, then it just hangs. However, it's also a network service, and therefore is a reasonable target for hackers, so it makes sense to try and limit it. I've run into similar issues with quite a few 'desktop' services, both open and closed source. Looking at this another way, this is most useful for things that have a deterministic maximum resource usage under regular use, not something like a forking server which has a functionally unbounded maximum resource usage.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-22 01:16, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 15:31:07 schrieb Austin S. Hemmelgarn: Hi Austin, Little data, interesting statement for results on 200+ systems including all major CPU arches all showing information leading in the same directions. Let me try rephrasing this to make it a bit clearer: 1. You have lots of data on server systems. 2. You have a significant amount of data on desktop/workstation type systems. 3. You have very little data on embedded systems. and here are your arguments: A. This works well on server systems. B. This works well on desktop systems. C. This works well on embedded systems. Arguments A and B are substantiated directly by points 1 and 2. Argument C is not substantiated thoroughly because of point 3. My complaint is about argument C given point 3. Then let me rephrase what I try to say: my RNG rests on the intrinsic functionality of CPUs. When I show that such intrinsic behavior is present in various architectures I show that there is a common ground for the basis of the RNG. You're not demonstrating it's inherently present in the architecture, your demonstrating that it's present for those particular micro-architectures you have tested. You're dealing with effects resulting from such low-level details of the CPU that you can't assume it happens for the whole architecture. In fact, your own results regarding the weak values from Pentium Celeron Mobile system reinforce that it's not an architectural effect at the ISA level, but results from low level designs. Given the naming of that particular CPU, it's almost certainly a P6 or NetBurst micro-arch, neither of which had particularly complex branch-prediction or pipelining or similar things, and more significantly, probably did not have HyperThreading, which I'd be willing to bet is a significant contributing factor on the Xeon and Core processors you've got results for. I tested on all CPUs of all large scale architectures (including the architectures that are commonly used for embedded devices) and demonstrated that the fundamental phenomenon the RNG rests on is present in all architectures. In all architectures you've tested. Note that technically, from a low-level perspective of something like this, different revisions of an ISA are different architectures, and when you start talking about licensed IP cores like ARM and PPC (and MIPS, and SPARC), different manufacturer's designs almost count separately. You're relying on complexities inherent in the CPU itself, which will be different between micro-architectures, and possibly even individual revisions of a specific model number. Just because the test gives good results on an ARMv6 or ARMv7 does not mean it will on an ARMv4, because there are enough differences in typical designs of ARM CPU's that you can't directly draw conclusions based on such a small sample size (you've tested at most 4 manufacturer's designs, and even then only one from each, and there are about 10 different companies making ARM chips, each selling dozens of slightly different CPU's). I do not care about the form factor of the test system server, desktop or embedded systems nor do I care about the number of attached devices -- the form factor and number of attached devices is the differentiator of what you call embedded vs server vs desktop. I don't care about form factor, I care about the CPU, and embedded systems almost always have simpler CPU designs (not including all the peripherals they love to add in on-die). Your own analysis indicates that your getting entropy from the complex interaction of the different parts of the CPU. Such complexities are less present in simpler CPU designs. Heck, I have written a test that executes the RNG on bare metal (without OS and with only a keyboard as device present -- i.e no interrupts are received apart from a keyboard), which demonstrates that the phenomenon is present. Furthermore, chapter 6 of my document analyzes the root cause of the RNG and here you see clearly that it has nothing to do with the size of the CPU or its attached devices or the size of RAM. And yet averages are higher for systems that have more CPU cores, and thus more complex CPU's. Prime examples of this are the UltraSPARC CPU's you've tested. Those have more SMP cores (and threads of execution) than just about anything else you've listed, and they have significantly higher values than almost anything else you list. The massive number of x86 tests shall demonstrate the common theme I see: the newer the CPU the larger the phenomenon is the RNG rests on. Except each iteration of x86 grows more complex branch-prediction and pipe-lining and other tricks to try and make the CPU process data faster. That is the source of almost all of the increase your seeing in entropy for newer revisions. The same is not inherently true of embedded processors, especially ones designed for hard-real-time usage
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-22 01:16, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 15:31:07 schrieb Austin S. Hemmelgarn: Hi Austin, Little data, interesting statement for results on 200+ systems including all major CPU arches all showing information leading in the same directions. Let me try rephrasing this to make it a bit clearer: 1. You have lots of data on server systems. 2. You have a significant amount of data on desktop/workstation type systems. 3. You have very little data on embedded systems. and here are your arguments: A. This works well on server systems. B. This works well on desktop systems. C. This works well on embedded systems. Arguments A and B are substantiated directly by points 1 and 2. Argument C is not substantiated thoroughly because of point 3. My complaint is about argument C given point 3. Then let me rephrase what I try to say: my RNG rests on the intrinsic functionality of CPUs. When I show that such intrinsic behavior is present in various architectures I show that there is a common ground for the basis of the RNG. You're not demonstrating it's inherently present in the architecture, your demonstrating that it's present for those particular micro-architectures you have tested. You're dealing with effects resulting from such low-level details of the CPU that you can't assume it happens for the whole architecture. In fact, your own results regarding the weak values from Pentium Celeron Mobile system reinforce that it's not an architectural effect at the ISA level, but results from low level designs. Given the naming of that particular CPU, it's almost certainly a P6 or NetBurst micro-arch, neither of which had particularly complex branch-prediction or pipelining or similar things, and more significantly, probably did not have HyperThreading, which I'd be willing to bet is a significant contributing factor on the Xeon and Core processors you've got results for. I tested on all CPUs of all large scale architectures (including the architectures that are commonly used for embedded devices) and demonstrated that the fundamental phenomenon the RNG rests on is present in all architectures. In all architectures you've tested. Note that technically, from a low-level perspective of something like this, different revisions of an ISA are different architectures, and when you start talking about licensed IP cores like ARM and PPC (and MIPS, and SPARC), different manufacturer's designs almost count separately. You're relying on complexities inherent in the CPU itself, which will be different between micro-architectures, and possibly even individual revisions of a specific model number. Just because the test gives good results on an ARMv6 or ARMv7 does not mean it will on an ARMv4, because there are enough differences in typical designs of ARM CPU's that you can't directly draw conclusions based on such a small sample size (you've tested at most 4 manufacturer's designs, and even then only one from each, and there are about 10 different companies making ARM chips, each selling dozens of slightly different CPU's). I do not care about the form factor of the test system server, desktop or embedded systems nor do I care about the number of attached devices -- the form factor and number of attached devices is the differentiator of what you call embedded vs server vs desktop. I don't care about form factor, I care about the CPU, and embedded systems almost always have simpler CPU designs (not including all the peripherals they love to add in on-die). Your own analysis indicates that your getting entropy from the complex interaction of the different parts of the CPU. Such complexities are less present in simpler CPU designs. Heck, I have written a test that executes the RNG on bare metal (without OS and with only a keyboard as device present -- i.e no interrupts are received apart from a keyboard), which demonstrates that the phenomenon is present. Furthermore, chapter 6 of my document analyzes the root cause of the RNG and here you see clearly that it has nothing to do with the size of the CPU or its attached devices or the size of RAM. And yet averages are higher for systems that have more CPU cores, and thus more complex CPU's. Prime examples of this are the UltraSPARC CPU's you've tested. Those have more SMP cores (and threads of execution) than just about anything else you've listed, and they have significantly higher values than almost anything else you list. The massive number of x86 tests shall demonstrate the common theme I see: the newer the CPU the larger the phenomenon is the RNG rests on. Except each iteration of x86 grows more complex branch-prediction and pipe-lining and other tricks to try and make the CPU process data faster. That is the source of almost all of the increase your seeing in entropy for newer revisions. The same is not inherently true of embedded processors, especially ones designed for hard-real-time usage
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 14:04, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 13:51:15 schrieb Austin S. Hemmelgarn: 6. You have a significant lack of data regarding embedded systems, which is one of the two biggest segments of Linux's market share. You list no results for any pre-ARMv6 systems (Linux still runs on and is regularly used on ARMv4 CPU's, and it's worth also pointing out that the values on the ARMv6 systems are themselves below average), any MIPS systems other than 24k and 4k (which is not a good representation of modern embedded usage), any SPARC CPU's other than UltraSPARC (ideally you should have results on at least a couple of LEON systems as well), no tight-embedded PPC chips (PPC 440 processors are very widely used, as are the 7xx and 970 families, and Freescale's e series), and only one set of results for a tight-embedded x86 CPU (the Via Nano, you should ideally also have results on things like an Intel Quark). Overall, your test system selection is not entirely representative of actual Linux usage (yeah, ther'es a lot of x86 servers out there running Linux, there's at least as many embedded systems running it too though, even without including Android). Perfectly valid argument. But I programmed that RNG as a hobby -- I do not have the funds to buy all devices there are. I'm not complaining as much about the lack of data for such devices as I am about you stating that it will work fine for such devices when you have so little data to support those claims. Many of the devices you Little data, interesting statement for results on 200+ systems including all major CPU arches all showing information leading in the same directions. Let me try rephrasing this to make it a bit clearer: 1. You have lots of data on server systems. 2. You have a significant amount of data on desktop/workstation type systems. 3. You have very little data on embedded systems. and here are your arguments: A. This works well on server systems. B. This works well on desktop systems. C. This works well on embedded systems. Arguments A and B are substantiated directly by points 1 and 2. Argument C is not substantiated thoroughly because of point 3. My complaint is about argument C given point 3. I'm not saying you have insufficient data to support argument A or B, only that you have insufficient data to support argument C. Android barely counts as an embedded system anymore, as many Android phones can outperform most inexpensive desktop and laptop systems, and even some rather expensive laptops. This leaves the only systems that can be assumed without further information to be representative of embedded boards to be the ones running Genode, and possibly the MIPS systems, which is a total of about 10 results out of hundreds for servers and desktops/workstations.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 14:04, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 13:51:15 schrieb Austin S. Hemmelgarn: 6. You have a significant lack of data regarding embedded systems, which is one of the two biggest segments of Linux's market share. You list no results for any pre-ARMv6 systems (Linux still runs on and is regularly used on ARMv4 CPU's, and it's worth also pointing out that the values on the ARMv6 systems are themselves below average), any MIPS systems other than 24k and 4k (which is not a good representation of modern embedded usage), any SPARC CPU's other than UltraSPARC (ideally you should have results on at least a couple of LEON systems as well), no tight-embedded PPC chips (PPC 440 processors are very widely used, as are the 7xx and 970 families, and Freescale's e series), and only one set of results for a tight-embedded x86 CPU (the Via Nano, you should ideally also have results on things like an Intel Quark). Overall, your test system selection is not entirely representative of actual Linux usage (yeah, ther'es a lot of x86 servers out there running Linux, there's at least as many embedded systems running it too though, even without including Android). Perfectly valid argument. But I programmed that RNG as a hobby -- I do not have the funds to buy all devices there are. I'm not complaining as much about the lack of data for such devices as I am about you stating that it will work fine for such devices when you have so little data to support those claims. Many of the devices you Little data, interesting statement for results on 200+ systems including all major CPU arches all showing information leading in the same directions. Let me try rephrasing this to make it a bit clearer: 1. You have lots of data on server systems. 2. You have a significant amount of data on desktop/workstation type systems. 3. You have very little data on embedded systems. and here are your arguments: A. This works well on server systems. B. This works well on desktop systems. C. This works well on embedded systems. Arguments A and B are substantiated directly by points 1 and 2. Argument C is not substantiated thoroughly because of point 3. My complaint is about argument C given point 3. I'm not saying you have insufficient data to support argument A or B, only that you have insufficient data to support argument C. Android barely counts as an embedded system anymore, as many Android phones can outperform most inexpensive desktop and laptop systems, and even some rather expensive laptops. This leaves the only systems that can be assumed without further information to be representative of embedded boards to be the ones running Genode, and possibly the MIPS systems, which is a total of about 10 results out of hundreds for servers and desktops/workstations.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:19, Tomas Mraz wrote: On Út, 2016-06-21 at 09:05 -0400, Austin S. Hemmelgarn wrote: On 2016-06-20 14:32, Stephan Mueller wrote: [1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf Specific things I notice about this: 1. QEMU systems are reporting higher values than almost anything else with the same ISA. This makes sense, but you don't appear to have accounted for the fact that you can't trust almost any of the entropy in a VM unless you have absolute trust in the host system, because the host system can do whatever the hell it wants to you, including manipulating timings directly (with a little patience and some time spent working on it, you could probably get those number to show whatever you want just by manipulating scheduling parameters on the host OS for the VM software). You have to trust the host for anything, not just for the entropy in timings. This is completely invalid argument unless you can present a method that one guest can manipulate timings in other guest in such a way that _removes_ the inherent entropy from the host. When dealing with almost any type 2 hypervisor, it is fully possible for a user other than the one running the hypervisor to manipulate scheduling such that entropy is reduced. This does not imply that the user who is doing this has any other control over the target VM, and importantly, often does not require administrative access on the host, only regular user access. Such an attack is very difficult to effect outside of a clean-room environment, but is still possible. You can't use this to force generation of arbitrary data, but you can definitely starve a VM for entropy. By nature, something that relies on interrupt timings will be more impacted by such an attack than something that does not. In most cases, such an attack will be a DoS attack on the host as well (as that's the simplest way to do this). This is less of an issue with proper practices on a type 1 hypervisor, but is still possible there too (although pulling this off on at least Xen when you have proper VCPU isolation is functionally impossible without administrative access to the control domain).
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:19, Tomas Mraz wrote: On Út, 2016-06-21 at 09:05 -0400, Austin S. Hemmelgarn wrote: On 2016-06-20 14:32, Stephan Mueller wrote: [1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf Specific things I notice about this: 1. QEMU systems are reporting higher values than almost anything else with the same ISA. This makes sense, but you don't appear to have accounted for the fact that you can't trust almost any of the entropy in a VM unless you have absolute trust in the host system, because the host system can do whatever the hell it wants to you, including manipulating timings directly (with a little patience and some time spent working on it, you could probably get those number to show whatever you want just by manipulating scheduling parameters on the host OS for the VM software). You have to trust the host for anything, not just for the entropy in timings. This is completely invalid argument unless you can present a method that one guest can manipulate timings in other guest in such a way that _removes_ the inherent entropy from the host. When dealing with almost any type 2 hypervisor, it is fully possible for a user other than the one running the hypervisor to manipulate scheduling such that entropy is reduced. This does not imply that the user who is doing this has any other control over the target VM, and importantly, often does not require administrative access on the host, only regular user access. Such an attack is very difficult to effect outside of a clean-room environment, but is still possible. You can't use this to force generation of arbitrary data, but you can definitely starve a VM for entropy. By nature, something that relies on interrupt timings will be more impacted by such an attack than something that does not. In most cases, such an attack will be a DoS attack on the host as well (as that's the simplest way to do this). This is less of an issue with proper practices on a type 1 hypervisor, but is still possible there too (although pulling this off on at least Xen when you have proper VCPU isolation is functionally impossible without administrative access to the control domain).
Re: [PATCH v5 0/7] /dev/random - a new approach
On 2016-06-21 12:28, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 12:03:56 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-21 03:32, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 09:12:07 schrieb Nikos Mavrogiannopoulos: Hi Nikos, On Mon, Jun 20, 2016 at 5:43 PM, Stephan Mueller <smuel...@chronox.de> wrote: Personally, I don't really use /dev/random, nor would I recommend it for most application programmers. At this point, getrandom(2) really is the preferred interface unless you have some very specialized needs. I fully agree. But there are use cases for /dev/random, notably as a seed source for other DRNG. Is that really the case? I believe all DRNG's use /dev/urandom anyway for seeding since they cannot afford indeterminate blocking. It would be a gain for everyone if /dev/random was the same as /dev/urandom in Linux. For standard approaches, this is true. But there are regulations, notably in the German realm, /dev/random shall be used, at least partially (see AIS 20/31). Which just goes to show how utterly stupid some people who write laws and regulations are. Saying specifically that '/dev/random shall be used' does not enforce any improvement of entrophic value in the data at all, it just coincidentally improves the theoretical quality of the data because of how Linux and some legacy UNIX systems are designed. This 'regulation' already provides zero benefit other than a placebo effect on at least OpenBSD, FreeBSD, and I'm pretty certain most other BSD derivatives, as /dev/random and /dev/urandom point to the same thing there (on OpenBSD it's an arcfour based drbg, FreeBSD does similar but uses a CSPRNG called Fortuna), and while I'm not certain, I believe AIX does likewise (although they use a design based on yarrow). It is NOT utterly stupid, you should listen to the reasons. I'm not commenting about them wanting cryptographically secure entropy. My statement about the intelligence of the people who wrote the standard was commenting on their saying '/dev/random shall be used' instead of 'a cryptographically secure entropy source shall be used'. POSIX requires that the first statement imply the second, but the second does not require the first. Their choice to use the first indicates that they assume this will be the only implementation that will ever matter, and that POSIX systems will be the only thing anyone using the standard cares about, both of which are inherently poor assumptions outside of a completely closed system. Most standards proposed or mandated by bureaucracies have similar issues making bad assumptions about the context in which they will be used. The core reason is explained with the following analogy: /dev/urandom, for a short period of time, operates as a DRNG. For the sake of discussion, let us assume that it is a DRNG which is, for the sake of discussion, seeded with 256 bits of entropy. Now, you pull 256 bits out of it, you have 256 bits of entropy. You pull again 256 bit out of it -- what entropy do you have? You do NOT have *again* 256 bits of entropy. All you can say is the entire generated 512 bits have collectively 256 bits of entropy. And so on and so forth. Now, if you want to say that your newly spawned DRNG is seeded with X amount of entropy, the DRNG nature of /dev/urandom makes such a statement a challenge. The easy way out of the challenge is to use /dev/random (I am aware of the fact that the DRNG has a computational strength, but it is not, well, entropy). In addition, when using /dev/urandom, you have to show that at the time you seed the DRNG, it is fully seeded. That is a challenge at boot time - a challenge this entire thread revolves around. Yes, getrandom(2) is the way out, but not everybody uses it. Again, /dev/random does NOT have this challenge. The fact that that's how /dev/urandom works is a result of the implementation, I already listed at least 3 UNIX derivatives that do not work this way and provide cryptographically secure entropy from the same source stream for both /dev/random and /dev/urandom. Changing this on Linux would not break backwards compatibility as long as we provide sufficient entropy via /dev/random, because nobody depends on it blocking for anything other than ensuring they get good entropy, so if it always returns good cryptographically secure entropy, we never need to block unless the generator hasn't yet been seeded. Now, on top of that, there's still no absolute guarantee that what you get from /dev/random is actually cryptographically secure, but that's a separate issue.
Re: [PATCH v5 0/7] /dev/random - a new approach
On 2016-06-21 12:28, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 12:03:56 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-21 03:32, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 09:12:07 schrieb Nikos Mavrogiannopoulos: Hi Nikos, On Mon, Jun 20, 2016 at 5:43 PM, Stephan Mueller wrote: Personally, I don't really use /dev/random, nor would I recommend it for most application programmers. At this point, getrandom(2) really is the preferred interface unless you have some very specialized needs. I fully agree. But there are use cases for /dev/random, notably as a seed source for other DRNG. Is that really the case? I believe all DRNG's use /dev/urandom anyway for seeding since they cannot afford indeterminate blocking. It would be a gain for everyone if /dev/random was the same as /dev/urandom in Linux. For standard approaches, this is true. But there are regulations, notably in the German realm, /dev/random shall be used, at least partially (see AIS 20/31). Which just goes to show how utterly stupid some people who write laws and regulations are. Saying specifically that '/dev/random shall be used' does not enforce any improvement of entrophic value in the data at all, it just coincidentally improves the theoretical quality of the data because of how Linux and some legacy UNIX systems are designed. This 'regulation' already provides zero benefit other than a placebo effect on at least OpenBSD, FreeBSD, and I'm pretty certain most other BSD derivatives, as /dev/random and /dev/urandom point to the same thing there (on OpenBSD it's an arcfour based drbg, FreeBSD does similar but uses a CSPRNG called Fortuna), and while I'm not certain, I believe AIX does likewise (although they use a design based on yarrow). It is NOT utterly stupid, you should listen to the reasons. I'm not commenting about them wanting cryptographically secure entropy. My statement about the intelligence of the people who wrote the standard was commenting on their saying '/dev/random shall be used' instead of 'a cryptographically secure entropy source shall be used'. POSIX requires that the first statement imply the second, but the second does not require the first. Their choice to use the first indicates that they assume this will be the only implementation that will ever matter, and that POSIX systems will be the only thing anyone using the standard cares about, both of which are inherently poor assumptions outside of a completely closed system. Most standards proposed or mandated by bureaucracies have similar issues making bad assumptions about the context in which they will be used. The core reason is explained with the following analogy: /dev/urandom, for a short period of time, operates as a DRNG. For the sake of discussion, let us assume that it is a DRNG which is, for the sake of discussion, seeded with 256 bits of entropy. Now, you pull 256 bits out of it, you have 256 bits of entropy. You pull again 256 bit out of it -- what entropy do you have? You do NOT have *again* 256 bits of entropy. All you can say is the entire generated 512 bits have collectively 256 bits of entropy. And so on and so forth. Now, if you want to say that your newly spawned DRNG is seeded with X amount of entropy, the DRNG nature of /dev/urandom makes such a statement a challenge. The easy way out of the challenge is to use /dev/random (I am aware of the fact that the DRNG has a computational strength, but it is not, well, entropy). In addition, when using /dev/urandom, you have to show that at the time you seed the DRNG, it is fully seeded. That is a challenge at boot time - a challenge this entire thread revolves around. Yes, getrandom(2) is the way out, but not everybody uses it. Again, /dev/random does NOT have this challenge. The fact that that's how /dev/urandom works is a result of the implementation, I already listed at least 3 UNIX derivatives that do not work this way and provide cryptographically secure entropy from the same source stream for both /dev/random and /dev/urandom. Changing this on Linux would not break backwards compatibility as long as we provide sufficient entropy via /dev/random, because nobody depends on it blocking for anything other than ensuring they get good entropy, so if it always returns good cryptographically secure entropy, we never need to block unless the generator hasn't yet been seeded. Now, on top of that, there's still no absolute guarantee that what you get from /dev/random is actually cryptographically secure, but that's a separate issue.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 13:23, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 13:18:33 schrieb Austin S. Hemmelgarn: Hi Austin, You have to trust the host for anything, not just for the entropy in timings. This is completely invalid argument unless you can present a method that one guest can manipulate timings in other guest in such a way that _removes_ the inherent entropy from the host. When dealing with almost any type 2 hypervisor, it is fully possible for a user other than the one running the hypervisor to manipulate scheduling such that entropy is reduced. This does not imply that the Please re-read the document: Jitter RNG does not rest on scheduling. If you are running inside a VM, your interrupt timings depend on the hpyervisor's scheduling, period. You may not directly rely on scheduling from the OS you are running on, but if you are doing anything timing related in a VM, you are at the mercy of the scheduling used by the hypervisor and whatever host OS that may be running on. In the attack I"m describing, the malicious user is not manipulating the guest OS's scheduling, they are manipulating the host system's scheduling.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 13:23, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 13:18:33 schrieb Austin S. Hemmelgarn: Hi Austin, You have to trust the host for anything, not just for the entropy in timings. This is completely invalid argument unless you can present a method that one guest can manipulate timings in other guest in such a way that _removes_ the inherent entropy from the host. When dealing with almost any type 2 hypervisor, it is fully possible for a user other than the one running the hypervisor to manipulate scheduling such that entropy is reduced. This does not imply that the Please re-read the document: Jitter RNG does not rest on scheduling. If you are running inside a VM, your interrupt timings depend on the hpyervisor's scheduling, period. You may not directly rely on scheduling from the OS you are running on, but if you are doing anything timing related in a VM, you are at the mercy of the scheduling used by the hypervisor and whatever host OS that may be running on. In the attack I"m describing, the malicious user is not manipulating the guest OS's scheduling, they are manipulating the host system's scheduling.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:20, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 09:05:55 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-20 14:32, Stephan Mueller wrote: Am Montag, 20. Juni 2016, 13:07:32 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-18 12:31, Stephan Mueller wrote: Am Samstag, 18. Juni 2016, 10:44:08 schrieb Theodore Ts'o: Hi Theodore, At the end of the day, with these devices you really badly need a hardware RNG. We can't generate randomness out of thin air. The only thing you really can do requires user space help, which is to generate keys lazily, or as late as possible, so you can gather as much entropy as you can --- and to feed in measurements from the WiFi (RSSI measurements, MAC addresses seen, etc.) This won't help much if you have an FBI van parked outside your house trying to carry out a TEMPEST attack, but hopefully it provides some protection against a remote attacker who isn't try to carry out an on-premises attack. All my measurements on such small systems like MIPS or smaller/older ARMs do not seem to support that statement :-) Was this on real hardware, or in a virtual machine/emulator? Because if it's not on real hardware, you're harvesting entropy from the host system, not the emulated one. While I haven't done this with MIPS or ARM systems, I've taken similar measurements on SPARC64, x86_64, and PPC64 systems comparing real hardware and emulated hardware, and the emulated hardware _always_ has higher entropy, even when running the emulator on an identical CPU to the one being emulated and using KVM acceleration and passing through all the devices possible. Even if you were testing on real hardware, I'm still rather dubious, as every single test I've ever done on any hardware (SPARC, PPC, x86, ARM, and even PA-RISC) indicates that you can't harvest entropy as effectively from a smaller CPU compared to a large one, and this effect is significantly more pronounced on RISC systems. It was on real hardware. As part of my Jitter RNG project, I tested all major CPUs from small to big -- see Appendix F [1]. For MIPS/ARM, see the trailing part of the big table. [1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf Specific things I notice about this: 1. QEMU systems are reporting higher values than almost anything else with the same ISA. This makes sense, but you don't appear to have accounted for the fact that you can't trust almost any of the entropy in a VM unless you have absolute trust in the host system, because the host system can do whatever the hell it wants to you, including manipulating timings directly (with a little patience and some time spent working on it, you could probably get those number to show whatever you want just by manipulating scheduling parameters on the host OS for the VM software). I am not sure where you see QEMU systems listed there. That would be the ones which list 'QEMU Virtual CPU version X.Y' as the CPU string. The only things that return that in the CPUID data are either QEMU itself, or software that is based on QEMU. 2. Quite a few systems have a rather distressingly low lower bound and still get accepted by your algorithm (a number of the S/390 systems, and a handful of the AMD processors in particular). I am aware of that, but please read the entire documentation where the lower and upper boundary comes from and how the Jitter RNG really operates. There you will see that the lower boundary is just that: it will not be lower, but the common case is the upper boundary. Talking about the common case is all well and good, but the lower bound still needs to be taken into account. If the test results aren't uniformly distributed within that interval, or even following a typical Gaussian distribution within it (which is what I and many other people would probably assume without the data later in the appendix), then you really need to mention this _before_ the table itself. Such information is very important, and not everyone has time to read everything. Furthermore, the use case of the Jitter RNG is to support the DRBG seeding with a very high reseed interval. 3. Your statement at the bottom of the table that 'all test systems at least un-optimized have a lower bound of 1 bit' is refuted by your own data, I count at least 2 data points where this is not the case. One of them is mentioned at the bottom as an outlier, and you have data to back this up listed in the table, but the other (MIPS 4Kec v4.8) is the only system of that specific type that you tested, and thus can't be claimed as an outlier. You are right, I have added more and more test results to the table without updating the statement below. I will fix that. But note, that there is a list below that statement providing explanations already. So, it is just that one statement that needs updating. 4. You state the S/390 systems gave different results when run un-optimized, but don't provide any data regarding
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:20, Stephan Mueller wrote: Am Dienstag, 21. Juni 2016, 09:05:55 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-20 14:32, Stephan Mueller wrote: Am Montag, 20. Juni 2016, 13:07:32 schrieb Austin S. Hemmelgarn: Hi Austin, On 2016-06-18 12:31, Stephan Mueller wrote: Am Samstag, 18. Juni 2016, 10:44:08 schrieb Theodore Ts'o: Hi Theodore, At the end of the day, with these devices you really badly need a hardware RNG. We can't generate randomness out of thin air. The only thing you really can do requires user space help, which is to generate keys lazily, or as late as possible, so you can gather as much entropy as you can --- and to feed in measurements from the WiFi (RSSI measurements, MAC addresses seen, etc.) This won't help much if you have an FBI van parked outside your house trying to carry out a TEMPEST attack, but hopefully it provides some protection against a remote attacker who isn't try to carry out an on-premises attack. All my measurements on such small systems like MIPS or smaller/older ARMs do not seem to support that statement :-) Was this on real hardware, or in a virtual machine/emulator? Because if it's not on real hardware, you're harvesting entropy from the host system, not the emulated one. While I haven't done this with MIPS or ARM systems, I've taken similar measurements on SPARC64, x86_64, and PPC64 systems comparing real hardware and emulated hardware, and the emulated hardware _always_ has higher entropy, even when running the emulator on an identical CPU to the one being emulated and using KVM acceleration and passing through all the devices possible. Even if you were testing on real hardware, I'm still rather dubious, as every single test I've ever done on any hardware (SPARC, PPC, x86, ARM, and even PA-RISC) indicates that you can't harvest entropy as effectively from a smaller CPU compared to a large one, and this effect is significantly more pronounced on RISC systems. It was on real hardware. As part of my Jitter RNG project, I tested all major CPUs from small to big -- see Appendix F [1]. For MIPS/ARM, see the trailing part of the big table. [1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf Specific things I notice about this: 1. QEMU systems are reporting higher values than almost anything else with the same ISA. This makes sense, but you don't appear to have accounted for the fact that you can't trust almost any of the entropy in a VM unless you have absolute trust in the host system, because the host system can do whatever the hell it wants to you, including manipulating timings directly (with a little patience and some time spent working on it, you could probably get those number to show whatever you want just by manipulating scheduling parameters on the host OS for the VM software). I am not sure where you see QEMU systems listed there. That would be the ones which list 'QEMU Virtual CPU version X.Y' as the CPU string. The only things that return that in the CPUID data are either QEMU itself, or software that is based on QEMU. 2. Quite a few systems have a rather distressingly low lower bound and still get accepted by your algorithm (a number of the S/390 systems, and a handful of the AMD processors in particular). I am aware of that, but please read the entire documentation where the lower and upper boundary comes from and how the Jitter RNG really operates. There you will see that the lower boundary is just that: it will not be lower, but the common case is the upper boundary. Talking about the common case is all well and good, but the lower bound still needs to be taken into account. If the test results aren't uniformly distributed within that interval, or even following a typical Gaussian distribution within it (which is what I and many other people would probably assume without the data later in the appendix), then you really need to mention this _before_ the table itself. Such information is very important, and not everyone has time to read everything. Furthermore, the use case of the Jitter RNG is to support the DRBG seeding with a very high reseed interval. 3. Your statement at the bottom of the table that 'all test systems at least un-optimized have a lower bound of 1 bit' is refuted by your own data, I count at least 2 data points where this is not the case. One of them is mentioned at the bottom as an outlier, and you have data to back this up listed in the table, but the other (MIPS 4Kec v4.8) is the only system of that specific type that you tested, and thus can't be claimed as an outlier. You are right, I have added more and more test results to the table without updating the statement below. I will fix that. But note, that there is a list below that statement providing explanations already. So, it is just that one statement that needs updating. 4. You state the S/390 systems gave different results when run un-optimized, but don't provide any data regarding
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:42, Pavel Machek wrote: Hi! 6. You have a significant lack of data regarding embedded systems, which is one of the two biggest segments of Linux's market share. You list no results for any pre-ARMv6 systems (Linux still runs on and is regularly used on ARMv4 CPU's, and it's worth also pointing out that the values on the Feel free to contribute more test results. I mean... you can't expect every person who wants to improve something in linux to test on everything in the world... can you? I was commenting less on the lack of results for such systems than on attempts to make statements about such systems based on a data-set that lacks information about such systems.
Re: [PATCH v4 0/5] /dev/random - a new approach
On 2016-06-21 09:42, Pavel Machek wrote: Hi! 6. You have a significant lack of data regarding embedded systems, which is one of the two biggest segments of Linux's market share. You list no results for any pre-ARMv6 systems (Linux still runs on and is regularly used on ARMv4 CPU's, and it's worth also pointing out that the values on the Feel free to contribute more test results. I mean... you can't expect every person who wants to improve something in linux to test on everything in the world... can you? I was commenting less on the lack of results for such systems than on attempts to make statements about such systems based on a data-set that lacks information about such systems.