Re: Creating compressed backing_store as swapfile

2018-11-05 Thread Austin S. Hemmelgarn

On 11/5/2018 11:53 AM, valdis.kletni...@vt.edu wrote:

On Mon, 05 Nov 2018 11:28:49 -0500, "Austin S. Hemmelgarn" said:


Also, it's probably worth noting that BTRFS doesn't need to decompress
the entire file to read or write blocks in the middle, it splits the
file into 128k blocks and compresses each of those independent of the
others, so it can just decompress the 128k block that holds the actual
block that's needed.


Presumably it does something sane with block allocation for the now-compressed
128K that's presumably much smaller.  Also, that limits the damage from writing 
to
the middle of a compression unit

That *does* however increase the memory requirement - you can OOM or
deadlock if your read/write from the swap needs an additional 128K for the
compression buffer at an inconvenient time...

Indeed, and I can't really comment on how it might behave under those 
circumstances (the systems I did the testing on never saw memory 
pressure quite _that_ bad, and I had them set up to swap things out 
pretty early and really aggressively).


Re: Creating compressed backing_store as swapfile

2018-11-05 Thread Austin S. Hemmelgarn

On 11/5/2018 11:53 AM, valdis.kletni...@vt.edu wrote:

On Mon, 05 Nov 2018 11:28:49 -0500, "Austin S. Hemmelgarn" said:


Also, it's probably worth noting that BTRFS doesn't need to decompress
the entire file to read or write blocks in the middle, it splits the
file into 128k blocks and compresses each of those independent of the
others, so it can just decompress the 128k block that holds the actual
block that's needed.


Presumably it does something sane with block allocation for the now-compressed
128K that's presumably much smaller.  Also, that limits the damage from writing 
to
the middle of a compression unit

That *does* however increase the memory requirement - you can OOM or
deadlock if your read/write from the swap needs an additional 128K for the
compression buffer at an inconvenient time...

Indeed, and I can't really comment on how it might behave under those 
circumstances (the systems I did the testing on never saw memory 
pressure quite _that_ bad, and I had them set up to swap things out 
pretty early and really aggressively).


Re: POSIX violation by writeback error

2018-09-05 Thread Austin S. Hemmelgarn

On 2018-09-05 04:37, 焦晓冬 wrote:

On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff  wrote:


On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:

Rogier Wolff - 05.09.18, 09:08:

So when a mail queuer puts mail the mailq files and the mail processor
can get them out of there intact, nobody is going to notice.  (I know
mail queuers should call fsync and report errors when that fails, but
there are bound to be applications where calling fsync is not
appropriate (*))


AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
once fsync() on the mail file completed successfully. And I´d expect
every sensible MDA to do this. I don´t know how Dovecot MDA which I
currently use for sieve support does this tough.




Is every implementation of mail editor really going to call fsync()? Why
they are going to call fsync(), when fsync() is meant to persist the file
on disk which is apparently unnecessary if the delivering to SMTP task
won't start again after reboot?
Not mail clients, the actual servers.  If they implement the SMTP 
standard correctly, they _have_ to call fsync() before they return that 
an email was accepted for delivery or relaying, because SMTP requires 
that a successful return means that the system can actually attempt 
delivery, which is not guaranteed if they haven't verified that it's 
actually written out to persistent storage.



Yes. That's why I added the remark that mailers will call fsync and know
about it on the write side. I encountered a situation in the last few
days that when a developer runs into this while developing, would have
caused him to write:
   /* Calling this fsync causes unacceptable performance */
   // fsync (fd);

I know of an application somewhere that does realtime-gathering of
call-records (number X called Y for Z seconds). They come in from a
variety of sources, get de-duplicated standardized and written to
files. Then different output modules push the data to the different
consumers within the company. Billing among them.

Now getting old data there would be pretty bad. And calling fsync
all the time might have performance issues

That's the situation where "old data is really bad".

But when apt-get upgrade replaces your /bin/sh and gets a write error
returning error on subsequent reads is really bad.


At this point, the /bin/sh may be partially old and partially new. Execute
this corrupted bin is also dangerous though.
But the system may still be usable in that state, while returning an 
error there guarantees it isn't.  This is, in general, not the best 
example though, because no sane package manager directly overwrites 
_anything_, they all do some variation on replace-by-rename and call 
fsync _before_ renaming, so this situation is not realistically going to 
happen on any real system.


Re: POSIX violation by writeback error

2018-09-05 Thread Austin S. Hemmelgarn

On 2018-09-05 04:37, 焦晓冬 wrote:

On Wed, Sep 5, 2018 at 4:04 PM Rogier Wolff  wrote:


On Wed, Sep 05, 2018 at 09:39:58AM +0200, Martin Steigerwald wrote:

Rogier Wolff - 05.09.18, 09:08:

So when a mail queuer puts mail the mailq files and the mail processor
can get them out of there intact, nobody is going to notice.  (I know
mail queuers should call fsync and report errors when that fails, but
there are bound to be applications where calling fsync is not
appropriate (*))


AFAIK at least Postfix MDA only reports mail as being accepted over SMTP
once fsync() on the mail file completed successfully. And I´d expect
every sensible MDA to do this. I don´t know how Dovecot MDA which I
currently use for sieve support does this tough.




Is every implementation of mail editor really going to call fsync()? Why
they are going to call fsync(), when fsync() is meant to persist the file
on disk which is apparently unnecessary if the delivering to SMTP task
won't start again after reboot?
Not mail clients, the actual servers.  If they implement the SMTP 
standard correctly, they _have_ to call fsync() before they return that 
an email was accepted for delivery or relaying, because SMTP requires 
that a successful return means that the system can actually attempt 
delivery, which is not guaranteed if they haven't verified that it's 
actually written out to persistent storage.



Yes. That's why I added the remark that mailers will call fsync and know
about it on the write side. I encountered a situation in the last few
days that when a developer runs into this while developing, would have
caused him to write:
   /* Calling this fsync causes unacceptable performance */
   // fsync (fd);

I know of an application somewhere that does realtime-gathering of
call-records (number X called Y for Z seconds). They come in from a
variety of sources, get de-duplicated standardized and written to
files. Then different output modules push the data to the different
consumers within the company. Billing among them.

Now getting old data there would be pretty bad. And calling fsync
all the time might have performance issues

That's the situation where "old data is really bad".

But when apt-get upgrade replaces your /bin/sh and gets a write error
returning error on subsequent reads is really bad.


At this point, the /bin/sh may be partially old and partially new. Execute
this corrupted bin is also dangerous though.
But the system may still be usable in that state, while returning an 
error there guarantees it isn't.  This is, in general, not the best 
example though, because no sane package manager directly overwrites 
_anything_, they all do some variation on replace-by-rename and call 
fsync _before_ renaming, so this situation is not realistically going to 
happen on any real system.


Re: [PATCH 1/2] x86: x86_64_defconfig: Enable KSM.

2018-06-15 Thread Austin S. Hemmelgarn

On 2018-06-14 18:50, Daniel Díaz wrote:

As per the documentation, Kernel Samepage Merging (available
since 2.6.32) is a memory-saving de-duplication feature,
enabled by CONFIG_KSM=y and activated via sysfs. More
information can be found here:
   https://www.kernel.org/doc/Documentation/vm/ksm.txt

When enabled in the kernel, the default is to not do anything
at all, until it is activated at run-time with:
   echo 1 > /sys/kernel/mm/ksm/run

As pointed out by a couple of others, this doesn't explain why this is a 
good idea.  All you're doing here is giving a reason that it won't have 
a negative impact on most users.


Two points that may be worth adding, but also don't really argue for it 
being a significant improvement:


* Pretty much all of the major distributions that use pre-built kernels 
have it enabled in their kernels (At minimum, Debian, Ubuntu, Fedora 
(and by extension RHEL and CentOS), openSUSE (and by extension SLES), 
Arch, and Alpine have it enabled), so enabling this in defconfig would 
bring it a bit closer to parity with distribution kernels in terms of 
core features.


* Software other than QEMU is starting to take advantage of it if 
available (for example, Netdata [1] can mark it's in-memory TSDB's for 
deduplication, which usually cuts it's memory usage roughly in half).



[1] https://my-netdata.io/


Re: [PATCH 1/2] x86: x86_64_defconfig: Enable KSM.

2018-06-15 Thread Austin S. Hemmelgarn

On 2018-06-14 18:50, Daniel Díaz wrote:

As per the documentation, Kernel Samepage Merging (available
since 2.6.32) is a memory-saving de-duplication feature,
enabled by CONFIG_KSM=y and activated via sysfs. More
information can be found here:
   https://www.kernel.org/doc/Documentation/vm/ksm.txt

When enabled in the kernel, the default is to not do anything
at all, until it is activated at run-time with:
   echo 1 > /sys/kernel/mm/ksm/run

As pointed out by a couple of others, this doesn't explain why this is a 
good idea.  All you're doing here is giving a reason that it won't have 
a negative impact on most users.


Two points that may be worth adding, but also don't really argue for it 
being a significant improvement:


* Pretty much all of the major distributions that use pre-built kernels 
have it enabled in their kernels (At minimum, Debian, Ubuntu, Fedora 
(and by extension RHEL and CentOS), openSUSE (and by extension SLES), 
Arch, and Alpine have it enabled), so enabling this in defconfig would 
bring it a bit closer to parity with distribution kernels in terms of 
core features.


* Software other than QEMU is starting to take advantage of it if 
available (for example, Netdata [1] can mark it's in-memory TSDB's for 
deduplication, which usually cuts it's memory usage roughly in half).



[1] https://my-netdata.io/


Re: Read-protected UEFI variables

2018-02-14 Thread Austin S. Hemmelgarn

On 2018-02-14 08:21, Benjamin Drung wrote:

Am Mittwoch, den 14.02.2018, 13:09 + schrieb Ard Biesheuvel:

On 14 February 2018 at 12:52, Benjamin Drung
 wrote:

Hi,

I am exploring the possibility to store SSH and other keys in UEFI
variables for systems that do not have persistent storage. These
systems boot via network and need individual SSH keys which ideally
should not be distributed via network.

The plan is to write a small daemon that starts at boot and gets
the
SSH keys from EFI variables to individualize the system with SSH
keys.
I plan to release the code as free software. Simple proof-of-
concept
code:

mount -t efivarfs none /sys/firmware/efi/efivars
for key in ssh_host_dsa_key ssh_host_ecdsa_key ssh_host_rsa_key; do
   dd ibs=1 skip=4 if=/sys/firmware/efi/efivars/${key}-89df11f4-
38e6-473e-ab43-b4406b76fba9 of=/etc/ssh/$key
done

I am not the first person having the idea to use UEFI variables to
store keys:
https://www.usenix.org/conference/srecon17asia/program/presentation
/korgachin

There is one problem: The keys should be readable only by root.
When
mounting efivarfs, all variables have the permission 644 which
makes
them readable by all users. I have different ideas how to solve it:

1) Hard-code a list of GUIDs that should be only readable by root
in
the kernel module. These modules would also be not set to
immutable.

2) Instead of hard-coding GUIDs, add a kernel module parameter to
specify the GUIDs. Maybe have a default list in the kernel module.

3) Add a mount option to specify the protected GUIDs.

Feedback is welcome.



I'd consider a patch that makes the permissions a mount option for
efivarfs, applying to all variables. The reason is that these
variables shouldn't have been world readable in the first place, and
I
am reluctant to make this overly complex.


Having some variables (like the Boot and BootOrder variables) world
readable is useful. This allows normal users to run 'efibootmgr' to
display the boot options.
Some variables maybe (ISTR variables for things like the system 
time-zone or the firmware locale settings, which _might_ be useful), but 
I would say the boot variables are not on that list.  The only practical 
application for a regular (non-root) user to read those variables is to 
gather info for an attack on the system.  Anybody who legitimately needs 
to access them (either for debugging the boot process, or changing the 
boot options) should have administrative access to the system anyway, 
and even then they will usually not need to read them.


In fact, most of the UEFI variables fall into the same category, but 
even more so, userspace has no legitimate reason to need to read them. 
You can get an absolutely insane amount of info out of them on some 
systems, most of which is a gold-mine for potential attackers.  For the 
handful that may actually be useful to userspace, most would be needed 
only during system startup, and thus could safely be made readable by 
root only.



On the other hand, you should realize that UEFI was never designed to
keep secrets, and so whether it is a good idea to put secrets in UEFI
variables to begin with is dubious IMHO.


If the UEFI is as secure as storing an unencrypted file on a hard
drive, I am satisfied. Or do you have a better idea where to store the
SSH keys for a diskless system that boots via network?

There really isn't any other option unless you're willing to put a small 
amount of flash storage in the system somehow (maybe a small USB flash 
drive connected directly to a USB header inside the system?).  As far as 
the security of UEFI variables, the same limitations as storing the data 
on an unencrypted hard drive apply, with the addition that it's much 
easier to get at them through stuff like Intel's AMT or IPMI than it is 
to read data off of the hard drive.


Re: Read-protected UEFI variables

2018-02-14 Thread Austin S. Hemmelgarn

On 2018-02-14 08:21, Benjamin Drung wrote:

Am Mittwoch, den 14.02.2018, 13:09 + schrieb Ard Biesheuvel:

On 14 February 2018 at 12:52, Benjamin Drung
 wrote:

Hi,

I am exploring the possibility to store SSH and other keys in UEFI
variables for systems that do not have persistent storage. These
systems boot via network and need individual SSH keys which ideally
should not be distributed via network.

The plan is to write a small daemon that starts at boot and gets
the
SSH keys from EFI variables to individualize the system with SSH
keys.
I plan to release the code as free software. Simple proof-of-
concept
code:

mount -t efivarfs none /sys/firmware/efi/efivars
for key in ssh_host_dsa_key ssh_host_ecdsa_key ssh_host_rsa_key; do
   dd ibs=1 skip=4 if=/sys/firmware/efi/efivars/${key}-89df11f4-
38e6-473e-ab43-b4406b76fba9 of=/etc/ssh/$key
done

I am not the first person having the idea to use UEFI variables to
store keys:
https://www.usenix.org/conference/srecon17asia/program/presentation
/korgachin

There is one problem: The keys should be readable only by root.
When
mounting efivarfs, all variables have the permission 644 which
makes
them readable by all users. I have different ideas how to solve it:

1) Hard-code a list of GUIDs that should be only readable by root
in
the kernel module. These modules would also be not set to
immutable.

2) Instead of hard-coding GUIDs, add a kernel module parameter to
specify the GUIDs. Maybe have a default list in the kernel module.

3) Add a mount option to specify the protected GUIDs.

Feedback is welcome.



I'd consider a patch that makes the permissions a mount option for
efivarfs, applying to all variables. The reason is that these
variables shouldn't have been world readable in the first place, and
I
am reluctant to make this overly complex.


Having some variables (like the Boot and BootOrder variables) world
readable is useful. This allows normal users to run 'efibootmgr' to
display the boot options.
Some variables maybe (ISTR variables for things like the system 
time-zone or the firmware locale settings, which _might_ be useful), but 
I would say the boot variables are not on that list.  The only practical 
application for a regular (non-root) user to read those variables is to 
gather info for an attack on the system.  Anybody who legitimately needs 
to access them (either for debugging the boot process, or changing the 
boot options) should have administrative access to the system anyway, 
and even then they will usually not need to read them.


In fact, most of the UEFI variables fall into the same category, but 
even more so, userspace has no legitimate reason to need to read them. 
You can get an absolutely insane amount of info out of them on some 
systems, most of which is a gold-mine for potential attackers.  For the 
handful that may actually be useful to userspace, most would be needed 
only during system startup, and thus could safely be made readable by 
root only.



On the other hand, you should realize that UEFI was never designed to
keep secrets, and so whether it is a good idea to put secrets in UEFI
variables to begin with is dubious IMHO.


If the UEFI is as secure as storing an unencrypted file on a hard
drive, I am satisfied. Or do you have a better idea where to store the
SSH keys for a diskless system that boots via network?

There really isn't any other option unless you're willing to put a small 
amount of flash storage in the system somehow (maybe a small USB flash 
drive connected directly to a USB header inside the system?).  As far as 
the security of UEFI variables, the same limitations as storing the data 
on an unencrypted hard drive apply, with the addition that it's much 
easier to get at them through stuff like Intel's AMT or IPMI than it is 
to read data off of the hard drive.


Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all

2017-12-15 Thread Austin S. Hemmelgarn

On 2017-12-15 12:24, Vincent Legoll wrote:

Hello,


This looks fine to me. Ard?


Doesn't this break existing configs?


Would adding a "default yes" on the new menuconfig be OK.

If yes, I'd respin it for a v2

Alternatively, would it not make some degree of sense to just turn the 
CONFIG_EFI symbol into the menuconfig?  It already controls all the EFI 
related stuff except GPT support (which should not be dependent on EFI 
support), so anyone who wants EFI support already has it enabled, and it 
would be a bit nicer to work with when actually configuring a kernel (It 
is rather annoying to have to effectively enable something twice).


The same (in theory) goes for pretty much any other patch like this 
where there's already a config option controlling it all that just isn't 
a menuconfig.


Re: [PATCH] efi: make EFI a menuconfig to ease disabling it all

2017-12-15 Thread Austin S. Hemmelgarn

On 2017-12-15 12:24, Vincent Legoll wrote:

Hello,


This looks fine to me. Ard?


Doesn't this break existing configs?


Would adding a "default yes" on the new menuconfig be OK.

If yes, I'd respin it for a v2

Alternatively, would it not make some degree of sense to just turn the 
CONFIG_EFI symbol into the menuconfig?  It already controls all the EFI 
related stuff except GPT support (which should not be dependent on EFI 
support), so anyone who wants EFI support already has it enabled, and it 
would be a bit nicer to work with when actually configuring a kernel (It 
is rather annoying to have to effectively enable something twice).


The same (in theory) goes for pretty much any other patch like this 
where there's already a config option controlling it all that just isn't 
a menuconfig.


Re: [PATCH v5 3/5] btrfs: Add zstd support

2017-08-11 Thread Austin S. Hemmelgarn

On 2017-08-09 22:39, Nick Terrell wrote:

Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|-|---|--|-|
| None|  0.99 |  504 | 686 |
| lzo |  1.66 |  398 | 442 |
| zlib|  2.58 |   65 | 241 |
| zstd 1  |  2.57 |  260 | 383 |
| zstd 3  |  2.71 |  174 | 408 |
| zstd 6  |  2.87 |   70 | 398 |
| zstd 9  |  2.92 |   43 | 406 |
| zstd 12 |  2.93 |   21 | 408 |
| zstd 15 |  3.01 |   11 | 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
||---|---|--||--|
| None   |  0.97 |  0.78 |0.981 |  5.501 |8.807 |
| lzo|  2.06 |  1.38 |1.631 |  8.458 |8.585 |
| zlib   |  3.40 |  1.86 |7.750 | 21.544 |   11.744 |
| zstd 1 |  3.57 |  1.85 |2.579 | 11.479 |9.389 |

[1] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell 
Considering how things went with the previous patch, I've been a bit 
more aggressive testing this one, but after now almost 72 hours of 
combined runtime for each of the architectures I have tests set up for 
with nothing breaking (well, nothing breaking that wasn't already 
breaking before this patch set, some of the raid56 tests are still 
failing semi-reliably as expected), I'd say this is reasonably stable 
and looks good overall.

---
v2 -> v3:
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff
- Change default compression level for BtrFS to 3

v3 -> v4:
- Add missing includes, which fixes the aarch64 build
- Fix minor linter warnings

  fs/btrfs/Kconfig   |   2 +
  fs/btrfs/Makefile  |   2 +-
  fs/btrfs/compression.c |   1 +
  fs/btrfs/compression.h |   6 +-
  fs/btrfs/ctree.h   |   1 +
  fs/btrfs/disk-io.c |   2 +
  fs/btrfs/ioctl.c   |   6 +-
  fs/btrfs/props.c   |   6 +
  fs/btrfs/super.c   |  12 +-
  fs/btrfs/sysfs.c   |   2 +
  fs/btrfs/zstd.c| 432 +
  include/uapi/linux/btrfs.h |   8 +-
  12 files changed, 468 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..a26c63b 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,8 @@ config BTRFS_FS
select ZLIB_DEFLATE
select LZO_COMPRESS
select LZO_DECOMPRESS
+   select ZSTD_COMPRESS
+   select ZSTD_DECOMPRESS
select RAID6_PQ
select XOR_BLOCKS
select SRCU
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..962a95a 100644
--- 

Re: [PATCH v5 3/5] btrfs: Add zstd support

2017-08-11 Thread Austin S. Hemmelgarn

On 2017-08-09 22:39, Nick Terrell wrote:

Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|-|---|--|-|
| None|  0.99 |  504 | 686 |
| lzo |  1.66 |  398 | 442 |
| zlib|  2.58 |   65 | 241 |
| zstd 1  |  2.57 |  260 | 383 |
| zstd 3  |  2.71 |  174 | 408 |
| zstd 6  |  2.87 |   70 | 398 |
| zstd 9  |  2.92 |   43 | 406 |
| zstd 12 |  2.93 |   21 | 408 |
| zstd 15 |  3.01 |   11 | 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
||---|---|--||--|
| None   |  0.97 |  0.78 |0.981 |  5.501 |8.807 |
| lzo|  2.06 |  1.38 |1.631 |  8.458 |8.585 |
| zlib   |  3.40 |  1.86 |7.750 | 21.544 |   11.744 |
| zstd 1 |  3.57 |  1.85 |2.579 | 11.479 |9.389 |

[1] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] 
https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell 
Considering how things went with the previous patch, I've been a bit 
more aggressive testing this one, but after now almost 72 hours of 
combined runtime for each of the architectures I have tests set up for 
with nothing breaking (well, nothing breaking that wasn't already 
breaking before this patch set, some of the raid56 tests are still 
failing semi-reliably as expected), I'd say this is reasonably stable 
and looks good overall.

---
v2 -> v3:
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff
- Change default compression level for BtrFS to 3

v3 -> v4:
- Add missing includes, which fixes the aarch64 build
- Fix minor linter warnings

  fs/btrfs/Kconfig   |   2 +
  fs/btrfs/Makefile  |   2 +-
  fs/btrfs/compression.c |   1 +
  fs/btrfs/compression.h |   6 +-
  fs/btrfs/ctree.h   |   1 +
  fs/btrfs/disk-io.c |   2 +
  fs/btrfs/ioctl.c   |   6 +-
  fs/btrfs/props.c   |   6 +
  fs/btrfs/super.c   |  12 +-
  fs/btrfs/sysfs.c   |   2 +
  fs/btrfs/zstd.c| 432 +
  include/uapi/linux/btrfs.h |   8 +-
  12 files changed, 468 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..a26c63b 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,8 @@ config BTRFS_FS
select ZLIB_DEFLATE
select LZO_COMPRESS
select LZO_DECOMPRESS
+   select ZSTD_COMPRESS
+   select ZSTD_DECOMPRESS
select RAID6_PQ
select XOR_BLOCKS
select SRCU
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..962a95a 100644
--- a/fs/btrfs/Makefile
+++ 

Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 15:25, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.
AFAIUI, the intent is to extend the compression type specification for 
both the mount options and the property, not to add a new mount option. 
I think we all agree that `mount -o compress=zstd3` is a lot better than 
`mount -o compress=zstd,compresslevel=3`.


In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.
Are properties set on the root subvolume inherited properly?  Because 
unless they are, we can't get the same semantics.


Two other counter arguments on completely removing BTRFS-specific mount 
options:
1. It's a lot easier and a lot more clearly defined to change things 
that affect global behavior of the FS by a remount than having to 
iterate everything in the FS to update properties.  If I'm disabling 
autodefrag, I'd much rather just `mount -o remount,noautodefrag` than 
`find / -xdev -exec btrfs property set \{\} autodefrag false`, as the 
first will take effect for everything simultaneously and run 
exponentially quicker.
2. There are some things that don't make sense as per-object settings or 
are otherwise nonsensical on objects.  Many, but not all, of the BTRFS 
specific mount options fall into this category IMO, with the notable 
exception of compress[-force], [no]autodefrag, [no]datacow, and 
[no]datasum.  Some other options do make sense as properties of the 
filesystem (commit, flushoncommit, {inode,space}_cache, max_inline, 
metadata_ratio, [no]ssd, and [no]treelog are such options), but many are 
one-off options that affect behavior on mount (like skip_balance, 
clear_cache, nologreplay, norecovery, usebbackuproot, and subvol).


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 15:25, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.
AFAIUI, the intent is to extend the compression type specification for 
both the mount options and the property, not to add a new mount option. 
I think we all agree that `mount -o compress=zstd3` is a lot better than 
`mount -o compress=zstd,compresslevel=3`.


In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.
Are properties set on the root subvolume inherited properly?  Because 
unless they are, we can't get the same semantics.


Two other counter arguments on completely removing BTRFS-specific mount 
options:
1. It's a lot easier and a lot more clearly defined to change things 
that affect global behavior of the FS by a remount than having to 
iterate everything in the FS to update properties.  If I'm disabling 
autodefrag, I'd much rather just `mount -o remount,noautodefrag` than 
`find / -xdev -exec btrfs property set \{\} autodefrag false`, as the 
first will take effect for everything simultaneously and run 
exponentially quicker.
2. There are some things that don't make sense as per-object settings or 
are otherwise nonsensical on objects.  Many, but not all, of the BTRFS 
specific mount options fall into this category IMO, with the notable 
exception of compress[-force], [no]autodefrag, [no]datacow, and 
[no]datasum.  Some other options do make sense as properties of the 
filesystem (commit, flushoncommit, {inode,space}_cache, max_inline, 
metadata_ratio, [no]ssd, and [no]treelog are such options), but many are 
one-off options that affect behavior on mount (like skip_balance, 
clear_cache, nologreplay, norecovery, usebbackuproot, and subvol).


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 13:24, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 07:32:18AM -0400, Austin S. Hemmelgarn wrote:

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not at the
same time.  I doubt there's a use case for using the highest compression levels
in kernel mode --- especially the ones using zstd_opt.h.

Large data-sets with WORM access patterns and infrequent writes
immediately come to mind as a use case for the highest compression
level.

As a more specific example, the company I work for has a very large
amount of documentation, and we keep all old versions.  This is all
stored on a file server which is currently using BTRFS.  Once a
document is written, it's almost never rewritten, so write
performance only matters for the first write.  However, they're read
back pretty frequently, so we need good read performance.  As of
right now, the system is set to use LZO compression by default, and
then when a new document is added, the previous version of that
document gets re-compressed using zlib compression, which actually
results in pretty significant space savings most of the time.  I
would absolutely love to use zstd compression with this system with
the highest compression level, because most people don't care how
long it takes to write the file out, but they do care how long it
takes to read a file (even if it's an older version).


This may be a reasonable use case, but note this cannot just be the regular
"zstd" compression setting, since filesystem compression by default must provide
reasonable performance for many different access patterns.  See the patch in
this series which actually adds zstd compression to btrfs; it only uses level 1.
I do not see a patch which adds a higher compression mode.  It would need to be
a special setting like "zstdhc" that users could opt-in to on specific
directories.  It also would need to be compared to simply compressing in
userspace.  In many cases compressing in userspace is probably the better
solution for the use case in question because it works on any filesystem, allows
using any compression algorithm, and if random access is not needed it is
possible to compress each file as a single stream (like a .xz file), which
produces a much better compression ratio than the block-by-block compression
that filesystems have to use.
There has been discussion as well as (I think) initial patches merged 
for support of specifying the compression level for algorithms which 
support multiple compression levels in BTRFS.  I was actually under the 
impression that we had decided to use level 3 as the default for zstd, 
but that apparently isn't the case, and with the benchmark issues, it 
may not be once proper benchmarks are run.


Also, on the note of compressing in userspace, the use case I quoted at 
least can't do that because we have to deal with Windows clients and 
users have to be able to open files directly on said Windows clients.  I 
entirely agree that real archival storage is better off using userspace 
compression, but sometimes real archival storage isn't an option.


Note also that LZ4HC is in the kernel source tree currently but no one is using
it vs. the regular LZ4.  I think it is the kind of thing that sounded useful
originally, but at the end of the day no one really wants to use it in kernel
mode.  I'd certainly be interested in actual patches, though.
Part of that is the fact that BTRFS is one of the only consumers (AFAIK) 
of this API that can freely choose all aspects of their usage, and the 
consensus here (which I don't agree with I might add) amounts to the 
argument that 'we already have  compression with a  compression 
ratio, we don't need more things like that'.  I would personally love to 
see LZ4HC support in BTRFS (based on testing my own use cases, LZ4 is 
more deterministic than LZO for both compression and decompression, and 
most of the non archival usage I have of BTRFS benefits from 
determinism), but there's not any point in me writing up such a patch 
because it's almost certain to get rejected because BTRFS already has 
LZO.  The main reason that zstd is getting considered at all is that the 
quoted benchmarks show clear benefits in decompression speed relative to 
zlib and far better compression ratios than LZO.


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 13:24, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 07:32:18AM -0400, Austin S. Hemmelgarn wrote:

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not at the
same time.  I doubt there's a use case for using the highest compression levels
in kernel mode --- especially the ones using zstd_opt.h.

Large data-sets with WORM access patterns and infrequent writes
immediately come to mind as a use case for the highest compression
level.

As a more specific example, the company I work for has a very large
amount of documentation, and we keep all old versions.  This is all
stored on a file server which is currently using BTRFS.  Once a
document is written, it's almost never rewritten, so write
performance only matters for the first write.  However, they're read
back pretty frequently, so we need good read performance.  As of
right now, the system is set to use LZO compression by default, and
then when a new document is added, the previous version of that
document gets re-compressed using zlib compression, which actually
results in pretty significant space savings most of the time.  I
would absolutely love to use zstd compression with this system with
the highest compression level, because most people don't care how
long it takes to write the file out, but they do care how long it
takes to read a file (even if it's an older version).


This may be a reasonable use case, but note this cannot just be the regular
"zstd" compression setting, since filesystem compression by default must provide
reasonable performance for many different access patterns.  See the patch in
this series which actually adds zstd compression to btrfs; it only uses level 1.
I do not see a patch which adds a higher compression mode.  It would need to be
a special setting like "zstdhc" that users could opt-in to on specific
directories.  It also would need to be compared to simply compressing in
userspace.  In many cases compressing in userspace is probably the better
solution for the use case in question because it works on any filesystem, allows
using any compression algorithm, and if random access is not needed it is
possible to compress each file as a single stream (like a .xz file), which
produces a much better compression ratio than the block-by-block compression
that filesystems have to use.
There has been discussion as well as (I think) initial patches merged 
for support of specifying the compression level for algorithms which 
support multiple compression levels in BTRFS.  I was actually under the 
impression that we had decided to use level 3 as the default for zstd, 
but that apparently isn't the case, and with the benchmark issues, it 
may not be once proper benchmarks are run.


Also, on the note of compressing in userspace, the use case I quoted at 
least can't do that because we have to deal with Windows clients and 
users have to be able to open files directly on said Windows clients.  I 
entirely agree that real archival storage is better off using userspace 
compression, but sometimes real archival storage isn't an option.


Note also that LZ4HC is in the kernel source tree currently but no one is using
it vs. the regular LZ4.  I think it is the kind of thing that sounded useful
originally, but at the end of the day no one really wants to use it in kernel
mode.  I'd certainly be interested in actual patches, though.
Part of that is the fact that BTRFS is one of the only consumers (AFAIK) 
of this API that can freely choose all aspects of their usage, and the 
consensus here (which I don't agree with I might add) amounts to the 
argument that 'we already have  compression with a  compression 
ratio, we don't need more things like that'.  I would personally love to 
see LZ4HC support in BTRFS (based on testing my own use cases, LZ4 is 
more deterministic than LZO for both compression and decompression, and 
most of the non archival usage I have of BTRFS benefits from 
determinism), but there's not any point in me writing up such a patch 
because it's almost certain to get rejected because BTRFS already has 
LZO.  The main reason that zstd is getting considered at all is that the 
quoted benchmarks show clear benefits in decompression speed relative to 
zlib and far better compression ratios than LZO.


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 07:32, Austin S. Hemmelgarn wrote:

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not 
at the
same time.  I doubt there's a use case for using the highest 
compression levels

in kernel mode --- especially the ones using zstd_opt.h.
Large data-sets with WORM access patterns and infrequent writes 
immediately come to mind as a use case for the highest compression level.


As a more specific example, the company I work for has a very large 
amount of documentation, and we keep all old versions.  This is all 
stored on a file server which is currently using BTRFS.  Once a document 
is written, it's almost never rewritten, so write performance only 
matters for the first write.  However, they're read back pretty 
frequently, so we need good read performance.  As of right now, the 
system is set to use LZO compression by default, and then when a new 
document is added, the previous version of that document gets 
re-compressed using zlib compression, which actually results in pretty 
significant space savings most of the time.  I would absolutely love to 
use zstd compression with this system with the highest compression 
level, because most people don't care how long it takes to write the 
file out, but they do care how long it takes to read a file (even if 
it's an older version).
Also didn't think to mention this, but I could see the max level being 
very popular for use with SquashFS root filesystems used in LiveCD's. 
Currently, they have to decide between read performance and image size, 
while zstd would provide both.


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 07:32, Austin S. Hemmelgarn wrote:

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not 
at the
same time.  I doubt there's a use case for using the highest 
compression levels

in kernel mode --- especially the ones using zstd_opt.h.
Large data-sets with WORM access patterns and infrequent writes 
immediately come to mind as a use case for the highest compression level.


As a more specific example, the company I work for has a very large 
amount of documentation, and we keep all old versions.  This is all 
stored on a file server which is currently using BTRFS.  Once a document 
is written, it's almost never rewritten, so write performance only 
matters for the first write.  However, they're read back pretty 
frequently, so we need good read performance.  As of right now, the 
system is set to use LZO compression by default, and then when a new 
document is added, the previous version of that document gets 
re-compressed using zlib compression, which actually results in pretty 
significant space savings most of the time.  I would absolutely love to 
use zstd compression with this system with the highest compression 
level, because most people don't care how long it takes to write the 
file out, but they do care how long it takes to read a file (even if 
it's an older version).
Also didn't think to mention this, but I could see the max level being 
very popular for use with SquashFS root filesystems used in LiveCD's. 
Currently, they have to decide between read performance and image size, 
while zstd would provide both.


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not at the
same time.  I doubt there's a use case for using the highest compression levels
in kernel mode --- especially the ones using zstd_opt.h.
Large data-sets with WORM access patterns and infrequent writes 
immediately come to mind as a use case for the highest compression level.


As a more specific example, the company I work for has a very large 
amount of documentation, and we keep all old versions.  This is all 
stored on a file server which is currently using BTRFS.  Once a document 
is written, it's almost never rewritten, so write performance only 
matters for the first write.  However, they're read back pretty 
frequently, so we need good read performance.  As of right now, the 
system is set to use LZO compression by default, and then when a new 
document is added, the previous version of that document gets 
re-compressed using zlib compression, which actually results in pretty 
significant space savings most of the time.  I would absolutely love to 
use zstd compression with this system with the highest compression 
level, because most people don't care how long it takes to write the 
file out, but they do care how long it takes to read a file (even if 
it's an older version).




The code was ported from the upstream zstd source repository.


What version?


`linux/zstd.h` header was modified to match linux kernel style.
The cross-platform and allocation code was stripped out. Instead zstd
requires the caller to pass a preallocated workspace. The source files
were clang-formatted [1] to match the Linux Kernel style as much as
possible.


It would be easier to compare to the upstream version if it was not all
reformatted.  There is a chance that bugs were introduced by Linux-specific
changes, and it would be nice if they could be easily reviewed.  (Also I don't
know what clang-format settings you used, but there are still a lot of
differences from the Linux coding style.)



I benchmarked zstd compression as a special character device. I ran zstd
and zlib compression at several levels, as well as performing no
compression, which measure the time spent copying the data to kernel space.
Data is passed to the compresser 4096 B at a time. The benchmark file is
located in the upstream zstd source repository under
`contrib/linux-kernel/zstd_compress_test.c` [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

 sudo modprobe zstd_compress_test
 sudo mknod zstd_compress_test c 245 0
 sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

 1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Austin S. Hemmelgarn

On 2017-08-10 04:30, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:


It can compress at speeds approaching lz4, and quality approaching lzma.


Well, for a very loose definition of "approaching", and certainly not at the
same time.  I doubt there's a use case for using the highest compression levels
in kernel mode --- especially the ones using zstd_opt.h.
Large data-sets with WORM access patterns and infrequent writes 
immediately come to mind as a use case for the highest compression level.


As a more specific example, the company I work for has a very large 
amount of documentation, and we keep all old versions.  This is all 
stored on a file server which is currently using BTRFS.  Once a document 
is written, it's almost never rewritten, so write performance only 
matters for the first write.  However, they're read back pretty 
frequently, so we need good read performance.  As of right now, the 
system is set to use LZO compression by default, and then when a new 
document is added, the previous version of that document gets 
re-compressed using zlib compression, which actually results in pretty 
significant space savings most of the time.  I would absolutely love to 
use zstd compression with this system with the highest compression 
level, because most people don't care how long it takes to write the 
file out, but they do care how long it takes to read a file (even if 
it's an older version).




The code was ported from the upstream zstd source repository.


What version?


`linux/zstd.h` header was modified to match linux kernel style.
The cross-platform and allocation code was stripped out. Instead zstd
requires the caller to pass a preallocated workspace. The source files
were clang-formatted [1] to match the Linux Kernel style as much as
possible.


It would be easier to compare to the upstream version if it was not all
reformatted.  There is a chance that bugs were introduced by Linux-specific
changes, and it would be nice if they could be easily reviewed.  (Also I don't
know what clang-format settings you used, but there are still a lot of
differences from the Linux coding style.)



I benchmarked zstd compression as a special character device. I ran zstd
and zlib compression at several levels, as well as performing no
compression, which measure the time spent copying the data to kernel space.
Data is passed to the compresser 4096 B at a time. The benchmark file is
located in the upstream zstd source repository under
`contrib/linux-kernel/zstd_compress_test.c` [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

 sudo modprobe zstd_compress_test
 sudo mknod zstd_compress_test c 245 0
 sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

 1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


Re: bcache with existing ext4 filesystem

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-26 13:41, Eric Wheeler wrote:

On Wed, 26 Jul 2017, Pavel Machek wrote:


Hi!


Unfortunately, that would mean shifting 400GB data 8KB forward, and
compatibility problems. So I'd prefer adding bcache superblock into
the reserved space, so I can have caching _and_ compatibility with
grub2 etc (and avoid 400GB move):


The common way to do that is to move the beginning of the partition,
assuming your ext4 lives in a partition.


Well... if I move the partition, grub2 (etc) will be unable to access
data on it. (Plus I do not have free space before some of the
partitions I'd like to be cached).


Why not use dm-linear and prepend space for the bcache superblock?  If
this is your boot device, then you would need to write a custom
initrd hook too.


Thanks for a pointer. That would actually work, but I'd have to be
very, very careful using it...

...because if I, or systemd or some kind of automounter sees the
underlying device (sda4) and writes to it (it is valid ext4 after
all), I'll have inconsistent base device and cache ... and that will
be asking for major problems (even in writethrough mode).


Sigh.  Gone are the days when distributions would only mount filesystems
if you ask them to.

This is only an issue if:
1. You are crazy and have your desktop set up to auto-mount stuff on 
insertion (instead of on-access), or you're using a brain-dead desktop 
that doesn't let you turn this functionality off.

or:
2. You are mounting by filesystem label or filesystem UUID 
(/dev/mapper/* and /dev// links are more than 
sufficiently stable unless you're changing the storage stack all the 
time).  This is of course almost always the case these days because for 
some reason the big distros assume that the device mapper links aren't 
stable or aren't safe to use.

or:
3. You're working with a multi-device BTRFS volume (and in that case, 
it's not an issue of auto-mounting it, but an issue of buggy kernel 
behavior combined with the 'scan everything for BTRFS' udev rule that's 
now upstream in udev).


If this is a desktop, then I'm not sure what to suggest.  But for server
with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
/etc/sysconfig/grub).  If this was LVM, then I would suggest also setting
global_filter in lvm.conf.  If you find other places that need poked to
prevent automounting then please let me know!

Nope, just those and what I mentioned above under 1 and 2.


As for ext4 feature bits, can they be arbitrarily named? (I think they are
bits, so maybe not).  Maybe propose a patch to ext4 to provide a
"disable_mount" feature.  This would prevent mounting altogether, and you
would set/clear it when you care to.  A strange feature indeed.  Doing
this as an obscure feature on a single filesystem doesn't quite seem
right.
In the case of ext4, you can use the MMP feature (with the associated 
minor overhead) to enforce this.



It might be better to have a block-layer device-mount mask so devices that
are allowed to be mounted can be whitelisted on the kernel cmdline or
something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or
probably both).

Just ideas, but it would be great to allow mounting only those
major/minors which are authorized, particularly with increasingly complex
block-device stacks.
If it's not explicitly created as a mount unit, fstab entry, or 
automount entry, and you're not using a brain-dead desktop, and you're 
not using FS labels or UUID's to mount, it _really_ isn't an issue 
except on BTRFS, and for that this would provide _no_ benefit, because 
the issue there results from confusion over which device to use combined 
with the assumption that a UUID stored on the device is sufficient to 
uniquely identify a filesystem.


Re: bcache with existing ext4 filesystem

2017-07-26 Thread Austin S. Hemmelgarn

On 2017-07-26 13:41, Eric Wheeler wrote:

On Wed, 26 Jul 2017, Pavel Machek wrote:


Hi!


Unfortunately, that would mean shifting 400GB data 8KB forward, and
compatibility problems. So I'd prefer adding bcache superblock into
the reserved space, so I can have caching _and_ compatibility with
grub2 etc (and avoid 400GB move):


The common way to do that is to move the beginning of the partition,
assuming your ext4 lives in a partition.


Well... if I move the partition, grub2 (etc) will be unable to access
data on it. (Plus I do not have free space before some of the
partitions I'd like to be cached).


Why not use dm-linear and prepend space for the bcache superblock?  If
this is your boot device, then you would need to write a custom
initrd hook too.


Thanks for a pointer. That would actually work, but I'd have to be
very, very careful using it...

...because if I, or systemd or some kind of automounter sees the
underlying device (sda4) and writes to it (it is valid ext4 after
all), I'll have inconsistent base device and cache ... and that will
be asking for major problems (even in writethrough mode).


Sigh.  Gone are the days when distributions would only mount filesystems
if you ask them to.

This is only an issue if:
1. You are crazy and have your desktop set up to auto-mount stuff on 
insertion (instead of on-access), or you're using a brain-dead desktop 
that doesn't let you turn this functionality off.

or:
2. You are mounting by filesystem label or filesystem UUID 
(/dev/mapper/* and /dev// links are more than 
sufficiently stable unless you're changing the storage stack all the 
time).  This is of course almost always the case these days because for 
some reason the big distros assume that the device mapper links aren't 
stable or aren't safe to use.

or:
3. You're working with a multi-device BTRFS volume (and in that case, 
it's not an issue of auto-mounting it, but an issue of buggy kernel 
behavior combined with the 'scan everything for BTRFS' udev rule that's 
now upstream in udev).


If this is a desktop, then I'm not sure what to suggest.  But for server
with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
/etc/sysconfig/grub).  If this was LVM, then I would suggest also setting
global_filter in lvm.conf.  If you find other places that need poked to
prevent automounting then please let me know!

Nope, just those and what I mentioned above under 1 and 2.


As for ext4 feature bits, can they be arbitrarily named? (I think they are
bits, so maybe not).  Maybe propose a patch to ext4 to provide a
"disable_mount" feature.  This would prevent mounting altogether, and you
would set/clear it when you care to.  A strange feature indeed.  Doing
this as an obscure feature on a single filesystem doesn't quite seem
right.
In the case of ext4, you can use the MMP feature (with the associated 
minor overhead) to enforce this.



It might be better to have a block-layer device-mount mask so devices that
are allowed to be mounted can be whitelisted on the kernel cmdline or
something.  blk.allow_mount=8:16,253:*,...  or blk.disallow_mount=8:32 (or
probably both).

Just ideas, but it would be great to allow mounting only those
major/minors which are authorized, particularly with increasingly complex
block-device stacks.
If it's not explicitly created as a mount unit, fstab entry, or 
automount entry, and you're not using a brain-dead desktop, and you're 
not using FS labels or UUID's to mount, it _really_ isn't an issue 
except on BTRFS, and for that this would provide _no_ benefit, because 
the issue there results from confusion over which device to use combined 
with the assumption that a UUID stored on the device is sufficient to 
uniquely identify a filesystem.


Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-24 Thread Austin S. Hemmelgarn

On 2017-07-22 07:35, Adam Borowski wrote:

On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote:

On 2017-07-20 17:27, Nick Terrell wrote:

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.


For patches 2-3, I've compile tested and had runtime testing running for
about 18 hours now with no issues, so you can add:

Tested-by: Austin S. Hemmelgarn <ahferro...@gmail.com>


I assume you haven't tried it on arm64, right?

I had no time to get 'round to it before, and just got the following build
failure:

   CC  fs/btrfs/zstd.o
In file included from fs/btrfs/zstd.c:28:0:
fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’
   refcount_t pending_bios;
   ^~
scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed

It's trivially fixably by:
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -24,6 +24,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "compression.h"

after which it works fine, although half an hour of testing isn't exactly
exhaustive.

I did, and didn't hit this somehow...

Off to go verify my tool-chain and scripts then...



Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on
doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that
would be quite painful so I did not try yet.  I guess re-testing your patch
set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't
be of much help.  So far, previous versions have been running for weeks,
with no issue since you fixed workspace flickering.
I also didn't see this, but I test on some seriously bare-bones 
configurations for both the 32-bit ARM tests I run.  On further 
inspection, it looks like my scripts decided to use btrfs-for-4.13 as 
the base, not 4.13-rc1 like I thought they did, so I don't know anymore 
how helpful my testing may have been.



On amd64 all is fine.


I haven't tested SquashFS at all.


Meow!





Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-24 Thread Austin S. Hemmelgarn

On 2017-07-22 07:35, Adam Borowski wrote:

On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote:

On 2017-07-20 17:27, Nick Terrell wrote:

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.


For patches 2-3, I've compile tested and had runtime testing running for
about 18 hours now with no issues, so you can add:

Tested-by: Austin S. Hemmelgarn 


I assume you haven't tried it on arm64, right?

I had no time to get 'round to it before, and just got the following build
failure:

   CC  fs/btrfs/zstd.o
In file included from fs/btrfs/zstd.c:28:0:
fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’
   refcount_t pending_bios;
   ^~
scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed

It's trivially fixably by:
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -24,6 +24,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "compression.h"

after which it works fine, although half an hour of testing isn't exactly
exhaustive.

I did, and didn't hit this somehow...

Off to go verify my tool-chain and scripts then...



Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on
doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that
would be quite painful so I did not try yet.  I guess re-testing your patch
set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't
be of much help.  So far, previous versions have been running for weeks,
with no issue since you fixed workspace flickering.
I also didn't see this, but I test on some seriously bare-bones 
configurations for both the 32-bit ARM tests I run.  On further 
inspection, it looks like my scripts decided to use btrfs-for-4.13 as 
the base, not 4.13-rc1 like I thought they did, so I don't know anymore 
how helpful my testing may have been.



On amd64 all is fine.


I haven't tested SquashFS at all.


Meow!





Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-20 17:27, Nick Terrell wrote:

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell


For patches 2-3, I've compile tested and had runtime testing running for 
about 18 hours now with no issues, so you can add:


Tested-by: Austin S. Hemmelgarn <ahferro...@gmail.com>

For patch 1, I've only compile tested it, but had no issues and got no 
warnings about it when booting to test 2-4.


For patch 4, I've compile tested it and done some really basic runtime 
testing using a couple of hand-crafted SquashFS images.  It appears to 
run fine, but I've not done enough that I feel it warrants a Tested-by 
tag from me.


Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(),
   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4)
- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388
   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4)
- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 
  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3





Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-20 17:27, Nick Terrell wrote:

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell


For patches 2-3, I've compile tested and had runtime testing running for 
about 18 hours now with no issues, so you can add:


Tested-by: Austin S. Hemmelgarn 

For patch 1, I've only compile tested it, but had no issues and got no 
warnings about it when booting to test 2-4.


For patch 4, I've compile tested it and done some really basic runtime 
testing using a couple of hand-crafted SquashFS images.  It appears to 
run fine, but I've not done enough that I feel it warrants a Tested-by 
tag from me.


Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(),
   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4)
- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388
   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4)
- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 
  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3





Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-21 07:16, Austin S. Hemmelgarn wrote:

On 2017-07-20 17:27, Nick Terrell wrote:

Well this is embarrassing, forgot to type anything before hitting send...

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell
I've compile tested all of this on my end now, and have gotten some 
preliminary testing of patches 2-4, and everything looks good so far 
here.  I'll reply to this if I end up finding issues, but it doesn't 
look like I will right now considering the testing has run over night 
without issue.


Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), 
FSE_decompress_wksp(),

   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() 
(2/4)

- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388

   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff 
(3/4)

- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 


  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3







Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-21 07:16, Austin S. Hemmelgarn wrote:

On 2017-07-20 17:27, Nick Terrell wrote:

Well this is embarrassing, forgot to type anything before hitting send...

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell
I've compile tested all of this on my end now, and have gotten some 
preliminary testing of patches 2-4, and everything looks good so far 
here.  I'll reply to this if I end up finding issues, but it doesn't 
look like I will right now considering the testing has run over night 
without issue.


Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), 
FSE_decompress_wksp(),

   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() 
(2/4)

- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388

   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff 
(3/4)

- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 


  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3







Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-20 17:27, Nick Terrell wrote:

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell

Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(),
   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4)
- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388
   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4)
- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 
  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3





Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-21 Thread Austin S. Hemmelgarn

On 2017-07-20 17:27, Nick Terrell wrote:

Hi all,

This patch set adds xxhash, zstd compression, and zstd decompression
modules. It also adds zstd support to BtrFS and SquashFS.

Each patch has relevant summaries, benchmarks, and tests.

Best,
Nick Terrell

Changelog:

v1 -> v2:
- Make pointer in lib/xxhash.c:394 non-const (1/4)
- Use div_u64() for division of u64s (2/4)
- Reduce stack usage of ZSTD_compressSequences(), ZSTD_buildSeqTable(),
   ZSTD_decompressSequencesLong(), FSE_buildDTable(), FSE_decompress_wksp(),
   HUF_writeCTable(), HUF_readStats(), HUF_readCTable(),
   HUF_compressWeights(), HUF_readDTableX2(), and HUF_readDTableX4() (2/4)
- No zstd function uses more than 400 B of stack space (2/4)

v2 -> v3:
- Work around gcc-7 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388
   (2/4)
- Fix bug in dictionary compression from upstream commit cc1522351f (2/4)
- Port upstream BtrFS commits e1ddce71d6, 389a6cfc2a, and 6acafd1eff (3/4)
- Change default compression level for BtrFS to 3 (3/4)

Nick Terrell (4):
   lib: Add xxhash module
   lib: Add zstd modules
   btrfs: Add zstd support
   squashfs: Add zstd support

  fs/btrfs/Kconfig   |2 +
  fs/btrfs/Makefile  |2 +-
  fs/btrfs/compression.c |1 +
  fs/btrfs/compression.h |6 +-
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |2 +
  fs/btrfs/ioctl.c   |6 +-
  fs/btrfs/props.c   |6 +
  fs/btrfs/super.c   |   12 +-
  fs/btrfs/sysfs.c   |2 +
  fs/btrfs/zstd.c|  435 ++
  fs/squashfs/Kconfig|   14 +
  fs/squashfs/Makefile   |1 +
  fs/squashfs/decompressor.c |7 +
  fs/squashfs/decompressor.h |4 +
  fs/squashfs/squashfs_fs.h  |1 +
  fs/squashfs/zstd_wrapper.c |  150 ++
  include/linux/xxhash.h |  236 +++
  include/linux/zstd.h   | 1157 +++
  include/uapi/linux/btrfs.h |8 +-
  lib/Kconfig|   11 +
  lib/Makefile   |3 +
  lib/xxhash.c   |  500 +++
  lib/zstd/Makefile  |   18 +
  lib/zstd/bitstream.h   |  374 +
  lib/zstd/compress.c| 3479 
  lib/zstd/decompress.c  | 2526 
  lib/zstd/entropy_common.c  |  243 
  lib/zstd/error_private.h   |   53 +
  lib/zstd/fse.h |  575 
  lib/zstd/fse_compress.c|  795 ++
  lib/zstd/fse_decompress.c  |  332 +
  lib/zstd/huf.h |  212 +++
  lib/zstd/huf_compress.c|  770 ++
  lib/zstd/huf_decompress.c  |  960 
  lib/zstd/mem.h |  151 ++
  lib/zstd/zstd_common.c |   75 +
  lib/zstd/zstd_internal.h   |  250 
  lib/zstd/zstd_opt.h| 1014 +
  39 files changed, 14382 insertions(+), 12 deletions(-)
  create mode 100644 fs/btrfs/zstd.c
  create mode 100644 fs/squashfs/zstd_wrapper.c
  create mode 100644 include/linux/xxhash.h
  create mode 100644 include/linux/zstd.h
  create mode 100644 lib/xxhash.c
  create mode 100644 lib/zstd/Makefile
  create mode 100644 lib/zstd/bitstream.h
  create mode 100644 lib/zstd/compress.c
  create mode 100644 lib/zstd/decompress.c
  create mode 100644 lib/zstd/entropy_common.c
  create mode 100644 lib/zstd/error_private.h
  create mode 100644 lib/zstd/fse.h
  create mode 100644 lib/zstd/fse_compress.c
  create mode 100644 lib/zstd/fse_decompress.c
  create mode 100644 lib/zstd/huf.h
  create mode 100644 lib/zstd/huf_compress.c
  create mode 100644 lib/zstd/huf_decompress.c
  create mode 100644 lib/zstd/mem.h
  create mode 100644 lib/zstd/zstd_common.c
  create mode 100644 lib/zstd/zstd_internal.h
  create mode 100644 lib/zstd/zstd_opt.h

--
2.9.3





Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-10 Thread Austin S. Hemmelgarn

On 2017-07-07 23:07, Adam Borowski wrote:

On Sat, Jul 08, 2017 at 01:40:18AM +0200, Adam Borowski wrote:

On Fri, Jul 07, 2017 at 11:17:49PM +, Nick Terrell wrote:

On 7/6/17, 9:32 AM, "Adam Borowski"  wrote:

Got a reproducible crash on amd64:



Thanks for the bug report Adam! I'm looking into the failure, and haven't
been able to reproduce it yet. I've built my kernel from your tree, and
I ran your script with the kernel.tar tarball 100 times, but haven't gotten
a failure yet.



I have a few questions to guide my debugging.

- How many cores are you running with? I’ve run the script with 1, 2, and 4 
cores.
- Which version of gcc are you using to compile the kernel? I’m using 
gcc-6.2.0-5ubuntu12.
- Are the failures always in exactly the same place, and does it fail 100%
   of the time or just regularly?


6 cores -- all on bare metal.  gcc-7.1.0-9.
Lemme try with gcc-6, a different config or in a VM.


I've tried the following:
* gcc-6, defconfig (+btrfs obviously)
* gcc-7, defconfig
* gcc-6, my regular config
* gcc-7, my regular config
* gcc-7, debug + UBSAN + etc
* gcc-7, defconfig, qemu-kvm with only 1 core

Every build with gcc-7 reproduces the crash, every with gcc-6 does not.

Got a GCC7 tool-chain built, and I can confirm this here too, tested 
with various numbers of cores ranging from 1-32 in a QEMU+KVM VM, with 
various combinations of debug options and other config switches.


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-10 Thread Austin S. Hemmelgarn

On 2017-07-07 23:07, Adam Borowski wrote:

On Sat, Jul 08, 2017 at 01:40:18AM +0200, Adam Borowski wrote:

On Fri, Jul 07, 2017 at 11:17:49PM +, Nick Terrell wrote:

On 7/6/17, 9:32 AM, "Adam Borowski"  wrote:

Got a reproducible crash on amd64:



Thanks for the bug report Adam! I'm looking into the failure, and haven't
been able to reproduce it yet. I've built my kernel from your tree, and
I ran your script with the kernel.tar tarball 100 times, but haven't gotten
a failure yet.



I have a few questions to guide my debugging.

- How many cores are you running with? I’ve run the script with 1, 2, and 4 
cores.
- Which version of gcc are you using to compile the kernel? I’m using 
gcc-6.2.0-5ubuntu12.
- Are the failures always in exactly the same place, and does it fail 100%
   of the time or just regularly?


6 cores -- all on bare metal.  gcc-7.1.0-9.
Lemme try with gcc-6, a different config or in a VM.


I've tried the following:
* gcc-6, defconfig (+btrfs obviously)
* gcc-7, defconfig
* gcc-6, my regular config
* gcc-7, my regular config
* gcc-7, debug + UBSAN + etc
* gcc-7, defconfig, qemu-kvm with only 1 core

Every build with gcc-7 reproduces the crash, every with gcc-6 does not.

Got a GCC7 tool-chain built, and I can confirm this here too, tested 
with various numbers of cores ranging from 1-32 in a QEMU+KVM VM, with 
various combinations of debug options and other config switches.


Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-13 19:17, PGNet Dev wrote:

On 5/13/17 3:15 PM, Valentin Vidic wrote:

Try booting without 'hpet=force,verbose clocksource=hpet' and it should
select xen by default:


Nope. Well, not quite ...

With both

'hpet=force,verbose clocksource=hpet'

removed, I end up with

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc xen
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

But with

clocksource=xen

*explicitly* added

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc xen
cat /sys/devices/system/clocksource/clocksource0/current_clocksourcexen
xen

and in *console*, NOT dmesg, output,

grep -i hpet tmp.txt
(XEN) ACPI: HPET 9E8298F8, 0038 (r1 SUPERM SMCI--MB  1072009 
AMI.5)
(XEN) ACPI: HPET id: 0x8086a701 base: 0xfed0
(XEN) [VT-D] MSI HPET: :f0:0f.0
(XEN) Platform timer is 14.318MHz HPET
[0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
SMCI--MB 01072009 AMI. 00
[0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
[0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
SMCI--MB 01072009 AMI. 00
[0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
[8.515245] hpet_acpi_add: no address or irqs in _CRS
[8.515245] hpet_acpi_add: no address or irqs in _CRS
(XEN) [2017-05-13 23:04:27] HVM1 save: HPET



and

dmesg | grep -i clocksource | grep -v line:
[0.00] clocksource: refined-jiffies: mask: 0x 
max_cycles: 0x, max_idle_ns: 7645519600211568 ns
[0.004000] clocksource: xen: mask: 0x 
max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[0.375709] clocksource: jiffies: mask: 0x 
max_cycles: 0x, max_idle_ns: 764504178510 ns
[4.656634] clocksource: Switched to clocksource xen
[8.912897] clocksource: tsc: mask: 0x 
max_cycles: 0x2c94dffea94, max_idle_ns: 440795361700 ns

jiffies, now? hm. no idea where that came from. and why the 'tsc' ?

So I'm still unclear -- is this^ now, correctly "all" using MSI/HPET?
That depends on what you mean by everything correctly using the HPET. 
Using clocksource=xen (or autoselecting it) will cause the kernel to get 
timing info from Xen.  If you're running as a guest, this is absolutely 
what you want (unless you're using HVM), and with possible limited and 
extremely specific exceptions, this is almost certainly what you want in 
Domain-0 as well.  Given that Xen is using the HPEt for timing itself, 
using clocksource=xen will result in Linux _indirectly_ using the HPET 
through Xen without involving the HPET driver (in essence, Xen is your 
HPET driver in this situation), which will get you essentially the same 
precision that you would get by using the HPET directly.


So, if you just want the precision offered by the HPET, then yes, you 
are getting the same thing through the Xen clocksource.


Re: [Xen-devel] HPET enabled in BIOS, not presented as available_clocksource -- config, kernel code, &/or BIOS?

2017-05-15 Thread Austin S. Hemmelgarn

On 2017-05-13 19:17, PGNet Dev wrote:

On 5/13/17 3:15 PM, Valentin Vidic wrote:

Try booting without 'hpet=force,verbose clocksource=hpet' and it should
select xen by default:


Nope. Well, not quite ...

With both

'hpet=force,verbose clocksource=hpet'

removed, I end up with

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc xen
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

But with

clocksource=xen

*explicitly* added

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc xen
cat /sys/devices/system/clocksource/clocksource0/current_clocksourcexen
xen

and in *console*, NOT dmesg, output,

grep -i hpet tmp.txt
(XEN) ACPI: HPET 9E8298F8, 0038 (r1 SUPERM SMCI--MB  1072009 
AMI.5)
(XEN) ACPI: HPET id: 0x8086a701 base: 0xfed0
(XEN) [VT-D] MSI HPET: :f0:0f.0
(XEN) Platform timer is 14.318MHz HPET
[0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
SMCI--MB 01072009 AMI. 00
[0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
[0.00] ACPI: HPET 0x9E8298F8 38 (v01 SUPERM 
SMCI--MB 01072009 AMI. 00
[0.00] ACPI: HPET id: 0x8086a701 base: 0xfed0
[8.515245] hpet_acpi_add: no address or irqs in _CRS
[8.515245] hpet_acpi_add: no address or irqs in _CRS
(XEN) [2017-05-13 23:04:27] HVM1 save: HPET



and

dmesg | grep -i clocksource | grep -v line:
[0.00] clocksource: refined-jiffies: mask: 0x 
max_cycles: 0x, max_idle_ns: 7645519600211568 ns
[0.004000] clocksource: xen: mask: 0x 
max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[0.375709] clocksource: jiffies: mask: 0x 
max_cycles: 0x, max_idle_ns: 764504178510 ns
[4.656634] clocksource: Switched to clocksource xen
[8.912897] clocksource: tsc: mask: 0x 
max_cycles: 0x2c94dffea94, max_idle_ns: 440795361700 ns

jiffies, now? hm. no idea where that came from. and why the 'tsc' ?

So I'm still unclear -- is this^ now, correctly "all" using MSI/HPET?
That depends on what you mean by everything correctly using the HPET. 
Using clocksource=xen (or autoselecting it) will cause the kernel to get 
timing info from Xen.  If you're running as a guest, this is absolutely 
what you want (unless you're using HVM), and with possible limited and 
extremely specific exceptions, this is almost certainly what you want in 
Domain-0 as well.  Given that Xen is using the HPEt for timing itself, 
using clocksource=xen will result in Linux _indirectly_ using the HPET 
through Xen without involving the HPET driver (in essence, Xen is your 
HPET driver in this situation), which will get you essentially the same 
precision that you would get by using the HPET directly.


So, if you just want the precision offered by the HPET, then yes, you 
are getting the same thing through the Xen clocksource.


Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-08 17:07, Adam Borowski wrote:

Unbreaks ARM and possibly other 32-bit architectures.

Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len
Reported-by: Icenowy Zheng 
Signed-off-by: Adam Borowski 
---
You'd probably want to squash this with Liu's commit, to be nice to future
bisects.

Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub
sometimes works, but, like most operations, randomly dies with some badness
that doesn't look related: io_schedule, kunmap_high.  That badness wasn't
there in 4.11-rc5, needs investigating, but since it's not connected to our
issue at hand, I consider this patch sort-of tested.

 fs/btrfs/scrub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6fe1cd08048..95372e3679f3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct 
scrub_parity *sparity,

start -= sparity->logic_start;
start = div64_u64_rem(start, sparity->stripe_len, );
-   offset /= sectorsize;
+   do_div(offset, sectorsize);
nsectors = (int)len / sectorsize;

if (offset + nsectors <= sparity->nsectors) {


Also fixes things on:
32-bit MIPS (eb and el variants)
32-bit SPARC
32-bit PPC

You can add my Tested-by if you want.


Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-08 17:07, Adam Borowski wrote:

Unbreaks ARM and possibly other 32-bit architectures.

Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len
Reported-by: Icenowy Zheng 
Signed-off-by: Adam Borowski 
---
You'd probably want to squash this with Liu's commit, to be nice to future
bisects.

Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub
sometimes works, but, like most operations, randomly dies with some badness
that doesn't look related: io_schedule, kunmap_high.  That badness wasn't
there in 4.11-rc5, needs investigating, but since it's not connected to our
issue at hand, I consider this patch sort-of tested.

 fs/btrfs/scrub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6fe1cd08048..95372e3679f3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct 
scrub_parity *sparity,

start -= sparity->logic_start;
start = div64_u64_rem(start, sparity->stripe_len, );
-   offset /= sectorsize;
+   do_div(offset, sectorsize);
nsectors = (int)len / sectorsize;

if (offset + nsectors <= sparity->nsectors) {


Also fixes things on:
32-bit MIPS (eb and el variants)
32-bit SPARC
32-bit PPC

You can add my Tested-by if you want.


Re: [PATCH 00/24] Kernel lockdown

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-05 16:14, David Howells wrote:


These patches provide a facility by which a variety of avenues by which
userspace can feasibly modify the running kernel image can be locked down.
These include:

 (*) No unsigned modules and no modules for which can't validate the
 signature.

 (*) No use of ioperm(), iopl() and no writing to /dev/port.

 (*) No writing to /dev/mem or /dev/kmem.

 (*) No hibernation.

 (*) Restrict PCI BAR access.

 (*) Restrict MSR access.

 (*) No kexec_load().

 (*) Certain ACPI restrictions.

 (*) Restrict debugfs interface to ASUS WMI.

The lock-down can be configured to be triggered by the EFI secure boot
status, provided the shim isn't insecure.  The lock-down can be lifted by
typing SysRq+x on a keyboard attached to the system.
This has already been mentioned both in response to previous versions of 
this patch set, and by at least 2 people in response to a specific patch 
in this posting, but for any kind of proper security analysis, you need 
to better clarify your threat model.  'Prevent modification to the 
running kernel image' is a decent start on this, but at least some of 
the patches don't explain very well _how_ what you're disabling could be 
used to modify the running kernel image.  Clarifying how something is a 
threat will help with verifying that you're correctly blocking the threat.


Furthermore, why is the only way to enable this to boot in UEFI Secure 
Boot mode?  Almost all of the hardening done here has general utility in 
hardening regular systems, and as such I'd suggest adding a command line 
option to enable kernel lock-down (which would greatly simplify 
testing), and a kconfig option to enforce it at build-time.


In addition to all that, it would be nice to be able to disable all of 
the following at build time independent of the kernel lock-down state
* The acpi_rsdp kernel parameter (I could easily see many distros 
building kernels with this disabled, it's insanely use-case specific).
* IO port and resource reservation module parameters (this would 
actually be easier than having runtime blacklisting, and I could also 
easily see this being turned on by default by a number of distros).

* TOICSERIAL (this one is more likely than the above two to break systems).

And these would probably be useful as lockable sysctls that would be 
automatically locked disabled when the kernel is locked down:
* ioperm/iopl (these can technically be blocked by seccomp or other 
means, but that is non-trivial to do).
* Most of the other ACPI stuff (some of this is useful for 
troubleshooting, but is not normally used during regular operation).

* PCI BAR access.


Re: [PATCH 00/24] Kernel lockdown

2017-04-07 Thread Austin S. Hemmelgarn

On 2017-04-05 16:14, David Howells wrote:


These patches provide a facility by which a variety of avenues by which
userspace can feasibly modify the running kernel image can be locked down.
These include:

 (*) No unsigned modules and no modules for which can't validate the
 signature.

 (*) No use of ioperm(), iopl() and no writing to /dev/port.

 (*) No writing to /dev/mem or /dev/kmem.

 (*) No hibernation.

 (*) Restrict PCI BAR access.

 (*) Restrict MSR access.

 (*) No kexec_load().

 (*) Certain ACPI restrictions.

 (*) Restrict debugfs interface to ASUS WMI.

The lock-down can be configured to be triggered by the EFI secure boot
status, provided the shim isn't insecure.  The lock-down can be lifted by
typing SysRq+x on a keyboard attached to the system.
This has already been mentioned both in response to previous versions of 
this patch set, and by at least 2 people in response to a specific patch 
in this posting, but for any kind of proper security analysis, you need 
to better clarify your threat model.  'Prevent modification to the 
running kernel image' is a decent start on this, but at least some of 
the patches don't explain very well _how_ what you're disabling could be 
used to modify the running kernel image.  Clarifying how something is a 
threat will help with verifying that you're correctly blocking the threat.


Furthermore, why is the only way to enable this to boot in UEFI Secure 
Boot mode?  Almost all of the hardening done here has general utility in 
hardening regular systems, and as such I'd suggest adding a command line 
option to enable kernel lock-down (which would greatly simplify 
testing), and a kconfig option to enforce it at build-time.


In addition to all that, it would be nice to be able to disable all of 
the following at build time independent of the kernel lock-down state
* The acpi_rsdp kernel parameter (I could easily see many distros 
building kernels with this disabled, it's insanely use-case specific).
* IO port and resource reservation module parameters (this would 
actually be easier than having runtime blacklisting, and I could also 
easily see this being turned on by default by a number of distros).

* TOICSERIAL (this one is more likely than the above two to break systems).

And these would probably be useful as lockable sysctls that would be 
automatically locked disabled when the kernel is locked down:
* ioperm/iopl (these can technically be blocked by seccomp or other 
means, but that is non-trivial to do).
* Most of the other ACPI stuff (some of this is useful for 
troubleshooting, but is not normally used during regular operation).

* PCI BAR access.


Re: [PATCH linux-next V2] tty: Disable default console blanking interval

2017-03-23 Thread Austin S. Hemmelgarn

On 2017-03-22 21:32, Scot Doyle wrote:

On Wed, 22 Mar 2017, Tim Gardner wrote:

BugLink: http://bugs.launchpad.net/bugs/869017

Console blanking is not enabling DPMS power saving (thereby negating any
power-saving benefit), and is simply turning the screen content blank. This
means that any crash output is invisible which is unhelpful on a server
(virtual or otherwise).

Furthermore, CRT burn in concerns should no longer govern the default case.
Affected users could always set consoleblank on the kernel command line.


Does screen blanking save some power by disabling the cursor blink?

Unless you're dealing with ancient hardware, the difference in power 
usage is probably on the order of single digit micro-watts, which is not 
worth worrying about on almost anything you would expect to have a 
console display connected to.


Re: [PATCH linux-next V2] tty: Disable default console blanking interval

2017-03-23 Thread Austin S. Hemmelgarn

On 2017-03-22 21:32, Scot Doyle wrote:

On Wed, 22 Mar 2017, Tim Gardner wrote:

BugLink: http://bugs.launchpad.net/bugs/869017

Console blanking is not enabling DPMS power saving (thereby negating any
power-saving benefit), and is simply turning the screen content blank. This
means that any crash output is invisible which is unhelpful on a server
(virtual or otherwise).

Furthermore, CRT burn in concerns should no longer govern the default case.
Affected users could always set consoleblank on the kernel command line.


Does screen blanking save some power by disabling the cursor blink?

Unless you're dealing with ancient hardware, the difference in power 
usage is probably on the order of single digit micro-watts, which is not 
worth worrying about on almost anything you would expect to have a 
console display connected to.


Re: [PATCH linux-next] tty: Disable default console blanking interval

2017-03-22 Thread Austin S. Hemmelgarn

On 2017-03-22 09:50, Tim Gardner wrote:

BugLink: http://bugs.launchpad.net/bugs/869017

Signed-off-by: Tim Gardner 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: Adam Borowski 
Cc: Scot Doyle 
---

I'm not particularly knowledgable about console issues. Is a blaknking interval
relevant in a post CRT world ? The argument in the bug description seems
compelling.
Burn-in still happens on at least LCD screens (not sure about anything 
else except DLP, where it doesn't happen unless it's a really crappy 
display), but on many of those it happens regardless of the contents of 
the display (I've actually got an old LCD display where the image is 
constantly dark because of having displayed so many hours of a black 
screen), but displaying a black screen instead of powering off the 
display doesn't really save any power on most modern displays and the 
fact that the screen isn't un-blanked when the kernel crashes is a 
pretty compelling argument against having it enabled by default IMO.


 drivers/tty/vt/vt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 5c4933b..9c99452 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -181,7 +181,7 @@ int console_blanked;

 static int vesa_blank_mode; /* 0:none 1:suspendV 2:suspendH 3:powerdown */
 static int vesa_off_interval;
-static int blankinterval = 10*60;
+static int blankinterval;
 core_param(consoleblank, blankinterval, int, 0444);

 static DECLARE_WORK(console_work, console_callback);





Re: [PATCH linux-next] tty: Disable default console blanking interval

2017-03-22 Thread Austin S. Hemmelgarn

On 2017-03-22 09:50, Tim Gardner wrote:

BugLink: http://bugs.launchpad.net/bugs/869017

Signed-off-by: Tim Gardner 
Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: Adam Borowski 
Cc: Scot Doyle 
---

I'm not particularly knowledgable about console issues. Is a blaknking interval
relevant in a post CRT world ? The argument in the bug description seems
compelling.
Burn-in still happens on at least LCD screens (not sure about anything 
else except DLP, where it doesn't happen unless it's a really crappy 
display), but on many of those it happens regardless of the contents of 
the display (I've actually got an old LCD display where the image is 
constantly dark because of having displayed so many hours of a black 
screen), but displaying a black screen instead of powering off the 
display doesn't really save any power on most modern displays and the 
fact that the screen isn't un-blanked when the kernel crashes is a 
pretty compelling argument against having it enabled by default IMO.


 drivers/tty/vt/vt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 5c4933b..9c99452 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -181,7 +181,7 @@ int console_blanked;

 static int vesa_blank_mode; /* 0:none 1:suspendV 2:suspendH 3:powerdown */
 static int vesa_off_interval;
-static int blankinterval = 10*60;
+static int blankinterval;
 core_param(consoleblank, blankinterval, int, 0444);

 static DECLARE_WORK(console_work, console_callback);





Re: When will Linux support M2 on RAID ?

2017-03-07 Thread Austin S. Hemmelgarn

On 2017-03-07 10:15, Christoph Hellwig wrote:

On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote:

He's referring to the RAID mode most modern Intel chipsets have, which (last
I checked) Linux does not support completely and many OEM's are setting by
default on new systems because it apparently provides better performance
than AHCI even for a single device.


It actually provides worse performance.  What it does it that it shoves
up to three nvme device bars into the bar of an AHCI device, and
requires the OS to handle them all using a single driver.  The Money's
on crack at Intel decided to do that to provide their "valueable" RSTe
IP (which is a windows ATA + RAID driver in a blob, which now has also
grown a NVMe driver).  The only remotely sane thing is to disable it
in the bios, and burn all people involved with it.  The next best thing
is to provide a fake PCIe root port driver untangling this before it
hits the driver, but unfortunately Intel is unwilling to either do this
on their own or at least provide enough documentation for others to do
it.

For NVMe, yeah, it hurts performance horribly.  For SATA devices though, 
it's hit or miss, some setups perform better, some perform worse.


It does have one advantage though, it lets you put the C drive for a 
Windows install on a soft-RAID array insanely easily compared to trying 
to do so through Windows itself (although still significantly less 
easily that doing the equivalent on Linux...).


The cynic in me is tempted to believe that the OEM's who are turning it 
on by default are trying to either:
1. Make their low-end systems look even crappier in terms of performance 
while adding to their marketing checklist (Of the systems I've seen that 
have this on by default, most were cheap ones with really low specs).

2. Actively make it harder to run anything but Windows on their hardware.


Re: When will Linux support M2 on RAID ?

2017-03-07 Thread Austin S. Hemmelgarn

On 2017-03-07 10:15, Christoph Hellwig wrote:

On Tue, Mar 07, 2017 at 09:50:22AM -0500, Austin S. Hemmelgarn wrote:

He's referring to the RAID mode most modern Intel chipsets have, which (last
I checked) Linux does not support completely and many OEM's are setting by
default on new systems because it apparently provides better performance
than AHCI even for a single device.


It actually provides worse performance.  What it does it that it shoves
up to three nvme device bars into the bar of an AHCI device, and
requires the OS to handle them all using a single driver.  The Money's
on crack at Intel decided to do that to provide their "valueable" RSTe
IP (which is a windows ATA + RAID driver in a blob, which now has also
grown a NVMe driver).  The only remotely sane thing is to disable it
in the bios, and burn all people involved with it.  The next best thing
is to provide a fake PCIe root port driver untangling this before it
hits the driver, but unfortunately Intel is unwilling to either do this
on their own or at least provide enough documentation for others to do
it.

For NVMe, yeah, it hurts performance horribly.  For SATA devices though, 
it's hit or miss, some setups perform better, some perform worse.


It does have one advantage though, it lets you put the C drive for a 
Windows install on a soft-RAID array insanely easily compared to trying 
to do so through Windows itself (although still significantly less 
easily that doing the equivalent on Linux...).


The cynic in me is tempted to believe that the OEM's who are turning it 
on by default are trying to either:
1. Make their low-end systems look even crappier in terms of performance 
while adding to their marketing checklist (Of the systems I've seen that 
have this on by default, most were cheap ones with really low specs).

2. Actively make it harder to run anything but Windows on their hardware.


Re: When will Linux support M2 on RAID ?

2017-03-07 Thread Austin S. Hemmelgarn

On 2017-03-06 23:52, Christoph Hellwig wrote:

On Sun, Mar 05, 2017 at 06:09:42PM -0800, David F. wrote:

More and more systems are coming with M2 on RAID and Linux doesn't
work unless you change the system out of RAID mode.  This is becoming
more and more of a problem.   What is the status of Linux support for
the new systems?


Your message doesn't make sense at all.  MD works on absolutely any
Linux block device, and I've used it on plenty M.2 form factor devices -
not that the form factor has anything to do with how Linux would treat
a device.

He's referring to the RAID mode most modern Intel chipsets have, which 
(last I checked) Linux does not support completely and many OEM's are 
setting by default on new systems because it apparently provides better 
performance than AHCI even for a single device.  The bigger issue with 
this is that it's damn near impossible to get to the firmware on many 
new systems, and even if you can, some OEM's aren't even giving the 
option to use AHCI mode instead of the RAID mode.


The whole M.2 thing is absolutely bogus though, my thought is that 
people are conflating the two because this switch to using RAID mode by 
default is happening concurrently with a general movement towards using 
M.2 devices as primary storage.


Re: When will Linux support M2 on RAID ?

2017-03-07 Thread Austin S. Hemmelgarn

On 2017-03-06 23:52, Christoph Hellwig wrote:

On Sun, Mar 05, 2017 at 06:09:42PM -0800, David F. wrote:

More and more systems are coming with M2 on RAID and Linux doesn't
work unless you change the system out of RAID mode.  This is becoming
more and more of a problem.   What is the status of Linux support for
the new systems?


Your message doesn't make sense at all.  MD works on absolutely any
Linux block device, and I've used it on plenty M.2 form factor devices -
not that the form factor has anything to do with how Linux would treat
a device.

He's referring to the RAID mode most modern Intel chipsets have, which 
(last I checked) Linux does not support completely and many OEM's are 
setting by default on new systems because it apparently provides better 
performance than AHCI even for a single device.  The bigger issue with 
this is that it's damn near impossible to get to the firmware on many 
new systems, and even if you can, some OEM's aren't even giving the 
option to use AHCI mode instead of the RAID mode.


The whole M.2 thing is absolutely bogus though, my thought is that 
people are conflating the two because this switch to using RAID mode by 
default is happening concurrently with a general movement towards using 
M.2 devices as primary storage.


Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-31 Thread Austin S. Hemmelgarn

On 2017-01-31 10:29, Paul Menzel wrote:

Dear Borislav, dear Mario,


On 01/27/17 18:16, mario.limoncie...@dell.com wrote:

-Original Message-
From: Borislav Petkov [mailto:b...@alien8.de]
Sent: Friday, January 27, 2017 11:11 AM
To: Paul Menzel 
Cc: Ashok Raj ; Linux Kernel Mailing List ; Thorsten Leemhuis ; Len
Brown ; Tony Luck ;
Limonciello, Mario 
Subject: Re: Dell XPS13: MCE (Hardware Error) reported

On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:

Thank you very much for your help. After wasting my time with the Dell
support over Twitter [1], where they basically also make you jump
through hoops,


Fun read that twitter page. Especially the bit about not supporting
Linux but
but you did ship this laptop with ubuntu. LOOL.

FWIW, this is not the first time I've heard vendors trying to get
away from
dealing with bugs by playing the "we-dont-support-Linux" card.


I think, in this case it was an honest mistake from the support person,
which they corrected in a follow-up post.

But the next conversion was a little disappointing as the problem was
already diagnosed out on the LKML.

As Mario replied, the Twitter people do not seem to know the free
software world that well, that “normal” people actually are in contact
with the developers.

What surprises me though is, that the change-log for the firmware seems
to be incomplete even for the Dell staff.


Good to know when I go looking for a new laptop in the future.


Unfortunately, there are not a lot of options. If you know a good
vendor, please share.

Dell is one of the view “big” vendors selling laptops with a GNU/Linux
operating system installed.

There are Google Chromebooks, which even have coreboot. Google’s
Chromium team does a great job and have a great quality assurance. On
par, or even better than Apple, I’d say.

Some people need more powerful devices though.

In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
installed [2].

TUXEDO does not build the devices themselves, and gets template
models(?) from Chinese manufactures. They only seem to start hiring the
expertise to do Linux kernel and OS work themselves [3]. Currently, it’s
mostly the stock Ubuntu for desktop with their artwork and some small
optimizations.
When it comes to laptops, I've recently become a fan of System76 [1]. 
Linux runs pretty much flawlessly on all the systems I've ever 
encountered from them without needing any special configuration, and 
they offer some pretty high-end systems.  Their Oryx Pro laptop is 
particularly impressive, it's got hardware on par with a desktop bearing 
a similar price-tag, which is rare even for systems that ship with 
Windows per-installed.  They're based in the US, but I'm pretty certain 
they ship worldwide, and like most of the better manufacturers these 
days, the systems are made-to-order with quite a lot in the way of 
options (you can even get an NVMe boot drive in their higher-end systems 
if you're willing to shell out more than a third of the base cost for 
the system on it).


They also do desktops and servers, but it's generally cheaper in my 
experience for anything but a laptop to just build it yourself.



Sorry you had that experience.  I'll pass that on to get that messaging
corrected.  The team that does Linux support doesn't use twitter, they
do phone and email support only.


Thank you that’s good to know.


I'm glad you got things sorted moving to the next FW version.


Me too.

Mario, there are at least two more firmware bugs [4][5][6]. Having fast
suspend and resume says something about the quality of a device. If the
Dell XPS13 9360 is supposed to compete with Apple devices, and Google
Chromebooks, then this should be improved in my opinion. Do you have a
suggestion on how to best get this solved? Do you want me to contact the
support, or are there Dell employees with access to the Linux kernel
Bugzilla bug tracker?


Kind regards,

Paul


[1] https://www.tuxedocomputers.com/
[2] https://puri.sm/
[3]
https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5

[4] https://github.molgen.mpg.de/mariux64/dell_xps13/
[5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
[6] https://bugzilla.kernel.org/show_bug.cgi?id=185621

[1] https://system76.com/



Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-31 Thread Austin S. Hemmelgarn

On 2017-01-31 10:29, Paul Menzel wrote:

Dear Borislav, dear Mario,


On 01/27/17 18:16, mario.limoncie...@dell.com wrote:

-Original Message-
From: Borislav Petkov [mailto:b...@alien8.de]
Sent: Friday, January 27, 2017 11:11 AM
To: Paul Menzel 
Cc: Ashok Raj ; Linux Kernel Mailing List ; Thorsten Leemhuis ; Len
Brown ; Tony Luck ;
Limonciello, Mario 
Subject: Re: Dell XPS13: MCE (Hardware Error) reported

On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:

Thank you very much for your help. After wasting my time with the Dell
support over Twitter [1], where they basically also make you jump
through hoops,


Fun read that twitter page. Especially the bit about not supporting
Linux but
but you did ship this laptop with ubuntu. LOOL.

FWIW, this is not the first time I've heard vendors trying to get
away from
dealing with bugs by playing the "we-dont-support-Linux" card.


I think, in this case it was an honest mistake from the support person,
which they corrected in a follow-up post.

But the next conversion was a little disappointing as the problem was
already diagnosed out on the LKML.

As Mario replied, the Twitter people do not seem to know the free
software world that well, that “normal” people actually are in contact
with the developers.

What surprises me though is, that the change-log for the firmware seems
to be incomplete even for the Dell staff.


Good to know when I go looking for a new laptop in the future.


Unfortunately, there are not a lot of options. If you know a good
vendor, please share.

Dell is one of the view “big” vendors selling laptops with a GNU/Linux
operating system installed.

There are Google Chromebooks, which even have coreboot. Google’s
Chromium team does a great job and have a great quality assurance. On
par, or even better than Apple, I’d say.

Some people need more powerful devices though.

In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
installed [2].

TUXEDO does not build the devices themselves, and gets template
models(?) from Chinese manufactures. They only seem to start hiring the
expertise to do Linux kernel and OS work themselves [3]. Currently, it’s
mostly the stock Ubuntu for desktop with their artwork and some small
optimizations.
When it comes to laptops, I've recently become a fan of System76 [1]. 
Linux runs pretty much flawlessly on all the systems I've ever 
encountered from them without needing any special configuration, and 
they offer some pretty high-end systems.  Their Oryx Pro laptop is 
particularly impressive, it's got hardware on par with a desktop bearing 
a similar price-tag, which is rare even for systems that ship with 
Windows per-installed.  They're based in the US, but I'm pretty certain 
they ship worldwide, and like most of the better manufacturers these 
days, the systems are made-to-order with quite a lot in the way of 
options (you can even get an NVMe boot drive in their higher-end systems 
if you're willing to shell out more than a third of the base cost for 
the system on it).


They also do desktops and servers, but it's generally cheaper in my 
experience for anything but a laptop to just build it yourself.



Sorry you had that experience.  I'll pass that on to get that messaging
corrected.  The team that does Linux support doesn't use twitter, they
do phone and email support only.


Thank you that’s good to know.


I'm glad you got things sorted moving to the next FW version.


Me too.

Mario, there are at least two more firmware bugs [4][5][6]. Having fast
suspend and resume says something about the quality of a device. If the
Dell XPS13 9360 is supposed to compete with Apple devices, and Google
Chromebooks, then this should be improved in my opinion. Do you have a
suggestion on how to best get this solved? Do you want me to contact the
support, or are there Dell employees with access to the Linux kernel
Bugzilla bug tracker?


Kind regards,

Paul


[1] https://www.tuxedocomputers.com/
[2] https://puri.sm/
[3]
https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5

[4] https://github.molgen.mpg.de/mariux64/dell_xps13/
[5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
[6] https://bugzilla.kernel.org/show_bug.cgi?id=185621

[1] https://system76.com/



Re: [PATCH v2] Add a "nosymlinks" mount option.

2016-11-17 Thread Austin S. Hemmelgarn

On 2016-11-16 16:18, Mattias Nissler wrote:

I understand that silence suggests there's little interest, but here's
some new information I discovered today that may justify to reconsider
the patch:

The BSDs already have exactly what I propose, the mount option is
called "nosymfollow" there:
https://github.com/freebsd/freebsd/blob/a41f4cc9a57cd74604ae7b051eec2f48865f18d6/sys/kern/vfs_lookup.c#L939

There's also some evidence on the net that people have been using said
nosymfollow mount option to mitigate symlink attacks.
I will comment that I do have interest in this (to the point that if 
this doesn't go into mainline, I'll probably add it to my local set of 
patches), though I'm not a maintainer or reviewer for the relevant area, 
so there's not much I can do except say that there is interest from at 
least one person who didn't work on the patch.


That said, it would be nice if we matched the BSD option name (we do on 
pretty much every other core VFS level mount option).


On Mon, Oct 24, 2016 at 7:09 AM, Mattias Nissler  wrote:

Friendly ping - does this version of the patch have any chance on
getting included in mainline?

On Wed, Oct 19, 2016 at 2:31 PM, Mattias Nissler  wrote:

For mounts that have the new "nosymlinks" option, don't follow
symlinks when resolving paths. The new option is similar in spirit to
the existing "nodev", "noexec", and "nosuid" options.

Note that symlinks may still be created on mounts where the
"nosymlinks" option is present. readlink() remains functional, so user
space code that is aware of symlinks can still choose to follow them
explicitly.

Setting the "nosymlinks" mount option helps prevent privileged writers
from modifying files unintentionally in case there is an unexpected
link along the accessed path. The "nosymlinks" option is thus useful
as a defensive measure for systems that need to deal with untrusted
file systems in privileged contexts.

Signed-off-by: Mattias Nissler 
---
 fs/namei.c  | 3 +++
 fs/namespace.c  | 9 ++---
 fs/proc_namespace.c | 1 +
 fs/statfs.c | 2 ++
 include/linux/mount.h   | 3 ++-
 include/linux/statfs.h  | 1 +
 include/uapi/linux/fs.h | 1 +
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5b4eed2..4cddcf3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1021,6 +1021,9 @@ const char *get_link(struct nameidata *nd)
touch_atime(>link);
}

+   if (nd->path.mnt->mnt_flags & MNT_NOSYMLINKS)
+   return ERR_PTR(-EACCES);
+
error = security_inode_follow_link(dentry, inode,
   nd->flags & LOOKUP_RCU);
if (unlikely(error))
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..deec84e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2732,6 +2732,8 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+   if (flags & MS_NOSYMLINKS)
+   mnt_flags |= MNT_NOSYMLINKS;

/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
@@ -2741,9 +2743,10 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
}

-   flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
-  MS_STRICTATIME | MS_NOREMOTELOCK);
+   flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_NOSYMLINKS |
+  MS_ACTIVE | MS_BORN | MS_NOATIME | MS_NODIRATIME |
+  MS_RELATIME | MS_KERNMOUNT | MS_STRICTATIME |
+  MS_NOREMOTELOCK);

if (flags & MS_REMOUNT)
retval = do_remount(, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3f1190d..a1949d9 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -67,6 +67,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount 
*mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+   { MNT_NOSYMLINKS, ",nosymlinks" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0a..7ff7c32 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -27,6 +27,8 @@ static int flags_by_mnt(int mnt_flags)
flags |= ST_NODIRATIME;
if (mnt_flags & MNT_RELATIME)
flags |= ST_RELATIME;
+   if (mnt_flags & MNT_NOSYMLINKS)
+   flags |= ST_NOSYMLINKS;
return flags;
 }

diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..5e302f4 100644
--- 

Re: [PATCH v2] Add a "nosymlinks" mount option.

2016-11-17 Thread Austin S. Hemmelgarn

On 2016-11-16 16:18, Mattias Nissler wrote:

I understand that silence suggests there's little interest, but here's
some new information I discovered today that may justify to reconsider
the patch:

The BSDs already have exactly what I propose, the mount option is
called "nosymfollow" there:
https://github.com/freebsd/freebsd/blob/a41f4cc9a57cd74604ae7b051eec2f48865f18d6/sys/kern/vfs_lookup.c#L939

There's also some evidence on the net that people have been using said
nosymfollow mount option to mitigate symlink attacks.
I will comment that I do have interest in this (to the point that if 
this doesn't go into mainline, I'll probably add it to my local set of 
patches), though I'm not a maintainer or reviewer for the relevant area, 
so there's not much I can do except say that there is interest from at 
least one person who didn't work on the patch.


That said, it would be nice if we matched the BSD option name (we do on 
pretty much every other core VFS level mount option).


On Mon, Oct 24, 2016 at 7:09 AM, Mattias Nissler  wrote:

Friendly ping - does this version of the patch have any chance on
getting included in mainline?

On Wed, Oct 19, 2016 at 2:31 PM, Mattias Nissler  wrote:

For mounts that have the new "nosymlinks" option, don't follow
symlinks when resolving paths. The new option is similar in spirit to
the existing "nodev", "noexec", and "nosuid" options.

Note that symlinks may still be created on mounts where the
"nosymlinks" option is present. readlink() remains functional, so user
space code that is aware of symlinks can still choose to follow them
explicitly.

Setting the "nosymlinks" mount option helps prevent privileged writers
from modifying files unintentionally in case there is an unexpected
link along the accessed path. The "nosymlinks" option is thus useful
as a defensive measure for systems that need to deal with untrusted
file systems in privileged contexts.

Signed-off-by: Mattias Nissler 
---
 fs/namei.c  | 3 +++
 fs/namespace.c  | 9 ++---
 fs/proc_namespace.c | 1 +
 fs/statfs.c | 2 ++
 include/linux/mount.h   | 3 ++-
 include/linux/statfs.h  | 1 +
 include/uapi/linux/fs.h | 1 +
 7 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5b4eed2..4cddcf3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1021,6 +1021,9 @@ const char *get_link(struct nameidata *nd)
touch_atime(>link);
}

+   if (nd->path.mnt->mnt_flags & MNT_NOSYMLINKS)
+   return ERR_PTR(-EACCES);
+
error = security_inode_follow_link(dentry, inode,
   nd->flags & LOOKUP_RCU);
if (unlikely(error))
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..deec84e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2732,6 +2732,8 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+   if (flags & MS_NOSYMLINKS)
+   mnt_flags |= MNT_NOSYMLINKS;

/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
@@ -2741,9 +2743,10 @@ long do_mount(const char *dev_name, const char __user 
*dir_name,
mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
}

-   flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
-  MS_STRICTATIME | MS_NOREMOTELOCK);
+   flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_NOSYMLINKS |
+  MS_ACTIVE | MS_BORN | MS_NOATIME | MS_NODIRATIME |
+  MS_RELATIME | MS_KERNMOUNT | MS_STRICTATIME |
+  MS_NOREMOTELOCK);

if (flags & MS_REMOUNT)
retval = do_remount(, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3f1190d..a1949d9 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -67,6 +67,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount 
*mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+   { MNT_NOSYMLINKS, ",nosymlinks" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0a..7ff7c32 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -27,6 +27,8 @@ static int flags_by_mnt(int mnt_flags)
flags |= ST_NODIRATIME;
if (mnt_flags & MNT_RELATIME)
flags |= ST_RELATIME;
+   if (mnt_flags & MNT_NOSYMLINKS)
+   flags |= ST_NOSYMLINKS;
return flags;
 }

diff --git a/include/linux/mount.h b/include/linux/mount.h
index 1172cce..5e302f4 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -28,6 +28,7 @@ struct 

Re: [PATCH] f2fs: support multiple devices

2016-11-10 Thread Austin S. Hemmelgarn

On 2016-11-09 21:29, Qu Wenruo wrote:



At 11/10/2016 06:57 AM, Andreas Dilger wrote:

On Nov 9, 2016, at 1:56 PM, Jaegeuk Kim  wrote:


This patch implements multiple devices support for f2fs.
Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
volume under one f2fs instance.

Internal block management is very simple, but we will modify block
allocation and background GC policy to boost IO speed by exploiting them
accoording to each device speed.


How will you integrate this into FIEMAP, since it is now possible if a
file is split across multiple devices then it will return ambiguous block
numbers for a file.  I've been meaning to merge the FIEMAP handling in
Lustre to support multiple devices in a single filesystem, so that this
can be detected in userspace.

struct ll_fiemap_extent {
__u64 fe_logical;  /* logical offset in bytes for the start of
* the extent from the beginning of the file
*/
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the
disk
*/
__u64 fe_length;   /* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device;   /* device number for this extent */
__u32 fe_reserved[2];
};


Btrfs introduce a new layer for multi-device (even for single device).

So fiemap returned by btrfs is never real device bytenr, but logical
address in btrfs logical address space.
Much like traditional soft RAID.
This is a really important point.  BTRFS does a good job of segregating 
the layers here, so the file-level allocator really has very limited 
knowledge of the underlying storage, which in turn means that adding 
this to BTRFS would likely be a pretty invasive change for the FIEMAP 
implementation.




This adds the 32-bit "fe_device" field, which would optionally be filled
in by the filesystem (zero otherwise).  It would return the kernel device
number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET
set) this could just return an integer device number since the device
number is meaningless (and may conflict) on a remote system.

Since AFAIK Btrfs also has multiple device support there are an
increasing
number of places where this would be useful.


AFAIK, btrfs multi-device is here due to scrub with its data/meta csum.

It's also here for an attempt at parity with ZFS.


Unlike device-mapper based multi-device, btrfs has csum so it can detect
which mirror is correct.
This makes btrfs scrub a little better than soft raid.
For example, for RAID1 if two mirror differs from each other, btrfs can
find the correct one and rewrite it into the other mirror.

And further more, btrfs supports snapshot and is faster than
device-mapper based snapshot(LVM).
This makes it a little more worthy to implement multi-device support in
btrfs.


But for f2fs, no data csum, no snapshot.
I don't really see the point to use so many codes to implement it,
especially we can use mdadm or LVM to implement it.
I'd tend to agree on this, if it weren't for the fact that this looks to 
me like preparation for implementing storage tiering, which neither LVM 
nor MD have a good implementation of.  Whether or not such functionality 
is worthwhile for the embedded systems that F2FS typically targets is 
another story of course.



Not to mention btrfs multi-device support still has quite a lot of bugs,
like scrub can corrupt correct data stripes.
This sounds like you're lumping raid5/6 code in with the general 
multi-device code, which is not a good way of describing things for 
multiple reasons.  Pretty much, if you're using just raid1 mode, without 
compression, on reasonable storage devices, things are rock-solid 
relative to the rest of BTRFS.


Yes, there is a bug with compression and multiple copies of things, but 
that requires a pretty spectacular device failure to manifest, and it 
impacts single device mode too (it happens in dup profiles as well as 
raid1).  As far as the raid5/6 stuff, that shouldn't have been merged in 
the state it was in when it got merged, and should probably just be 
rewritten from the ground up.


Personally speaking, I am not a fan of btrfs multi-device management,
despite the above advantage.
As the complexity is really not worthy.
(So I think XFS with LVM is much better than Btrfs considering the
stability)




Re: [PATCH] f2fs: support multiple devices

2016-11-10 Thread Austin S. Hemmelgarn

On 2016-11-09 21:29, Qu Wenruo wrote:



At 11/10/2016 06:57 AM, Andreas Dilger wrote:

On Nov 9, 2016, at 1:56 PM, Jaegeuk Kim  wrote:


This patch implements multiple devices support for f2fs.
Given multiple devices by mkfs.f2fs, f2fs shows them entirely as one big
volume under one f2fs instance.

Internal block management is very simple, but we will modify block
allocation and background GC policy to boost IO speed by exploiting them
accoording to each device speed.


How will you integrate this into FIEMAP, since it is now possible if a
file is split across multiple devices then it will return ambiguous block
numbers for a file.  I've been meaning to merge the FIEMAP handling in
Lustre to support multiple devices in a single filesystem, so that this
can be detected in userspace.

struct ll_fiemap_extent {
__u64 fe_logical;  /* logical offset in bytes for the start of
* the extent from the beginning of the file
*/
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the
disk
*/
__u64 fe_length;   /* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device;   /* device number for this extent */
__u32 fe_reserved[2];
};


Btrfs introduce a new layer for multi-device (even for single device).

So fiemap returned by btrfs is never real device bytenr, but logical
address in btrfs logical address space.
Much like traditional soft RAID.
This is a really important point.  BTRFS does a good job of segregating 
the layers here, so the file-level allocator really has very limited 
knowledge of the underlying storage, which in turn means that adding 
this to BTRFS would likely be a pretty invasive change for the FIEMAP 
implementation.




This adds the 32-bit "fe_device" field, which would optionally be filled
in by the filesystem (zero otherwise).  It would return the kernel device
number (i.e. st_dev), or for network filesystem (with FIEMAP_EXTENT_NET
set) this could just return an integer device number since the device
number is meaningless (and may conflict) on a remote system.

Since AFAIK Btrfs also has multiple device support there are an
increasing
number of places where this would be useful.


AFAIK, btrfs multi-device is here due to scrub with its data/meta csum.

It's also here for an attempt at parity with ZFS.


Unlike device-mapper based multi-device, btrfs has csum so it can detect
which mirror is correct.
This makes btrfs scrub a little better than soft raid.
For example, for RAID1 if two mirror differs from each other, btrfs can
find the correct one and rewrite it into the other mirror.

And further more, btrfs supports snapshot and is faster than
device-mapper based snapshot(LVM).
This makes it a little more worthy to implement multi-device support in
btrfs.


But for f2fs, no data csum, no snapshot.
I don't really see the point to use so many codes to implement it,
especially we can use mdadm or LVM to implement it.
I'd tend to agree on this, if it weren't for the fact that this looks to 
me like preparation for implementing storage tiering, which neither LVM 
nor MD have a good implementation of.  Whether or not such functionality 
is worthwhile for the embedded systems that F2FS typically targets is 
another story of course.



Not to mention btrfs multi-device support still has quite a lot of bugs,
like scrub can corrupt correct data stripes.
This sounds like you're lumping raid5/6 code in with the general 
multi-device code, which is not a good way of describing things for 
multiple reasons.  Pretty much, if you're using just raid1 mode, without 
compression, on reasonable storage devices, things are rock-solid 
relative to the rest of BTRFS.


Yes, there is a bug with compression and multiple copies of things, but 
that requires a pretty spectacular device failure to manifest, and it 
impacts single device mode too (it happens in dup profiles as well as 
raid1).  As far as the raid5/6 stuff, that shouldn't have been merged in 
the state it was in when it got merged, and should probably just be 
rewritten from the ground up.


Personally speaking, I am not a fan of btrfs multi-device management,
despite the above advantage.
As the complexity is really not worthy.
(So I think XFS with LVM is much better than Btrfs considering the
stability)




Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-04 10:39, Markus Trippelsdorf wrote:

On 2016.11.04 at 15:24 +0100, Sebastian Andrzej Siewior wrote:

On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote:

clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.


If it helps, you could point out that Gentoo's hardened profile's GCC builds
use PIE by default and have absolutely zero issues building the Linux kernel
without any special kernel patches to turn it off (and has been doing so for
years).


Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default.
They have their own PIE patch since it is not yet supported by gcc. And
let me quote why that works for them:

| This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z 
now
| as default if the defines and the spec allow it.
| Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened 
in flag-o-matic.eclass
| to support older hardened GCC patches and we don't need to change the code on 
gcc-specs-* and _filter-hardened.
| This will add some unsupported upstream commands options as -nopie and -nonow.
| -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all 
and -fstack-check when building kernels.
| ESP_CC1_SPEC is added to CC1_SPEC.
| ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow 
check.
| ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static.
| ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or 
-fpie.

I was thinking about asking doko for something similar but no. Looking at
portage they have a few patches where they add -fno-PIE to some packages.
Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo
did not better.
And according to Google, there are also people in the ARCH Linux camp
with the same problem. Gentoo's 6 gcc is completely masked and it does
not reference the patch I quote above so Gentoo will run into this
problem once they enable gcc 6 and don't add the -D__KERNEL__ hack.
Eventually Fedora and SUSE will migrate to PIE by default and by then we
should cover all major distros so even Al should be affected unless he
decides not to update or is using something else.


But why enable PIE by default? What additional security does one gain if
e.g. "cat" or "dmesg" are position independent executables?
It also adds overhead (,although this is smaller than it used to be on
X86_64).
So instead of doing the sane thing and adding -fPIE to long running
daemons only, now many project have to add -fno-pie to avoid breakage.

Actually, the number of things that don't work with PIE is pretty small 
(pretty much kernel code and some stuff that does in-line assembly).


In general, it falls under the heading of 'secure by default'.  Yeah, 
some stuff like cat and dmesg have essentially zero security benefit 
from being compiled with -fPIE, but once you start looking at pretty 
much anything that's doing any processing of potentially untrusted data 
(which includes anything that talks to the network), you do start having 
some security improvement.


Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-04 10:39, Markus Trippelsdorf wrote:

On 2016.11.04 at 15:24 +0100, Sebastian Andrzej Siewior wrote:

On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote:

clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.


If it helps, you could point out that Gentoo's hardened profile's GCC builds
use PIE by default and have absolutely zero issues building the Linux kernel
without any special kernel patches to turn it off (and has been doing so for
years).


Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default.
They have their own PIE patch since it is not yet supported by gcc. And
let me quote why that works for them:

| This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z 
now
| as default if the defines and the spec allow it.
| Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened 
in flag-o-matic.eclass
| to support older hardened GCC patches and we don't need to change the code on 
gcc-specs-* and _filter-hardened.
| This will add some unsupported upstream commands options as -nopie and -nonow.
| -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all 
and -fstack-check when building kernels.
| ESP_CC1_SPEC is added to CC1_SPEC.
| ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow 
check.
| ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static.
| ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or 
-fpie.

I was thinking about asking doko for something similar but no. Looking at
portage they have a few patches where they add -fno-PIE to some packages.
Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo
did not better.
And according to Google, there are also people in the ARCH Linux camp
with the same problem. Gentoo's 6 gcc is completely masked and it does
not reference the patch I quote above so Gentoo will run into this
problem once they enable gcc 6 and don't add the -D__KERNEL__ hack.
Eventually Fedora and SUSE will migrate to PIE by default and by then we
should cover all major distros so even Al should be affected unless he
decides not to update or is using something else.


But why enable PIE by default? What additional security does one gain if
e.g. "cat" or "dmesg" are position independent executables?
It also adds overhead (,although this is smaller than it used to be on
X86_64).
So instead of doing the sane thing and adding -fPIE to long running
daemons only, now many project have to add -fno-pie to avoid breakage.

Actually, the number of things that don't work with PIE is pretty small 
(pretty much kernel code and some stuff that does in-line assembly).


In general, it falls under the heading of 'secure by default'.  Yeah, 
some stuff like cat and dmesg have essentially zero security benefit 
from being compiled with -fPIE, but once you start looking at pretty 
much anything that's doing any processing of potentially untrusted data 
(which includes anything that talks to the network), you do start having 
some security improvement.


Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-04 10:24, Sebastian Andrzej Siewior wrote:

On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote:

clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.


If it helps, you could point out that Gentoo's hardened profile's GCC builds
use PIE by default and have absolutely zero issues building the Linux kernel
without any special kernel patches to turn it off (and has been doing so for
years).


Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default.
They have their own PIE patch since it is not yet supported by gcc. And
let me quote why that works for them:

| This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z 
now
| as default if the defines and the spec allow it.
| Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened 
in flag-o-matic.eclass
| to support older hardened GCC patches and we don't need to change the code on 
gcc-specs-* and _filter-hardened.
| This will add some unsupported upstream commands options as -nopie and -nonow.
| -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all 
and -fstack-check when building kernels.
| ESP_CC1_SPEC is added to CC1_SPEC.
| ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow 
check.
| ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static.
| ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or 
-fpie.

I was thinking about asking doko for something similar but no. Looking at
portage they have a few patches where they add -fno-PIE to some packages.
Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo
did not better.
And according to Google, there are also people in the ARCH Linux camp
with the same problem. Gentoo's 6 gcc is completely masked and it does
not reference the patch I quote above so Gentoo will run into this
problem once they enable gcc 6 and don't add the -D__KERNEL__ hack.
Eventually Fedora and SUSE will migrate to PIE by default and by then we
should cover all major distros so even Al should be affected unless he
decides not to update or is using something else.
While I don't agree with _how_ they worked around it, it still works 
correctly with no user intervention for pretty much every important 
case, and my point was more that it is possible to make this work 
without a kernel patch than 'Hey, it works over here, lets do what 
they're doing'.


I would still argue that the root of the issue is how GCC handles 
options specified on the command line that conflict with it's compile 
time defaults.  This is at least the second kernel related case where 
things broke because GCC doesn't do the sensible thing and override 
defaults based on command line options (there is (or was, not sure if 
it's been resolved yet or not) an issue too on MIPS with some other 
option that I can't remember right now).  If option X and option Y are 
mutually exclusive, and option X is specified on the command line while 
option Y isn't, they should not use option Y regardless of whether or 
not it's the default and possibly spit out a warning if it is the 
default (for PIC, yes, there probably should be a warning), not die.


Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-04 10:24, Sebastian Andrzej Siewior wrote:

On 2016-11-04 07:37:02 [-0400], Austin S. Hemmelgarn wrote:

clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.


If it helps, you could point out that Gentoo's hardened profile's GCC builds
use PIE by default and have absolutely zero issues building the Linux kernel
without any special kernel patches to turn it off (and has been doing so for
years).


Interesting. So I took a look at Gentoo. They ship gcc 4.9.3 by default.
They have their own PIE patch since it is not yet supported by gcc. And
let me quote why that works for them:

| This file will add -fstack-protector-all, -fstack-check, -fPIE, -pie and -z 
now
| as default if the defines and the spec allow it.
| Added a hack for gcc-specs-* in toolchain-funcs.eclass and _filter-hardened 
in flag-o-matic.eclass
| to support older hardened GCC patches and we don't need to change the code on 
gcc-specs-* and _filter-hardened.
| This will add some unsupported upstream commands options as -nopie and -nonow.
| -D__KERNEL__ is added so we don't have -fPIE, -pie and -fstack-protector-all 
and -fstack-check when building kernels.
| ESP_CC1_SPEC is added to CC1_SPEC.
| ESP_CC1_STRICT_OVERFLOW_SPEC is added so we don't disable the strict-overflow 
check.
| ESP_LINK_PIE_CHECK_SPEC check for -pie, -p, -pg, -profile and -static.
| ENABLE_CRTBEGINP add support for crtbeginP.o, build -static with -fPIE or 
-fpie.

I was thinking about asking doko for something similar but no. Looking at
portage they have a few patches where they add -fno-PIE to some packages.
Also disabling PIE based on __KERNEL__ does not look right. So no, Gentoo
did not better.
And according to Google, there are also people in the ARCH Linux camp
with the same problem. Gentoo's 6 gcc is completely masked and it does
not reference the patch I quote above so Gentoo will run into this
problem once they enable gcc 6 and don't add the -D__KERNEL__ hack.
Eventually Fedora and SUSE will migrate to PIE by default and by then we
should cover all major distros so even Al should be affected unless he
decides not to update or is using something else.
While I don't agree with _how_ they worked around it, it still works 
correctly with no user intervention for pretty much every important 
case, and my point was more that it is possible to make this work 
without a kernel patch than 'Hey, it works over here, lets do what 
they're doing'.


I would still argue that the root of the issue is how GCC handles 
options specified on the command line that conflict with it's compile 
time defaults.  This is at least the second kernel related case where 
things broke because GCC doesn't do the sensible thing and override 
defaults based on command line options (there is (or was, not sure if 
it's been resolved yet or not) an issue too on MIPS with some other 
option that I can't remember right now).  If option X and option Y are 
mutually exclusive, and option X is specified on the command line while 
option Y isn't, they should not use option Y regardless of whether or 
not it's the default and possibly spit out a warning if it is the 
default (for PIC, yes, there probably should be a warning), not die.


Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-03 21:08, Al Viro wrote:

On Thu, Nov 03, 2016 at 04:50:55PM -0600, Ben Hutchings wrote:

On Wed, 2016-11-02 at 18:20 +0100, Sebastian Andrzej Siewior wrote:

Debian started to build the gcc with -fPIE by default so the kernel
build ends before it starts properly with:
|kernel/bounds.c:1:0: error: code model kernel does not support PIC mode


Also add to KBUILD_AFLAGSi due to:

|gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY 
… vdso/vdso32/note.S
|arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t 
supported for 32-bit in c ombination with -fpic

[...]

Unfortunately this isn't sufficient:

Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available 
but compiler is broken
/build/linux-4.9~rc3/Makefile:1069: recipe for target 'prepare-compiler-check' 
failed
make[5]: *** [prepare-compiler-check] Error 1

scripts/gcc-x86_64-has-stack-protector.sh has its own list of options
which will need to include -fno-PIE.


That shit should be fixed in debian; no amount of kbuild changes will help
on bisect going back to a state prior to those.  IOW, no matter what we
do kernel-side, that fuckup by doku, balint, et.al. needs to be fixed in
debian gcc package.

Al, seriously disappointed by that mess - debian gcc packagers are generally
clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.

If it helps, you could point out that Gentoo's hardened profile's GCC 
builds use PIE by default and have absolutely zero issues building the 
Linux kernel without any special kernel patches to turn it off (and has 
been doing so for years).


Re: [PATCH 2/2] kbuild: add -fno-PIE

2016-11-04 Thread Austin S. Hemmelgarn

On 2016-11-03 21:08, Al Viro wrote:

On Thu, Nov 03, 2016 at 04:50:55PM -0600, Ben Hutchings wrote:

On Wed, 2016-11-02 at 18:20 +0100, Sebastian Andrzej Siewior wrote:

Debian started to build the gcc with -fPIE by default so the kernel
build ends before it starts properly with:
|kernel/bounds.c:1:0: error: code model kernel does not support PIC mode


Also add to KBUILD_AFLAGSi due to:

|gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY 
… vdso/vdso32/note.S
|arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t 
supported for 32-bit in c ombination with -fpic

[...]

Unfortunately this isn't sufficient:

Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available 
but compiler is broken
/build/linux-4.9~rc3/Makefile:1069: recipe for target 'prepare-compiler-check' 
failed
make[5]: *** [prepare-compiler-check] Error 1

scripts/gcc-x86_64-has-stack-protector.sh has its own list of options
which will need to include -fno-PIE.


That shit should be fixed in debian; no amount of kbuild changes will help
on bisect going back to a state prior to those.  IOW, no matter what we
do kernel-side, that fuckup by doku, balint, et.al. needs to be fixed in
debian gcc package.

Al, seriously disappointed by that mess - debian gcc packagers are generally
clued enough to have known better.  Reassigning bug reports in question
from gcc-6 to linux is beyond stupid; Balint is either being deliberately
obtuse, or geniunely unable to imagine that somebody might be using the
compiler _not_ for debian package builds.

If it helps, you could point out that Gentoo's hardened profile's GCC 
builds use PIE by default and have absolutely zero issues building the 
Linux kernel without any special kernel patches to turn it off (and has 
been doing so for years).


Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-17 Thread Austin S. Hemmelgarn

On 2016-10-17 09:02, Mattias Nissler wrote:

OK, no more feedback thus far. Is there generally any interest in a
mount option to avoid path name aliasing resulting in target file
confusion? Perhaps a version that only disables symlinks instead of
also hard-disabling files hard-linked to multiple locations (those are
much lower risk for the situation I care about)?

If there is interest, I'm happy to iterate the patch until it's
accepted. If there's no interest, that's fine too - I'll then likely
resort to moving the restrictions desired for Chrome OS into an LSM we
compile into our kernels.
I can see the symlink related part potentially being useful in other 
cases, although if you do get rid of the hardlink portion, I'd suggest 
renaming the mount option to 'nosymlinks'.


One use that comes to mind is securing multi-protocol file servers (for 
example, something serving both NFS and SMB) where at least one protocol 
doesn't directly handle symlinks (or there is inconsistency among the 
protocols in how they're handled).


On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nissler  wrote:

Forgot to mention: I realize my motivation is very specific to Chrome
OS, however the nolinks option seemed useful also as a mitigation to
generic privilege escalation symlink attacks, for cases where
disabling symlinks/hardlinks is acceptable.

On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler  wrote:

On Fri, Oct 14, 2016 at 5:00 PM, Al Viro  wrote:


On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:

Setting the "nolinks" mount option helps prevent privileged writers
from modifying files unintentionally in case there is an unexpected
link along the accessed path. The "nolinks" option is thus useful as a
defensive measure against persistent exploits (i.e. a system getting
re-exploited after a reboot) for systems that employ a read-only or
dm-verity-protected rootfs. These systems prevent non-legit binaries
from running after reboot. However, legit code typically still reads
from and writes to a writable file system previously under full
control of the attacker, who can place symlinks to trick file writes
after reboot to target a file of their choice. "nolinks" fundamentally
prevents this.


Which parts of the tree would be on that "protected" rootfs and which would
you mount with that option?  Description above is rather vague and I'm
not convinced that it actually buys you anything.  Details, please...


Apologies for the vague description, I'm happy to explain in detail.

In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
an attacker can't modify any binaries. After reboot, everything except
the rootfs is mounted noexec, so there's no way to re-gain code
execution after reboot by modifying existing binaries or dropping new
ones.

We've seen multiple exploits now where the attacker worked around
these limitations in two steps:

1. Before reboot, the attacker sets up symlinks on the writeable file
system (called "stateful" file system), which are later accessed by
legit boot code (such as init scripts) after reboot. For example, an
init script that copies file A to B can be abused by an attacker by
symlinking or hardlinking B to a location C of their choice, and
placing desired data to be written to C in A. That gives the attacker
a primitive to write data of their choice to a path of their choice
after reboot. Note that this primitive may target locations _outside_
the stateful file system the attacker previously had control of.
Particularly of interest are targets on /sys, but also tmpfs on /run
etc.

2. The second step for a successful attack is finding some legit code
invoked in the boot flow that has a vulnerability exploitable by
feeding it unexpected data. As an example, there are Linux userspace
utilities that read config from /run which may contain shell commands
the the utility executes, through which the attacker can gain code
execution again.

The purpose of the proposed patch is to raise the bar for the first
step of the attack: Writing arbitrary files after reboot. I'm
intending to mount the stateful file system with the nolinks option
(or otherwise prevent symlink traversal). This will help make sure
that any legit writes taking place during boot in init scripts etc. go
to the files intended by the developer, and can't be redirected by an
attacker.

Does this make more sense to you?




PS: what the hell do restrictions on _following_ symlinks have to _creating_
hardlinks?  I'm trying to imagine a threat model where both would apply or
anything else beyond the word "link" they would have in common...


The restriction is not on _creating_ hard links, but _opening_
hardlinks. The commonality is in the confusion between the file you're
meaning to write vs. the file you actually end up writing to, which
stems from the fact that as things stand a file can be accessible on
other paths than its canonical one. For Chrome OS, 

Re: [RFC] [PATCH] Add a "nolinks" mount option.

2016-10-17 Thread Austin S. Hemmelgarn

On 2016-10-17 09:02, Mattias Nissler wrote:

OK, no more feedback thus far. Is there generally any interest in a
mount option to avoid path name aliasing resulting in target file
confusion? Perhaps a version that only disables symlinks instead of
also hard-disabling files hard-linked to multiple locations (those are
much lower risk for the situation I care about)?

If there is interest, I'm happy to iterate the patch until it's
accepted. If there's no interest, that's fine too - I'll then likely
resort to moving the restrictions desired for Chrome OS into an LSM we
compile into our kernels.
I can see the symlink related part potentially being useful in other 
cases, although if you do get rid of the hardlink portion, I'd suggest 
renaming the mount option to 'nosymlinks'.


One use that comes to mind is securing multi-protocol file servers (for 
example, something serving both NFS and SMB) where at least one protocol 
doesn't directly handle symlinks (or there is inconsistency among the 
protocols in how they're handled).


On Fri, Oct 14, 2016 at 6:22 PM, Mattias Nissler  wrote:

Forgot to mention: I realize my motivation is very specific to Chrome
OS, however the nolinks option seemed useful also as a mitigation to
generic privilege escalation symlink attacks, for cases where
disabling symlinks/hardlinks is acceptable.

On Fri, Oct 14, 2016 at 5:50 PM, Mattias Nissler  wrote:

On Fri, Oct 14, 2016 at 5:00 PM, Al Viro  wrote:


On Fri, Oct 14, 2016 at 03:55:15PM +0100, Al Viro wrote:

Setting the "nolinks" mount option helps prevent privileged writers
from modifying files unintentionally in case there is an unexpected
link along the accessed path. The "nolinks" option is thus useful as a
defensive measure against persistent exploits (i.e. a system getting
re-exploited after a reboot) for systems that employ a read-only or
dm-verity-protected rootfs. These systems prevent non-legit binaries
from running after reboot. However, legit code typically still reads
from and writes to a writable file system previously under full
control of the attacker, who can place symlinks to trick file writes
after reboot to target a file of their choice. "nolinks" fundamentally
prevents this.


Which parts of the tree would be on that "protected" rootfs and which would
you mount with that option?  Description above is rather vague and I'm
not convinced that it actually buys you anything.  Details, please...


Apologies for the vague description, I'm happy to explain in detail.

In case of Chrome OS, we have all binaries on a dm-verity rootfs, so
an attacker can't modify any binaries. After reboot, everything except
the rootfs is mounted noexec, so there's no way to re-gain code
execution after reboot by modifying existing binaries or dropping new
ones.

We've seen multiple exploits now where the attacker worked around
these limitations in two steps:

1. Before reboot, the attacker sets up symlinks on the writeable file
system (called "stateful" file system), which are later accessed by
legit boot code (such as init scripts) after reboot. For example, an
init script that copies file A to B can be abused by an attacker by
symlinking or hardlinking B to a location C of their choice, and
placing desired data to be written to C in A. That gives the attacker
a primitive to write data of their choice to a path of their choice
after reboot. Note that this primitive may target locations _outside_
the stateful file system the attacker previously had control of.
Particularly of interest are targets on /sys, but also tmpfs on /run
etc.

2. The second step for a successful attack is finding some legit code
invoked in the boot flow that has a vulnerability exploitable by
feeding it unexpected data. As an example, there are Linux userspace
utilities that read config from /run which may contain shell commands
the the utility executes, through which the attacker can gain code
execution again.

The purpose of the proposed patch is to raise the bar for the first
step of the attack: Writing arbitrary files after reboot. I'm
intending to mount the stateful file system with the nolinks option
(or otherwise prevent symlink traversal). This will help make sure
that any legit writes taking place during boot in init scripts etc. go
to the files intended by the developer, and can't be redirected by an
attacker.

Does this make more sense to you?




PS: what the hell do restrictions on _following_ symlinks have to _creating_
hardlinks?  I'm trying to imagine a threat model where both would apply or
anything else beyond the word "link" they would have in common...


The restriction is not on _creating_ hard links, but _opening_
hardlinks. The commonality is in the confusion between the file you're
meaning to write vs. the file you actually end up writing to, which
stems from the fact that as things stand a file can be accessible on
other paths than its canonical one. For Chrome OS, I'd like to get to
a point where most privileged code can only access 

Re: md/raid1: Improve another size determination in setup_conf()

2016-10-07 Thread Austin S. Hemmelgarn

On 2016-10-07 06:50, SF Markus Elfring wrote:

Linux has tons of issues, fixes for real problems are very welcome.


Is a spectrum of software improvements to reconsider there?



But coding style bike shedding is just a waste of time.


Why do various software developers bother about coding style specifications
at all then?
Coding style is important, but patches that just fix coding style are a 
bad thing because they break things like `git blame` and run the risk of 
introducing new bugs without any net benefit to end users.  This goes 
double for code you don't actually work on regularly or don't completely 
understand.


Re: md/raid1: Improve another size determination in setup_conf()

2016-10-07 Thread Austin S. Hemmelgarn

On 2016-10-07 06:50, SF Markus Elfring wrote:

Linux has tons of issues, fixes for real problems are very welcome.


Is a spectrum of software improvements to reconsider there?



But coding style bike shedding is just a waste of time.


Why do various software developers bother about coding style specifications
at all then?
Coding style is important, but patches that just fix coding style are a 
bad thing because they break things like `git blame` and run the risk of 
introducing new bugs without any net benefit to end users.  This goes 
double for code you don't actually work on regularly or don't completely 
understand.


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 11:05, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn 
<ahferro...@gmail.com> ha scritto:

On 2016-10-06 08:50, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
<ahferro...@gmail.com> ha scritto:

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <t...@kernel.org> wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


ISTR that the original argument for this approach essentially amounted to: 'If 
it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all respects 
(which there isn't sufficient information to decide in this case), or the 
negative aspects are worth the improvements (which is too workload specific to 
decide for something like this).


All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.

My point is that whether or not BFQ is better depends on the workload. You 
can't test for every workload, so you can't say definitively that BFQ is better 
for every workload.


Yes


 At a minimum, there are workloads where the deadline and noop schedulers are 
better, but they're very domain specific workloads.


Definitely


 Based on the numbers from Shaohua, it looks like CFQ has better throughput 
than BFQ, and that will affect some workloads (for most, the improved fairness 
is worth the reduced throughput, but there probably are some cases where it 
isn't).


Well, no fairness as deadline and noop, but with much less throughput
than deadline and noop, doesn't sound much like the best scheduler for
those workloads.  With BFQ you have service guarantees, with noop or
deadline you have maximum throughput.
And with CFQ you have something in between, which is half of why I think 
CFQ is still worth keeping (the other half being the people who 
inevitably want to stay on CFQ).  And TBH, deadline and noop only give 
good throughput with specific workloads (and in the case of noop, it's 
usually only useful on tiny systems where the overhead of scheduling is 
greater than the time saved by doing so (like some very low power 
embedded systems), or when you have scheduling done elsewher in the 
storage stack (like in a VM)).




Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: 
"we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, 
and thus replace, CFQ"

I agree that that's part of the issue, but I also don't entirely agree with the 
reasoning on it.  Until blk-mq has proper I/O scheduling, people will continue 
to use CFQ, and based on the way things are going, it will be multiple months 
before that happens, whereas BFQ exists and is working now.


Exactly!

Thanks,
Paolo





Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 11:05, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 15:52, Austin S. Hemmelgarn 
 ha scritto:

On 2016-10-06 08:50, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
 ha scritto:

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


ISTR that the original argument for this approach essentially amounted to: 'If 
it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all respects 
(which there isn't sufficient information to decide in this case), or the 
negative aspects are worth the improvements (which is too workload specific to 
decide for something like this).


All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.

My point is that whether or not BFQ is better depends on the workload. You 
can't test for every workload, so you can't say definitively that BFQ is better 
for every workload.


Yes


 At a minimum, there are workloads where the deadline and noop schedulers are 
better, but they're very domain specific workloads.


Definitely


 Based on the numbers from Shaohua, it looks like CFQ has better throughput 
than BFQ, and that will affect some workloads (for most, the improved fairness 
is worth the reduced throughput, but there probably are some cases where it 
isn't).


Well, no fairness as deadline and noop, but with much less throughput
than deadline and noop, doesn't sound much like the best scheduler for
those workloads.  With BFQ you have service guarantees, with noop or
deadline you have maximum throughput.
And with CFQ you have something in between, which is half of why I think 
CFQ is still worth keeping (the other half being the people who 
inevitably want to stay on CFQ).  And TBH, deadline and noop only give 
good throughput with specific workloads (and in the case of noop, it's 
usually only useful on tiny systems where the overhead of scheduling is 
greater than the time saved by doing so (like some very low power 
embedded systems), or when you have scheduling done elsewher in the 
storage stack (like in a VM)).




Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: 
"we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, 
and thus replace, CFQ"

I agree that that's part of the issue, but I also don't entirely agree with the 
reasoning on it.  Until blk-mq has proper I/O scheduling, people will continue 
to use CFQ, and based on the way things are going, it will be multiple months 
before that happens, whereas BFQ exists and is working now.


Exactly!

Thanks,
Paolo





Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 08:50, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
<ahferro...@gmail.com> ha scritto:

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo <t...@kernel.org> wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


ISTR that the original argument for this approach essentially amounted to: 'If 
it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all respects 
(which there isn't sufficient information to decide in this case), or the 
negative aspects are worth the improvements (which is too workload specific to 
decide for something like this).


All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.
My point is that whether or not BFQ is better depends on the workload. 
You can't test for every workload, so you can't say definitively that 
BFQ is better for every workload.  At a minimum, there are workloads 
where the deadline and noop schedulers are better, but they're very 
domain specific workloads.  Based on the numbers from Shaohua, it looks 
like CFQ has better throughput than BFQ, and that will affect some 
workloads (for most, the improved fairness is worth the reduced 
throughput, but there probably are some cases where it isn't).


Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: 
"we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, 
and thus replace, CFQ"
I agree that that's part of the issue, but I also don't entirely agree 
with the reasoning on it.  Until blk-mq has proper I/O scheduling, 
people will continue to use CFQ, and based on the way things are going, 
it will be multiple months before that happens, whereas BFQ exists and 
is working now.


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 08:50, Paolo Valente wrote:



Il giorno 06 ott 2016, alle ore 13:57, Austin S. Hemmelgarn 
 ha scritto:

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).


ISTR that the original argument for this approach essentially amounted to: 'If 
it's so much better, why do we need both?'.

Such an argument is valid only if the new design is better in all respects 
(which there isn't sufficient information to decide in this case), or the 
negative aspects are worth the improvements (which is too workload specific to 
decide for something like this).


All correct, apart from the workload-specific issue, which is not very clear to 
me. Over the last five years I have not found a single workload for which CFQ 
is better than BFQ, and none has been suggested.
My point is that whether or not BFQ is better depends on the workload. 
You can't test for every workload, so you can't say definitively that 
BFQ is better for every workload.  At a minimum, there are workloads 
where the deadline and noop schedulers are better, but they're very 
domain specific workloads.  Based on the numbers from Shaohua, it looks 
like CFQ has better throughput than BFQ, and that will affect some 
workloads (for most, the improved fairness is worth the reduced 
throughput, but there probably are some cases where it isn't).


Anyway, leaving aside this fact, IMO the real problem here is that we are in a catch-22: 
"we want BFQ to replace CFQ, but, since CFQ is legacy code, then you cannot change, 
and thus replace, CFQ"
I agree that that's part of the issue, but I also don't entirely agree 
with the reasoning on it.  Until blk-mq has proper I/O scheduling, 
people will continue to use CFQ, and based on the way things are going, 
it will be multiple months before that happens, whereas BFQ exists and 
is working now.


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).

ISTR that the original argument for this approach essentially amounted 
to: 'If it's so much better, why do we need both?'.


Such an argument is valid only if the new design is better in all 
respects (which there isn't sufficient information to decide in this 
case), or the negative aspects are worth the improvements (which is too 
workload specific to decide for something like this).


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-06 Thread Austin S. Hemmelgarn

On 2016-10-06 07:03, Mark Brown wrote:

On Thu, Oct 06, 2016 at 10:04:41AM +0200, Linus Walleij wrote:

On Tue, Oct 4, 2016 at 9:14 PM, Tejun Heo  wrote:



I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.



Not just desktops, also Android phones.



So why not have BFQ as a separate scheduling policy upstream,
alongside CFQ, deadline and noop?


Right.


We're already doing the per-usecase Kconfig thing for preemption.
But maybe somebody already hates that and want to get rid of it,
I don't know.


Hannes also suggested going back to making BFQ a separate scheduler
rather than replacing CFQ earlier, pointing out that it mitigates
against the risks of changing CFQ substantially at this point (which
seems to be the biggest issue here).

ISTR that the original argument for this approach essentially amounted 
to: 'If it's so much better, why do we need both?'.


Such an argument is valid only if the new design is better in all 
respects (which there isn't sufficient information to decide in this 
case), or the negative aspects are worth the improvements (which is too 
workload specific to decide for something like this).


Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-19 Thread Austin S. Hemmelgarn

On 2016-09-17 01:14, James Bottomley wrote:

On Fri, 2016-09-16 at 13:06 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 12:21, James Bottomley wrote:

On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is
there an ATA Information VPD page? Is REPORT SUPPORTED
OPERATION CODES supported?

We need really solid discovery data before we can entertain
enabling something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal
too.

However, sg_sat_identify worked reliably, which means a solid way
of probing this would be to send IDENTIFY DEVICE ATA via
SG_ATA_16 or SG_ATA_12.

Let me know and I can give you access to the hardware if you're
curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication
that you can pass DSM TRIM that way.

So it's better to not support it at all than to support it on
hardware we can reliably identify?

I get that having feature parity is a good thing, but the discussion
isn't about providing support for all SATL devices, it's specifically
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do
anything with UAS, which means it's kind of irrelevant WRT supporting
this for UAS devices.


We're getting a bit off topic on mptsas and it's eccentric SATL.

The point is, you're asking for UAS devices which each have an internal
SATL which you say potentially doesn't support discard.  The three
problems we have are

   1. How do we identify if the UAS SATL doesn't support discard.  If it
  does, we really don't want to cause further SATL related issues by
  bypassing it, so we need a way of telling this.
   2. If the SATL doesn't support discard, will it reliably support the
  ATA_12 or ATA_16 pass through (and which one) .. we need a way of
  checking this because there are known SATLs that don't do pass
  through.
   3. How do we actually configure it?  Presumably if the SATL doesn't
  support discard, it also doesn't give us the useful mode page
  indications we use to configure TRIM, so we're going to have to do
  some pass through discovery as well.

I assume by 'discard' here you're referring to SCSI UNMAP, as
anything that supports ATA_12 or ATA_16 pass through correctly will
support ATA TRIM/DISCARD on drives that support it.


discard is the block layer terminology it's mapped per transport to
UNMAP or WRITE SAME on SCSI and TRIM on ATA.
I actually didn't know this.  I'm not quite as knowledgeable about the 
block layer as I probably should be, and definitely not as up-to-date as 
I could be on the ATA and SCSI specs.



  If that's the case, then:
1. If SCSI UNMAP fails, it doesn't support UNMAP.  This is of course
non-trivial to verify safely (we pretty much have to assume it is
supported if we have no clear indication it isn't, and then switch
based on what happens the first time we try to use it).


It's not quite that simple: to get us to configure discard in the first
place, you have to indicate support in READ CAPACITY (16): the LBPME
bit.  The chances are your UAS SATL isn't setting this.
OK, that makes sense.  Given that though, is it known how something like 
that may react if you tried to issue an UNMAP or WRITE SAME command when 
it's not supported?



2. Unless there are SATL's out there that write garbage to the device
or die when sent an ATA_12 or ATA_16 pass through command



Yes, there are; the problems with USB devices that fail to speak
standard versions of SCSI are legion.





encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue.
 Even if such SATL's exist, they can easily be blacklisted.
3. This isn't hard, a SATL which actually supports ATA pass through
will almost always pass through the mode page unmodified.


You mean the ATA Information VPD page?  Yes, that's feasible because we
already queried the supported VPD pages, so we can tell if this one's
there.
I kind of got my terminology confused here, and didn't proof-read 
properly.  I'm not sure exactly what I was trying to refer to 
originally, but what I meant was that pretty much all UAS SATL's I've 
seen that support ATA pass through either have a proper ATA Information 
VPD page, or properly pass through ATA DEVICE IDENTIFY and related commands.



On the note of UAS SATL's, all of them that I've seen fall into one
of four categories:
1. Supports one or both of ATA_12 or ATA_16 pass through, and
supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP.
2. Supports one of ATA_12 or ATA_16 pass through, and does not
support passing through ATA TRIM/DISCARD or translating SCSI UNMAP.
 All

Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-19 Thread Austin S. Hemmelgarn

On 2016-09-17 01:14, James Bottomley wrote:

On Fri, 2016-09-16 at 13:06 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 12:21, James Bottomley wrote:

On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is
there an ATA Information VPD page? Is REPORT SUPPORTED
OPERATION CODES supported?

We need really solid discovery data before we can entertain
enabling something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal
too.

However, sg_sat_identify worked reliably, which means a solid way
of probing this would be to send IDENTIFY DEVICE ATA via
SG_ATA_16 or SG_ATA_12.

Let me know and I can give you access to the hardware if you're
curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication
that you can pass DSM TRIM that way.

So it's better to not support it at all than to support it on
hardware we can reliably identify?

I get that having feature parity is a good thing, but the discussion
isn't about providing support for all SATL devices, it's specifically
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do
anything with UAS, which means it's kind of irrelevant WRT supporting
this for UAS devices.


We're getting a bit off topic on mptsas and it's eccentric SATL.

The point is, you're asking for UAS devices which each have an internal
SATL which you say potentially doesn't support discard.  The three
problems we have are

   1. How do we identify if the UAS SATL doesn't support discard.  If it
  does, we really don't want to cause further SATL related issues by
  bypassing it, so we need a way of telling this.
   2. If the SATL doesn't support discard, will it reliably support the
  ATA_12 or ATA_16 pass through (and which one) .. we need a way of
  checking this because there are known SATLs that don't do pass
  through.
   3. How do we actually configure it?  Presumably if the SATL doesn't
  support discard, it also doesn't give us the useful mode page
  indications we use to configure TRIM, so we're going to have to do
  some pass through discovery as well.

I assume by 'discard' here you're referring to SCSI UNMAP, as
anything that supports ATA_12 or ATA_16 pass through correctly will
support ATA TRIM/DISCARD on drives that support it.


discard is the block layer terminology it's mapped per transport to
UNMAP or WRITE SAME on SCSI and TRIM on ATA.
I actually didn't know this.  I'm not quite as knowledgeable about the 
block layer as I probably should be, and definitely not as up-to-date as 
I could be on the ATA and SCSI specs.



  If that's the case, then:
1. If SCSI UNMAP fails, it doesn't support UNMAP.  This is of course
non-trivial to verify safely (we pretty much have to assume it is
supported if we have no clear indication it isn't, and then switch
based on what happens the first time we try to use it).


It's not quite that simple: to get us to configure discard in the first
place, you have to indicate support in READ CAPACITY (16): the LBPME
bit.  The chances are your UAS SATL isn't setting this.
OK, that makes sense.  Given that though, is it known how something like 
that may react if you tried to issue an UNMAP or WRITE SAME command when 
it's not supported?



2. Unless there are SATL's out there that write garbage to the device
or die when sent an ATA_12 or ATA_16 pass through command



Yes, there are; the problems with USB devices that fail to speak
standard versions of SCSI are legion.





encapsulating an ATA DEVICE IDENTIFY command, this isn't an issue.
 Even if such SATL's exist, they can easily be blacklisted.
3. This isn't hard, a SATL which actually supports ATA pass through
will almost always pass through the mode page unmodified.


You mean the ATA Information VPD page?  Yes, that's feasible because we
already queried the supported VPD pages, so we can tell if this one's
there.
I kind of got my terminology confused here, and didn't proof-read 
properly.  I'm not sure exactly what I was trying to refer to 
originally, but what I meant was that pretty much all UAS SATL's I've 
seen that support ATA pass through either have a proper ATA Information 
VPD page, or properly pass through ATA DEVICE IDENTIFY and related commands.



On the note of UAS SATL's, all of them that I've seen fall into one
of four categories:
1. Supports one or both of ATA_12 or ATA_16 pass through, and
supports passing through ATA TRIM/DISCARD, but not SCSI UNMAP.
2. Supports one of ATA_12 or ATA_16 pass through, and does not
support passing through ATA TRIM/DISCARD or translating SCSI UNMAP.
 All

Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-16 Thread Austin S. Hemmelgarn

On 2016-09-16 12:21, James Bottomley wrote:

On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is
there an ATA Information VPD page? Is REPORT SUPPORTED
OPERATION CODES supported?

We need really solid discovery data before we can entertain
enabling something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal
too.

However, sg_sat_identify worked reliably, which means a solid way
of probing this would be to send IDENTIFY DEVICE ATA via
SG_ATA_16 or SG_ATA_12.

Let me know and I can give you access to the hardware if you're
curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication
that you can pass DSM TRIM that way.

So it's better to not support it at all than to support it on
hardware we can reliably identify?

I get that having feature parity is a good thing, but the discussion
isn't about providing support for all SATL devices, it's specifically
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do
anything with UAS, which means it's kind of irrelevant WRT supporting
this for UAS devices.


We're getting a bit off topic on mptsas and it's eccentric SATL.

The point is, you're asking for UAS devices which each have an internal
SATL which you say potentially doesn't support discard.  The three
problems we have are

   1. How do we identify if the UAS SATL doesn't support discard.  If it
  does, we really don't want to cause further SATL related issues by
  bypassing it, so we need a way of telling this.
   2. If the SATL doesn't support discard, will it reliably support the
  ATA_12 or ATA_16 pass through (and which one) .. we need a way of
  checking this because there are known SATLs that don't do pass
  through.
   3. How do we actually configure it?  Presumably if the SATL doesn't
  support discard, it also doesn't give us the useful mode page
  indications we use to configure TRIM, so we're going to have to do
  some pass through discovery as well.
I assume by 'discard' here you're referring to SCSI UNMAP, as anything 
that supports ATA_12 or ATA_16 pass through correctly will support ATA 
TRIM/DISCARD on drives that support it.  If that's the case, then:
1. If SCSI UNMAP fails, it doesn't support UNMAP.  This is of course 
non-trivial to verify safely (we pretty much have to assume it is 
supported if we have no clear indication it isn't, and then switch based 
on what happens the first time we try to use it).
2. Unless there are SATL's out there that write garbage to the device or 
die when sent an ATA_12 or ATA_16 pass through command encapsulating an 
ATA DEVICE IDENTIFY command, this isn't an issue.  Even if such SATL's 
exist, they can easily be blacklisted.
3. This isn't hard, a SATL which actually supports ATA pass through will 
almost always pass through the mode page unmodified.


On the note of UAS SATL's, all of them that I've seen fall into one of 
four categories:
1. Supports one or both of ATA_12 or ATA_16 pass through, and supports 
passing through ATA TRIM/DISCARD, but not SCSI UNMAP.
2. Supports one of ATA_12 or ATA_16 pass through, and does not support 
passing through ATA TRIM/DISCARD or translating SCSI UNMAP.  All devices 
I've seen that fit this will modify the ATA DEVICE IDENTIFY data so it 
doesn't report DISCARD support, or will simply return an error for 
DISCARD requests.  I haven't seen any like this that were manufactured 
after UAS became standardized.
3. Supports neither ATA_12 or ATA_16 pass through, and doesn't support 
UNMAP.
4. Like type 1, except it supports both pass through commands, and also 
properly translates SCSI UNMAP commands (I've only ever seen one of 
these, have no idea what chipset it had, and it was insanely expensive 
(upside of 300 USD)).
All we really can do anything about is category 1.  Category 4 works 
with the current drivers, and we can't properly support it on category 2 
or 3.  All three devices I have right now are in category 1, I know a 
number of other people in a similar situation, and it sounds like Jason 
has at least one such device as well.  Given that Windows does this 
(I've confirmed this with a hardware USB analyzer I borrowed from a 
friend), and that I've not seen anything since the UAS spec was actually 
released that falls into category 2 (and if I understand the spec 
correctly, such a device is actually not compliant with it anyway), I 
think it's probably safe to do this in Linux and just base the check on:

1. UAS (not some other SCSI transport) without UNMAP support.
2. Supports ATA_12 or ATA_16 pass through.
3. ATA

Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-16 Thread Austin S. Hemmelgarn

On 2016-09-16 12:21, James Bottomley wrote:

On Fri, 2016-09-16 at 11:53 -0400, Austin S. Hemmelgarn wrote:

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is
there an ATA Information VPD page? Is REPORT SUPPORTED
OPERATION CODES supported?

We need really solid discovery data before we can entertain
enabling something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal
too.

However, sg_sat_identify worked reliably, which means a solid way
of probing this would be to send IDENTIFY DEVICE ATA via
SG_ATA_16 or SG_ATA_12.

Let me know and I can give you access to the hardware if you're
curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication
that you can pass DSM TRIM that way.

So it's better to not support it at all than to support it on
hardware we can reliably identify?

I get that having feature parity is a good thing, but the discussion
isn't about providing support for all SATL devices, it's specifically
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do
anything with UAS, which means it's kind of irrelevant WRT supporting
this for UAS devices.


We're getting a bit off topic on mptsas and it's eccentric SATL.

The point is, you're asking for UAS devices which each have an internal
SATL which you say potentially doesn't support discard.  The three
problems we have are

   1. How do we identify if the UAS SATL doesn't support discard.  If it
  does, we really don't want to cause further SATL related issues by
  bypassing it, so we need a way of telling this.
   2. If the SATL doesn't support discard, will it reliably support the
  ATA_12 or ATA_16 pass through (and which one) .. we need a way of
  checking this because there are known SATLs that don't do pass
  through.
   3. How do we actually configure it?  Presumably if the SATL doesn't
  support discard, it also doesn't give us the useful mode page
  indications we use to configure TRIM, so we're going to have to do
  some pass through discovery as well.
I assume by 'discard' here you're referring to SCSI UNMAP, as anything 
that supports ATA_12 or ATA_16 pass through correctly will support ATA 
TRIM/DISCARD on drives that support it.  If that's the case, then:
1. If SCSI UNMAP fails, it doesn't support UNMAP.  This is of course 
non-trivial to verify safely (we pretty much have to assume it is 
supported if we have no clear indication it isn't, and then switch based 
on what happens the first time we try to use it).
2. Unless there are SATL's out there that write garbage to the device or 
die when sent an ATA_12 or ATA_16 pass through command encapsulating an 
ATA DEVICE IDENTIFY command, this isn't an issue.  Even if such SATL's 
exist, they can easily be blacklisted.
3. This isn't hard, a SATL which actually supports ATA pass through will 
almost always pass through the mode page unmodified.


On the note of UAS SATL's, all of them that I've seen fall into one of 
four categories:
1. Supports one or both of ATA_12 or ATA_16 pass through, and supports 
passing through ATA TRIM/DISCARD, but not SCSI UNMAP.
2. Supports one of ATA_12 or ATA_16 pass through, and does not support 
passing through ATA TRIM/DISCARD or translating SCSI UNMAP.  All devices 
I've seen that fit this will modify the ATA DEVICE IDENTIFY data so it 
doesn't report DISCARD support, or will simply return an error for 
DISCARD requests.  I haven't seen any like this that were manufactured 
after UAS became standardized.
3. Supports neither ATA_12 or ATA_16 pass through, and doesn't support 
UNMAP.
4. Like type 1, except it supports both pass through commands, and also 
properly translates SCSI UNMAP commands (I've only ever seen one of 
these, have no idea what chipset it had, and it was insanely expensive 
(upside of 300 USD)).
All we really can do anything about is category 1.  Category 4 works 
with the current drivers, and we can't properly support it on category 2 
or 3.  All three devices I have right now are in category 1, I know a 
number of other people in a similar situation, and it sounds like Jason 
has at least one such device as well.  Given that Windows does this 
(I've confirmed this with a hardware USB analyzer I borrowed from a 
friend), and that I've not seen anything since the UAS spec was actually 
released that falls into category 2 (and if I understand the spec 
correctly, such a device is actually not compliant with it anyway), I 
think it's probably safe to do this in Linux and just base the check on:

1. UAS (not some other SCSI transport) without UNMAP support.
2. Supports ATA_12 or ATA_16 pass through.
3. ATA

Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-16 Thread Austin S. Hemmelgarn

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is there an ATA
Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported?

We need really solid discovery data before we can entertain enabling
something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too.

However, sg_sat_identify worked reliably, which means a solid way of
probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or
SG_ATA_12.

Let me know and I can give you access to the hardware if you're curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication that
you can pass DSM TRIM that way.
So it's better to not support it at all than to support it on hardware 
we can reliably identify?


I get that having feature parity is a good thing, but the discussion 
isn't about providing support for all SATL devices, it's specifically 
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do 
anything with UAS, which means it's kind of irrelevant WRT supporting 
this for UAS devices.


It's pretty easy to tell that something is a UAS device (the uas driver 
wouldn't be bound to it otherwise), so if we check that and then check 
whether or not IDENTIFY DEVICE ATA works when sent via SG_ATA_16 or 
SG_ATA_12, it should be relatively safe (ignoring of course the fact 
that there will inevitably be some brain-dead hardware that for some 
obscure reason translates the command into something that will corrupt 
data).  I've got three USB 3.0 UAS SATA adapters (all ASMedia branded 
chips) that behave pretty much identically to what Jason is describing, 
so it appears that at least one brand behaves this way in a reliable and 
reproducible manner.


Re: TRIM/UNMAP/DISCARD via ATA Passthrough

2016-09-16 Thread Austin S. Hemmelgarn

On 2016-09-16 07:16, Hannes Reinecke wrote:

On 09/15/2016 10:52 PM, Jason A. Donenfeld wrote:

Hi Martin,

On Thu, Sep 15, 2016 at 6:07 PM, Martin K. Petersen

But how do they signal that ATA passthrough is possible? Is there an ATA
Information VPD page? Is REPORT SUPPORTED OPERATION CODES supported?

We need really solid discovery data before we can entertain enabling
something like this.


`sg_opcodes` said invalid request, so I think there isn't REPORT
SUPPORTED OPERATION CODES, and `sg_vpd -p ai` came up illegal too.

However, sg_sat_identify worked reliably, which means a solid way of
probing this would be to send IDENTIFY DEVICE ATA via SG_ATA_16 or
SG_ATA_12.

Let me know and I can give you access to the hardware if you're curious.


Sadly, that's not sufficient.
linux is not the only provider of an SATL (mpt3sas being the most
prominent other one).
And while they might support ATA_12/ATA_16, there is no indication that
you can pass DSM TRIM that way.
So it's better to not support it at all than to support it on hardware 
we can reliably identify?


I get that having feature parity is a good thing, but the discussion 
isn't about providing support for all SATL devices, it's specifically 
about UAS connected SATL devices.  Last I checked, mpt3sas doesn't do 
anything with UAS, which means it's kind of irrelevant WRT supporting 
this for UAS devices.


It's pretty easy to tell that something is a UAS device (the uas driver 
wouldn't be bound to it otherwise), so if we check that and then check 
whether or not IDENTIFY DEVICE ATA works when sent via SG_ATA_16 or 
SG_ATA_12, it should be relatively safe (ignoring of course the fact 
that there will inevitably be some brain-dead hardware that for some 
obscure reason translates the command into something that will corrupt 
data).  I've got three USB 3.0 UAS SATA adapters (all ASMedia branded 
chips) that behave pretty much identically to what Jason is describing, 
so it appears that at least one brand behaves this way in a reliable and 
reproducible manner.


Re: [PATCH] logfs: remove from tree

2016-09-12 Thread Austin S. Hemmelgarn

On 2016-09-12 02:55, Artem Bityutskiy wrote:

On Sun, 2016-09-11 at 15:04 +0200, Christoph Hellwig wrote:

Logfs was introduced to the kernel in 2009, and hasn't seen any non
drive-by changes since 2012, while having lots of unsolved issues
including the complete lack of error handling, with more and more
issues popping up without any fixes.

The logfs.org domain has been bouncing from a mail, and the
maintainer
on the non-logfs.org domain hasn't repsonded to past queries either.

Signed-off-by: Christoph Hellwig 


Back in 2008 logfs and UBIFS were in sort of competing projects. I
remember we inspected logfs code and tested it - we did not find proper
wear-levelling and bad block handling, we did not see proper error
handling, and it exploded when we were running relatively simple tests.
We indicated this here in a very humble way to avoid the "conflict of
interest" perseption:

https://lkml.org/lkml/2008/3/31/117

I did not follow logfs since then, but I think there wasn't much
development since then and all these issue are still there. I mean,
unless I am horribly mistaken, logfs does not really have the basic
features of a flash file system and there is no point keeping it in the
tree and consuming people's time maintaining it.

FWIW, I tried testing it about a year ago, and got similar results both 
from the tests and from trying to contact the maintainer.




Re: [PATCH] logfs: remove from tree

2016-09-12 Thread Austin S. Hemmelgarn

On 2016-09-12 02:55, Artem Bityutskiy wrote:

On Sun, 2016-09-11 at 15:04 +0200, Christoph Hellwig wrote:

Logfs was introduced to the kernel in 2009, and hasn't seen any non
drive-by changes since 2012, while having lots of unsolved issues
including the complete lack of error handling, with more and more
issues popping up without any fixes.

The logfs.org domain has been bouncing from a mail, and the
maintainer
on the non-logfs.org domain hasn't repsonded to past queries either.

Signed-off-by: Christoph Hellwig 


Back in 2008 logfs and UBIFS were in sort of competing projects. I
remember we inspected logfs code and tested it - we did not find proper
wear-levelling and bad block handling, we did not see proper error
handling, and it exploded when we were running relatively simple tests.
We indicated this here in a very humble way to avoid the "conflict of
interest" perseption:

https://lkml.org/lkml/2008/3/31/117

I did not follow logfs since then, but I think there wasn't much
development since then and all these issue are still there. I mean,
unless I am horribly mistaken, logfs does not really have the basic
features of a flash file system and there is no point keeping it in the
tree and consuming people's time maintaining it.

FWIW, I tried testing it about a year ago, and got similar results both 
from the tests and from trying to contact the maintainer.




Re: [Documentation] State of CPU controller in cgroup v2

2016-09-12 Thread Austin S. Hemmelgarn

On 2016-09-09 18:57, Tejun Heo wrote:

Hello, again.

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.


Your idea of "trivially" doesn't match mine.  You gave a use case in


I suppose I wasn't clear enough.  It is trivial in the sense that if
the userland implements something which works for namespace-root, it
would work the same in system-root without further modifications.


which userspace might take advantage of root being special.  If


I was emphasizing the cases where userspace would have to deal with
the inherent differences, and, when they don't, they can behave
exactly the same way.


userspace does that, then that userspace cannot be run in a container.
This could be a problem for real users.  Sure, "don't do that" is a
*valid* answer, but it's not a very helpful answer.


Great, now we agree that what's currently implemented is valid.  I
think you're still failing to recognize the inherent specialness of
the system-root and how much unnecessary pain the removal of the
exemption would cause at virtually no practical gain.  I won't repeat
the same backing points here.


* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.


I think their functions are much closer than you think they are.  I
want a whole Linux distro to be able to run in a container.  This
means that useful things people do in a distro or initramfs or
whatever should just work if containerized.


There isn't much which is getting in the way of doing that.  Again,
something which follows no-internal-task rule would behave the same no
matter where it is.  The system-root is different in that it is exempt
from the rule and thus is more flexible but that difference is serving
the purpose of handling the inherent specialness of the system-root.
AFAICS, it is the solution which causes the least amount of contortion
and unnecessary inconvenience to userland.


It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?


I'm not saying it's fundamentally wrong.  I'm saying it's a design


You were.


that has a big wart, and that wart is unfortunate, and after thinking
a bit, I'm starting to agree with PeterZ that this is problematic.  It
also seems fixable: the constraint could be relaxed.


You've been pushing for enforcing the restriction on the system-root
too and now are jumping to the opposite end.  It's really frustrating
that this is such a whack-a-mole game where you throw ideas without
really thinking through them and only concede the bare minimum when
all other logical avenues are closed off.  Here, again, you seem to be
stating a strong opinion when you haven't fully thought about it or
tried to understand the reasons behind it.

But, whatever, let's go there: Given the arguments that I laid out for
the no-internal-tasks rule, how does the problem seem fixable through
relaxing the constraint?


Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.


The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.


So, up until this point, we were talking about no-internal-tasks
constraint.


Isn't this the same thing?  IIUC the constraint in question is that,
if a non-root cgroup has subtree control on, then it can't have
processes in it.  This is the no-internal-tasks constraint, right?


Yes, that is what no-internal-tasks rule is but I don't understand how
that is the same thing as process granularity.  Am I completely
misunderstanding what you are trying to say here?


And I still think that, at least for cpu, nothing at all goes wrong if
you allow processes to exist in cgroups that have cpu set in
subtree-control.


If you confine it to the cpu controller, ignore anonymous
consumptions, the rather ugly mapping between nice and weight values
and the 

Re: [Documentation] State of CPU controller in cgroup v2

2016-09-12 Thread Austin S. Hemmelgarn

On 2016-09-09 18:57, Tejun Heo wrote:

Hello, again.

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.


Your idea of "trivially" doesn't match mine.  You gave a use case in


I suppose I wasn't clear enough.  It is trivial in the sense that if
the userland implements something which works for namespace-root, it
would work the same in system-root without further modifications.


which userspace might take advantage of root being special.  If


I was emphasizing the cases where userspace would have to deal with
the inherent differences, and, when they don't, they can behave
exactly the same way.


userspace does that, then that userspace cannot be run in a container.
This could be a problem for real users.  Sure, "don't do that" is a
*valid* answer, but it's not a very helpful answer.


Great, now we agree that what's currently implemented is valid.  I
think you're still failing to recognize the inherent specialness of
the system-root and how much unnecessary pain the removal of the
exemption would cause at virtually no practical gain.  I won't repeat
the same backing points here.


* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.


I think their functions are much closer than you think they are.  I
want a whole Linux distro to be able to run in a container.  This
means that useful things people do in a distro or initramfs or
whatever should just work if containerized.


There isn't much which is getting in the way of doing that.  Again,
something which follows no-internal-task rule would behave the same no
matter where it is.  The system-root is different in that it is exempt
from the rule and thus is more flexible but that difference is serving
the purpose of handling the inherent specialness of the system-root.
AFAICS, it is the solution which causes the least amount of contortion
and unnecessary inconvenience to userland.


It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?


I'm not saying it's fundamentally wrong.  I'm saying it's a design


You were.


that has a big wart, and that wart is unfortunate, and after thinking
a bit, I'm starting to agree with PeterZ that this is problematic.  It
also seems fixable: the constraint could be relaxed.


You've been pushing for enforcing the restriction on the system-root
too and now are jumping to the opposite end.  It's really frustrating
that this is such a whack-a-mole game where you throw ideas without
really thinking through them and only concede the bare minimum when
all other logical avenues are closed off.  Here, again, you seem to be
stating a strong opinion when you haven't fully thought about it or
tried to understand the reasons behind it.

But, whatever, let's go there: Given the arguments that I laid out for
the no-internal-tasks rule, how does the problem seem fixable through
relaxing the constraint?


Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.


The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.


So, up until this point, we were talking about no-internal-tasks
constraint.


Isn't this the same thing?  IIUC the constraint in question is that,
if a non-root cgroup has subtree control on, then it can't have
processes in it.  This is the no-internal-tasks constraint, right?


Yes, that is what no-internal-tasks rule is but I don't understand how
that is the same thing as process granularity.  Am I completely
misunderstanding what you are trying to say here?


And I still think that, at least for cpu, nothing at all goes wrong if
you allow processes to exist in cgroups that have cpu set in
subtree-control.


If you confine it to the cpu controller, ignore anonymous
consumptions, the rather ugly mapping between nice and weight values
and the 

Re: bcache vs bcachefs

2016-09-07 Thread Austin S. Hemmelgarn

On 2016-09-06 20:55, Kent Overstreet wrote:

On Tue, Sep 06, 2016 at 11:46:28AM +0200, Harald Dunkel wrote:

Hi folks,

I am pretty hesitant replacing the rock-solid ext4 by bcachefs on my servers.
Meaning no offense, but surely I would prefer to have ext4 with a thin "SSD
caching layer" over a completely different filesystem, potentially with alot
of teething troubles.

Question: Is bcache EOL or can I rely on it for the next 5 to 10 years?


bcache is not EOL - it's still receiving bugfixes.

That said though, there's no reason to expect a long teething period with
bcachefs, it's already more reliable than btrfs in single device mode.

I'd be curious to see any actual data you have to back that up, 
especially regarding what kernel and userspace were involved with the 
BTRFS testing.


Re: bcache vs bcachefs

2016-09-07 Thread Austin S. Hemmelgarn

On 2016-09-06 20:55, Kent Overstreet wrote:

On Tue, Sep 06, 2016 at 11:46:28AM +0200, Harald Dunkel wrote:

Hi folks,

I am pretty hesitant replacing the rock-solid ext4 by bcachefs on my servers.
Meaning no offense, but surely I would prefer to have ext4 with a thin "SSD
caching layer" over a completely different filesystem, potentially with alot
of teething troubles.

Question: Is bcache EOL or can I rely on it for the next 5 to 10 years?


bcache is not EOL - it's still receiving bugfixes.

That said though, there's no reason to expect a long teething period with
bcachefs, it's already more reliable than btrfs in single device mode.

I'd be curious to see any actual data you have to back that up, 
especially regarding what kernel and userspace were involved with the 
BTRFS testing.


Re: Who reordered my disks (probably v4.8-rc1 problem)

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-13 15:30, Pavel Machek wrote:

Hi!

It seems that in v4.8-rc0, /dev/sdX got reordered, and now USB devices
are probed before SATA drivers. That is pretty anti-social. It
broke my boot on my primary machine, and unfortunately due to BIOS
problems (keyboard does not work when connected through a hub) it is
less fun than it should be.

The simple solution (which I'm surprised nobody else has suggested yet) 
is to build everything but the driver for your root device as a module. 
This will prevent anything else from getting enumerated first, and works 
regardless of whether or not your using an initrd and independently of 
what root= format you use.




Re: Who reordered my disks (probably v4.8-rc1 problem)

2016-08-15 Thread Austin S. Hemmelgarn

On 2016-08-13 15:30, Pavel Machek wrote:

Hi!

It seems that in v4.8-rc0, /dev/sdX got reordered, and now USB devices
are probed before SATA drivers. That is pretty anti-social. It
broke my boot on my primary machine, and unfortunately due to BIOS
problems (keyboard does not work when connected through a hub) it is
less fun than it should be.

The simple solution (which I'm surprised nobody else has suggested yet) 
is to build everything but the driver for your root device as a module. 
This will prevent anything else from getting enumerated first, and works 
regardless of whether or not your using an initrd and independently of 
what root= format you use.




Re: [PATCH 0002/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Austin S. Hemmelgarn

On 2016-08-02 06:33, Baole Ni wrote:

I find that the developers often just specified the numeric value
when calling a macro which is defined with a parameter for access permission.
As we know, these numeric value for access permission have had the 
corresponding macro,
and that using macro can improve the robustness and readability of the code,
thus, I suggest replacing the numeric parameter with the macro.
Have you taken the time to consider _why_ this is common practice?  In 
most cases, people who have been using UNIX-like systems for any 
reasonable period of time can mentally parse the octal permissions 
almost instantly, while it often takes them a while to figure out the 
macros.  So, overall, this change actually hurts readability for a large 
number of people.


Additionally, the argument of robustness is somewhat bogus too, the 
internal representation of POSIX DAC permissions is not going to change 
any time soon, if at all, so abstracting the values away isn't really 
improving robustness considering that what it's supposed to protect 
against won't ever happen.


Beyond that, there are also a number of other issues with the set as a 
whole:
1. In many places, you haven't properly re-wrapped lines at the 80 
column limit.
2. In many places, you haven't condensed macros like you should (for 
example, 0644 should be (S_IWUSR | S_IRUGO), not (S_IWUSR | S_IRUSR | 
S_IRGRP | S_IROTH), and this would help with the line length limit too).

3. You're missing subsystem tags in your patch subjects.
4. Patches shouldn't be per-file for changes like this, they should be 
per-subsystem.  Dropping this many patches (you've sent more patches in 
a few hours than the whole sees in a day on average) is a very good way 
to get people to ignore you.




Re: [PATCH 0002/1285] Replace numeric parameter like 0444 with macro

2016-08-02 Thread Austin S. Hemmelgarn

On 2016-08-02 06:33, Baole Ni wrote:

I find that the developers often just specified the numeric value
when calling a macro which is defined with a parameter for access permission.
As we know, these numeric value for access permission have had the 
corresponding macro,
and that using macro can improve the robustness and readability of the code,
thus, I suggest replacing the numeric parameter with the macro.
Have you taken the time to consider _why_ this is common practice?  In 
most cases, people who have been using UNIX-like systems for any 
reasonable period of time can mentally parse the octal permissions 
almost instantly, while it often takes them a while to figure out the 
macros.  So, overall, this change actually hurts readability for a large 
number of people.


Additionally, the argument of robustness is somewhat bogus too, the 
internal representation of POSIX DAC permissions is not going to change 
any time soon, if at all, so abstracting the values away isn't really 
improving robustness considering that what it's supposed to protect 
against won't ever happen.


Beyond that, there are also a number of other issues with the set as a 
whole:
1. In many places, you haven't properly re-wrapped lines at the 80 
column limit.
2. In many places, you haven't condensed macros like you should (for 
example, 0644 should be (S_IWUSR | S_IRUGO), not (S_IWUSR | S_IRUSR | 
S_IRGRP | S_IROTH), and this would help with the line length limit too).

3. You're missing subsystem tags in your patch subjects.
4. Patches shouldn't be per-file for changes like this, they should be 
per-subsystem.  Dropping this many patches (you've sent more patches in 
a few hours than the whole sees in a day on average) is a very good way 
to get people to ignore you.




Re: [PATCH 00/14] Present useful limits to user (v2)

2016-07-18 Thread Austin S. Hemmelgarn

On 2016-07-15 16:54, H. Peter Anvin wrote:

On July 15, 2016 6:59:56 AM PDT, Peter Zijlstra  wrote:

On Fri, Jul 15, 2016 at 01:52:48PM +, Topi Miettinen wrote:

On 07/15/16 12:43, Peter Zijlstra wrote:

On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:

Hello,

There are many basic ways to control processes, including

capabilities,

cgroups and resource limits. However, there are far fewer ways to

find out

useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice

starting

point from the highwater mark values of the resources in question.
I looked where each limit is checked and added a call to update

the mark

nearby.


And how is that useful? Setting things to the high watermark is
basically the same as not setting the limit at all.


What else would you use, too small limits?


That question doesn't make sense.

What's the point of setting a limit if it ends up being the same as
no-limit (aka unlimited).

If you cannot explain; and you have not so far; what use these values
are, why would we look at the patches.


One reason is to catch a malfunctioning process rather than dragging the whole 
system down with it.  It could also be useful for development.

Additionally, there are quite a few applications which don't gracefully 
handle memory allocation or process creation failures, either hanging, 
constantly retrying, or just dying when this happens.  For such an 
application, you have to set the limit to the high watermark if you want 
them limited at all, otherwise they don't work.  A classic example of 
this is the official client for Dropbox.  If it can't start up all the 
insane number of threads it thinks it needs, then it just hangs. 
However, it's also a network service, and therefore is a reasonable 
target for hackers, so it makes sense to try and limit it.  I've run 
into similar issues with quite a few 'desktop' services, both open and 
closed source.


Looking at this another way, this is most useful for things that have a 
deterministic maximum resource usage under regular use, not something 
like a forking server which has a functionally unbounded maximum 
resource usage.


Re: [PATCH 00/14] Present useful limits to user (v2)

2016-07-18 Thread Austin S. Hemmelgarn

On 2016-07-15 16:54, H. Peter Anvin wrote:

On July 15, 2016 6:59:56 AM PDT, Peter Zijlstra  wrote:

On Fri, Jul 15, 2016 at 01:52:48PM +, Topi Miettinen wrote:

On 07/15/16 12:43, Peter Zijlstra wrote:

On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:

Hello,

There are many basic ways to control processes, including

capabilities,

cgroups and resource limits. However, there are far fewer ways to

find out

useful values for the limits, except blind trial and error.

This patch series attempts to fix that by giving at least a nice

starting

point from the highwater mark values of the resources in question.
I looked where each limit is checked and added a call to update

the mark

nearby.


And how is that useful? Setting things to the high watermark is
basically the same as not setting the limit at all.


What else would you use, too small limits?


That question doesn't make sense.

What's the point of setting a limit if it ends up being the same as
no-limit (aka unlimited).

If you cannot explain; and you have not so far; what use these values
are, why would we look at the patches.


One reason is to catch a malfunctioning process rather than dragging the whole 
system down with it.  It could also be useful for development.

Additionally, there are quite a few applications which don't gracefully 
handle memory allocation or process creation failures, either hanging, 
constantly retrying, or just dying when this happens.  For such an 
application, you have to set the limit to the high watermark if you want 
them limited at all, otherwise they don't work.  A classic example of 
this is the official client for Dropbox.  If it can't start up all the 
insane number of threads it thinks it needs, then it just hangs. 
However, it's also a network service, and therefore is a reasonable 
target for hackers, so it makes sense to try and limit it.  I've run 
into similar issues with quite a few 'desktop' services, both open and 
closed source.


Looking at this another way, this is most useful for things that have a 
deterministic maximum resource usage under regular use, not something 
like a forking server which has a functionally unbounded maximum 
resource usage.


Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-22 Thread Austin S. Hemmelgarn

On 2016-06-22 01:16, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 15:31:07 schrieb Austin S. Hemmelgarn:

Hi Austin,


Little data, interesting statement for results on 200+ systems including
all major CPU arches all showing information leading in the same
directions.

Let me try rephrasing this to make it a bit clearer:
1. You have lots of data on server systems.
2. You have a significant amount of data on desktop/workstation type
systems.
3. You have very little data on embedded systems.

and here are your arguments:
A. This works well on server systems.
B. This works well on desktop systems.
C. This works well on embedded systems.

Arguments A and B are substantiated directly by points 1 and 2.
Argument C is not substantiated thoroughly because of point 3.
My complaint is about argument C given point 3.


Then let me rephrase what I try to say: my RNG rests on the intrinsic
functionality of CPUs. When I show that such intrinsic behavior is present in
various architectures I show that there is a common ground for the basis of
the RNG.
You're not demonstrating it's inherently present in the architecture, 
your demonstrating that it's present for those particular 
micro-architectures you have tested.  You're dealing with effects 
resulting from such low-level details of the CPU that you can't assume 
it happens for the whole architecture.  In fact, your own results 
regarding the weak values from Pentium Celeron Mobile system reinforce 
that it's not an architectural effect at the ISA level, but results from 
low level designs.  Given the naming of that particular CPU, it's almost 
certainly a P6 or NetBurst micro-arch, neither of which had particularly 
complex branch-prediction or pipelining or similar things, and more 
significantly, probably did not have HyperThreading, which I'd be 
willing to bet is a significant contributing factor on the Xeon and Core 
processors you've got results for.


I tested on all CPUs of all large scale architectures (including the
architectures that are commonly used for embedded devices) and demonstrated
that the fundamental phenomenon the RNG rests on is present in all
architectures.
In all architectures you've tested.  Note that technically, from a 
low-level perspective of something like this, different revisions of an 
ISA are different architectures, and when you start talking about 
licensed IP cores like ARM and PPC (and MIPS, and SPARC), different 
manufacturer's designs almost count separately.  You're relying on 
complexities inherent in the CPU itself, which will be different between 
micro-architectures, and possibly even individual revisions of a 
specific model number.  Just because the test gives good results on an 
ARMv6 or ARMv7 does not mean it will on an ARMv4, because there are 
enough differences in typical designs of ARM CPU's that you can't 
directly draw conclusions based on such a small sample size (you've 
tested at most 4 manufacturer's designs, and even then only one from 
each, and there are about 10 different companies making ARM chips, each 
selling dozens of slightly different CPU's).


I do not care about the form factor of the test system server, desktop or
embedded systems nor do I care about the number of attached devices -- the
form factor and number of attached devices is the differentiator of what you
call embedded vs server vs desktop.
I don't care about form factor, I care about the CPU, and embedded 
systems almost always have simpler CPU designs (not including all the 
peripherals they love to add in on-die).  Your own analysis indicates 
that your getting entropy from the complex interaction of the different 
parts of the CPU.  Such complexities are less present in simpler CPU 
designs.


Heck, I have written a test that executes the RNG on bare metal (without OS
and with only a keyboard as device present -- i.e no interrupts are received
apart from a keyboard), which demonstrates that the phenomenon is present.

Furthermore, chapter 6 of my document analyzes the root cause of the RNG and
here you see clearly that it has nothing to do with the size of the CPU or its
attached devices or the size of RAM.
And yet averages are higher for systems that have more CPU cores, and 
thus more complex CPU's.  Prime examples of this are the UltraSPARC 
CPU's you've tested.  Those have more SMP cores (and threads of 
execution) than just about anything else you've listed, and they have 
significantly higher values than almost anything else you list.


The massive number of x86 tests shall demonstrate the common theme I see: the
newer the CPU the larger the phenomenon is the RNG rests on.
Except each iteration of x86 grows more complex branch-prediction and 
pipe-lining and other tricks to try and make the CPU process data 
faster.  That is the source of almost all of the increase your seeing in 
entropy for newer revisions.  The same is not inherently true of 
embedded processors, especially ones designed for hard-real-time usage

Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-22 Thread Austin S. Hemmelgarn

On 2016-06-22 01:16, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 15:31:07 schrieb Austin S. Hemmelgarn:

Hi Austin,


Little data, interesting statement for results on 200+ systems including
all major CPU arches all showing information leading in the same
directions.

Let me try rephrasing this to make it a bit clearer:
1. You have lots of data on server systems.
2. You have a significant amount of data on desktop/workstation type
systems.
3. You have very little data on embedded systems.

and here are your arguments:
A. This works well on server systems.
B. This works well on desktop systems.
C. This works well on embedded systems.

Arguments A and B are substantiated directly by points 1 and 2.
Argument C is not substantiated thoroughly because of point 3.
My complaint is about argument C given point 3.


Then let me rephrase what I try to say: my RNG rests on the intrinsic
functionality of CPUs. When I show that such intrinsic behavior is present in
various architectures I show that there is a common ground for the basis of
the RNG.
You're not demonstrating it's inherently present in the architecture, 
your demonstrating that it's present for those particular 
micro-architectures you have tested.  You're dealing with effects 
resulting from such low-level details of the CPU that you can't assume 
it happens for the whole architecture.  In fact, your own results 
regarding the weak values from Pentium Celeron Mobile system reinforce 
that it's not an architectural effect at the ISA level, but results from 
low level designs.  Given the naming of that particular CPU, it's almost 
certainly a P6 or NetBurst micro-arch, neither of which had particularly 
complex branch-prediction or pipelining or similar things, and more 
significantly, probably did not have HyperThreading, which I'd be 
willing to bet is a significant contributing factor on the Xeon and Core 
processors you've got results for.


I tested on all CPUs of all large scale architectures (including the
architectures that are commonly used for embedded devices) and demonstrated
that the fundamental phenomenon the RNG rests on is present in all
architectures.
In all architectures you've tested.  Note that technically, from a 
low-level perspective of something like this, different revisions of an 
ISA are different architectures, and when you start talking about 
licensed IP cores like ARM and PPC (and MIPS, and SPARC), different 
manufacturer's designs almost count separately.  You're relying on 
complexities inherent in the CPU itself, which will be different between 
micro-architectures, and possibly even individual revisions of a 
specific model number.  Just because the test gives good results on an 
ARMv6 or ARMv7 does not mean it will on an ARMv4, because there are 
enough differences in typical designs of ARM CPU's that you can't 
directly draw conclusions based on such a small sample size (you've 
tested at most 4 manufacturer's designs, and even then only one from 
each, and there are about 10 different companies making ARM chips, each 
selling dozens of slightly different CPU's).


I do not care about the form factor of the test system server, desktop or
embedded systems nor do I care about the number of attached devices -- the
form factor and number of attached devices is the differentiator of what you
call embedded vs server vs desktop.
I don't care about form factor, I care about the CPU, and embedded 
systems almost always have simpler CPU designs (not including all the 
peripherals they love to add in on-die).  Your own analysis indicates 
that your getting entropy from the complex interaction of the different 
parts of the CPU.  Such complexities are less present in simpler CPU 
designs.


Heck, I have written a test that executes the RNG on bare metal (without OS
and with only a keyboard as device present -- i.e no interrupts are received
apart from a keyboard), which demonstrates that the phenomenon is present.

Furthermore, chapter 6 of my document analyzes the root cause of the RNG and
here you see clearly that it has nothing to do with the size of the CPU or its
attached devices or the size of RAM.
And yet averages are higher for systems that have more CPU cores, and 
thus more complex CPU's.  Prime examples of this are the UltraSPARC 
CPU's you've tested.  Those have more SMP cores (and threads of 
execution) than just about anything else you've listed, and they have 
significantly higher values than almost anything else you list.


The massive number of x86 tests shall demonstrate the common theme I see: the
newer the CPU the larger the phenomenon is the RNG rests on.
Except each iteration of x86 grows more complex branch-prediction and 
pipe-lining and other tricks to try and make the CPU process data 
faster.  That is the source of almost all of the increase your seeing in 
entropy for newer revisions.  The same is not inherently true of 
embedded processors, especially ones designed for hard-real-time usage

Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 14:04, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 13:51:15 schrieb Austin S. Hemmelgarn:

6. You have a significant lack of data regarding embedded systems, which
is one of the two biggest segments of Linux's market share.  You list no
results for any pre-ARMv6 systems (Linux still runs on and is regularly
used on ARMv4 CPU's, and it's worth also pointing out that the values on
the ARMv6 systems are themselves below average), any MIPS systems other
than 24k and 4k (which is not a good representation of modern embedded
usage), any SPARC CPU's other than UltraSPARC (ideally you should have
results on at least a couple of LEON systems as well), no tight-embedded
PPC chips (PPC 440 processors are very widely used, as are the 7xx and
970 families, and Freescale's e series), and only one set of results for
a tight-embedded x86 CPU (the Via Nano, you should ideally also have
results on things like an Intel Quark).  Overall, your test system
selection is not entirely representative of actual Linux usage (yeah,
ther'es a lot of x86 servers out there running Linux, there's at least
as many embedded systems running it too though, even without including
Android).


Perfectly valid argument. But I programmed that RNG as a hobby -- I do not
have the funds to buy all devices there are.


I'm not complaining as much about the lack of data for such devices as I
am about you stating that it will work fine for such devices when you
have so little data to support those claims.  Many of the devices you


Little data, interesting statement for results on 200+ systems including all
major CPU arches all showing information leading in the same directions.


Let me try rephrasing this to make it a bit clearer:
1. You have lots of data on server systems.
2. You have a significant amount of data on desktop/workstation type 
systems.

3. You have very little data on embedded systems.

and here are your arguments:
A. This works well on server systems.
B. This works well on desktop systems.
C. This works well on embedded systems.

Arguments A and B are substantiated directly by points 1 and 2. 
Argument C is not substantiated thoroughly because of point 3.

My complaint is about argument C given point 3.

I'm not saying you have insufficient data to support argument A or B, 
only that you have insufficient data to support argument C.


Android barely counts as an embedded system anymore, as many Android 
phones can outperform most inexpensive desktop and laptop systems, and 
even some rather expensive laptops.  This leaves the only systems that 
can be assumed without further information to be representative of 
embedded boards to be the ones running Genode, and possibly the MIPS 
systems, which is a total of about 10 results out of hundreds for 
servers and desktops/workstations.


Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 14:04, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 13:51:15 schrieb Austin S. Hemmelgarn:

6. You have a significant lack of data regarding embedded systems, which
is one of the two biggest segments of Linux's market share.  You list no
results for any pre-ARMv6 systems (Linux still runs on and is regularly
used on ARMv4 CPU's, and it's worth also pointing out that the values on
the ARMv6 systems are themselves below average), any MIPS systems other
than 24k and 4k (which is not a good representation of modern embedded
usage), any SPARC CPU's other than UltraSPARC (ideally you should have
results on at least a couple of LEON systems as well), no tight-embedded
PPC chips (PPC 440 processors are very widely used, as are the 7xx and
970 families, and Freescale's e series), and only one set of results for
a tight-embedded x86 CPU (the Via Nano, you should ideally also have
results on things like an Intel Quark).  Overall, your test system
selection is not entirely representative of actual Linux usage (yeah,
ther'es a lot of x86 servers out there running Linux, there's at least
as many embedded systems running it too though, even without including
Android).


Perfectly valid argument. But I programmed that RNG as a hobby -- I do not
have the funds to buy all devices there are.


I'm not complaining as much about the lack of data for such devices as I
am about you stating that it will work fine for such devices when you
have so little data to support those claims.  Many of the devices you


Little data, interesting statement for results on 200+ systems including all
major CPU arches all showing information leading in the same directions.


Let me try rephrasing this to make it a bit clearer:
1. You have lots of data on server systems.
2. You have a significant amount of data on desktop/workstation type 
systems.

3. You have very little data on embedded systems.

and here are your arguments:
A. This works well on server systems.
B. This works well on desktop systems.
C. This works well on embedded systems.

Arguments A and B are substantiated directly by points 1 and 2. 
Argument C is not substantiated thoroughly because of point 3.

My complaint is about argument C given point 3.

I'm not saying you have insufficient data to support argument A or B, 
only that you have insufficient data to support argument C.


Android barely counts as an embedded system anymore, as many Android 
phones can outperform most inexpensive desktop and laptop systems, and 
even some rather expensive laptops.  This leaves the only systems that 
can be assumed without further information to be representative of 
embedded boards to be the ones running Genode, and possibly the MIPS 
systems, which is a total of about 10 results out of hundreds for 
servers and desktops/workstations.


Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:19, Tomas Mraz wrote:

On Út, 2016-06-21 at 09:05 -0400, Austin S. Hemmelgarn wrote:

On 2016-06-20 14:32, Stephan Mueller wrote:


[1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf

Specific things I notice about this:
1. QEMU systems are reporting higher values than almost anything
else
with the same ISA.  This makes sense, but you don't appear to have
accounted for the fact that you can't trust almost any of the entropy
in
a VM unless you have absolute trust in the host system, because the
host
system can do whatever the hell it wants to you, including
manipulating
timings directly (with a little patience and some time spent working
on
it, you could probably get those number to show whatever you want
just
by manipulating scheduling parameters on the host OS for the VM
software).


You have to trust the host for anything, not just for the entropy in
timings. This is completely invalid argument unless you can present a
method that one guest can manipulate timings in other guest in such a
way that _removes_ the inherent entropy from the host.
When dealing with almost any type 2 hypervisor, it is fully possible for 
a user other than the one running the hypervisor to manipulate 
scheduling such that entropy is reduced.  This does not imply that the 
user who is doing this has any other control over the target VM, and 
importantly, often does not require administrative access on the host, 
only regular user access.  Such an attack is very difficult to effect 
outside of a clean-room environment, but is still possible.  You can't 
use this to force generation of arbitrary data, but you can definitely 
starve a VM for entropy.  By nature, something that relies on interrupt 
timings will be more impacted by such an attack than something that does 
not.


In most cases, such an attack will be a DoS attack on the host as well 
(as that's the simplest way to do this).  This is less of an issue with 
proper practices on a type 1 hypervisor, but is still possible there too 
(although pulling this off on at least Xen when you have proper VCPU 
isolation is functionally impossible without administrative access to 
the control domain).


Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:19, Tomas Mraz wrote:

On Út, 2016-06-21 at 09:05 -0400, Austin S. Hemmelgarn wrote:

On 2016-06-20 14:32, Stephan Mueller wrote:


[1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf

Specific things I notice about this:
1. QEMU systems are reporting higher values than almost anything
else
with the same ISA.  This makes sense, but you don't appear to have
accounted for the fact that you can't trust almost any of the entropy
in
a VM unless you have absolute trust in the host system, because the
host
system can do whatever the hell it wants to you, including
manipulating
timings directly (with a little patience and some time spent working
on
it, you could probably get those number to show whatever you want
just
by manipulating scheduling parameters on the host OS for the VM
software).


You have to trust the host for anything, not just for the entropy in
timings. This is completely invalid argument unless you can present a
method that one guest can manipulate timings in other guest in such a
way that _removes_ the inherent entropy from the host.
When dealing with almost any type 2 hypervisor, it is fully possible for 
a user other than the one running the hypervisor to manipulate 
scheduling such that entropy is reduced.  This does not imply that the 
user who is doing this has any other control over the target VM, and 
importantly, often does not require administrative access on the host, 
only regular user access.  Such an attack is very difficult to effect 
outside of a clean-room environment, but is still possible.  You can't 
use this to force generation of arbitrary data, but you can definitely 
starve a VM for entropy.  By nature, something that relies on interrupt 
timings will be more impacted by such an attack than something that does 
not.


In most cases, such an attack will be a DoS attack on the host as well 
(as that's the simplest way to do this).  This is less of an issue with 
proper practices on a type 1 hypervisor, but is still possible there too 
(although pulling this off on at least Xen when you have proper VCPU 
isolation is functionally impossible without administrative access to 
the control domain).


Re: [PATCH v5 0/7] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 12:28, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 12:03:56 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-21 03:32, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 09:12:07 schrieb Nikos Mavrogiannopoulos:

Hi Nikos,


On Mon, Jun 20, 2016 at 5:43 PM, Stephan Mueller <smuel...@chronox.de>


wrote:

Personally, I don't really use /dev/random, nor would I recommend it
for most application programmers.  At this point, getrandom(2) really
is the preferred interface unless you have some very specialized
needs.


I fully agree. But there are use cases for /dev/random, notably as a
seed
source for other DRNG.


Is that really the case? I believe all DRNG's use /dev/urandom anyway
for seeding since they cannot afford indeterminate blocking. It would
be a gain for everyone if /dev/random was the same as /dev/urandom in
Linux.


For standard approaches, this is true. But there are regulations, notably
in the German realm, /dev/random shall be used, at least partially (see
AIS 20/31).


Which just goes to show how utterly stupid some people who write laws
and regulations are.  Saying specifically that '/dev/random shall be
used' does not enforce any improvement of entrophic value in the data at
all, it just coincidentally improves the theoretical quality of the data
because of how Linux and some legacy UNIX systems are designed.  This
'regulation' already provides zero benefit other than a placebo effect
on at least OpenBSD, FreeBSD, and I'm pretty certain most other BSD
derivatives, as /dev/random and /dev/urandom point to the same thing
there (on OpenBSD it's an arcfour based drbg, FreeBSD does similar but
uses a CSPRNG called Fortuna), and while I'm not certain, I believe AIX
does likewise (although they use a design based on yarrow).


It is NOT utterly stupid, you should listen to the reasons.
I'm not commenting about them wanting cryptographically secure entropy. 
My statement about the intelligence of the people who wrote the standard 
was commenting on their saying '/dev/random shall be used' instead of 'a 
cryptographically secure entropy source shall be used'.  POSIX requires 
that the first statement imply the second, but the second does not 
require the first.  Their choice to use the first indicates that they 
assume this will be the only implementation that will ever matter, and 
that POSIX systems will be the only thing anyone using the standard 
cares about, both of which are inherently poor assumptions outside of a 
completely closed system.  Most standards proposed or mandated by 
bureaucracies have similar issues making bad assumptions about the 
context in which they will be used.



The core reason is
explained with the following analogy:

/dev/urandom, for a short period of time, operates as a DRNG. For the sake of
discussion, let us assume that it is a DRNG which is, for the sake of
discussion, seeded with 256 bits of entropy.

Now, you pull 256 bits out of it, you have 256 bits of entropy.

You pull again 256 bit out of it -- what entropy do you have? You do NOT have
*again* 256 bits of entropy. All you can say is the entire generated 512 bits
have collectively 256 bits of entropy. And so on and so forth.

Now, if you want to say that your newly spawned DRNG is seeded with X amount
of entropy, the DRNG nature of /dev/urandom makes such a statement a
challenge. The easy way out of the challenge is to use /dev/random (I am aware
of the fact that the DRNG has a computational strength, but it is not, well,
entropy).

In addition, when using /dev/urandom, you have to show that at the time you
seed the DRNG, it is fully seeded. That is a challenge at boot time - a
challenge this entire thread revolves around. Yes, getrandom(2) is the way
out, but not everybody uses it. Again, /dev/random does NOT have this
challenge.
The fact that that's how /dev/urandom works is a result of the 
implementation, I already listed at least 3 UNIX derivatives that do not 
work this way and provide cryptographically secure entropy from the same 
source stream for both /dev/random and /dev/urandom.  Changing this on 
Linux would not break backwards compatibility as long as we provide 
sufficient entropy via /dev/random, because nobody depends on it 
blocking for anything other than ensuring they get good entropy, so if 
it always returns good cryptographically secure entropy, we never need 
to block unless the generator hasn't yet been seeded.  Now, on top of 
that, there's still no absolute guarantee that what you get from 
/dev/random is actually cryptographically secure, but that's a separate 
issue.


Re: [PATCH v5 0/7] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 12:28, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 12:03:56 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-21 03:32, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 09:12:07 schrieb Nikos Mavrogiannopoulos:

Hi Nikos,


On Mon, Jun 20, 2016 at 5:43 PM, Stephan Mueller 


wrote:

Personally, I don't really use /dev/random, nor would I recommend it
for most application programmers.  At this point, getrandom(2) really
is the preferred interface unless you have some very specialized
needs.


I fully agree. But there are use cases for /dev/random, notably as a
seed
source for other DRNG.


Is that really the case? I believe all DRNG's use /dev/urandom anyway
for seeding since they cannot afford indeterminate blocking. It would
be a gain for everyone if /dev/random was the same as /dev/urandom in
Linux.


For standard approaches, this is true. But there are regulations, notably
in the German realm, /dev/random shall be used, at least partially (see
AIS 20/31).


Which just goes to show how utterly stupid some people who write laws
and regulations are.  Saying specifically that '/dev/random shall be
used' does not enforce any improvement of entrophic value in the data at
all, it just coincidentally improves the theoretical quality of the data
because of how Linux and some legacy UNIX systems are designed.  This
'regulation' already provides zero benefit other than a placebo effect
on at least OpenBSD, FreeBSD, and I'm pretty certain most other BSD
derivatives, as /dev/random and /dev/urandom point to the same thing
there (on OpenBSD it's an arcfour based drbg, FreeBSD does similar but
uses a CSPRNG called Fortuna), and while I'm not certain, I believe AIX
does likewise (although they use a design based on yarrow).


It is NOT utterly stupid, you should listen to the reasons.
I'm not commenting about them wanting cryptographically secure entropy. 
My statement about the intelligence of the people who wrote the standard 
was commenting on their saying '/dev/random shall be used' instead of 'a 
cryptographically secure entropy source shall be used'.  POSIX requires 
that the first statement imply the second, but the second does not 
require the first.  Their choice to use the first indicates that they 
assume this will be the only implementation that will ever matter, and 
that POSIX systems will be the only thing anyone using the standard 
cares about, both of which are inherently poor assumptions outside of a 
completely closed system.  Most standards proposed or mandated by 
bureaucracies have similar issues making bad assumptions about the 
context in which they will be used.



The core reason is
explained with the following analogy:

/dev/urandom, for a short period of time, operates as a DRNG. For the sake of
discussion, let us assume that it is a DRNG which is, for the sake of
discussion, seeded with 256 bits of entropy.

Now, you pull 256 bits out of it, you have 256 bits of entropy.

You pull again 256 bit out of it -- what entropy do you have? You do NOT have
*again* 256 bits of entropy. All you can say is the entire generated 512 bits
have collectively 256 bits of entropy. And so on and so forth.

Now, if you want to say that your newly spawned DRNG is seeded with X amount
of entropy, the DRNG nature of /dev/urandom makes such a statement a
challenge. The easy way out of the challenge is to use /dev/random (I am aware
of the fact that the DRNG has a computational strength, but it is not, well,
entropy).

In addition, when using /dev/urandom, you have to show that at the time you
seed the DRNG, it is fully seeded. That is a challenge at boot time - a
challenge this entire thread revolves around. Yes, getrandom(2) is the way
out, but not everybody uses it. Again, /dev/random does NOT have this
challenge.
The fact that that's how /dev/urandom works is a result of the 
implementation, I already listed at least 3 UNIX derivatives that do not 
work this way and provide cryptographically secure entropy from the same 
source stream for both /dev/random and /dev/urandom.  Changing this on 
Linux would not break backwards compatibility as long as we provide 
sufficient entropy via /dev/random, because nobody depends on it 
blocking for anything other than ensuring they get good entropy, so if 
it always returns good cryptographically secure entropy, we never need 
to block unless the generator hasn't yet been seeded.  Now, on top of 
that, there's still no absolute guarantee that what you get from 
/dev/random is actually cryptographically secure, but that's a separate 
issue.


Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 13:23, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 13:18:33 schrieb Austin S. Hemmelgarn:

Hi Austin,


You have to trust the host for anything, not just for the entropy in
timings. This is completely invalid argument unless you can present a
method that one guest can manipulate timings in other guest in such a
way that _removes_ the inherent entropy from the host.


When dealing with almost any type 2 hypervisor, it is fully possible for
a user other than the one running the hypervisor to manipulate
scheduling such that entropy is reduced.  This does not imply that the


Please re-read the document: Jitter RNG does not rest on scheduling.
If you are running inside a VM, your interrupt timings depend on the 
hpyervisor's scheduling, period.  You may not directly rely on 
scheduling from the OS you are running on, but if you are doing anything 
timing related in a VM, you are at the mercy of the scheduling used by 
the hypervisor and whatever host OS that may be running on.


In the attack I"m describing, the malicious user is not manipulating the 
guest OS's scheduling, they are manipulating the host system's scheduling.




Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 13:23, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 13:18:33 schrieb Austin S. Hemmelgarn:

Hi Austin,


You have to trust the host for anything, not just for the entropy in
timings. This is completely invalid argument unless you can present a
method that one guest can manipulate timings in other guest in such a
way that _removes_ the inherent entropy from the host.


When dealing with almost any type 2 hypervisor, it is fully possible for
a user other than the one running the hypervisor to manipulate
scheduling such that entropy is reduced.  This does not imply that the


Please re-read the document: Jitter RNG does not rest on scheduling.
If you are running inside a VM, your interrupt timings depend on the 
hpyervisor's scheduling, period.  You may not directly rely on 
scheduling from the OS you are running on, but if you are doing anything 
timing related in a VM, you are at the mercy of the scheduling used by 
the hypervisor and whatever host OS that may be running on.


In the attack I"m describing, the malicious user is not manipulating the 
guest OS's scheduling, they are manipulating the host system's scheduling.




Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:20, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 09:05:55 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-20 14:32, Stephan Mueller wrote:

Am Montag, 20. Juni 2016, 13:07:32 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-18 12:31, Stephan Mueller wrote:

Am Samstag, 18. Juni 2016, 10:44:08 schrieb Theodore Ts'o:

Hi Theodore,


At the end of the day, with these devices you really badly need a
hardware RNG.  We can't generate randomness out of thin air.  The only
thing you really can do requires user space help, which is to generate
keys lazily, or as late as possible, so you can gather as much entropy
as you can --- and to feed in measurements from the WiFi (RSSI
measurements, MAC addresses seen, etc.)  This won't help much if you
have an FBI van parked outside your house trying to carry out a
TEMPEST attack, but hopefully it provides some protection against a
remote attacker who isn't try to carry out an on-premises attack.


All my measurements on such small systems like MIPS or smaller/older
ARMs
do not seem to support that statement :-)


Was this on real hardware, or in a virtual machine/emulator?  Because if
it's not on real hardware, you're harvesting entropy from the host
system, not the emulated one.  While I haven't done this with MIPS or
ARM systems, I've taken similar measurements on SPARC64, x86_64, and
PPC64 systems comparing real hardware and emulated hardware, and the
emulated hardware _always_ has higher entropy, even when running the
emulator on an identical CPU to the one being emulated and using KVM
acceleration and passing through all the devices possible.

Even if you were testing on real hardware, I'm still rather dubious, as
every single test I've ever done on any hardware (SPARC, PPC, x86, ARM,
and even PA-RISC) indicates that you can't harvest entropy as
effectively from a smaller CPU compared to a large one, and this effect
is significantly more pronounced on RISC systems.


It was on real hardware. As part of my Jitter RNG project, I tested all
major CPUs from small to big -- see Appendix F [1]. For MIPS/ARM, see the
trailing part of the big table.

[1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf


Specific things I notice about this:
1. QEMU systems are reporting higher values than almost anything else
with the same ISA.  This makes sense, but you don't appear to have
accounted for the fact that you can't trust almost any of the entropy in
a VM unless you have absolute trust in the host system, because the host
system can do whatever the hell it wants to you, including manipulating
timings directly (with a little patience and some time spent working on
it, you could probably get those number to show whatever you want just
by manipulating scheduling parameters on the host OS for the VM software).


I am not sure where you see QEMU systems listed there.
That would be the ones which list 'QEMU Virtual CPU version X.Y' as the 
CPU string.  The only things that return that in the CPUID data are 
either QEMU itself, or software that is based on QEMU.



2. Quite a few systems have a rather distressingly low lower bound and
still get accepted by your algorithm (a number of the S/390 systems, and
a handful of the AMD processors in particular).


I am aware of that, but please read the entire documentation where the lower
and upper boundary comes from and how the Jitter RNG really operates. There
you will see that the lower boundary is just that: it will not be lower, but
the common case is the upper boundary.
Talking about the common case is all well and good, but the lower bound 
still needs to be taken into account.  If the test results aren't 
uniformly distributed within that interval, or even following a typical 
Gaussian distribution within it (which is what I and many other people 
would probably assume without the data later in the appendix), then you 
really need to mention this _before_ the table itself.  Such information 
is very important, and not everyone has time to read everything.


Furthermore, the use case of the Jitter RNG is to support the DRBG seeding
with a very high reseed interval.


3. Your statement at the bottom of the table that 'all test systems at
least un-optimized have a lower bound of 1 bit' is refuted by your own
data, I count at least 2 data points where this is not the case.  One of
them is mentioned at the bottom as an outlier, and you have data to back
this up listed in the table, but the other (MIPS 4Kec v4.8) is the only
system of that specific type that you tested, and thus can't be claimed
as an outlier.


You are right, I have added more and more test results to the table without
updating the statement below. I will fix that.

But note, that there is a list below that statement providing explanations
already. So, it is just that one statement that needs updating.


4. You state the S/390 systems gave different results when run
un-optimized, but don't provide any data regarding

Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:20, Stephan Mueller wrote:

Am Dienstag, 21. Juni 2016, 09:05:55 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-20 14:32, Stephan Mueller wrote:

Am Montag, 20. Juni 2016, 13:07:32 schrieb Austin S. Hemmelgarn:

Hi Austin,


On 2016-06-18 12:31, Stephan Mueller wrote:

Am Samstag, 18. Juni 2016, 10:44:08 schrieb Theodore Ts'o:

Hi Theodore,


At the end of the day, with these devices you really badly need a
hardware RNG.  We can't generate randomness out of thin air.  The only
thing you really can do requires user space help, which is to generate
keys lazily, or as late as possible, so you can gather as much entropy
as you can --- and to feed in measurements from the WiFi (RSSI
measurements, MAC addresses seen, etc.)  This won't help much if you
have an FBI van parked outside your house trying to carry out a
TEMPEST attack, but hopefully it provides some protection against a
remote attacker who isn't try to carry out an on-premises attack.


All my measurements on such small systems like MIPS or smaller/older
ARMs
do not seem to support that statement :-)


Was this on real hardware, or in a virtual machine/emulator?  Because if
it's not on real hardware, you're harvesting entropy from the host
system, not the emulated one.  While I haven't done this with MIPS or
ARM systems, I've taken similar measurements on SPARC64, x86_64, and
PPC64 systems comparing real hardware and emulated hardware, and the
emulated hardware _always_ has higher entropy, even when running the
emulator on an identical CPU to the one being emulated and using KVM
acceleration and passing through all the devices possible.

Even if you were testing on real hardware, I'm still rather dubious, as
every single test I've ever done on any hardware (SPARC, PPC, x86, ARM,
and even PA-RISC) indicates that you can't harvest entropy as
effectively from a smaller CPU compared to a large one, and this effect
is significantly more pronounced on RISC systems.


It was on real hardware. As part of my Jitter RNG project, I tested all
major CPUs from small to big -- see Appendix F [1]. For MIPS/ARM, see the
trailing part of the big table.

[1] http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.pdf


Specific things I notice about this:
1. QEMU systems are reporting higher values than almost anything else
with the same ISA.  This makes sense, but you don't appear to have
accounted for the fact that you can't trust almost any of the entropy in
a VM unless you have absolute trust in the host system, because the host
system can do whatever the hell it wants to you, including manipulating
timings directly (with a little patience and some time spent working on
it, you could probably get those number to show whatever you want just
by manipulating scheduling parameters on the host OS for the VM software).


I am not sure where you see QEMU systems listed there.
That would be the ones which list 'QEMU Virtual CPU version X.Y' as the 
CPU string.  The only things that return that in the CPUID data are 
either QEMU itself, or software that is based on QEMU.



2. Quite a few systems have a rather distressingly low lower bound and
still get accepted by your algorithm (a number of the S/390 systems, and
a handful of the AMD processors in particular).


I am aware of that, but please read the entire documentation where the lower
and upper boundary comes from and how the Jitter RNG really operates. There
you will see that the lower boundary is just that: it will not be lower, but
the common case is the upper boundary.
Talking about the common case is all well and good, but the lower bound 
still needs to be taken into account.  If the test results aren't 
uniformly distributed within that interval, or even following a typical 
Gaussian distribution within it (which is what I and many other people 
would probably assume without the data later in the appendix), then you 
really need to mention this _before_ the table itself.  Such information 
is very important, and not everyone has time to read everything.


Furthermore, the use case of the Jitter RNG is to support the DRBG seeding
with a very high reseed interval.


3. Your statement at the bottom of the table that 'all test systems at
least un-optimized have a lower bound of 1 bit' is refuted by your own
data, I count at least 2 data points where this is not the case.  One of
them is mentioned at the bottom as an outlier, and you have data to back
this up listed in the table, but the other (MIPS 4Kec v4.8) is the only
system of that specific type that you tested, and thus can't be claimed
as an outlier.


You are right, I have added more and more test results to the table without
updating the statement below. I will fix that.

But note, that there is a list below that statement providing explanations
already. So, it is just that one statement that needs updating.


4. You state the S/390 systems gave different results when run
un-optimized, but don't provide any data regarding

Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:42, Pavel Machek wrote:

Hi!


6. You have a significant lack of data regarding embedded systems, which is
one of the two biggest segments of Linux's market share.  You list no
results for any pre-ARMv6 systems (Linux still runs on and is regularly used
on ARMv4 CPU's, and it's worth also pointing out that the values on
the


Feel free to contribute more test results.

I mean... you can't expect every person who wants to improve something
in linux to test on everything in the world... can you?

I was commenting less on the lack of results for such systems than on 
attempts to make statements about such systems based on a data-set that 
lacks information about such systems.




Re: [PATCH v4 0/5] /dev/random - a new approach

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 09:42, Pavel Machek wrote:

Hi!


6. You have a significant lack of data regarding embedded systems, which is
one of the two biggest segments of Linux's market share.  You list no
results for any pre-ARMv6 systems (Linux still runs on and is regularly used
on ARMv4 CPU's, and it's worth also pointing out that the values on
the


Feel free to contribute more test results.

I mean... you can't expect every person who wants to improve something
in linux to test on everything in the world... can you?

I was commenting less on the lack of results for such systems than on 
attempts to make statements about such systems based on a data-set that 
lacks information about such systems.




  1   2   3   4   5   6   7   >