Re: Parent transid verify failed (and more): BTRFS for data storage in Xen VM setup

2021-04-11 Thread Adam Borowski
On Sun, Apr 11, 2021 at 12:10:34PM +0500, Roman Mamedov wrote:
> On Sat, 10 Apr 2021 17:06:22 -0600
> Chris Murphy  wrote:
> 
> > Right. The block device (partition containing the Btrfs file system)
> > must be exclusively used by one kernel, host or guest. Dom0 or DomU.
> > Can't be both.
> > 
> > The only exception I'm aware of is virtiofs or virtio-9p, but I
> > haven't messed with that stuff yet.
> 
> If you want an FS that allows a block device to be mounted by multiple 
> machines
> at the same time, there are a few:
> https://en.wikipedia.org/wiki/Clustered_file_system#Shared-disk_file_system

All of those use some kind of lock manager, though.
So in no case you just mount the same device twice.

-- 
⢀⣴⠾⠻⢶⣦⠀ .--[ Makefile ]
⣾⠁⢠⠒⠀⣿⡁ # beware of races
⢿⡄⠘⠷⠚⠋⠀ all: pillage burn
⠈⠳⣄ `


Fwd: btrfs-progs: libbtrfsutil is under LGPL-3.0 and statically liked into btrfs

2021-03-17 Thread Adam Borowski
This is https://bugs.debian.org/985400

- Forwarded message from Claudius Heine  -

Dear Maintainer,

I looked into the licenses of the btrfs-progs project and found that the
libbtrfsutils library is licensed under LGPL-3.0-or-later [1]

The `copyright` file does not show this this.

IANAL, but I think since `btrfs` (under GPL-2.0-only [2]) links to 
`libbtrfsutil`
statically this might cause a license conflict. See [3]. This would be the part
that might require upstream fixing.

regards,
Claudius

[1] https://github.com/kdave/btrfs-progs/blob/master/libbtrfsutil/btrfsutil.h
[2] https://github.com/kdave/btrfs-progs/blob/master/btrfs.c
[3] http://gplv3.fsf.org/dd3-faq#gpl-compat-matrix


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-13 Thread Adam Borowski
On Sat, Mar 13, 2021 at 11:24:00AM -0500, Neal Gompa wrote:
> On Sat, Mar 13, 2021 at 8:09 AM Adam Borowski  wrote:
> >
> > On Wed, Mar 10, 2021 at 02:26:43PM +, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > > > DAX on btrfs has been attempted[1]. Of course, we could not
> > >
> > > But why?  A completeness fetish?  I don't understand why you decided
> > > to do this work.
> >
> > * xfs can shapshot only single files, btrfs entire subvolumes
> > * btrfs-send|receive
> > * enumeration of changed parts of a file
> 
> XFS cannot do snapshots since it lacks metadata COW. XFS reflinking is
> primarily for space efficiency.

A reflink is a single-file snapshot.

My work team really wants this very patchset -- reflinks on DAX allow
backups and/or checkpointing, without stopping the world (there's a single
file, "pool", here).

Besides, you can still get poor-man's whole-subvolume(/directory)
snapshots by manually walking the tree and reflinking everything.
That's not atomic -- but rsync isn't atomic either.  That's enough for
eg. dnf/dpkg purposes.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ NADIE anticipa la inquisición de españa!
⠈⠳⣄


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-13 Thread Adam Borowski
On Wed, Mar 10, 2021 at 02:26:43PM +, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > DAX on btrfs has been attempted[1]. Of course, we could not
> 
> But why?  A completeness fetish?  I don't understand why you decided
> to do this work.

* xfs can shapshot only single files, btrfs entire subvolumes
* btrfs-send|receive
* enumeration of changed parts of a file


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢠⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).


Re: Btrfs progs release 5.11

2021-03-07 Thread Adam Borowski
On Fri, Mar 05, 2021 at 02:36:05PM +0100, David Sterba wrote:
> btrfs-progs version 5.11 have been released.

W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-centos-7-x86_64/docker-build -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-build
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-centos-7-x86_64/docker-run -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-run
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-centos-8-x86_64/docker-build -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-build
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-centos-8-x86_64/docker-run -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-run
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-openSUSE-Leap-15.2-x86_64/docker-build -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-build
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-openSUSE-Leap-15.2-x86_64/docker-run -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-run
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-openSUSE-tumbleweed-x86_64/docker-build -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-build
W: btrfs-progs source: absolute-symbolic-link-target-in-source 
ci/images/ci-openSUSE-tumbleweed-x86_64/docker-run -> 
/home/dsterba/labs/btrfs-progs/ci/images/docker-run

Somehow, I can't find /home/dsterba/ on my machine :þ


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄


Re: Why do we need these mount options?

2021-01-16 Thread Adam Borowski
On Sat, Jan 16, 2021 at 10:39:51AM +0300, Andrei Borzenkov wrote:
> 15.01.2021 06:54, Zygo Blaxell пишет:
> > On the other hand, I'm in favor of deprecating the whole discard option
> > and going with fstrim instead.  discard in its current form tends to
> > increase write wear rather than decrease it, especially on metadata-heavy
> > workloads.  discard is roughly equivalent to running fstrim thousands
> > of times a day, which is clearly bad for many (most?  all?) SSDs.
> 
> My (probably naive) understanding so far was that trim on SSD marks
> areas as "unused" which means SSD need to copy less residual data from
> erase block when reusing it. Assuming TRIM unit is (significantly)
> smaller than erase block.
> 
> I would appreciate if you elaborate how trim results in more write on SSD?

The areas are not only marked as unused, but also zeroed.  To keep the
zeroing semantic, every discard must be persisted, thus requiring a write
to the SSD's metadata (not btrfs metadata) area.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ .--[ Makefile ]
⣾⠁⢠⠒⠀⣿⡁ # beware of races
⢿⡄⠘⠷⠚⠋⠀ all: pillage burn
⠈⠳⣄ `


[PATCH] btrfs-progs: fix unterminated long opts for send

2020-12-25 Thread Adam Borowski
Any use of a long option would crash.

Signed-off-by: Adam Borowski 
---
 cmds/send.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cmds/send.c b/cmds/send.c
index b8e3ba12..3bfc69f5 100644
--- a/cmds/send.c
+++ b/cmds/send.c
@@ -496,7 +496,8 @@ static int cmd_send(const struct cmd_struct *cmd, int argc, 
char **argv)
static const struct option long_options[] = {
{ "verbose", no_argument, NULL, 'v' },
{ "quiet", no_argument, NULL, 'q' },
-   { "no-data", no_argument, NULL, GETOPT_VAL_SEND_NO_DATA 
}
+   { "no-data", no_argument, NULL, GETOPT_VAL_SEND_NO_DATA 
},
+   { NULL, 0, NULL, 0 }
};
int c = getopt_long(argc, argv, "vqec:f:i:p:", long_options, 
NULL);
 
-- 
2.30.0.rc2



[PATCH] btrfs-progs: a bunch of typo fixes

2020-12-25 Thread Adam Borowski
Signed-off-by: Adam Borowski 
---
 CHANGES  | 2 +-
 Documentation/btrfs-balance.asciidoc | 2 +-
 Documentation/btrfs-man5.asciidoc| 4 ++--
 INSTALL  | 2 +-
 README.md| 2 +-
 cmds/filesystem-usage.c  | 2 +-
 crypto/hash-speedtest.c  | 6 +++---
 kernel-lib/radix-tree.c  | 2 +-
 m4/ax_gcc_version.m4 | 2 +-
 tests/README.md  | 2 +-
 10 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/CHANGES b/CHANGES
index e974dc58..16da2863 100644
--- a/CHANGES
+++ b/CHANGES
@@ -85,7 +85,7 @@ btrfs-progs-5.6 (2020-04-05)
   * fixes:
 * restore: proper mirror iteration on decompression error
 * restore: make symlink messages less noisy
-* check: handle holes at the begining or end of file
+* check: handle holes at the beginning or end of file
 * fix xxhash output on big endian machines
 * receive: fix lookup of subvolume by uuid in case it was already
   received before
diff --git a/Documentation/btrfs-balance.asciidoc 
b/Documentation/btrfs-balance.asciidoc
index d94719a0..7ba88671 100644
--- a/Documentation/btrfs-balance.asciidoc
+++ b/Documentation/btrfs-balance.asciidoc
@@ -37,7 +37,7 @@ The filters can be used to perform following actions:
 
 The filters can be applied to a combination of block group types (data,
 metadata, system). Note that changing only the 'system' type needs the force
-option. Otherwise 'system' gets automatically converted whenver 'metadata'
+option. Otherwise 'system' gets automatically converted whenever 'metadata'
 profile is converted.
 
 When metadata redundancy is reduced (eg. from RAID1 to single) the force option
diff --git a/Documentation/btrfs-man5.asciidoc 
b/Documentation/btrfs-man5.asciidoc
index 65352009..9016f400 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -34,7 +34,7 @@ per-subvolume 'nodatacow', 'nodatasum', or 'compress' using 
mount options. This
 should eventually be fixed, but it has proved to be difficult to implement
 correctly within the Linux VFS framework.
 
-Mount options are processed in order, only the last occurence of an option
+Mount options are processed in order, only the last occurrence of an option
 takes effect and may disable other options due to constraints (see eg.
 'nodatacow' and 'compress'). The output of 'mount' command shows which options
 have been applied.
@@ -868,7 +868,7 @@ refers to what `xfs_io`(8) provides:
 'append only', same as the attribute
 
 *s*::
-'synchronous updates', same as the atribute 'S'
+'synchronous updates', same as the attribute 'S'
 
 *A*::
 'no atime updates', same as the attribute
diff --git a/INSTALL b/INSTALL
index 470ceebd..e2b6c7c3 100644
--- a/INSTALL
+++ b/INSTALL
@@ -14,7 +14,7 @@ For the btrfs-convert utility:
 - e2fsprogs - ext2/ext3/ext4 file system libraries, or called e2fslibs
 - libreiserfscore - reiserfs file system library version >= 3.6.27
 
-Optionally, the checksums based on cryptographic hashes can be implemeted by
+Optionally, the checksums based on cryptographic hashes can be implemented by
 external libraries. Builtin implementations are provided in case the library
 dependencies are not desired.
 
diff --git a/README.md b/README.md
index 5d8e9b55..79421fd4 100644
--- a/README.md
+++ b/README.md
@@ -84,7 +84,7 @@ the patches meet some criteria (often lacking in github 
contributions):
 
 Source code coding style and preferences follow the
 [kernel coding 
style](https://www.kernel.org/doc/html/latest/process/coding-style.html).
-You can find the editor settins in `.editorconfig` and use the
+You can find the editor settings in `.editorconfig` and use the
 [EditorConfig](https://editorconfig.org/) plugin to let your editor use that,
 or update your editor settings manually.
 
diff --git a/cmds/filesystem-usage.c b/cmds/filesystem-usage.c
index ab60d769..717d436b 100644
--- a/cmds/filesystem-usage.c
+++ b/cmds/filesystem-usage.c
@@ -370,7 +370,7 @@ static void get_raid56_space_info(struct 
btrfs_ioctl_space_args *sargs,
*max_data_ratio = rt;
 
/*
-* size is the total disk(s) space occuped by a chunk
+* size is the total disk(s) space occupied by a chunk
 * the product of 'size' and  '*_ratio' is "in average"
 * the disk(s) space used by the data
 */
diff --git a/crypto/hash-speedtest.c b/crypto/hash-speedtest.c
index 09d309d2..132ca3aa 100644
--- a/crypto/hash-speedtest.c
+++ b/crypto/hash-speedtest.c
@@ -75,9 +75,9 @@ int main(int argc, char **argv) {
crc32c_optimization_init();
memset(buf, 0, 4096);
 
-   printf(&qu

Re: [PATCH v4 08/12] btrfs-progs: add option for checksum type to mkfs

2019-09-24 Thread Adam Borowski
On Tue, Sep 24, 2019 at 04:26:53PM +0200, David Sterba wrote:
> On Tue, Sep 03, 2019 at 05:00:42PM +0200, Johannes Thumshirn wrote:
> > Add an option to mkfs to specify which checksum algorithm will be used for
> > the filesystem.
> > 
> > Signed-off-by: Johannes Thumshirn 
> 
> I'll change the option to '-c' so we have the most common options as
> lowercase letters.

-c is used for compression elsewhere, I'd rather avoid this confusion.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ A MAP07 (Dead Simple) raspberry tincture recipe: 0.5l 95% alcohol,
⣾⠁⢠⠒⠀⣿⡁ 1kg raspberries, 0.4kg sugar; put into a big jar for 1 month.
⢿⡄⠘⠷⠚⠋⠀ Filter out and throw away the fruits (can dump them into a cake,
⠈⠳⣄ etc), let the drink age at least 3-6 months.


docbook45 is gone

2019-09-03 Thread Adam Borowski
Hi!
I'm afraid that asciidoctor 2.0 dropped support for docbook45.  The
explanation given is here:
https://github.com/asciidoctor/asciidoctor/issues/3005

This makes btrfs-progs fail to build unless docs are off, with:
asciidoctor: FAILED: missing converter for backend 'docbook45'. Processing 
aborted.

Naively bumping the backend to docbook5 makes the output fail to pass
validation.

I don't know a thing about docbook nor asciidoc, thus I can't fix this
myself.  kdave: you did the conversion, could you save us now?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH v2 0/4] Support xxhash64 checksums

2019-08-26 Thread Adam Borowski
On Mon, Aug 26, 2019 at 08:27:15AM -0400, Austin S. Hemmelgarn wrote:
> On 2019-08-23 13:08, Adam Borowski wrote:
> > the improved collision
> > resistance of xxhash64 is not a reason as if you intend to dedupe you want
> > a crypto hash so you don't need to verify.
> 
> The improved collision resistance is a roughly 10 orders of magnitude
> reduction in the chance of a collision.  That may not matter for most, but
> it's a significant improvement for anybody operating at large enough scale
> that media errors are commonplace.

Hash size doesn't matter vs media errors.  You don't have billions of
mismatches: the first one is a cause of alarm, so 1-in-4294967296 chance of
failing to notice it hardly ever matters (even though it _can_ happen in
real life as opposed to collisions below).

I can think of a bigger hash useful in three cases:
* recovering from a split-brain RAID
* recovering from one disk of a RAID having had a large piece scribbled upon
* finding candidates for deduplication (but see below why not 64-bit)

> Also, you would still need to verify even if you're using whatever the
> fanciest new collision resistant cryptographic hash is, because the number
> of possible input values is still more than _nine thousand_ orders of
> magnitude larger than the total number of output values even if we use a
> 512-bit cryptographic hash.

You're underestimating how rare crypto-strength hash collisions are.

There are two scenarios: unintentional, and malicious.

Let's go with unintentional first: the age of the Universe is 2^58.5
seconds.  The fastest disk (non-pmem) is NVMe-connected Optane, at 24
IOPS.  That's 2^17.8.  With a 256-bit hash, the mass of machines needed for
a single expected collision within the age of Universe exceeds the mass of
observable Universe itself.

So, malicious.  We demand a non-broken hash, which in crypto speak means
there's no known attack better than brute force.  An iterative approach is
right out; the best space-time tradeoff is birthday attack, which requires
storage size akin to the root of # of combinations (ie, half the hash
length).  It's drastically better: at current best storage densities, you'd
need only the mass of the Earth.

Please let me know when you'll build that Earth-sized computer, so I can
migrate from weak SHA256 to eg. BLAKE2b.

On the other hand, computers and memories get hit by cosmic rays, thermal
noise, and so on at a non-negligible rate.  Any theoretical chance of a hash
collision is dwarfed by flaws of technology we have.  Or, eg, by the chance
that you'll get hit by multiple lightings the next time you leave your
house.

Thus: no, you don't need to recheck after SHA256.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋  The root of a real enemy is an imaginary friend.
⠈⠳⣄


Re: [PATCH v2 0/4] Support xxhash64 checksums

2019-08-23 Thread Adam Borowski
On Fri, Aug 23, 2019 at 09:43:22AM +, Paul Jones wrote:
> > > Am Do., 22. Aug. 2019 um 16:41 Uhr schrieb Holger Hoffstätte
> > > :
> > > > but how does btrfs benefit from this compared to using crc32-intel?
> > >
> > > As i know, crc32c  is as far as ~3x faster than xxhash. But xxHash was
> > > created with a differend design goal.
> > > If you using a cpu without hardware crc32 support, xxHash provides you
> > > a maximum portability and speed. Look at arm, mips, power, etc. or old
> > > intel cpus like Core 2 Duo.
> > 
> > I've got a modified version of smhasher
> > (https://github.com/PeeJay/smhasher) that tests speed and cryptographics
> > of various hashing functions.
> 
> I forgot to add xxhash32
>  
> Crc32 Software -  379.91 MiB/sec
> Crc32 Hardware - 7338.60 MiB/sec
> XXhash64 Software - 12094.40 MiB/sec
> XXhash32 Software - 6060.11 MiB/sec
> 
> Testing done on a 1st Gen Ryzen. Impressive numbers from XXhash64.

Newest biggest Threadripper (2990WX, no 3* version released yet):
crc32  -   492.75 MiB/sec
crc32hw-  9447.37 MiB/sec
crc64  -  1959.51 MiB/sec
xxhash32   -  7479.29 MiB/sec
xxhash64   - 14911.58 MiB/sec

An old Skylake (i7-6700):
crc32  -   359.32 MiB/sec
crc32hw- 21119.68 MiB/sec
crc64  -  1656.34 MiB/sec
xxhash32   -  5989.87 MiB/sec
xxhash64   - 11949.41 MiB/sec

Cascade Lake (%@):
crc32hw 1.92× as fast as xxhash64.

So you want crc32hw on Intel, xxhash64 on AMD.

crc32 also allows going back to old kernels; the improved collision
resistance of xxhash64 is not a reason as if you intend to dedupe you want
a crypto hash so you don't need to verify.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋  The root of a real enemy is an imaginary friend.
⠈⠳⣄


Re: btrfs on RHEL7 (kernel 3.10.0) production ready?

2019-08-03 Thread Adam Borowski
On Sat, Aug 03, 2019 at 12:09:28PM +0200, Ulli Horlacher wrote:
> I have RHEL 7 and CentOS 7.6 servers with kernel 3.10.0 and btrfs-progs v4.9.1
> Is btrfs there ready for production usage(*)?

Hell no!

It's a truly ancient kernel, from the times btrfs wasn't considered stable.
There's a lot of backports atop it, which for a quickly evolving filesystem
are pretty unsafe unless there's a large team doing that work.

And unlike SLES, there's none.  This led to notorious breakages, with Red
Hat finally declaring btrfs unsupported on their distributions.

Thus, if you care about your data at all, please use a modern kernel.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...


[PATCH] btrfs-progs: check: fix option parsing for -E

2019-07-27 Thread Adam Borowski
> On Mon, Jun 17, 2019 at 06:45:48PM +0200, Adam Borowski wrote:
> > It has a mandatory argument, thus it always crashed.
>
> Applied, thanks.

Seems like this has fallen through the cracks -- could you please re-apply?

--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--8x--

It has a mandatory argument, thus it always crashed.

Signed-off-by: Adam Borowski 
---
 check/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/check/main.c b/check/main.c
index 0cc6fdba..866da8dc 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9867,7 +9867,7 @@ static int cmd_check(const struct cmd_struct *cmd, int 
argc, char **argv)
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "as:br:pEQ", long_options, NULL);
+   c = getopt_long(argc, argv, "as:br:pE:Q", long_options, NULL);
if (c < 0)
break;
switch(c) {
-- 
2.22.0



[PATCH] btrfs-progs: fix a printf format string fatal warning

2019-07-13 Thread Adam Borowski
At least in Debian, default build flags include -Werror=format-security,
for good reasons in most cases.  Here, the string comes from strftime --
and though I don't suspect any locale would be crazy enough to have %X
include a '%' char, the compiler has no way to know that.

Signed-off-by: Adam Borowski 
---
 common/format-output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/common/format-output.c b/common/format-output.c
index c5f1b51f..98fb8607 100644
--- a/common/format-output.c
+++ b/common/format-output.c
@@ -280,7 +280,7 @@ void fmt_print(struct format_ctx *fctx, const char* key, 
...)
 
localtime_r(&ts, &tm);
strftime(tstr, 256, "%Y-%m-%d %X %z", &tm);
-   printf(tstr);
+   printf("%s", tstr);
} else {
putchar('-');
}
-- 
2.22.0



[PATCH] btrfs-progs: check: fix option parsing for -E

2019-06-17 Thread Adam Borowski
It has a mandatory argument, thus it always crashed.

Signed-off-by: Adam Borowski 
---
 check/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/check/main.c b/check/main.c
index 731c21d3..b2f0c810 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9923,7 +9923,7 @@ int cmd_check(int argc, char **argv)
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "as:br:pEQ", long_options, NULL);
+   c = getopt_long(argc, argv, "as:br:pE:Q", long_options, NULL);
if (c < 0)
break;
switch(c) {
-- 
2.20.1



Re: Citation Needed: BTRFS Failure Resistance

2019-05-23 Thread Adam Borowski
On Thu, May 23, 2019 at 10:24:28AM -0600, Chris Murphy wrote:
> On Thu, May 23, 2019 at 5:19 AM Austin S. Hemmelgarn
> > BTRFS explicitly requests write barriers to prevent that type of
> > reordering of writes from happening, and it's actually pretty unusual on
> > modern hardware for those write barriers to not be honored unless the
> > user is doing something stupid (like mounting with 'nobarrier' or using
> > LVM with write barrier support disabled).
> 
> 'man xfs'
> 
>barrier|nobarrier
>   Note: This option has been deprecated as of kernel
> v4.10; in that version, integrity operations are always performed and
> the mount option is ignored.  These mount options will be removed no
> earlier than kernel v4.15.
> 
> Since they're getting rid of it, I wonder if it's sane for most any
> sane file system use case.

A volatile filesystem: one that you're willing to rebuild from scratch (or
backups) on power loss.  This includes any filesystem in a volatile VM.

Example use case: a build machine, where the build filesystem wants btrfs
for snapshots (the build environment several minutes to recreate), yet with
the environment recreated weekly, a crash can be considered an additional
start of a week. :)

Or, some clusters consider a crashed node to be dead and needing rebuild;
the filesystem's contents will be cloned from a master anyway.

In all of these cases, fsyncs can be ignored as well.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ At least spammers get it right: "Hello beautiful!".
⠈⠳⣄


Re: [PATCH 12/18] btrfs: allow MAP_SYNC mmap

2019-05-23 Thread Adam Borowski
On Thu, May 23, 2019 at 03:44:49PM +0200, Jan Kara wrote:
> On Mon 29-04-19 12:26:43, Goldwyn Rodrigues wrote:
> > From: Adam Borowski 
> > 
> > Used by userspace to detect DAX.
> > [rgold...@suse.com: Added CONFIG_FS_DAX around mmap_supported_flags]
> 
> Why the CONFIG_FS_DAX bit? Your mmap(2) implementation understands
> implications of MAP_SYNC flag and that's all that's needed to set
> .mmap_supported_flags.

Good point.

Also, that check will need to be updated when the pmem-virtio patchset goes
in.

> > Signed-off-by: Adam Borowski 
> > Signed-off-by: Goldwyn Rodrigues 
> > ---
> >  fs/btrfs/file.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 9d5a3c99a6b9..362a9cf9dcb2 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "ctree.h"
> >  #include "disk-io.h"
> >  #include "transaction.h"
> > @@ -3319,6 +3320,9 @@ const struct file_operations btrfs_file_operations = {
> > .splice_read= generic_file_splice_read,
> > .write_iter = btrfs_file_write_iter,
> > .mmap   = btrfs_file_mmap,
> > +#ifdef CONFIG_FS_DAX
> > +   .mmap_supported_flags = MAP_SYNC,
> > +#endif
> > .open   = btrfs_file_open,
> > .release= btrfs_release_file,
> > .fsync  = btrfs_sync_file,
> > -- 
> > 2.16.4
> > 
> -- 
> Jan Kara 
> SUSE Labs, CR
> 

-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!


Re: [PATCH 00/17] Add support for SHA-256 checksums

2019-05-17 Thread Adam Borowski
On Fri, May 17, 2019 at 09:07:03PM +0200, Johannes Thumshirn wrote:
> On Fri, May 17, 2019 at 08:36:23PM +0200, Diego Calleja wrote:
> > If btrfs needs an algorithm with good performance/security ratio, I would 
> > suggest considering BLAKE2 [1]. It is based in the BLAKE algorithm that 
> > made 
> > to the final round in the SHA3 competition, it is considered pretty secure 
> > (above SHA2 at least), and it was designed to take advantage of modern CPU 
> > features and be as fast as possible - it even beats SHA1 in that regard. It 
> > is 
> > not currently in the kernel but Wireguard uses it and will add an 
> > implementation when it's merged (but Wireguard doesn't use the crypto layer 
> > for some reason...)
> 
> SHA3 is on my list of other candidates to look at for a performance
> evaluation. As for BLAKE2 I haven't done too much research on it and I'm not a
> cryptographer so I have to trust FIPS et al.

"Trust FIPS" is the main problem here.  Until recently, FIPS certification
required implementing this nice random generator:
https://en.wikipedia.org/wiki/Dual_EC_DRBG

Thus, a good part of people are reluctant to use hash functions chosen by
NIST (and published as FIPS).

BLAKE2 is also a good deal faster on most hardware:
https://bench.cr.yp.to/results-sha3.html
Even with sha_ni, SHA256 wins only on Zen AMDs: sha_ni equipped Intels have
superior SIMD thus BLAKE2 is still faster.  And without sha_ni, the
difference is drastic.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!


Re: BTRFS Raid 5 Space missing - ideas ?

2019-04-20 Thread Adam Borowski
On Sat, Apr 20, 2019 at 12:46:16PM +0200, Juergen Sauer wrote:
> I wish a happy Easer Days before :)

Same to you!

> During my tests with BTRFS as Raid5 setup, I found a courious little
> "problem".

> Total devices 3 FS bytes used 9.98TiB
> devid1 size 9.09TiB used 4.99TiB path /dev/sdb1
> devid2 size 5.46TiB used 4.99TiB path /dev/sdc1
> devid3 size 5.46TiB used 4.99TiB path /dev/sde1

> All patitioins sdb1 sdc1 sde1 are the same size: 9.0 TiB. But BTRFS ist
> not using the bigger space on sdc1, sde1, there is only 5.46 TiB used,
> even there are 9.0 Tib Avaible, so 4.0 TiB are unused.

It's working as expected: while btrfs does RAID per block group rather than
per whole block device, there's no way to place a raid5 block group in a way
that doesn't require at least 3 devices.  This means with a 3-disk setup the
space utilized will be only as big as the smallest one.

This is also the case for raid1 on 2-disk, and for raid10 on 4-disk.

Btrfs can use uneven disks only when it has some freedom how to place the
data around.

There's a tool that lets you visualize space utilization:
http://carfax.org.uk/btrfs-usage/
or a command-line implementation:
btrfs-space-calculator (package python[3]-btrfs)


By the way, you can greatly improve performance and safety by switching
metadata profile to raid1: "btrfs bal start -mraid1".  RAID5 is very slow
for random writes, which is nearly all metadata write access; RAID1 doesn't
suffer from this problem -- and metadata tends to be only around 1-2% of
space so having it take a bit more doesn't hurt.

It would also solve your utilization problem, except that metadata uses so
little space.  Having mixed block groups means the space not taken by RAID5
can be recovered by taking twice as much from sdb1 than from each of sdc1
and sde1:

sdb1 *
sdc1 * * * * * * * * * * *
sde1  * * * * * * * * * *
(each RAID1 block group is either sdb1,sdc1 or sdb1:sde1)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄


Re: [PATCH] btrfs: allow MAP_SYNC mmap

2019-03-28 Thread Adam Borowski
[kdave: like the rest of btrfs+DAX patchset, this is WIP of course]

> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 196c8f37ff9d..8efdb040bc1d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
> +#include 

> + .mmap_supported_flags = MAP_SYNC,

With this, the userspace at least thinks that DAX works.  Whether it's
actually crash-safe, will require a lot of review that's outside of my
areas of knowledge.

Most of the PMDK's test suite passes, but at least pmemspoil test fails.
There's also (not sure if in pmemspoil or earlier):

[  652.053976] RIP: 0010:btrfs_free_reserved_data_space_noquota+0xe6/0xf0
[  652.060683] Code: b8 00 48 8b 45 00 48 85 c0 75 d7 41 c6 45 00 00 5b 5d 41 
5c 41 5d 41 5e 41 5f c3 48 89 c1 48 f7 d9 48 39 ca 0f 83 6a ff ff ff <0f> 0b 31 
c0 e9 64 ff ff ff 90 41 55 49 89 fd 41 54 49 89 f4 55 53
[  652.079767] RSP: :99450807fc38 EFLAGS: 00010287
[  652.085078] RAX: f000 RBX: 0030 RCX: 1000
[  652.092330] RDX:  RSI: 0030 RDI: 89965030a000
[  652.099583] RBP: 0030 R08: 99450807fc28 R09: ae6bc01a
[  652.106836] R10: c76c9f5f8600 R11: 0011 R12: 00300fff
[  652.114089] R13: 89965030a000 R14: 89964a82d000 R15: 0002
[  652.121342] FS:  7fee0adcc800() GS:899655bc() 
knlGS:
[  652.129566] CS:  0010 DS:  ES:  CR0: 80050033
[  652.135405] CR2: 7fee0a70 CR3: 0007742d6006 CR4: 003606e0
[  652.142657] DR0:  DR1:  DR2: 
[  652.149822] DR3:  DR6: fffe0ff0 DR7: 0400
[  652.157149] Call Trace:
[  652.159550]  btrfs_free_reserved_data_space+0x46/0x60
[  652.164755]  btrfs_iomap_end+0xf4/0x150
[  652.168654]  dax_iomap_pte_fault.isra.41+0x201/0x8f0
[  652.173712]  btrfs_dax_fault+0x3c/0xa0
[  652.177523]  __do_fault+0x2f/0x90
[  652.180891]  __handle_mm_fault+0x9fa/0xed0
[  652.185056]  __do_page_fault+0x242/0x4c0
[  652.188955]  ? page_fault+0x8/0x30
[  652.192484]  page_fault+0x1e/0x30
[  652.195854] RIP: 0033:0x7fee0b340efc
[  652.199400] Code: 9d 48 81 fa 80 00 00 00 77 19 c5 fe 7f 07 c5 fe 7f 47 20 
c5 fe 7f 44 17 e0 c5 fe 7f 44 17 c0 c5 f8 77 c3 48 8d 8f 80 00 00 00  fe 7f 
07 48 83 e1 80 c5 fe 7f 44 17 e0 c5 fe 7f 47 20 c5 fe 7f
[  652.218563] RSP: 002b:7ffcd02d58e8 EFLAGS: 00010202
[  652.223786] RAX: 7fee0a70 RBX: 7fee0a6fffc0 RCX: 7fee0a700080
[  652.231026] RDX: 0800 RSI:  RDI: 7fee0a70
[  652.238353] RBP: 7fee0a401b38 R08: 000f R09: 0001
[  652.245606] R10: 000d R11: 7fee0b360bf0 R12: 0800
[  652.252865] R13: 7fee0b4a6eec R14: 7fee0a70 R15: 0180
[  652.260119] ---[ end trace b6545baf6cf711c6 ]---


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄


[PATCH] btrfs: allow MAP_SYNC mmap

2019-03-28 Thread Adam Borowski
Used by userspace to detect DAX.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 196c8f37ff9d..8efdb040bc1d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3320,6 +3321,7 @@ const struct file_operations btrfs_file_operations = {
.splice_read= generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap   = btrfs_file_mmap,
+   .mmap_supported_flags = MAP_SYNC,
.open   = btrfs_file_open,
.release= btrfs_release_file,
.fsync  = btrfs_sync_file,
-- 
2.20.1



Re: [PATCH v2 00/15] btrfs dax support

2019-03-27 Thread Adam Borowski
On Tue, Mar 26, 2019 at 02:09:08PM -0500, Goldwyn Rodrigues wrote:
> This patch set adds support for dax on the BTRFS filesystem.

This patchset doesn't seem to support MAP_SYNC, which is the usual way to
use (and detect) DAX.  Basically, it requests for page faults to be
synchronous -- ie, when a page fault returns, the mapping points to actual
memory rather than to some buffer that'll be written back to the destination
at some point in the future.

Also, not really understanding these parts of the kernel, I can't tell if
the snapshots are atomic.  Ie, while the kernel walks over pages to set
mprotect flags, the process does two writes:
   RRRWW (R=ro W=rw)
A   B
The write at A causes a page fault, which clones the page, CoWing it and
letting the write into only one of the replicas.  After this, write to B
happens before the mprotect, thus goes into both replicas -- and despite
the process having issued proper memory barriers, the other replica has
B but not A.  To fix this, earlier page faults can't get finalized until
all mprotects are in place.  (I'm writing this as a query rather than a
problem report -- I'm an ignoramus here.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄


Re: [PATCH 01/15] btrfs: create a mount option for dax

2019-03-27 Thread Adam Borowski
On Tue, Mar 26, 2019 at 12:10:01PM -0700, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 
> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?

If I read this correctly (I merely glanced at it), this patchset _does_
provide the full snapshot functionality.  This is something other
filesystems don't allow: ext4 has no CoW at all, and IIRC on XFS reflinks
and DAX are mutually exclusive.

Obviously, the usual btrfs way of CoWing every write would remove all
(write) upsides of DAX, thus NOCOW (ie, CoW once) is the way to go: a page
fault should happen no more than once per page per snapshot.


On the other hand, checksumming seems useless to me.  Data corruption can
happen either in transit or at rest.  For at rest, disks already have their
own checksums -- and [NV]DIMMs have ECC.  On the other hand, the majority of
the time when someone seeks help on the btrfs mailing list, it turns out to
be a matter of bad RAM, bad motherboard or bad cabling.  This doesn't apply
to pmem.  The usual path is:

   CPU
|<--->memory
|
  SATA controller
|
(SATA cable)
|
  disk

The data goes to memory (very unlikely to to remain in the cache before
getting checksummed), then has to travel all the way down.  On the other
hand, the path on pmem is:

  CPU
   |>memory

So the data written by userspace goes to memory... and that's it.


As for multi-device, at least single block groups would be very nice (to
have a filesystem than spans regions) and easyish to implement, while RAID0
might spoil hugepage fun but may still be straightforward.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄


Re: [PATCH URGENT v1.1 0/2] btrfs-progs: Fix the nobarrier behavior of write

2019-03-27 Thread Adam Borowski
On Wed, Mar 27, 2019 at 05:46:50PM +0800, Qu Wenruo wrote:
> This urgent patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/flush_super
> Which is based on v4.20.2.
> 
> Before this patch, btrfs-progs writes to the fs has no barrier at all.
> All metadata and superblock are just buffered write, no barrier between
> super blocks and metadata writes at all.
> 
> No wonder why even clear space cache can cause serious transid
> corruption to the originally good fs.
> 
> Please merge this fix as soon as possible as I really don't want to see
> btrfs-progs corrupting any fs any more.

How often does this happen in practice?  I'm slightly incredulous about
btrfs-progs crashing often.   Especially that pwrite() is buffered on the
kernel side, so we'd need a _kernel_ crash (usually a power loss) to break
consistency.  Obviously, a potential data loss bug is always something that
needs fixing, I'm just wondering about severity.

Or do I understand this wrong?

Asking because Dimitri John Ledkov stepped down as Debian's maintainer of
this package, and I'm taking up the mantle (with Nicholas D Steeves being
around) -- modulo any updates other than important bug fixes being on hold
because of Debian's freeze.  Thus, I wonder if this is important enough to
ask for a freeze exception.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄


[PATCH resend 2/2] btrfs-progs: defrag: open files RO on new enough kernels

2019-02-25 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index b8beec13..0eb052dc 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -880,7 +883,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -917,6 +920,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1017,7 +1023,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.20.1



[PATCH resend 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2019-02-25 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 13ad0308..68653739 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", &saveptr);
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.20.1



Re: [RFC PATCH 0/6] Allow setting file birth time with utimensat()

2019-02-17 Thread Adam Borowski
On Sun, Feb 17, 2019 at 06:35:25PM +0200, Boaz Harrosh wrote:
> On 15/02/19 00:06, Dave Chinner wrote:
> > So you're adding an interface that allows users to change the create
> > time of files without needing any privileges?

> > Inode create time is forensic metadata in XFS  - information we use
> > for sequence of event and inode lifetime analysis during examination
> > of broken filesystem images and systems that have been broken into.

> I think the difference in opinion here is that there are two totally
> different BTIme out in the world. For two somewhat opposite motivations
> and it seems they both try to be crammed into the same on disk space.
> 
> One - Author creation time
> Two - Local creation time

> So it looks like both sides are correct trying to preserve their own guy?

I'd say that [2] is too easily gameable to be worth the effort.  You can
just change it on the disk.  That right now it'd take some skill to find the
right place to edit doesn't matter -- a tool to update the btime against
your wishes would need to be written just once.  Unlike btrfs, XFS doesn't
even have a chain of checksums all the way to the root.

On the other hand, [1] has a lot of uses.  It can also be preserved in
backups and version control (svnt and git-restore-mtime could be easily
extended).

I'd thus go with [2] -- any uses for [1] are better delegated to filesystem
specific tools.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Have you accepted Khorne as your lord and saviour?
⠈⠳⣄


Re: RAID56 Warning on "multiple serious data-loss bugs"

2019-01-28 Thread Adam Borowski
On Mon, Jan 28, 2019 at 03:23:28PM +, Supercilious Dude wrote:
> On Mon, 28 Jan 2019 at 01:18, Qu Wenruo  wrote:
> >
> > So for current upstream kernel, there should be no major problem despite
> > write hole.
> 
> 
> Can you please elaborate on the implications of the write-hole? Does
> it mean that the transaction currently in-flight might be lost but the
> filesystem is otherwise intact?

No, losing the in-flight transaction is normal operation of every modern
filesystem -- in fact, you _want_ the transaction to be lost instead of
partially torn.

The write hole means corruption of a random _old_ piece of data.

It can be fatal (ie, lead to data loss) if two errors happen together:
* the stripe is degraded
* there's unexpected crash/power loss

Every RAID implementation (not just btrfs) suffers from the write hole
unless some special, costly, precaution is being taken.  Those include
journaling, plug extents, varying-width stripes (ZFS: RAIDZ).  The two
former require effectively writing small writes twice, the latter degrades
small writes to RAID1 as disk capacity goes.

The write hole affects only writes that neighbour some old (ie, not from the
current transaction) data in the same stripe -- as long as everything in a
single stripe belongs to no more than one transaction, all is fine.  

> How does it interact with data and metadata being stored with a different
> profile (one with write hole and one without)?

If there's unrecoverable error due to write hole, you lose a single stripe
worth.  For data, this means a single piece of a file is beyond repair.  For
metadata, you lose a potentially large swatch of the filesystem -- and as
tree nodes close to the root get rewritten the most, a total filesystem loss
is pretty likely.  To make things worse, while data writes are mostly linear
(for small files, btrfs batches writes from the same transaction), metadata
is strewn all around, mixing pieces of different importance and different
age.  RAID5 (all implementations) is also very slow for random writes (such
as btrfs metadata), thus you really want RAID1 metadata both for safety and
performance.  Metadata being only around 1-2% of disk space, the only upside
of RAID5 (better use of capacity) doesn't really matter.

Ie: RAID1 is a clear winner for btrfs metadata; mixing profiles for data vs
metadata is safe.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄


Re: compress-force not really forcing compression?

2018-12-22 Thread Adam Borowski
On Sun, Dec 23, 2018 at 12:24:02AM +, Paul Jones wrote:
> > IMHO the more pertinent question is :
> > 
> > If a file has portions which are not easily compressible does that imply all
> > future writes are also incompressible. IMO no, so I think what will be 
> > prudent
> > is remove FORCE_COMPRESS altogether and make the code act as if it's
> > always on.
> > 
> > Any opinions?
> 
> 
> That is a good idea.  If I turn on compression I would expect everything
> to be compressed, except in cases where there is no size benefit.

I expect that the vast majority of files consist of blocks of similar
compressibility.  Thus, finding a block that fails to compress strongly
suggests other blocks are either incompressible as well or compress only
minimally.  Refusing to waste time, electricity and fragmentation in such
case is a good default, I think.

But, if you believe this should be changed, there's an easy experiment you
can try: for all files on your filesystem, chop every file into 128KB pieces
and compress each of them with your chosen algorithm.  Noting the compressed
size of every block in a file that had at least one block fail to compress
would give us some data.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: SATA/SAS mixed pool

2018-12-14 Thread Adam Borowski
On Fri, Dec 14, 2018 at 05:14:37AM +, Duncan wrote:
> Adam Borowski posted on Thu, 13 Dec 2018 08:29:05 +0100 as excerpted:
> > On Wed, Dec 12, 2018 at 09:31:02PM -0600, Nathan Dehnel wrote:
> >> Is it possible/safe to replace a SATA drive in a btrfs RAID10 pool with
> >> an SAS drive?
> > 
> > For btrfs, a block device is a block device, it's not "racist".
> > You can freely mix and/or replace.  If you want to, say, extend a SD
> > card with NBD to remote spinning rust, it works well -- tested :p
> 
> FWIW (mostly for other readers not so much this particular case) the 
> known exception/caveat to that is USB block devices, which do tend to 
> have problems, tho some hardware is fine.

Yeah, but the problem doesn't come from btrfs not supporting or ill
supporting USB, just from the devices themselves being flaky.  If they
supported the spec correctly, all would be ok.

An example might be NBD from one of my machines that has an incredibly bad
network driver -- it drops packets whenever there's even a bit of memory
pressure.  That's ok on RX (no different from packet being dropped on the
wire, the sender will retransmit) but unacceptable on TX -- it should have
slept instead; NBD (reasonably) can't handle this and destroys the block
device.  Yet btrfs can handle such an unexpected but clean disconnect just
fine -- not even a reboot is needed, I need to unmount, restart NBD then
remount.  From the filesystem's point of view this is exactly equivalent
to a power loss -- that the in-memory copy tried to do some writes
afterwards doesn't matter.

So it's not just whether the device fails, it's about _how_ it fails.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: SATA/SAS mixed pool

2018-12-12 Thread Adam Borowski
On Wed, Dec 12, 2018 at 09:31:02PM -0600, Nathan Dehnel wrote:
> Is it possible/safe to replace a SATA drive in a btrfs RAID10 pool
> with an SAS drive?

For btrfs, a block device is a block device, it's not "racist".
You can freely mix and/or replace.  If you want to, say, extend a SD
card with NBD to remote spinning rust, it works well -- tested :p


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_user_subvol_rm_allowed:
> > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> > break;
> > +#ifdef CONFIG_FS_DAX
> > +   case Opt_dax:
> > +   if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +   btrfs_info(info,
> > +  "dax not supported for multi-device 
> > btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Adam Borowski
On Tue, Dec 04, 2018 at 01:17:04PM +0100, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
> > The only location is the following code:
> > 
> > int level = path->lowest_level + 1;
> > BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
> > while(level < BTRFS_MAX_LEVEL) {
> > slot = path->slots[level] + 1;
> > ...
> > }
> > path->slots[level] = slot;
> > 
> > Again, it's the stupid compiler needs some hint for the fact that
> > we will always enter the while loop for at least once, thus @slot should
> > always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.

It's be nice if the C language provided a kind of a while loop that executes
at least once...

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


[PATCH RESEND 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2018-11-21 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..66111bf4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", &saveptr);
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.19.1



[PATCH RESEND-v3 2/2] btrfs-progs: defrag: open files RO on new enough kernels

2018-11-21 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)
v3: more eloquentier description; s/defrag_ro/defrag_open_mode/

 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index d1af21ee..c67bf5da 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -878,7 +881,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -915,6 +918,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1015,7 +1021,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.1



Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Adam Borowski
On Sun, Nov 04, 2018 at 06:29:06PM +, Duncan wrote:
> So do consider adding noatime to your mount options if you haven't done 
> so already.  AFAIK, the only /semi-common/ app that actually uses atimes 
> these days is mutt (for read-message tracking), and then not for mbox, so 
> you should be safe to at least test turning it off.

To the contrary, mutt uses atimes only for mbox.
 
> And YMMV, but if you do use mutt or something else that uses atimes, I'd 
> go so far as to recommend finding an alternative, replacing either btrfs 
> (because as I said, relatime is arguably enough on a traditional non-COW 
> filesystem) or whatever it is that uses atimes, your call, because IMO it 
> really is that big a deal.

Fortunately, mutt's use could be fixed by teaching it to touch atimes
manually.  And that's already done, for both forks (vanilla and neomutt).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Have you heard of the Amber Road?  For thousands of years, the
⣾⠁⢰⠒⠀⣿⡁ Romans and co valued amber, hauled through the Europe over the
⢿⡄⠘⠷⠚⠋⠀ mountains and along the Vistula, from Gdańsk.  To where it came
⠈⠳⣄ together with silk (judging by today's amber stalls).


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-10-07 Thread Adam Borowski
On Mon, Oct 08, 2018 at 02:03:44AM +0200, Hans van Kranenburg wrote:
> And yes, when promoting things like the new show_usage example to
> programs that are easily available, users will probably start parsing
> the output of them with sed and awk which is a total abomination and the
> absolute opposite of the purpose of the library. So be it. Let it go. :D
> "The code never bothered me any way".

It's not like some deranged person would parse the output of, say, show_file
in Perl...
 
> The interesting question that remains is where the result should go.
> 
> btrfs-heatmap is a thing of its own now, but it's a bit of the "show
> case" example using the lib, with its own collection of documentation
> and even possibility to script it again.
> 
> Shipping the 'binaries' in the python3-btrfs package wouldn't be the
> right thing, so where should they go? apt-get install btrfs-moar-utils-yolo?

At least in Debian, moving executables between packages is a matter of
versioned Replaces (+Conflicts: old), so if any point you decide differently
it's not a problem.  So btrfs-moar-utils-yolo should work well.

> Or should btrfs-progs start to use this to accelerate improvement for
> providing a richer collection of useful progs for things that are not on
> essential level (like, you won't need them inside initramfs, so they can
> use python)?

You might want your own package that's agile and btrfs-progs for things
declared to be rock stable (WRT command-line API, not neccesarily stability
of code).

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄ and 1 who narrowly avoided an off-by-one error.


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-09-23 Thread Adam Borowski
On Sun, Sep 23, 2018 at 11:54:12PM +0200, Hans van Kranenburg wrote:
> Two examples have been added, which use the new code. I would appreciate
> extra testing. Please try them and see if the reported numbers make sense:
> 
> space_calculator.py
> ---
> Best to be initially described as a CLI version of the well-known
> webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at
> it, choose data and metadata profile and see how much space you would
> get to store actual data.
> 
> See commit message "Add example to calculate usable and wasted space"
> for example output.
> 
> show_usage.py
> -
> The contents of the old show_usage.py example that simply showed a list
> of block groups are replaced with a detailed usage report of an existing
> filesystem.

I wonder, perhaps at least some of the examples could be elevated to
commands meant to be run by end-user?  Ie, installing them to /usr/bin/,
dropping the extension?  They'd probably need less generic names, though.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 10 people enter a bar:
⣾⠁⢰⠒⠀⣿⡁ • 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ • 1 who doesn't,
⠈⠳⣄ • and E who prefer to write it as hex.


Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Sat, Sep 08, 2018 at 08:45:47PM +, Martin Raiber wrote:
> Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> > On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> >> On 2018-09-06 03:23, Nathan Dehnel wrote:
> >>> So I guess my question is, does btrfs support atomic writes across
> >>> multiple files? Or is anyone interested in such a feature?
> >>>
> >> I'm fairly certain that it does not currently, but in theory it would not 
> >> be
> >> hard to add.

> >> However, if this were extended to include rename, unlink, touch, and a
> >> handful of other VFS operations, then I can easily think of a few dozen use
> >> cases.  Package managers in particular would likely be very interested in
> >> being able to atomically rename a group of files as a single transaction, 
> >> as
> >> it would make their job _much_ easier.

> > I wonder, what about:
> > sync; mount -o remount,commit=999,flushoncommit
> > eatmydata apt dist-upgrade
> > sync; mount -o remount,commit=30,noflushoncommit
> >
> > Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> > whole system (if you have unrelated writes they won't get committed until
> > the end of transaction).  Then there are nocow files, but you already made
> > the decision to disable most features of btrfs for them.

> Now combine this with snapshot root, then on success rename exchange to
> root and you are there.

No need: no unsuccessful transactions ever get written to the disk.
(Not counting unreachable stuff.)

> Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
> think), but no rollback (and therefore no error handling incl. ENOSPC).
> 
> If you want to look at a working file system transaction mechanism, you
> should look at transactional NTFS (TxF). They are writing they are
> deprecating it, so it's perhaps not very widely used. Windows uses it
> for updates, I think.

You're talking about multiple simultaneous transactions, they have a massive
complexity cost.  And btrfs is already ridiculously complex.  I don't really
see a good way to tie this with the POSIX API without some serious
rethinking.

dpkg can already recover from a properly returned error (although not as
nicely as a full rollback); what is fatal for it is having its status
database corrupted/out of sync.  That's why it does a multiple fsync dance
and keeps fully rewriting its files over and over and over.

Atomic operations are pretty useful even without a rollback: you still need
to be able to handle failure, but not a crash.

> Specifically for btrfs, the problem would be that it really needs to
> support multiple simultaneous writers, otherwise one transaction can
> block the whole system.

My dirty hack above doesn't suffer from such a block: it only suffers from
compromising durability of concurrent writers.  During that userspace
transaction, there are no commits until it finishes; this means that if
there's unrelated activity it may suffer from losing writes that were done
between transaction start and crash.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-09-06 03:23, Nathan Dehnel wrote:
> > So I guess my question is, does btrfs support atomic writes across
> > multiple files? Or is anyone interested in such a feature?
> > 
> I'm fairly certain that it does not currently, but in theory it would not be
> hard to add.
> 
> Realistically, the only cases I can think of where cross-file atomic
> _writes_ would be of any benefit are database systems.
> 
> However, if this were extended to include rename, unlink, touch, and a
> handful of other VFS operations, then I can easily think of a few dozen use
> cases.  Package managers in particular would likely be very interested in
> being able to atomically rename a group of files as a single transaction, as
> it would make their job _much_ easier.

I wonder, what about:
sync; mount -o remount,commit=999,flushoncommit
eatmydata apt dist-upgrade
sync; mount -o remount,commit=30,noflushoncommit

Obviously, this gets fooled by fsyncs, and makes the transaction affects the
whole system (if you have unrelated writes they won't get committed until
the end of transaction).  Then there are nocow files, but you already made
the decision to disable most features of btrfs for them.

So unless something forces a commit, this should already work, giving
cross-file atomic writes, renames and so on.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: dduper - Offline btrfs deduplication tool

2018-09-07 Thread Adam Borowski
On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote:
> > One question:
> > Why not ioctl_fideduperange?
> > i.e. you kill most of benefits from that ioctl - atomicity.
> > 
> I plan to add fideduperange as an option too. User can
> choose between fideduperange and ficlonerange call.
> 
> If I'm not wrong, with fideduperange, kernel performs
> comparsion check before dedupe. And it will increase
> time to dedupe files.

You already read the files to md5sum them, so you have no speed gain.
You get nasty data-losing races, and risk collisions as well.  md5sum is
safe against random occurences (compared eg. to the chance of lightning
hitting you today), but is exploitable by a hostile user.  On the other
hand, full bit-to-bit comparison is faster and 100% safe.

You can't skip verification -- the checksums are only 32-bit.  They have a
1:4G chance to mismatch, which means you can expect one false positive with
64K extents, rising quadratically as the number of files grows.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!


[PATCH v3] btrfs-progs: defrag: open files RO on new enough kernels

2018-09-03 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)
v3: more eloquentier description; s/defrag_ro/defrag_open_mode/

 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..99e2aec0 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



[PATCH v2] btrfs-progs: defrag: open files RO on new enough kernels

2018-09-03 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro.

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)


 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..17e992a3 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_ro = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_ro);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_ro = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, defrag_ro);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



Re: [PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
On Mon, Sep 03, 2018 at 02:04:23PM +0300, Nikolay Borisov wrote:
> On  3.09.2018 13:14, Adam Borowski wrote:
> > -   fd = open(fpath, O_RDWR);
> > +   fd = open(fpath, defrag_ro);
> 
> Looking at the kernel code I think this is in fact incorrect, because in
> ioctl.c we have:
> 
> if (!(file->f_mode & FMODE_WRITE)) {
> 
> ret = -EINVAL;
> 
> goto out;
> 
> }
> 
> So it seems a hard requirement to have opened a file for RW when you
> want to defragment it.

Oif!  I confused this with dedup, which does allow root to dedup RO even on
old kernels.  Good catch.


-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: [PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
On Mon, Sep 03, 2018 at 02:01:21PM +0300, Nikolay Borisov wrote:
> On  3.09.2018 13:14, Adam Borowski wrote:
> > Fixes EXTXBSY races.
> 
> You have to be more eloquent than that and explain at least one race
> condition.

If you try to defrag an executable that's currently running:

ERROR: cannot open XXX: Text file busy
total 1 failures

If you try to run an executable that's being defragged:

-bash: XXX: Text file busy

The former tends to be a long-lasting condition but has only benign fallout
(executables almost never get fragmented, not recompressing a single file is
not the end of the world), the latter is only a brief window of time but has
potential for data loss.

> > +static mode_t defrag_ro = O_RDONLY;
> 
> This brings no value whatsoever, just use O_RDONLY directly

On old kernels it gets overwritten with:

> > +   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0) && getuid())
> > +   defrag_ro = O_RDWR;


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: IO errors when building RAID1.... ?

2018-09-03 Thread Adam Borowski
On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote:
> For > 10 years drive firmware handles bad sector remapping internally.
> It remaps the sector logical address to a reserve physical sector.
> 
> NTFS and ext[234] have a means of accepting a list of bad sectors, and
> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
> and I think even FAT, lack this capability.


FAT entry FF7 (FAT12)/FFF7 (FAT16)/...



-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


[PATCH 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2018-09-03 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..66111bf4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", &saveptr);
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.19.0.rc1



[PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..4c9df69f 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_ro = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_ro);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0) && getuid())
+   defrag_ro = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, defrag_ro);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Adam Borowski
On Mon, Aug 20, 2018 at 08:16:16AM -0400, Austin S. Hemmelgarn wrote:
> Also, slightly OT, but atimes are not where the real benefit is here for
> most people.  No sane software other than mutt uses atimes (and mutt's use
> of them is not sane, but that's a different argument)

Right.  There are two competing forks of mutt: neomutt and vanilla:
https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f
https://gitlab.com/dops/mutt/commit/489a1c394c29e4b12b705b62da413f322406326f

So this has already been taken care of.

> so pretty much everyone who wants to avoid the overhead from them can just
> use the `noatime` mount option.

atime updates (including relatime) are bad not only for performance, they
also explode disk size used by snapshots (btrfs, LVM, ...) -- to the tune of
~5% per snapshot for some non-crafted loads.  And, are bad for media with
low write endurance (SD cards, as used by most SoCs).

Thus, atime needs to die.

> The real benefit for most people is with mtimes, for which there is no
> other way to limit the impact they have on performance.

With btrfs, any write already triggers metadata update (except nocow), thus
there's little benefit of lazytime for mtimes.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: [RESEND][PATCH v5 0/2] vfs: better dedupe permission check

2018-08-07 Thread Adam Borowski
On Tue, Aug 07, 2018 at 02:49:47PM -0700, Mark Fasheh wrote:
> Hi Andrew,
> 
> Could I please have these patches upstreamed or at least put in a tree for
> more public testing? They've hit fsdevel a few times now, I have links to
> the discussions in the change log below.

> The first patch expands our check to allow dedupe of a file if the
> user owns it or otherwise would be allowed to write to it.
[...]
> The other problem we have is also related to forcing the user to open
> target files for write - A process trying to exec a file currently
> being deduped gets ETXTBUSY. The answer (as above) is to allow them to
> open the targets ro - root can already do this. There was a patch from
> Adam Borowski to fix this back in 2016

> The 2nd patch fixes our return code for permission denied to be
> EPERM. For some reason we're returning EINVAL - I think that's
> probably my fault. At any rate, we need to be returning something
> descriptive of the actual problem, otherwise callers see EINVAL and
> can't really make a valid determination of what's gone wrong.

Note that the counterpart of these two patches for BTRFS_IOC_DEFRAG, which
fixes the same issues, is included in btrfs' for-next, slated for 4.19. 
While technically dedupe and defrag are independent, there would be somewhat
less confusion if both behave the same in the same kernel version.

Thus, it'd be nice if you would consider taking this.  Should be safe:
even the permission check is paranoid.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ So a Hungarian gypsy mountainman, lumberjack by day job,
⣾⠁⢰⠒⠀⣿⡁ brigand by, uhm, hobby, invented a dish: goulash on potato
⢿⡄⠘⠷⠚⠋⠀ pancakes.  Then the Polish couldn't decide which of his
⠈⠳⣄ adjectives to use for the dish's name.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-01 Thread Adam Borowski
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

Is this database performance-critical?

If yes, you'd want ext4 -- nocow is a crappy ext4 lookalike, with no
benefits of btrfs.  Or, if you snapshot it, you get bad fragmentation yet no
checksums/etc.

If no, regular cow (especially with autodefrag) will be enough.  Sure, this
particular load won't be as performant (mysql really loves fsync, which is
an anathema to btrfs), but you get all the data safety improvements,
frequent cheap backups, and so on.

Thus: if the server's primary purpose is that database, you don't want
btrfs.  If the database is merely incidental, not microoptimizing it will
save a lot of your time.

In neither case nocow is a good idea.  Especially if raid (!= 0) is
involved.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-07-24 Thread Adam Borowski
On Mon, Jul 23, 2018 at 05:26:24PM +0200, David Sterba wrote:
> On Wed, Jul 18, 2018 at 12:08:59AM +0200, Adam Borowski wrote:
(Combined with as-folded)

| | btrfs: allow defrag on a file opened read-only that has rw permissions
| |
> > Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
> > whenever you try to defrag a program that's currently being run, or
> > causing intermittent exec failures on a live system being defragged.
> > 
> > As defrag doesn't change the file's contents in any way, there's no reason
> > to consider it a rw operation.  Thus, let's check only whether the file
> > could have been opened rw.  Such access control is still needed as
> > currently defrag can use extra disk space, and might trigger bugs.
<-
| | We give EINVAL when the request is invalid; here it's ok but merely the
| | user has insufficient privileges.  Thus, this return value reflects the
| | error better -- as discussed in the identical case for dedupe.
| |
| | According to codesearch.debian.net, no userspace program distinguishes
| | these values beyond strerror().
| |
| | Signed-off-by: Adam Borowski 
| | Reviewed-by: David Sterba 
| | [ fold the EPERM patch from Adam ]
| | Signed-off-by: David Sterba 

[...]
> So, I'll add the patch to 4.19 queue. It's small and isolated change so
> a revert would be easy in case we find something bad. The 2nd patch
> should be IMHO part of this change as it's logical to return the error
> code in the patch that modifies the user visible behaviour.

A nitpick: the new commit message has a dangling pointer "this" to the title
of the commit that was squashed.  It was:

| btrfs: defrag: return EPERM not EINVAL when only permissions fail

It'd be nice if it could be inserted in some form in the place I marked with
an arrow.

But then, commit messages are not vital.  The actual functionality patch has
been applied correctly.  And thanks for adding the comment.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-07-17 Thread Adam Borowski
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
whenever you try to defrag a program that's currently being run, or
causing intermittent exec failures on a live system being defragged.

As defrag doesn't change the file's contents in any way, there's no reason
to consider it a rw operation.  Thus, let's check only whether the file
could have been opened rw.  Such access control is still needed as
currently defrag can use extra disk space, and might trigger bugs.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 43ecbe620dea..01c150b6ab62 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2941,7 +2941,8 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
ret = btrfs_defrag_root(root);
break;
case S_IFREG:
-   if (!(file->f_mode & FMODE_WRITE)) {
+   if (!capable(CAP_SYS_ADMIN) &&
+   inode_permission(inode, MAY_WRITE)) {
ret = -EINVAL;
goto out;
}
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail

2018-07-17 Thread Adam Borowski
We give EINVAL when the request is invalid; here it's ok but merely the
user has insufficient privileges.  Thus, this return value reflects the
error better -- as discussed in the identical case for dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 01c150b6ab62..e96e3c3caca1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2943,7 +2943,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
case S_IFREG:
if (!capable(CAP_SYS_ADMIN) &&
inode_permission(inode, MAY_WRITE)) {
-   ret = -EINVAL;
+   ret = -EPERM;
goto out;
}
 
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 0/2] btrfs: fix races between exec and defrag

2018-07-17 Thread Adam Borowski
Hi!
Here's a ping for a patch to fix ETXTBSY races between defrag and exec, just
like the dedupe counterpart.  Unlike that one which is shared to multiple
filesystems and thus lives in Al Viro's land, it is btrfs only.

Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length
1 at random positions; racey vs concurrent writes or execs but shouldn't
damage the file otherwise.

Also attached: a preliminary patch for -progs; it yet lacks a check for the
kernel version, but to add such a check we'd need to know which kernels
actually permit ro defrag for non-root.

No man page patch -- there's no man page to be patched...


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *txt, ...) __attribute__((format (printf, 1, 2)));
static void die(const char *txt, ...)
{
fprintf(stderr, "fragme: ");

va_list ap;
va_start(ap, txt);
vfprintf(stderr, txt, ap);
va_end(ap);

exit(1);
}

static uint64_t rnd(uint64_t max)
{
__uint128_t r;
if (syscall(SYS_getrandom, &r, sizeof(r), 0)==-1)
die("getrandom(): %m\n");
return r%max;
}

int main(int argc, char **argv)
{
if (argc!=2)
die("Usage: fragme \n");

int fd = open(argv[1], O_RDWR|O_SYNC);
if (fd == -1)
die("open(\"%s\"): %m\n", argv[1]);
off_t size = lseek(fd, 0, SEEK_END);
if (size == -1)
die("lseek(SEEK_END): %m\n");

for (int i=0; i<10; ++i)
{
off_t off = rnd(size);
char b;
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for read: %m\n");
if (read(fd, &b, 1) != 1)
die("read(%lu): %m\n", off);
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for write: %m\n");
if (write(fd, &b, 1) != 1)
    die("write: %m\n");
}

return 0;
}
>From d040af09adb03daadbba4336700f40425a860320 Mon Sep 17 00:00:00 2001
From: Adam Borowski 
Date: Tue, 28 Nov 2017 01:00:21 +0100
Subject: [PATCH] defrag: open files RO

NOT FOR MERGING -- requires kernel versioning

Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 30a50bf5..7eb6b7bb 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb,
 	if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
 		if (defrag_global_verbose)
 			printf("%s\n", fpath);
-		fd = open(fpath, O_RDWR);
+		fd = open(fpath, O_RDONLY);
 		if (fd < 0) {
 			goto error;
 		}
@@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
 		int defrag_err = 0;
 
 		dirstream = NULL;
-		fd = open_file_or_dir(argv[i], &dirstream);
+		fd = open_file_or_dir3(argv[i], &dirstream, O_RDONLY);
 		if (fd < 0) {
 			error("cannot open %s: %m", argv[i]);
 			ret = -errno;
-- 
2.18.0



Re: unsolvable technical issues?

2018-06-28 Thread Adam Borowski
On Wed, Jun 27, 2018 at 08:50:11PM +0200, waxhead wrote:
> Chris Murphy wrote:
> > On Thu, Jun 21, 2018 at 5:13 PM, waxhead  wrote:
> > > According to this:
> > > 
> > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf
> > > Page 4 , section 1.2
> > > 
> > > It claims that BTRFS still have significant technical issues that may 
> > > never
> > > be resolved.
> > > Could someone shed some light on exactly what these technical issues might
> > > be?! What are BTRFS biggest technical problems?
> > 
> > 
> > I think it's appropriate to file an issue and ask what they're
> > referring to. It very well might be use case specific to Red Hat.
> > https://github.com/stratis-storage/stratis-storage.github.io/issues

> https://github.com/stratis-storage/stratis-storage.github.io/issues/1
> 
> Apparently the author have toned down the wording a bit, this confirm that
> the claim was without basis and probably based on "popular myth".
> The document the PDF links to is not yet updated.

It's a company whose profits rely on users choosing it over anything that
competes.  Adding propaganda to a public document is a natural thing for
them to do.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Adam Borowski
On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.

Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
The vast majority of data is fine, and extents that have been written to
while a device is missing will be either placed elsewhere (if the filesystem
knew it was degraded) or read one of the copies to notice a wrong checksum
and automatically recover (if the device was still falsely believed to be
good at write time).

We currently don't have selective scrub yet so resyncing such single-copy
extents is costly, but 1. all will be fine if the data is read, 2. it's
possible to add such a smart resync in the future, far better than a
write-intent bitmap can do.

To do the latter, we can note the last generation the filesystem was known
to be fully coherent (ie, all devices were successfully flushed with no
mysterious write failures), then run selective scrub (perhaps even
automatically) when the filesystem is no longer degraded.  There's some
extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
a single number would help even there.

But even currently, without the above not-yet-written recovery, it's
reasonably safe to continue without scrub -- it's a case of running
partially degraded when the bad copy is already known to be suspicious.

For no-nodatacow data and metadata, that is.

> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.

As nodatacow already means "I don't care about this data, or have another
way of recovering it", I don't quite get why we would drop existing
auto-recovery for a common transient failure case.

If you're paranoid, perhaps some bit "this filesystem has some nodatacow
data on it" could warrant such a block, but it would still need to be
overridable _without_ a need for replace.  There's also the problem that
systemd marks its journal nodatacow (despite it having infamously bad
handling of failures!), and too many distributions infect their default
installs with systemd, meaning such a bit would be on in most cases.

But why would I put all my other data at risk, just because there's a
nodatacow file?  There's a big difference between scrubbing when only a few
transactions worth of data is suspicious and completely throwing away a
mostly-good replica to replace it from the now fully degraded copy.

> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.

Generation can do well at least unless both devices were mounted elsewhere
and got the exact same number of transactions, the problem is that nodatacow
doesn't bump generation number.

> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.

I wonder if there's some way to treat problematic nodatacow files as
degraded only?

Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
need to pretty much copy all of md's logic, with a write-intent bitmap or an
equivalent.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] defrag: open files RO

2018-05-21 Thread Adam Borowski
NOT FOR MERGING -- requires kernel versioning

Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 30a50bf5..7eb6b7bb 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, O_RDONLY);
if (fd < 0) {
goto error;
}
@@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], &dirstream);
+   fd = open_file_or_dir3(argv[i], &dirstream, O_RDONLY);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail

2018-05-21 Thread Adam Borowski
We give EINVAL when the request is invalid; here it's ok but merely the
user has insufficient privileges.  Thus, this return value reflects the
error better -- as discussed in the identical case for dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b75db9d72106..ae6a110987a7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2563,7 +2563,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
case S_IFREG:
if (!capable(CAP_SYS_ADMIN) &&
inode_permission(inode, MAY_WRITE)) {
-   ret = -EINVAL;
+   ret = -EPERM;
goto out;
}
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-05-21 Thread Adam Borowski
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
whenever you try to defrag a program that's currently being run, or
causing intermittent exec failures on a live system being defragged.

As defrag doesn't change the file's contents in any way, there's no reason
to consider it a rw operation.  Thus, let's check only whether the file
could have been opened rw.  Such access control is still needed as
currently defrag can use extra disk space, and might trigger bugs.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 632e26d6f7ce..b75db9d72106 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2561,7 +2561,8 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
ret = btrfs_defrag_root(root);
break;
case S_IFREG:
-   if (!(file->f_mode & FMODE_WRITE)) {
+   if (!capable(CAP_SYS_ADMIN) &&
+   inode_permission(inode, MAY_WRITE)) {
ret = -EINVAL;
goto out;
}
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] btrfs: fix races between exec and defrag

2018-05-21 Thread Adam Borowski
Hi!
Here's a patch to fix ETXTBSY races between defrag and exec -- similar to
what was just submitted for dedupe, even to the point of being followed by
a second patch that replaces EINVAL with EPERM.

As defrag is not something you're going to do on files you don't write, I
skipped complex rules and I'm sending the original version of the patch
as-is.  It has stewed in my tree for two years (long story...), tested on
multiple machines.

Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length
1 at random positions; racey vs concurrent writes or execs but shouldn't
damage the file otherwise.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *txt, ...) __attribute__((format (printf, 1, 2)));
static void die(const char *txt, ...)
{
fprintf(stderr, "fragme: ");

va_list ap;
va_start(ap, txt);
vfprintf(stderr, txt, ap);
va_end(ap);

exit(1);
}

static uint64_t rnd(uint64_t max)
{
__uint128_t r;
if (syscall(SYS_getrandom, &r, sizeof(r), 0)==-1)
die("getrandom(): %m\n");
return r%max;
}

int main(int argc, char **argv)
{
if (argc!=2)
die("Usage: fragme \n");

int fd = open(argv[1], O_RDWR|O_SYNC);
if (fd == -1)
die("open(\"%s\"): %m\n", argv[1]);
off_t size = lseek(fd, 0, SEEK_END);
if (size == -1)
die("lseek(SEEK_END): %m\n");

for (int i=0; i<10; ++i)
{
off_t off = rnd(size);
char b;
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for read: %m\n");
if (read(fd, &b, 1) != 1)
die("read(%lu): %m\n", off);
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for write: %m\n");
if (write(fd, &b, 1) != 1)
die("write: %m\n");
}

return 0;
}


Re: [PATCH v2 2/3] btrfs: balance: add args info during start and resume

2018-05-16 Thread Adam Borowski
On Wed, May 16, 2018 at 10:57:57AM +0300, Nikolay Borisov wrote:
> On 16.05.2018 05:51, Anand Jain wrote:
> > Balance args info is an important information to be reviewed for the
> > system audit. So this patch adds it to the kernel log.
> > 
> > Example:
> > 
> > -> btrfs bal start -dprofiles='raid1|single',convert=raid5 
> > -mprofiles='raid1|single',convert=raid5 /btrfs
> > 
> >  kernel: BTRFS info (device sdb): balance: start data profiles=raid1|single 
> > convert=raid5 metadata profiles=raid1|single convert=raid5 system 
> > profiles=raid1|single convert=raid5
> > 
> > -> btrfs bal start -dprofiles=raid5,convert=single 
> > -mprofiles='raid1|single',convert=raid5 --background /btrfs
> > 
> >  kernel: BTRFS info (device sdb): balance: start data profiles=raid5 
> > convert=single metadata profiles=raid1|single convert=raid5 system 
> > profiles=raid1|single convert=raid5
> > 
> > Signed-off-by: Anand Jain 
> 
> Why can't this code be part of progs, the bctl which you are parsing is
> constructed from the arguments passed from users space? I think you are
> adding way too much string parsing code to the kernel and this is never
> a good sign, since it's very easy to trip.

progs are not the only way to start a balance, they're merely the most
widespread one.  For example, Hans van Kranenburg has some smarter scripts
among his tools -- currently only of "example" quality, but quite useful
already.  "balance_least_used" works in greedy order (least used first) with
nice verbose output.  It's not unlikely he or someone else improves this
further.  Thus, I really think the logging should be kernel side.

On the other hand, the string producing (not parsing) code is so repetitive
that it indeed could use some refactoring.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-13 Thread Adam Borowski
On Sun, May 13, 2018 at 06:16:53PM +, Mark Fasheh wrote:
> On Sat, May 12, 2018 at 04:49:20AM +0200, Adam Borowski wrote:
> > On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> > > The permission check in vfs_dedupe_file_range() is too coarse - We
> > > only allow dedupe of the destination file if the user is root, or
> > > they have the file open for write.
> > > 
> > > This effectively limits a non-root user from deduping their own
> > > read-only files. As file data during a dedupe does not change,
> > > this is unexpected behavior and this has caused a number of issue
> > > reports.
[...]
> > > So change the check so we allow dedupe on the target if:
> > > 
> > > - the root or admin is asking for it
> > > - the owner of the file is asking for the dedupe
> > > - the process has write access
> > 
> > I submitted a similar patch in May 2016, yet it has never been applied
> > despite multiple pings, with no NAK.  My version allowed dedupe if:
> > - the root or admin is asking for it
> > - the file has w permission (on the inode -- ie, could have been opened rw)
> 
> Ahh, yes I see that now. I did wind up acking it too :)
> > 
> > I like this new version better than mine: "root or owner or w" is more
> > Unixy than "could have been opened w".
> 
> I agree, IMHO the behavior in this patch is intuitive. What we had before
> would surprise users.

Actually, there's one reason to still consider "could have been opened w":
with it, deduplication programs can simply open the file r and not care
about ETXTBSY at all.  Otherwise, every program needs to stat() and have
logic to pick the proper argument to the open() call (r if owner/root,
rw or w if not).

I also have a sister patch: btrfs_ioctl_defrag wants the same change, for
the very same reason.  But, let's discuss dedupe first to avoid unnecessary
round trips.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-13 Thread Adam Borowski
On Fri, May 11, 2018 at 05:06:34PM -0700, Darrick J. Wong wrote:
> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
> > Right now we return EINVAL if a process does not have permission to dedupe a
> > file. This was an oversight on my part. EPERM gives a true description of
> > the nature of our error, and EINVAL is already used for the case that the
> > filesystem does not support dedupe.

> > -   info->status = -EINVAL;
> > +   info->status = -EPERM;
> 
> Hmm, are we allowed to change this aspect of the kabi after the fact?
> 
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)

There's more:
https://codesearch.debian.net/search?q=FILE_EXTENT_SAME

This includes only software that has been packaged for Debian (notably, not
bees), but that gives enough interesting coverage.  And none of these cases
discriminate between error codes -- they merely report them to the user.

Thus, I can't think of a downside of making the error code more accurate.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-11 Thread Adam Borowski
On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own
> read-only files. As file data during a dedupe does not change,
> this is unexpected behavior and this has caused a number of issue
> reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the owner of the file is asking for the dedupe
> - the process has write access

I submitted a similar patch in May 2016, yet it has never been applied
despite multiple pings, with no NAK.  My version allowed dedupe if:
- the root or admin is asking for it
- the file has w permission (on the inode -- ie, could have been opened rw)

There was a request to include in xfstests a test case for the ETXTBSY race
this patch fixes, but there's no reasonable way to make such a test case:
the race condition is not a bug, it's write-xor-exec working as designed.

Another idea discussed was about possibly just allowing everyone who can
open the file to deduplicate it, as the file contents are not modified in
any way.  Zygo Blaxell expressed a concern that it could be used by an
unprivileged user who can trigger a crash to abuse writeout bugs.

I like this new version better than mine: "root or owner or w" is more
Unixy than "could have been opened w".

> Signed-off-by: Mark Fasheh 
> ---
>  fs/read_write.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c4eabbfc90df..77986a2e2a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2036,7 +2036,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
> +  uid_eq(current_fsuid(), dst->i_uid))) {
I had:
  + } else if (!(is_admin || !inode_permission(dst, MAY_WRITE))) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add balance args info during start and resume

2018-04-26 Thread Adam Borowski
On Thu, Apr 26, 2018 at 04:01:29PM +0800, Anand Jain wrote:
> Balance args info is an important information to be reviewed on the
> system under audit. So this patch adds that.

This kept annoying me.  Thanks a lot!

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Adam Borowski
On Mon, Apr 02, 2018 at 10:07:01PM +, Hugo Mills wrote:
> On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> > Is there some testing utility for this? Is there a way to extract this/tell 
> > with a high enough certainty from datasheets/other material before purchase?
> 
>Given that not implementing barriers is basically a bug in the
> hardware [for SATA or SAS], I don't think anyone's going to specify
> anything other than "fully suppors barriers" in their datasheets.
> 
>I don't know of a testing tool. It may not be obvious that barriers
> aren't being honoured without doing things like power-failure testing.

And you'd need to do a lot of power-cycling during writes, with various
write patterns -- as unless you have a case of "let's lie about barriers to
make benchmarks better than the competition" where barriers are consistently
absent, it might be a genuine bug in a well-meaning controller that at least
tries but sometimes fails to.  The intentional case is usually easy to
detect -- but just wait go get volkswagenized. :/


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ ... what's the frequency of that 5V DC?
⠈⠳⣄
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question, will ls -l eventually be able to show subvolumes?

2018-03-30 Thread Adam Borowski
On Fri, Mar 30, 2018 at 10:42:10AM +0100, Pete wrote:
> I've just notice work going on to make rmdir be able to delete
> subvolumes.  Is there an intent to allow ls -l to display directories as
> subvolumes?

That's entirely up to coreutils guys.


Meow!
-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-11 Thread Adam Borowski
On Sun, Mar 11, 2018 at 11:28:08PM +0700, Andreas Hild wrote:
> Following a physical disk failure of a RAID1 array, I tried to mount
> the remaining volume of a root partition with "-o degraded". For some
> reason it ended up as read-only as described here:
> https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volumes_only_mountable_once_RW_if_degraded
> 
> 
> How to precisely do this: dump, recreate and restore its contents?
> Could someone please provided more details how to recover this volume
> safely?

> Linux debian 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3 (2017-12-03) x86_64 
> GNU/Linux

> [ 1313.279140] BTRFS warning (device sdb2): missing devices (1)
> exceeds the limit (0), writeable mount is not allowed

I'd recommend instead going with kernel 4.14 or newer (available in
stretch-backports), which handles this case well without the need to
restore.  If there's no actual data loss (there shouldn't be, it's RAID1
with only a single device missing), you can mount degraded normally, then
balance the data onto the new disk.

Recovery with 4.9 is unpleasant.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 07:37:22PM +0500, Roman Mamedov wrote:
> Note you can use it on HDDs too, even without QEMU and the like: via using LVM
> "thin" volumes. I use that on a number of machines, the benefit is that since
> TRIMed areas are "stored nowhere", those partitions allow for incredibly fast
> block-level backups, as it doesn't have to physically read in all the free
> space, let alone any stale data in there. LVM snapshots are also way more
> efficient with thin volumes, which helps during backup.

Since we're on a btrfs mailing list, if you use qemu, you really want
sparse format:raw instead of qcow2 or preallocated raw.  This also works
great with TRIM.

> > Back then it didn't seem to work.
> 
> It works, just not with some of the QEMU virtualized disk device drivers.
> You don't need to use qemu-img to manually dig holes either, it's all
> automatic.

It works only with scsi and virtio-scsi drivers.  Most qemu setups use
either ide (ouch!) or virtio-blk.

You'd obviously want virtio-scsi; note that defconfig enables virtio-blk but
not virtio-scsi; I assume most distribution kernels have both.  It's a bit
tedious to switch between the two as -blk is visible as /dev/vda while -scsi
as /dev/sda.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 03:55:25AM +0100, Christoph Anton Mitterer wrote:
> Just wondered... was it ever planned (or is there some equivalent) to
> get support for btrfs in zerofree?

Do you want zerofree for thin storage optimization, or for security?

For the former, you can use fstrim; this is enough on any modern SSD; on HDD
you can rig the block device to simulate TRIM by writing zeroes.  I'm sure
one of dm-* can do this, if not -- should be easy to add, there's also
qemu-nbd which allows control over discard, but incurs a performance penalty
compared to playing with the block layer.

For zerofree for security, you'd need defrag (to dislodge partial pinned
extents) first, and do a full balance to avoid data left in metadata nodes
and in blocks beyond file ends (note that zerofree doesn't do this on
traditional filesystems either).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Adam Borowski
On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:
> In discussing the performance of various metadata operations over the past
> few days I've had this idea in the back of my head, and wanted to see if
> anybody had already thought about it before (likely, I would guess).
>
> It appears based on this page:
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> that data and metadata in BTRFS are fairly well isolated from one another,
> particularly in the case of large files.  This appears reinforced by a
> recent comment from Qu ("...btrfs strictly
> split metadata and data usage...").
> 
> Yet, while there are plenty of options to RAID0/1/10/etc across generally
> homogeneous media types, there doesn't appear to be any functionality (at
> least that I can find) to segment different BTRFS internals to different
> types of devices.  E.G., place metadata trees and extent block groups on
> SSD, and data trees and extent block groups on HDD(s).
> 
> Is this something that has already been considered (and if so, implemented,
> which would make me extremely happy)?  Is it feasible it is hasn't been
> approached yet?  I admit my internal knowledge of BTRFS is fleeting, though
> I'm trying to work on that daily at this time, so forgive me if this is
> unapproachable for obvious architectural reasons.

Considered: many times.  It's an obvious improvement, and one that shouldn't
even be that hard to implement.  What remains, it's SMoC then SMoR (Simple
Matter of Coding then Simple Matter of Review), but both of those are in
short supply.

After the maximum size of inline extents has been lowered, there's no real
point in putting different types of metadata or not-really-metadata on
different media: thus, existing split of data -vs- metadata block groups is
fine.

What you'd want is an ability to tell the block allocator that metadata
block groups should prefer device[s] A, while data ones, device[s] B.

Right now, the allocator's algorithm is: any new allocations are placed on
device that has the most available space, for 2nd/etc RAID chunk obviously
excluding the device which 1st chunk has been already placed on.  This is
optimal wrt not wasting space, but doesn't always provide best performance,
especially when devices' speed varies.  There are also other downsides, like
usual RAID10 having 2/3 chance for tolerating two missing devices, while
btrfs RAID10 almost guarantees massive data loss with two missing devices.

Thus, allowing to specify an allocation policy that alters this algorithm
would be the way to go.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Adam Borowski
On Sun, Feb 11, 2018 at 12:31:42PM +0300, Andrei Borzenkov wrote:
> 11.02.2018 04:02, Hans van Kranenburg пишет:
> >> - /dev/sda6 / btrfs
> >> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
> >> 0 0
> > 
> > Note that changes on atime cause writes to metadata, which means cowing
> > metadata blocks and unsharing them from a previous snapshot, only when
> > using the filesystem, not even when changing things (!).
> 
> With relatime atime is updated only once after file was changed. So your
> description is not entirely accurate and things should not be that
> dramatic unless files are continuously being changed.

Alas, that's untrue.  relatime updates happen if:
* the file has been written after it was last read, or
* previous atime was older than 24 hours

Thus, you get at least one unshare per inode per day, which is also the most
widespread frequency of both snapshotting and cronjobs.

Fortunately, most uses of atime are gone, thus it's generally safe to
disable it completely.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Adam Borowski
On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
> 
> > systemd can't possibly need to know more information than a person
> > does in the exact same situation in order to do the right thing. No
> > human would wait 10 minutes, let alone literally the heat death of the
> > planet for "all devices have appeared" but systemd will. And it does
> 
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices.

Because there is NO compound device.  You can't wait for something that
doesn't exist.  The user wants a filesystem, not some mythical compound
device, and as knowing whether we have enough requires doing most of mount
work, we can as well complete the mount instead of backing off and
reporting, so you can then racily repeat the work.

> Just like it 'waits' for someone to plug USB pendrive in.

Plugging an USB pendrive is an event -- there's no user request.  On the
other hand, we already know we want to mount -- the user requested so either
by booting ("please mount everything in fstab") or by an explicit mount
command.

So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue the
mount() call and let the kernel do the work.

> It is a btrfs choice to not expose compound device as separate one (like
> every other device manager does)

Btrfs is not a device manager, it's a filesystem.

> it is a btrfs drawback that doesn't provice anything else except for this
> IOCTL with it's logic

How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?

> it is a btrfs drawback that there is nothing to push assembling into "OK,
> going degraded" state

The way to do so is to timeout, then retry with -o degraded.

> I've told already - pretend the /dev/sda1 device doesn't
> exist until assembled.

It does... you're confusing a block device (a _part_ of the filesystem) with
the filesystem itself.  MD takes a bunch of such block devices and provides
you with another block devices, btrfs takes a bunch of block devices and
provides you with a filesystem.

> If this overlapping usage was designed with 'easier mounting' on mind,
> this is simply bad design.

No other rc system but systemd has a problem.

> > that by its own choice, its own policy. That's the complaint. It's
> > choosing to do something a person wouldn't do, given identical
> > available information.
> 
> You are expecting systemd to mix in functions of kernel and udev.
> There is NO concept of 'assembled stuff' in systemd AT ALL.
> There is NO concept of 'waiting' in udev AT ALL.
> If you want to do some crazy interlayer shortcuts just implement btrfsd.

No, I don't want systemd, or any userspace daemon, to try knowing kernel
stuff better than the kernel.  Just call mount(), and that's it.

Let me explain via a car analogy.  There is a flood that covers many roads,
the phone network is unreliable, and you want to drive to help relatives at
place X.

You can ask someone who was there yesterday how to get there (ie, ask a
device; it can tell you "when I was a part of the filesystem last time, its
layout was such and such").  Usually, this is reliable (you don't reshape an
array every day), but if there's flooding (you're contemplating a degraded
mount), yesterday's data being stale shouldn't be a surprise.

So, you climb into the car and drive.  It's possible that the road you
wanted to take has changed, it's also possible some other roads you didn't
even know about are now driveable.  Once you have X in sight, do you retrace
all the way home, tell your mom (systemd) who's worrying but has no way to
help, that the road is clear, and only then get to X?  Or do you stop,
search for a spot with working phone coverage to phone mom asking for
advice, despite her having no informations you don't have?  The reasonable
thing to do (and what all other rc systems do) is to get to X, help the
relatives, and only then tell mom that all is ok.

But with mom wanting to control everything, things can go worse.  If you,
without mom's prior knowledge (the user typed "mount" by hand) manage to
find a side road to X, she shouldn't tell you "I hear you telling me you're
at X -- as the road is flooded, that's impossible, so get home this instant"
(ie, systemd thinking the filesystem not being complete, despite it being
already mounted).

> > There's nothing the kernel is doing that's
> > telling systemd to wait for goddamn ever.
> 
> There's nothing the kernel is doing that's
> telling udev there IS a degraded device assembled to be used.

Because there is no device.

> There's nothing a userspace-thing is doing that's
> 

Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 03:36:48PM +0100, Goffredo Baroncelli wrote:
> I think that the real problem relies that the mounting a btrfs filesystem
> cannot be a responsibility of systemd (or whichever rc-system). 
> Unfortunately in the past it was thought that it would be sufficient to
> assemble a devices list in the kernel, then issue a simple mount...

Yeah... every device that comes online may have its own idea what devices
are part of the filesystem.  There's also a quite separate question whether
we have enough chunks for a degraded mount (implemented by Qu), which
requires reading the chunk tree.

> In the past[*] I proposed a mount helper, which would perform all the
> device registering and mounting in degraded mode (depending by the
> option).  My idea is that all the policies should be placed only in one
> place.  Now some policies are in the kernel, some in udev, some in
> systemd...  It is a mess.  And if something goes wrong, you have to look
> to several logs to understand which/where is the problem..

Since most of the logic needs to be in the kernel anyway, I believe it'd be
best to keep as much as possible in the kernel, and let the userspace
request at most "try regular/degraded mount, block/don't block".  Anything
else would be duplicating functionality.

> I have to point out that there is not a sane default for mounting in
> degraded mode or not.  May be that now RAID1/10 are "mount-degraded"
> friendly, so it would be a sane default; but for other (raid5/6) I think
> that this is not mature enough.  And it is possible to exist hybrid
> filesystem (both RAID1/10 and RAID5/6)

Not yet: if one of the devices comes a bit late, btrfs won't let it into the
filesystem yet (patches to do so have been proposed), and if you run
degraded for even a moment, a very lengthy action is required.  That lengthy
action could be improved -- we can note the last generation when the raid
was complete[1], and scrub/balance only extents newer than that[2] -- but
that's a SMOC then SMOR, and I don't see volunteers yet.

Thus, auto-degrading without a hearty timeout first is currently sitting
strongly in the "do not want" land.

> Mounting in degraded mode would be better for a root filesystem, than a
> non-root one (think about remote machine)

I for one use ext4-on-md for root, and btrfs raid for the actual data.  It's
not like production servers see much / churn anyway.


Meow!

[1]. Extra fun for raid6 (or possible future raid1×N where N>2 modes):
there's "fully complete", "degraded missing A", "degraded missing B",
"degraded missing A and B".

[2]. NOCOW extents would require an artificial generation bump upon writing
to whenever the level of degradeness changes.
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
> 
> >> I just tested to boot with a single drive (raid1 degraded), even with
> >> degraded option in fstab and grub, unable to boot !  The boot process
> >> stop on initramfs.
> >> 
> >> Is there a solution to boot with systemd and degraded array ?
> > 
> > No. It is finger pointing. Both btrfs and systemd developers say
> > everything is fine from their point of view.

It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems.  They just don't try to
outsmart the kernel.

> Treating btrfs volume as ready by systemd would open a window of
> opportunity when volume would be mounted degraded _despite_ all the
> components are (meaning: "would soon") be ready - just like Chris Murphy
> wrote; provided there is -o degraded somewhere.

For this reason, currently hardcoding -o degraded isn't a wise choice.  This
might chance once autoresync and devices coming back at runtime are
implemented.

> This is not a systemd issue, but apparently btrfs design choice to allow
> using any single component device name also as volume name itself.

And what other user interface would you propose?  The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.
 
> If btrfs pretends to be device manager it should expose more states,

But it doesn't pretend to.

> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

You're assuming that btrfs somehow knows this itself.  Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.

Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
naive counting of this kind, it had to be replaced by actually checking
whether at least one copy of every block group is actually present.

An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
sda goes offline due to a loose cable, controller hiccup, evil fairies, or
something of this kind.  The sysadmin notices this, rushes in with an
USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
its cable reseated, etc), while sdd either got accidentally removed or is
just slow to initialize (USB...).  So, systemd asks sda how many devices
there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
even ask for UUIDs -- all devices are present.  So, mount will succeed,
right?
 
> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

Btrfs already tells everything it knows.  To learn more, you need to do most
of the mount process (whether you continue or abort is another matter). 
This can't be done sanely from outside the kernel.  Adding finer control
would be reasonable ("wait and block" vs "try and return immediately") but
that's about all.  It's be also wrong to have a different interface for
daemon X than for humans.

Ie, the thing systemd can safely do, is to stop trying to rule everything,
and refrain from telling the user whether he can mount something or not.
And especially, unmounting after the user mounts manually...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang in btrfs_async_reclaim_metadata_space

2018-01-10 Thread Adam Borowski
On Sun, Jan 07, 2018 at 01:17:19PM +0200, Nikolay Borisov wrote:
> On  6.01.2018 07:10, Adam Borowski wrote:
> > Hi!
> > I got a reproducible infinite hang, reliably triggered by the testsuite of
> > "flatpak"; fails on at least 4.15-rc6, 4.9.75, and on another machine with
> > Debian's 4.14.2-1.
> > 
> > [580632.355107] INFO: task kworker/u8:2:11105 blocked for more than 120 
> > seconds.
> > [580632.355120]   Not tainted 4.14.0-1-amd64 #1 Debian 4.14.2-1
> > [580632.355124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [580632.355129] kworker/u8:2D0 11105  2 0x8000
> > [580632.355176] Workqueue: events_unbound 
> > btrfs_async_reclaim_metadata_space [btrfs]
> > [580632.355179] Call Trace:
> > [580632.355192]  __schedule+0x28e/0x880
> > [580632.355196]  schedule+0x2c/0x80
> > [580632.355200]  wb_wait_for_completion+0x64/0x90
> > [580632.355205]  ? finish_wait+0x80/0x80
> > [580632.355207]  __writeback_inodes_sb_nr+0xa1/0xd0
> > [580632.355210]  writeback_inodes_sb_nr+0x10/0x20
> > [580632.355235]  flush_space+0x3ed/0x520 [btrfs]
> > [580632.355238]  ? pick_next_task_fair+0x158/0x590
> > [580632.355242]  ? __switch_to+0x1f3/0x460
> > [580632.355267]  btrfs_async_reclaim_metadata_space+0xf6/0x4a0 [btrfs]
> > [580632.355278]  process_one_work+0x198/0x390
> > [580632.355281]  worker_thread+0x35/0x3c0
> > [580632.355284]  kthread+0x125/0x140
> > [580632.355287]  ? process_one_work+0x390/0x390
> > [580632.355289]  ? kthread_create_on_node+0x70/0x70
> > [580632.355292]  ? SyS_exit_group+0x14/0x20
> > [580632.355295]  ret_from_fork+0x25/0x30
> > 
> > The machines are distinct enough that this probably should happen
> > everywhere:
> > 
> > AMD Phenom2, SSD, noatime,compress=lzo,space_cache=v2
> > Intel Braswell, rust, noatime,autodefrag,space_cache=v2

It does cause data loss: while everything seems to work ok, files that are
written to while there's this stuck worker become size 0 after rebooting.
Only after a longish time other processes start getting stuck as well.

For this reason, I was reluctant to try on a real system -- but somehow I
don't seem to be able to reproduce on minimal VMs.

> Provide output of echo w > /proc/sysrq-trigger when the hang happens.

[ 5679.403833] INFO: task kworker/u12:2:9904 blocked for more than 120 seconds.
[ 5679.413938]   Not tainted 4.15.0-rc7-debug-00137-g13f8e1b5cc83 #1
[ 5679.423336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 5679.434136] kworker/u12:2   D0  9904  2 0x8000
[ 5679.442528] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 5679.450809] Call Trace:
[ 5679.455920]  ? __schedule+0x1a5/0x6e0
[ 5679.462251]  ? check_preempt_curr+0x74/0x80
[ 5679.468950]  ? atomic_t_wait+0x50/0x50
[ 5679.475134]  schedule+0x23/0x80
[ 5679.480670]  bit_wait+0x8/0x50
[ 5679.485985]  __wait_on_bit+0x3d/0x80
[ 5679.491828]  ? atomic_t_wait+0x50/0x50
[ 5679.497726]  __inode_wait_for_writeback+0x9e/0xc0
[ 5679.504531]  ? bit_waitqueue+0x30/0x30
[ 5679.510292]  inode_wait_for_writeback+0x18/0x30
[ 5679.516782]  evict+0xa4/0x180
[ 5679.521596]  btrfs_run_delayed_iputs+0x61/0xb0
[ 5679.527842]  btrfs_commit_transaction+0x7b0/0x8c0
[ 5679.534310]  ? start_transaction+0xa0/0x390
[ 5679.540151]  __writeback_single_inode+0x168/0x1b0
[ 5679.546477]  writeback_sb_inodes+0x1be/0x420
[ 5679.552264]  wb_writeback+0xe0/0x1d0
[ 5679.557292]  wb_workfn+0x7d/0x2c0
[ 5679.561984]  ? __switch_to+0x17c/0x370
[ 5679.567045]  process_one_work+0x1a7/0x340
[ 5679.572286]  worker_thread+0x26/0x3f0
[ 5679.577043]  ? create_worker+0x190/0x190
[ 5679.582007]  kthread+0x107/0x120
[ 5679.586198]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 5679.592200]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 5679.598122]  ret_from_fork+0x1f/0x30
[ 5679.602549] INFO: task testlibrary:12647 blocked for more than 120 seconds.
[ 5679.610434]   Not tainted 4.15.0-rc7-debug-00137-g13f8e1b5cc83 #1
[ 5679.617758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 5679.626453] testlibrary D0 12647  12606 0x
[ 5679.632812] Call Trace:
[ 5679.636038]  ? __schedule+0x1a5/0x6e0
[ 5679.640483]  ? bdi_split_work_to_wbs+0x159/0x290
[ 5679.645881]  schedule+0x23/0x80
[ 5679.649790]  wb_wait_for_completion+0x39/0x70
[ 5679.654901]  ? wait_woken+0x80/0x80
[ 5679.659092]  __writeback_inodes_sb_nr+0x95/0xa0
[ 5679.664325]  sync_filesystem+0x21/0x80
[ 5679.668797]  SyS_syncfs+0x44/0x90
[ 5679.672809]  entry_SYSCALL_64_fastpath+0x17/0x70
[ 5679.678136] RIP: 0033:0x7fe82599e057
[ 5679.682405] RSP: 002b:7ffdfda6cf18 EFLAGS: 0202
[ 5679.682431] INFO: task pool:14436 blocked for more than 120 second

hang in btrfs_async_reclaim_metadata_space

2018-01-05 Thread Adam Borowski
Hi!
I got a reproducible infinite hang, reliably triggered by the testsuite of
"flatpak"; fails on at least 4.15-rc6, 4.9.75, and on another machine with
Debian's 4.14.2-1.

[580632.355107] INFO: task kworker/u8:2:11105 blocked for more than 120 seconds.
[580632.355120]   Not tainted 4.14.0-1-amd64 #1 Debian 4.14.2-1
[580632.355124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[580632.355129] kworker/u8:2D0 11105  2 0x8000
[580632.355176] Workqueue: events_unbound btrfs_async_reclaim_metadata_space 
[btrfs]
[580632.355179] Call Trace:
[580632.355192]  __schedule+0x28e/0x880
[580632.355196]  schedule+0x2c/0x80
[580632.355200]  wb_wait_for_completion+0x64/0x90
[580632.355205]  ? finish_wait+0x80/0x80
[580632.355207]  __writeback_inodes_sb_nr+0xa1/0xd0
[580632.355210]  writeback_inodes_sb_nr+0x10/0x20
[580632.355235]  flush_space+0x3ed/0x520 [btrfs]
[580632.355238]  ? pick_next_task_fair+0x158/0x590
[580632.355242]  ? __switch_to+0x1f3/0x460
[580632.355267]  btrfs_async_reclaim_metadata_space+0xf6/0x4a0 [btrfs]
[580632.355278]  process_one_work+0x198/0x390
[580632.355281]  worker_thread+0x35/0x3c0
[580632.355284]  kthread+0x125/0x140
[580632.355287]  ? process_one_work+0x390/0x390
[580632.355289]  ? kthread_create_on_node+0x70/0x70
[580632.355292]  ? SyS_exit_group+0x14/0x20
[580632.355295]  ret_from_fork+0x25/0x30

The machines are distinct enough that this probably should happen
everywhere:

AMD Phenom2, SSD, noatime,compress=lzo,space_cache=v2
Intel Braswell, rust, noatime,autodefrag,space_cache=v2


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Adam Borowski
On Mon, Dec 18, 2017 at 03:28:14PM -0700, Chris Murphy wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain  wrote:
> >  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
> >  caused by [1], which we should revert back, since..
> >- balance (to raid1 chunk) may fail if FS is near full
> >- recovery (to raid1 chunk) will take more writes as compared
> >  to recovery under degraded raid1 chunks
> 
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
> 
> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
> 
> 'btrfs sub find' will list all *added* files since an arbitrarily
> specified generation; but not deletions.

This is fine as scrub cares about extents not files.  The newer generation
of metadata doesn't have a reference to the deleted extent anymore.

Selective scrub hasn't been implemented, but it should be pretty
straightforward -- unless nocow is involved.  Correct me if I'm wrong, but I
believe there's no way to tell which copy of a nocow extent is the good one.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-12 Thread Adam Borowski
This link is replicated in most filesystems' config stanzas.  Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.

Signed-off-by: Adam Borowski 
---
Sending this as one piece; if you guys would instead prefer this chopped
into tiny per-filesystem bits, please say so.


 Documentation/filesystems/ext2.txt |  2 --
 Documentation/filesystems/ext4.txt |  7 +++
 fs/9p/Kconfig  |  3 ---
 fs/Kconfig |  6 +-
 fs/btrfs/Kconfig   |  3 ---
 fs/ceph/Kconfig|  3 ---
 fs/cifs/Kconfig| 15 +++
 fs/ext2/Kconfig|  6 +-
 fs/ext4/Kconfig|  3 ---
 fs/f2fs/Kconfig|  6 +-
 fs/hfsplus/Kconfig |  3 ---
 fs/jffs2/Kconfig   |  6 +-
 fs/jfs/Kconfig |  3 ---
 fs/reiserfs/Kconfig|  6 +-
 fs/xfs/Kconfig |  3 ---
 15 files changed, 15 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/ext2.txt 
b/Documentation/filesystems/ext2.txt
index 55755395d3dc..81c0becab225 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -49,12 +49,10 @@ sb=nUse alternate 
superblock at this location.
 
 user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
-   See also http://acl.bestbits.at
 nouser_xattr   Don't support "user." extended attributes.
 
 aclEnable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
-   See also http://acl.bestbits.at
 noacl  Don't support POSIX ACLs.
 
 nobh   Do not attach buffer_heads to file pagecache.
diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 75236c0c2ac2..8cd63e16f171 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -202,15 +202,14 @@ inode_readahead_blks=nThis tuning parameter controls 
the maximum
the buffer cache.  The default value is 32 blocks.
 
 nouser_xattr   Disables Extended User Attributes.  See the
-   attr(5) manual page and http://acl.bestbits.at/
-   for more information about extended attributes.
+   attr(5) manual page for more information about
+   extended attributes.
 
 noacl  This option disables POSIX Access Control List
support. If ACL support is enabled in the kernel
configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
enabled by default on mount. See the acl(5) manual
-   page and http://acl.bestbits.at/ for more information
-   about acl.
+   page for more information about acl.
 
 bsddf  (*) Make 'df' act like BSD.
 minixdfMake 'df' act like Minix.
diff --git a/fs/9p/Kconfig b/fs/9p/Kconfig
index 6489e1fc1afd..11045d8e356a 100644
--- a/fs/9p/Kconfig
+++ b/fs/9p/Kconfig
@@ -25,9 +25,6 @@ config 9P_FS_POSIX_ACL
  POSIX Access Control Lists (ACLs) support permissions for users and
  groups beyond the owner/group/world scheme.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website <http://acl.bestbits.at/>.
-
  If you don't know what Access Control Lists are, say N
 
 endif
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..0ed56752f208 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -167,17 +167,13 @@ config TMPFS_POSIX_ACL
  files for sound to work properly.  In short, if you're not sure,
  say Y.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website <http://acl.bestbits.at/>.
-
 config TMPFS_XATTR
bool "Tmpfs extended attributes"
depends on TMPFS
default n
help
  Extended attributes are name:value pairs associated with inodes by
- the kernel or by users (see the attr(5) manual page, or visit
- <http://acl.bestbits.at/> for details).
+ the kernel or by users (see the attr(5) manual page for details).
 
  Currently this enables support for the trusted.* and
  security.* namespaces.
diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 2e558227931a..273351ee4c46 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -38,9 +38,6 @@ config BTRFS_FS_POSIX_ACL
  POSIX Access C

Re: exclusive subvolume space missing

2017-12-03 Thread Adam Borowski
On Sun, Dec 03, 2017 at 01:45:45AM +, Duncan wrote:
> Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:
> >> I got ~500 small files (100-500 kB) updated partially in regular
> >> intervals:
> >> 
> >> # du -Lc **/*.rrd | tail -n1
> >> 105Mtotal
> 
> FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
> are (other than that a quick google suggests that it's...
> round-robin-database...

Basically: preallocate a file, its size doesn't change since then.  Every a
few minutes, write several bytes into the file, slowly advancing.

This is indeed the worst possible case for btrfs, and nocow doesn't help the
slightest as the database doesn't wrap around before a typical snapshot
interval.

> Meanwhile, /because/ nocow has these complexities along with others (nocow
> automatically turns off data checksumming and compression for the files
> too), and the fact that they nullify some of the big reasons people might
> choose btrfs in the first place, I actually don't recommend setting
> nocow in the first place -- if usage is such than a file needs nocow,
> my thinking is that btrfs isn't a particularly good hosting choice for
> that file in the first place, a more traditional rewrite-in-place
> filesystem is likely to be a better fit.

I'd say that the only good use for nocow is "I wish I have placed this file
on a non-btrfs, but it'd be too much hassle to repartition".

If you snapshot nocow at all, you get the worst of both worlds.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...";
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote:
> On 2017年11月27日 22:22, David Sterba wrote:
> > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
> >> On 4.15-rc1, I get the following failure:
> >>
> >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> >> extent, have 134 expect 281474976710677
> > 
> > By a quick look at suspiciously large number
> > 
> > hex(281474976710677) = 0x10015
> > 
> > may be a bitflip, but 0x15 does not match 134, so there could be
> > something else involved in the corruption.
> 
> That's a known bug, fixed by that patch which is not merged yet.
> 
> https://patchwork.kernel.org/patch/10047489/

This helped, thanks!


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritical Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...";
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
Hi!
On 4.15-rc1, I get the following failure:

BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
extent, have 134 expect 281474976710677

Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
(progs 4.13.3) doesn't find any badness either.

[   11.347451] BTRFS info (device sda1): use lzo compression
[   11.352914] BTRFS info (device sda1): using free space tree
[] Activating lvm and md swap...[ ok done.
[] Checking file systems...fsck from util-l
[   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
uncompressed inline extent, have 134 expect 281474976710677
inux 2.30.2
[   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 free 
space 4350
[   11.687767]  item 0 key (35663 12 32909) itemoff 16263 itemsize 20
[   11.695021]  item 1 key (35663 108 0) itemoff 15751 itemsize 512
[   11.701274]  inline extent data size 509
 ok
[   11.705704]  item 2 key (35664 1 0) itemoff 15591 itemsize 160
[   11.713034]  inode generation 1292 size 509 mode 100644
[   11.718811]  item 3 key (35664 12 32909) itemoff 15571 itemsize 20
[   11.725113]  item 4 key (35664 108 0) itemoff 15059 itemsize 512
[   11.732168]  inline extent data size 509
done.
[   11.736280]  item 5 key (35665 1 0) itemoff 14899 itemsize 160
[   11.742681]  inode generation 1292 size 457 mode 100644
[   11.748275]  item 6 key (35665 12 32909) itemoff 14879 itemsize 20
[   11.754584]  item 7 key (35665 108 0) itemoff 14411 itemsize 468
[   11.760674]  inline extent data size 457
[   11.764780]  item 8 key (35666 1 0) itemoff 14251 itemsize 160
[   11.770711]  inode generation 1292 size 533 mode 100644
[]
[   11.776145]  item 9 key (35666 12 32909) itemoff 14231 itemsize 20
Cleaning up temp
[   11.783069]  item 10 key (35666 108 0) itemoff 13697 itemsize 534
orary files...
[   11.790666]  inline extent data size 533
[   11.795980]  item 11 key (35668 1 0) itemoff 13537 itemsize 160
[   11.801989]  inode generation 1292 size 319 mode 100644
 /tmp
[   11.807413]  item 12 key (35668 12 32909) itemoff 13517 itemsize 20
[   11.814250]  item 13 key (35668 108 0) itemoff 13247 itemsize 270
[   11.820512]  inline extent data size 319
[   11.825539]  item 14 key (35669 1 0) itemoff 13087 itemsize 160
[   11.831577]  inode generation 1292 size 375 mode 100644
[   11.837149]  item 15 key (35669 12 32909) itemoff 13067 itemsize 20
[   11.843873]  item 16 key (35669 108 0) itemoff 12783 itemsize 284
 ok 
[   11.850098]  inline extent data size 375
[   11.855579]  item 17 key (35670 1 0) itemoff 12623 itemsize 160
[   11.862861]  inode generation 1292 size 168 mode 100644
.
[   11.869194]  item 18 key (35670 12 33512) itemoff 12588 itemsize 35
[   11.876467]  item 19 key (35670 108 0) itemoff 12399 itemsize 189
[   11.883564]  inline extent data size 168
[   11.888551]  item 20 key (35676 1 0) itemoff 12239 itemsize 160
[   11.895421]  inode generation 1292 size 512 mode 100600
[   11.901719]  item 21 key (35676 12 32911) itemoff 12218 itemsize 21
[   11.909045]  item 22 key (35676 108 0) itemoff 11685 itemsize 533
[   11.916136]  inline extent data size 512
[   11.921125]  item 23 key (35685 1 0) itemoff 11525 itemsize 160
[   11.928047]  inode generation 1292 size 32128 mode 100644
[   11.934553]  item 24 key (35685 12 32783) itemoff 11508 itemsize 17
[] Mounting 
[   11.941874]  item 25 key (35685 108 0) itemoff 11455 itemsize 53
local filesystem
[   11.949377]  extent data disk bytenr 3757990555648 nr 4096
s...
[   11.956471]  extent data offset 0 nr 4096 ram 4096
[   11.962383]  item 26 key (35685 108 4096) itemoff 11402 itemsize 53
[   11.969704]  extent data disk bytenr 3755041128448 nr 4096
[   11.976324]  extent data offset 4096 nr 24576 ram 32768
[   11.982686]  item 27 key (35685 108 28672) itemoff 11349 itemsize 53
[   11.990140]  extent data disk bytenr 3749090922496 nr 4096
[   11.996786]  extent data offset 0 nr 4096 ram 4096
[   12.002732]  item 28 key (35686 1 0) itemoff 11189 itemsize 160
[   12.009755]  inode generation 1292 size 5023 mode 100644
[   12.016204]  item 29 key (35686 12 32783) itemoff 11165 itemsize 24
[   12.023576]  item 30 key (35686 108 0) itemoff 2 itemsize 53
[   12.030665]  extent data disk bytenr 3651995594752 nr 4096
[   12.037298]  extent data offset 0 nr 8192 ram 8192
[   12.043184]  item 31 key (35687 1 0) itemoff 10952 itemsize 160
[   12.050181]  inode generation 1292 size 293168 mode 100664
[   12.056793]  item 32 key (35687 12 32783) itemoff 10935 itemsize 17
[   12.064082]  item 33 key (35687 108 0) itemoff 10882 itemsize 53
[   12.071154]  extent data disk bytenr 3752

Re: Unrecoverable scrub errors

2017-11-17 Thread Adam Borowski
On Fri, Nov 17, 2017 at 08:19:11PM -0700, Chris Murphy wrote:
> On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi  
> wrote:
> 
> >> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038723] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 1, gen 0
> >> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039637] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 2, gen 0
> >> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error 
> >> at logical 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> 
> These are metadata errors. Are there any other storage stack related
> errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
> link reset messages?
> 
> >Maybe I can find snapshot that contains file with wrong checksum and
> > remove corresponding snapshot or something like that?
> 
> It's not a file. It's metadata leaf.

Just for the record: had this be a data block (ie, a non-inline file
extent), the dmesg message would include one of filenames that refer to that
extent.  To clear the error, you'd need to remove all such files.

> >> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
> >> Data, single: total=879.01GiB, used=877.24GiB
> >> System, DUP: total=40.00MiB, used=128.00KiB
> >> Metadata, DUP: total=20.50GiB, used=18.96GiB
> >> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Metadata is DUP, but both copies have corruption. Kinda strange. But I
> don't know how close the DUP copies are to each other, if possibly a
> big enough media defect can explain this.

The original post mentioned SSD (but was unclear if _this_ filesystem is
backed by one).  If so, DUP is nearly worthless as both copies will be
written to physical cells next to each other, no matter what positions the
FTL shows them at.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Adam Borowski
On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote:
> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
 ^
> 2TB disks started giving me I/O errors in dmesg like this:
> 
> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed

Alas, chances to recover anything are pretty slim.  That's RAID0 metadata
for you.

On the other hand, losing any non-trivial file while being able to gape at
intact metadata isn't that much better, thus -mraid0 isn't completely
unreasonable.

> To fix it, it ended up with me adding a new 6TB disk and trying to
> delete the failing 2TB disks.
> 
> That didn't go so well; apparently, the delete command aborts when
> ever it encounters I/O errors. So now my raid0 looks like this:
> 
> klaus@box:~$ sudo btrfs fi show
> [sudo] password for klaus:
> Label: none  uuid: 5db5f82c-2571-4e62-a6da-50da0867888a
> Total devices 4 FS bytes used 5.14TiB
> devid1 size 1.82TiB used 1.78TiB path /dev/sde
> devid2 size 1.82TiB used 1.78TiB path /dev/sdf
> devid3 size 0.00B used 1.49TiB path /dev/sdd
> devid4 size 5.46TiB used 305.21GiB path /dev/sdb

> Obviously, I want /dev/sdd emptied and deleted from the raid.
> 
> So how do I do that?
> 
> I thought of three possibilities myself. I am sure there are more,
> given that I am in no way a btrfs expert:
> 
> 1)Try to force a deletion of /dev/sdd where btrfs copies all intact
> data to the other disks
> 2) Somehow re-balances the raid so that sdd is emptied, and then deleted
> 3) converting into a raid1, physically removing the failing disk,
> simulating a hard error, starting the raid degraded, and converting it
> back to raid0 again.

There's hardly any intact data: roughly 2/3 of chunks have half of their
blocks on the failed disk, densely interspersed.  Even worse, metadata
required to map those blocks to files is gone, too: if we naively assume
there's only a single tree, a tree node is intact only if it and every
single node on the path to the root is intact.  In practice, this means
it's a total filesystem loss.

> How do you guys think I should go about this? Given that it's a raid0
> for a reason, it's not the end of the world losing all data, but I'd
> really prefer losing as little as possible, obviously.

As the disk isn't _completely_ gone, there's a slim chance of some stuff
requiring only still-readable sectors.  Probably a waste of time to try
to recover, though.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updatedb does not index /home when /home is Btrfs

2017-11-04 Thread Adam Borowski
On Sat, Nov 04, 2017 at 09:26:36AM +0300, Andrei Borzenkov wrote:
> 04.11.2017 07:49, Adam Borowski пишет:
> > On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
> >> Ancient bug, still seems to be a bug.
> >> https://bugzilla.redhat.com/show_bug.cgi?id=906591
> >>
> >> The issue is that updatedb by default will not index bind mounts, but
> >> by default on Fedora and probably other distros, put /home on a
> >> subvolume and then mount that subvolume which is in effect a bind
> >> mount.
> >>
> >> There's a lot of early discussion in 2013 about it, but then it's
> >> dropped off the radar as nobody has any ideas how to fix this in
> >> mlocate.
> > 
> > I don't see how this would be a bug in btrfs.  The same happens if you
> > bind-mount /home (or individual homes), which is a valid and non-rare setup.
> 
> It is the problem *on* btrfs because - as opposed to normal bind mount -
> those mount points do *not* refer to the same content.

Neither do they refer to in a "normal" bind mount.

> As was commented in mentioned bug report:
> 
> mount -o subvol=root /dev/sdb1 /root
> mount -o subvol=foo /dev/sdb1 /root/foo
> mount -o subvol bar /dev/sdb1 /bar/bar
> 
> Both /root/foo and /root/bar, will be skipped even though they are not
> accessible via any other path (on mounted filesystem)

losetup -D
truncate -s 4G junk
losetup -f junk
mkfs.ext4 /dev/loop0
mkdir -p foo bar
mount /dev/loop0 foo
mkdir foo/bar
touch foo/fileA foo/bar/fileB
mount --bind foo/bar bar
umount foo

> It is a problem *of* btrfs because it does not offer any easy way to
> distinguish between subvolume mount and bind mount. If you are aware of
> one, please comment on mentioned bug report.

Well, subvolume mounts are indistinguishable from bind mounts because they
_are_ bind mounts.  You merely don't need to mount the "master" first.

The only way such a "master" mount is special is that, on most filesystems,
its root was accessible at least at some point (but it might no longer be,
thanks to chroot, pivot_root, etc).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updatedb does not index /home when /home is Btrfs

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
> Ancient bug, still seems to be a bug.
> https://bugzilla.redhat.com/show_bug.cgi?id=906591
> 
> The issue is that updatedb by default will not index bind mounts, but
> by default on Fedora and probably other distros, put /home on a
> subvolume and then mount that subvolume which is in effect a bind
> mount.
> 
> There's a lot of early discussion in 2013 about it, but then it's
> dropped off the radar as nobody has any ideas how to fix this in
> mlocate.

I don't see how this would be a bug in btrfs.  The same happens if you
bind-mount /home (or individual homes), which is a valid and non-rare setup.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>  wrote:
> 
> > If you're running on an SSD (or thinly provisioned storage, or something
> > else which supports discards) and have the 'discard' mount option enabled,
> > then there is no backup metadata tree (this issue was mentioned on the list
> > a while ago, but nobody ever replied),
> 
> 
> This is a really good point. I've been running discard mount option
> for some time now without problems, in a laptop with Samsung
> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
> 
> However, just trying btrfs-debug-tree -b on a specific block address
> for any of the backup root trees listed in the super, only the current
> one returns a valid result.  All others fail with checksum errors. And
> even the good one fails with checksum errors within seconds as a new
> tree is created, the super updated, and Btrfs considers the old root
> tree disposable and subject to discard.
> 
> So absolutely if I were to have a problem, probably no rollback for
> me. This seems to totally obviate a fundamental part of Btrfs design.

How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.

Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.

For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.

>  because it's already been discarded.
> > This is ideally something which should be addressed (we need some sort of
> > discard queue for handling in-line discards), but it's not easy to address.
> 
> Discard data extents, don't discard metadata extents? Or put them on a
> substantial delay.

Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: avoid misleading talk about "compression level 0"

2017-10-25 Thread Adam Borowski
On Wed, Oct 25, 2017 at 03:23:11PM +0200, David Sterba wrote:
> On Sat, Oct 21, 2017 at 06:49:01PM +0200, Adam Borowski wrote:
> > Many compressors do assign a meaning to level 0: either null compression or
> > the lowest possible level.  This differs from our "unset thus default".
> > Thus, let's not unnecessarily confuse users.
> 
> I agree 'level 0' confusing, however I'd like to keep the level
> mentioned in the message.
> 
> We could add
> 
> #define   BTRFS_COMPRESSION_ZLIB_DEFAULT  3
> 
> and use it in btrfs_compress_str2level.

I considered this but every algorithm has a different default, thus we'd
need separate cases for zlib vs zstd, while lzo has no settable level at
all.  Still, this is just some extra lines of code, thus doable.

> > Signed-off-by: Adam Borowski 
> > ---
> >  fs/btrfs/super.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index f9d4522336db..144fabfbd246 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -551,7 +551,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> >   compress_force != saved_compress_force)) ||
> > (!btrfs_test_opt(info, COMPRESS) &&
> >  no_compress == 1)) {
> > -   btrfs_info(info, "%s %s compression, level %d",
> > +   btrfs_printk(info, info->compress_level ?
> > +  KERN_INFO"%s %s compression, level 
> > %d" :
> > +  KERN_INFO"%s %s compression",
> 
> Please keep using btrfs_info, the KERN_INFO prefix would not work here.
> btrfs_printk prepends the filesystem description and the message level
> must be at the beginning.

Seems to work for me:
[   14.072575] BTRFS info (device sda1): use lzo compression
with identical colors as other info messages next to it.

But if we're to expand this code, ternary operators would get too hairy,
thus this can go at least for clarity.

> >(compress_force) ? "force" : "use",
> >compress_type, info->compress_level);
> > }
> 

-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLES 11 SP4: can't mount btrfs

2017-10-21 Thread Adam Borowski
On Sat, Oct 21, 2017 at 01:46:06PM +0200, Lentes, Bernd wrote:
> - Am 21. Okt 2017 um 4:31 schrieb Duncan 1i5t5.dun...@cox.net:
> > Lentes, Bernd posted on Fri, 20 Oct 2017 20:40:15 +0200 as excerpted:
> > 
> >> Is it generally possible to restore a btrfs partition from a tape backup
> >> ?
> >> I'm just starting, and I'm asking myself. What is about the subvolumes ?
> >> This information isn't stored in files, but in the fs ? This is not on a
> >> file-based backup on a tape.
> > 
> > Yes it's possible to restore a btrfs partition from tape backup, /if/ you
> > backed up the partition itself, not just the files on top of it.

Which is usually a quite bad idea: unless you shut down (or remount ro) the
filesystem in question, the data _will_ be corrupted, and in the case of
btrfs, this kind of corruption tends to be fatal.  You also back up all the
unused space (trim greatly recommended), and the backup process takes ages
as it needs to read everything.

An efficient block-level backup of btrfs _would_ be possible as it can
nicely enumerate blocks touched since generation X, but AFAIK no one wrote
such a program yet.  It'd be also corruption free if done in two passes:
first a racey copy, fsfreeze(), copy of just newest updates.

> > Otherwise, as you deduce, you get the files, but not the snapshot history
> > or relationship, nor the subvolumes, which will look to normal file-level
> > backup software (that is, backup software not designed with btrfs-
> > specifics like subvolumes, which if it did, would likely use btrfs send/
> > receive at least optionally) like normal directories.

If the backup software does incrementals well, this is not as bad as it
sounds.  While rsync takes half an hour just to stat() a typical small piece
spinning rust (obviously depending on # of files), that's still in the
acceptable range.  That backup software can be then be told to back every
snapshot in turn.  You still lose reflinks between unrelated subvolumes but
those tend to be quite rare -- and you can re-dedupe.

> i apprehend that i have just a file based backup.  We use EMC Networker
> (version 8.1 or 8.2), and from what i read in the net i think it does not
> support BTRFS.  So i have to reinstall, which is maybe not the worst,
> because i'm thinking about using SLES 11 SP3.
> 
> What i know now is that i can't rely on our EMC backup.
> What would you propose to backup a complete btrfs partition
> (https://btrfs.wiki.kernel.org/index.php/Incremental_Backup) ?
> We have a NAS with propable enough space, and the servers aren't used
> heavily over night.  So using one of the mentioned tools in a cronjob over
> night is possible.

> Which tool do you recommend ?

It depends on what you use subvolumes for.

While a simple file-base backup may be inadequate for the general case, for
most actual uses it works well or at least well enough.  Only if you're
doing something special, bothering with the complexity might be worth it.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: avoid misleading talk about "compression level 0"

2017-10-21 Thread Adam Borowski
Many compressors do assign a meaning to level 0: either null compression or
the lowest possible level.  This differs from our "unset thus default".
Thus, let's not unnecessarily confuse users.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f9d4522336db..144fabfbd246 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -551,7 +551,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
  compress_force != saved_compress_force)) ||
(!btrfs_test_opt(info, COMPRESS) &&
 no_compress == 1)) {
-   btrfs_info(info, "%s %s compression, level %d",
+   btrfs_printk(info, info->compress_level ?
+  KERN_INFO"%s %s compression, level 
%d" :
+  KERN_INFO"%s %s compression",
   (compress_force) ? "force" : "use",
   compress_type, info->compress_level);
}
-- 
2.15.0.rc1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Adam Borowski
On Wed, Oct 18, 2017 at 07:30:55AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 16:21, Adam Borowski wrote:
> > > > It's a single-device filesystem, thus disconnects are obviously fatal.  
> > > > But,
> > > > they never caused even a single bit of damage (as scrub goes), thus 
> > > > proving
> > > > btrfs handles this kind of disconnects well.  Unlike times past, the 
> > > > kernel
> > > > doesn't get confused thus no reboot is needed, merely an unmount, 
> > > > "service
> > > > nbd-client restart", mount, restart the rebuild jobs.
> > > That's expected behavior though.  _Single_ device BTRFS has nothing to get
> > > out of sync most of the time, the only time there's any possibility of an
> > > issue is when you die after writing the first copy of a block that's in a
> > > dup profile chunk, but even that is not very likely to cause problems
> > > (you'll just lose at most the last  worth of data).
> > 
> > How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
> > superblock.  The two prior writes may be arbitrarily reordered -- both
> > between each other or even individual sectors inside the chunks, but unless
> > the disk lies about barriers, there's no way to have any corruption, thus
> > running scrub is not needed.
> If the device dies after writing chunk 1 but before the barrier, you end up
> needing scrub.  How much of a failure window is present is largely a
> function of how fast the device is, but there is a failure window there.

CoW is there to ensure there is _no_ failure window.  The new content
doesn't matter until there are live pointers to it -- from the filesystem's
point of view we merely scribbled something on an unused part of the block
device.  Only after all pieces are in place (as ensured by the barrier), the
superblock is updated with a reference to the new metadata->data chain.

Thus, no matter when a disconnect happens, after a crash you get either
uncorrupted old version or uncorrupted new version.

No scrub is ever needed for this reason on single device or on RAID1 that
didn't run degraded.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 13:06, Adam Borowski wrote:
> > On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-10-17 07:42, Zoltan wrote:
> > > > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> > > >  wrote:
> > > > 
> > > > > I forget sometimes that people insist on storing large volumes of 
> > > > > data on
> > > > > unreliable storage...
> > > > 
> > > > In my opinion the unreliability of the storage is the exact reason for
> > > > wanting to use raid1. And I think any problem one encounters with an
> > > > unreliable disk can likely happen with more reliable ones as well,
> > > > only less frequently, so if I don't feel comfortable using raid1 on an
> > > > unreliable medium then I wouldn't trust it on a more reliable one
> > > > either.
> > 
> > > The thing is that you need some minimum degree of reliability in the other
> > > components in the storage stack for it to be viable to use any given 
> > > storage
> > > technology.  If you don't meet that minimum degree of reliability, then 
> > > you
> > > can't count on the reliability guarantees of the storage technology.
> > 
> > The thing is, reliability guarantees required vary WILDLY depending on your
> > particular use cases.  On one hand, there's "even an one-minute downtime
> > would cost us mucho $$$s, can't have that!" -- on the other, "it died?
> > Okay, we got backups, lemme restore it after the weekend".
> Yes, but if you are in the second case, you arguably don't need replication,
> and would be better served by improving the reliability of your underlying
> storage stack than trying to work around it's problems. Even in that case,
> your overall reliability is still constrained by the least reliable
> component (in more idiomatic terms 'a chain is only as strong as it's
> weakest link').

MD can handle this case well, there's no reason btrfs shouldn't do that too.
A RAID is not akin to serially connected chain, it's a parallel connected
chain: while pieces of the broken second chain hanging down from the first
don't make it strictly more resilient than having just a single chain, in
general case it _is_ more reliable even if the other chain is weaker.

Don't we have a patchset that deals with marking a device as failed at
runtime floating on the mailing list?  I did not look at those patches yet,
but they are a step in this direction.

> Using replication with a reliable device and a questionable device is
> essentially the same as trying to add redundancy to a machine by adding an
> extra linkage that doesn't always work and can get in the way of the main
> linkage it's supposed to be protecting from failure.  Yes, it will work most
> of the time, but the system is going to be less reliable than it is without
> the 'redundancy'.

That's the current state of btrfs, but the design is sound, and reaching
more than parity with MD is a matter of implementation.

> > Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
> > the network driver allocates memory with GFP_NOIO which causes NBD
> > disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
> > be obvious but on regular filesystem where throwing out userspace memory is
> > safe).  The disconnects happen around once per week.
> Somewhat off-topic, but you might try looking at ATAoE as an alternative,
> it's more reliable in my experience (if you've got a reliable network),
> gives better performance (there's less protocol overhead than NBD, and it
> runs on top of layer 2 instead of layer 4)

I've tested it -- not on the Odroid-U2 but on Pine64 (fully working GbE). 
NBD delivers 108MB/sec in a linear transfer, ATAoE is lucky to break
40MB/sec, same target (Qnap-253a, spinning rust), both in default
configuration without further tuning.  NBD is over IPv6 for that extra 20
bytes per packet overhead.

Also, NBD can be encrypted or arbitrarily routed.

> > It's a single-device filesystem, thus disconnects are obviously fatal.  But,
> > they never caused even a single bit of damage (as scrub goes), thus proving
> > btrfs handles this kind of disconnects well.  Unlike times past, the kernel
> > doesn't get confused thus no reboot is needed, merely an unmount, "service
> > nbd-client restart", mount, restart the rebuild jobs.
> That's expected behavior though.  _Single_ device BTRFS has nothing t

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 07:42, Zoltan wrote:
> > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> >  wrote:
> > 
> > > I forget sometimes that people insist on storing large volumes of data on
> > > unreliable storage...
> > 
> > In my opinion the unreliability of the storage is the exact reason for
> > wanting to use raid1. And I think any problem one encounters with an
> > unreliable disk can likely happen with more reliable ones as well,
> > only less frequently, so if I don't feel comfortable using raid1 on an
> > unreliable medium then I wouldn't trust it on a more reliable one
> > either.

> The thing is that you need some minimum degree of reliability in the other
> components in the storage stack for it to be viable to use any given storage
> technology.  If you don't meet that minimum degree of reliability, then you
> can't count on the reliability guarantees of the storage technology.

The thing is, reliability guarantees required vary WILDLY depending on your
particular use cases.  On one hand, there's "even an one-minute downtime
would cost us mucho $$$s, can't have that!" -- on the other, "it died? 
Okay, we got backups, lemme restore it after the weekend".

Lemme tell you a btrfs blockdev disconnects story.
I have an Odroid-U2, a cheap ARM SoC that, despite being 5 years old and
costing mere $79 (+$89 eMMC...) still beats the performance of much newer
SoCs that have far better theoretical specs, including subsequent Odroids.
After ~1.5 year of CPU-bound stress tests for one program, I switched this
machine to doing Debian package rebuilds, 24/7/365¼, for QA purposes.
Being a moron, I did not realize until pretty late that high parallelism to
keep all cores utilized is still a net performance loss when a memory-hungry
package goes into a swappeathon, even despite the latter being fairly rare.
Thus, I can say disk utilization was pretty much 100%, with almost as much
writing as reading.  The eMMC card endured all of this until very recently
(nowadays it sadly throws errors from time to time).

Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
the network driver allocates memory with GFP_NOIO which causes NBD
disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
be obvious but on regular filesystem where throwing out userspace memory is
safe).  The disconnects happen around once per week.

It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.

I also can recreate this filesystem and the build environment on it with
just a few commands, thus, unlike /, there's no need for backups.  But I
had no need to recreate it yet.

This is single-device not RAID5, but it's a good example for an use case
where an unreliable storage medium is acceptable (even if the GFP_NOIO issue
is still worth fixing).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Adam Borowski
On Mon, Oct 16, 2017 at 01:27:40PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-16 12:57, Zoltan wrote:
> > On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:
> In an ideal situation, scrubbing should not be an 'only if needed' thing,
> even for a regular array that isn't dealing with USB issues. From a
> practical perspective, there's no way to know for certain if a scrub is
> needed short of reading every single file in the filesystem in it's
> entirety, at which point, you're just better off running a scrub (because if
> you _do_ need to scrub, you'll end up reading everything twice).

> [...]  There are three things to deal with here:
> 1. Latent data corruption caused either by bit rot, or by a half-write (that
> is, one copy got written successfully, then the other device disappeared
> _before_ the other copy got written).
> 2. Single chunks generated when the array is degraded.
> 3. Half-raid1 chunks generated by newer kernels when the array is degraded.

Note that any of the above other than bit rot affect only very recent data. 
If we keep record of the last known-good generation, all of that can be
enumerated, allowing us to make a selective scrub that checks only a small
part of the disk.  A linear read a 8TB disk takes 14 hours...

If we ever get auto-recovery, this is a fine candidate.

> Scrub will fix problem 1 because that's what it's designed to fix.  it will
> also fix problem 3, since that behaves just like problem 1 from a
> higher-level perspective.  It won't fix problem 2 though, as it doesn't look
> at chunk types (only if the data in the chunk doesn't have the correct
> number of valid copies).

Here not even tracking generations is required: a soft convert balance
touches only bad chunks.  Again, would work well for auto-recovery, as it's
a no-op if all is well.

> In contrast, the balance command you quoted won't fix issue 1 (because it
> doesn't validate checksums or check that data has the right number of
> copies), or issue 3 (because it's been told to only operate on non-raid1
> chunks), but it will fix issue 2.
> 
> In comparison to both of the above, a full balance without filters will fix
> all three issues, although it will do so less efficiently (in terms of both
> time and disk usage) than running a soft-conversion balance followed by a
> scrub.

"less efficiently" is an understatement.  Scrub gets a good part of
theoretical linear speed, while I just had a single metadata block take
14428 seconds to balance.

> In the case of normal usage, device disconnects are rare, so you should
> generally be more worried about latent data corruption.

Yeah, but certain setups (like anything USB) gets disconnect quite often. 
It would be nice to get them right.  MD thanks to write-intent bitmap can
recover almost instantly, btrfs could do it better -- the code to do so
isn't written yet.

> monitor the kernel log to watch for device disconnects, remount the
> filesystem when the device reconnects, and then run the balance command
> followed by a scrub.  With most hardware I've seen, USB disconnects tend to
> be relatively frequent unless you're using very high quality cabling and
> peripheral devices.  If, however, they happen less than once a day most of
> the time, just set up the log monitor to remount, and set the balance and
> scrub commands on the schedule I suggested above for normal usage.

A day-long recovery for an event that happens daily isn't a particularly
enticing prospect.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ I was born a dumb, ugly and work-loving kid, then I got swapped on
⠈⠳⣄ the maternity ward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Give up on bcache?

2017-09-26 Thread Adam Borowski
On Tue, Sep 26, 2017 at 11:33:19PM +0500, Roman Mamedov wrote:
> On Tue, 26 Sep 2017 16:50:00 + (UTC)
> Ferry Toth  wrote:
> 
> > https://www.phoronix.com/scan.php?page=article&item=linux414-bcache-
> > raid&num=2
> > 
> > I think it might be idle hopes to think bcache can be used as a ssd cache 
> > for btrfs to significantly improve performance..
> 
> My personal real-world experience shows that SSD caching -- with lvmcache --
> does indeed significantly improve performance of a large Btrfs filesystem with
> slowish base storage.
> 
> And that article, sadly, only demonstrates once again the general mediocre
> quality of Phoronix content: it is an astonishing oversight to not check out
> lvmcache in the same setup, to at least try to draw some useful conclusion, is
> it Bcache that is strangely deficient, or SSD caching as a general concept
> does not work well in the hardware setup utilized.

Also, it looks as if Phoronix' tests don't stress metadata at all.  Btrfs is
all about metadata, speeding it up greatly helps most workloads.

A pipe-dream wishlist would be:
* store and access master copy of metadata on SSD only
* pin all data blocks referenced by generations not yet mirrored
* slowly copy over metadata to HDD

-- 
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄ agriculture, towns then cities. -- whitroth on /.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-18 Thread Adam Borowski
On Wed, Sep 13, 2017 at 08:21:01AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 17:13, Adam Borowski wrote:
> > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-09-12 16:00, Adam Borowski wrote:
> > > > Noted.  Both Marat's and my use cases, though, involve VMs that are off 
> > > > most
> > > > of the time, and at least for me, turned on only to test something.
> > > > Touching mtime makes rsync run again, and it's freaking _slow_: worse 
> > > > than
> > > > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
> > > 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
> > > you're going direct to a hard drive.  I get better performance than that 
> > > on
> > > my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s 
> > > there,
> > > but it's for archival storage so I don't really care).  I'm actually 
> > > curious
> > > what the exact rsync command you are using is (you can obviously redact
> > > paths as you see fit), as the only way I can think of that it should be 
> > > that
> > > slow is if you're using both --checksum (but if you're using this, you can
> > > tell rsync to skip the mtime check, and that issue goes away) and 
> > > --inplace,
> > > _and_ your HDD is slow to begin with.
> >
> > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ 
> > mordor:$BASE/qemu
> > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> > with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
> compress=zlib is probably your biggest culprit.  As odd as this sounds, I'd
> suggest switching that to lzo (seriously, the performance difference is
> ludicrous), and then setting up a cron job (or systemd timer) to run defrag
> over things to switch to zlib.  As a general point of comparison, we do
> archival backups to a file server running BTRFS where I work, and the
> archiving process runs about four to ten times faster if we take this
> approach (LZO for initial compression, then recompress using defrag once the
> initial transfer is done) than just using zlib directly.

Turns out that lzo is actually the slowest, but only by a bit.

I tried a different disk, in the same Qnap; also an old disk but 7200 rpm
rather than 5400.  Mostly empty, only a handful subvolumes, not much
reflinking.  I made three separate copies, fallocated -d, upgraded Windows
inside the VM, then:

[/mnt/btr1/qemu]$ for x in none lzo zlib;do time rsync -axX --delete --inplace 
--numeric-ids win10.img mordor:/SOME/DIR/$x/win10.img;done

real31m37.459s
user27m21.587s
sys 2m16.210s

real33m28.258s
user27m19.745s
sys 2m17.642s

real32m57.058s
user27m24.297s
sys 2m17.640s

Note the "user" values.  So rsync does something bad on the source side.

Despite fragmentation, reads on the source are not a problem:

[/mnt/btr1/qemu]$ time cat /dev/null

real1m28.815s
user0m0.061s
sys 0m48.094s
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 63682 extents found
[/mnt/btr1/qemu]$ btrfs fi def win10.img
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 18015 extents found
[/mnt/btr1/qemu]$ time cat /dev/null

real1m17.879s
user0m0.076s
sys 0m37.757s

> `--inplace` is probably not helping (especially if most of the file changed,
> on BTRFS, it actually is marginally more efficient to just write out a whole
> new file and then replace the old one with a rename if you're rewriting most
> of the file), but is probably not as much of an issue as compress=zlib.

Yeah, scp + dedupe would run faster.  For deduplication, instead of
duperemove it'd be better to call file_extent_same on the first 128K, then
the second, ... -- without even hashing the blocks beforehand.

Not that this particular VM takes enough backup space to make spending too
much time worthwhile, but it's a good test case for performance issues like
this.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] btrfs: allow to set compression level for zlib

2017-09-15 Thread Adam Borowski
From: David Sterba 

Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib # default
$ mount -o compess=zlib0# same
$ mount -o compess=zlib9# level 9, slower sync, less data
$ mount -o compess=zlib1# level 1, faster sync, more data
$ mount -o remount,compress=zlib3   # level set by remount

The level is visible in the same format in /proc/mounts. Level set via
file property does not work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 20 +++-
 fs/btrfs/compression.h |  6 +-
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/inode.c   |  5 -
 fs/btrfs/lzo.c |  5 +
 fs/btrfs/super.c   |  7 +--
 fs/btrfs/zlib.c| 12 +++-
 7 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b51d23f5cafa..70a50194fcf5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -867,6 +867,11 @@ static void free_workspaces(void)
  * Given an address space and start and length, compress the bytes into @pages
  * that are allocated on demand.
  *
+ * @type_level is encoded algorithm and level, where level 0 means whatever
+ * default the algorithm chooses and is opaque here;
+ * - compression algo are 0-3
+ * - the level are bits 4-7
+ *
  * @out_pages is an in/out parameter, holds maximum number of pages to allocate
  * and returns number of actually allocated pages
  *
@@ -881,7 +886,7 @@ static void free_workspaces(void)
  * @max_out tells us the max number of bytes that we're allowed to
  * stuff into pages
  */
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -889,9 +894,11 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
 {
struct list_head *workspace;
int ret;
+   int type = type_level & 0xF;
 
workspace = find_workspace(type);
 
+   btrfs_compress_op[type - 1]->set_level(workspace, type_level);
ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1081,3 +1088,14 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
 
return ret;
 }
+
+unsigned int btrfs_compress_str2level(const char *str)
+{
+   if (strncmp(str, "zlib", 4) != 0)
+   return 0;
+
+   if ('1' <= str[4] && str[4] <= '9' )
+   return str[4] - '0';
+
+   return 0;
+}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index d2781ff8f994..da20755ebf21 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -76,7 +76,7 @@ struct compressed_bio {
 void btrfs_init_compress(void);
 void btrfs_exit_compress(void);
 
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -95,6 +95,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 int mirror_num, unsigned long bio_flags);
 
+unsigned btrfs_compress_str2level(const char *str);
+
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
@@ -124,6 +126,8 @@ struct btrfs_compress_op {
  struct page *dest_page,
  unsigned long start_byte,
  size_t srclen, size_t destlen);
+
+   void (*set_level)(struct list_head *ws, unsigned int type);
 };
 
 extern const struct btrfs_compress_op btrfs_zlib_compress;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a8933da39a7..dd07a7ef234c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -791,6 +791,7 @@ struct btrfs_fs_info {
 */
unsigned long pending_changes;
unsigned long compress_type:4;
+   unsigned int compress_level;
int commit_interval;
/*
 * It is a suggestive number, the read side is safe even it gets a
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 128f3e58634f..28201b924575 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -530,7 +530,10 @@ static noinline void compress_file_range(struct inode 
*inode,
 */
extent_range_clear_dirty_for_io(inode, start, end);
redirty 

  1   2   3   >