Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-15 Thread Ivan Sizov
2015-12-15 1:42 GMT+00:00 Qu Wenruo :
> You'll see output like the following:
> Well block 29491200(gen: 5 level: 0) seems good, and it matches superblock
> Well block 29376512(gen: 4 level: 0) seems good, but generation/level
> doesn't match, want gen: 5 level: 0
>
> The match one is not what you're looking for.
> Try the one whose generation is a little smaller than match one.
>
> Then use btrfsck to test if it's OK:
> $ btrfsck -r  /dev/sda1
>
> Try 2~5 times with bytenr whose generation is near the match one.
> If you're in good luck, you will find one doesn't crash btrfsck.
>
> And if that doesn't produce much error, then you can try btrfsck --repair -r
>  to fix it and try mount.

I've found a root that doesn't produce backtrace. But extent/chunk
allocation errors was found:

$ sudo btrfsck --tree-root 535461888 /dev/sda1
parent transid verify failed on 535461888 wanted 21154 found 21150
parent transid verify failed on 535461888 wanted 21154 found 21150
Ignoring transid failure
checking extents
parent transid verify failed on 459292672 wanted 21148 found 21153
parent transid verify failed on 459292672 wanted 21148 found 21153
Ignoring transid failure
bad block 459292672
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 459292672 wanted 21148 found 21153

Should I ignore those errors and run btrfsck --repair? Or
--init-extent-tree is needed?

-- 
Ivan Sizov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: ftw_add_entry_size: Round up file size to sectorsize

2015-12-15 Thread Chandan Rajendra
ftw_add_entry_size() assumes 4k as the block size of the underlying filesystem
and hence the file sizes computed is incorrect for non-4k sectorsized
filesystems. Fix this by rounding up file sizes to sectorsize.

Signed-off-by: Chandan Rajendra 
---
 mkfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index c58ab2f..88c2289 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1031,16 +1031,15 @@ out:
  * This ignores symlinks with unreadable targets and subdirs that can't
  * be read.  It's a best-effort to give a rough estimate of the size of
  * a subdir.  It doesn't guarantee that prepopulating btrfs from this
- * tree won't still run out of space. 
- *
- * The rounding up to 4096 is questionable.  Previous code used du -B 4096.
+ * tree won't still run out of space.
  */
 static u64 global_total_size;
+static u64 fs_block_size;
 static int ftw_add_entry_size(const char *fpath, const struct stat *st,
  int type)
 {
if (type == FTW_F || type == FTW_D)
-   global_total_size += round_up(st->st_size, 4096);
+   global_total_size += round_up(st->st_size, fs_block_size);
 
return 0;
 }
@@ -1060,6 +1059,7 @@ static u64 size_sourcedir(char *dir_name, u64 sectorsize,
allocated_meta_size / default_chunk_size;
 
global_total_size = 0;
+   fs_block_size = sectorsize;
ret = ftw(dir_name, ftw_add_entry_size, 10);
dir_size = global_total_size;
if (ret < 0) {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] btrfs-progs: fix file restore to lost+found bug

2015-12-15 Thread David Sterba
On Tue, Dec 08, 2015 at 11:06:28AM +0900, Naohiro Aota wrote:
> On Tue, Dec 8, 2015 at 12:35 AM, David Sterba  wrote:
> > On Mon, Dec 07, 2015 at 11:59:19AM +0900, Naohiro Aota wrote:
> >> > But I only see the first 2 patches in maillist...
> >> > The last test case seems missing?
> >>
> >> Maybe, the last patch is too large to post to the list? Even it get
> >> smaller, 130260 bytes seems to be a bit large.
> >>
> >> How should I handle this? Put my repo somewhere and wait a maintainer
> >> to pull it?
> >
> > Please send it to me directly. The image will be available in
> > btrfs-progs git and we don't necessarily need the copy in the
> > mailinglist.
> 
> Sure. I'll send it to you.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not

2015-12-15 Thread Zhao Lei
__btrfs_map_block() should return all mirror on WRITE,
REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set
or not.

need_raid_map only used to control is to set bbio->raid_map.

Current code works right becuase there is only one caller can
trigger above bug, which is readahead, and this function happened
to bypass on less mirror.

But after we fixed __btrfs_map_block(), readahead will be really
works, and exit with warning at another bug.
This patchset fixed __btrfs_map_block(), and disable raid56
readahead temp temporary, (actually, it already disable by this bug),
and I'll fix raid56 readahead next.

Zhao Lei (4):
  btrfs: Disable raid56 readahead
  btrfs: return all mirror whether need_raid_map set or not
  btrfs: Small cleanup for get index_srcdev loop
  btrfs: Use direct way to determine raid56 write/recover mode

 fs/btrfs/reada.c   |  5 +
 fs/btrfs/volumes.c | 50 --
 2 files changed, 29 insertions(+), 26 deletions(-)

-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs: Small cleanup for get index_srcdev loop

2015-12-15 Thread Zhao Lei
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/volumes.c | 42 --
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4ee429b..367e8ec 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5368,35 +5368,33 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 * target drive.
 */
for (i = 0; i < tmp_num_stripes; i++) {
-   if (tmp_bbio->stripes[i].dev->devid == srcdev_devid) {
-   /*
-* In case of DUP, in order to keep it
-* simple, only add the mirror with the
-* lowest physical address
-*/
-   if (found &&
-   physical_of_found <=
-tmp_bbio->stripes[i].physical)
-   continue;
-   index_srcdev = i;
-   found = 1;
-   physical_of_found =
-   tmp_bbio->stripes[i].physical;
-   }
+   if (tmp_bbio->stripes[i].dev->devid != srcdev_devid)
+   continue;
+
+   /*
+* In case of DUP, in order to keep it simple, only add
+* the mirror with the lowest physical address
+*/
+   if (found &&
+   physical_of_found <= tmp_bbio->stripes[i].physical)
+   continue;
+
+   index_srcdev = i;
+   found = 1;
+   physical_of_found = tmp_bbio->stripes[i].physical;
}
 
-   if (found) {
-   mirror_num = index_srcdev + 1;
-   patch_the_first_stripe_for_dev_replace = 1;
-   physical_to_patch_in_first_stripe = physical_of_found;
-   } else {
+   btrfs_put_bbio(tmp_bbio);
+
+   if (!found) {
WARN_ON(1);
ret = -EIO;
-   btrfs_put_bbio(tmp_bbio);
goto out;
}
 
-   btrfs_put_bbio(tmp_bbio);
+   mirror_num = index_srcdev + 1;
+   patch_the_first_stripe_for_dev_replace = 1;
+   physical_to_patch_in_first_stripe = physical_of_found;
} else if (mirror_num > map->num_stripes) {
mirror_num = 0;
}
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: Disable raid56 readahead

2015-12-15 Thread Zhao Lei
Raid56 readahead can not work in current code, reada_find_extent()
will show warning of bbio->num_stripes > BTRFS_MAX_MIRRORS, because
raid56 have parity strip, which makes more bbio->num_stripes.

The reason why we haven't see above error is because another bug
in __btrfs_map_block(), which make raid56 readahead do nothing.

Before we will fix bug in __btrfs_map_block(), we need to disable
raid56 temporary, to avoid above warning.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/reada.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 619f929..7bbd656 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -363,6 +363,11 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
if (ret || !bbio || length < blocksize)
goto error;
 
+   if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+   /* Current code can not support RAID56 yet */
+   goto error;
+   }
+
if (bbio->num_stripes > BTRFS_MAX_MIRRORS) {
btrfs_err(root->fs_info,
   "readahead: more than %d copies not supported",
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs: Use direct way to determine raid56 write/recover mode

2015-12-15 Thread Zhao Lei
Old code use bbio->raid_map to determine whether in raid56
write/recover operation, because we don't have bbio->map_type
that time, and have to use above workaround.

Now we have direct way for this condition, to get gid of using
the function-relative data, and make code readable.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 367e8ec..d411444 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6056,7 +6056,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct 
bio *bio,
bbio->fs_info = root->fs_info;
atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
-   if (bbio->raid_map) {
+   if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
+   ((rw & WRITE) || (mirror_num > 1))) {
/* In this case, map_length has been set to the length of
   a single stripe; not the whole write */
if (rw & WRITE) {
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs: return all mirror whether need_raid_map set or not

2015-12-15 Thread Zhao Lei
__btrfs_map_block() should return all mirror on WRITE,
REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set
or not.

need_raid_map only used to control is to set bbio->raid_map.

Current code works right because there is only one caller can
trigger above bug, which is readahead, and this function happened
to bypass on less mirror.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/volumes.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a6df8fd..4ee429b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5464,9 +5464,8 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
}
 
} else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
-   if (need_raid_map &&
-   ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) ||
-mirror_num > 1)) {
+   if ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) ||
+   mirror_num > 1) {
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = div_u64(raid56_full_stripe_start,
stripe_len * nr_data_stripes(map));
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-14 18:34, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 15:20 -0500, Austin S. Hemmelgarn wrote:

On 2015-12-14 14:44, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:

The traditional reasoning was that read-only meant that users
couldn't
change anything

Where I'd however count the atime changes to.
The atimes wouldn't change magically, but only because the user
stared
some program, configured some daemon, etc. ... which
reads/writes/etc.
the file.

But reading the file is allowed, which is where this starts to get
ambiguous.

Why?
Because according to POSIX, when a file gets read, the atime gets 
updated.  Except that POSIX doesn't specify what happens if the 
filesystem is mounted read-only, but the underlying block device is 
writable.



Reading a file updates the atime (and in fact, this is the
way that most stuff that uses them cares about them), but even a ro
mount allows reading the file.

As I just wrote in the other post, at least for btrfs (haven't checked
ext/xfs due to being... well... lazy O:-) ) ro mount option  or  ro
snapshot seems to mean: no atime updates even if mounted with
strictatimes (or maybe I did just something stupid when checking, so
better double check)



The traditional meaning of ro on UNIX
was (AFAIUI) that directory structure couldn't change, new files
couldn't be created, existing files couldn't be deleted, flags on the
inodes couldn't be changed, and file data couldn't be changed.  TBH,
I'm
not even certain that atime updates on ro filesystems was even an
intentional thing in the first place, it really sounds to me like the
type of thing that somebody forgot to put in a permissions check for,
and then people thought it was a feature.

Well in the end it probably doesn't matter how it came to existence,...
rather what it should be and what it actually is.
Knowing how you got where you are is pretty important for figuring out 
how to not end up there again :)

As said, I, personally, from the user PoV, would says soft-ro already
includes no dates on files being modifiable (including atime), as I'd
consider these a property of the file.
However anyone else may of course see that differently and at the same
time be smarter than I am.
AFAIK, the original versions of UNIX had no touch command or utime() 
syscall, so ctime, mtime, and atime were these things that just got 
magically updated by the system (ctime is still this way), and thus 
wasn't something that was considered user modification to the filesystem.



Also,



even with noatime, I'm pretty sure the VFS updates the atime every
time
the mtime changes

I've just checked and not it doesn't:
   File: ‘subvol/FILE’
   Size: 8  Blocks: 16 IO Block: 4096   regular
file
Device: 30h/48d Inode: 257 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid:
(0/root)
Access: 2015-12-15 00:01:46.452007798 +0100
Modify: 2015-12-15 00:31:26.579511816 +0100
Change: 2015-12-15 00:31:26.579511816 +0100

(rw,noatime mounted,... mtime, is more recent than atime)
Hmm, I could have sworn that updating the mtime on a file would force an 
atime update.  \me checks documentation.  OK, I was thinking of 
relatime, which updates the atime if it's older than mtime or ctime.



  (because not doing so would be somewhat stupid, and
you're writing the inode anyway), which technically means that stuff
could work around this by opening the file, truncating it to the size
it
already is, and then closing it.

Hmm I don't have a strong opinion here... it sounds "supid" from the
technical point in that it *could* write the atime and that wouldn't
cost much.
OTOH, that would make things more ambiguous when atimes change and when
not... (they'd only change on writes, never on reads,...)
So I think it's good as it is... and it matches the name, which is
noatime - and not noatime-unless-on-writes ;-)
Except there are still ways to update the atime even on a filesystem 
mounted noatime.  For example, on _every_ POSIX compliant system out 
there (and Linux itself is mostly POSIX compliant, it's primarily the 
userspace that isn't), you can update the atime using the utime() system 
call, unless the filesystem is read-only.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-14 16:26, Chris Murphy wrote:

On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
 wrote:


Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't
making a valid argument.  The fact that there is software that doesn't
handle it well would say to me based on established practice that that
software is what's broken, not common practice.


The automobile is invented and due to the ensuing chaos, common
practice of doing whatever the F you wanted came to an end in favor of
rules of the road and traffic lights. I'm sure some people went
ballistic, but for the most part things were much better without the
brokenness or prior common practice.
Except for one thing:  Automobiles actually provide a measurable 
significant benefit to society.  What specific benefit does embedding 
the filesystem UUID in the metadata actually provide?


So the fact we're going to have this problem with all file systems
that incorporate the volume UUID into the metadata stream, tells me
that the very rudimentary common practice of using dd needs to go
away, in general practice. I've already said data recovery (including
forensics) and sticking drives away on a shelf could be reasonable.


The assumption that a UUID is actually unique is an inherently flawed one,
because it depends both on the method of generation guaranteeing it's unique
(and none of the defined methods guarantee that), and a distinct absence of
malicious intent.


http://www.ietf.org/rfc/rfc4122.txt
"A UUID is 128 bits long, and can guarantee uniqueness across space and time."

Also see security considerations in section 6.

Both aspects ignore the facts that:
Version 1 is easy to cause a collision with (MAC addresses are by no 
means unique, and are easy to spoof, and so are timestamps).
Version 2 is relatively easy to cause a collision with, because UID and 
GID numbers are a fixed size namespace.
Version 3 is slightly better, but still not by any means unique because 
you just have to guess the seed string (or a collision for it).
Version 4 is probably the hardest to get a collision with, but only if 
you are using a true RNG, and evne then, 122 bits of entropy is not much 
protection.
Version 5 has the same issues as Version 3, but is more secure against 
hash collisions.


In general, you should only use UUID's when either:
a. You have absolutely 100% complete control of the storage of them, 
such that you can guarantee they don't get reused.

b. They can be guaranteed to be relatively unique for the system using them.




On that note, why exactly is it better to make the filesystem UUID such an
integral part of the filesystem?  The other thing I'm reading out of this
all, is that by writing a total of 64 bytes to a specific location in a
single disk in a multi-device BTRFS filesystem, you can make the whole
filesystem fall apart, which is absolutely absurd.



OK maybe I'm  missing something.

1. UUID is 128 bits. So where are you getting the additional 48 bytes from?
2. The volume UUID is in every superblock, which for all practical
purposes means at least two instances of that UUID per device.

Are you saying the file system falls apart when changing just one of
those volume UUIDs in one superblock? And how does it fall apart? I'd
say all volume UUID instances (each superblock, on every device)
should be checked and if any of them mismatch then fail to mount.
You're right, it would probably take writing all the SB's (although I'm 
not 100% certain that we actually check that the SB UUID's match).
The extra bytes, which I grossly miscalculated, are for the SB checksum, 
which would have to be updated to match the new SB.


There could be some leveraging of the device WWN, or absent that its
serial number, propogated into all of the volume's devices (cross
referencing each other's devid to WWN or serial). And then that way
there's a way to differentiate. In the dd case, there would be
mismatching real device WWN/serial number and the one written in
metadata on all drives, including the copy. This doesn't say what
policy should happen next, just that at least it's known there's a
mismatch.

That gets tricky too, because for example you have stuff like flat files 
used as filesystem images.


However, if we then use some separate UUID (possibly hashed off of the 
file location) in place of the device serial/WWN, that could 
theoretically provide some better protection.  The obvious solution in 
the case of a mismatch would be to refuse the mount until either the 
issue is fixed using the tools, or the user specifies some particular 
mount option to either fix ti automatically, or ignore copies with a 
mismatching serial.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Hugo Mills
On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 16:26, Chris Murphy wrote:
> >On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
> > wrote:
> >>
> >>Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't
> >>making a valid argument.  The fact that there is software that doesn't
> >>handle it well would say to me based on established practice that that
> >>software is what's broken, not common practice.
> >
> >The automobile is invented and due to the ensuing chaos, common
> >practice of doing whatever the F you wanted came to an end in favor of
> >rules of the road and traffic lights. I'm sure some people went
> >ballistic, but for the most part things were much better without the
> >brokenness or prior common practice.
> Except for one thing:  Automobiles actually provide a measurable
> significant benefit to society.  What specific benefit does
> embedding the filesystem UUID in the metadata actually provide?

   That one's easy to answer. It deals with a major issue that
reiserfs had: if you have a filesystem with another filesystem image
stored on it, reiserfsck could end up deciding that both the metadata
blocks of the main filesystem *and* the metadata blocks of the image
were part of the same FS (because they're on the same block device),
and so would splice both filesystems into one, generally complaining
loudly along the way that there was a lot of corruption present that
it was trying to fix.

   Putting the UUID of the FS into the metadata blocks means that the
kind of low-level check/repair attempt which scans for "stuff that
looks like metadata" can at least distinguish between the stuff that's
really metadata and the stuff that's just data that looks like
metadata.

   Hugo.

> >So the fact we're going to have this problem with all file systems
> >that incorporate the volume UUID into the metadata stream, tells me
> >that the very rudimentary common practice of using dd needs to go
> >away, in general practice. I've already said data recovery (including
> >forensics) and sticking drives away on a shelf could be reasonable.
> >
> >>The assumption that a UUID is actually unique is an inherently flawed one,
> >>because it depends both on the method of generation guaranteeing it's unique
> >>(and none of the defined methods guarantee that), and a distinct absence of
> >>malicious intent.
> >
> >http://www.ietf.org/rfc/rfc4122.txt
> >"A UUID is 128 bits long, and can guarantee uniqueness across space and 
> >time."
> >
> >Also see security considerations in section 6.
> Both aspects ignore the facts that:
> Version 1 is easy to cause a collision with (MAC addresses are by no
> means unique, and are easy to spoof, and so are timestamps).
> Version 2 is relatively easy to cause a collision with, because UID
> and GID numbers are a fixed size namespace.
> Version 3 is slightly better, but still not by any means unique
> because you just have to guess the seed string (or a collision for
> it).
> Version 4 is probably the hardest to get a collision with, but only
> if you are using a true RNG, and evne then, 122 bits of entropy is
> not much protection.
> Version 5 has the same issues as Version 3, but is more secure
> against hash collisions.
> 
> In general, you should only use UUID's when either:
> a. You have absolutely 100% complete control of the storage of them,
> such that you can guarantee they don't get reused.
> b. They can be guaranteed to be relatively unique for the system using them.
> >
> >
> >>On that note, why exactly is it better to make the filesystem UUID such an
> >>integral part of the filesystem?  The other thing I'm reading out of this
> >>all, is that by writing a total of 64 bytes to a specific location in a
> >>single disk in a multi-device BTRFS filesystem, you can make the whole
> >>filesystem fall apart, which is absolutely absurd.
> >
> >
> >OK maybe I'm  missing something.
> >
> >1. UUID is 128 bits. So where are you getting the additional 48 bytes from?
> >2. The volume UUID is in every superblock, which for all practical
> >purposes means at least two instances of that UUID per device.
> >
> >Are you saying the file system falls apart when changing just one of
> >those volume UUIDs in one superblock? And how does it fall apart? I'd
> >say all volume UUID instances (each superblock, on every device)
> >should be checked and if any of them mismatch then fail to mount.
> You're right, it would probably take writing all the SB's (although
> I'm not 100% certain that we actually check that the SB UUID's
> match).
> The extra bytes, which I grossly miscalculated, are for the SB
> checksum, which would have to be updated to match the new SB.
> >
> >There could be some leveraging of the device WWN, or absent that its
> >serial number, propogated into all of the volume's devices (cross
> >referencing each other's devid to WWN or serial). And then that way
> >there's a way to differentiate. In the dd case, there woul

Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-14 19:08, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 08:23 -0500, Austin S. Hemmelgarn wrote:

The reason that this isn't quite as high of a concern is because
performing this attack requires either root access, or direct
physical
access to the hardware, and in either case, your system is already
compromised.

No necessarily.
Apart from the ATM image (where most people wouldn't call it
compromised, just because it's openly accessible on the street)
Um, no you don't have direct physical access to the hardware with an 
ATM, at least, not unless you are going to take apart the cover and 
anything else in your way (and probably set off internal alarms).  And 
even without that, it's still possible to DoS an ATM without much 
effort.  Most of them have a 3.5mm headphone jack for TTS for people 
with poor vision, and that's more than enough to overload at least part 
of the system with a relatively simple to put together bit of 
electronics that would cost you less than 10 USD.

imageine you're running a VM hosting service, where you allow users to
upload images and have them deployed.
In the cheap" case these will end up as regular files, where they
couldn't do any harm (even if colliding UUIDs)... but even there one
would have to expect, that the hypervisor admin may losetup them for
whichever reason.
But if you offer more professional services, you may give your clients
e.g. direct access to some storage backend, which are then probably
also seen on the host by its kernel.
And here we already have the case, that a client could remotely trigger
such collision.
In that particular situation, it's not relevant unless the host admin 
goes to mount them.  UUID collisions are only an issue if the 
filesystems get mounted.


And remember, things only sounds far-fetched until it actually happens
the first time ;)



I still think that that isn't a sufficient excuse for not fixing the
issue, as there are a number of non-security related issues that can
result from this (there are some things that are common practice with
LVM or mdraid that can't be done with BTRFS because of this).

Sure, I guess we agree on that,...



Apart from that, btrfs should be a general purpose fs, and not just
a
desktop or server fs.
So edge cases like forensics (where it's common that you create
bitwise
identical images) shouln't be forgotten either.

While I would normally agree, there are ways to work around this in
the
forensics case that don't work for any other case (namely, if BTRFS
is
built as a module, you can unmount everything, unload the module,
reload
it, and only scan the devices you want).

see below (*)



On that note, why exactly is it better to make the filesystem UUID
such
an integral part of the filesystem?

Well I think it's a proper way to e.g. handle the multi-device case.
You have n devices, you want to differ them,... using a pseudo-random
UUID is surely better than giving them numbers.
That's debatable, the same issues are obviously present in both cases 
(individual numbers can collide too).

Same for the fs UUID, e.g. when used for mounting devices whose paths
aren't stable.
In the case of a sanely designed system using LVM for example, device 
paths are stable.


As said before, using the UUID isn't the problem - not protecting
against collisions is.

No, the issues are:
1. We assume that the UUID will be unique for the life of the 
filesystem, which is not a safe assumption.

2. We don't sanely handle things if it isn't unique.




The other thing I'm reading out of
this all, is that by writing a total of 64 bytes to a specific
location
in a single disk in a multi-device BTRFS filesystem, you can make the
whole filesystem fall apart, which is absolutely absurd.

Well,... I don't think that writing *into* the filesystem is covered by
common practise anymore.
For end users, I agree.  Part of the discussion involves attacks on the 
system, and for a attacker it's not a far stretch to write directly to 
the block device if possible (and it's even common practice for 
bypassing permission checks done in the VFS layer).


In UNIX, a device (which holds the filesystem) is a file. Therefore one
can argue: if one copies/duplicates one file (i.e. the fs) neither of
the two's contents should get corrupted.
But if you actively write *into* the file by yourself,... then you're
simply on your own, either you know what you do, or just may just
corrupt *that* specific file. Of course it should again not lead to any
of it's clones or become corrupted as well.
My point is that by changing the UUID in a superblock (and properly 
updating the checksum for the superblock), you can trivially break a 
multi-device filesystem.  And it's a whole lot easier to do that than it 
is to do the equivalent for LVM.




And some recovery situations (think along the lines of no recovery
disk,
and you only have busybox or something similar to work with).

(*) which is however also, why you may not be able to unmount the
de

Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-15 09:18, Hugo Mills wrote:

On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote:

On 2015-12-14 16:26, Chris Murphy wrote:

On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
 wrote:


Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't
making a valid argument.  The fact that there is software that doesn't
handle it well would say to me based on established practice that that
software is what's broken, not common practice.


The automobile is invented and due to the ensuing chaos, common
practice of doing whatever the F you wanted came to an end in favor of
rules of the road and traffic lights. I'm sure some people went
ballistic, but for the most part things were much better without the
brokenness or prior common practice.

Except for one thing:  Automobiles actually provide a measurable
significant benefit to society.  What specific benefit does
embedding the filesystem UUID in the metadata actually provide?


That one's easy to answer. It deals with a major issue that
reiserfs had: if you have a filesystem with another filesystem image
stored on it, reiserfsck could end up deciding that both the metadata
blocks of the main filesystem *and* the metadata blocks of the image
were part of the same FS (because they're on the same block device),
and so would splice both filesystems into one, generally complaining
loudly along the way that there was a lot of corruption present that
it was trying to fix.
IIRC, that was because of the way the SB was designed, and is why other 
filesystems have a UUID in the superblock.


I probably should have been clearer with my statement, what I meant was:
What specific benefit does using the UUID for multi-device filesystems 
to identify the various devices provide?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Hugo Mills
On Tue, Dec 15, 2015 at 09:27:12AM -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-15 09:18, Hugo Mills wrote:
> >On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote:
> >>On 2015-12-14 16:26, Chris Murphy wrote:
> >>>On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
> >>> wrote:
> 
> Agreed, if yo9u can't substantiate _why_ it's bad practice, then you 
> aren't
> making a valid argument.  The fact that there is software that doesn't
> handle it well would say to me based on established practice that that
> software is what's broken, not common practice.
> >>>
> >>>The automobile is invented and due to the ensuing chaos, common
> >>>practice of doing whatever the F you wanted came to an end in favor of
> >>>rules of the road and traffic lights. I'm sure some people went
> >>>ballistic, but for the most part things were much better without the
> >>>brokenness or prior common practice.
> >>Except for one thing:  Automobiles actually provide a measurable
> >>significant benefit to society.  What specific benefit does
> >>embedding the filesystem UUID in the metadata actually provide?
> >
> >That one's easy to answer. It deals with a major issue that
> >reiserfs had: if you have a filesystem with another filesystem image
> >stored on it, reiserfsck could end up deciding that both the metadata
> >blocks of the main filesystem *and* the metadata blocks of the image
> >were part of the same FS (because they're on the same block device),
> >and so would splice both filesystems into one, generally complaining
> >loudly along the way that there was a lot of corruption present that
> >it was trying to fix.
> IIRC, that was because of the way the SB was designed, and is why
> other filesystems have a UUID in the superblock.
> 
> I probably should have been clearer with my statement, what I meant was:
> What specific benefit does using the UUID for multi-device
> filesystems to identify the various devices provide?

   Well, given a bunch of block devices, how do you identify which
ones to use for each of the (unknown number of) filesystems in the
system?

   You can either use some kind of config file, which is going to get
out of date as device enumeration orders change or as devices are
added/deleted from the FS, or you can try to identify the devices that
belong together automatically in some way. btrfs uses the latter
option (with the former option kind of supported using the device=
mount option). The use of a UUID isn't fundamental to the latter
process, but anything that you replaced the UUID with would have the
same issues that we're seeing here -- make a duplicate of the device
at the block level, and you get additional devices that look like they
should be part of the FS.

   The question is not how you avoid duplicating the UUIDs, but how
you identify that there are duplicates present, and how you deal with
that issue once you've detected them. This is complicated by the fact
that it's perfectly legitimate to have two block devices in the system
that identify themselves as the same device for the same filesystem --
this happens when they're different views of the same underlying
storage through multipathing.

   I would suggest trying to migrate to a state where detecting more
than one device with the same UUID and devid is cause to prevent the
FS from mounting, unless there's also a "mount_duplicates_yes_i_
know_this_is_dangerous_and_i_know_what_im_doing" mount flag present,
for the multipathing people. That will break existing userspace
behaviour for the multipathing case, but the migration can probably be
managed. (e.g. NFS has successfully changed default behaviour for one
of its mount options in the last few(?) years).

   Hugo.

-- 
Hugo Mills | I think that everything darkling says is actually a
hugo@... carfax.org.uk | joke. It's just that we haven't worked out most of
http://carfax.org.uk/  | them yet.
PGP: E2AB1DE4  |Vashka


signature.asc
Description: Digital signature


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-14 22:15, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:

When one starts to get a bit deeper into btrfs (from the admin/end-
user
side) one sooner or later stumbles across the recommendation/need
to
use nodatacow for certain types of data (DBs, VM images, etc.) and
the
reason, AFAIU, being the inherent fragmentation that comes along
with
the CoW, which is especially noticeable for those types of files
with
lots of random internal writes.

It is worth pointing out that in the case of DB's at least, this is
because at least some of the do COW internally to provide the
transactional semantics that are required for many workloads.

Guess that also applies to some VM images then, IIRC qcow2 does CoW.

Yep, and I think that VMWare's image format does too.





a) for performance reasons (when I consider our research software
which
often has IO as the limiting factor and where we want as much IO
being
used by actual programs as possible)...

There are other things that can be done to improve this.  I would
assume
of course that you're already doing some of them (stuff like using
dedicated storage controller cards instead of the stuff on the
motherboard), but some things often get overlooked, like actually
taking
the time to fine-tune the I/O scheduler for the workload (Linux has
particularly brain-dead default settings for CFQ, and the deadline
I/O
scheduler is only good in hard-real-time usage or on small hard
drives
that actually use spinning disks).

Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.
nodatacow only works for that if the file is pre-allocated, if it isn't, 
then it still ends up fragmented.




The big argument for defragmenting a SSD is that it makes it such
that
you require fewer I/O requests to the device to read a file

I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)
I can't give any real numbers, but I've seen noticeable performance 
improvements on good SSD's (Intel, Samsung, and Crucial) when making 
sure that things are defragmented.



The problem is not entirely the lack of COW semantics, it's also the
fact that it's impossible to implement an atomic write on a hard
disk.

Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).
Yes, but it's also the reason that using either COW or a log-structured 
filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
consistency.




but I wouldn't know that relational DBs really do cheksuming of the
data.

All the ones I know of except GDBM and BerkDB do in fact provide the
option of checksumming.  It's pretty much mandatory if you want to be
considered for usage in financial, military, or medical applications.

Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.



Long story short, it does happen every now and then, that a scrub
shows
file errors, for neither the RAID was broken, nor there were any
block
errors reported by the disks, or anything suspicious in SMART.
In other words, silent block corruption.

Or a transient error in system RAM that ECC didn't catch, or a
undetected error in the physical link layer to the disks, or an error
in
the disk cache or controller, or any number of other things.

Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.



   BTRFS
could only protect against some cases, not all (for example, if you
have
a big enough error in RAM that ECC doesn't catch it, you've got
serious
issues that just about nothing short of a cold reboot can save you
from).

Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.



But, AFAIU, not doing CoW, while not having a journal (or does it
have
one for these cases???) almost certainly means that the data (not
necessarily the fs) will be inconsistent in case of a crash during
a
no-CoWed write anyway, right?
Wouldn't it be basically like ext2?

Kind of, but not quite.  Even with nodatacow, metadata is still COW,
which is functionally as safe as a traditional journaling filesystem
like XFS or ext4.

Sure, I was referring to the data part only, should have made that more
clear.



Absolute worst case scenario for both nodatacow on
BTRFS, and a traditional journaling filesystem,

Re: attacking btrfs filesystems via UUID collisions?

2015-12-15 Thread Austin S. Hemmelgarn

On 2015-12-15 09:42, Hugo Mills wrote:

On Tue, Dec 15, 2015 at 09:27:12AM -0500, Austin S. Hemmelgarn wrote:

On 2015-12-15 09:18, Hugo Mills wrote:

On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote:

On 2015-12-14 16:26, Chris Murphy wrote:

On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
 wrote:


Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't
making a valid argument.  The fact that there is software that doesn't
handle it well would say to me based on established practice that that
software is what's broken, not common practice.


The automobile is invented and due to the ensuing chaos, common
practice of doing whatever the F you wanted came to an end in favor of
rules of the road and traffic lights. I'm sure some people went
ballistic, but for the most part things were much better without the
brokenness or prior common practice.

Except for one thing:  Automobiles actually provide a measurable
significant benefit to society.  What specific benefit does
embedding the filesystem UUID in the metadata actually provide?


That one's easy to answer. It deals with a major issue that
reiserfs had: if you have a filesystem with another filesystem image
stored on it, reiserfsck could end up deciding that both the metadata
blocks of the main filesystem *and* the metadata blocks of the image
were part of the same FS (because they're on the same block device),
and so would splice both filesystems into one, generally complaining
loudly along the way that there was a lot of corruption present that
it was trying to fix.

IIRC, that was because of the way the SB was designed, and is why
other filesystems have a UUID in the superblock.

I probably should have been clearer with my statement, what I meant was:
What specific benefit does using the UUID for multi-device
filesystems to identify the various devices provide?


Well, given a bunch of block devices, how do you identify which
ones to use for each of the (unknown number of) filesystems in the
system?

You can either use some kind of config file, which is going to get
out of date as device enumeration orders change or as devices are
added/deleted from the FS, or you can try to identify the devices that
belong together automatically in some way. btrfs uses the latter
option (with the former option kind of supported using the device=
mount option). The use of a UUID isn't fundamental to the latter
process, but anything that you replaced the UUID with would have the
same issues that we're seeing here -- make a duplicate of the device
at the block level, and you get additional devices that look like they
should be part of the FS.

The question is not how you avoid duplicating the UUIDs, but how
you identify that there are duplicates present, and how you deal with
that issue once you've detected them. This is complicated by the fact
that it's perfectly legitimate to have two block devices in the system
that identify themselves as the same device for the same filesystem --
this happens when they're different views of the same underlying
storage through multipathing.

I would suggest trying to migrate to a state where detecting more
than one device with the same UUID and devid is cause to prevent the
FS from mounting, unless there's also a "mount_duplicates_yes_i_
know_this_is_dangerous_and_i_know_what_im_doing" mount flag present,
for the multipathing people. That will break existing userspace
behaviour for the multipathing case, but the migration can probably be
managed. (e.g. NFS has successfully changed default behaviour for one
of its mount options in the last few(?) years).
May I propose the alternative option of adding a flag to tell mount to 
_only_ use the devices specified in the options?  That would allow 
people to work around the common issues (multipath, dm-cache, etc), and 
would provide people who have stable device enumeration to mitigate the 
possibility of an attack.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs: check for empty bitmap list in setup_cluster_bitmaps

2015-12-15 Thread Chris Mason
Dave Jones found a warning from kasan in setup_cluster_bitmaps()

==
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr 88039bef6828
Read of size 8 by task nfsd/1009
page:ea000e6fbd80 count:0 mapcount:0 mapping:  (null)
index:0x0
flags: 0x8000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: GW
4.4.0-rc3-backup-debug+ #1
880065647b50 6bb712c2 88039bef6640 a680a43e
004559c0 88039bef66c8 a62638d1 a61121c0
8803a5769de8 0296 8803a5769df0 00046280
Call Trace:
[] dump_stack+0x4b/0x6d
[] kasan_report_error+0x501/0x520
[] ? debug_show_all_locks+0x1e0/0x1e0
[] kasan_report+0x58/0x60
[] ? rb_last+0x10/0x40
[] ? setup_cluster_bitmap+0xc4/0x5a0
[] __asan_load8+0x5d/0x70
[] setup_cluster_bitmap+0xc4/0x5a0
[] ? setup_cluster_no_bitmap+0x6a/0x400
[] btrfs_find_space_cluster+0x4b6/0x640
[] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[] ? _raw_spin_unlock+0x27/0x40
[] find_free_extent+0xba1/0x1520

Andrey noticed this was because we were doing list_first_entry on a list
that might be empty.  Rework the tests a bit so we don't do that.

Signed-off-by: Chris Mason 
Reprorted-by: Andrey Ryabinin 
Reported-by:  Dave Jones 

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0948d34..e6fc7d9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
*block_group,
 u64 cont1_bytes, u64 min_bytes)
 {
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
-   struct btrfs_free_space *entry;
+   struct btrfs_free_space *entry = NULL;
int ret = -ENOSPC;
u64 bitmap_offset = offset_to_bitmap(ctl, offset);
 
@@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
*block_group,
 * The bitmap that covers offset won't be in the list unless offset
 * is just its start offset.
 */
-   entry = list_first_entry(bitmaps, struct btrfs_free_space, list);
-   if (entry->offset != bitmap_offset) {
+   if (!list_empty(bitmaps))
+   entry = list_first_entry(bitmaps, struct btrfs_free_space, 
list);
+
+   if (!entry || entry->offset != bitmap_offset) {
entry = tree_search_offset(ctl, bitmap_offset, 1, 0);
if (entry && list_empty(&entry->list))
list_add(&entry->list, bitmaps);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps

2015-12-15 Thread Josef Bacik

On 12/15/2015 12:08 PM, Chris Mason wrote:

Dave Jones found a warning from kasan in setup_cluster_bitmaps()

==
BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
addr 88039bef6828
Read of size 8 by task nfsd/1009
page:ea000e6fbd80 count:0 mapcount:0 mapping:  (null)
index:0x0
flags: 0x8000()
page dumped because: kasan: bad access detected
CPU: 1 PID: 1009 Comm: nfsd Tainted: GW
4.4.0-rc3-backup-debug+ #1
880065647b50 6bb712c2 88039bef6640 a680a43e
004559c0 88039bef66c8 a62638d1 a61121c0
8803a5769de8 0296 8803a5769df0 00046280
Call Trace:
[] dump_stack+0x4b/0x6d
[] kasan_report_error+0x501/0x520
[] ? debug_show_all_locks+0x1e0/0x1e0
[] kasan_report+0x58/0x60
[] ? rb_last+0x10/0x40
[] ? setup_cluster_bitmap+0xc4/0x5a0
[] __asan_load8+0x5d/0x70
[] setup_cluster_bitmap+0xc4/0x5a0
[] ? setup_cluster_no_bitmap+0x6a/0x400
[] btrfs_find_space_cluster+0x4b6/0x640
[] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
[] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
[] ? _raw_spin_unlock+0x27/0x40
[] find_free_extent+0xba1/0x1520

Andrey noticed this was because we were doing list_first_entry on a list
that might be empty.  Rework the tests a bit so we don't do that.

Signed-off-by: Chris Mason 
Reprorted-by: Andrey Ryabinin 
Reported-by:  Dave Jones 

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 0948d34..e6fc7d9 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
*block_group,
 u64 cont1_bytes, u64 min_bytes)
  {
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
-   struct btrfs_free_space *entry;
+   struct btrfs_free_space *entry = NULL;
int ret = -ENOSPC;
u64 bitmap_offset = offset_to_bitmap(ctl, offset);

@@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
*block_group,
 * The bitmap that covers offset won't be in the list unless offset
 * is just its start offset.
 */


Just above this we have a if (ctl->total_bitmaps == 0) return NULL; 
check that should make this useless, which means we're screwing up our 
ctl->total_bitmaps counter somehow.  We should probably figure out why 
that is happening.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: !PageLocked BUG_ON hit in clear_page_dirty_for_io

2015-12-15 Thread Filipe Manana
On Tue, Dec 15, 2015 at 12:03 AM, Chris Mason  wrote:
> On Tue, Dec 08, 2015 at 11:25:28PM -0500, Dave Jones wrote:
>> Not sure if I've already reported this one, but I've been seeing this
>> a lot this last couple days.
>>
>> kernel BUG at mm/page-writeback.c:2654!
>> invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>
> We ended up discussing this in more detail on lkml, but I'll summarize
> here.
>
> There were two problems.  First lock_page() might not actually lock the
> page in v4.4-rc4, it can bail out if a signal is pending.  This got
> fixed just before v4.4-rc5, so if you were on rc4, upgrade asap.
>
> Second, prepare_pages had a bug for single page writes:
>
> From f0be89af049857bcc537a53fe2a2fae080e7a5bd Mon Sep 17 00:00:00 2001
> From: Chris Mason 
> Date: Mon, 14 Dec 2015 15:40:44 -0800
> Subject: [PATCH] Btrfs: check prepare_uptodate_page() error code earlier
>
> prepare_pages() may end up calling prepare_uptodate_page() twice if our
> write only spans a single page.  But if the first call returns an error,
> our page will be unlocked and its not safe to call it again.
>
> This bug goes all the way back to 2011, and it's not something commonly
> hit.
>
> While we're here, add a more explicit check for the page being truncated
> away.  The bare lock_page() alone is protected only by good thoughts and
> i_mutex, which we're sure to regret eventually.
>
> Reported-by: Dave Jones 
> Signed-off-by: Chris Mason 

Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/file.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 72e7346..0f09526 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1291,7 +1291,8 @@ out:
>   * on error we return an unlocked page and the error value
>   * on success we return a locked page and 0
>   */
> -static int prepare_uptodate_page(struct page *page, u64 pos,
> +static int prepare_uptodate_page(struct inode *inode,
> +struct page *page, u64 pos,
>  bool force_uptodate)
>  {
> int ret = 0;
> @@ -1306,6 +1307,10 @@ static int prepare_uptodate_page(struct page *page, 
> u64 pos,
> unlock_page(page);
> return -EIO;
> }
> +   if (page->mapping != inode->i_mapping) {
> +   unlock_page(page);
> +   return -EAGAIN;
> +   }
> }
> return 0;
>  }
> @@ -1324,6 +1329,7 @@ static noinline int prepare_pages(struct inode *inode, 
> struct page **pages,
> int faili;
>
> for (i = 0; i < num_pages; i++) {
> +again:
> pages[i] = find_or_create_page(inode->i_mapping, index + i,
>mask | __GFP_WRITE);
> if (!pages[i]) {
> @@ -1333,13 +1339,17 @@ static noinline int prepare_pages(struct inode 
> *inode, struct page **pages,
> }
>
> if (i == 0)
> -   err = prepare_uptodate_page(pages[i], pos,
> +   err = prepare_uptodate_page(inode, pages[i], pos,
> force_uptodate);
> -   if (i == num_pages - 1)
> -   err = prepare_uptodate_page(pages[i],
> +   if (!err && i == num_pages - 1)
> +   err = prepare_uptodate_page(inode, pages[i],
> pos + write_bytes, false);
> if (err) {
> page_cache_release(pages[i]);
> +   if (err == -EAGAIN) {
> +   err = 0;
> +   goto again;
> +   }
> faili = i - 1;
> goto fail;
> }
> --
> 2.4.6
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps

2015-12-15 Thread Chris Mason
On Tue, Dec 15, 2015 at 01:37:01PM -0500, Josef Bacik wrote:
> On 12/15/2015 12:08 PM, Chris Mason wrote:
> >Dave Jones found a warning from kasan in setup_cluster_bitmaps()
> >
> >==
> >BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
> >addr 88039bef6828
> >Read of size 8 by task nfsd/1009
> >page:ea000e6fbd80 count:0 mapcount:0 mapping:  (null)
> >index:0x0
> >flags: 0x8000()
> >page dumped because: kasan: bad access detected
> >CPU: 1 PID: 1009 Comm: nfsd Tainted: GW
> >4.4.0-rc3-backup-debug+ #1
> >880065647b50 6bb712c2 88039bef6640 a680a43e
> >004559c0 88039bef66c8 a62638d1 a61121c0
> >8803a5769de8 0296 8803a5769df0 00046280
> >Call Trace:
> >[] dump_stack+0x4b/0x6d
> >[] kasan_report_error+0x501/0x520
> >[] ? debug_show_all_locks+0x1e0/0x1e0
> >[] kasan_report+0x58/0x60
> >[] ? rb_last+0x10/0x40
> >[] ? setup_cluster_bitmap+0xc4/0x5a0
> >[] __asan_load8+0x5d/0x70
> >[] setup_cluster_bitmap+0xc4/0x5a0
> >[] ? setup_cluster_no_bitmap+0x6a/0x400
> >[] btrfs_find_space_cluster+0x4b6/0x640
> >[] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
> >[] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
> >[] ? _raw_spin_unlock+0x27/0x40
> >[] find_free_extent+0xba1/0x1520
> >
> >Andrey noticed this was because we were doing list_first_entry on a list
> >that might be empty.  Rework the tests a bit so we don't do that.
> >
> >Signed-off-by: Chris Mason 
> >Reprorted-by: Andrey Ryabinin 
> >Reported-by:  Dave Jones 
> >
> >diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> >index 0948d34..e6fc7d9 100644
> >--- a/fs/btrfs/free-space-cache.c
> >+++ b/fs/btrfs/free-space-cache.c
> >@@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
> >*block_group,
> >  u64 cont1_bytes, u64 min_bytes)
> >  {
> > struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >-struct btrfs_free_space *entry;
> >+struct btrfs_free_space *entry = NULL;
> > int ret = -ENOSPC;
> > u64 bitmap_offset = offset_to_bitmap(ctl, offset);
> >
> >@@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache 
> >*block_group,
> >  * The bitmap that covers offset won't be in the list unless offset
> >  * is just its start offset.
> >  */
> 
> Just above this we have a if (ctl->total_bitmaps == 0) return NULL; check
> that should make this useless, which means we're screwing up our
> ctl->total_bitmaps counter somehow.  We should probably figure out why that
> is happening.  Thanks,

My best explanation is that btrfs_bitmap_cluster() takes the bitmap out
of the rbtree without dropping ctl->total_bitmaps.  So,
setup_cluster_no_bitmap() can't find it.  This should require mixed
allocation modes to trigger.

Another path is that during btrfs_write_out_cache() we'll pull entries
out.  My relatively new code allows that to happen before commit now, so
it might happen then.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-15 Thread Chris Mason
On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> 
> 
> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >Hi!
> >
> >For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> 
> >Again I ran into:
> >
> >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
> >write into big file
> >https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Not sure about guideline for other fs, but it will attract more dev's
> attention if it can be posted to maillist.
> 
> >
> >
> >No matter whether SLES 12 uses it as default for root, no matter whether
> >Fujitsu and Facebook use it: I will not let this onto any customer machine
> >without lots and lots of underprovisioning and rigorous free space 
> >monitoring.
> >Actually I will renew my recommendations in my trainings to be careful with
> >BTRFS.
> >
> > From my experience the monitoring would check for:
> >
> >merkaba:~> btrfs fi show /home
> >Label: 'home'  uuid: […]
> > Total devices 2 FS bytes used 156.31GiB
> > devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> > devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> >
> >If "used" is same as "size" then make big fat alarm. It is not sufficient for
> >it to happen. It can run for quite some time just fine without any issues, 
> >but
> >I never have seen a kworker thread using 100% of one core for extended period
> >of time blocking everything else on the fs without this condition being met.
> >
> 
> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be at
> most 10G and metadata to be 1G.
> 
> I have seen a lot of users with about 100~200G device, and hit unbalanced
> chunk allocation (10G data chunk easily takes the last available space and
> makes later metadata no where to store)

Maybe we should tune things so the size of the chunk is based on the
space remaining instead of the total space?

> 
> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs with
> SINGLE data/meta)
> 
> >
> >In addition to that last time I tried it aborts scrub any of my BTRFS
> >filesstems. Reported in another thread here that got completely ignored so
> >far. I think I could go back to 4.2 kernel to make this work.

We'll pick this thread up again, the ones that get fixed the fastest are
the ones that we can easily reproduce.  The rest need a lot of think
time.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-15 Thread Martin Steigerwald
Am Dienstag, 15. Dezember 2015, 16:59:58 CET schrieb Chris Mason:
> On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> > Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> > >Hi!
> > >
> > >For me it is still not production ready.
> > 
> > Yes, this is the *FACT* and not everyone has a good reason to deny it.
> > 
> > >Again I ran into:
> > >
> > >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > >random write into big file
> > >https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > Not sure about guideline for other fs, but it will attract more dev's
> > attention if it can be posted to maillist.
> > 
> > >No matter whether SLES 12 uses it as default for root, no matter whether
> > >Fujitsu and Facebook use it: I will not let this onto any customer
> > >machine
> > >without lots and lots of underprovisioning and rigorous free space
> > >monitoring. Actually I will renew my recommendations in my trainings to
> > >be careful with BTRFS.
> > >
> > > From my experience the monitoring would check for:
> > >merkaba:~> btrfs fi show /home
> > >Label: 'home'  uuid: […]
> > >
> > > Total devices 2 FS bytes used 156.31GiB
> > > devid1 size 170.00GiB used 164.13GiB path
> > > /dev/mapper/msata-home
> > > devid2 size 170.00GiB used 164.13GiB path
> > > /dev/mapper/sata-home
> > >
> > >If "used" is same as "size" then make big fat alarm. It is not sufficient
> > >for it to happen. It can run for quite some time just fine without any
> > >issues, but I never have seen a kworker thread using 100% of one core
> > >for extended period of time blocking everything else on the fs without
> > >this condition being met.> 
> > And specially advice on the device size from myself:
> > Don't use devices over 100G but less than 500G.
> > Over 100G will leads btrfs to use big chunks, where data chunks can be at
> > most 10G and metadata to be 1G.
> > 
> > I have seen a lot of users with about 100~200G device, and hit unbalanced
> > chunk allocation (10G data chunk easily takes the last available space and
> > makes later metadata no where to store)
> 
> Maybe we should tune things so the size of the chunk is based on the
> space remaining instead of the total space?

Still on my filesystem where was over 1 GiB free on metadata chunks, so…

… my theory still is: BTRFS has trouble finding free space in chunks at some 
time.

> > And unfortunately, your fs is already in the dangerous zone.
> > (And you are using RAID1, which means it's the same as one 170G btrfs with
> > SINGLE data/meta)
> > 
> > >In addition to that last time I tried it aborts scrub any of my BTRFS
> > >filesstems. Reported in another thread here that got completely ignored
> > >so
> > >far. I think I could go back to 4.2 kernel to make this work.
> 
> We'll pick this thread up again, the ones that get fixed the fastest are
> the ones that we can easily reproduce.  The rest need a lot of think
> time.

I understand. Maybe I just wanted to see at least some sort of an reaction.

I now have 4.4-rc5 running, the boot crash I had appears to be fixed. Oh, and 
I see that scrubbing / at leasted worked now:

merkaba:~> btrfs scrub status -d /
scrub status for […]
scrub device /dev/dm-5 (id 1) history
scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:42
total bytes scrubbed: 23.94GiB with 0 errors
scrub device /dev/mapper/msata-debian (id 2) history
scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:34
total bytes scrubbed: 23.94GiB with 0 errors

Okay, I test the other ones tomorrow, so maybe this one is fixed meanwhile.

Yay!

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.3-rc4] scrubbing aborts before finishing (probably solved)

2015-12-15 Thread Martin Steigerwald
Am Montag, 14. Dezember 2015, 08:59:59 CET schrieb Martin Steigerwald:
> Am Mittwoch, 25. November 2015, 16:35:39 CET schrieben Sie:
> > Am Samstag, 31. Oktober 2015, 12:10:37 CET schrieb Martin Steigerwald:
> > > Am Donnerstag, 22. Oktober 2015, 10:41:15 CET schrieb Martin 
Steigerwald:
> > > > I get this:
> > > > 
> > > > merkaba:~> btrfs scrub status -d /
> > > > scrub status for […]
> > > > scrub device /dev/mapper/sata-debian (id 1) history
> > > > 
> > > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted
> > > > after
> > > > 00:00:00
> > > > total bytes scrubbed: 0.00B with 0 errors
> > > > 
> > > > scrub device /dev/dm-2 (id 2) history
> > > > 
> > > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted
> > > > after
> > > > 00:01:30
> > > > total bytes scrubbed: 23.81GiB with 0 errors
> > > > 
> > > > For / scrub aborts for sata SSD immediately.
> > > > 
> > > > For /home scrub aborts for both SSDs at some time.
> > > > 
> > > > merkaba:~> btrfs scrub status -d /home
> > > > scrub status for […]
> > > > scrub device /dev/mapper/msata-home (id 1) history
> > > > 
> > > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted
> > > > after
> > > > 00:01:31
> > > > total bytes scrubbed: 22.03GiB with 0 errors
> > > > 
> > > > scrub device /dev/dm-3 (id 2) history
> > > > 
> > > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted
> > > > after
> > > > 00:03:34
> > > > total bytes scrubbed: 53.30GiB with 0 errors
> > > > 
> > > > Also single volume BTRFS is affected:
> > > > 
> > > > merkaba:~> btrfs scrub status /daten
> > > > scrub status for […]
> > > > 
> > > > scrub started at Thu Oct 22 10:36:38 2015 and was aborted
> > > > after
> > > > 00:00:00
> > > > total bytes scrubbed: 0.00B with 0 errors
> > > > 
> > > > No errors in dmesg, btrfs device stat or smartctl -a.
> > > > 
> > > > Any known issue?
> > > 
> > > I am still seeing this in 4.3-rc7. It happens so that on one SSD BTRFS
> > > doesn´t even start scrubbing. But in the end it aborts it scrubbing
> > > anyway.
> > > 
> > > I do not see any other issue so far. But I would really like to be able
> > > to
> > > scrub my BTRFS filesystems completely again. Any hints? Any further
> > > information needed?
> > > 
> > > merkaba:~> btrfs scrub status -d /
> > > scrub status for […]
> > > scrub device /dev/dm-5 (id 1) history
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > > total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > scrub device /dev/mapper/msata-debian (id 2) status
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:20
> > > total bytes scrubbed: 5.27GiB with 0 errors
> > > 
> > > merkaba:~> btrfs scrub status -d /
> > > scrub status for […]
> > > scrub device /dev/dm-5 (id 1) history
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > > total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > scrub device /dev/mapper/msata-debian (id 2) status
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:25
> > > total bytes scrubbed: 6.59GiB with 0 errors
> > > 
> > > merkaba:~> btrfs scrub status -d /
> > > scrub status for […]
> > > scrub device /dev/dm-5 (id 1) history
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > > total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > scrub device /dev/mapper/msata-debian (id 2) status
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:01:25
> > > total bytes scrubbed: 21.97GiB with 0 errors
> > > 
> > > merkaba:~> btrfs scrub status -d /
> > > scrub status for […]
> > > scrub device /dev/dm-5 (id 1) history
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after
> > > 
> > > 00:00:00 total bytes scrubbed: 0.00B with 0 errors
> > > scrub device /dev/mapper/msata-debian (id 2) history
> > > 
> > > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after
> > > 
> > > 00:01:32 total bytes scrubbed: 23.63GiB with 0 errors
> > > 
> > > 
> > > For the sake of it I am going to btrfs check one of the filesystem where
> > > BTRFS aborts scrubbing (which is all of the laptop filesystems, not only
> > > the RAID 1 one).
> > > 
> > > I will use the /daten filesystem as I can unmount it during laptop
> > > runtime
> > > easily. There scrubbing aborts immediately:
> > > 
> > > merkaba:~> btrfs scrub start /daten
> > > scrub started on /daten, fsid […] (pid=13861)
> > > merkaba:~> btrfs scrub status /daten
> > > scrub status for […]
> > > 
> > > scrub started at Sat Oct 31 12:04:25 2015 and was aborted after
> > > 
> > > 00:00:00 total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > It is single device:
> > > 
> > > merkaba:~> btrfs 

ERROR: did not find source subvol

2015-12-15 Thread Chris Murphy
kernel 4.2.6-301.fc23.x86_64
btrfs-progs-4.2.2-1.fc23.x86_64

This is a new one for me. Two new Btrfs volumes (one single profile,
one 2x raid1) both created with this btrfs-progs. And then

[root@f23a chrisbackup]# btrfs send everything-20150922/ | btrfs
receive /mnt/br1-500g/
At subvol everything-20150922/
At subvol everything-20150922
ERROR: did not find source subvol.

There are no kernel messages.

[root@f23a chrisbackup]# du -sh everything-20150922/
324Geverything-20150922/
[root@f23a chrisbackup]# du -sh /mnt /b

[root@f23a chrisbackup]# du -sh /mnt/br1-500g/everything-20150922/
322G/mnt/br1-500g/everything-20150922/

HUH, looks like 2G is missing on the receive side. So it got
interrupted for some reason?

btrfs check (v4.3.1) comes up clean on the source, as does a scrub.

So should I retry with -v on the send or the receive side or both?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-15 Thread Qu Wenruo



Ivan Sizov wrote on 2015/12/15 09:34 +:

2015-12-15 1:42 GMT+00:00 Qu Wenruo :

You'll see output like the following:
Well block 29491200(gen: 5 level: 0) seems good, and it matches superblock
Well block 29376512(gen: 4 level: 0) seems good, but generation/level
doesn't match, want gen: 5 level: 0

The match one is not what you're looking for.
Try the one whose generation is a little smaller than match one.

Then use btrfsck to test if it's OK:
$ btrfsck -r  /dev/sda1

Try 2~5 times with bytenr whose generation is near the match one.
If you're in good luck, you will find one doesn't crash btrfsck.

And if that doesn't produce much error, then you can try btrfsck --repair -r
 to fix it and try mount.


I've found a root that doesn't produce backtrace. But extent/chunk
allocation errors was found:

$ sudo btrfsck --tree-root 535461888 /dev/sda1
parent transid verify failed on 535461888 wanted 21154 found 21150
parent transid verify failed on 535461888 wanted 21154 found 21150
Ignoring transid failure
checking extents
parent transid verify failed on 459292672 wanted 21148 found 21153
parent transid verify failed on 459292672 wanted 21148 found 21153


Transid failure is OK.


Ignoring transid failure
bad block 459292672
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 459292672 wanted 21148 found 21153

Should I ignore those errors and run btrfsck --repair? Or
--init-extent-tree is needed?


Did it btrfsck has other complain?
And how is the generation difference between the one you're using and 
the one in superblock?


If the generation difference is larger than 1, I'd recommend not to run 
'--repair' nor '--init-extent-tree'


If the difference is only 1, and btrfsck doesn't report problems other 
than transid error, I'd like to try --repair or --init-extent-tree.


But there is *NO* guarantee and it may still make case worse.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-15 Thread Qu Wenruo



Chris Mason wrote on 2015/12/15 16:59 -0500:

On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:



Martin Steigerwald wrote on 2015/12/13 23:35 +0100:

Hi!

For me it is still not production ready.


Yes, this is the *FACT* and not everyone has a good reason to deny it.


Again I ran into:

btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401


Not sure about guideline for other fs, but it will attract more dev's
attention if it can be posted to maillist.




No matter whether SLES 12 uses it as default for root, no matter whether
Fujitsu and Facebook use it: I will not let this onto any customer machine
without lots and lots of underprovisioning and rigorous free space monitoring.
Actually I will renew my recommendations in my trainings to be careful with
BTRFS.

 From my experience the monitoring would check for:

merkaba:~> btrfs fi show /home
Label: 'home'  uuid: […]
 Total devices 2 FS bytes used 156.31GiB
 devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
 devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home

If "used" is same as "size" then make big fat alarm. It is not sufficient for
it to happen. It can run for quite some time just fine without any issues, but
I never have seen a kworker thread using 100% of one core for extended period
of time blocking everything else on the fs without this condition being met.



And specially advice on the device size from myself:
Don't use devices over 100G but less than 500G.
Over 100G will leads btrfs to use big chunks, where data chunks can be at
most 10G and metadata to be 1G.

I have seen a lot of users with about 100~200G device, and hit unbalanced
chunk allocation (10G data chunk easily takes the last available space and
makes later metadata no where to store)


Maybe we should tune things so the size of the chunk is based on the
space remaining instead of the total space?


Submitted such patch before.
David pointed out that such behavior will cause a lot of small 
fragmented chunks at last several GB.

Which may make balance behavior not as predictable as before.


At least, we can just change the current 10% chunk size limit to 5% to 
make such problem less easier to trigger.

It's a simple and easy solution.

Another cause of the problem is, we understated the chunk size change 
for fs at the borderline of big chunk.


For 99G, its chunk size limit is 1G, and it needs 99 data chunks to 
fully cover the fs.

But for 100G, it only needs 10 chunks to covert the fs.
And it need to be 990G to match the number again.

The sudden drop of chunk number is the root cause.

So we'd better reconsider both the big chunk size limit and chunk size 
limit to find a balanaced solution for it.


Thanks,
Qu




And unfortunately, your fs is already in the dangerous zone.
(And you are using RAID1, which means it's the same as one 170G btrfs with
SINGLE data/meta)



In addition to that last time I tried it aborts scrub any of my BTRFS
filesstems. Reported in another thread here that got completely ignored so
far. I think I could go back to 4.2 kernel to make this work.


We'll pick this thread up again, the ones that get fixed the fastest are
the ones that we can easily reproduce.  The rest need a lot of think
time.

-chris





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-15 Thread Qu Wenruo



David Sterba wrote on 2015/12/14 18:32 +0100:

On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo 
Reviewed-by: Chandan Rajendra 
Tested-by: Austin S. Hemmelgarn 


I've read the discussions around the change and from the user's POV I'd
suggest to add another mount option that would be just an alias for any
mount options that would implement the 'hard-ro' semantics.

Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
more options in the future.

  mount -o ro,nowr /dev/sdx /mnt

would work when switching kernels.



That would be nice.

I'd like to forward the idea/discussion to filesystem ml, not only btrfs 
maillist.


Such behavior should better be coordinated between all(at least xfs and 
ext4 and btrfs) filesystems.


One sad example is, we can't use 'norecovery' mount option to disable 
log replay in btrfs, as there is 'recovery' mount option already.


So I hope we can have a unified mount option between mainline filesystems.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps

2015-12-15 Thread Manish
Hi Chris,

I have one coding comment for this patch.

Following line can be merged into single:
  if (!list_empty(bitmaps))
 entry = list_first_entry(bitmaps, struct btrfs_free_space, list);


new change as below-:

entry = list_first_entry_or_null(bitmaps, struct btrfs_free_space,list);


Manish


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-15 Thread Liu Bo
On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2015/12/15 16:59 -0500:
> >On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >>>Hi!
> >>>
> >>>For me it is still not production ready.
> >>
> >>Yes, this is the *FACT* and not everyone has a good reason to deny it.
> >>
> >>>Again I ran into:
> >>>
> >>>btrfs kworker thread uses up 100% of a Sandybridge core for minutes on 
> >>>random
> >>>write into big file
> >>>https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >>
> >>Not sure about guideline for other fs, but it will attract more dev's
> >>attention if it can be posted to maillist.
> >>
> >>>
> >>>
> >>>No matter whether SLES 12 uses it as default for root, no matter whether
> >>>Fujitsu and Facebook use it: I will not let this onto any customer machine
> >>>without lots and lots of underprovisioning and rigorous free space 
> >>>monitoring.
> >>>Actually I will renew my recommendations in my trainings to be careful with
> >>>BTRFS.
> >>>
> >>> From my experience the monitoring would check for:
> >>>
> >>>merkaba:~> btrfs fi show /home
> >>>Label: 'home'  uuid: […]
> >>> Total devices 2 FS bytes used 156.31GiB
> >>> devid1 size 170.00GiB used 164.13GiB path 
> >>> /dev/mapper/msata-home
> >>> devid2 size 170.00GiB used 164.13GiB path 
> >>> /dev/mapper/sata-home
> >>>
> >>>If "used" is same as "size" then make big fat alarm. It is not sufficient 
> >>>for
> >>>it to happen. It can run for quite some time just fine without any issues, 
> >>>but
> >>>I never have seen a kworker thread using 100% of one core for extended 
> >>>period
> >>>of time blocking everything else on the fs without this condition being 
> >>>met.
> >>>
> >>
> >>And specially advice on the device size from myself:
> >>Don't use devices over 100G but less than 500G.
> >>Over 100G will leads btrfs to use big chunks, where data chunks can be at
> >>most 10G and metadata to be 1G.
> >>
> >>I have seen a lot of users with about 100~200G device, and hit unbalanced
> >>chunk allocation (10G data chunk easily takes the last available space and
> >>makes later metadata no where to store)
> >
> >Maybe we should tune things so the size of the chunk is based on the
> >space remaining instead of the total space?
> 
> Submitted such patch before.
> David pointed out that such behavior will cause a lot of small fragmented
> chunks at last several GB.
> Which may make balance behavior not as predictable as before.
> 
> 
> At least, we can just change the current 10% chunk size limit to 5% to make
> such problem less easier to trigger.
> It's a simple and easy solution.
> 
> Another cause of the problem is, we understated the chunk size change for fs
> at the borderline of big chunk.
> 
> For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
> cover the fs.
> But for 100G, it only needs 10 chunks to covert the fs.
> And it need to be 990G to match the number again.

max_stripe_size is fixed at 1GB and the chunk size is stripe_size * 
data_stripes,
may I know how your partition gets a 10GB chunk?


Thanks,

-liubo
 

> 
> The sudden drop of chunk number is the root cause.
> 
> So we'd better reconsider both the big chunk size limit and chunk size limit
> to find a balanaced solution for it.
> 
> Thanks,
> Qu
> >
> >>
> >>And unfortunately, your fs is already in the dangerous zone.
> >>(And you are using RAID1, which means it's the same as one 170G btrfs with
> >>SINGLE data/meta)
> >>
> >>>
> >>>In addition to that last time I tried it aborts scrub any of my BTRFS
> >>>filesstems. Reported in another thread here that got completely ignored so
> >>>far. I think I could go back to 4.2 kernel to make this work.
> >
> >We'll pick this thread up again, the ones that get fixed the fastest are
> >the ones that we can easily reproduce.  The rest need a lot of think
> >time.
> >
> >-chris
> >
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-15 Thread Christoph Anton Mitterer
On Wed, 2015-12-16 at 09:36 +0800, Qu Wenruo wrote:
> One sad example is, we can't use 'norecovery' mount option to disable
 
> log replay in btrfs, as there is 'recovery' mount option already.
I think "norecovery" would anyway not really fit... the name should
rather indicated, that from the filesystem side, nothing changes the
underlying device's contents.
"norecovery" would just tell, that no recovery options would be tried,
however, any other changes (optimisations, etc.) could still go on.

David's "nowr" is already, better, though it could be misinterpreted as
no write/read (as wr being rw swapped), so perhaps "nowrites" would be
better... but that again may be considered just another name for "ro".

So perhaps one could do something that includes "dev", like "rodev",
"ro-dev", or "immuatable-dev"... or instead of "dev" "devs" to cover
multi-device cases.
OTOH, the devices aren't really set "ro" (as in blockdev --setro).

Maybe "nodevwrites" or "no-dev-writes" or one of these with "device"
not abbreviated?


Many programs have a "--dry-run" option, but I kinda don't liky
drymount or something like that.


Guess from the above, I'd personally like "nodevwrites" the most.


Oh and Qu's idea of coordinating that with the other filesystems is
surely good.


Cheers,
Chris.


smime.p7s
Description: S/MIME cryptographic signature


Re: Still not production ready

2015-12-15 Thread Qu Wenruo



Liu Bo wrote on 2015/12/15 17:53 -0800:

On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:



Chris Mason wrote on 2015/12/15 16:59 -0500:

On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:



Martin Steigerwald wrote on 2015/12/13 23:35 +0100:

Hi!

For me it is still not production ready.


Yes, this is the *FACT* and not everyone has a good reason to deny it.


Again I ran into:

btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401


Not sure about guideline for other fs, but it will attract more dev's
attention if it can be posted to maillist.




No matter whether SLES 12 uses it as default for root, no matter whether
Fujitsu and Facebook use it: I will not let this onto any customer machine
without lots and lots of underprovisioning and rigorous free space monitoring.
Actually I will renew my recommendations in my trainings to be careful with
BTRFS.

 From my experience the monitoring would check for:

merkaba:~> btrfs fi show /home
Label: 'home'  uuid: […]
 Total devices 2 FS bytes used 156.31GiB
 devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
 devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home

If "used" is same as "size" then make big fat alarm. It is not sufficient for
it to happen. It can run for quite some time just fine without any issues, but
I never have seen a kworker thread using 100% of one core for extended period
of time blocking everything else on the fs without this condition being met.



And specially advice on the device size from myself:
Don't use devices over 100G but less than 500G.
Over 100G will leads btrfs to use big chunks, where data chunks can be at
most 10G and metadata to be 1G.

I have seen a lot of users with about 100~200G device, and hit unbalanced
chunk allocation (10G data chunk easily takes the last available space and
makes later metadata no where to store)


Maybe we should tune things so the size of the chunk is based on the
space remaining instead of the total space?


Submitted such patch before.
David pointed out that such behavior will cause a lot of small fragmented
chunks at last several GB.
Which may make balance behavior not as predictable as before.


At least, we can just change the current 10% chunk size limit to 5% to make
such problem less easier to trigger.
It's a simple and easy solution.

Another cause of the problem is, we understated the chunk size change for fs
at the borderline of big chunk.

For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
cover the fs.
But for 100G, it only needs 10 chunks to covert the fs.
And it need to be 990G to match the number again.


max_stripe_size is fixed at 1GB and the chunk size is stripe_size * 
data_stripes,
may I know how your partition gets a 10GB chunk?


Oh, it seems that I remembered the wrong size.
After checking the code, yes you're right.
A stripe won't be larger than 1G, so my assumption above is totally wrong.

And the problem is not in the 10% limit.

Please forget it.

Thanks,
Qu




Thanks,

-liubo




The sudden drop of chunk number is the root cause.

So we'd better reconsider both the big chunk size limit and chunk size limit
to find a balanaced solution for it.

Thanks,
Qu




And unfortunately, your fs is already in the dangerous zone.
(And you are using RAID1, which means it's the same as one 170G btrfs with
SINGLE data/meta)



In addition to that last time I tried it aborts scrub any of my BTRFS
filesstems. Reported in another thread here that got completely ignored so
far. I think I could go back to 4.2 kernel to make this work.


We'll pick this thread up again, the ones that get fixed the fastest are
the ones that we can easily reproduce.  The rest need a lot of think
time.

-chris





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-15 Thread Liu Bo
On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote:
> 
> 
> Liu Bo wrote on 2015/12/15 17:53 -0800:
> >On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Chris Mason wrote on 2015/12/15 16:59 -0500:
> >>>On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> 
> 
> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >Hi!
> >
> >For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> 
> >Again I ran into:
> >
> >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on 
> >random
> >write into big file
> >https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Not sure about guideline for other fs, but it will attract more dev's
> attention if it can be posted to maillist.
> 
> >
> >
> >No matter whether SLES 12 uses it as default for root, no matter whether
> >Fujitsu and Facebook use it: I will not let this onto any customer 
> >machine
> >without lots and lots of underprovisioning and rigorous free space 
> >monitoring.
> >Actually I will renew my recommendations in my trainings to be careful 
> >with
> >BTRFS.
> >
> > From my experience the monitoring would check for:
> >
> >merkaba:~> btrfs fi show /home
> >Label: 'home'  uuid: […]
> > Total devices 2 FS bytes used 156.31GiB
> > devid1 size 170.00GiB used 164.13GiB path 
> > /dev/mapper/msata-home
> > devid2 size 170.00GiB used 164.13GiB path 
> > /dev/mapper/sata-home
> >
> >If "used" is same as "size" then make big fat alarm. It is not 
> >sufficient for
> >it to happen. It can run for quite some time just fine without any 
> >issues, but
> >I never have seen a kworker thread using 100% of one core for extended 
> >period
> >of time blocking everything else on the fs without this condition being 
> >met.
> >
> 
> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be at
> most 10G and metadata to be 1G.
> 
> I have seen a lot of users with about 100~200G device, and hit unbalanced
> chunk allocation (10G data chunk easily takes the last available space and
> makes later metadata no where to store)
> >>>
> >>>Maybe we should tune things so the size of the chunk is based on the
> >>>space remaining instead of the total space?
> >>
> >>Submitted such patch before.
> >>David pointed out that such behavior will cause a lot of small fragmented
> >>chunks at last several GB.
> >>Which may make balance behavior not as predictable as before.
> >>
> >>
> >>At least, we can just change the current 10% chunk size limit to 5% to make
> >>such problem less easier to trigger.
> >>It's a simple and easy solution.
> >>
> >>Another cause of the problem is, we understated the chunk size change for fs
> >>at the borderline of big chunk.
> >>
> >>For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
> >>cover the fs.
> >>But for 100G, it only needs 10 chunks to covert the fs.
> >>And it need to be 990G to match the number again.
> >
> >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * 
> >data_stripes,
> >may I know how your partition gets a 10GB chunk?
> 
> Oh, it seems that I remembered the wrong size.
> After checking the code, yes you're right.
> A stripe won't be larger than 1G, so my assumption above is totally wrong.
> 
> And the problem is not in the 10% limit.
> 
> Please forget it.

No problem, glad to see people talking about the space issue again.

Thanks,

-liubo
> 
> Thanks,
> Qu
> 
> >
> >
> >Thanks,
> >
> >-liubo
> >
> >
> >>
> >>The sudden drop of chunk number is the root cause.
> >>
> >>So we'd better reconsider both the big chunk size limit and chunk size limit
> >>to find a balanaced solution for it.
> >>
> >>Thanks,
> >>Qu
> >>>
> 
> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs with
> SINGLE data/meta)
> 
> >
> >In addition to that last time I tried it aborts scrub any of my BTRFS
> >filesstems. Reported in another thread here that got completely ignored 
> >so
> >far. I think I could go back to 4.2 kernel to make this work.
> >>>
> >>>We'll pick this thread up again, the ones that get fixed the fastest are
> >>>the ones that we can easily reproduce.  The rest need a lot of think
> >>>time.
> >>>
> >>>-chris
> >>>
> >>>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> 
--
To unsubscribe from this list: send the line "un

[PATCH] Btrfs: fix output of compression message in btrfs_parse_options()

2015-12-15 Thread Tsutomu Itoh
The compression message might not be correctly output.
Fix it.

[[before fix]]

# mount -o compress /dev/sdb3 /test3
[  996.874264] BTRFS info (device sdb3): disk space caching is enabled
[  996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
[root@luna compress-info]# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

[[after fix]]

# mount -o compress /dev/sdb3 /test3
[  401.021753] BTRFS info (device sdb3): use zlib compression
[  401.021758] BTRFS info (device sdb3): disk space caching is enabled
[  401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[  439.824624] BTRFS info (device sdb3): force zlib compression
[  439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[  459.918430] BTRFS info (device sdb3): use zlib compression
[  459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

Signed-off-by: Tsutomu Itoh 
---
 fs/btrfs/disk-io.c |  2 +-
 fs/btrfs/super.c   | 21 ++---
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 974be09..dcc1f15 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2709,7 +2709,7 @@ int open_ctree(struct super_block *sb,
 * In the long term, we'll store the compression type in the super
 * block, and it'll be used for per file compression control.
 */
-   fs_info->compress_type = BTRFS_COMPRESS_ZLIB;
+   fs_info->compress_type = BTRFS_COMPRESS_NONE;
 
ret = btrfs_parse_options(tree_root, options);
if (ret) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 24154e4..e2e8a54 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -381,6 +381,8 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
int ret = 0;
char *compress_type;
bool compress_force = false;
+   enum btrfs_compression_type saved_compress_type;
+   bool saved_compress_force;
 
cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
if (cache_gen)
@@ -458,6 +460,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
/* Fallthrough */
case Opt_compress:
case Opt_compress_type:
+   saved_compress_type = info->compress_type;
+   saved_compress_force =
+   btrfs_test_opt(root, FORCE_COMPRESS);
if (token == Opt_compress ||
token == Opt_compress_force ||
strcmp(args[0].from, "zlib") == 0) {
@@ -475,6 +480,7 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
btrfs_set_fs_incompat(info, COMPRESS_LZO);
} else if (strncmp(args[0].from, "no", 2) == 0) {
compress_type = "no";
+   info->compress_type = BTRFS_COMPRESS_NONE;
btrfs_clear_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, 
FORCE_COMPRESS);
compress_force = false;
@@ -484,14 +490,8 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
}
 
if (compress_force) {
-   btrfs_set_and_info(root, FORCE_COMPRESS,
-  "force %s compression",
-  compress_type);
+   btrfs_set_opt(info->mount_opt, FORCE_COMPRESS);
} else {
-   if (!btrfs_test_opt(root, COMPRESS))
-   btrfs_info(root->fs_info,
-  "btrfs: use %s compression",
-  compress_type);
/*
   

need to recover large file

2015-12-15 Thread Langhorst, Brad
Hi:

I just screwed up… spent the last 3 weeks generting a 400G file (genome 
assembly) . 
Went to back it up and swapped the arguments to tar (tar Jcf my_precious 
my_precious.tar.xz) 
what was once 400G is now 108 bytes of xz header - argh.  

This is on a 6-volume btrfs filesystem. 

I immediately unmounted the fs (had to cd /  first).

After a bit of searching I found Chris Mason’s post about using 
btrfs-debug-tree -R

root tree: 25606900367360 level 1
chunk tree: 25606758596608 level 1
extent tree key (EXTENT_TREE ROOT_ITEM 0) 25606900383744 level 2
device tree key (DEV_TREE ROOT_ITEM 0) 25606865108992 level 1
fs tree key (FS_TREE ROOT_ITEM 0) 22983583956992 level 1
checksum tree key (CSUM_TREE ROOT_ITEM 0) 25606922682368 level 2
uuid tree key (UUID_TREE ROOT_ITEM 0) 22984609513472 level 0
data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 22984615477248 level 0
btrfs root backup slot 0
tree root gen 21613 block 25606891274240
extent root gen 21613 block 25606891323392
chunk root gen 21612 block 25606758596608
device root gen 21612 block 25606865108992
csum root gen 21613 block 25606891503616
fs root gen 21611 block 22983583956992
1712858198016 used 5400011280384 total 6 devices
btrfs root backup slot 1
tree root gen 21614 block 25606900367360
extent root gen 21614 block 25606900383744
chunk root gen 21612 block 25606758596608
device root gen 21612 block 25606865108992
csum root gen 21614 block 25606922682368
fs root gen 21611 block 22983583956992
1712858198016 used 5400011280384 total 6 devices
btrfs root backup slot 2
tree root gen 21611 block 25606857605120
extent root gen 21611 block 22983584268288
chunk root gen 21595 block 25606758612992
device root gen 21601 block 22983580794880
csum root gen 21611 block 22983584333824
fs root gen 21611 block 22983583956992
1712971542528 used 5400011280384 total 6 devices
btrfs root backup slot 3
tree root gen 21612 block 25606874546176
extent root gen 21612 block 25606880575488
chunk root gen 21612 block 25606758596608
device root gen 21612 block 25606865108992
csum root gen 21612 block 25606890864640
fs root gen 21611 block 22983583956992
1712971444224 used 5400011280384 total 6 devices
total bytes 5400011280384
bytes used 1712858198016
uuid b13d3dc1-f287-483c-8b7d-b142f31fe6df
Btrfs v3.12

if found the oldest generation and grabbed the tree root gen block  like this

sudo btrfs restore -o -v -t 25606857605120 --path-regex 
^/\(\|deer\(\|/masurca\(\|/quorum_mer_db.jf\)\)\)$ /dev/sdd1 /tmp/recover/

unfortunately, I only recovered the post error file.

I also read that one can use btrfs-find-root to get a list of files to recover
 and just ran btrfs-find-root on one of the underlying disks but I get an error 
 
"Super think's the tree root is at 25606900367360, chunk root 25606758596608
Went past the fs size, exiting”
Someone else was able to get past this by commenting that error … so i tried 
recompiling without those lines.

Super think's the tree root is at 25606900367360, chunk root 25606758596608
Well block 24729718243328 seems great, but generation doesn't match, 
have=21582, want=21614 level 1
Well block 24729718292480 seems great, but generation doesn't match, 
have=21582, want=21614 level 0
Well block 24729718308864 seems great, but generation doesn't match, 
have=21582, want=21614 level 0
Well block 24729718407168 seems great, but generation doesn't match, 
have=21582, want=21614 level 0
Well block 24729944670208 seems great, but generation doesn't match, 
have=21583, want=21614 level 1
Well block 24729944719360 seems great, but generation doesn't match, 
have=21583, want=21614 level 0
Well block 24729944735744 seems great, but generation doesn't match, 
have=21583, want=21614 level 0
Well block 24729944817664 seems great, but generation doesn't match, 
have=21583, want=21614 level 0
Well block 24730048708608 seems great, but generation doesn't match, 
have=21584, want=21614 level 1
Well block 24730048724992 seems great, but generation doesn't match, 
have=21584, want=21614 level 0
Well block 24730048741376 seems great, but generation doesn't match, 
have=21584, want=21614 level 0
Well block 24730048757760 seems great, but generation doesn't match, 
have=21584, want=21614 level 0
Well block 24730048774144 seems great, but generation doesn't match, 
have=21584, want=21614 level 0
Well block 24730048823296 seems great, but generation doesn't match, 
have=21584, want=21614 level 0
Well block 24730132348928 seems great, but generation doesn't match, 
have=21585, want=21614 level 1
Well block 24730132414464 seems great, but generation doesn't mat

Re: need to recover large file

2015-12-15 Thread Michael Darling
... Didn't read all of your original post originally, because I
haven't been into those internals.  Now that I have, I see it seems to
be using 6 devices, so you might have to use 1 hard drive 6*size of
partition (others can say if this will work), or 6 other hard drives
to make the backups.  Although, I'm not sure what raid configuration
there may be, which could reduce the number of copies you had to make.

On Tue, Dec 15, 2015 at 10:59 PM, Michael Darling  wrote:
>
> Or, even better yet, just unplug the drive completely and let someone more 
> knowledgeable with btrfs say my dd suggestion works, as long as you don't 
> mount either when they're both plugged in.  I know the urge to just work on 
> something, but bad recovery attempts can make recoverable data lost forever.
>
> On Tue, Dec 15, 2015 at 10:56 PM, Michael Darling  wrote:
>>
>> First thing first, if the file is as important as it sounds like.  Since 
>> there's no physical problem with the hard drive, go get (or if you have one 
>> laying around, use) another hard drive that is at least as big as the 
>> partition that file was on.  Then completely unmount the original partition. 
>>  Do a dd copy of the entire partition it was on.
>>
>> Then DO NOT mount either partition.  You do NOT want to mount a btrfs 
>> partition when there's a dd clone hooked up, because the UUID's are the same 
>> on both.
>>
>> Turn off the machine, pull the new backup copy, then go back to work on 
>> recovery attempts.  If recovery goes wrong, restore the fresh backup copy by 
>> another dd.  Make sure again to NOT mount either while more than one copy is 
>> plugged in.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: did not find source subvol

2015-12-15 Thread Chris Murphy
On Tue, Dec 15, 2015 at 5:35 PM, Chris Murphy  wrote:
> kernel 4.2.6-301.fc23.x86_64
> btrfs-progs-4.2.2-1.fc23.x86_64
>
> This is a new one for me. Two new Btrfs volumes (one single profile,
> one 2x raid1) both created with this btrfs-progs. And then
>
> [root@f23a chrisbackup]# btrfs send everything-20150922/ | btrfs
> receive /mnt/br1-500g/
> At subvol everything-20150922/
> At subvol everything-20150922
> ERROR: did not find source subvol.
>
> There are no kernel messages.
>
> [root@f23a chrisbackup]# du -sh everything-20150922/
> 324Geverything-20150922/
> [root@f23a chrisbackup]# du -sh /mnt /b
>
> [root@f23a chrisbackup]# du -sh /mnt/br1-500g/everything-20150922/
> 322G/mnt/br1-500g/everything-20150922/
>
> HUH, looks like 2G is missing on the receive side. So it got
> interrupted for some reason?

Files are definitely missing. -v reveals no new information. So I'm confused.

Next, I switched to kernel 4.4rc5 and btrfs-progs 4.3.1 and I'm not
seeing this problem. Same file systems, same subvolume.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html