Re: Will "btrfs check --repair" fix the mounting problem?
2015-12-15 1:42 GMT+00:00 Qu Wenruo : > You'll see output like the following: > Well block 29491200(gen: 5 level: 0) seems good, and it matches superblock > Well block 29376512(gen: 4 level: 0) seems good, but generation/level > doesn't match, want gen: 5 level: 0 > > The match one is not what you're looking for. > Try the one whose generation is a little smaller than match one. > > Then use btrfsck to test if it's OK: > $ btrfsck -r /dev/sda1 > > Try 2~5 times with bytenr whose generation is near the match one. > If you're in good luck, you will find one doesn't crash btrfsck. > > And if that doesn't produce much error, then you can try btrfsck --repair -r > to fix it and try mount. I've found a root that doesn't produce backtrace. But extent/chunk allocation errors was found: $ sudo btrfsck --tree-root 535461888 /dev/sda1 parent transid verify failed on 535461888 wanted 21154 found 21150 parent transid verify failed on 535461888 wanted 21154 found 21150 Ignoring transid failure checking extents parent transid verify failed on 459292672 wanted 21148 found 21153 parent transid verify failed on 459292672 wanted 21148 found 21153 Ignoring transid failure bad block 459292672 Errors found in extent allocation tree or chunk allocation parent transid verify failed on 459292672 wanted 21148 found 21153 Should I ignore those errors and run btrfsck --repair? Or --init-extent-tree is needed? -- Ivan Sizov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: ftw_add_entry_size: Round up file size to sectorsize
ftw_add_entry_size() assumes 4k as the block size of the underlying filesystem and hence the file sizes computed is incorrect for non-4k sectorsized filesystems. Fix this by rounding up file sizes to sectorsize. Signed-off-by: Chandan Rajendra --- mkfs.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mkfs.c b/mkfs.c index c58ab2f..88c2289 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1031,16 +1031,15 @@ out: * This ignores symlinks with unreadable targets and subdirs that can't * be read. It's a best-effort to give a rough estimate of the size of * a subdir. It doesn't guarantee that prepopulating btrfs from this - * tree won't still run out of space. - * - * The rounding up to 4096 is questionable. Previous code used du -B 4096. + * tree won't still run out of space. */ static u64 global_total_size; +static u64 fs_block_size; static int ftw_add_entry_size(const char *fpath, const struct stat *st, int type) { if (type == FTW_F || type == FTW_D) - global_total_size += round_up(st->st_size, 4096); + global_total_size += round_up(st->st_size, fs_block_size); return 0; } @@ -1060,6 +1059,7 @@ static u64 size_sourcedir(char *dir_name, u64 sectorsize, allocated_meta_size / default_chunk_size; global_total_size = 0; + fs_block_size = sectorsize; ret = ftw(dir_name, ftw_add_entry_size, 10); dir_size = global_total_size; if (ret < 0) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] btrfs-progs: fix file restore to lost+found bug
On Tue, Dec 08, 2015 at 11:06:28AM +0900, Naohiro Aota wrote: > On Tue, Dec 8, 2015 at 12:35 AM, David Sterba wrote: > > On Mon, Dec 07, 2015 at 11:59:19AM +0900, Naohiro Aota wrote: > >> > But I only see the first 2 patches in maillist... > >> > The last test case seems missing? > >> > >> Maybe, the last patch is too large to post to the list? Even it get > >> smaller, 130260 bytes seems to be a bit large. > >> > >> How should I handle this? Put my repo somewhere and wait a maintainer > >> to pull it? > > > > Please send it to me directly. The image will be available in > > btrfs-progs git and we don't necessarily need the copy in the > > mailinglist. > > Sure. I'll send it to you. Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] btrfs: return all mirror whether need_raid_map set or not
__btrfs_map_block() should return all mirror on WRITE, REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set or not. need_raid_map only used to control is to set bbio->raid_map. Current code works right becuase there is only one caller can trigger above bug, which is readahead, and this function happened to bypass on less mirror. But after we fixed __btrfs_map_block(), readahead will be really works, and exit with warning at another bug. This patchset fixed __btrfs_map_block(), and disable raid56 readahead temp temporary, (actually, it already disable by this bug), and I'll fix raid56 readahead next. Zhao Lei (4): btrfs: Disable raid56 readahead btrfs: return all mirror whether need_raid_map set or not btrfs: Small cleanup for get index_srcdev loop btrfs: Use direct way to determine raid56 write/recover mode fs/btrfs/reada.c | 5 + fs/btrfs/volumes.c | 50 -- 2 files changed, 29 insertions(+), 26 deletions(-) -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: Small cleanup for get index_srcdev loop
1: Adjust condition in loop to make less TAB 2: Move btrfs_put_bbio()'s line for combine, and makes logic clean. Signed-off-by: Zhao Lei --- fs/btrfs/volumes.c | 42 -- 1 file changed, 20 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4ee429b..367e8ec 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5368,35 +5368,33 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, * target drive. */ for (i = 0; i < tmp_num_stripes; i++) { - if (tmp_bbio->stripes[i].dev->devid == srcdev_devid) { - /* -* In case of DUP, in order to keep it -* simple, only add the mirror with the -* lowest physical address -*/ - if (found && - physical_of_found <= -tmp_bbio->stripes[i].physical) - continue; - index_srcdev = i; - found = 1; - physical_of_found = - tmp_bbio->stripes[i].physical; - } + if (tmp_bbio->stripes[i].dev->devid != srcdev_devid) + continue; + + /* +* In case of DUP, in order to keep it simple, only add +* the mirror with the lowest physical address +*/ + if (found && + physical_of_found <= tmp_bbio->stripes[i].physical) + continue; + + index_srcdev = i; + found = 1; + physical_of_found = tmp_bbio->stripes[i].physical; } - if (found) { - mirror_num = index_srcdev + 1; - patch_the_first_stripe_for_dev_replace = 1; - physical_to_patch_in_first_stripe = physical_of_found; - } else { + btrfs_put_bbio(tmp_bbio); + + if (!found) { WARN_ON(1); ret = -EIO; - btrfs_put_bbio(tmp_bbio); goto out; } - btrfs_put_bbio(tmp_bbio); + mirror_num = index_srcdev + 1; + patch_the_first_stripe_for_dev_replace = 1; + physical_to_patch_in_first_stripe = physical_of_found; } else if (mirror_num > map->num_stripes) { mirror_num = 0; } -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: Disable raid56 readahead
Raid56 readahead can not work in current code, reada_find_extent() will show warning of bbio->num_stripes > BTRFS_MAX_MIRRORS, because raid56 have parity strip, which makes more bbio->num_stripes. The reason why we haven't see above error is because another bug in __btrfs_map_block(), which make raid56 readahead do nothing. Before we will fix bug in __btrfs_map_block(), we need to disable raid56 temporary, to avoid above warning. Signed-off-by: Zhao Lei --- fs/btrfs/reada.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 619f929..7bbd656 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -363,6 +363,11 @@ static struct reada_extent *reada_find_extent(struct btrfs_root *root, if (ret || !bbio || length < blocksize) goto error; + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) { + /* Current code can not support RAID56 yet */ + goto error; + } + if (bbio->num_stripes > BTRFS_MAX_MIRRORS) { btrfs_err(root->fs_info, "readahead: more than %d copies not supported", -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: Use direct way to determine raid56 write/recover mode
Old code use bbio->raid_map to determine whether in raid56 write/recover operation, because we don't have bbio->map_type that time, and have to use above workaround. Now we have direct way for this condition, to get gid of using the function-relative data, and make code readable. Signed-off-by: Zhao Lei --- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 367e8ec..d411444 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6056,7 +6056,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio, bbio->fs_info = root->fs_info; atomic_set(&bbio->stripes_pending, bbio->num_stripes); - if (bbio->raid_map) { + if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) && + ((rw & WRITE) || (mirror_num > 1))) { /* In this case, map_length has been set to the length of a single stripe; not the whole write */ if (rw & WRITE) { -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: return all mirror whether need_raid_map set or not
__btrfs_map_block() should return all mirror on WRITE, REQ_GET_READ_MIRRORS, and RECOVERY case, whether need_raid_map set or not. need_raid_map only used to control is to set bbio->raid_map. Current code works right because there is only one caller can trigger above bug, which is readahead, and this function happened to bypass on less mirror. Signed-off-by: Zhao Lei --- fs/btrfs/volumes.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a6df8fd..4ee429b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5464,9 +5464,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, } } else if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { - if (need_raid_map && - ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) || -mirror_num > 1)) { + if ((rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) || + mirror_num > 1) { /* push stripe_nr back to the start of the full stripe */ stripe_nr = div_u64(raid56_full_stripe_start, stripe_len * nr_data_stripes(map)); -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-14 18:34, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 15:20 -0500, Austin S. Hemmelgarn wrote: On 2015-12-14 14:44, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote: The traditional reasoning was that read-only meant that users couldn't change anything Where I'd however count the atime changes to. The atimes wouldn't change magically, but only because the user stared some program, configured some daemon, etc. ... which reads/writes/etc. the file. But reading the file is allowed, which is where this starts to get ambiguous. Why? Because according to POSIX, when a file gets read, the atime gets updated. Except that POSIX doesn't specify what happens if the filesystem is mounted read-only, but the underlying block device is writable. Reading a file updates the atime (and in fact, this is the way that most stuff that uses them cares about them), but even a ro mount allows reading the file. As I just wrote in the other post, at least for btrfs (haven't checked ext/xfs due to being... well... lazy O:-) ) ro mount option or ro snapshot seems to mean: no atime updates even if mounted with strictatimes (or maybe I did just something stupid when checking, so better double check) The traditional meaning of ro on UNIX was (AFAIUI) that directory structure couldn't change, new files couldn't be created, existing files couldn't be deleted, flags on the inodes couldn't be changed, and file data couldn't be changed. TBH, I'm not even certain that atime updates on ro filesystems was even an intentional thing in the first place, it really sounds to me like the type of thing that somebody forgot to put in a permissions check for, and then people thought it was a feature. Well in the end it probably doesn't matter how it came to existence,... rather what it should be and what it actually is. Knowing how you got where you are is pretty important for figuring out how to not end up there again :) As said, I, personally, from the user PoV, would says soft-ro already includes no dates on files being modifiable (including atime), as I'd consider these a property of the file. However anyone else may of course see that differently and at the same time be smarter than I am. AFAIK, the original versions of UNIX had no touch command or utime() syscall, so ctime, mtime, and atime were these things that just got magically updated by the system (ctime is still this way), and thus wasn't something that was considered user modification to the filesystem. Also, even with noatime, I'm pretty sure the VFS updates the atime every time the mtime changes I've just checked and not it doesn't: File: ‘subvol/FILE’ Size: 8 Blocks: 16 IO Block: 4096 regular file Device: 30h/48d Inode: 257 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-12-15 00:01:46.452007798 +0100 Modify: 2015-12-15 00:31:26.579511816 +0100 Change: 2015-12-15 00:31:26.579511816 +0100 (rw,noatime mounted,... mtime, is more recent than atime) Hmm, I could have sworn that updating the mtime on a file would force an atime update. \me checks documentation. OK, I was thinking of relatime, which updates the atime if it's older than mtime or ctime. (because not doing so would be somewhat stupid, and you're writing the inode anyway), which technically means that stuff could work around this by opening the file, truncating it to the size it already is, and then closing it. Hmm I don't have a strong opinion here... it sounds "supid" from the technical point in that it *could* write the atime and that wouldn't cost much. OTOH, that would make things more ambiguous when atimes change and when not... (they'd only change on writes, never on reads,...) So I think it's good as it is... and it matches the name, which is noatime - and not noatime-unless-on-writes ;-) Except there are still ways to update the atime even on a filesystem mounted noatime. For example, on _every_ POSIX compliant system out there (and Linux itself is mostly POSIX compliant, it's primarily the userspace that isn't), you can update the atime using the utime() system call, unless the filesystem is read-only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On 2015-12-14 16:26, Chris Murphy wrote: On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn wrote: Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't making a valid argument. The fact that there is software that doesn't handle it well would say to me based on established practice that that software is what's broken, not common practice. The automobile is invented and due to the ensuing chaos, common practice of doing whatever the F you wanted came to an end in favor of rules of the road and traffic lights. I'm sure some people went ballistic, but for the most part things were much better without the brokenness or prior common practice. Except for one thing: Automobiles actually provide a measurable significant benefit to society. What specific benefit does embedding the filesystem UUID in the metadata actually provide? So the fact we're going to have this problem with all file systems that incorporate the volume UUID into the metadata stream, tells me that the very rudimentary common practice of using dd needs to go away, in general practice. I've already said data recovery (including forensics) and sticking drives away on a shelf could be reasonable. The assumption that a UUID is actually unique is an inherently flawed one, because it depends both on the method of generation guaranteeing it's unique (and none of the defined methods guarantee that), and a distinct absence of malicious intent. http://www.ietf.org/rfc/rfc4122.txt "A UUID is 128 bits long, and can guarantee uniqueness across space and time." Also see security considerations in section 6. Both aspects ignore the facts that: Version 1 is easy to cause a collision with (MAC addresses are by no means unique, and are easy to spoof, and so are timestamps). Version 2 is relatively easy to cause a collision with, because UID and GID numbers are a fixed size namespace. Version 3 is slightly better, but still not by any means unique because you just have to guess the seed string (or a collision for it). Version 4 is probably the hardest to get a collision with, but only if you are using a true RNG, and evne then, 122 bits of entropy is not much protection. Version 5 has the same issues as Version 3, but is more secure against hash collisions. In general, you should only use UUID's when either: a. You have absolutely 100% complete control of the storage of them, such that you can guarantee they don't get reused. b. They can be guaranteed to be relatively unique for the system using them. On that note, why exactly is it better to make the filesystem UUID such an integral part of the filesystem? The other thing I'm reading out of this all, is that by writing a total of 64 bytes to a specific location in a single disk in a multi-device BTRFS filesystem, you can make the whole filesystem fall apart, which is absolutely absurd. OK maybe I'm missing something. 1. UUID is 128 bits. So where are you getting the additional 48 bytes from? 2. The volume UUID is in every superblock, which for all practical purposes means at least two instances of that UUID per device. Are you saying the file system falls apart when changing just one of those volume UUIDs in one superblock? And how does it fall apart? I'd say all volume UUID instances (each superblock, on every device) should be checked and if any of them mismatch then fail to mount. You're right, it would probably take writing all the SB's (although I'm not 100% certain that we actually check that the SB UUID's match). The extra bytes, which I grossly miscalculated, are for the SB checksum, which would have to be updated to match the new SB. There could be some leveraging of the device WWN, or absent that its serial number, propogated into all of the volume's devices (cross referencing each other's devid to WWN or serial). And then that way there's a way to differentiate. In the dd case, there would be mismatching real device WWN/serial number and the one written in metadata on all drives, including the copy. This doesn't say what policy should happen next, just that at least it's known there's a mismatch. That gets tricky too, because for example you have stuff like flat files used as filesystem images. However, if we then use some separate UUID (possibly hashed off of the file location) in place of the device serial/WWN, that could theoretically provide some better protection. The obvious solution in the case of a mismatch would be to refuse the mount until either the issue is fixed using the tools, or the user specifies some particular mount option to either fix ti automatically, or ignore copies with a mismatching serial. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote: > On 2015-12-14 16:26, Chris Murphy wrote: > >On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn > > wrote: > >> > >>Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't > >>making a valid argument. The fact that there is software that doesn't > >>handle it well would say to me based on established practice that that > >>software is what's broken, not common practice. > > > >The automobile is invented and due to the ensuing chaos, common > >practice of doing whatever the F you wanted came to an end in favor of > >rules of the road and traffic lights. I'm sure some people went > >ballistic, but for the most part things were much better without the > >brokenness or prior common practice. > Except for one thing: Automobiles actually provide a measurable > significant benefit to society. What specific benefit does > embedding the filesystem UUID in the metadata actually provide? That one's easy to answer. It deals with a major issue that reiserfs had: if you have a filesystem with another filesystem image stored on it, reiserfsck could end up deciding that both the metadata blocks of the main filesystem *and* the metadata blocks of the image were part of the same FS (because they're on the same block device), and so would splice both filesystems into one, generally complaining loudly along the way that there was a lot of corruption present that it was trying to fix. Putting the UUID of the FS into the metadata blocks means that the kind of low-level check/repair attempt which scans for "stuff that looks like metadata" can at least distinguish between the stuff that's really metadata and the stuff that's just data that looks like metadata. Hugo. > >So the fact we're going to have this problem with all file systems > >that incorporate the volume UUID into the metadata stream, tells me > >that the very rudimentary common practice of using dd needs to go > >away, in general practice. I've already said data recovery (including > >forensics) and sticking drives away on a shelf could be reasonable. > > > >>The assumption that a UUID is actually unique is an inherently flawed one, > >>because it depends both on the method of generation guaranteeing it's unique > >>(and none of the defined methods guarantee that), and a distinct absence of > >>malicious intent. > > > >http://www.ietf.org/rfc/rfc4122.txt > >"A UUID is 128 bits long, and can guarantee uniqueness across space and > >time." > > > >Also see security considerations in section 6. > Both aspects ignore the facts that: > Version 1 is easy to cause a collision with (MAC addresses are by no > means unique, and are easy to spoof, and so are timestamps). > Version 2 is relatively easy to cause a collision with, because UID > and GID numbers are a fixed size namespace. > Version 3 is slightly better, but still not by any means unique > because you just have to guess the seed string (or a collision for > it). > Version 4 is probably the hardest to get a collision with, but only > if you are using a true RNG, and evne then, 122 bits of entropy is > not much protection. > Version 5 has the same issues as Version 3, but is more secure > against hash collisions. > > In general, you should only use UUID's when either: > a. You have absolutely 100% complete control of the storage of them, > such that you can guarantee they don't get reused. > b. They can be guaranteed to be relatively unique for the system using them. > > > > > >>On that note, why exactly is it better to make the filesystem UUID such an > >>integral part of the filesystem? The other thing I'm reading out of this > >>all, is that by writing a total of 64 bytes to a specific location in a > >>single disk in a multi-device BTRFS filesystem, you can make the whole > >>filesystem fall apart, which is absolutely absurd. > > > > > >OK maybe I'm missing something. > > > >1. UUID is 128 bits. So where are you getting the additional 48 bytes from? > >2. The volume UUID is in every superblock, which for all practical > >purposes means at least two instances of that UUID per device. > > > >Are you saying the file system falls apart when changing just one of > >those volume UUIDs in one superblock? And how does it fall apart? I'd > >say all volume UUID instances (each superblock, on every device) > >should be checked and if any of them mismatch then fail to mount. > You're right, it would probably take writing all the SB's (although > I'm not 100% certain that we actually check that the SB UUID's > match). > The extra bytes, which I grossly miscalculated, are for the SB > checksum, which would have to be updated to match the new SB. > > > >There could be some leveraging of the device WWN, or absent that its > >serial number, propogated into all of the volume's devices (cross > >referencing each other's devid to WWN or serial). And then that way > >there's a way to differentiate. In the dd case, there woul
Re: attacking btrfs filesystems via UUID collisions?
On 2015-12-14 19:08, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 08:23 -0500, Austin S. Hemmelgarn wrote: The reason that this isn't quite as high of a concern is because performing this attack requires either root access, or direct physical access to the hardware, and in either case, your system is already compromised. No necessarily. Apart from the ATM image (where most people wouldn't call it compromised, just because it's openly accessible on the street) Um, no you don't have direct physical access to the hardware with an ATM, at least, not unless you are going to take apart the cover and anything else in your way (and probably set off internal alarms). And even without that, it's still possible to DoS an ATM without much effort. Most of them have a 3.5mm headphone jack for TTS for people with poor vision, and that's more than enough to overload at least part of the system with a relatively simple to put together bit of electronics that would cost you less than 10 USD. imageine you're running a VM hosting service, where you allow users to upload images and have them deployed. In the cheap" case these will end up as regular files, where they couldn't do any harm (even if colliding UUIDs)... but even there one would have to expect, that the hypervisor admin may losetup them for whichever reason. But if you offer more professional services, you may give your clients e.g. direct access to some storage backend, which are then probably also seen on the host by its kernel. And here we already have the case, that a client could remotely trigger such collision. In that particular situation, it's not relevant unless the host admin goes to mount them. UUID collisions are only an issue if the filesystems get mounted. And remember, things only sounds far-fetched until it actually happens the first time ;) I still think that that isn't a sufficient excuse for not fixing the issue, as there are a number of non-security related issues that can result from this (there are some things that are common practice with LVM or mdraid that can't be done with BTRFS because of this). Sure, I guess we agree on that,... Apart from that, btrfs should be a general purpose fs, and not just a desktop or server fs. So edge cases like forensics (where it's common that you create bitwise identical images) shouln't be forgotten either. While I would normally agree, there are ways to work around this in the forensics case that don't work for any other case (namely, if BTRFS is built as a module, you can unmount everything, unload the module, reload it, and only scan the devices you want). see below (*) On that note, why exactly is it better to make the filesystem UUID such an integral part of the filesystem? Well I think it's a proper way to e.g. handle the multi-device case. You have n devices, you want to differ them,... using a pseudo-random UUID is surely better than giving them numbers. That's debatable, the same issues are obviously present in both cases (individual numbers can collide too). Same for the fs UUID, e.g. when used for mounting devices whose paths aren't stable. In the case of a sanely designed system using LVM for example, device paths are stable. As said before, using the UUID isn't the problem - not protecting against collisions is. No, the issues are: 1. We assume that the UUID will be unique for the life of the filesystem, which is not a safe assumption. 2. We don't sanely handle things if it isn't unique. The other thing I'm reading out of this all, is that by writing a total of 64 bytes to a specific location in a single disk in a multi-device BTRFS filesystem, you can make the whole filesystem fall apart, which is absolutely absurd. Well,... I don't think that writing *into* the filesystem is covered by common practise anymore. For end users, I agree. Part of the discussion involves attacks on the system, and for a attacker it's not a far stretch to write directly to the block device if possible (and it's even common practice for bypassing permission checks done in the VFS layer). In UNIX, a device (which holds the filesystem) is a file. Therefore one can argue: if one copies/duplicates one file (i.e. the fs) neither of the two's contents should get corrupted. But if you actively write *into* the file by yourself,... then you're simply on your own, either you know what you do, or just may just corrupt *that* specific file. Of course it should again not lead to any of it's clones or become corrupted as well. My point is that by changing the UUID in a superblock (and properly updating the checksum for the superblock), you can trivially break a multi-device filesystem. And it's a whole lot easier to do that than it is to do the equivalent for LVM. And some recovery situations (think along the lines of no recovery disk, and you only have busybox or something similar to work with). (*) which is however also, why you may not be able to unmount the de
Re: attacking btrfs filesystems via UUID collisions?
On 2015-12-15 09:18, Hugo Mills wrote: On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote: On 2015-12-14 16:26, Chris Murphy wrote: On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn wrote: Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't making a valid argument. The fact that there is software that doesn't handle it well would say to me based on established practice that that software is what's broken, not common practice. The automobile is invented and due to the ensuing chaos, common practice of doing whatever the F you wanted came to an end in favor of rules of the road and traffic lights. I'm sure some people went ballistic, but for the most part things were much better without the brokenness or prior common practice. Except for one thing: Automobiles actually provide a measurable significant benefit to society. What specific benefit does embedding the filesystem UUID in the metadata actually provide? That one's easy to answer. It deals with a major issue that reiserfs had: if you have a filesystem with another filesystem image stored on it, reiserfsck could end up deciding that both the metadata blocks of the main filesystem *and* the metadata blocks of the image were part of the same FS (because they're on the same block device), and so would splice both filesystems into one, generally complaining loudly along the way that there was a lot of corruption present that it was trying to fix. IIRC, that was because of the way the SB was designed, and is why other filesystems have a UUID in the superblock. I probably should have been clearer with my statement, what I meant was: What specific benefit does using the UUID for multi-device filesystems to identify the various devices provide? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On Tue, Dec 15, 2015 at 09:27:12AM -0500, Austin S. Hemmelgarn wrote: > On 2015-12-15 09:18, Hugo Mills wrote: > >On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote: > >>On 2015-12-14 16:26, Chris Murphy wrote: > >>>On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn > >>> wrote: > > Agreed, if yo9u can't substantiate _why_ it's bad practice, then you > aren't > making a valid argument. The fact that there is software that doesn't > handle it well would say to me based on established practice that that > software is what's broken, not common practice. > >>> > >>>The automobile is invented and due to the ensuing chaos, common > >>>practice of doing whatever the F you wanted came to an end in favor of > >>>rules of the road and traffic lights. I'm sure some people went > >>>ballistic, but for the most part things were much better without the > >>>brokenness or prior common practice. > >>Except for one thing: Automobiles actually provide a measurable > >>significant benefit to society. What specific benefit does > >>embedding the filesystem UUID in the metadata actually provide? > > > >That one's easy to answer. It deals with a major issue that > >reiserfs had: if you have a filesystem with another filesystem image > >stored on it, reiserfsck could end up deciding that both the metadata > >blocks of the main filesystem *and* the metadata blocks of the image > >were part of the same FS (because they're on the same block device), > >and so would splice both filesystems into one, generally complaining > >loudly along the way that there was a lot of corruption present that > >it was trying to fix. > IIRC, that was because of the way the SB was designed, and is why > other filesystems have a UUID in the superblock. > > I probably should have been clearer with my statement, what I meant was: > What specific benefit does using the UUID for multi-device > filesystems to identify the various devices provide? Well, given a bunch of block devices, how do you identify which ones to use for each of the (unknown number of) filesystems in the system? You can either use some kind of config file, which is going to get out of date as device enumeration orders change or as devices are added/deleted from the FS, or you can try to identify the devices that belong together automatically in some way. btrfs uses the latter option (with the former option kind of supported using the device= mount option). The use of a UUID isn't fundamental to the latter process, but anything that you replaced the UUID with would have the same issues that we're seeing here -- make a duplicate of the device at the block level, and you get additional devices that look like they should be part of the FS. The question is not how you avoid duplicating the UUIDs, but how you identify that there are duplicates present, and how you deal with that issue once you've detected them. This is complicated by the fact that it's perfectly legitimate to have two block devices in the system that identify themselves as the same device for the same filesystem -- this happens when they're different views of the same underlying storage through multipathing. I would suggest trying to migrate to a state where detecting more than one device with the same UUID and devid is cause to prevent the FS from mounting, unless there's also a "mount_duplicates_yes_i_ know_this_is_dangerous_and_i_know_what_im_doing" mount flag present, for the multipathing people. That will break existing userspace behaviour for the multipathing case, but the migration can probably be managed. (e.g. NFS has successfully changed default behaviour for one of its mount options in the last few(?) years). Hugo. -- Hugo Mills | I think that everything darkling says is actually a hugo@... carfax.org.uk | joke. It's just that we haven't worked out most of http://carfax.org.uk/ | them yet. PGP: E2AB1DE4 |Vashka signature.asc Description: Digital signature
Re: dear developers, can we have notdatacow + checksumming, plz?
On 2015-12-14 22:15, Christoph Anton Mitterer wrote: On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote: When one starts to get a bit deeper into btrfs (from the admin/end- user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. It is worth pointing out that in the case of DB's at least, this is because at least some of the do COW internally to provide the transactional semantics that are required for many workloads. Guess that also applies to some VM images then, IIRC qcow2 does CoW. Yep, and I think that VMWare's image format does too. a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... There are other things that can be done to improve this. I would assume of course that you're already doing some of them (stuff like using dedicated storage controller cards instead of the stuff on the motherboard), but some things often get overlooked, like actually taking the time to fine-tune the I/O scheduler for the workload (Linux has particularly brain-dead default settings for CFQ, and the deadline I/O scheduler is only good in hard-real-time usage or on small hard drives that actually use spinning disks). Well sure, I think we'de done most of this and have dedicated controllers, at least of a quality that funding allows us ;-) But regardless how much one tunes, and how good the hardware is. If you'd then loose always a fraction of your overall IO, and be it just 5%, to defragging these types of files, one may actually want to avoid this at all, for which nodatacow seems *the* solution. nodatacow only works for that if the file is pre-allocated, if it isn't, then it still ends up fragmented. The big argument for defragmenting a SSD is that it makes it such that you require fewer I/O requests to the device to read a file I've had read about that too, but since I haven't had much personal experience or measurements in that respect, I didn't list it :) I can't give any real numbers, but I've seen noticeable performance improvements on good SSD's (Intel, Samsung, and Crucial) when making sure that things are defragmented. The problem is not entirely the lack of COW semantics, it's also the fact that it's impossible to implement an atomic write on a hard disk. Sure... but that's just the same for the nodatacow writes of data. (And the same, AFAIU, for CoW itself, just that we'd notice any corruption in case of a crash due to the CoWed nature of the fs and could go back to the last generation). Yes, but it's also the reason that using either COW or a log-structured filesystem (like NILFS2, LogFS, or I think F2FS) is important for consistency. but I wouldn't know that relational DBs really do cheksuming of the data. All the ones I know of except GDBM and BerkDB do in fact provide the option of checksumming. It's pretty much mandatory if you want to be considered for usage in financial, military, or medical applications. Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know that... only crc16 but at least something. Long story short, it does happen every now and then, that a scrub shows file errors, for neither the RAID was broken, nor there were any block errors reported by the disks, or anything suspicious in SMART. In other words, silent block corruption. Or a transient error in system RAM that ECC didn't catch, or a undetected error in the physical link layer to the disks, or an error in the disk cache or controller, or any number of other things. Well sure,... I was referring to these particular cases, where silent block corruption was the most likely reason. The data was reproducibly read identical, which probably rules out bad RAM or controller, etc. BTRFS could only protect against some cases, not all (for example, if you have a big enough error in RAM that ECC doesn't catch it, you've got serious issues that just about nothing short of a cold reboot can save you from). Sure, I haven't claimed, that checksumming for no-CoWed data is a solution for everything. But, AFAIU, not doing CoW, while not having a journal (or does it have one for these cases???) almost certainly means that the data (not necessarily the fs) will be inconsistent in case of a crash during a no-CoWed write anyway, right? Wouldn't it be basically like ext2? Kind of, but not quite. Even with nodatacow, metadata is still COW, which is functionally as safe as a traditional journaling filesystem like XFS or ext4. Sure, I was referring to the data part only, should have made that more clear. Absolute worst case scenario for both nodatacow on BTRFS, and a traditional journaling filesystem,
Re: attacking btrfs filesystems via UUID collisions?
On 2015-12-15 09:42, Hugo Mills wrote: On Tue, Dec 15, 2015 at 09:27:12AM -0500, Austin S. Hemmelgarn wrote: On 2015-12-15 09:18, Hugo Mills wrote: On Tue, Dec 15, 2015 at 08:54:01AM -0500, Austin S. Hemmelgarn wrote: On 2015-12-14 16:26, Chris Murphy wrote: On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn wrote: Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't making a valid argument. The fact that there is software that doesn't handle it well would say to me based on established practice that that software is what's broken, not common practice. The automobile is invented and due to the ensuing chaos, common practice of doing whatever the F you wanted came to an end in favor of rules of the road and traffic lights. I'm sure some people went ballistic, but for the most part things were much better without the brokenness or prior common practice. Except for one thing: Automobiles actually provide a measurable significant benefit to society. What specific benefit does embedding the filesystem UUID in the metadata actually provide? That one's easy to answer. It deals with a major issue that reiserfs had: if you have a filesystem with another filesystem image stored on it, reiserfsck could end up deciding that both the metadata blocks of the main filesystem *and* the metadata blocks of the image were part of the same FS (because they're on the same block device), and so would splice both filesystems into one, generally complaining loudly along the way that there was a lot of corruption present that it was trying to fix. IIRC, that was because of the way the SB was designed, and is why other filesystems have a UUID in the superblock. I probably should have been clearer with my statement, what I meant was: What specific benefit does using the UUID for multi-device filesystems to identify the various devices provide? Well, given a bunch of block devices, how do you identify which ones to use for each of the (unknown number of) filesystems in the system? You can either use some kind of config file, which is going to get out of date as device enumeration orders change or as devices are added/deleted from the FS, or you can try to identify the devices that belong together automatically in some way. btrfs uses the latter option (with the former option kind of supported using the device= mount option). The use of a UUID isn't fundamental to the latter process, but anything that you replaced the UUID with would have the same issues that we're seeing here -- make a duplicate of the device at the block level, and you get additional devices that look like they should be part of the FS. The question is not how you avoid duplicating the UUIDs, but how you identify that there are duplicates present, and how you deal with that issue once you've detected them. This is complicated by the fact that it's perfectly legitimate to have two block devices in the system that identify themselves as the same device for the same filesystem -- this happens when they're different views of the same underlying storage through multipathing. I would suggest trying to migrate to a state where detecting more than one device with the same UUID and devid is cause to prevent the FS from mounting, unless there's also a "mount_duplicates_yes_i_ know_this_is_dangerous_and_i_know_what_im_doing" mount flag present, for the multipathing people. That will break existing userspace behaviour for the multipathing case, but the migration can probably be managed. (e.g. NFS has successfully changed default behaviour for one of its mount options in the last few(?) years). May I propose the alternative option of adding a flag to tell mount to _only_ use the devices specified in the options? That would allow people to work around the common issues (multipath, dm-cache, etc), and would provide people who have stable device enumeration to mitigate the possibility of an attack. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs: check for empty bitmap list in setup_cluster_bitmaps
Dave Jones found a warning from kasan in setup_cluster_bitmaps() == BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at addr 88039bef6828 Read of size 8 by task nfsd/1009 page:ea000e6fbd80 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x8000() page dumped because: kasan: bad access detected CPU: 1 PID: 1009 Comm: nfsd Tainted: GW 4.4.0-rc3-backup-debug+ #1 880065647b50 6bb712c2 88039bef6640 a680a43e 004559c0 88039bef66c8 a62638d1 a61121c0 8803a5769de8 0296 8803a5769df0 00046280 Call Trace: [] dump_stack+0x4b/0x6d [] kasan_report_error+0x501/0x520 [] ? debug_show_all_locks+0x1e0/0x1e0 [] kasan_report+0x58/0x60 [] ? rb_last+0x10/0x40 [] ? setup_cluster_bitmap+0xc4/0x5a0 [] __asan_load8+0x5d/0x70 [] setup_cluster_bitmap+0xc4/0x5a0 [] ? setup_cluster_no_bitmap+0x6a/0x400 [] btrfs_find_space_cluster+0x4b6/0x640 [] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 [] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 [] ? _raw_spin_unlock+0x27/0x40 [] find_free_extent+0xba1/0x1520 Andrey noticed this was because we were doing list_first_entry on a list that might be empty. Rework the tests a bit so we don't do that. Signed-off-by: Chris Mason Reprorted-by: Andrey Ryabinin Reported-by: Dave Jones diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 0948d34..e6fc7d9 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache *block_group, u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; - struct btrfs_free_space *entry; + struct btrfs_free_space *entry = NULL; int ret = -ENOSPC; u64 bitmap_offset = offset_to_bitmap(ctl, offset); @@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache *block_group, * The bitmap that covers offset won't be in the list unless offset * is just its start offset. */ - entry = list_first_entry(bitmaps, struct btrfs_free_space, list); - if (entry->offset != bitmap_offset) { + if (!list_empty(bitmaps)) + entry = list_first_entry(bitmaps, struct btrfs_free_space, list); + + if (!entry || entry->offset != bitmap_offset) { entry = tree_search_offset(ctl, bitmap_offset, 1, 0); if (entry && list_empty(&entry->list)) list_add(&entry->list, bitmaps); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps
On 12/15/2015 12:08 PM, Chris Mason wrote: Dave Jones found a warning from kasan in setup_cluster_bitmaps() == BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at addr 88039bef6828 Read of size 8 by task nfsd/1009 page:ea000e6fbd80 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x8000() page dumped because: kasan: bad access detected CPU: 1 PID: 1009 Comm: nfsd Tainted: GW 4.4.0-rc3-backup-debug+ #1 880065647b50 6bb712c2 88039bef6640 a680a43e 004559c0 88039bef66c8 a62638d1 a61121c0 8803a5769de8 0296 8803a5769df0 00046280 Call Trace: [] dump_stack+0x4b/0x6d [] kasan_report_error+0x501/0x520 [] ? debug_show_all_locks+0x1e0/0x1e0 [] kasan_report+0x58/0x60 [] ? rb_last+0x10/0x40 [] ? setup_cluster_bitmap+0xc4/0x5a0 [] __asan_load8+0x5d/0x70 [] setup_cluster_bitmap+0xc4/0x5a0 [] ? setup_cluster_no_bitmap+0x6a/0x400 [] btrfs_find_space_cluster+0x4b6/0x640 [] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 [] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 [] ? _raw_spin_unlock+0x27/0x40 [] find_free_extent+0xba1/0x1520 Andrey noticed this was because we were doing list_first_entry on a list that might be empty. Rework the tests a bit so we don't do that. Signed-off-by: Chris Mason Reprorted-by: Andrey Ryabinin Reported-by: Dave Jones diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 0948d34..e6fc7d9 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache *block_group, u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; - struct btrfs_free_space *entry; + struct btrfs_free_space *entry = NULL; int ret = -ENOSPC; u64 bitmap_offset = offset_to_bitmap(ctl, offset); @@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache *block_group, * The bitmap that covers offset won't be in the list unless offset * is just its start offset. */ Just above this we have a if (ctl->total_bitmaps == 0) return NULL; check that should make this useless, which means we're screwing up our ctl->total_bitmaps counter somehow. We should probably figure out why that is happening. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: !PageLocked BUG_ON hit in clear_page_dirty_for_io
On Tue, Dec 15, 2015 at 12:03 AM, Chris Mason wrote: > On Tue, Dec 08, 2015 at 11:25:28PM -0500, Dave Jones wrote: >> Not sure if I've already reported this one, but I've been seeing this >> a lot this last couple days. >> >> kernel BUG at mm/page-writeback.c:2654! >> invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > > We ended up discussing this in more detail on lkml, but I'll summarize > here. > > There were two problems. First lock_page() might not actually lock the > page in v4.4-rc4, it can bail out if a signal is pending. This got > fixed just before v4.4-rc5, so if you were on rc4, upgrade asap. > > Second, prepare_pages had a bug for single page writes: > > From f0be89af049857bcc537a53fe2a2fae080e7a5bd Mon Sep 17 00:00:00 2001 > From: Chris Mason > Date: Mon, 14 Dec 2015 15:40:44 -0800 > Subject: [PATCH] Btrfs: check prepare_uptodate_page() error code earlier > > prepare_pages() may end up calling prepare_uptodate_page() twice if our > write only spans a single page. But if the first call returns an error, > our page will be unlocked and its not safe to call it again. > > This bug goes all the way back to 2011, and it's not something commonly > hit. > > While we're here, add a more explicit check for the page being truncated > away. The bare lock_page() alone is protected only by good thoughts and > i_mutex, which we're sure to regret eventually. > > Reported-by: Dave Jones > Signed-off-by: Chris Mason Reviewed-by: Filipe Manana > --- > fs/btrfs/file.c | 18 ++ > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 72e7346..0f09526 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -1291,7 +1291,8 @@ out: > * on error we return an unlocked page and the error value > * on success we return a locked page and 0 > */ > -static int prepare_uptodate_page(struct page *page, u64 pos, > +static int prepare_uptodate_page(struct inode *inode, > +struct page *page, u64 pos, > bool force_uptodate) > { > int ret = 0; > @@ -1306,6 +1307,10 @@ static int prepare_uptodate_page(struct page *page, > u64 pos, > unlock_page(page); > return -EIO; > } > + if (page->mapping != inode->i_mapping) { > + unlock_page(page); > + return -EAGAIN; > + } > } > return 0; > } > @@ -1324,6 +1329,7 @@ static noinline int prepare_pages(struct inode *inode, > struct page **pages, > int faili; > > for (i = 0; i < num_pages; i++) { > +again: > pages[i] = find_or_create_page(inode->i_mapping, index + i, >mask | __GFP_WRITE); > if (!pages[i]) { > @@ -1333,13 +1339,17 @@ static noinline int prepare_pages(struct inode > *inode, struct page **pages, > } > > if (i == 0) > - err = prepare_uptodate_page(pages[i], pos, > + err = prepare_uptodate_page(inode, pages[i], pos, > force_uptodate); > - if (i == num_pages - 1) > - err = prepare_uptodate_page(pages[i], > + if (!err && i == num_pages - 1) > + err = prepare_uptodate_page(inode, pages[i], > pos + write_bytes, false); > if (err) { > page_cache_release(pages[i]); > + if (err == -EAGAIN) { > + err = 0; > + goto again; > + } > faili = i - 1; > goto fail; > } > -- > 2.4.6 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps
On Tue, Dec 15, 2015 at 01:37:01PM -0500, Josef Bacik wrote: > On 12/15/2015 12:08 PM, Chris Mason wrote: > >Dave Jones found a warning from kasan in setup_cluster_bitmaps() > > > >== > >BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at > >addr 88039bef6828 > >Read of size 8 by task nfsd/1009 > >page:ea000e6fbd80 count:0 mapcount:0 mapping: (null) > >index:0x0 > >flags: 0x8000() > >page dumped because: kasan: bad access detected > >CPU: 1 PID: 1009 Comm: nfsd Tainted: GW > >4.4.0-rc3-backup-debug+ #1 > >880065647b50 6bb712c2 88039bef6640 a680a43e > >004559c0 88039bef66c8 a62638d1 a61121c0 > >8803a5769de8 0296 8803a5769df0 00046280 > >Call Trace: > >[] dump_stack+0x4b/0x6d > >[] kasan_report_error+0x501/0x520 > >[] ? debug_show_all_locks+0x1e0/0x1e0 > >[] kasan_report+0x58/0x60 > >[] ? rb_last+0x10/0x40 > >[] ? setup_cluster_bitmap+0xc4/0x5a0 > >[] __asan_load8+0x5d/0x70 > >[] setup_cluster_bitmap+0xc4/0x5a0 > >[] ? setup_cluster_no_bitmap+0x6a/0x400 > >[] btrfs_find_space_cluster+0x4b6/0x640 > >[] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 > >[] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 > >[] ? _raw_spin_unlock+0x27/0x40 > >[] find_free_extent+0xba1/0x1520 > > > >Andrey noticed this was because we were doing list_first_entry on a list > >that might be empty. Rework the tests a bit so we don't do that. > > > >Signed-off-by: Chris Mason > >Reprorted-by: Andrey Ryabinin > >Reported-by: Dave Jones > > > >diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c > >index 0948d34..e6fc7d9 100644 > >--- a/fs/btrfs/free-space-cache.c > >+++ b/fs/btrfs/free-space-cache.c > >@@ -2972,7 +2972,7 @@ setup_cluster_bitmap(struct btrfs_block_group_cache > >*block_group, > > u64 cont1_bytes, u64 min_bytes) > > { > > struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; > >-struct btrfs_free_space *entry; > >+struct btrfs_free_space *entry = NULL; > > int ret = -ENOSPC; > > u64 bitmap_offset = offset_to_bitmap(ctl, offset); > > > >@@ -2983,8 +2983,10 @@ setup_cluster_bitmap(struct btrfs_block_group_cache > >*block_group, > > * The bitmap that covers offset won't be in the list unless offset > > * is just its start offset. > > */ > > Just above this we have a if (ctl->total_bitmaps == 0) return NULL; check > that should make this useless, which means we're screwing up our > ctl->total_bitmaps counter somehow. We should probably figure out why that > is happening. Thanks, My best explanation is that btrfs_bitmap_cluster() takes the bitmap out of the rbtree without dropping ctl->total_bitmaps. So, setup_cluster_no_bitmap() can't find it. This should require mixed allocation modes to trigger. Another path is that during btrfs_write_out_cache() we'll pull entries out. My relatively new code allows that to happen before commit now, so it might happen then. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: > > > Martin Steigerwald wrote on 2015/12/13 23:35 +0100: > >Hi! > > > >For me it is still not production ready. > > Yes, this is the *FACT* and not everyone has a good reason to deny it. > > >Again I ran into: > > > >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random > >write into big file > >https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > Not sure about guideline for other fs, but it will attract more dev's > attention if it can be posted to maillist. > > > > > > >No matter whether SLES 12 uses it as default for root, no matter whether > >Fujitsu and Facebook use it: I will not let this onto any customer machine > >without lots and lots of underprovisioning and rigorous free space > >monitoring. > >Actually I will renew my recommendations in my trainings to be careful with > >BTRFS. > > > > From my experience the monitoring would check for: > > > >merkaba:~> btrfs fi show /home > >Label: 'home' uuid: […] > > Total devices 2 FS bytes used 156.31GiB > > devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home > > devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home > > > >If "used" is same as "size" then make big fat alarm. It is not sufficient for > >it to happen. It can run for quite some time just fine without any issues, > >but > >I never have seen a kworker thread using 100% of one core for extended period > >of time blocking everything else on the fs without this condition being met. > > > > And specially advice on the device size from myself: > Don't use devices over 100G but less than 500G. > Over 100G will leads btrfs to use big chunks, where data chunks can be at > most 10G and metadata to be 1G. > > I have seen a lot of users with about 100~200G device, and hit unbalanced > chunk allocation (10G data chunk easily takes the last available space and > makes later metadata no where to store) Maybe we should tune things so the size of the chunk is based on the space remaining instead of the total space? > > And unfortunately, your fs is already in the dangerous zone. > (And you are using RAID1, which means it's the same as one 170G btrfs with > SINGLE data/meta) > > > > >In addition to that last time I tried it aborts scrub any of my BTRFS > >filesstems. Reported in another thread here that got completely ignored so > >far. I think I could go back to 4.2 kernel to make this work. We'll pick this thread up again, the ones that get fixed the fastest are the ones that we can easily reproduce. The rest need a lot of think time. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
Am Dienstag, 15. Dezember 2015, 16:59:58 CET schrieb Chris Mason: > On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: > > Martin Steigerwald wrote on 2015/12/13 23:35 +0100: > > >Hi! > > > > > >For me it is still not production ready. > > > > Yes, this is the *FACT* and not everyone has a good reason to deny it. > > > > >Again I ran into: > > > > > >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on > > >random write into big file > > >https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > > > Not sure about guideline for other fs, but it will attract more dev's > > attention if it can be posted to maillist. > > > > >No matter whether SLES 12 uses it as default for root, no matter whether > > >Fujitsu and Facebook use it: I will not let this onto any customer > > >machine > > >without lots and lots of underprovisioning and rigorous free space > > >monitoring. Actually I will renew my recommendations in my trainings to > > >be careful with BTRFS. > > > > > > From my experience the monitoring would check for: > > >merkaba:~> btrfs fi show /home > > >Label: 'home' uuid: […] > > > > > > Total devices 2 FS bytes used 156.31GiB > > > devid1 size 170.00GiB used 164.13GiB path > > > /dev/mapper/msata-home > > > devid2 size 170.00GiB used 164.13GiB path > > > /dev/mapper/sata-home > > > > > >If "used" is same as "size" then make big fat alarm. It is not sufficient > > >for it to happen. It can run for quite some time just fine without any > > >issues, but I never have seen a kworker thread using 100% of one core > > >for extended period of time blocking everything else on the fs without > > >this condition being met.> > > And specially advice on the device size from myself: > > Don't use devices over 100G but less than 500G. > > Over 100G will leads btrfs to use big chunks, where data chunks can be at > > most 10G and metadata to be 1G. > > > > I have seen a lot of users with about 100~200G device, and hit unbalanced > > chunk allocation (10G data chunk easily takes the last available space and > > makes later metadata no where to store) > > Maybe we should tune things so the size of the chunk is based on the > space remaining instead of the total space? Still on my filesystem where was over 1 GiB free on metadata chunks, so… … my theory still is: BTRFS has trouble finding free space in chunks at some time. > > And unfortunately, your fs is already in the dangerous zone. > > (And you are using RAID1, which means it's the same as one 170G btrfs with > > SINGLE data/meta) > > > > >In addition to that last time I tried it aborts scrub any of my BTRFS > > >filesstems. Reported in another thread here that got completely ignored > > >so > > >far. I think I could go back to 4.2 kernel to make this work. > > We'll pick this thread up again, the ones that get fixed the fastest are > the ones that we can easily reproduce. The rest need a lot of think > time. I understand. Maybe I just wanted to see at least some sort of an reaction. I now have 4.4-rc5 running, the boot crash I had appears to be fixed. Oh, and I see that scrubbing / at leasted worked now: merkaba:~> btrfs scrub status -d / scrub status for […] scrub device /dev/dm-5 (id 1) history scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:42 total bytes scrubbed: 23.94GiB with 0 errors scrub device /dev/mapper/msata-debian (id 2) history scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:34 total bytes scrubbed: 23.94GiB with 0 errors Okay, I test the other ones tomorrow, so maybe this one is fixed meanwhile. Yay! Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.3-rc4] scrubbing aborts before finishing (probably solved)
Am Montag, 14. Dezember 2015, 08:59:59 CET schrieb Martin Steigerwald: > Am Mittwoch, 25. November 2015, 16:35:39 CET schrieben Sie: > > Am Samstag, 31. Oktober 2015, 12:10:37 CET schrieb Martin Steigerwald: > > > Am Donnerstag, 22. Oktober 2015, 10:41:15 CET schrieb Martin Steigerwald: > > > > I get this: > > > > > > > > merkaba:~> btrfs scrub status -d / > > > > scrub status for […] > > > > scrub device /dev/mapper/sata-debian (id 1) history > > > > > > > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted > > > > after > > > > 00:00:00 > > > > total bytes scrubbed: 0.00B with 0 errors > > > > > > > > scrub device /dev/dm-2 (id 2) history > > > > > > > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted > > > > after > > > > 00:01:30 > > > > total bytes scrubbed: 23.81GiB with 0 errors > > > > > > > > For / scrub aborts for sata SSD immediately. > > > > > > > > For /home scrub aborts for both SSDs at some time. > > > > > > > > merkaba:~> btrfs scrub status -d /home > > > > scrub status for […] > > > > scrub device /dev/mapper/msata-home (id 1) history > > > > > > > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted > > > > after > > > > 00:01:31 > > > > total bytes scrubbed: 22.03GiB with 0 errors > > > > > > > > scrub device /dev/dm-3 (id 2) history > > > > > > > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted > > > > after > > > > 00:03:34 > > > > total bytes scrubbed: 53.30GiB with 0 errors > > > > > > > > Also single volume BTRFS is affected: > > > > > > > > merkaba:~> btrfs scrub status /daten > > > > scrub status for […] > > > > > > > > scrub started at Thu Oct 22 10:36:38 2015 and was aborted > > > > after > > > > 00:00:00 > > > > total bytes scrubbed: 0.00B with 0 errors > > > > > > > > No errors in dmesg, btrfs device stat or smartctl -a. > > > > > > > > Any known issue? > > > > > > I am still seeing this in 4.3-rc7. It happens so that on one SSD BTRFS > > > doesn´t even start scrubbing. But in the end it aborts it scrubbing > > > anyway. > > > > > > I do not see any other issue so far. But I would really like to be able > > > to > > > scrub my BTRFS filesystems completely again. Any hints? Any further > > > information needed? > > > > > > merkaba:~> btrfs scrub status -d / > > > scrub status for […] > > > scrub device /dev/dm-5 (id 1) history > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00 > > > total bytes scrubbed: 0.00B with 0 errors > > > > > > scrub device /dev/mapper/msata-debian (id 2) status > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:20 > > > total bytes scrubbed: 5.27GiB with 0 errors > > > > > > merkaba:~> btrfs scrub status -d / > > > scrub status for […] > > > scrub device /dev/dm-5 (id 1) history > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00 > > > total bytes scrubbed: 0.00B with 0 errors > > > > > > scrub device /dev/mapper/msata-debian (id 2) status > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:25 > > > total bytes scrubbed: 6.59GiB with 0 errors > > > > > > merkaba:~> btrfs scrub status -d / > > > scrub status for […] > > > scrub device /dev/dm-5 (id 1) history > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00 > > > total bytes scrubbed: 0.00B with 0 errors > > > > > > scrub device /dev/mapper/msata-debian (id 2) status > > > > > > scrub started at Sat Oct 31 11:58:45 2015, running for 00:01:25 > > > total bytes scrubbed: 21.97GiB with 0 errors > > > > > > merkaba:~> btrfs scrub status -d / > > > scrub status for […] > > > scrub device /dev/dm-5 (id 1) history > > > > > > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after > > > > > > 00:00:00 total bytes scrubbed: 0.00B with 0 errors > > > scrub device /dev/mapper/msata-debian (id 2) history > > > > > > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after > > > > > > 00:01:32 total bytes scrubbed: 23.63GiB with 0 errors > > > > > > > > > For the sake of it I am going to btrfs check one of the filesystem where > > > BTRFS aborts scrubbing (which is all of the laptop filesystems, not only > > > the RAID 1 one). > > > > > > I will use the /daten filesystem as I can unmount it during laptop > > > runtime > > > easily. There scrubbing aborts immediately: > > > > > > merkaba:~> btrfs scrub start /daten > > > scrub started on /daten, fsid […] (pid=13861) > > > merkaba:~> btrfs scrub status /daten > > > scrub status for […] > > > > > > scrub started at Sat Oct 31 12:04:25 2015 and was aborted after > > > > > > 00:00:00 total bytes scrubbed: 0.00B with 0 errors > > > > > > It is single device: > > > > > > merkaba:~> btrfs
ERROR: did not find source subvol
kernel 4.2.6-301.fc23.x86_64 btrfs-progs-4.2.2-1.fc23.x86_64 This is a new one for me. Two new Btrfs volumes (one single profile, one 2x raid1) both created with this btrfs-progs. And then [root@f23a chrisbackup]# btrfs send everything-20150922/ | btrfs receive /mnt/br1-500g/ At subvol everything-20150922/ At subvol everything-20150922 ERROR: did not find source subvol. There are no kernel messages. [root@f23a chrisbackup]# du -sh everything-20150922/ 324Geverything-20150922/ [root@f23a chrisbackup]# du -sh /mnt /b [root@f23a chrisbackup]# du -sh /mnt/br1-500g/everything-20150922/ 322G/mnt/br1-500g/everything-20150922/ HUH, looks like 2G is missing on the receive side. So it got interrupted for some reason? btrfs check (v4.3.1) comes up clean on the source, as does a scrub. So should I retry with -v on the send or the receive side or both? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Will "btrfs check --repair" fix the mounting problem?
Ivan Sizov wrote on 2015/12/15 09:34 +: 2015-12-15 1:42 GMT+00:00 Qu Wenruo : You'll see output like the following: Well block 29491200(gen: 5 level: 0) seems good, and it matches superblock Well block 29376512(gen: 4 level: 0) seems good, but generation/level doesn't match, want gen: 5 level: 0 The match one is not what you're looking for. Try the one whose generation is a little smaller than match one. Then use btrfsck to test if it's OK: $ btrfsck -r /dev/sda1 Try 2~5 times with bytenr whose generation is near the match one. If you're in good luck, you will find one doesn't crash btrfsck. And if that doesn't produce much error, then you can try btrfsck --repair -r to fix it and try mount. I've found a root that doesn't produce backtrace. But extent/chunk allocation errors was found: $ sudo btrfsck --tree-root 535461888 /dev/sda1 parent transid verify failed on 535461888 wanted 21154 found 21150 parent transid verify failed on 535461888 wanted 21154 found 21150 Ignoring transid failure checking extents parent transid verify failed on 459292672 wanted 21148 found 21153 parent transid verify failed on 459292672 wanted 21148 found 21153 Transid failure is OK. Ignoring transid failure bad block 459292672 Errors found in extent allocation tree or chunk allocation parent transid verify failed on 459292672 wanted 21148 found 21153 Should I ignore those errors and run btrfsck --repair? Or --init-extent-tree is needed? Did it btrfsck has other complain? And how is the generation difference between the one you're using and the one in superblock? If the generation difference is larger than 1, I'd recommend not to run '--repair' nor '--init-extent-tree' If the difference is only 1, and btrfsck doesn't report problems other than transid error, I'd like to try --repair or --init-extent-tree. But there is *NO* guarantee and it may still make case worse. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
Chris Mason wrote on 2015/12/15 16:59 -0500: On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: Martin Steigerwald wrote on 2015/12/13 23:35 +0100: Hi! For me it is still not production ready. Yes, this is the *FACT* and not everyone has a good reason to deny it. Again I ran into: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Not sure about guideline for other fs, but it will attract more dev's attention if it can be posted to maillist. No matter whether SLES 12 uses it as default for root, no matter whether Fujitsu and Facebook use it: I will not let this onto any customer machine without lots and lots of underprovisioning and rigorous free space monitoring. Actually I will renew my recommendations in my trainings to be careful with BTRFS. From my experience the monitoring would check for: merkaba:~> btrfs fi show /home Label: 'home' uuid: […] Total devices 2 FS bytes used 156.31GiB devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home If "used" is same as "size" then make big fat alarm. It is not sufficient for it to happen. It can run for quite some time just fine without any issues, but I never have seen a kworker thread using 100% of one core for extended period of time blocking everything else on the fs without this condition being met. And specially advice on the device size from myself: Don't use devices over 100G but less than 500G. Over 100G will leads btrfs to use big chunks, where data chunks can be at most 10G and metadata to be 1G. I have seen a lot of users with about 100~200G device, and hit unbalanced chunk allocation (10G data chunk easily takes the last available space and makes later metadata no where to store) Maybe we should tune things so the size of the chunk is based on the space remaining instead of the total space? Submitted such patch before. David pointed out that such behavior will cause a lot of small fragmented chunks at last several GB. Which may make balance behavior not as predictable as before. At least, we can just change the current 10% chunk size limit to 5% to make such problem less easier to trigger. It's a simple and easy solution. Another cause of the problem is, we understated the chunk size change for fs at the borderline of big chunk. For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully cover the fs. But for 100G, it only needs 10 chunks to covert the fs. And it need to be 990G to match the number again. The sudden drop of chunk number is the root cause. So we'd better reconsider both the big chunk size limit and chunk size limit to find a balanaced solution for it. Thanks, Qu And unfortunately, your fs is already in the dangerous zone. (And you are using RAID1, which means it's the same as one 170G btrfs with SINGLE data/meta) In addition to that last time I tried it aborts scrub any of my BTRFS filesstems. Reported in another thread here that got completely ignored so far. I think I could go back to 4.2 kernel to make this work. We'll pick this thread up again, the ones that get fixed the fastest are the ones that we can easily reproduce. The rest need a lot of think time. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
David Sterba wrote on 2015/12/14 18:32 +0100: On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote: Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Signed-off-by: Qu Wenruo Reviewed-by: Chandan Rajendra Tested-by: Austin S. Hemmelgarn I've read the discussions around the change and from the user's POV I'd suggest to add another mount option that would be just an alias for any mount options that would implement the 'hard-ro' semantics. Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover more options in the future. mount -o ro,nowr /dev/sdx /mnt would work when switching kernels. That would be nice. I'd like to forward the idea/discussion to filesystem ml, not only btrfs maillist. Such behavior should better be coordinated between all(at least xfs and ext4 and btrfs) filesystems. One sad example is, we can't use 'norecovery' mount option to disable log replay in btrfs, as there is 'recovery' mount option already. So I hope we can have a unified mount option between mainline filesystems. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: check for empty bitmap list in setup_cluster_bitmaps
Hi Chris, I have one coding comment for this patch. Following line can be merged into single: if (!list_empty(bitmaps)) entry = list_first_entry(bitmaps, struct btrfs_free_space, list); new change as below-: entry = list_first_entry_or_null(bitmaps, struct btrfs_free_space,list); Manish -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote: > > > Chris Mason wrote on 2015/12/15 16:59 -0500: > >On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: > >> > >> > >>Martin Steigerwald wrote on 2015/12/13 23:35 +0100: > >>>Hi! > >>> > >>>For me it is still not production ready. > >> > >>Yes, this is the *FACT* and not everyone has a good reason to deny it. > >> > >>>Again I ran into: > >>> > >>>btrfs kworker thread uses up 100% of a Sandybridge core for minutes on > >>>random > >>>write into big file > >>>https://bugzilla.kernel.org/show_bug.cgi?id=90401 > >> > >>Not sure about guideline for other fs, but it will attract more dev's > >>attention if it can be posted to maillist. > >> > >>> > >>> > >>>No matter whether SLES 12 uses it as default for root, no matter whether > >>>Fujitsu and Facebook use it: I will not let this onto any customer machine > >>>without lots and lots of underprovisioning and rigorous free space > >>>monitoring. > >>>Actually I will renew my recommendations in my trainings to be careful with > >>>BTRFS. > >>> > >>> From my experience the monitoring would check for: > >>> > >>>merkaba:~> btrfs fi show /home > >>>Label: 'home' uuid: […] > >>> Total devices 2 FS bytes used 156.31GiB > >>> devid1 size 170.00GiB used 164.13GiB path > >>> /dev/mapper/msata-home > >>> devid2 size 170.00GiB used 164.13GiB path > >>> /dev/mapper/sata-home > >>> > >>>If "used" is same as "size" then make big fat alarm. It is not sufficient > >>>for > >>>it to happen. It can run for quite some time just fine without any issues, > >>>but > >>>I never have seen a kworker thread using 100% of one core for extended > >>>period > >>>of time blocking everything else on the fs without this condition being > >>>met. > >>> > >> > >>And specially advice on the device size from myself: > >>Don't use devices over 100G but less than 500G. > >>Over 100G will leads btrfs to use big chunks, where data chunks can be at > >>most 10G and metadata to be 1G. > >> > >>I have seen a lot of users with about 100~200G device, and hit unbalanced > >>chunk allocation (10G data chunk easily takes the last available space and > >>makes later metadata no where to store) > > > >Maybe we should tune things so the size of the chunk is based on the > >space remaining instead of the total space? > > Submitted such patch before. > David pointed out that such behavior will cause a lot of small fragmented > chunks at last several GB. > Which may make balance behavior not as predictable as before. > > > At least, we can just change the current 10% chunk size limit to 5% to make > such problem less easier to trigger. > It's a simple and easy solution. > > Another cause of the problem is, we understated the chunk size change for fs > at the borderline of big chunk. > > For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully > cover the fs. > But for 100G, it only needs 10 chunks to covert the fs. > And it need to be 990G to match the number again. max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes, may I know how your partition gets a 10GB chunk? Thanks, -liubo > > The sudden drop of chunk number is the root cause. > > So we'd better reconsider both the big chunk size limit and chunk size limit > to find a balanaced solution for it. > > Thanks, > Qu > > > >> > >>And unfortunately, your fs is already in the dangerous zone. > >>(And you are using RAID1, which means it's the same as one 170G btrfs with > >>SINGLE data/meta) > >> > >>> > >>>In addition to that last time I tried it aborts scrub any of my BTRFS > >>>filesstems. Reported in another thread here that got completely ignored so > >>>far. I think I could go back to 4.2 kernel to make this work. > > > >We'll pick this thread up again, the ones that get fixed the fastest are > >the ones that we can easily reproduce. The rest need a lot of think > >time. > > > >-chris > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay
On Wed, 2015-12-16 at 09:36 +0800, Qu Wenruo wrote: > One sad example is, we can't use 'norecovery' mount option to disable > log replay in btrfs, as there is 'recovery' mount option already. I think "norecovery" would anyway not really fit... the name should rather indicated, that from the filesystem side, nothing changes the underlying device's contents. "norecovery" would just tell, that no recovery options would be tried, however, any other changes (optimisations, etc.) could still go on. David's "nowr" is already, better, though it could be misinterpreted as no write/read (as wr being rw swapped), so perhaps "nowrites" would be better... but that again may be considered just another name for "ro". So perhaps one could do something that includes "dev", like "rodev", "ro-dev", or "immuatable-dev"... or instead of "dev" "devs" to cover multi-device cases. OTOH, the devices aren't really set "ro" (as in blockdev --setro). Maybe "nodevwrites" or "no-dev-writes" or one of these with "device" not abbreviated? Many programs have a "--dry-run" option, but I kinda don't liky drymount or something like that. Guess from the above, I'd personally like "nodevwrites" the most. Oh and Qu's idea of coordinating that with the other filesystems is surely good. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: Still not production ready
Liu Bo wrote on 2015/12/15 17:53 -0800: On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote: Chris Mason wrote on 2015/12/15 16:59 -0500: On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: Martin Steigerwald wrote on 2015/12/13 23:35 +0100: Hi! For me it is still not production ready. Yes, this is the *FACT* and not everyone has a good reason to deny it. Again I ran into: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Not sure about guideline for other fs, but it will attract more dev's attention if it can be posted to maillist. No matter whether SLES 12 uses it as default for root, no matter whether Fujitsu and Facebook use it: I will not let this onto any customer machine without lots and lots of underprovisioning and rigorous free space monitoring. Actually I will renew my recommendations in my trainings to be careful with BTRFS. From my experience the monitoring would check for: merkaba:~> btrfs fi show /home Label: 'home' uuid: […] Total devices 2 FS bytes used 156.31GiB devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home If "used" is same as "size" then make big fat alarm. It is not sufficient for it to happen. It can run for quite some time just fine without any issues, but I never have seen a kworker thread using 100% of one core for extended period of time blocking everything else on the fs without this condition being met. And specially advice on the device size from myself: Don't use devices over 100G but less than 500G. Over 100G will leads btrfs to use big chunks, where data chunks can be at most 10G and metadata to be 1G. I have seen a lot of users with about 100~200G device, and hit unbalanced chunk allocation (10G data chunk easily takes the last available space and makes later metadata no where to store) Maybe we should tune things so the size of the chunk is based on the space remaining instead of the total space? Submitted such patch before. David pointed out that such behavior will cause a lot of small fragmented chunks at last several GB. Which may make balance behavior not as predictable as before. At least, we can just change the current 10% chunk size limit to 5% to make such problem less easier to trigger. It's a simple and easy solution. Another cause of the problem is, we understated the chunk size change for fs at the borderline of big chunk. For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully cover the fs. But for 100G, it only needs 10 chunks to covert the fs. And it need to be 990G to match the number again. max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes, may I know how your partition gets a 10GB chunk? Oh, it seems that I remembered the wrong size. After checking the code, yes you're right. A stripe won't be larger than 1G, so my assumption above is totally wrong. And the problem is not in the 10% limit. Please forget it. Thanks, Qu Thanks, -liubo The sudden drop of chunk number is the root cause. So we'd better reconsider both the big chunk size limit and chunk size limit to find a balanaced solution for it. Thanks, Qu And unfortunately, your fs is already in the dangerous zone. (And you are using RAID1, which means it's the same as one 170G btrfs with SINGLE data/meta) In addition to that last time I tried it aborts scrub any of my BTRFS filesstems. Reported in another thread here that got completely ignored so far. I think I could go back to 4.2 kernel to make this work. We'll pick this thread up again, the ones that get fixed the fastest are the ones that we can easily reproduce. The rest need a lot of think time. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote: > > > Liu Bo wrote on 2015/12/15 17:53 -0800: > >On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote: > >> > >> > >>Chris Mason wrote on 2015/12/15 16:59 -0500: > >>>On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote: > > > Martin Steigerwald wrote on 2015/12/13 23:35 +0100: > >Hi! > > > >For me it is still not production ready. > > Yes, this is the *FACT* and not everyone has a good reason to deny it. > > >Again I ran into: > > > >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on > >random > >write into big file > >https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > Not sure about guideline for other fs, but it will attract more dev's > attention if it can be posted to maillist. > > > > > > >No matter whether SLES 12 uses it as default for root, no matter whether > >Fujitsu and Facebook use it: I will not let this onto any customer > >machine > >without lots and lots of underprovisioning and rigorous free space > >monitoring. > >Actually I will renew my recommendations in my trainings to be careful > >with > >BTRFS. > > > > From my experience the monitoring would check for: > > > >merkaba:~> btrfs fi show /home > >Label: 'home' uuid: […] > > Total devices 2 FS bytes used 156.31GiB > > devid1 size 170.00GiB used 164.13GiB path > > /dev/mapper/msata-home > > devid2 size 170.00GiB used 164.13GiB path > > /dev/mapper/sata-home > > > >If "used" is same as "size" then make big fat alarm. It is not > >sufficient for > >it to happen. It can run for quite some time just fine without any > >issues, but > >I never have seen a kworker thread using 100% of one core for extended > >period > >of time blocking everything else on the fs without this condition being > >met. > > > > And specially advice on the device size from myself: > Don't use devices over 100G but less than 500G. > Over 100G will leads btrfs to use big chunks, where data chunks can be at > most 10G and metadata to be 1G. > > I have seen a lot of users with about 100~200G device, and hit unbalanced > chunk allocation (10G data chunk easily takes the last available space and > makes later metadata no where to store) > >>> > >>>Maybe we should tune things so the size of the chunk is based on the > >>>space remaining instead of the total space? > >> > >>Submitted such patch before. > >>David pointed out that such behavior will cause a lot of small fragmented > >>chunks at last several GB. > >>Which may make balance behavior not as predictable as before. > >> > >> > >>At least, we can just change the current 10% chunk size limit to 5% to make > >>such problem less easier to trigger. > >>It's a simple and easy solution. > >> > >>Another cause of the problem is, we understated the chunk size change for fs > >>at the borderline of big chunk. > >> > >>For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully > >>cover the fs. > >>But for 100G, it only needs 10 chunks to covert the fs. > >>And it need to be 990G to match the number again. > > > >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * > >data_stripes, > >may I know how your partition gets a 10GB chunk? > > Oh, it seems that I remembered the wrong size. > After checking the code, yes you're right. > A stripe won't be larger than 1G, so my assumption above is totally wrong. > > And the problem is not in the 10% limit. > > Please forget it. No problem, glad to see people talking about the space issue again. Thanks, -liubo > > Thanks, > Qu > > > > > > >Thanks, > > > >-liubo > > > > > >> > >>The sudden drop of chunk number is the root cause. > >> > >>So we'd better reconsider both the big chunk size limit and chunk size limit > >>to find a balanaced solution for it. > >> > >>Thanks, > >>Qu > >>> > > And unfortunately, your fs is already in the dangerous zone. > (And you are using RAID1, which means it's the same as one 170G btrfs with > SINGLE data/meta) > > > > >In addition to that last time I tried it aborts scrub any of my BTRFS > >filesstems. Reported in another thread here that got completely ignored > >so > >far. I think I could go back to 4.2 kernel to make this work. > >>> > >>>We'll pick this thread up again, the ones that get fixed the fastest are > >>>the ones that we can easily reproduce. The rest need a lot of think > >>>time. > >>> > >>>-chris > >>> > >>> > >> > >> > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>the body of a message to majord...@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- To unsubscribe from this list: send the line "un
[PATCH] Btrfs: fix output of compression message in btrfs_parse_options()
The compression message might not be correctly output. Fix it. [[before fix]] # mount -o compress /dev/sdb3 /test3 [ 996.874264] BTRFS info (device sdb3): disk space caching is enabled [ 996.874268] BTRFS: has skinny extents # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress-force /dev/sdb3 /test3 [ 1035.075017] BTRFS info (device sdb3): force zlib compression [ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress /dev/sdb3 /test3 [ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled [root@luna compress-info]# mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) [[after fix]] # mount -o compress /dev/sdb3 /test3 [ 401.021753] BTRFS info (device sdb3): use zlib compression [ 401.021758] BTRFS info (device sdb3): disk space caching is enabled [ 401.021760] BTRFS: has skinny extents # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress-force /dev/sdb3 /test3 [ 439.824624] BTRFS info (device sdb3): force zlib compression [ 439.824629] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/) # mount -o remount,compress /dev/sdb3 /test3 [ 459.918430] BTRFS info (device sdb3): use zlib compression [ 459.918434] BTRFS info (device sdb3): disk space caching is enabled # mount | grep /test3 /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/) Signed-off-by: Tsutomu Itoh --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/super.c | 21 ++--- 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 974be09..dcc1f15 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2709,7 +2709,7 @@ int open_ctree(struct super_block *sb, * In the long term, we'll store the compression type in the super * block, and it'll be used for per file compression control. */ - fs_info->compress_type = BTRFS_COMPRESS_ZLIB; + fs_info->compress_type = BTRFS_COMPRESS_NONE; ret = btrfs_parse_options(tree_root, options); if (ret) { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 24154e4..e2e8a54 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -381,6 +381,8 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) int ret = 0; char *compress_type; bool compress_force = false; + enum btrfs_compression_type saved_compress_type; + bool saved_compress_force; cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy); if (cache_gen) @@ -458,6 +460,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) /* Fallthrough */ case Opt_compress: case Opt_compress_type: + saved_compress_type = info->compress_type; + saved_compress_force = + btrfs_test_opt(root, FORCE_COMPRESS); if (token == Opt_compress || token == Opt_compress_force || strcmp(args[0].from, "zlib") == 0) { @@ -475,6 +480,7 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) btrfs_set_fs_incompat(info, COMPRESS_LZO); } else if (strncmp(args[0].from, "no", 2) == 0) { compress_type = "no"; + info->compress_type = BTRFS_COMPRESS_NONE; btrfs_clear_opt(info->mount_opt, COMPRESS); btrfs_clear_opt(info->mount_opt, FORCE_COMPRESS); compress_force = false; @@ -484,14 +490,8 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) } if (compress_force) { - btrfs_set_and_info(root, FORCE_COMPRESS, - "force %s compression", - compress_type); + btrfs_set_opt(info->mount_opt, FORCE_COMPRESS); } else { - if (!btrfs_test_opt(root, COMPRESS)) - btrfs_info(root->fs_info, - "btrfs: use %s compression", - compress_type); /*
need to recover large file
Hi: I just screwed up… spent the last 3 weeks generting a 400G file (genome assembly) . Went to back it up and swapped the arguments to tar (tar Jcf my_precious my_precious.tar.xz) what was once 400G is now 108 bytes of xz header - argh. This is on a 6-volume btrfs filesystem. I immediately unmounted the fs (had to cd / first). After a bit of searching I found Chris Mason’s post about using btrfs-debug-tree -R root tree: 25606900367360 level 1 chunk tree: 25606758596608 level 1 extent tree key (EXTENT_TREE ROOT_ITEM 0) 25606900383744 level 2 device tree key (DEV_TREE ROOT_ITEM 0) 25606865108992 level 1 fs tree key (FS_TREE ROOT_ITEM 0) 22983583956992 level 1 checksum tree key (CSUM_TREE ROOT_ITEM 0) 25606922682368 level 2 uuid tree key (UUID_TREE ROOT_ITEM 0) 22984609513472 level 0 data reloc tree key (DATA_RELOC_TREE ROOT_ITEM 0) 22984615477248 level 0 btrfs root backup slot 0 tree root gen 21613 block 25606891274240 extent root gen 21613 block 25606891323392 chunk root gen 21612 block 25606758596608 device root gen 21612 block 25606865108992 csum root gen 21613 block 25606891503616 fs root gen 21611 block 22983583956992 1712858198016 used 5400011280384 total 6 devices btrfs root backup slot 1 tree root gen 21614 block 25606900367360 extent root gen 21614 block 25606900383744 chunk root gen 21612 block 25606758596608 device root gen 21612 block 25606865108992 csum root gen 21614 block 25606922682368 fs root gen 21611 block 22983583956992 1712858198016 used 5400011280384 total 6 devices btrfs root backup slot 2 tree root gen 21611 block 25606857605120 extent root gen 21611 block 22983584268288 chunk root gen 21595 block 25606758612992 device root gen 21601 block 22983580794880 csum root gen 21611 block 22983584333824 fs root gen 21611 block 22983583956992 1712971542528 used 5400011280384 total 6 devices btrfs root backup slot 3 tree root gen 21612 block 25606874546176 extent root gen 21612 block 25606880575488 chunk root gen 21612 block 25606758596608 device root gen 21612 block 25606865108992 csum root gen 21612 block 25606890864640 fs root gen 21611 block 22983583956992 1712971444224 used 5400011280384 total 6 devices total bytes 5400011280384 bytes used 1712858198016 uuid b13d3dc1-f287-483c-8b7d-b142f31fe6df Btrfs v3.12 if found the oldest generation and grabbed the tree root gen block like this sudo btrfs restore -o -v -t 25606857605120 --path-regex ^/\(\|deer\(\|/masurca\(\|/quorum_mer_db.jf\)\)\)$ /dev/sdd1 /tmp/recover/ unfortunately, I only recovered the post error file. I also read that one can use btrfs-find-root to get a list of files to recover and just ran btrfs-find-root on one of the underlying disks but I get an error "Super think's the tree root is at 25606900367360, chunk root 25606758596608 Went past the fs size, exiting” Someone else was able to get past this by commenting that error … so i tried recompiling without those lines. Super think's the tree root is at 25606900367360, chunk root 25606758596608 Well block 24729718243328 seems great, but generation doesn't match, have=21582, want=21614 level 1 Well block 24729718292480 seems great, but generation doesn't match, have=21582, want=21614 level 0 Well block 24729718308864 seems great, but generation doesn't match, have=21582, want=21614 level 0 Well block 24729718407168 seems great, but generation doesn't match, have=21582, want=21614 level 0 Well block 24729944670208 seems great, but generation doesn't match, have=21583, want=21614 level 1 Well block 24729944719360 seems great, but generation doesn't match, have=21583, want=21614 level 0 Well block 24729944735744 seems great, but generation doesn't match, have=21583, want=21614 level 0 Well block 24729944817664 seems great, but generation doesn't match, have=21583, want=21614 level 0 Well block 24730048708608 seems great, but generation doesn't match, have=21584, want=21614 level 1 Well block 24730048724992 seems great, but generation doesn't match, have=21584, want=21614 level 0 Well block 24730048741376 seems great, but generation doesn't match, have=21584, want=21614 level 0 Well block 24730048757760 seems great, but generation doesn't match, have=21584, want=21614 level 0 Well block 24730048774144 seems great, but generation doesn't match, have=21584, want=21614 level 0 Well block 24730048823296 seems great, but generation doesn't match, have=21584, want=21614 level 0 Well block 24730132348928 seems great, but generation doesn't match, have=21585, want=21614 level 1 Well block 24730132414464 seems great, but generation doesn't mat
Re: need to recover large file
... Didn't read all of your original post originally, because I haven't been into those internals. Now that I have, I see it seems to be using 6 devices, so you might have to use 1 hard drive 6*size of partition (others can say if this will work), or 6 other hard drives to make the backups. Although, I'm not sure what raid configuration there may be, which could reduce the number of copies you had to make. On Tue, Dec 15, 2015 at 10:59 PM, Michael Darling wrote: > > Or, even better yet, just unplug the drive completely and let someone more > knowledgeable with btrfs say my dd suggestion works, as long as you don't > mount either when they're both plugged in. I know the urge to just work on > something, but bad recovery attempts can make recoverable data lost forever. > > On Tue, Dec 15, 2015 at 10:56 PM, Michael Darling wrote: >> >> First thing first, if the file is as important as it sounds like. Since >> there's no physical problem with the hard drive, go get (or if you have one >> laying around, use) another hard drive that is at least as big as the >> partition that file was on. Then completely unmount the original partition. >> Do a dd copy of the entire partition it was on. >> >> Then DO NOT mount either partition. You do NOT want to mount a btrfs >> partition when there's a dd clone hooked up, because the UUID's are the same >> on both. >> >> Turn off the machine, pull the new backup copy, then go back to work on >> recovery attempts. If recovery goes wrong, restore the fresh backup copy by >> another dd. Make sure again to NOT mount either while more than one copy is >> plugged in. >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: did not find source subvol
On Tue, Dec 15, 2015 at 5:35 PM, Chris Murphy wrote: > kernel 4.2.6-301.fc23.x86_64 > btrfs-progs-4.2.2-1.fc23.x86_64 > > This is a new one for me. Two new Btrfs volumes (one single profile, > one 2x raid1) both created with this btrfs-progs. And then > > [root@f23a chrisbackup]# btrfs send everything-20150922/ | btrfs > receive /mnt/br1-500g/ > At subvol everything-20150922/ > At subvol everything-20150922 > ERROR: did not find source subvol. > > There are no kernel messages. > > [root@f23a chrisbackup]# du -sh everything-20150922/ > 324Geverything-20150922/ > [root@f23a chrisbackup]# du -sh /mnt /b > > [root@f23a chrisbackup]# du -sh /mnt/br1-500g/everything-20150922/ > 322G/mnt/br1-500g/everything-20150922/ > > HUH, looks like 2G is missing on the receive side. So it got > interrupted for some reason? Files are definitely missing. -v reveals no new information. So I'm confused. Next, I switched to kernel 4.4rc5 and btrfs-progs 4.3.1 and I'm not seeing this problem. Same file systems, same subvolume. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html