[PATCH] btrfs-progs: Enhance chunk validation check
Enhance chunk validation: 1) Num_stripes We already have such check but it's only in super block sys chunk array. Now check all on-disk chunks. 2) Chunk logical It should be aligned to sector size. This behavior should be *DOUBLE CHECKED* for 64K sector size like PPC64 or AArch64. Maybe we can found some hidden bugs. 3) Chunk length Same as chunk logical, should be aligned to sector size. 4) Stripe length It should be power of 2. 5) Chunk type Any bit out of TYPE_MAS | PROFILE_MASK is invalid. With all these much restrict rules, several fuzzed image reported in mail list should no longer cause btrfsck error. Reported-by: Vegard NossumSigned-off-by: Qu Wenruo --- disk-io.c | 2 -- utils.h | 7 +++ volumes.c | 29 - 3 files changed, 35 insertions(+), 3 deletions(-) diff --git a/disk-io.c b/disk-io.c index 7a63b91..83bdb27 100644 --- a/disk-io.c +++ b/disk-io.c @@ -40,8 +40,6 @@ #define BTRFS_BAD_LEVEL(-3) #define BTRFS_BAD_NRITEMS (-4) -#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0) - /* Calculate max possible nritems for a leaf/node */ static u32 max_nritems(u8 level, u32 nodesize) { diff --git a/utils.h b/utils.h index 493c2e4..7740fc2 100644 --- a/utils.h +++ b/utils.h @@ -24,6 +24,8 @@ #include #include +#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0) + #define BTRFS_MKFS_SYSTEM_GROUP_SIZE (4 * 1024 * 1024) #define BTRFS_MKFS_SMALL_VOLUME_SIZE (1024 * 1024 * 1024) #define BTRFS_MKFS_DEFAULT_NODE_SIZE 16384 @@ -246,6 +248,11 @@ static inline u64 div_factor(u64 num, int factor) return num; } +static inline int is_power_of_2(unsigned long n) +{ + return (n != 0 && ((n & (n - 1)) == 0)); +} + int btrfs_tree_search2_ioctl_supported(int fd); int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features); diff --git a/volumes.c b/volumes.c index 492dcd2..a94be0e 100644 --- a/volumes.c +++ b/volumes.c @@ -1591,6 +1591,7 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, struct cache_extent *ce; u64 logical; u64 length; + u64 stripe_len; u64 devid; u8 uuid[BTRFS_UUID_SIZE]; int num_stripes; @@ -1599,6 +1600,33 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, logical = key->offset; length = btrfs_chunk_length(leaf, chunk); + stripe_len = btrfs_chunk_stripe_len(leaf, chunk); + num_stripes = btrfs_chunk_num_stripes(leaf, chunk); + /* Validation check */ + if (!num_stripes) { + error("invalid chunk num_stripes: %u", num_stripes); + return -EIO; + } + if (!IS_ALIGNED(logical, root->sectorsize)) { + error("invalid chunk logical %llu", logical); + return -EIO; + } + if (!length || !IS_ALIGNED(length, root->sectorsize)) { + error("invalid chunk length %llu", length); + return -EIO; + } + if (!is_power_of_2(stripe_len)) { + error("invalid chunk stripe length: %llu", stripe_len); + return -EIO; + } + if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) & + btrfs_chunk_type(leaf, chunk)) { + error("unrecognized chunk type: %llu", + ~(BTRFS_BLOCK_GROUP_TYPE_MASK | + BTRFS_BLOCK_GROUP_PROFILE_MASK) & + btrfs_chunk_type(leaf, chunk)); + return -EIO; + } ce = search_cache_extent(_tree->cache_tree, logical); @@ -1607,7 +1635,6 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, return 0; } - num_stripes = btrfs_chunk_num_stripes(leaf, chunk); map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS); if (!map) return -ENOMEM; -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: Check metadata redundancy on balance
Resending as previous comments did not need any changes. Currently BTRFS allows you to make bad choices of data and metadata levels. For example -d raid1 -m raid0 means you can only use half your total disk space, but will loose everything if 1 disk fails. It should give a warning in these cases. This patch is a follow up to [PATCH v2] btrfs-progs: check metadata redundancy in order to cover the case of using balance to convert to such a set of raid levels. A simple example to hit this is to create a single device fs, which will default to single:dup, then to add a second device and attempt to convert to raid1 with the command btrfs balance start -dconvert=raid1 /mnt this will result in a filesystem with raid1:dup, which will not survive the loss of one drive. I personally don't see why the tools should allow this, but in the previous thread a warning was considered sufficient. Changes in v2 Use btrfs_get_num_tolerated_disk_barrier_failures() Signed-off-by: Sam TygierFrom: Sam Tygier Date: Sat, 3 Oct 2015 16:43:48 +0100 Subject: [PATCH] Btrfs: Check metadata redundancy on balance When converting a filesystem via balance check that metadata mode is at least as redundant as the data mode. For example give warning when: -dconvert=raid1 -mconvert=single --- fs/btrfs/volumes.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6fc73586..40247e9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3584,6 +3584,12 @@ int btrfs_balance(struct btrfs_balance_control *bctl, } } while (read_seqretry(_info->profiles_lock, seq)); + if (btrfs_get_num_tolerated_disk_barrier_failures(bctl->meta.target) < + btrfs_get_num_tolerated_disk_barrier_failures(bctl->data.target)) { + btrfs_info(fs_info, + "Warning: metatdata has lower redundancy than data\n"); + } + if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) { fs_info->num_tolerated_disk_barrier_failures = min( btrfs_calc_num_tolerated_disk_barrier_failures(fs_info), -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/2] btrfs: Enhance chunk validation check
Enhance chunk validation: 1) Num_stripes We already have such check but it's only in super block sys chunk array. Now check all on-disk chunks. 2) Chunk logical It should be aligned to sector size. This behavior should be *DOUBLE CHECKED* for 64K sector size like PPC64 or AArch64. Maybe we can found some hidden bugs. 3) Chunk length Same as chunk logical, should be aligned to sector size. 4) Stripe length It should be power of 2. 5) Chunk type Any bit out of TYPE_MAS | PROFILE_MASK is invalid. With all these much restrict rules, several fuzzed image reported in mail list should no longer cause kernel panic. Reported-by: Vegard NossumSigned-off-by: Qu Wenruo --- v3: Fix a typo which forgot to return -EIO after num_stripes check. --- fs/btrfs/volumes.c | 33 - 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9ea345f..bda84be 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6199,6 +6199,7 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, struct extent_map *em; u64 logical; u64 length; + u64 stripe_len; u64 devid; u8 uuid[BTRFS_UUID_SIZE]; int num_stripes; @@ -6207,6 +6208,37 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, logical = key->offset; length = btrfs_chunk_length(leaf, chunk); + stripe_len = btrfs_chunk_stripe_len(leaf, chunk); + num_stripes = btrfs_chunk_num_stripes(leaf, chunk); + /* Validation check */ + if (!num_stripes) { + btrfs_err(root->fs_info, "invalid chunk num_stripes: %u", + num_stripes); + return -EIO; + } + if (!IS_ALIGNED(logical, root->sectorsize)) { + btrfs_err(root->fs_info, + "invalid chunk logical %llu", logical); + return -EIO; + } + if (!length || !IS_ALIGNED(length, root->sectorsize)) { + btrfs_err(root->fs_info, + "invalid chunk length %llu", length); + return -EIO; + } + if (!is_power_of_2(stripe_len)) { + btrfs_err(root->fs_info, "invalid chunk stripe length: %llu", + stripe_len); + return -EIO; + } + if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) & + btrfs_chunk_type(leaf, chunk)) { + btrfs_err(root->fs_info, "unrecognized chunk type: %llu", + ~(BTRFS_BLOCK_GROUP_TYPE_MASK | + BTRFS_BLOCK_GROUP_PROFILE_MASK) & + btrfs_chunk_type(leaf, chunk)); + return -EIO; + } read_lock(_tree->map_tree.lock); em = lookup_extent_mapping(_tree->map_tree, logical, 1); @@ -6223,7 +6255,6 @@ static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key, em = alloc_extent_map(); if (!em) return -ENOMEM; - num_stripes = btrfs_chunk_num_stripes(leaf, chunk); map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS); if (!map) { free_extent_map(em); -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-08 01:08, Qu Wenruo wrote: Austin S Hemmelgarn wrote on 2015/12/07 11:36 -0500: On 2015-12-07 01:06, Qu Wenruo wrote: Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Passes xfstests and a handful of other things that I really should just take the time to integrate into xfstests, so: Tested-by: Austin S. HemmelgarnThanks for the test. But I'm afraid you may need to test v2 patch again, as the v2 changed some behavior. That's OK, I should have results for you some time later today. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-07 18:06, Eric Sandeen wrote: On 12/7/15 2:54 PM, Christoph Anton Mitterer wrote: ... 2) a section that describes "ro" in btrfs-mount(5) which describes that normal "ro" alone may cause changes on the device and which then refers to hard-ro and/or the list of options (currently nologreplay) which are required right now to make it truly ro. I think this is important as an end-user probably expects "ro" to be truly ro, Yeah, I don't know that this is true. It hasn't been true for over a decade (2?), with the most widely-used filesystem in linux history, i.e. ext3. So if btrfs wants to go on this re-education crusade, more power to you, but I don't know that it's really a fight worth fighting. ;) Actually, AFAICT, it's been at least 4.5 decades. Last I checked, this dates back to the original UNIX filesystems, which still updated atimes even when mounted RO. Despite this, it really isn't a widely known or well documented behavior outside of developers, forensic specialists, and people who have had to deal with the implications it has on data recovery. There really isn't any way that the user would know about it without being explicitly told, and it's something that can have a serious impact on being able to recover a broken filesystem. TBH, I really feel that _every_ filesystem's documentation should have something about how to make it mount truly read-only, even if it's just a reference to how to mark the block device read-only. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH v2] btrfs: Introduce new mount option to disable tree log replay
On Tuesday 08 Dec 2015 14:10:33 Qu Wenruo wrote: > Introduce a new mount option "nologreplay" to co-operate with "ro" mount > option to get real readonly mount, like "norecovery" in ext* and xfs. > > Since the new parse_options() need to check new flags at remount time, > so add a new parameter for parse_options(). > > Signed-off-by: Qu Wenruo> --- > v2: > Make RO check mandatory for btrfs_parse_options(). > Add btrfs_show_options() support for nologreplay. > > Document for btrfs-mount(5) will follow after the patch being merged. > --- > Documentation/filesystems/btrfs.txt | 7 +++ > fs/btrfs/ctree.h| 4 +++- > fs/btrfs/disk-io.c | 7 --- > fs/btrfs/super.c| 29 + > 4 files changed, 39 insertions(+), 8 deletions(-) > > diff --git a/Documentation/filesystems/btrfs.txt > b/Documentation/filesystems/btrfs.txt index c772b47..7ad5b93 100644 > --- a/Documentation/filesystems/btrfs.txt > +++ b/Documentation/filesystems/btrfs.txt > @@ -168,6 +168,13 @@ Options with (*) are default options and will not show > in the mount options. notreelog > Enable/disable the tree logging used for fsync and O_SYNC writes. > > + nologreplay > + Disable the log tree replay at mount time to prevent devices get > + modified. Must be use with 'ro' mount option. > + A filesystem mounted with the 'nologreplay' option cannot > + transition to a read-write mount via remount,rw - the filesystem > + must be unmounted and remounted if read-write access is desired. > + May be the following is slightly better ... Disable the log tree replay at mount time to prevent filesystem from getting modified. Must be used with 'ro' mount option. A filesystem mounted with the 'nologreplay' option cannot transition to a read-write mount via remount,rw - the filesystem must be unmounted and mounted back again if read-write access is desired. Aside from above, everything else looks good to me. Reviewed-by: Chandan Rajendra -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] locks: new locks_mandatory_area calling convention
On Tue, Dec 08, 2015 at 04:05:04AM +, Al Viro wrote: > Where the hell would truncate(2) get struct file, anyway? IOW, the inode > argument is _not_ pointless; re-added. Oh, right. Interestingly is seems like xfstests has no coverage of this code path at all. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing recursive fault and parent transid verify failed
Alistair Grant posted on Tue, 08 Dec 2015 06:55:04 +1100 as excerpted: > On Mon, Dec 07, 2015 at 01:48:47PM +, Duncan wrote: >> Alistair Grant posted on Mon, 07 Dec 2015 21:02:56 +1100 as excerpted: >> >> > I think I'll try the btrfs restore as a learning exercise, and to >> > check the contents of my backup (I don't trust my memory, so >> > something could have changed since the last backup). >> >> Trying btrfs restore is an excellent idea. It'll make things far >> easier if you have to use it for real some day. >> >> Note that while I see your kernel is reasonably current (4.2 series), I >> don't know what btrfs-progs ubuntu ships. There have been some marked >> improvements to restore somewhat recently, checking the wiki >> btrfs-progs release-changelog list says 4.0 brought optional metadata >> restore, 4.0.1 added --symlinks, and 4.2.3 fixed a symlink path check >> off-by-one error. (And don't use 4.1.1 as its mkfs.btrfs is broken and >> produces invalid filesystems.) So you'll want at least progs 4.0 to >> get the optional metadata restoration, and 4.2.3 to get full symlinks >> restoration support. >> >> > Ubuntu 15.10 comes with btrfs-progs v4.0. It looks like it is easy > enough to compile and install the latest version from > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git so > I'll do that. > > Should I stick to 4.2.3 or use the latest 4.3.1? I generally use the latest myself, but recommend as a general guideline that at minimum, a userspace version series matching that of your kernel be used, as if the usual kernel recommendations (within two kernel series of either current or LTS, so presently 4.2 or 4.3 for current or 3.18 or 4.1 for LTS) are followed, that will keep userspace reasonably current as well, and the userspace of a particular version was being developed concurrently with the kernel of the same series, so they're relatively in sync. So with a 4.2 kernel, I'd suggest at least a 4.2 userspace. If you want the latest, as I generally do, and are willing to put up with occasional bleeding edge bugs like that broken mkfs.btrfs in 4.1.1, by all means, use the latest, but otherwise, the general same series as your kernel guideline is quite acceptable. The exception would be if you're trying to fix or recover from a broken filesystem, in which case the very latest tends to have the best chance at fixing things, since it has fixes for (or lacking that, at least detection of) the latest round of discovered bugs, that older versions will lack. While btrfs restore does fall into the recover from broken category, we know from the changelogs that nothing specific has gone into it since the mentioned 4.2.3 symlink off-by-one fix, so while I would recommend at least that since you are going to be working with restore, there's no urgent need for 4.3.0 or 4.3.1 if you're more comfortable with the older version. (In fact, while I knew I was on 4.3.something, I just had to run btrfs version, to check whether it was 4.3 or 4.3.1, myself. FWIW, it was 4.3.1.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrub: no spae left on device
Howdy, Why would scrub need space and why would it cancel if there isn't enough of it? (kernel 4.3) /etc/cron.daily/btrfs-scrub: btrfs scrub start -Bd /dev/mapper/cryptroot scrub device /dev/mapper/cryptroot (id 1) done scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds total bytes scrubbed: 130.84GiB with 0 errors btrfs scrub start -Bd /dev/mapper/pool1 ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on device) scrub device /dev/mapper/pool1 (id 1) canceled Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
Jon Panozzo posted on Mon, 07 Dec 2015 08:43:14 -0600 as excerpted: [On single-device dup data] > Thanks for the additional feedback. Two follow-up questions to this is: > > Can the --mixed option only be applied when first creating the fs, or > can you simply add this to the balance command to take an existing > filesystem and add this to it? Mixed-bg mode has to be done at btrfs creation. It changes the way btrfs handles chunks, and doing that _live_, with a non-zero time during which both modes are active, would be... complex and an invitation to all sorts of race bugs, to put it mildly. > So it sounds like there are really three ways to enable scrub to repair > errors on a btrfs single device (please confirm): Yes. > 1) mkfs.btrfs with the --mixed option This would be my current preferred to filesystem sizes of a quarter to perhaps a half terabyte on spinning rust, and some people are known to use mixed for exactly this reason, tho it's not particularly well tested at the terabyte scale filesystem level, where as a result you might uncover some unusual bugs. > 2) create two partitions on a single phys device, > then present them as logical devices (maybe a loopback or something) > and create a btrfs raid1 for both data/metadata No special loopback, etc, required. Btrfs deploys just fine on pretty much any block device as presented by the kernel, including both partitions and LVM volumes, the two ways single physical devices are likely to be presented as multiple logical devices. In fact I use btrfs on partitions here, tho in my case it's two devices partitioned up identically, with raid1 across the parallel partitions on each device, instead of using multiple partitions on the same physical device, which is what we're talking about here. This option will be rather inefficient on spinning rust as the write head will have to write one copy to the one partition, then reposition itself to write the second copy to the other partition, and that repositioning is non-zero time on spinning rust, but there's no such repositioning latency on SSDs, where it might actually be faster than mixed-mode, tho I'm unaware of any benchmarking to find out. Despite the inefficiency, both partitions and btrfs raid1 are separately well tested and their combined use on a single device should introduce no race conditions that wouldn't have been found by previous separate usage, so this would be my current preferred at filesystem sizes over a half terabyte on spinning rust, or on SSDs with their zero seek times. But writing /will/ be slow on spinning rust, particularly with partition sizes of a half-TiB or larger each, as that write-mode seek-time will be /nasty/. That said, again, there are people known to be using this mode, and it's a viable choice in deployments such as laptops where physical multi- device isn't an option, but the additional reliability of pair-copy data is highly desirable. > 3) wait for the patch in process to allow for btrfs single devices to > support dup mode for data This should be the preferred mode in the future, tho as with any new btrfs feature, it'll probably take a couple kernel versions after initial introduction for the most critical bugs in the new feature to be found and duly exterminated, so I'd consider anyone using it the first kernel cycle or two after introduction to be volunteering as guinea pigs. That said, the individual components of this feature have been in btrfs for some time and are well tested by now, so I'd expect the introduction of this feature to be rather smoother than many. For the much more disruptive raid56 mode, I suggested a guinea-pig time of a year, five kernel cycles, for instance, and that turned out to be about right. (Interestingly enough, that put raid56 mode feature stability at the soon to be released kernel 4.4, which is scheduled to be a long-term-support release, so the raid56 mode stability timing worked out rather well, tho I had no idea 4.4 would be an LTS when I originally predicted the year's settle-time.) > Is that about right? =:^) One further caveat regarding SSDs. On SSDs, many commonly deployed FTLs do dedup. Sandforce firmware, where dedup is sold as a feature, is known for this. If the firmware is doing dedup, then duplicated data /or/ metadata at the filesystem level is simply being deduped at the physical device firmware level, so you end up with only one physical copy in any case, and filesystem efforts to provide redundancy only end up costing CPU cycles at both the filesystem and device-firmware levels, all for naught. This is a big reason why mkfs.btrfs on a single device defaults to single metadata if it detects an SSD, despite the normally preferred dup metadata default. So if you're deploying on SSDs using sandforce firmware or otherwise known to do dedup at the FTL, don't bother with any of the above as the firmware will be simply defeating your efforts at
[PATCH] btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems better not to do that for struct btrfs_delalloc_work. Objects of this type are allocated rarely, when transaction commit calls btrfs_start_delalloc_roots, requesting delayed iputs. The objects are temporary (with some IO involved) but still allocated and freed within __start_delalloc_inodes. Memory allocation failure is handled. The slab cache is empty most of the time (observed on several systems), so if we need to allocate a new slab object, the first one has to allocate a full page. In a potential case of low memory conditions this might fail with higher probability compared to using the generic slab caches. Signed-off-by: David Sterba--- fs/btrfs/inode.c | 14 ++ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 994490d5fa64..eeae851427fe 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -77,7 +77,6 @@ static const struct file_operations btrfs_dir_file_operations; static struct extent_io_ops btrfs_extent_io_ops; static struct kmem_cache *btrfs_inode_cachep; -static struct kmem_cache *btrfs_delalloc_work_cachep; struct kmem_cache *btrfs_trans_handle_cachep; struct kmem_cache *btrfs_transaction_cachep; struct kmem_cache *btrfs_path_cachep; @@ -9174,8 +9173,6 @@ void btrfs_destroy_cachep(void) kmem_cache_destroy(btrfs_path_cachep); if (btrfs_free_space_cachep) kmem_cache_destroy(btrfs_free_space_cachep); - if (btrfs_delalloc_work_cachep) - kmem_cache_destroy(btrfs_delalloc_work_cachep); } int btrfs_init_cachep(void) @@ -9210,13 +9207,6 @@ int btrfs_init_cachep(void) if (!btrfs_free_space_cachep) goto fail; - btrfs_delalloc_work_cachep = kmem_cache_create("btrfs_delalloc_work", - sizeof(struct btrfs_delalloc_work), 0, - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, - NULL); - if (!btrfs_delalloc_work_cachep) - goto fail; - return 0; fail: btrfs_destroy_cachep(); @@ -9461,7 +9451,7 @@ struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode, { struct btrfs_delalloc_work *work; - work = kmem_cache_zalloc(btrfs_delalloc_work_cachep, GFP_NOFS); + work = kmalloc(sizeof(*work), GFP_NOFS); if (!work) return NULL; @@ -9480,7 +9470,7 @@ struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode, void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work) { wait_for_completion(>completion); - kmem_cache_free(btrfs_delalloc_work_cachep, work); + kfree(work); } /* -- 1.8.4.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub on btrfs single device only to detect errors, not correct them?
Austin S Hemmelgarn posted on Mon, 07 Dec 2015 10:39:05 -0500 as excerpted: > On 2015-12-07 10:12, Jon Panozzo wrote: >> This is what I was thinking as well. In my particular use-case, parity >> is only really used today to reconstruct an entire device due to a >> device failure. I think if btrfs scrub detected errors on a single >> device, I could do a "reverse reconstruct" where instead of syncing TO >> the parity disk, I sync FROM the parity disk TO the btrfs single device >> with the error, replacing physical blocks that are out of sync with >> parity (thus repairing the scrub-found errrors). The downside to this >> approach is I would have to perform the reverse-sync against the entire >> btrfs block device, which could be much more time-consuming than if I >> could single out the specific block addresses and just sync those. >> That said, I guess option A is better than no option at all. >> >> I would be curious if any of the devs or other members of this mailing >> list have tried to correlate btrfs internal block addresses to a true >> block-address on the device being used. Any interesting articles / >> links that show how to do this? Not expecting much, but if someone >> does know, I'd be very grateful. > I think there is a tool in btrfs-progs to do it, but I've never used it, > and you would still need to get scrub to spit out actual error addresses > for you. btrfs-debug-tree is what you're looking for. =:^) As I understand things, the complexity is due to btrfs' chunk abstraction, along with the multi-device feature. On a normal filesystem, byte or block addresses are mapped linearly to absolute filesystem byte address and there's just the one device to worry about, so there's effectively little or no translation to be done. On btrfs by contrast, block addresses map into chunks, also known as block groups, which are designed to be more or less arbitrarily relocatable within the filesystem using balance (originally called the restriper). Further, these block groups can be single, striped across multiple devices (raid0 and the 0 side of raid10, duplicated on the same device (dup) or across multiple devices (only two devices currently, N- way-mirroring is on the roadmap, raid1 and the 1 side of raid10), or striped with parity (raid5 and 6). So while block addresses can map more or less linearly into block groups, btrfs has to maintain an entirely new layer of abstraction mapping in addition, that tells the filesystem where to look for that block group, that is, on what device (or across what devices if striped), and at what absolute bytenr offset into the device. And again, keep in mind that even with a constant single/dup/raid mapping and even in the simplest single mode on single device, balance can and does more or less arbitrarily dynamically relocate block groups within the filesystem, so the mapping you see today may or may not be the mapping you see tomorrow, depending on whether a balance was run in the mean time. Obviously the devs are going to need a tool to help them debug this additional complexity, and that's where btrfs-debug-tree comes in. =:^) But for "ordinary mortal admins", yes, btrfs is open source and btrfs-debug-tree is available for those that want to use it, but once they realize the complexity, most (including me) are going to simply be content to treat it as a black box and not worry too much about investigating its innards. So while specific block and/or byte mapping can be done and there's tools available for and appropriate to the task, it's the type of thing most admins are very content to treat as a black box and leave well enough alone, once they understand the complexities involved. "Btrfs, while he might use it, it ain't your grandfather's filesystem!" (TM) =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
Le 08/12/2015 16:06, Marc MERLIN a écrit : > Howdy, > > Why would scrub need space and why would it cancel if there isn't enough of > it? > (kernel 4.3) > > /etc/cron.daily/btrfs-scrub: > btrfs scrub start -Bd /dev/mapper/cryptroot > scrub device /dev/mapper/cryptroot (id 1) done > scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds > total bytes scrubbed: 130.84GiB with 0 errors > btrfs scrub start -Bd /dev/mapper/pool1 > ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on > device) > scrub device /dev/mapper/pool1 (id 1) canceled I can't be sure (not-a-dev), but one possibility that comes to mind is that if an error is detected writes must be done on the device. The repair might not be done in-place but with CoW and even if the error is not repaired by lack of redundancy IIRC each device tracks the number of errors detected so I assume this is written somewhere (system or metadata chunks most probably). Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
Le 08/12/2015 16:37, Holger Hoffstätte a écrit : > On 12/08/15 16:06, Marc MERLIN wrote: >> Howdy, >> >> Why would scrub need space and why would it cancel if there isn't enough of >> it? >> (kernel 4.3) >> >> /etc/cron.daily/btrfs-scrub: >> btrfs scrub start -Bd /dev/mapper/cryptroot >> scrub device /dev/mapper/cryptroot (id 1) done >> scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds >> total bytes scrubbed: 130.84GiB with 0 errors >> btrfs scrub start -Bd /dev/mapper/pool1 >> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on >> device) >> scrub device /dev/mapper/pool1 (id 1) canceled > Scrub rewrites metadata (apparently even in -r aka readonly mode), and that > can lead to temporary metadata expansion (stuff gets COWed around); it's > a bit surprising but makes sense if you think about it. How long must I think about it until it makes sense? :-) Sorry I'm not sure why metadata is rewritten if no error is detected. I've several theories but lack information: is the fact that no error has been detected stored somewhere? is scrub using some kind of internal temporary snapshot(s) to avoid interfering with other operations? other reason I didn't think about? Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
On Tue, Dec 08, 2015 at 04:46:32PM +0100, Lionel Bouton wrote: > Le 08/12/2015 16:37, Holger Hoffstätte a écrit : > > On 12/08/15 16:06, Marc MERLIN wrote: > >> Howdy, > >> > >> Why would scrub need space and why would it cancel if there isn't enough of > >> it? > >> (kernel 4.3) > >> > >> /etc/cron.daily/btrfs-scrub: > >> btrfs scrub start -Bd /dev/mapper/cryptroot > >> scrub device /dev/mapper/cryptroot (id 1) done > >>scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds > >>total bytes scrubbed: 130.84GiB with 0 errors > >> btrfs scrub start -Bd /dev/mapper/pool1 > >> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left > >> on device) > >> scrub device /dev/mapper/pool1 (id 1) canceled > > Scrub rewrites metadata (apparently even in -r aka readonly mode), and that > > can lead to temporary metadata expansion (stuff gets COWed around); it's > > a bit surprising but makes sense if you think about it. > > How long must I think about it until it makes sense? :-) > > Sorry I'm not sure why metadata is rewritten if no error is detected. > I've several theories but lack information: is the fact that no error > has been detected stored somewhere? is scrub using some kind of internal > temporary snapshot(s) to avoid interfering with other operations? other > reason I didn't think about? Yeah, I was also wondering why metadata should be rewritten on a single device scrub. Does not make sense to me. And this is what I got: legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1/ Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=10 SYSTEM (flags 0x2): balancing, usage=10 ERROR: error during balancing '/mnt/btrfs_pool1/' - No space left on device There may be more info in syslog - try dmesg | tail Ok, that sucks. legolas:~# btrfs balance start -musage=0 -v /mnt/btrfs_pool1/ Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=0 SYSTEM (flags 0x2): balancing, usage=0 Done, had to relocate 0 out of 618 chunks This worked. Mmmh, I thought this wouldn't be necessary anymore in 4.3 kernels? legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1 Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=10 SYSTEM (flags 0x2): balancing, usage=10 Done, had to relocate 1 out of 618 chunks And now I'm back in business... Still, this is a bit disappointing and at the very least very unexpected in 4.3. legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=604.88GiB, used=520.09GiB System, DUP: total=32.00MiB, used=96.00KiB Metadata, DUP: total=5.00GiB, used=4.17GiB GlobalReserve, single: total=512.00MiB, used=0.00B legolas:~# btrfs fi show /mnt/btrfs_pool1 Label: 'btrfs_pool1' uuid: 5ee24229-2431-448a-868e-2c325d10bfa7 Total devices 1 FS bytes used 524.26GiB devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1 Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
On Tue, Dec 08, 2015 at 05:24:16PM +0100, Holger Hoffstätte wrote: > On 12/08/15 17:06, Marc MERLIN wrote: > > Label: 'btrfs_pool1' uuid: 5ee24229-2431-448a-868e-2c325d10bfa7 > > Total devices 1 FS bytes used 524.26GiB > > devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1 > > This is what I was alluding to. You could have started a -dusage balance > *before* the scrub so that one or several data chunks get freed. > Balancing metadata when you're out of space accomplishes nothing and only > will very likely fail, just as you saw. You have ~90GB usable space, but > that space is spread over chunks with low utilisation. Yes, my partition got a bit full, I freed up space, and unfortunately we still don't have a background rebalance to fix this, so I did run a manual one. But my filesystem was usable, I was writing to it just fine. I was just very surprised that scrub needed to rewrite blocks on a single disk device. You could make the case that scrub and balance=0 should be run together. In the meantime, I upgraded my script: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair http://marc.merlins.org/linux/scripts/btrfs-scrub I figured there is no good reason not to run a balance 20 on metadata and data every night. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
On 12/08/15 16:06, Marc MERLIN wrote: > Howdy, > > Why would scrub need space and why would it cancel if there isn't enough of > it? > (kernel 4.3) > > /etc/cron.daily/btrfs-scrub: > btrfs scrub start -Bd /dev/mapper/cryptroot > scrub device /dev/mapper/cryptroot (id 1) done > scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds > total bytes scrubbed: 130.84GiB with 0 errors > btrfs scrub start -Bd /dev/mapper/pool1 > ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on > device) > scrub device /dev/mapper/pool1 (id 1) canceled Scrub rewrites metadata (apparently even in -r aka readonly mode), and that can lead to temporary metadata expansion (stuff gets COWed around); it's a bit surprising but makes sense if you think about it. The fact that you ENOSPCed means that the fs was probably already fully allocated. If it bothers you, a subsequent balance with -musage=10 should vacuum things up. Alternatively just keep using the filesystem; eventually the empty metadata chunks should be collected, on the next remount at the latest. tl;dr: Never allocate all the chunks. Yes, this needs more graceful handling. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
On 2015-12-08 10:06, Marc MERLIN wrote: Howdy, Why would scrub need space and why would it cancel if there isn't enough of it? (kernel 4.3) Wild guess here, but maybe scrub unconditionally updates the error counters, regardless of whether any errors were found or not? smime.p7s Description: S/MIME Cryptographic Signature
Re: Scrub: no spae left on device
On 12/08/15 16:46, Lionel Bouton wrote: > Le 08/12/2015 16:37, Holger Hoffstätte a écrit : >> On 12/08/15 16:06, Marc MERLIN wrote: >>> Howdy, >>> >>> Why would scrub need space and why would it cancel if there isn't enough of >>> it? >>> (kernel 4.3) >>> >>> /etc/cron.daily/btrfs-scrub: >>> btrfs scrub start -Bd /dev/mapper/cryptroot >>> scrub device /dev/mapper/cryptroot (id 1) done >>> scrub started at Mon Dec 7 01:35:08 2015 and finished after 258 seconds >>> total bytes scrubbed: 130.84GiB with 0 errors >>> btrfs scrub start -Bd /dev/mapper/pool1 >>> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on >>> device) >>> scrub device /dev/mapper/pool1 (id 1) canceled >> Scrub rewrites metadata (apparently even in -r aka readonly mode), and that >> can lead to temporary metadata expansion (stuff gets COWed around); it's >> a bit surprising but makes sense if you think about it. > > How long must I think about it until it makes sense? :-) > > Sorry I'm not sure why metadata is rewritten if no error is detected. > I've several theories but lack information: is the fact that no error > has been detected stored somewhere? is scrub using some kind of internal > temporary snapshot(s) to avoid interfering with other operations? other > reason I didn't think about? Well..I have no idea what the historical motivation for this behaviour was, even though I can make up at least two: rewriting known-good checksums generally (since you know they are good this very moment), and in case of error avoiding the area where the block error occurred (read errors on rust are often clustered and affect entire tracks). That's really all I know. I agree it's surprising, especially since it happens by default and also in -r mode, which might be considered a bug. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] locks: new locks_mandatory_area calling convention
On Tue, Dec 08, 2015 at 03:54:53PM +0100, Christoph Hellwig wrote: > On Tue, Dec 08, 2015 at 04:05:04AM +, Al Viro wrote: > > Where the hell would truncate(2) get struct file, anyway? IOW, the inode > > argument is _not_ pointless; re-added. > > Oh, right. Interestingly is seems like xfstests has no coverage of this > code path at all. LTP does (ftruncate04)... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scrub: no spae left on device
On 12/08/15 17:06, Marc MERLIN wrote: > Label: 'btrfs_pool1' uuid: 5ee24229-2431-448a-868e-2c325d10bfa7 > Total devices 1 FS bytes used 524.26GiB > devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1 This is what I was alluding to. You could have started a -dusage balance *before* the scrub so that one or several data chunks get freed. Balancing metadata when you're out of space accomplishes nothing and only will very likely fail, just as you saw. You have ~90GB usable space, but that space is spread over chunks with low utilisation. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-08 01:10, Qu Wenruo wrote: Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Signed-off-by: Qu Wenruo--- v2: Make RO check mandatory for btrfs_parse_options(). Add btrfs_show_options() support for nologreplay. Document for btrfs-mount(5) will follow after the patch being merged. Same set of tests I ran against the last version, still no issues, so: Tested-by: Austin S. Hemmelgarn smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay
On Tue, 2015-12-08 at 07:15 -0500, Austin S Hemmelgarn wrote: > Despite this, it really isn't a widely known or well documented > behavior > outside of developers, forensic specialists, and people who have had > to > deal with the implications it has on data recovery. There really > isn't > any way that the user would know about it without being explicitly > told, > and it's something that can have a serious impact on being able to > recover a broken filesystem. TBH, I really feel that _every_ > filesystem's documentation should have something about how to make it > mount truly read-only, even if it's just a reference to how to mark > the > block device read-only. Exactly what I've meant. And the developers here, should definitely consider that every normal end-user, may easily assume the role of e.g. a forensics specialist (especially with btrfs ;-) ), when recovery in case of corruptions is tried. I don't think that "it has always been improperly documented" (i.e. the "ro" option) is a good excuse to continue doing it that way =) Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay
On 2015-12-08 14:20, Christoph Anton Mitterer wrote: On Tue, 2015-12-08 at 07:15 -0500, Austin S Hemmelgarn wrote: Despite this, it really isn't a widely known or well documented behavior outside of developers, forensic specialists, and people who have had to deal with the implications it has on data recovery. There really isn't any way that the user would know about it without being explicitly told, and it's something that can have a serious impact on being able to recover a broken filesystem. TBH, I really feel that _every_ filesystem's documentation should have something about how to make it mount truly read-only, even if it's just a reference to how to mark the block device read-only. Exactly what I've meant. And the developers here, should definitely consider that every normal end-user, may easily assume the role of e.g. a forensics specialist (especially with btrfs ;-) ), when recovery in case of corruptions is tried. I don't think that "it has always been improperly documented" (i.e. the "ro" option) is a good excuse to continue doing it that way =) Agreed, 'but it's always been that way' is never a valid argument, and the fact that people who have been working on UNIX for decades know it doesn't mean that it's something that people will just inherently know. The only reason it was that way to begin with is because it was assumed that everyone dealing with computers had a huge amount of domain specific knowledge of them (this was a valid assumption back in 1970, it hasn't been a valid assumption since at least 1990). Stuff that seems obvious to people who have been working on it for years isn't necessarily obvious to people who have limited experience with it (I recently had to explain to a friend who had almost no networking background how IP addresses are just an abstraction for MAC addresses, and how it's not possible to block WiFi access based on an IP address; it took me three tries and eventually making the analogy of street addresses being an abstraction for geographical coordinates before he finally got it). TBH, the only reason I knew about this rather annoying detail of filesystem implementation before using BTRFS is because of dealing with shared storage on VM's (that was an interesting week of debugging and restoring backups before I finally figured out what was going on). smime.p7s Description: S/MIME Cryptographic Signature
btrfs scrub can neither start nor cancel
I just tried this script: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair but I did not pass the directory where the filesystem is mounted. Next I called it correctly: btrfs-scrub /t4 I also tried btrfs scrub start / cancel directly, but I am not really sure what I did in which order. Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not help. Running unmodified Linux 4.3 It seems like scrub stopped and did not clean up. Maybe because: Dec 8 21:07:41 s5 kernel: [17833.840868] btrfs[23746]: segfault at ff98 ip 004079e1 sp 7fffafa27510 error 5 in btrfs[40+53000] How can I now clean this up? root@s5:~# btrfs --version Btrfs v3.12 root@s5:~# btrfs scrub status /t4 scrub status for 700900de-e35f-4264-8f5d-1b2b249a5c3a scrub started at Tue Dec 8 21:05:31 2015, running for 20 seconds total bytes scrubbed: 3.09GiB with 0 errors root@s5:~# btrfs scrub cancel /t4 ERROR: scrub cancel failed on /t4: not running root@s5:~# btrfs scrub start /t4 ERROR: scrub is already running. To cancel use 'btrfs scrub cancel /t4'. To see the status use 'btrfs scrub status [-d] /t4'. -- Wolfgang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub can neither start nor cancel
On Tue, Dec 08, 2015 at 09:46:48PM +0100, Wolfgang Rohdewald wrote: > I just tried this script: > http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair > > but I did not pass the directory where the filesystem is mounted. > > Next I called it correctly: btrfs-scrub /t4 > I also tried btrfs scrub start / cancel directly, but > I am not really sure what I did in which order. > > Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not help. It might be that the userspace tools has got confused and left behind a lock/pid/progress file in /var/lib/btrfs/ Take a look in there and see if there's anything that you can delete to good effect? Hugo. > Running unmodified Linux 4.3 > > It seems like scrub stopped and did not clean up. Maybe because: > Dec 8 21:07:41 s5 kernel: [17833.840868] btrfs[23746]: segfault at > ff98 ip 004079e1 sp 7fffafa27510 error 5 in > btrfs[40+53000] > > How can I now clean this up? > > root@s5:~# btrfs --version > Btrfs v3.12 > > root@s5:~# btrfs scrub status /t4 > scrub status for 700900de-e35f-4264-8f5d-1b2b249a5c3a > scrub started at Tue Dec 8 21:05:31 2015, running for 20 seconds > total bytes scrubbed: 3.09GiB with 0 errors > > root@s5:~# btrfs scrub cancel /t4 > ERROR: scrub cancel failed on /t4: not running > > root@s5:~# btrfs scrub start /t4 > ERROR: scrub is already running. > To cancel use 'btrfs scrub cancel /t4'. > To see the status use 'btrfs scrub status [-d] /t4'. > > -- Hugo Mills | Go not to the elves for counsel, for they will say hugo@... carfax.org.uk | both no and yes. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: Fixing recursive fault and parent transid verify failed
On Tue, Dec 08, 2015 at 03:25:14PM +, Duncan wrote: > Alistair Grant posted on Tue, 08 Dec 2015 06:55:04 +1100 as excerpted: > > > On Mon, Dec 07, 2015 at 01:48:47PM +, Duncan wrote: > >> Alistair Grant posted on Mon, 07 Dec 2015 21:02:56 +1100 as excerpted: > >> > >> > I think I'll try the btrfs restore as a learning exercise, and to > >> > check the contents of my backup (I don't trust my memory, so > >> > something could have changed since the last backup). > >> > >> Trying btrfs restore is an excellent idea. It'll make things far > >> easier if you have to use it for real some day. > >> > >> Note that while I see your kernel is reasonably current (4.2 series), I > >> don't know what btrfs-progs ubuntu ships. There have been some marked > >> improvements to restore somewhat recently, checking the wiki > >> btrfs-progs release-changelog list says 4.0 brought optional metadata > >> restore, 4.0.1 added --symlinks, and 4.2.3 fixed a symlink path check > >> off-by-one error. (And don't use 4.1.1 as its mkfs.btrfs is broken and > >> produces invalid filesystems.) So you'll want at least progs 4.0 to > >> get the optional metadata restoration, and 4.2.3 to get full symlinks > >> restoration support. > >> > >> ... Thanks again Duncan for your assistance. I plugged the ext4 drive I planned to use for the recovery in to the machine and immediately got a couple of errors, which makes me wonder whether there isn't a hardware problem with the machine somewhere. So decided to move to another machine to do the recovery. So I'm now recovering on Arch Linux 4.1.13-1 with btrfs-progs v4.3.1 (the latest version from archlinuxarm.org). Attempting: sudo btrfs restore -S -m -v /dev/sdb /mnt/btrfs-recover/ ^&1 | tee btrfs-recover.log only recovered 53 of the more than 106,000 files that should be available. The log is available at: https://www.dropbox.com/s/p8bi6b8b27s9mhv/btrfs-recover.log?dl=0 I did attempt btrfs-find-root, but couldn't make sense of the output: https://www.dropbox.com/s/qm3h2f7c6puvd4j/btrfs-find-root.log?dl=0 Simply mounting the drive, then re-mounting it read only, and rsync'ing the files to the backup drive recovered 97,974 files before crashing. If anyone is interested, I've uploaded a photo of the console to: https://www.dropbox.com/s/xbrp6hiah9y6i7s/rsync%20crash.jpg?dl=0 I'm currently running a hashdeep audit between the recovered files and the backup to see how the recovery went. If you'd like me to try any other tests, I'll keep the damaged file system for at least the next day or so. Thanks again for all your assistance, Alistair -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub can neither start nor cancel
Am Dienstag, 8. Dezember 2015, 20:51:08 schrieb Hugo Mills: > On Tue, Dec 08, 2015 at 09:46:48PM +0100, Wolfgang Rohdewald wrote: > > Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not > > help. > >It might be that the userspace tools has got confused and left > behind a lock/pid/progress file in /var/lib/btrfs/ > >Take a look in there and see if there's anything that you can > delete to good effect? root@s5:/var/lib/btrfs# ls -l insgesamt 4 srwxr-xr-x 1 root root 0 Dez 8 21:05 scrub.progress.700900de-e35f-4264-8f5d-1b2b249a5c3a -rw--- 1 root root 394 Dez 8 21:05 scrub.status.700900de-e35f-4264-8f5d-1b2b249a5c3a that fixed it, thanks! I would have expected that such temporary files are deleted at reboot, so ẗo me this looks like a bug in user-space. -- Wolfgang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Btrfs device and pool management (wip)
On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote: > On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn >wrote: > > > General thoughts on this: > > 1. If there's a write error, we fail unconditionally right now. It > > would be > > nice to have a configurable number of retries before failing. > > I'm unconvinced. I pretty much immediately do not trust a block > device > that fails even a single write, and I'd expect the file system to > quickly get confused if it can't rely on flushing pending writes to > that device. From my large-amounts-of-storage-admin PoV,... I'd say it would be nice to have more knobs to control when exactly a device is considered no longer perfectly fine, which can include several different stages like: - perhaps unreliable e.g. maybe the device shows SMART problems or there were correctable read and/or write errors under a certain threshold (either in total, or per time period) Then I could imagine that one can control whether the device is put - continued to be normally used until certain error thresholds are exceeded. - placed in a mode where data is still written to, but only when there's a duplicate on at least on other good device,... so the device would be used as read pool maybe optionally, data already on the device is auto-replicated to good devices - offline (perhaps only to be automatically reused in case of emergency (as a hot spare) when the fs knows that otherwise it's even more likely that data would be lost soon - failed the threshold from above has been reached, the fs suspects the device to completely fail soon Possible knobs would include how aggressively data is tried to move of the device. How often should retries be made? In case the other devices are under high IO load how much percentage should be used to get the still working data of the bad device (i.e. up to 100%, meaning "rather stop any other IO, just to move the data to good devices ASAP)? - dead accesses don't work anymore at all an the fs shouldn't even waste time trying to read/recover data from it. It would also make sense to allow tuning what conditions need be met to e.g. consider a drive unreliable (e.g. which SMART errors?) and to allow an admin to manually place a drive in a certain state (e.g. SMART would be still good, no IO errors so far, but the drive is 5 year old and I better want to consider it unreliable). That's - to some extent - what we at our LHC Tier-2 do at higher levels (partly simply by human management, partly via the storage management system we use (dCache), partly by RAID and other tools and scripting). In any case, though,... any of these knobs should IMHO default to the most conservative settings. In other words: If a device shows the slightest hint of being unstable/unreliable/failed... it should be considered bad, no new data should go on it (if necessary, because not enough other devices are left, the fs should get ro). The only thing I wouldn't have a opinion is: should the fs go ro and do nothing, waiting for a human to decide what's next, or should it go ro and (if possible) try to move data off the bad device (per default). Generally, a filesystem should be safe per default (which is why I see the issue in the other thread with the corruption/security leaks in case of UUID collisions quite a showstopper). From the admin side, I don't want to be required to make it safe,.. my interaction should rather only be needed to tune things. Of course I'm aware that btrfs brings several techniques which make it unavoidable that more maintenance is put into the filesystem, but, per default, this should be minimised as far as possible. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Hey Hugo, On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: > Answering the second part first, no, it can't. Thanks so far :) > The issue is that nodatacow bypasses the transactional nature of > the FS, making changes to live data immediately. This then means that > if you modify a modatacow file, the csum for that modified section is > out of date, and won't be back in sync again until the latest > transaction is committed. So you can end up with an inconsistent > filesystem if there's a crash between the two events. Sure,... (and btw: is there some kind of journal planned for nodatacow'ed files?),... but why not simply trying to write an updated checksum after the modified section has been flushed to disk... of course there's no guarantee that both are consistent in case of crash ( but that's also the case without any checksum)... but at least one would have the csum protection against everything else (blockerrors and that like) in case no crash occurs? > > For me the checksumming is actually the most important part of > > btrfs > > (not that I wouldn't like its other features as well)... so turning > > it > > off is something I really would want to avoid. > > > > Plus it opens questions like: When there are no checksums, how can > > it > > (in the RAID cases) decide which block is the good one in case of > > corruptions? > It doesn't decide -- both copies look equally good, because > there's > no checksum, so if you read the data, the FS will return whatever > data > was on the copy it happened to pick. Hmm I see... so one gets basically the behaviour of RAID. Isn't that kind of a big loss? I always considered the guarantee against block errors and that like one of the big and basic features of btrfs. It seems that for certain (not too unimportant cases: DBs, VMs) one has to decide between either evil, loosing the guaranteed consistency via checksums... or basically running into severe troubles (like Mitch's reported fragmentation issues). > > 3) When I would actually disable datacow for e.g. a subvolume that > > holds VMs or DBs... what are all the implications? > > Obviously no checksumming, but what happens if I snapshot such a > > subvolume or if I send/receive it? > > After snapshotting, modifications are CoWed precisely once, and > then it reverts to nodatacow again. This means that making a snapshot > of a nodatacow object will cause it to fragment as writes are made to > it. I see... something that should possibly go to some advanced admin documentation (if not already in). It means basically, that one must assure that any such files (VM images, DB data dirs) are already created with nodatacow (perhaps on a subvolume which is mounted as such. > > 4) Duncan mentioned that defrag (and I guess that's also for auto- > > defrag) isn't ref-link aware... > > Isn't that somehow a complete showstopper? > It is, but the one attempt at dealing with it caused massive data > corruption, and it was turned off again. So... does this mean that it's still planned to be implemented some day or has it been given up forever? And is it (hopefully) also planned to be implemented for reflinks when compression is added/changed/removed? Given that you (or Duncan?,... sorry I sometimes mix up which of said exactly what, since both of you are notoriously helpful :-) ) mentioned that autodefrag basically fails with larger files,... and given that it seems to be quite important for btrfs to not be fragmented too heavily, it sounds a bit as if anything that uses (multiple) reflinks (e.g. snapshots) cannot be really used very well. > autodefrag, however, has > always been snapshot aware and snapshot safe, and would be the > recommended approach here. Ahhh... so autodefag *is* snapshot aware, and that's basically why the suggestion is (AFAIU) that it's turned on, right? So, I'm afraid O:-), that triggers a follow-up question: Why isn't it the default? Or in other words what are its drawbacks (e.g. other cases where ref-links would be broken up,... or issues with compression)? And also, when I now activate it on an already populated fs, will it defrag also any old files (even if they're not rewritten or so)? I tried to have a look for some general (rather "for dummies" than for core developers) description of how defrag and autodefrag work... but couldn't find anything in the usual places... :-( btw: The wiki (https://btrfs.wiki.kernel.org/index.php/UseCases#How_do_ I_defragment_many_files.3F) doesn't mention that auto-defrag doesn't suffer from that problem. > (Actually, it was broken in the same > incident I just described -- but fixed again when the broken patches > were reverted). So it just couldn't be fixed (hopfully: yet) for the (manual) online defragmentation?! > > 5) Especially keeping (4) in mind but also the other comments in > > from > > Duncan and Austin... > > Is auto-defrag now recommended to be generally used? > > Absolutely, yes. I see... well, I'll probably wait
Re: Scrub: no spae left on device
Marc MERLIN posted on Tue, 08 Dec 2015 08:06:15 -0800 as excerpted: > On Tue, Dec 08, 2015 at 04:46:32PM +0100, Lionel Bouton wrote: >> Le 08/12/2015 16:37, Holger Hoffstätte a écrit : >> > On 12/08/15 16:06, Marc MERLIN wrote: >> >> >> >> Why would scrub need space and why would it cancel if there isn't >> >> enough of it? (kernel 4.3) >> >> >> >> btrfs scrub start -Bd /dev/mapper/pool1 >> >> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 >> >> (No space left on device) >> >> scrub device /dev/mapper/pool1 (id 1) canceled >> > Scrub rewrites metadata (apparently even in -r aka readonly mode), >> > and that can lead to temporary metadata expansion (stuff gets COWed >> > around); it's a bit surprising but makes sense if you think about it. Are you sure about that? My / is mounted ro by default, and if I try to scrub it in normal mode, it'll error out due to read-only. But I can run a read-only scrub just fine, and if I find errors, I simply mount it writable and redo the scrub without the -r. (My / is only 8 GiB, under half used including metadata on a fast SSD, so scrubs complete in under 30 seconds, and doing a read- only scrub followed by a mount-writable and a second fixing scrub if necessary, is trivial.) >> Sorry I'm not sure why metadata is rewritten if no error is detected. But scrub will of course do copy-on-write if there's an error, and it's possible that on initialization it checks for space to do a few cows if necessary, before it actually checks for the -r read-only flag. I try to leave at least enough unallocated space to do a balance, which of course except for -dusage=0 (or -musage=0) writes a new chunk to rewrite existing chunks into, so I'd be unlikely to ever get that close to out of space to trigger the possible initialization-time space-warning, and thus wouldn't know whether it has one or whether it comes before the -r check, or not. > And this is what I got: > legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1/ > Dumping filters: flags 0x6, state 0x0, force is off > METADATA (flags 0x2): balancing, usage=10 > SYSTEM (flags 0x2): balancing, usage=10 > ERROR: error during balancing '/mnt/btrfs_pool1/' - No space left on > device There may be more info in syslog - try dmesg | tail > > Ok, that sucks. > > legolas:~# btrfs balance start -musage=0 -v /mnt/btrfs_pool1/ > Dumping filters: flags 0x6, state 0x0, force is off > METADATA (flags 0x2): balancing, usage=0 > SYSTEM (flags 0x2): balancing, usage=0 > Done, had to relocate 0 out of 618 chunks > > This worked. Mmmh, I thought this wouldn't be necessary anymore in 4.3 > kernels? Well, it said it had to relocate zero blocks, so it _appears_ that it didn't do anything, which would be expected on reasonably current kernels as they already clean up zero-usage chunks, automatically. *BUT*... > legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1 > Dumping filters: flags 0x6, state 0x0, force is off > METADATA (flags 0x2): balancing, usage=10 > SYSTEM (flags 0x2): balancing, usage=10 > Done, had to relocate 1 out of 618 chunks ... if it did nothing in the -musage=0 case above, why did the -musage=10 case fail before, but succeed after? That's a very good question I don't have an answer to. Good question for the devs and others that actually read code. Meanwhile, note that if it relocates only a single chunk (of non-zero usage), under normal circumstances, it'll take exactly the same amount of space as before, because it'd allocate a new chunk of exactly the same size as the one it was rewriting. However, once remaining unallocated space gets tight enough, it starts allocating smaller than normal chunks, which may be what happened this time. Presumably that chunk was originally allocated when the filesystem still has much more unallocated free space, so it was a standard size chunk. When it was rewritten, unallocated space was much tighter, so a smaller chunk would likely be written, which would then be rather fuller than it was previously, as it would have the same amount of metadata in it, but be a smaller chunk. And, perhaps partially answering my own question above, the balance with -musage=0 somehow triggered a space reevaluation, thus allowing the -musage=10 balance to run afterward when it wouldn't before, even tho the -musage=0 didn't actually relocate (to /dev/null as they'd be empty, IOW, delete) any empty chunks. But... it still shouldn't happen, as if -musage=0 didn't relocate anything, it shouldn't trigger a space reevaluage that -musage=10 wouldn't trigger on its own, so while this might partially answer what happened, it does nothing to explain /why/ it happened. I'd call it a bug in the balance code, as the result of the -musage=10 should be exactly the same before and after, because the -musage=0 didn't actually relocate/delete anything. > And now I'm back in business... > > Still, this is a bit disappointing and at the
Missing half of available space (resend)
Hi all. I'm trying to figuring out why my btrfs file system doesn't show all the available space. I currently have four 4TB drives set up as a raid6 array, so I would expect to see a total available data size slightly under 8TB (two drives for data + two drives for parity). The 'btrfs fi df' command consistently shows a total size of around 3TB, and says that space is almost completely full. Here's my current system information... === root@selene:~# uname -a Linux selene.dhampton.net 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux root@selene:~# btrfs --version Btrfs v3.12 root@selene:~# btrfs fi show /video Label: none uuid: 74a4c4fa-9e83-465a-850d-cc089ecd00f6 Total devices 4 FS bytes used 3.12TiB devid1 size 3.64TiB used 1.58TiB path /dev/vdb devid2 size 3.64TiB used 1.58TiB path /dev/vda devid3 size 3.64TiB used 1.58TiB path /dev/vdc devid4 size 3.64TiB used 1.58TiB path /dev/vdd Btrfs v3.12 root@selene:~# btrfs fi df /video Data, RAID6: total=3.15TiB, used=3.11TiB System, RAID6: total=64.00MiB, used=352.00KiB Metadata, RAID6: total=5.00GiB, used=3.73GiB unknown, single: total=512.00MiB, used=1.07MiB root@selene:~# df -h /video Filesystem Size Used Avail Use% Mounted on /dev/vda 15T 3.2T 8.3T 28% /video === I have tried issuing the command "btrfs filesystem resize :max /video" on each devid in the array, and also tried balancing the array. None of these commands changed the indication that the file system is almost full. I'm wondering if the problem is because this file system began as a two drive raid1 array, and I later added the other two drives and used the 'btrfs balance' command to convert to raid6. Any suggestions on what I can try to get the 'btrfs fi df' command to show me more available space? Did I forget a command when I converted the raid1 array to raid6? Alternatively, can I trust the numbers in the standard df command? The 'used' number seems right but the 'avail' number seems high. If i can provide any more information to help figure out what's happening, please ask. Thanks. David [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 3.19.0-32-generic (buildd@lgw01-43) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 (Ubuntu 3.19.0-32.37~14.04.1-generic 3.19.8-ckt7) [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.19.0-32-generic root=UUID=b9fb1104-f681-4664-b0c3-b17db28d9d68 ro quiet splash vt.handoff=7 [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Centaur CentaurHauls [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x3fffdfff] usable [0.00] BIOS-e820: [mem 0x3fffe000-0x3fff] reserved [0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved [0.00] BIOS-e820: [mem 0xfffc-0x] reserved [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.4 present. [0.00] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011 [0.00] Hypervisor detected: KVM [0.00] e820: update [mem 0x-0x0fff] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] AGP: No AGP bridge found [0.00] e820: last_pfn = 0x3fffe max_arch_pfn = 0x4 [0.00] MTRR default type: write-back [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 8000 mask 3FFF8000 uncachable [0.00] 1 disabled [0.00] 2 disabled [0.00] 3 disabled [0.00] 4 disabled [0.00] 5 disabled [0.00] 6 disabled [0.00] 7 disabled [0.00] PAT configuration [0-7]: WB WC UC- UC WB WC UC- UC [0.00] found SMP MP-table at [mem 0x000f1ff0-0x000f1fff] mapped at [880f1ff0] [0.00] Scanning 1 areas for low memory corruption [0.00] Base memory trampoline at [88099000] 99000 size 24576 [0.00] init_memory_mapping: [mem 0x-0x000f] [0.00] [mem 0x-0x000f] page 4k [0.00] BRK [0x01fd4000, 0x01fd4fff] PGTABLE [0.00] BRK [0x01fd5000, 0x01fd5fff] PGTABLE [0.00] BRK [0x01fd6000, 0x01fd6fff] PGTABLE [0.00] init_memory_mapping: [mem
Re: Missing half of available space (resend)
On Tue, Dec 8, 2015 at 10:02 PM, David Hamptonwrote: > The > 'btrfs fi df' command consistently shows a total size of around 3TB, and > says that space is almost completely full. and > root@selene:~# btrfs fi df /video > Data, RAID6: total=3.15TiB, used=3.11TiB The "total=3.15TiB" means "there's a total of 3.15TiB allocated for data chunks using raid6 profile" and of that 3.11TiB is used. btrfs fi df doesn't ever show how much is free or available. You can get an estimate of that by using 'btrfs fi usage' instead. > root@selene:~# df -h /video > Filesystem Size Used Avail Use% Mounted on > /dev/vda 15T 3.2T 8.3T 28% /video That's about right although it seems it's slightly overestimating the available free space. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On 2015-11-27 00:08, Duncan wrote: > Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as > excerpted: >> 1) AFAIU, the fragmentation problem exists especially for those files >> that see many random writes, especially, but not limited to, big files. >> Now that databases and VMs are affected by this, is probably broadly >> known in the meantime (well at least by people on that list). >> But I'd guess there are n other cases where such IO patterns can happen >> which one simply never notices, while the btrfs continues to degrade. > > The two other known cases are: > > 1) Bittorrent download files, where the full file size is preallocated > (and I think fsynced), then the torrent client downloads into it a chunk > at a time. Okay, sounds obvious. > The more general case would be any time a file of some size is > preallocated and then written into more or less randomly, the problem > being the preallocation, which on traditional rewrite-in-place > filesystems helps avoid fragmentation (as well as ensuring space to save > the full file), but on COW-based filesystems like btrfs, triggers exactly > the fragmentation it was trying to avoid. Is it really just the case when the file storage *is* actually fully pre-allocated? Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g. qcow2, or raw images when these are sparse files). Or is it rather any case where, in larger file, many random (file internal) writes occur? > arranging to > have the client write into a dir with the nocow attribute set, so newly > created torrent files inherit it and do rewrite-in-place, is highly > recommended. At the IMHO pretty high expense of loosing the checksumming :-( Basically loosing half of the main functionalities that make btrfs interesting for me. > It's also worth noting that once the download is complete, the files > aren't going to be rewritten any further, and thus can be moved out of > the nocow-set download dir and treated normally. Sure... but this requires manual intervention. For databases, will e.g. the vacuuming maintenance tasks solve the fragmentation issues (cause I guess at least when doing full vacuuming, it will rewrite the files). > The problem is much reduced in newer systemd, which is btrfs aware and in > fact uses btrfs-specific features such as subvolumes in a number of cases > (creating subvolumes rather than directories where it makes sense in some > shipped tmpfiles.d config files, for instance), if it's running on > btrfs. Hmm doesn't seem really good to me if systemd would do that, cause it then excludes any such files from being snapshot. > For the journal, I /think/ (see the next paragraph) that it now > sets the journal files nocow, and puts them in a dedicated subvolume so > snapshots of the parent won't snapshot the journals, thereby helping to > avoid the snapshot-triggered cow1 issue. The same here, kinda disturbing if systemd would decide that on it's own, i.e. excluding files from being checksum protected... >> So is there any general approach towards this? > The general case is that for normal desktop users, it doesn't tend to be > a problem, as they don't do either large VMs or large databases, Well depends a bit on how one defines the "normal desktop user",... for e.g. developers or more "power users" it's probably not so unlikely that they do run local VMs for testing or whatever. > and > small ones such as the sqlite files generated by firefox and various > email clients are handled quite well by autodefrag, with that general > desktop usage being its primary target. Which is however not yet the default... > For server usage and the more technically inclined workstation users who > are running VMs and larger databases, the general feeling seems to be > that those adminning such systems are, or should be, technically inclined > enough to do their research and know when measures such as nocow and > limited snapshotting along with manual defrags where necessary, are > called for. mhh... well it's perhaps simple to expect that knowledge for few things like VMs, DBs and that like... but there are countless of software systems, many of them being more or less like a black box, at least with respect to their internals. It feels a bit, if there should be some tools provided by btrfs, which tell the users which files are likely problematic and should be nodatacow'ed > And if they don't originally, they find out when they start > researching why performance isn't what they expected and what to do about > it. =:^) Which can take quite a while to be found out... >> And what are the actual possible consequences? Is it just that fs gets >> slower (due to the fragmentation) or may I even run into other issues to >> the point the space is eaten up or the fs becomes basically unusable? > It's primarily a performance issue, tho in severe cases it can also be a > scaling issue, to the point that maintenance tasks such
!PageLocked BUG_ON hit in clear_page_dirty_for_io
Not sure if I've already reported this one, but I've been seeing this a lot this last couple days. kernel BUG at mm/page-writeback.c:2654! invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN CPU: 1 PID: 2566 Comm: trinity-c1 Tainted: GW 4.4.0-rc4-think+ #14 task: 880462811b80 ti: 8800cd808000 task.ti: 8800cd808000 RIP: 0010:[] [] clear_page_dirty_for_io+0x180/0x1d0 RSP: 0018:8800cd80fa00 EFLAGS: 00010246 RAX: 880c RBX: ea0011098a00 RCX: 8800cd80fbb7 RDX: dc00 RSI: 110019b01f76 RDI: ea0011098a00 RBP: 8800cd80fa20 R08: 880453272000 R09: R10: R11: R12: 88045326f2c0 R13: 88046272a310 R14: R15: 0001 FS: 7f186573d700() GS:880468a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 010dd580 CR3: 00046261c000 CR4: 001406e0 Stack: 0001 88046272a310 88046272a310 8800cd80fa90 c03891b5 8800cd80fb30 880402400040 88045326f0e8 1000 88045326ed88 8800cd80fbb0 Call Trace: [] lock_and_cleanup_extent_if_need+0xa5/0x260 [btrfs] [] __btrfs_buffered_write+0x324/0x8a0 [btrfs] [] ? btrfs_dirty_pages+0xf0/0xf0 [btrfs] [] ? generic_file_direct_write+0x2ac/0x2c0 [] ? generic_file_read_iter+0xa00/0xa00 [] btrfs_file_write_iter+0x6dd/0x800 [btrfs] [] __vfs_write+0x21d/0x260 [] ? __vfs_read+0x260/0x260 [] ? __lock_is_held+0x92/0xd0 [] ? preempt_count_sub+0xc1/0x120 [] ? percpu_down_read+0x57/0xa0 [] ? __sb_start_write+0xb4/0xf0 [] vfs_write+0xf6/0x260 [] SyS_write+0xbf/0x160 [] ? SyS_read+0x160/0x160 [] ? trace_hardirqs_on_thunk+0x17/0x19 [] entry_SYSCALL_64_fastpath+0x12/0x6b Code: 61 01 49 8d bd f0 00 00 00 8d 14 c5 08 00 00 00 e8 b6 cd 31 00 f6 c7 02 74 20 e8 8c 41 ec ff 53 9d b8 01 00 00 00 e9 1d ff ff ff <0f> 0b 48 89 df e8 b6 f5 ff ff e9 41 ff ff ff 53 9d e8 0a e7 eb That BUG is.. 2653 2654 BUG_ON(!PageLocked(page)); 2655 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: subvols and parents - how?
On Fri, 2015-11-27 at 02:02 +, Duncan wrote: > Uhm, I don't get the big security advantage here... whether nested > > or > > manually mounted to a subdir,... if the permissions are insecure > > I'll > > have a problem... if they're secure, than not. > Consider a setuid-root binary with a recently publicized but patched > on > your system vuln. But if you have root snapshots from before the > patch > and those snapshots are nested below root, then they're always > accessible. If the path to the vulnerable setuid is as user > accessible > as it likely was in its original location, then anyone with login > access > to the system is likely to be able to run it from the snapshot... and > will be able to get root due to the vuln. Hmm good point... I think it would be great if you could add that scenario somewhere to the documentation. :-) Based on that one can easily think about more/similar examples... device file that had too permissive modes set, and where snapshotted like that... and so on. I think that's another example why it would be nice if btrfs had something (per subvolume) like ext4's default mount options (I mean the ones stored in the superblock). Not only would it allow the userland tools to do things like "adding notatime" per default on snapshots (at least ro snapshot), so that one can have them nested and still doesn't suffer from the previously discussed writes-on-read-amplifications... it would also allow to set things like nodev, noexec, nosuid and that like on subvols... and again it would make the whole thing practically usable with nested subvols. Where would be the appropriate place to record that as a feature request? Simply here on the list? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions?
On Sun, 2015-12-06 at 22:34 +0800, Qu Wenruo wrote: > Not sure about LVM/MD, but they should suffer the same UUID conflict > problem. Well I had that actually quite often in LVM (i.e. same UUIDs visible on the same system), basically because we made clones from one template VM image and when that is normally booted, LVM doesn't allow to change the UUIDs of already active PV/VG/LVs (or maybe just some of these three, forgot the details) But there was never any issue, LVM on the host system, when one set was already used, continues to use that just fine and the toolset reports which it would use (more below). > The only idea I have can only enhance the behavior, but never fix it. > For example, if found multiple btrfs devices with same devid, just > refuse to mount. > And for already mounted btrfs, ignore any duplicated fsid/devid. Well I think that's already a perfectly valid solution... basically the idea that I had before. I'd call that a 100% fix, not just a workaround. If then the tools (i.e. btrfstune) allows to change the UUID of the duplicate set of devices (perhaps again with the necessity to specify each of them via device=/dev/sda,etc.) I'd be completely happy again,... and the show could get on ;) > The problem can get even tricky for case like device missing for a > while > and appear again case. I had thought about that too: a) In the non-malicious case, this could e.g. mean that a device from a btrfs RAID was missing and a clone with the same UUID / dev ID get's added to the system Possible consequences, AFAICS: - The data is simply auto-rebuilt on the clone. - Some corruptions occur when the clone is older, and data that was only on the newer device is now missing (not sure if this can happen at all or whether generation IDs prevent it). b) In the malicious/attack case, one possible scenario could be: A device is missing from a btrfs RAID... the machine is left unattended. An attacker comes plugs in the USB stick with the missing UUID. Is the rebuild (and thus data leakage) now happening automatically? In any case though, a simply solution could be, that not automatic assemblies happen per default, and the people who still want to do that, are properly warned about the possible implications in the docs. > But just as you mentioned, it *IS* a real problem, and we should need > to > enhance it. Should one (or I) add this as a ticket to the kernel bugzilla, or as an entry to the btrfs wiki? > I'd like to see how LVM/DM behaves first, at least as a reference if > they are really so safe. Well that's very simple to check, I did it here for the LV case only: root@lcg-lrz-admin:~# truncate -s 1G image1 root@lcg-lrz-admin:~# losetup -f image1 root@lcg-lrz-admin:~# pvcreate /dev/loop0 Physical volume "/dev/loop0" successfully created root@lcg-lrz-admin:~# losetup -d /dev/loop0 root@lcg-lrz-admin:~# cp image1 image2 root@lcg-lrz-admin:~# losetup -f image1 root@lcg-lrz-admin:~# pvscan PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free] PV /dev/sda1VG vg_system lvm2 [9,99 GiB / 0free] PV /dev/loop0 lvm2 [1,00 GiB] Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB] root@lcg-lrz-admin:~# losetup -f image2 root@lcg-lrz-admin:~# pvscan Found duplicate PV tSK9Cdpw6bcmocZnxFPD6ThNz1opRXsB: using /dev/loop1 not /dev/loop0 PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free] PV /dev/sda1VG vg_system lvm2 [9,99 GiB / 0free] PV /dev/loop1 lvm2 [1,00 GiB] Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB] Obviously, with PVs alone, there is no "x is already used" case. As one can see it just says it would ignore one of them, which I think is rather stupid in that particular case (i.e. non of the devices already used somehow), because it probably just "randomly" decides which is to be used, which is ambiguous. > And what will rescan show if they are not active? My experience was always (it's just quite late and I don't want to simulate everything right now, which is trivial anyway): - It shows warnings about the duplicates in the tools - It continues to use the already active devices (if any) - Unfortunately, while the kernel continues to use the already used devices, the toolset may use other device (kinda stupid, but at least it warns and the already used devices seem to be still properly used): continuation from the setup above: root@lcg-lrz-admin:~# losetup -d /dev/loop1 (now only image1 is seen as loop0) root@lcg-lrz-admin:~# vgcreate vg_test /dev/loop0 Volume group "vg_test" successfully created root@lcg-lrz-admin:~# lvcreate -n test vg_test -l 100 Logical volume "test" created root@lcg-lrz-admin:~# mkfs.ext4 /dev/vg_test/test mke2fs 1.42.12 (29-Aug-2014) ... root@lcg-lrz-admin:~# mount /dev/vg_test/test /mnt/ root@lcg-lrz-admin:~# losetup -a /dev/loop0: [64768]:518297 (/root/image1) root@lcg-lrz-admin:~# losetup -f image2 root@lcg-lrz-admin:~# vgs
Re: kernel call trace during send/receive
Hey. Hmm I guess no one has any clue about that error? Well it seems at least that an fsck over the receiving fs passes through without any error. Cheers, Chris. On Fri, 2015-11-27 at 02:49 +0100, Christoph Anton Mitterer wrote: > Hey. > > Just got the following during send/receiving a big snapshot from one > btrfs to another fresh one. > > Both under kernel 4.2.6, tools 4.3 > > The send/receive seems to continue however... > > Any ideas what that means? > > Cheers, > Chris. > > Nov 27 01:52:36 heisenberg kernel: [ cut here ] > > Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at > /build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794 > btrfs_ioctl_send+0x661/0x1120 [btrfs]() > Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache > jbd2 nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan > xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 > iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp tun bridge stp llc fuse ccm > ebtable_filter ebtables seqiv ecb drbg ansi_cprng algif_skcipher md4 > algif_hash af_alg binfmt_misc xfrm_user xfrm4_tunnel tunnel4 ipcomp > xfrm_ipcomp esp4 ah4 cpufreq_userspace cpufreq_powersave > cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6 > nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy > ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 > xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables > x_tables joydev rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt > iTCO_vendor_support x86_pkg_temp_thermal > Nov 27 01:52:36 heisenberg kernel: intel_powerclamp intel_rapl > iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev > deflate ctr psmouse serio_raw twofish_generic pcspkr btusb btrtl > btbcm btintel bluetooth crc16 uvcvideo videobuf2_vmalloc > videobuf2_memops videobuf2_core v4l2_common videodev media > twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common > sg arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci > rfkill camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm > 8250_fintek camellia_x86_64 snd_hda_codec_realtek > snd_hda_codec_generic processor battery fujitsu_laptop i2c_i801 ac > lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel snd_hda_codec > snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd soundcore > i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me > Nov 27 01:52:36 heisenberg kernel: i2c_algo_bit mei > serpent_sse2_x86_64 xts serpent_generic blowfish_generic > blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic > cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 > sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key > xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod > md_mod btrfs xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel > aesni_intel aes_x86_64 glue_helper ahci lrw gf128mul ablk_helper > libahci cryptd libata ehci_pci xhci_pci ehci_hcd scsi_mod xhci_hcd > usbcore usb_common > Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not > tainted 4.2.0-1-amd64 #1 Debian 4.2.6-1 > Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK > E782/FJNB23E, BIOS Version 1.11 05/24/2012 > Nov 27 01:52:36 heisenberg kernel: a02e6260 > 8154e2f6 > Nov 27 01:52:36 heisenberg kernel: 8106e5b1 880235a3c42c > 7ffd3d3796c0 8802f0e5c000 > Nov 27 01:52:36 heisenberg kernel: 0004 88010543c500 > a02d2d81 88041e5ebb00 > Nov 27 01:52:36 heisenberg kernel: Call Trace: > Nov 27 01:52:36 heisenberg kernel: [] ? > dump_stack+0x40/0x50 > Nov 27 01:52:36 heisenberg kernel: [] ? > warn_slowpath_common+0x81/0xb0 > Nov 27 01:52:36 heisenberg kernel: [] ? > btrfs_ioctl_send+0x661/0x1120 [btrfs] > Nov 27 01:52:36 heisenberg kernel: [] ? > __alloc_pages_nodemask+0x194/0x9e0 > Nov 27 01:52:36 heisenberg kernel: [] ? > btrfs_ioctl+0x26c/0x2a10 [btrfs] > Nov 27 01:52:36 heisenberg kernel: [] ? > sched_move_task+0xca/0x1d0 > Nov 27 01:52:36 heisenberg kernel: [] ? > cpumask_next_and+0x2e/0x50 > Nov 27 01:52:36 heisenberg kernel: [] ? > select_task_rq_fair+0x23f/0x5c0 > Nov 27 01:52:36 heisenberg kernel: [] ? > enqueue_task_fair+0x387/0x1120 > Nov 27 01:52:36 heisenberg kernel: [] ? > native_sched_clock+0x24/0x80 > Nov 27 01:52:36 heisenberg kernel: [] ? > sched_clock+0x5/0x10 > Nov 27 01:52:36 heisenberg kernel: [] ? > do_vfs_ioctl+0x2c3/0x4a0 > Nov 27 01:52:36 heisenberg kernel: [] ? > _do_fork+0x146/0x3a0 > Nov 27 01:52:36 heisenberg kernel: [] ? > SyS_ioctl+0x76/0x90 > Nov 27 01:52:36 heisenberg kernel: [] ? > system_call_fast_compare_end+0xc/0x6b > Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]- > -- smime.p7s Description: S/MIME cryptographic signature
Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)
On Sun, 2015-12-06 at 04:06 +, Duncan wrote: > There's actually a number of USB-based hardware and software vulns > out > there, from the under $10 common-component-capacitor-based charge- > and-zap > (charges off the 5V USB line, zaps the port with several hundred > volts > reverse-polarity, if the machine survives the first pulse and > continues > supplying 5V power, repeat...), to the ones that act like USB-based > input > devices and "type" in whatever commands, to simple USB-boot to a > forensic > distro and let you inspect attached hardware (which is where the > encrypted > storage comes in, they've got everything that's not encrypted), > to the plain old fashioned boot-sector viruses that quickly jump to > everything else on the system that's not boot-sector protected and/or > secure-boot locked, to... Well this is all well known - at least to security folks ;) - but to be quite honest: Not an excuse for allowing even more attack surface, in this case via the filesystem. One will *always* find a weaker element in the security chain, and could always argue with that not to fixe one's own issues. "Well, there's no need to fix that possible collision-data-leakage- issue in btrfs[0]! Why? Well an attacker could still simply abduct the bank manager, torture him for hours until he gives any secret with pleasure" ;-) > Which is why most people in the know say if you have unsupervised > physical > access, you effectively own the machine and everything on it, at > least > that's not encrypted. Sorry, I wouldn't say so. Ultimately you're of course right, which is why my fully-dm-crypted notebook is never left alone when it runs (cold boot or USB firmware attacks)... but in practise things are a bit different I think. Take the ATM example. Or take real world life in big computing centres. Fact is, many people have usually access, from the actual main personell, over electricians to the cleaning personnel. Whacking a device or attacking it via USB firmware tricks, is of course possible for them, but it's much more likely to be noted (making noise, taking time and so on),... so there is no need to give another attack surface by this. > If you haven't been keeping up, you really have some reading to > do. If > you're plugging in untrusted USB devices, seriously, a thumb drive > with a > few duplicated btrfs UUIDs is the least of your worries! Well as I've said, getting that in via USB may be only one way. We're already so far that GNOME automount devices when plugged... who says the the next step isn't that this happens remotely in some form, e.g. btrfs-image on dropbox, automounted by nautilus. Okay, that may be a bit constructed, but it should demonstrate that there could be plenty of ways for that to happen, which we don't even think of (and usually these are the worst in security). You said it's basically not fixable in btrfs: It's absolutely clear that I'm no btrfs expert (or even developer), but my poor man approach which I think I've written before doesn't seem so impossible, does it? 1) Don't simply "activate" btrfs devices that are found but rather: 2) Check if there are other devices of the same fs UUID + device ID, or more generally said: check if there are any collisions 3) If there are, and some of them are already active, continue to use them, don't activate the newly appeared ones 4) If there are, and none of them are already active, refuse to activate *any* of them unless the user manually instructs to do so via device= like options. > BTW, this is documented (in someone simpler "do not do XX" form) on > the > wiki, gotchas page. > > https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of > _devices I know, but it doesn't really tell all possibly consequences, and again, it's unlikely that the end-user (even if possibly heavily affected by it) will stumble over that. Cheer, Chris. [0] Assuming there is actually one, I haven't really verified that and base it solely one what people told that basically arbitrary corruptions may happen on both devices. smime.p7s Description: S/MIME cryptographic signature
Re: Missing half of available space (resend)
On Tue, 2015-12-08 at 22:27 -0700, Chris Murphy wrote: > On Tue, Dec 8, 2015 at 10:02 PM, David Hampton >wrote: > > The > > 'btrfs fi df' command consistently shows a total size of around > > 3TB, and says that space is almost completely full. > > and > > > > root@selene:~# btrfs fi df /video > > Data, RAID6: total=3.15TiB, used=3.11TiB > > The "total=3.15TiB" means "there's a total of 3.15TiB allocated for > data chunks using raid6 profile" and of that 3.11TiB is used. > > btrfs fi df doesn't ever show how much is free or available. I think I get it. The numbers in the 'df' command don't show the total number of chunks that exist, only the subset of those chunks that have been allocated to something. > You can get an estimate of that by using 'btrfs fi usage' instead. Seems I need to upgrade my tools. That command was added in 3.18 and I only have the 3.12 tools. > > root@selene:~# df -h /video > > Filesystem Size Used Avail Use% Mounted on > > /dev/vda 15T 3.2T 8.3T 28% /video > > That's about right although it seems it's slightly overestimating the > available free space. Thanks. Make me feel a lot better. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: subvols and parents - how?
On Fri, 2015-11-27 at 01:02 +, Duncan wrote: [snip snap] > #1 could be a pain to setup if you weren't actually mounting it > previously, just relying on the nested tree, AND... > > #2 The point I was trying to make, now, to mount it you'll mount not > a > native nested subvol, and not a directly available sibling > 5/subvols/home, but you'll actually be reaching into an entirely > different nesting structure to grab something down inside, mounting > 5/subvols/root/home subvolume nesting down inside the direct > 5/subvols/root sibling subvol. Okay so your main point was basically "keeping things administrable"... > one of which was that everything > that the package manager installs should be on the same partition > with > the installed-package database, so if it has to be restored from > backup, > at least if it's all old, at least it's all equally old, and the > package > database actually matches what's on the system because it's in the > same > backup! I basically agree, though I'd allow few exceptions, like database-like data that is stored in /var/ sometimes and that doesn't need to be consistent with anything but iself... e.g. static web pages (/var/www)... postgresl DB, or sks keyserver DB... and so on. btw: What's the proper way for merging / splitting into subvols. E.g. consider I have: 5 | +--root (subvol) | +-- var (no subvol) And say I would want to split of var/www into a subvol. Well one obvious way would be with mv (and AFAIU that would keep my ref-links with clones, if any) but that also means that anything that accesses /var/www probably needs a downtime. Is it planned to have a special function that basically says: "make dir foo and anything below (except nested subvols) a subvol named foo, immediately and atomically"? And similar vice-versa... a special function that says: "make subvol foo and anything below (except nested subvols) a dir of the parent subvol named foo, immediately and atomically"? Could be handy for real world administration, especially when one want's to avoid downtimes. btw: Few days ago, either Hugo or your thought that mv'ing a subvol would change it's UUID, but my try (which was with coreutils 8.3 -> no reflinked mv) seemed to show it wouldn't but there was no further reply then... so am I right that the UUID wouldn't change? > The same idea applies here. Once you start reaching into nested > subvols > to get the deeper nested subvols you're trying to mount, it's too > much > and you're just begging to get it wrong under the extreme pressures > of a > disaster recovery. Well apparently you oversaw the extremely simple and reliable solution: leaving a tiny little note on your desk saying something like: "dear boss, things are screwed up, I'm on vacation now..." ;-) Thanks, Chris. smime.p7s Description: S/MIME cryptographic signature