[check] set metadata extent size of tree block extents
When scanning extents, we didn't set num_bytes when visiting a tree block extent. On the corrupted filesystem I was trying to fix, this caused an extent to have its size guessed as zero, so we'd compute end as start-1, which caused us to hit insert_state's BUG_ON(end<start). Signed-off-by: Alexandre Oliva <ol...@gnu.org> --- cmds-check.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 0165fba..e563354 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -5208,9 +5208,10 @@ static int process_extent_item(struct btrfs_root *root, ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item); refs = btrfs_extent_refs(eb, ei); - if (btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) + if (btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) { metadata = 1; - else + num_bytes = root->leafsize; + } else metadata = 0; add_extent_rec(extent_cache, NULL, 0, key.objectid, num_bytes, -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
after crash, btrfs attempts to clean up extent it has already cleaned up
Are there others getting errors like $SUBJECT, described in more detail at https://bugzilla.kernel.org/show_bug.cgi?id=112561 If my theory is correct, workloads involving lots of snapshots, such as Ceph OSDs, might run into it quite often. Although I could recover from a few such metadata corruptions by hand, when btrfs check --repair couldn't fix it, it's quite cumbersome. I wonder if a change like this, made conditional on a mount option, would be considered appropriate. I considered making it conditional on -o recovery, but ended up just making it unconditional for my own temporary use. As for fixing metadata corruption by hand, I've been thinking it might be useful to have some tool to help navigate and change metadata, extract files and whatnot, much like debugfs for ext* filesystems. Would others find it useful? Is anyone else already working on such a thing? diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cadacf6..849765a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6356,7 +6356,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, "unable to find ref byte nr %llu parent %llu root %llu owner %llu offset %llu", bytenr, parent, root_objectid, owner_objectid, owner_offset); - btrfs_abort_transaction(trans, extent_root, ret); + ret = 0; /*btrfs_abort_transaction(trans, extent_root, ret);*/ goto out; } else { btrfs_abort_transaction(trans, extent_root, ret); -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
non-atomic xattr replacement in btrfs = rsync random errors
A few days ago, I started using rsync batches to archive old copies of ceph OSD snapshots for certain kinds of disaster recovery. This seems to exercise an unexpected race condition in rsync, which happens to expose what appears to be a race condition in btrfs, causing random scary but harmless errors when replaying the rsync batches. strace has revealed that the two rsync processes running concurrently to apply the batch both attempt to access xattrs of the same directory concurrently. I understand rsync is supposed to avoid this, but something's going wrong with that. Here's the smoking gun, snipped from strace -p 27251 -p 27253 -o smoking.gun, where both processes are started from a single rsync --read-batch=- -aHAX --del ... run: 0: 27251 stat(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, unfinished ... 1: 27253 stat(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, {st_mode=S_IFDIR|0755, st_size=5470, ...}) = 0 2: 27251 ... stat resumed {st_mode=S_IFDIR|0755, st_size=5470, ...}) = 0 3: 27253 llistxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, user.cephos.phash.contents\0, 1024) = 27 4: 27251 llistxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, unfinished ... 5: 27253 lsetxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, user.cephos.phash.contents, \x01F\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x03\x00\x00, 17, 0 unfinished ... 6: 27251 ... llistxattr resumed user.cephos.phash.contents\0, 1024) = 27 7: 27251 lgetxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, user.cephos.phash.contents, 0x0, 0) = -1 ENODATA (No data available) 8: 27253 ... lsetxattr resumed ) = 0 9: 27253 utimensat(AT_FDCWD, osd/0.6ed_head/DIR_D/DIR_E/DIR_6, {UTIME_NOW, {1407992261, 0}}, AT_SYMLINK_NOFOLLOW) = 0 a: 27251 write(2, rsync: get_xattr_data: lgetxattr..., 181) = 181 lines 0-2, 3-6 and 5-8, show concurrent access of both rsync processes to the same directory. This wouldn't be a problem, not even for replaying batches, for the lsetxattr would put the intended xattr value in there regardless of whether the scanner saw the xattr value before or after that. What makes the problem visible is that btrfs appears to have a race in its handling of xattr replacement, leaving a window between the removal of the old value and the insertion of the new one, as shown by lines 5-8. line 3 show the attribute existed before, and lines 5-8 show it disappears in line 7, while lsetxattr still runs to replace it. If rsync tries hard enough to hit this window, the lgetxattr concurrent to the lsetxattr eventually hits, and then rsync reports an error: rsync: get_xattr_data: lgetxattr(/media/px/snapshots/cluster/20141102-to-20140816/osd/0.6ed_head/DIR_D/DIR_E/DIR_6,user.cephos.phash.contents,0) failed: No data available (61) At the end, it exits with a nonzero status, even though nothing really wrong went on and the tree ended up looking just as it was supposed to. Now, I'm a bit concerned because the btrfs race condition, if exercised on security-related xattrs or ACLs, could cause data to become visible that shouldn't, which could turn this into a locally exploitable security issue. Sure enough nobody goes nuts repeatedly changing the ACLs of a dir or file containing information that should be guarded by it, so as to increase the likelihood that an attacker succeeds in accessing the data, but still... I don't think the temporary removal of the xattr for subsequent insertion should be visible at all. I'm sorry for reporting a potential security issue like that, but by the time it occurred to me that it might have potential security implications, I'd already mentioned the problem on #btrfs at FreeNode, so the horse was out of the barn already :-( I hope this helps, -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: non-atomic xattr replacement in btrfs = rsync random errors
[dropping rs...@lists.samba.org, it rejects posts from non-subscribers; refer to https://bugzilla.samba.org/show_bug.cgi?id=10925 instead] On Nov 6, 2014, Alexandre Oliva ol...@gnu.org wrote: What makes the problem visible is that btrfs appears to have a race in its handling of xattr replacement, leaving a window between the removal of the old value and the insertion of the new one The bugs described above occurred with rsync-3.1.0-5.fc20.x86_64 and kernel-libre-3.16.7-200.fc20.gnu.x86_64. The btrfs code in kernel-libre is unchanged from the corresponding Fedora kernel. The distro is BLAG 200k/x86_64, under development. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs: add -k option to filesystem df
On Aug 30, 2014, Shriramana Sharma samj...@gmail.com wrote: But somehow I feel the name of the long option could be made better than --kbytes which is not exactly descriptive of what it accomplishes. IIUC so far only bytes are displayed right? kbytes displays KiBs, whereas the preexisting code chooses a magnitude most suitable to present the size in a human-friendly way. I'd be happy to drop the long option, to follow GNU df's practice: there's no long option (without arguments) equivalent to -k there. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
fixes for btrfs check --repair
I got a faulty memory module a while ago, and it ran for a while, corrupting a number of filesystems on that server. Most of the corruption is long gone, as the filesystems (ceph osds) were reconstructed, but I tried really hard to avoid having to rebuild one 4TB filesystem from scratch, since it was still fully operational. I failed, but in the process, I ran into and fixed two btrfs check --repair bugs. I gave up when removing an old snapshot caused the delayed refs processing to abort because it couldn't find a ref to delete, whereas btrfs check --repair completed successfully without fixing anything. Mounting the apparently-clean filesystem would still run into the same delayed refs error, but trying to map the logical extent back to a file produced an error. Since it was far too big to preserve, even in metadata only, I didn't, and proceeded to mkfs.btrfs right away. Here are the patches. repair: remove recowed entry from the to-recow list From: Alexandre Oliva ol...@gnu.org If we attempt to repair a filesystem with metadata blocks that need recowing, we'll get into an infinite loop repeatedly recowing the first entry in the list, without ever removing it from the list. Oops. Fixed. Signed-off-by: Alexandre Oliva ol...@gnu.org --- cmds-check.c |1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index 268e588..66c982f 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6760,6 +6760,7 @@ int cmd_check(int argc, char **argv) eb = list_first_entry(root-fs_info-recow_ebs, struct extent_buffer, recow); + list_del_init(eb-recow); ret = recow_extent_buffer(root, eb); if (ret) break; check: do not dereference tree_refs as data_refs From: Alexandre Oliva ol...@gnu.org In a filesystem corrupted by a faulty memory module, btrfsck would get very confused attempting to access backrefs that weren't data backrefs as if they were. Besides invoking undefined behavior for accessing potentially-uninitialized data past the end of objects, or with dynamic types unrelated with the static types held in the corresponding memory, it used offsets and lengths from such fields that did not correspond to anything in the filesystem proper. Moving the test for full backrefs and checking that they're data backrefs earlier avoided the crash I was running into, but that was not enough to make the filesystem complete a successful repair. Signed-off-by: Alexandre Oliva ol...@gnu.org --- cmds-check.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 66c982f..319dd2b 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -4781,15 +4781,17 @@ static int verify_backrefs(struct btrfs_trans_handle *trans, return 0; list_for_each_entry(back, rec-backrefs, list) { + if (back-full_backref || !back-is_data) + continue; + dback = (struct data_backref *)back; + /* * We only pay attention to backrefs that we found a real * backref for. */ if (dback-found_ref == 0) continue; - if (back-full_backref) - continue; /* * For now we only catch when the bytes don't match, not the @@ -4905,6 +4907,9 @@ static int verify_backrefs(struct btrfs_trans_handle *trans, * references and fix up the ones that don't match. */ list_for_each_entry(back, rec-backrefs, list) { + if (back-full_backref || !back-is_data) + continue; + dback = (struct data_backref *)back; /* @@ -4913,8 +4918,6 @@ static int verify_backrefs(struct btrfs_trans_handle *trans, */ if (dback-found_ref == 0) continue; - if (back-full_backref) - continue; if (dback-bytes == best-bytes dback-disk_bytenr == best-bytenr) @@ -5134,14 +5137,16 @@ static int find_possible_backrefs(struct btrfs_trans_handle *trans, int ret; list_for_each_entry(back, rec-backrefs, list) { + /* Don't care about full backrefs (poor unloved backrefs) */ + if (back-full_backref || !back-is_data) + continue; + dback = (struct data_backref *)back; /* We found this one, we don't need to do a lookup */ if (dback-found_ref) continue; - /* Don't care about full backrefs (poor unloved backrefs) */ - if (back-full_backref) - continue; + key.objectid = dback-root; key.type = BTRFS_ROOT_ITEM_KEY; key.offset = (u64)-1; -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
btrfs: add -k option to filesystem df
Introduce support for df to print sizes in KiB, easy to extend to other bases. The man page is also updated and fixed in that it made it seem like multiple paths were accepted. Signed-off-by: Alexandre Oliva ol...@gnu.org --- Documentation/btrfs-filesystem.txt |4 +++- cmds-filesystem.c | 26 +++--- utils.c| 29 +++-- utils.h|1 + 4 files changed, 54 insertions(+), 6 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index c9c0b00..70ba4b8 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -17,8 +17,10 @@ resizing, defragment. SUBCOMMAND -- -*df* path [path...]:: +*df* [--kbytes] path:: Show space usage information for a mount point. ++ +If '-k' or '--kbytes' is passed, sizes will be printed in KiB. *show* [--mounted|--all-devices|path|uuid|device|label]:: Show the btrfs filesystem with some additional info. diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 7e8ca95..737fcf3 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -113,8 +113,9 @@ static const char * const filesystem_cmd_group_usage[] = { }; static const char * const cmd_df_usage[] = { - btrfs filesystem df path, + btrfs filesystem df [-k] path, Show space usage information for a mount point, + -k|--kbytesshow disk spaces in KB, NULL }; @@ -226,10 +227,29 @@ static int cmd_df(int argc, char **argv) char *path; DIR *dirstream = NULL; - if (check_argc_exact(argc, 2)) + while (1) { + int long_index; + static struct option long_options[] = { + { kbytes, no_argument, NULL, 'k'}, + { NULL, no_argument, NULL, 0 }, + }; + int c = getopt_long(argc, argv, k, long_options, + long_index); + if (c 0) + break; + switch (c) { + case 'k': + pretty_size_force_base (1024); + break; + default: + usage(cmd_df_usage); + } + } + + if (check_argc_max(argc, optind + 1)) usage(cmd_df_usage); - path = argv[1]; + path = argv[optind]; fd = open_file_or_dir(path, dirstream); if (fd 0) { diff --git a/utils.c b/utils.c index 6c09366..f760d1b 100644 --- a/utils.c +++ b/utils.c @@ -1377,19 +1377,43 @@ out: } static char *size_strs[] = { , KiB, MiB, GiB, TiB, PiB, EiB}; +u64 forced_base = 0; +int pretty_size_force_base(u64 base) +{ + u64 check = 1; + while (check base) + check *= 1024; + if (check != base base) + return -1; + forced_base = base; + return 0; +} int pretty_size_snprintf(u64 size, char *str, size_t str_bytes) { int num_divs = 0; + u64 last_size = size; float fraction; if (str_bytes == 0) return 0; - if( size 1024 ){ + if( forced_base ){ + u64 base = forced_base; + while (base 1) { + base /= 1024; + last_size = size; + size /= 1024; + num_divs++; + } + if (num_divs 2) + return snprintf(str, str_bytes, %llu%s, + (unsigned long long)size, + size_strs[num_divs]); + goto check; + } else if( size 1024 ){ fraction = size; num_divs = 0; } else { - u64 last_size = size; num_divs = 0; while(size = 1024){ last_size = size; @@ -1397,6 +1421,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t str_bytes) num_divs ++; } + check: if (num_divs = ARRAY_SIZE(size_strs)) { str[0] = '\0'; return -1; diff --git a/utils.h b/utils.h index fd25126..bbcb042 100644 --- a/utils.h +++ b/utils.h @@ -71,6 +71,7 @@ int check_mounted_where(int fd, const char *file, char *where, int size, int btrfs_device_already_in_root(struct btrfs_root *root, int fd, int super_offset); +int pretty_size_force_base(u64 base); int pretty_size_snprintf(u64 size, char *str, size_t str_bytes); #define pretty_size(size) \ ({ \ -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http
[PATCH] [btrfs] add volid to failed csum messages
The failed csum messages generated by btrfs mention the inode number, but on filesystems with multiple subvolumes, that's not enough to identify the file. I've added the inode number to the messages so that they're more complete. I also noticed that the extent/offset information printed for the file isn't always correct. Indeed, when we print an offset that could be fed to inspect-internal logical-resolve, we used the term offset, that doesn't make it clear it's a logical offset, whereas when we print a physical disk offset, as in compression.c, we used the term extent, which incorrectly implied it to be a logical offset. I've renamed them to lofst and phofst, which are hopefully clearer. Ideally, we'd uniformly print logical offsets in these messages, but presumably the information isn't readily available for check_compressed_csum. I haven't quite tested this beyond building it (I don't have a sure way to trigger csum errors :-), but AFAICT the objectid I've added is the same number that one can pass to mount as subvolid, or look up in the btrfs subvol list table. Signed-off-by: Alexandre Oliva ol...@gnu.org --- fs/btrfs/compression.c |8 +--- fs/btrfs/inode.c | 12 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index b01fb6c..9f095b3 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -129,9 +129,11 @@ static int check_compressed_csum(struct inode *inode, if (csum != *cb_sum) { btrfs_info(BTRFS_I(inode)-root-fs_info, - csum failed ino %llu extent %llu csum %u wanted %u mirror %d, - btrfs_ino(inode), disk_start, csum, *cb_sum, - cb-mirror_num); + csum failed ino %llu vol %llu phofst %llu csum %u wanted %u mirror %d, + btrfs_ino(inode), + BTRFS_I(inode)-root-root_key.objectid, + disk_start, csum, *cb_sum, + cb-mirror_num); ret = -EIO; goto fail; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d3d4448..cc32b84 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2829,8 +2829,10 @@ good: zeroit: if (__ratelimit(_rs)) - btrfs_info(root-fs_info, csum failed ino %llu off %llu csum %u expected csum %u, - btrfs_ino(page-mapping-host), start, csum, csum_expected); + btrfs_info(root-fs_info, csum failed ino %llu vol %llu lofst %llu csum %u expected csum %u, + btrfs_ino(page-mapping-host), + root-root_key.objectid, + start, csum, csum_expected); memset(kaddr + offset, 1, end - start + 1); flush_dcache_page(page); kunmap_atomic(kaddr); @@ -6981,8 +6983,10 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) flush_dcache_page(bvec-bv_page); if (csum != csums[i]) { - btrfs_err(root-fs_info, csum failed ino %llu off %llu csum %u expected csum %u, - btrfs_ino(inode), start, csum, + btrfs_err(root-fs_info, csum failed ino %llu vol %llu lofst %llu csum %u expected csum %u, + btrfs_ino(inode), + root-root_key.objectid, + start, csum, csums[i]); err = -EIO; } -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid5
On Oct 22, 2013, Duncan 1i5t5.dun...@cox.net wrote: the quick failure should they try raid56 in its current state simply alerts them to the problem they already had. What quick failure? There's no such thing in place AFAIK. It seems to do all the work properly, the limitations in the current implementation will only show up when an I/O error kicks in. I can't see any indication, in existing announcements, that recovery from I/O errors in raid56 is missing, let alone that it's so utterly and completely broken that it will freeze the entire filesystem and require a forced reboot to unmount the filesystem and make any other data in it accessible again. That's far, far worse than the general state of btrfs, and that's not a documented limitation of raid56, so how would someone be expected to know about it? It certainly isn't obvious by having a cursory look at the code either. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid5
On Oct 22, 2013, Duncan 1i5t5.dun...@cox.net wrote: This is because there's a hole in the recovery process in case of a lost device, making it dangerous to use except for the pure test-case. It's not just that; any I/O error in raid56 chunks will trigger a BUG and make the filesystem unusable until the next reboot, because the mirror number is zero. I wrote this patch last week, just before leaving on a trip, and I was happy to find out it enabled a frequently-failing disk to hold a filesystem that turned out to be surprisingly reliable! btrfs: some progress in raid56 recovery From: Alexandre Oliva ol...@gnu.org This patch is WIP, but it has enabled a raid6 filesystem on a bad disk (frequent read failures at random blocks) to work flawlessly for a couple of weeks, instead of hanging the entire filesystem upon the first read error. One of the problems is that we have the mirror number set to zero on most raid56 reads. That's unexpected, for mirror numbers start at one. I couldn't quite figure out where to fix the mirror number in the bio construction, but by simply refraining from failing when the mirror number is zero, I found out we end up retrying the read with the next mirror, which becomes a read retry that, on my bad disk, often succeeds. So, that was the first win. After that, I had to make a few further tweaks so that other BUG_ONs wouldn't hit, and we'd instead fail the read altogether, i.e., in the extent_io layer, we still don't repair/rewrite the raid56 blocks, nor do we attempt to rebuild bad blocks out of the other blocks in the stride. In a few cases in which the read retry didn't succeed, I'd get an extent cksum verify failure, which I regarded as ok. What did surprise me was that, for some of these failures, but not all, the raid56 recovery code would kick in and rebuild the bad block, so that we'd get the correct data back in spite of the cksum failure and the bad block. I'm still puzzled by that; I can't explain what I'm observing, but surely the correct data is coming out of somewhere ;-) Another oddity I noticed is that sometimes the mirror numbers appear to be totally out of range; I suspect there might be some type mismatch or out-of-range memory access that causes some other information to be read as a mirror number from bios or somesuch. I couldn't track that down yet. As it stands, although I know this still doesn't kick in the recovery or repair code at the right place, the patch is usable on its own, and it is surely an improvement over the current state of raid56 in btrfs, so it might be a good idea to put it in. So far, I've put more than 1TB of data on that failing disk with 16 partitions on raid6, and somehow I got all the data back successfully: every file passed an md5sum check, in spite of tons of I/O errors in the process. Signed-off-by: Alexandre Oliva ol...@gnu.org --- fs/btrfs/extent_io.c | 17 - fs/btrfs/raid56.c| 18 ++ 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index fe443fe..4a592a3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2061,11 +2061,11 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, struct btrfs_mapping_tree *map_tree = fs_info-mapping_tree; int ret; - BUG_ON(!mirror_num); - /* we can't repair anything in raid56 yet */ if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) - return 0; + return -EIO; + + BUG_ON(!mirror_num); bio = btrfs_io_bio_alloc(GFP_NOFS, 1); if (!bio) @@ -2157,7 +2157,6 @@ static int clean_io_failure(u64 start, struct page *page) return 0; failrec = (struct io_failure_record *)(unsigned long) private_failure; - BUG_ON(!failrec-this_mirror); if (failrec-in_validation) { /* there was no real error, just free the record */ @@ -2167,6 +2166,12 @@ static int clean_io_failure(u64 start, struct page *page) goto out; } + if (!failrec-this_mirror) { + pr_debug(clean_io_failure: failrec-this_mirror not set, assuming %llu not repaired\n, + failrec-start); + goto out; + } + spin_lock(BTRFS_I(inode)-io_tree.lock); state = find_first_extent_bit_state(BTRFS_I(inode)-io_tree, failrec-start, @@ -2338,7 +2343,9 @@ static int bio_readpage_error(struct bio *failed_bio, struct page *page, * everything for repair_io_failure to do the rest for us. */ if (failrec-in_validation) { - BUG_ON(failrec-this_mirror != failed_mirror); + if (failrec-this_mirror != failed_mirror) + pr_debug(bio_readpage_error: this_mirror equals failed_mirror: %i\n
Re: Q: Why subvolumes?
On Jul 23, 2013, Jerome Haltom was...@cogito.cx wrote: Why not just create the new dev_id on the destination snapshot of any directory? That way the snapshot can share inodes with is source. Agreed. Nothing stops us from implementing snapshotting of any directory whatsoever: all it takes is to take a snapshot of the subvolume enclosing the directory we want to snapshot, removing everything that's not in the requested directory from the snapshot, and making that directory the root of the snapshot. The only tricky bit here AFAICT is to arrange for the non-snapshotted subtree components to be cleaned up in background. If we had some primitive to unlink an entire subtree and clean it up in background we could use that. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O errors block the entire filesystem
On May 14, 2013, Liu Bo bo.li@oracle.com wrote: In one of the failures that caused machine load spikes, I tried to collect info on active processes with perf top and SysRq-T, but nothing there seemed to explain the spike. Thoughts on how to figure out what's causing this? Although I've seen your solution patch in this thread, I'm still curious about this senario, could you please share the reproducer script or something? I'm afraid I don't have one. I just use the filesystem on various disks, with ceph osds and other non-ceph subvolumes and files, and occasionally I run into one of these bad blocks and the filesystem gets into these odd states. I guess that you're using '-l 64k -n 64k' for mkfs.btrfs That is correct, but IIUC this should only affect metadata, and metadata recovery from the DUP block works. It's data (single copy) that fails as described. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O errors block the entire filesystem
On May 15, 2013, Josef Bacik jba...@fusionio.com wrote: So this should only happen in the case that you are on a dm device it looks like, is that how you are running? That was my first thought, but no, I'm using partitions out of the SATA disks directly. I even checked for stray dm out of fake raid or somesuch, but the dm modules were not even loaded, and perusing /sys/block confirms the “scsi” devices are actual ATA disks. Further investigation suggested that when individual 512-byte blocks are read from a disk (that's the block size reported by the kernel), the underlying disk driver is supposed to inform the upper layer about what it could read by updating the bio_vec bits in precisely the observed way. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O errors block the entire filesystem
On Apr 4, 2013, Alexandre Oliva ol...@gnu.org wrote: I've been trying to figure out the btrfs I/O stack to try to understand why, sometimes (but not always), after a failure to read a (data non-replicated) block from the disk, the file being accessed becomes permanently locked, and the filesystem, unmountable. So, after some further investigation, we could determine that the problem was that end_bio_extent_readpage would unlock_extent_cached only part of the page, because it had previously computed whole_page as zero because of the nonzero bv_offset. So I started hunting for some place that would set up the bio with partial pages, and I failed. I was already suspecting some race condition or other form of corruption of the bvec before it got to end_bio_extent_readpage when I realized that the bv_offset was always a multiple of 512 bytes, and it represented the offset into the 4KiB page that the sector that failed to read was going to occupy. So I started hunting for places that modified bv_offset, and I found blk_update_request in fs/blk-core.c, where the very error message reporting the failed sector was output. The conclusion is that we cannot assume bvec is unmodified between our submitting the bio and our getting an error back. OTOH, I don't see that we ever set up bvecs that do not correspond to whole pages. Indeed, my attempts to catch such situations with a wrapper around bio_add_page got no hits whatsoever, which suggests we could just do away with the whole_page computation, and take bv_offset+bv_len == PAGE_CACHE_SIZE as the requested read size. With this patch, after a read error, I get an EIO rather than a process hang that causes further attempts to access the file to hang, generally in a non-interruptible way. Yay! btrfs: do away with non-whole_page extent I/O From: Alexandre Oliva ol...@gnu.org end_bio_extent_readpage computes whole_page based on bv_offset and bv_len, without taking into account that blk_update_request may modify them when some of the blocks to be read into a page produce a read error. This would cause the read to unlock only part of the file range associated with the page, which would in turn leave the entire page locked, which would not only keep the process blocked instead of returning -EIO to it, but also prevent any further access to the file. It turns out that btrfs always issues whole-page reads and writes. The special handling of non-whole_page appears to be a mistake or a left-over from a time when this wasn't the case. Indeed, end_bio_extent_writepage distinguished between whole_page and non-whole_page writes but behaved identically in both cases! I've replaced the whole_page computations with warnings, just to be sure that we're not issuing partial page reads or writes. The warnings should probably just go away some time. Signed-off-by: Alexandre Oliva ol...@gnu.org --- fs/btrfs/extent_io.c | 85 ++ 1 file changed, 30 insertions(+), 55 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cdee391..f44b033 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1873,28 +1873,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) } /* - * helper function to unlock a page if all the extents in the tree - * for that page are unlocked - */ -static void check_page_locked(struct extent_io_tree *tree, struct page *page) -{ - u64 start = page_offset(page); - u64 end = start + PAGE_CACHE_SIZE - 1; - if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) - unlock_page(page); -} - -/* - * helper function to end page writeback if all the extents - * in the tree for that page are done with writeback - */ -static void check_page_writeback(struct extent_io_tree *tree, - struct page *page) -{ - end_page_writeback(page); -} - -/* * When IO fails, either with EIO or csum verification fails, we * try other mirrors that might have a good copy of the data. This * io_failure_record is used to record state as we go through all the @@ -2323,19 +2301,24 @@ static void end_bio_extent_writepage(struct bio *bio, int err) struct extent_io_tree *tree; u64 start; u64 end; - int whole_page; do { struct page *page = bvec-bv_page; tree = BTRFS_I(page-mapping-host)-io_tree; - start = page_offset(page) + bvec-bv_offset; - end = start + bvec-bv_len - 1; + /* We always issue full-page reads, but if some block + * in a page fails to read, blk_update_request() will + * advance bv_offset and adjust bv_len to compensate. + * Print a warning for nonzero offsets, and an error + * if they don't add up to a full page. */ + if (bvec-bv_offset || bvec-bv_len != PAGE_CACHE_SIZE) + printk(%s page write in btrfs with offset %u and length %u\n, + bvec-bv_offset + bvec-bv_len != PAGE_CACHE_SIZE + ? KERN_ERR partial : KERN_INFO incomplete, + bvec-bv_offset, bvec-bv_len); - if (bvec-bv_offset == 0
I/O errors block the entire filesystem
I've been trying to figure out the btrfs I/O stack to try to understand why, sometimes (but not always), after a failure to read a (data non-replicated) block from the disk, the file being accessed becomes permanently locked, and the filesystem, unmountable. Sometimes (but not always) it's possible to kill the process that accessed the file, and sometimes (but not always) the failure causes the machine load to skyrocket by 60+ processes. In one of the failures that caused machine load spikes, I tried to collect info on active processes with perf top and SysRq-T, but nothing there seemed to explain the spike. Thoughts on how to figure out what's causing this? Another weirdness I noticed is that, after a single read failure, btree_io_failed_hook gets called multiple times, until io_pages gets down to zero. This seems wrong: I think it should only be called once when a single block fails, rather than having that single failure get all pending pages marked as failed, no? Here are some instrumented dumps I collected from one occurrence of the scenario described in the previous paragraph (it didn't cause a load spike). Only one disk block had a read failure. At the end, I enclose the patch that got those dumps printed, the result of several iterations in which one failure led me to find another function to instrument. end_request: I/O error, dev sdd, sector 183052083 btrfs: bdev /dev/sdd4 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 btrfs_end_bio orig -EIO 1 0 pending 0 end a0240820,a020c2d0 end_workqueue_bio err -5 bi_rw 0 ata5: EH complete end_workqueue_fn err -5 end_io a020c2d0,a0231080 btree_io_failed_hook failed_mirror 1 io_pages 15 readahead 0 end_bio_extent_readpage err -5 faied_hook a020bed0 ret -5 btree_io_failed_hook failed_mirror 1 io_pages 14 readahead 0 end_bio_extent_readpage err -5 failed_hook a020bed0 ret -5 [...repeat both msgs with io_pages decremented one at a time...] btree_io_failed_hook failed_mirror 1 io_pages 0 readahead 0 end_bio_extent_readpage err -5 failed_hook a020bed0 ret -5 (no further related messages) Be verbose about the path followed after an I/O error From: Alexandre Oliva lxol...@fsfla.org --- fs/btrfs/disk-io.c | 22 -- fs/btrfs/extent_io.c |6 ++ fs/btrfs/volumes.c | 31 +-- 3 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6d19a0a..20f9828 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -659,13 +659,18 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror) { struct extent_buffer *eb; struct btrfs_root *root = BTRFS_I(page-mapping-host)-root; + long io_pages; + bool readahead; eb = (struct extent_buffer *)page-private; set_bit(EXTENT_BUFFER_IOERR, eb-bflags); eb-read_mirror = failed_mirror; - atomic_dec(eb-io_pages); - if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags)) + io_pages = atomic_dec_return(eb-io_pages); + if ((readahead = test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags))) btree_readahead_hook(root, eb, eb-start, -EIO); + printk(KERN_ERR + btree_io_failed_hook failed_mirror %i io_pages %li readahead %i\n, + failed_mirror, io_pages, readahead); return -EIO; /* we fixed nothing */ } @@ -674,6 +679,12 @@ static void end_workqueue_bio(struct bio *bio, int err) struct end_io_wq *end_io_wq = bio-bi_private; struct btrfs_fs_info *fs_info; + if (err) { + printk(KERN_ERR + end_workqueue_bio err %i bi_rw %lx\n, + err, (unsigned long)bio-bi_rw); + } + fs_info = end_io_wq-info; end_io_wq-error = err; end_io_wq-work.func = end_workqueue_fn; @@ -1647,6 +1658,13 @@ static void end_workqueue_fn(struct btrfs_work *work) fs_info = end_io_wq-info; error = end_io_wq-error; + + if (error) { + printk(KERN_ERR + end_workqueue_fn err %i end_io %p,%p\n, + error, bio-bi_end_io, end_io_wq-end_io); + } + bio-bi_private = end_io_wq-private; bio-bi_end_io = end_io_wq-end_io; kfree(end_io_wq); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cdee391..355b24e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2422,6 +2422,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err) if (!uptodate tree-ops tree-ops-readpage_io_failed_hook) { ret = tree-ops-readpage_io_failed_hook(page, mirror); + printk(KERN_ERR + end_bio_extent_readpage err %i failed_hook %p ret %i\n, + err, tree-ops-readpage_io_failed_hook, ret); if (!ret !err test_bit(BIO_UPTODATE, bio-bi_flags)) uptodate = 1; @@ -2437,6 +2440,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err) * remain responsible for that page. */ ret = bio_readpage_error(bio, page, start, end, mirror, NULL); + printk(KERN_ERR + end_bio_extent_readpage err %i readpage_error ret %i\n, + err, ret
Re: corruption of active mmapped files in btrfs snapshots
On Mar 25, 2013, Chris Mason chris.ma...@fusionio.com wrote: This patch changes our compression code to call clear_page_dirty_for_io before we compress, and then redirty the pages if the compression fails. Alexandre, many thanks for tracking this down into a well defined use case. Thanks for the patch, it's run flawlessly since I started gradually rolling it out onto my ceph OSDs on Monday! Ship it! :-) -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote: Quoting Samuel Just (2013-03-22 13:06:41) Incomplete writes for leveldb should just result in lost updates, not corruption. In this case, I think Alexandre is scanning for zeros in the file. Yup, the symptom is zeros at the end of a page, with nonzeros on the subsequent page, which indicates that the writes to the previous page were dropped. What I actually do is to iterate over the entire database, which will error out when the block header is found to be corrupted. I use this program I wrote (also hereby provided under GNU GPLv3+) to check the database for corruption. #include assert.h #include iostream #include leveldb/db.h int main(int argc, char *argv[]) { bool paranoid = false; bool dump = false; bool repair = false; bool quiet = false; int i = 0; int errors = 0; if (argc == 1) { usage: std::cout usage: [flags] dbname [flags] ... std::endl -d --dump dump database contents std::endl -r --repair repair database std::endl -p --paranoid enable paranoid mode std::endl -l --lax disable paranoid mode (default) std::endl -q --quietenable quiet mode std::endl -v --verbose disable quiet mode (default) std::endl -h --help show this message and exit std::endl dbnamecheck, dump and repair std::endl std::endl exit status is the number of errors std::endl; return errors; } for (i++; i argc; i++) { if (argv[i][0] == '-') { if (strcmp (argv[i], --dump) == 0 || strcmp (argv[i], -d) == 0) dump = true; else if (strcmp (argv[i], --repair) == 0 || strcmp (argv[i], -r) == 0) repair = true; else if (strcmp (argv[i], --paranoid) == 0 || strcmp (argv[i], -p) == 0) paranoid = true; else if (strcmp (argv[i], --lax) == 0 || strcmp (argv[i], -l) == 0) paranoid = false; else if (strcmp (argv[i], --quiet) == 0 || strcmp (argv[i], -q) == 0) quiet = true; else if (strcmp (argv[i], --verbose) == 0 || strcmp (argv[i], -v) == 0) quiet = false; else if (strcmp (argv[i], --help) == 0 || strcmp (argv[i], -h) == 0) goto usage; else { std::cerr unrecognized option: argv[i] std::endl; goto usage; } } else { if (!quiet) std::cout argv[i] std::endl; leveldb::DB* db; leveldb::Options options; options.paranoid_checks = paranoid; leveldb::Status status = leveldb::DB::Open(options, argv[i], db); bool bad = false; if (!status.ok()) { std::cerr status.ToString() std::endl; bad = true; } else { leveldb::ReadOptions rdopt; rdopt.verify_checksums = paranoid; rdopt.fill_cache = false; leveldb::Iterator* it = db-NewIterator(rdopt); int count = 0; try { for (it-SeekToFirst(); it-Valid(); it-Next()) { count++; if (dump) std::cout it-key().ToString() : it-value().ToString() std::endl; else if (!quiet count % 1000 == 0) std::cout count entries\r std::flush; } if (!it-status().ok()) { std::cerr it-status().ToString() std::endl; bad = true; } } catch (...) { std::cerr caught an exception std::endl; } delete it; if (!quiet) std::cout count entries std::endl; } delete db; if (bad) { errors++; if (repair) { if (!quiet) std::cout repairing... std::endl; status = RepairDB(argv[i], options); if (!status.ok()) { std::cerr status.ToString() std::endl; errors++; } } else if (!quiet) std::cout use --repair to repair std::endl; } } } return errors; } -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, David Sterba dste...@suse.cz wrote: I've reproduced this without compression, with autodefrag on. I don't have autodefrag on, unless it's enabled by default on 3.8.3 or on the for-linus tree. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote: Are you using compression in btrfs or just in leveldb? btrfs lzo compression. I'd like to take snapshots out of the picture for a minute. That's understandable, I guess, but I don't know that anyone has ever got the problem without snapshots. I mean, even when the master copy of the database got corrupted, snapshots of the subvol containing it were being taken every now and again, because that's the way ceph works. Even back when I noticed corruption of firefox _CACHE_* files, snapshots taken for archival were involved. So, unless the program happens to trigger the problem with the -DNOSNAPS option about as easily as it did without it, I guess we may not have a choice but to keep snapshots in the picture. We need some way to synchronize the leveldb with snapshotting I purposefully refrained from doing that, because AFAICT ceph doesn't do that. Once I failed to trigger the problem with Sync calls, and determined ceph only syncs the leveldb logs before taking its snapshots, I went without syncing and finally succeeded in triggering the bug in snapshots, by simulating very similar snapshotting and mmaping conditions to those generated by ceph. I haven't managed to trigger the corruption of the master subvol yet with the test program, but I already knew its corruption didn't occur as often as that of the snapshots, and since it smells like two slightly different symptoms of the same bug, I decided to leave the test program at that. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote: On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote: that is being processed inside the snapshot. This doesn't explain why the master database occasionally gets similarly corrupted, does it? Actually, scratch this bit for now. I don't really have proof that the master database actually gets corrupted while it's in use Scratch the “scratch this”. The master database actually gets corrupted, and it's with recently-created files, created after earlier known-good snapshots. So, it can't really be orphan processing, can it? Some more info from the errors and instrumentation: - no data syncing on the affected files is taking place. it's just memcpy()ing data in 4KiB-sized chunks onto mmap()ed areas, munmap()ing it, growing the file with ftruncate and mapping a subsequent chunk for further output - the NULs at the end of pages do NOT occur at munmap/mmap boundaries as I suspected at first, but they do coincide with the end of extents that are smaller than the maximum compressed extent size. So, something's making btrfs flush pages to disk before the pages are completely written (which is fine in principle), but apparently failing to pick up subsequent changes to the pages (eek!) -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
);) ; totalsize += size; } printf(\r%i blocks, %llu total size\n, blocks, totalsize); #if NOBGCMP if (system(cmp snaptest./??)) { printf (\ncmp error: %s\n, strerror (errno)); break; } #endif } -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Chris Mason clma...@fusionio.com wrote: My guess is the truncate is creating a orphan item Would it, even though the truncate is used to grow rather than to shrink the file? that is being processed inside the snapshot. This doesn't explain why the master database occasionally gets similarly corrupted, does it? Is it possible to create a smaller leveldb unit test that we might use to exercise all of this? I suppose we can even do away with leveldb altogether, using only a PosixMmapFile object, as created by PosixEnv::NewWritableFile (all of this is defined in leveldb's util/env_posix.cc), to exercise the creation and growth of multiple files, one at a time, taking btrfs snapshots at random in between the writes. This ought to suffice. One thing I'm yet to check is whether ceph uses the sync leveldb WriteOption, to determine whether or not to call the file object's Sync member function in the test; this would bring fdatasync and msync calls into the picture, that would otherwise be left entirely out of the test. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Sage Weil s...@inktank.com wrote: There is a set of unit tests in the leveldb source tree that ought to do the trick: git clone https://code.google.com/p/leveldb/ But these don't create btrfs snapshots. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote: that is being processed inside the snapshot. This doesn't explain why the master database occasionally gets similarly corrupted, does it? Actually, scratch this bit for now. I don't really have proof that the master database actually gets corrupted while it's in use, rather than having inherited corruption on a server restart, that rolls back to the most recent snapshot and replays the osd journal on it. It could be that the used snapshot is corrupted in a way that doesn't manifest itself immediately, or that that it gets corrupted afterwards with your delayed-orphan theory. I wrote a test that exercises leveldb's PosixMmapFile with highly compressible appends of varying sizes, as well as syncs and btrfs snapshots at random, but I haven't been able to trigger the problem with it (yet?). I'm now instrumenting the failing code to try to collect more data. It looks like, even though ceph does use leveldb's sync option in some situations, the syncs don't seem to get all to the data files, only to the leveldb logs. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
corruption of active mmapped files in btrfs snapshots
For quite a while, I've experienced oddities with snapshotted Firefox _CACHE_00?_ files, whose checksums (and contents) would change after the btrfs snapshot was taken, and would even change depending on how the file was brought to memory (e.g., rsyncing it to backup storage vs checking its md5sum before or after the rsync). This only affected these cache files, so I didn't give it too much attention. A similar problem seems to affect the leveldb databases maintained by ceph within the periodic snapshots it takes of its object storage volumes. I'm told others using ceph on filesystems other than btrfs are not observing this problem, which makes me thing it's not memory corruption within ceph itself. I've looked into this for a bit, and I'm now inclined to believe it has to do with some bad interaction of mmap and snapshots; I'm not sure the fact that the filesystem has compression enabled has any effect, but that's certainly a possibility. leveldb does not modify file contents once they're initialized, it only appends to files, ftruncate()ing them to about a MB early on, mmap()ping that in and memcpy()ing blocks of various sizes to the end of the output buffer, occasionally msync()ing the maps, or running fdatasync if it didn't msync a map before munmap()ping it. If it runs out of space in a map, it munmap()s the previously mapped range, truncates the file to a larger size, then maps in the new tail of the file, starting at the page it should append to next. What I'm observing is that some btrfs snapshots taken by ceph osds, containing the leveldb database, are corrupted, causing crashes during the use of the database. I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. The stream of NULs ended in the middle of a database block (meaning it was not the beginning of a subsequent database block written later; the beginning of the database block was partially replaced with NULs). Furthermore, the checksum fails to match on this one partially-NULed block. Since the checksum is computed just before the block and the checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty that the block was copied entirely to the right place at some point, and if part of it became zeros, it's either because the modification was partially lost, or because the mmapped buffer was partially overwritten. The fact that all instances of corruption I looked at were correct right to the end of one block boundary, and then all zeros instead of the beginning of the subsequent block to the end of that page, makes a failure to write that modified page seem more likely in my mind (more so given the Firefox _CACHE_ file oddities in snapshots); intense memory pressure at the time of the corruption also seems to favor this possibility. Now, it could be that btrfs requires those who modify SHARED mmap()ed files so as to make sure that data makes it to a subsequent snapshot, along the lines of msync MS_ASYNC, and leveldb does not take this sort of precaution. However, I noticed that the unexpected stream of zeros after a prior block and before the rest of the subsequent block *remains* in subsequent snapshots, which to me indicates the page update is effectively lost. This explains why even the running osd, that operates on the “current” subvolumes from which snapshots for recovery are taken, occasionally crashes because of database corruption, and will later fail to restart from an earlier snapshot due to that same corruption. Does this problem sound familiar to anyone else? Should mmaped-file writers in general do more than umount or msync to ensure changes make it to subsequent snapshots that are supposed to be consistent? Any tips on where to start looking so as to fix the problem, or even to confirm that the problem is indeed in btrfs? TIA, -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
While I wrote the previous email, a smoking gun formed in one of my servers: a snapshot that had passed a database consistency check turned out to be corrupted when I tried to rollback to it! Since the snapshot was not modified in any way between the initial scripted check and the later manual check, the problem must be in btrfs. On Mar 18, 2013, Alexandre Oliva ol...@gnu.org wrote: I've scripted regular checks of osd snapshots, saving the last-known-good database along with the first one that displays the corruption. Studying about two dozen failures over the weekend, that took place on all of 13 btrfs-based osds on 3 servers running btrfs as in 3.8.3(-gnu), I noticed that all of the corrupted databases had a similar pattern: a stream of NULs of varying sizes at the end of a page, starting at a block boundary (leveldb doesn't do page-sized blocking, so blocks can start anywhere in a page), and ending close to the beginning of the next page, although not exactly at the page boundary; 20 bytes past the page boundary seemed to be the most common size, but the occasional presence of NULs in the database contents makes it harder to tell for sure. Additional corrupted snapshots collected today have confirmed this pattern, except that today I got several corrupted files with non-NULs right at the beginning of the page following the one that marked the beginning of the corrupted database block. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corruption of active mmapped files in btrfs snapshots
On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote: A few questions. Does leveldb use O_DIRECT and mmap together? No, it doesn't use O_DIRECT at all. Its I/O interface is very simplified: it just opens each new file (database chunks limited to 2MB) with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync, munmap and fdatasync. It doesn't seem to modify data once it's written; it only appends. Reading data back from it uses a completely different class interface, using separate descriptors and using pread only. (the source of a write being pages that are mmap'd from somewhere else) AFAICT the source of the memcpy()s that append to the file are malloc()ed memory. That's the most likely place for this kind of problem. Also, you mention crc errors. Are those reported by btrfs or are they application level crcs. These are CRCs leveldb computes and writes out after each db block. No btrfs CRC errors are reported in this process. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: collapse concurrent forced allocations
On Feb 23, 2013, Alexandre Oliva ol...@gnu.org wrote: On Feb 22, 2013, Josef Bacik jba...@fusionio.com wrote: So I understand what you are getting at, but I think you are doing it wrong. If we're calling with CHUNK_ALLOC_FORCE, but somebody has already started to allocate with CHUNK_ALLOC_NO_FORCE, we'll reset the space_info-force_alloc to our original caller's CHUNK_ALLOC_FORCE. But that's ok, do_chunk_alloc will set space_info-force_alloc to CHUNK_ALLOC_NO_FORCE at the end, when it succeeds allocating, and then anyone else waiting on the mutex to try to allocate will load the NO_FORCE from space_info. So we only really care about making sure a chunk is actually allocated, instead of doing this flag shuffling we should just do if (space_info-chunk_alloc) { spin_unlock(space_info-lock); wait_event(!space_info-chunk_alloc); return 0; I looked a bit further into it. I think I this would work if we had a wait_queue for space_info-chunk_alloc. We don't, so the mutex interface is probably the best we can do. OTOH, I found out we seem to get into an allocate spree when a large file is being quickly created, such as when creating a ceph journal or making a copy of a multi-GB file. I suppose btrfs is just trying to allocate contiguous space for the file, but unfortunately there doesn't seem to be a fallback for allocation failure: as soon as data allocation fails and space_info is set as full, the large write fails and the filesystem becomes full, without even trying to use non-contiguous storage. Isn't that a bug? I've also been trying to track down why, on a single-data filesystem, (compressed?) data reads that fail because of bad blocks also spike the CPU load and lock the file that failed to map in and the entire filesystem, so that the only way to recover is to force a reboot. Does this sound familiar to anyone? -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: collapse concurrent forced allocations
On Feb 22, 2013, Josef Bacik jba...@fusionio.com wrote: So I understand what you are getting at, but I think you are doing it wrong. If we're calling with CHUNK_ALLOC_FORCE, but somebody has already started to allocate with CHUNK_ALLOC_NO_FORCE, we'll reset the space_info-force_alloc to our original caller's CHUNK_ALLOC_FORCE. But that's ok, do_chunk_alloc will set space_info-force_alloc to CHUNK_ALLOC_NO_FORCE at the end, when it succeeds allocating, and then anyone else waiting on the mutex to try to allocate will load the NO_FORCE from space_info. So we only really care about making sure a chunk is actually allocated, instead of doing this flag shuffling we should just do if (space_info-chunk_alloc) { spin_unlock(space_info-lock); wait_event(!space_info-chunk_alloc); return 0; } Sorry, I don't follow. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
collapse concurrent forced allocations (was: Re: clear chunk_alloc flag on retryable failure)
On Feb 21, 2013, Alexandre Oliva ol...@gnu.org wrote: What I saw in that function also happens to explain why in some cases I see filesystems allocate a huge number of chunks that remain unused (leading to the scenario above, of not having more chunks to allocate). It happens for data and metadata, but not necessarily both. I'm guessing some thread sets the force_alloc flag on the corresponding space_info, and then several threads trying to get disk space end up attempting to allocate a new chunk concurrently. All of them will see the force_alloc flag and bump their local copy of force up to the level they see first, and they won't clear it even if another thread succeeds in allocating a chunk, thus clearing the force flag. Then each thread that observed the force flag will, on its turn, force the allocation of a new chunk. And any threads that come in while it does that will see the force flag still set and pick it up, and so on. This sounds like a problem to me, but... what should the correct behavior be? Clear force_flag once we copy it to a local force? Reset force to the incoming value on every loop? I think a slight variant of the following makes the most sense, so I implemented it in the patch below. Set the flag to our incoming force if we have it at first, clear our local flag, and move it from the space_info when we determined that we are the thread that's going to perform the allocation? From: Alexandre Oliva ol...@gnu.org btrfs: consume force_alloc in the first thread to chunk_alloc Even if multiple threads in do_chunk_alloc look at force_alloc and see a force flag, it suffices that one of them consumes the flag. Arrange for an incoming force argument to make to force_alloc in case of concurrent calls, so that it is used only by the first thread to get to allocation after the initial request. Signed-off-by: Alexandre Oliva ol...@gnu.org --- fs/btrfs/extent-tree.c |8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 6ee89d5..66283f7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3574,8 +3574,12 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, again: spin_lock(space_info-lock); + + /* Bring force_alloc to force and tentatively consume it. */ if (force space_info-force_alloc) force = space_info-force_alloc; + space_info-force_alloc = CHUNK_ALLOC_NO_FORCE; + if (space_info-full) { spin_unlock(space_info-lock); return 0; @@ -3586,6 +3590,10 @@ again: return 0; } else if (space_info-chunk_alloc) { wait_for_alloc = 1; + /* Reset force_alloc so that it's consumed by the + first thread that completes the allocation. */ + space_info-force_alloc = force; + force = CHUNK_ALLOC_NO_FORCE; } else { space_info-chunk_alloc = 1; } -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
ceph-on-btrfs inline-cow regression fix for 3.4.3
Hi, Greg, There's a btrfs regression in 3.4 that's causing a lot of grief to ceph-on-btrfs users like myself. This small and nice patch cures it. It's in Linus' master already. I've been running it on top of 3.4.2, and it would be very convenient for me if this could be in 3.4.3. Although the patch mentions ENOSPC, the fix has nothing to do with disk full conditions; it's more along the lines of not finding enough room for inline data contents and/or failing to split the btree nodes to make room for it. I don't know that anyone knows for sure, but without this patch what we get is a horrible error, that can only be fixed with a reboot. Yeah, not even umountmount will make the filesystem writable again. The fix makes us return an error condition in this case, that callers are prepared to deal with. I know btrfs hasn't had maintenance fixes in stable series, but Chris Mason tells me the only reason is that nobody stepped up to do so. Given my interest, I might as well give it a try ;-) Thanks, From 2adcac1a7331d93a17285804819caa96070b231f Mon Sep 17 00:00:00 2001 From: Josef Bacik jo...@redhat.com Date: Wed, 23 May 2012 16:10:14 -0400 Subject: [PATCH] Btrfs: fall back to non-inline if we don't have enough space If cow_file_range_inline fails with ENOSPC we abort the transaction which isn't very nice. This really shouldn't be happening anyways but there's no sense in making it a horrible error when we can easily just go allocate normal data space for this stuff. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/inode.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0298928..92df0a5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -257,10 +257,13 @@ static noinline int cow_file_range_inline(struct btrfs_trans_handle *trans, ret = insert_inline_extent(trans, root, inode, start, inline_len, compressed_size, compress_type, compressed_pages); - if (ret) { + if (ret ret != -ENOSPC) { btrfs_abort_transaction(trans, root, ret); return ret; + } else if (ret == -ENOSPC) { + return 1; } + btrfs_delalloc_release_metadata(inode, end + 1 - start); btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0); return 0; -- 1.7.7.6 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
avoid redundant block group free-space checks
It was pointed out to me that the test for enough free space in a block group was wrong in that it would skip a block group that had most of its free space reserved by a cluster. I offer two mutually exclusive, (so far) very lightly tested patches to address this problem. One moves the test to the middle of the clustered allocation logic, between the release of the cluster and the attempt to create a new cluster, with some ugliness due to more indentation, locking operations and testing. The other, that I like better but haven't given any significant amount of testing yet, only performs the test when we fall back to unclustered allocation, relying on btrfs_find_space_cluster to test for enough free space early (it does); it also arranges for the cluster in the current block group to be released before we try unclustered allocation. From f1d4d6212a4cfb2fde6a15780d9b337319d3d1e1 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Mon, 12 Dec 2011 04:33:33 -0200 Subject: [PATCH] Btrfs: delay block group's free space test within allocator If a block group has a cluster, we don't want to test its free space when the cluster has taken an unknown amount of free space. Delay the free space test after failing to allocate from the cluster and releasing it. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 37 - 1 files changed, 20 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 05e1386..1de4c47 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5277,15 +5277,6 @@ alloc: if (unlikely(block_group-ro)) goto loop; - spin_lock(block_group-free_space_ctl-tree_lock); - if (cached - block_group-free_space_ctl-free_space - num_bytes + empty_cluster + empty_size) { - spin_unlock(block_group-free_space_ctl-tree_lock); - goto loop; - } - spin_unlock(block_group-free_space_ctl-tree_lock); - /* * Ok we want to try and use the cluster allocator, so * lets look there @@ -5323,6 +5314,7 @@ alloc: } refill_cluster: BUG_ON(used_block_group != block_group); + /* If we are on LOOP_NO_EMPTY_SIZE, we can't * set up a new clusters, so lets just skip it * and let the allocator find whatever block @@ -5332,17 +5324,29 @@ refill_cluster: * anything, so we are likely way too * fragmented for the clustering stuff to find * anything. */ - if (loop = LOOP_NO_EMPTY_SIZE) { + if (loop = LOOP_NO_EMPTY_SIZE) spin_unlock(last_ptr-refill_lock); -goto unclustered_alloc; + else { +/* + * this cluster didn't work out, free + * it and start over + */ +btrfs_return_cluster_to_free_space(NULL, last_ptr); } + } - /* - * this cluster didn't work out, free it and - * start over - */ - btrfs_return_cluster_to_free_space(NULL, last_ptr); + spin_lock(block_group-free_space_ctl-tree_lock); + if (cached + block_group-free_space_ctl-free_space + num_bytes + empty_cluster + empty_size) { + spin_unlock(block_group-free_space_ctl-tree_lock); + if (last_ptr loop LOOP_NO_EMPTY_SIZE) +spin_unlock(last_ptr-refill_lock); + goto loop; + } + spin_unlock(block_group-free_space_ctl-tree_lock); + if (last_ptr loop LOOP_NO_EMPTY_SIZE) { /* allocate a cluster in this block group */ ret = btrfs_find_space_cluster(trans, root, block_group, last_ptr, @@ -5382,7 +5386,6 @@ refill_cluster: goto loop; } -unclustered_alloc: offset = btrfs_find_space_for_alloc(block_group, search_start, num_bytes, empty_size); /* -- 1.7.4.4 From 72c9239effd15c7c921c5265e860a14084e1f13e Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Mon, 12 Dec 2011 04:48:19 -0200 Subject: [PATCH 1/9] Btrfs: test free space only for unclustered allocation Since the clustered allocation may be taking extents from a different block group, there's no point in spin-locking and testing the current block group free space before attempting to allocate space from a cluster, even more so when we might refrain from even trying the cluster in the current block group because, after the cluster was set up, not enough free space remained. Furthermore, cluster creation attempts fail fast when the block group doesn't have enough free space, so the test was completely superfluous. I've move the free space test past the cluster allocation attempt, where it is more useful, and arranged for a cluster in the current block group to be released before trying an unclustered allocation, when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in the cluster stands a chance of being combined with additional free space in the block group so as to succeed in the allocation attempt. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 34 +++--- 1 files changed, 23 insertions
Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list
On Nov 29, 2011, Christian Brunner c...@muc.de wrote: When I'm doing havy reading in our ceph cluster. The load and wait-io on the patched servers is higher than on the unpatched ones. That's unexpected. This seems to be coming from btrfs-endio-1. A kernel thread that has not caught my attention on unpatched systems, yet. I suppose I could wave my hands while explaining that you're getting higher data throughput, so it's natural that it would take up more resources, but that explanation doesn't satisfy me. I suppose allocation might have got slightly more CPU intensive in some cases, as we now use bitmaps where before we'd only use the cheaper-to-allocate extents. But that's unsafisfying as well. Do you have any idea what's going on here? Sorry, not really. (Please note that the filesystem is still unmodified - metadata overhead is large). Speaking of metadata overhead, I found out that the bitmap-enabling patch is not enough for a metadata balance to get rid of excess metadata block groups. I had to apply patch #16 to get it again. It sort of makes sense: without patch 16, too often will we get to the end of the list of metadata block groups and advance from LOOP_FIND_IDEAL to LOOP_CACHING_WAIT (skipping NOWAIT after we've cached free space for all block groups), and if we get to the end of that loop as well (how? I couldn't quite figure out, but it only seems to happen under high contention) we'll advance to LOOP_ALLOC_CHUNK and end up unnecessarily allocating a new chunk. Patch 16 makes sure we don't jump ahead during LOOP_CACHING_WAIT, so we won't get new chunks unless they can really help us keep the system going. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because setup_cluster_no_bitmap checks that the list is empty before inserting the bitmap entry in the list for setup_cluster_bitmap, but the list field is only initialized when it is restored from the on-disk free space cache, or when it is written out to disk. Besides a potential race condition due to the multiple use of the list field, filesystem performance severely degrades over time: as we use up all non-bitmap free extents, the try-to-set-up-cluster dance is done at every metadata block allocation. For every block group, we fail to set up a cluster, and after failing on them all up to twice, we fall back to the much slower unclustered allocation. To make matters worse, before the unclustered allocation, we try to create new block groups until we reach the 1% threshold, which introduces additional bitmaps and thus block groups that we'll iterate over at each metadata block request. --- fs/btrfs/free-space-cache.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 33fa4bb..4642c42 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl *ctl, { info-offset = offset_to_bitmap(ctl, offset); info-bytes = 0; + INIT_LIST_HEAD(info-list); link_free_space(ctl, info); ctl-total_bitmaps++; -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/20] Btrfs: fix comment typo
--- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5d86877..bc0f13d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5304,7 +5304,7 @@ alloc: /* * whoops, this cluster doesn't actually point to * this block group. Get a ref on the block -* group is does point to and try again +* group it does point to and try again */ if (!last_ptr_loop last_ptr-block_group last_ptr-block_group != block_group -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/20] Here's my current btrfs patchset
The first 11 patches are relatively simple fixes or improvements that I suppose go could make it even in 3.2 (02 is particularly essential to avoid progressive performance degradation and metadata space waste in the default clustered allocation strategy). Patch 12 and its complement 15, and also 19, are debugging aids that helped me track down the problem fixed in 02. Patch 13 is a revised version of the larger-clusters patch I posted before, that adds a microoptimization to the bitmap computations to the earlier version. Patches 14 to 20 are probably not suitable for inclusion, and are provided only for reference, although I'm still undecided on 16: it seems to me to make sense to stick to the ordered list and index instead of jumping to the current cluster's block group, but it may also make sense performance-wise to start at the current cluster and advance from there. We still do that, as long as we find a cluster to begin with, but I'm yet to double check on the race that causes multiple subsequent releases/creation of clusters under heavy load. I'm sure I saw it, and I no longer do, but now I'm no longer sure whether this is the patch that fixed it, or about the details of how we came about that scenario. Patches 14, 17, 18 and 20 were posted before, and I'm probably dropping them from future patchsets unless I find them to be still useful. Alexandre Oliva (20): Btrfs: enable removal of second disk with raid1 metadata Btrfs: initialize new bitmaps' list Btrfs: fix comment typo Btrfs: reset cluster's max_size when creating bitmap cluster Btrfs: start search for new cluster at the beginning of the block group Btrfs: skip block groups without enough space for a cluster Btrfs: don't set up allocation result twice Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE Btrfs: skip allocation attempt from empty cluster Btrfs: report reason for failed relocation Btrfs: note when a bitmap is skipped because its list is in use Btrfs: introduce verbose debug mode for patched clustered allocation recovery Btrfs: revamp clustered allocation logic Btrfs: introduce option to rebalance only metadata Btrfs: activate allocation debugging Btrfs: try cluster but don't advance in search list Btrfs: introduce -o cluster and -o nocluster Btrfs: add -o mincluster option Btrfs: log when a bitmap is rejected for a cluster Btrfs: don't waste metadata block groups for clustered allocation fs/btrfs/ctree.h|3 +- fs/btrfs/extent-tree.c | 297 --- fs/btrfs/free-space-cache.c | 132 ++- fs/btrfs/ioctl.c|2 + fs/btrfs/ioctl.h|3 + fs/btrfs/relocation.c |8 + fs/btrfs/super.c| 31 - fs/btrfs/volumes.c | 39 +- fs/btrfs/volumes.h |1 + 9 files changed, 369 insertions(+), 147 deletions(-) -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/20] Btrfs: try cluster but don't advance in search list
When we find an existing cluster, we switch to its block group as the current block group, possibly skipping multiple blocks in the process. Furthermore, under heavy contention, multiple threads may fail to allocate from a cluster and then release just-created clusters just to proceed to create new ones in a different block group. This patch tries to allocate from an existing cluster regardless of its block group, and doesn't switch to that group, instead proceeding to try to allocate a cluster from the group it was iterating before the attempt. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 76 +--- 1 files changed, 33 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 66edda2..7064979 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5174,11 +5174,11 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root = orig_root-fs_info-extent_root; struct btrfs_free_cluster *last_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; + struct btrfs_block_group_cache *used_block_group; int empty_cluster = 2 * 1024 * 1024; int allowed_chunk_alloc = 0; int done_chunk_alloc = 0; struct btrfs_space_info *space_info; - int last_ptr_loop = 0; int loop = 0; int index = 0; int alloc_type = (data BTRFS_BLOCK_GROUP_DATA) ? @@ -5245,6 +5245,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, ideal_cache: block_group = btrfs_lookup_block_group(root-fs_info, search_start); + used_block_group = block_group; if (debug 1) printk(KERN_DEBUG btrfs %x.%i: ideal cache block %llx\n, debugid, loop, @@ -5286,6 +5287,7 @@ search: u64 offset; int cached; + used_block_group = block_group; btrfs_get_block_group(block_group); search_start = block_group-key.objectid; @@ -5380,13 +5382,20 @@ alloc: * people trying to start a new cluster */ spin_lock(last_ptr-refill_lock); - if (!last_ptr-block_group || - last_ptr-block_group-ro || - !block_group_bits(last_ptr-block_group, data)) + used_block_group = last_ptr-block_group; + if (used_block_group != block_group + (!used_block_group || +used_block_group-ro || +!block_group_bits(used_block_group, data))) { + used_block_group = block_group; goto refill_cluster; + } + + if (used_block_group != block_group) + btrfs_get_block_group(used_block_group); - offset = btrfs_alloc_from_cluster(block_group, last_ptr, -num_bytes, search_start); + offset = btrfs_alloc_from_cluster(used_block_group, + last_ptr, num_bytes, used_block_group-key.objectid); if (offset) { /* we have a block, we're done */ spin_unlock(last_ptr-refill_lock); @@ -5398,36 +5407,15 @@ alloc: printk(KERN_DEBUG btrfs %x.%i: failed cluster alloc\n, debugid, loop); - spin_lock(last_ptr-lock); - /* -* whoops, this cluster doesn't actually point to -* this block group. Get a ref on the block -* group it does point to and try again -*/ - if (!last_ptr_loop last_ptr-block_group - last_ptr-block_group != block_group - index = -get_block_group_index(last_ptr-block_group)) { - - btrfs_put_block_group(block_group); - block_group = last_ptr-block_group; - btrfs_get_block_group(block_group); - spin_unlock(last_ptr-lock); - spin_unlock(last_ptr-refill_lock); - - last_ptr_loop = 1; - search_start = block_group-key.objectid; - /* -* we know this block group is properly -* in the list
[PATCH 07/20] Btrfs: don't set up allocation result twice
We store the allocation start and length twice in ins, once right after the other, but with intervening calls that may prevent the duplicate from being optimized out by the compiler. Remove one of the assignments. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 525ff20..24eef3a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5412,9 +5412,6 @@ checks: goto loop; } - ins-objectid = search_start; - ins-offset = num_bytes; - if (offset search_start) btrfs_add_free_space(block_group, offset, search_start - offset); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/20] Btrfs: reset cluster's max_size when creating bitmap cluster
The field that indicates the size of the largest contiguous chunk of free space in the cluster is not initialized when setting up bitmaps, it's only increased when we find a larger contiguous chunk. We end up retaining a larger value than appropriate for highly-fragmented clusters, which may cause pointless searches for large contiguous groups, and even cause clusters that do not meet the density requirements to be set up. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index ff179b1..ec23d43 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2320,6 +2320,7 @@ again: if (!found) { start = i; + cluster-max_size = 0; found = true; } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/20] Btrfs: log when a bitmap is rejected for a cluster
--- fs/btrfs/free-space-cache.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 953f7dd..0151274 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2316,6 +2316,16 @@ again: i = next_zero; } + if (!found_bits total_found) + printk(KERN_INFO btrfs: bitmap %llx want:%llx min:%llx cont:%llx start:%llx max:%llx total:%llx\n, + (unsigned long long)entry-offset, + (unsigned long long)bytes, + (unsigned long long)min_bytes, + (unsigned long long)cont1_bytes, + (unsigned long long)(start * block_group-sectorsize), + (unsigned long long)cluster-max_size, + (unsigned long long)(total_found * block_group-sectorsize)); + if (!found_bits) return -ENOSPC; -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/20] Btrfs: add -o mincluster option
With -o mincluster, we save the location of the last successful allocation, so as to emulate some of the cluster allocation logic (though not non-bitmap preference) without actually going through the exercise of allocating clusters. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 16 +--- fs/btrfs/free-space-cache.c |1 + fs/btrfs/super.c| 17 + 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7ddbf9b..3c649fe 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5172,7 +5172,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, { int ret = 0; struct btrfs_root *root = orig_root-fs_info-extent_root; - struct btrfs_free_cluster *last_ptr = NULL; + struct btrfs_free_cluster *last_ptr = NULL, *save_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; struct btrfs_block_group_cache *used_block_group; int empty_cluster = 2 * 1024 * 1024; @@ -5219,8 +5219,16 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, debug = 1; debugid = atomic_inc_return(debugcnt); last_ptr = root-fs_info-meta_alloc_cluster; - if (!btrfs_test_opt(root, SSD)) - empty_cluster = 64 * 1024; + if (!btrfs_test_opt(root, SSD)) { + /* !SSD SSD_SPREAD == -o mincluster. */ + if (btrfs_test_opt(root, SSD_SPREAD)) { + save_ptr = last_ptr; + hint_byte = save_ptr-window_start; + last_ptr = NULL; + use_cluster = false; + } else + empty_cluster = 64 * 1024; + } } if ((data BTRFS_BLOCK_GROUP_DATA) use_cluster @@ -5556,6 +5564,8 @@ checks: btrfs_add_free_space(used_block_group, offset, search_start - offset); BUG_ON(offset search_start); + if (save_ptr) + save_ptr-window_start = search_start + num_bytes; if (used_block_group != block_group) btrfs_put_block_group(used_block_group); btrfs_put_block_group(block_group); diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 3aa56e4..953f7dd 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2579,6 +2579,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster) cluster-max_size = 0; INIT_LIST_HEAD(cluster-block_group_list); cluster-block_group = NULL; + cluster-window_start = 0; } int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 26b13d7..32fe064 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -165,7 +165,7 @@ enum { Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, - Opt_nocluster, Opt_cluster, Opt_err, + Opt_nocluster, Opt_cluster, Opt_mincluster, Opt_err, }; static match_table_t tokens = { @@ -202,6 +202,7 @@ static match_table_t tokens = { {Opt_recovery, recovery}, {Opt_nocluster, nocluster}, {Opt_cluster, cluster}, + {Opt_mincluster, mincluster}, {Opt_err, NULL}, }; @@ -407,6 +408,11 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) printk(KERN_INFO btrfs: enabling alloc clustering\n); btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); break; + case Opt_mincluster: + printk(KERN_INFO btrfs: enabling minimal alloc clustering\n); + btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); + btrfs_set_opt(info-mount_opt, SSD_SPREAD); + break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option '%s'\n, p); @@ -706,9 +712,12 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs) } if (btrfs_test_opt(root, NOSSD)) seq_puts(seq, ,nossd); - if (btrfs_test_opt(root, SSD_SPREAD)) - seq_puts(seq, ,ssd_spread); - else if (btrfs_test_opt(root, SSD)) + if (btrfs_test_opt(root, SSD_SPREAD)) { + if (btrfs_test_opt(root, SSD)) + seq_puts(seq, ,ssd_spread); + else + seq_puts(seq, ,mincluster); + } else if (btrfs_test_opt(root
[PATCH 14/20] Btrfs: introduce option to rebalance only metadata
Experimental patch to be able to compact only the metadata after excessive block groups are created. I guess it should be implemented as a balance option rather than a separate ioctl, but this was good enough for me to try it. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/ioctl.c |2 ++ fs/btrfs/ioctl.h |3 +++ fs/btrfs/volumes.c | 33 - fs/btrfs/volumes.h |1 + 4 files changed, 34 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a90e749..6f53983 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3077,6 +3077,8 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_dev_info(root, argp); case BTRFS_IOC_BALANCE: return btrfs_balance(root-fs_info-dev_root); + case BTRFS_IOC_BALANCE_METADATA: + return btrfs_balance_metadata(root-fs_info-dev_root); case BTRFS_IOC_CLONE: return btrfs_ioctl_clone(file, arg, 0, 0, 0); case BTRFS_IOC_CLONE_RANGE: diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 252ae99..46bc428 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -277,4 +277,7 @@ struct btrfs_ioctl_logical_ino_args { #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \ struct btrfs_ioctl_ino_path_args) +#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \ + struct btrfs_ioctl_vol_args) + #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7b348c2..db4397d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2084,7 +2084,7 @@ static u64 div_factor(u64 num, int factor) return num; } -int btrfs_balance(struct btrfs_root *dev_root) +static int btrfs_balance_skip(struct btrfs_root *dev_root, u64 skip_type) { int ret; struct list_head *devices = dev_root-fs_info-fs_devices-devices; @@ -2096,6 +2096,9 @@ int btrfs_balance(struct btrfs_root *dev_root) struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root; struct btrfs_trans_handle *trans; struct btrfs_key found_key; + struct btrfs_chunk *chunk; + u64 chunk_type; + bool skip; if (dev_root-fs_info-sb-s_flags MS_RDONLY) return -EROFS; @@ -2165,11 +2168,21 @@ int btrfs_balance(struct btrfs_root *dev_root) if (found_key.offset == 0) break; + if (skip_type) { + chunk = btrfs_item_ptr(path-nodes[0], path-slots[0], + struct btrfs_chunk); + chunk_type = btrfs_chunk_type(path-nodes[0], chunk); + skip = (chunk_type skip_type); + } else + skip = false; + btrfs_release_path(path); - ret = btrfs_relocate_chunk(chunk_root, - chunk_root-root_key.objectid, - found_key.objectid, - found_key.offset); + + ret = (skip ? 0 : + btrfs_relocate_chunk(chunk_root, + chunk_root-root_key.objectid, + found_key.objectid, + found_key.offset)); if (ret ret != -ENOSPC) goto error; key.offset = found_key.offset - 1; @@ -2181,6 +2194,16 @@ error: return ret; } +int btrfs_balance(struct btrfs_root *dev_root) +{ + return btrfs_balance_skip(dev_root, 0); +} + +int btrfs_balance_metadata(struct btrfs_root *dev_root) +{ + return btrfs_balance_skip(dev_root, BTRFS_BLOCK_GROUP_DATA); +} + /* * shrinking a device means finding all of the device extents past * the new size, and then following the back refs to the chunks. diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 78f2d4d..6844010 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -229,6 +229,7 @@ struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid, int btrfs_shrink_device(struct btrfs_device *device, u64 new_size); int btrfs_init_new_device(struct btrfs_root *root, char *path); int btrfs_balance(struct btrfs_root *dev_root); +int btrfs_balance_metadata(struct btrfs_root *dev_root); int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); int find_free_dev_extent(struct btrfs_trans_handle *trans, struct btrfs_device *device, u64 num_bytes, -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/20] Btrfs: start search for new cluster at the beginning of the block group
Instead of starting at zero (offset is always zero), request a cluster starting at search_start, that denotes the beginning of the current block group. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c |6 ++ 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index bc0f13d..7edb9e6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5287,10 +5287,8 @@ alloc: spin_lock(last_ptr-refill_lock); if (last_ptr-block_group (last_ptr-block_group-ro || - !block_group_bits(last_ptr-block_group, data))) { - offset = 0; + !block_group_bits(last_ptr-block_group, data))) goto refill_cluster; - } offset = btrfs_alloc_from_cluster(block_group, last_ptr, num_bytes, search_start); @@ -5341,7 +5339,7 @@ refill_cluster: /* allocate a cluster in this block group */ ret = btrfs_find_space_cluster(trans, root, block_group, last_ptr, - offset, num_bytes, + search_start, num_bytes, empty_cluster + empty_size); if (ret == 0) { /* -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/20] Btrfs: don't waste metadata block groups for clustered allocation
We try to maintain about 1% of the filesystem space in free space in data block groups, but we need not do that for metadata, since we only allocate one block at a time. This patch also moves the adjustment of flags to account for mixed data/metadata block groups into the block protected by spin lock, and before the point in which we now look at flags to decide whether or not we should keep the free space buffer. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 24 +--- 1 files changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3c649fe..cce452d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3228,7 +3228,7 @@ static void force_metadata_allocation(struct btrfs_fs_info *info) static int should_alloc_chunk(struct btrfs_root *root, struct btrfs_space_info *sinfo, u64 alloc_bytes, - int force) + u64 flags, int force) { struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv; u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly; @@ -3246,10 +3246,10 @@ static int should_alloc_chunk(struct btrfs_root *root, num_allocated += global_rsv-size; /* -* in limited mode, we want to have some free space up to +* in limited mode, we want to have some free data space up to * about 1% of the FS size. */ - if (force == CHUNK_ALLOC_LIMITED) { + if (force == CHUNK_ALLOC_LIMITED (flags BTRFS_BLOCK_GROUP_DATA)) { thresh = btrfs_super_total_bytes(root-fs_info-super_copy); thresh = max_t(u64, 64 * 1024 * 1024, div_factor_fine(thresh, 1)); @@ -3310,7 +3310,16 @@ again: return 0; } - if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, force)) { + /* +* If we have mixed data/metadata chunks we want to make sure we keep +* allocating mixed chunks instead of individual chunks. +*/ + if (btrfs_mixed_space_info(space_info)) + flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA); + + if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, + flags, force)) { + space_info-force_alloc = CHUNK_ALLOC_NO_FORCE; spin_unlock(space_info-lock); return 0; } else if (space_info-chunk_alloc) { @@ -3336,13 +3345,6 @@ again: } /* -* If we have mixed data/metadata chunks we want to make sure we keep -* allocating mixed chunks instead of individual chunks. -*/ - if (btrfs_mixed_space_info(space_info)) - flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA); - - /* * if we're doing a data chunk, go ahead and make sure that * we keep a reasonable number of metadata chunks allocated in the * FS as well. -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/20] Btrfs: skip allocation attempt from empty cluster
If we don't have a cluster, don't bother trying to allocate from it, jumping right away to the attempt to allocate a new cluster. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9eec362..92e640b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5280,9 +5280,9 @@ alloc: * people trying to start a new cluster */ spin_lock(last_ptr-refill_lock); - if (last_ptr-block_group - (last_ptr-block_group-ro || - !block_group_bits(last_ptr-block_group, data))) + if (!last_ptr-block_group || + last_ptr-block_group-ro || + !block_group_bits(last_ptr-block_group, data)) goto refill_cluster; offset = btrfs_alloc_from_cluster(block_group, last_ptr, -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/20] Btrfs: introduce verbose debug mode for patched clustered allocation recovery
This patch adds several debug messages that helped me track down problems in the cluster allocation logic. All the messages are disabled by default, so that they're optimized away, but enabling the commented-out settings of debug brings some helpful messages. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 148 +++- 1 files changed, 147 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 92e640b..823ab22 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5073,6 +5073,88 @@ enum btrfs_loop_type { LOOP_NO_EMPTY_SIZE = 4, }; +/* ??? Move to free-space-cache.c? */ +static void +btrfs_dump_free_space_tree (const char *kern, int debugid, int loop, + int detailed, const char *what, const char *what2, + unsigned long long prev, struct rb_node *node) { + struct btrfs_free_space *entry; + int entries = 0, frags = 0; + unsigned long long size = 0; + unsigned long bits = 0, i, p, q; + + if (detailed) + printk(%sbtrfs %x.%i: %s %s %llx:\n, + kern, debugid, loop, what, what2, prev); + + while (node) { + entries++; + entry = rb_entry(node, struct btrfs_free_space, offset_index); + node = rb_next(entry-offset_index); + + size += entry-bytes; + + if (detailed) + printk(%sbtrfs %x.%i: +%llx,%llx%s\n, + kern, debugid, loop, + (long long)(entry-offset - prev), + (unsigned long long)entry-bytes, + entry-bitmap ? (detailed 1 ? : : bitmap) : ); + + if (!entry-bitmap) + continue; + + i = 0; +#define BITS_PER_BITMAP (PAGE_CACHE_SIZE * 8) + do { + p = i; + i = find_next_bit (entry-bitmap, BITS_PER_BITMAP, i); + q = i; + i = find_next_zero_bit (entry-bitmap, BITS_PER_BITMAP, i); + + if (i != q) + frags++; + bits += i - q; + + if (detailed 1) + printk(%sbtrfs %x.%i: b+%lx,%lx\n, + kern, debugid, loop, q - p, i - q); + } while (i BITS_PER_BITMAP); +#undef BITS_PER_BITMAP + } + + if (detailed) + printk(%sbtrfs %x.%i: entries %x size %llx bits %lx frags %x\n, + kern, debugid, loop, entries, size, bits, frags); + else + printk(%sbtrfs %x.%i: %s %s %llx: e:%x s:%llx b:%lx f:%x\n, + kern, debugid, loop, what, what2, + prev, entries, size, bits, frags); +} + +static void +btrfs_dump_cluster (const char *kern, int debugid, int loop, int detailed, + const char *what, struct btrfs_free_cluster *cluster) { + spin_lock (cluster-lock); + + btrfs_dump_free_space_tree (kern, debugid, loop, + detailed, what, cluster, + cluster-window_start, + rb_first(cluster-root)); + + spin_unlock (cluster-lock); +} + +static void +btrfs_dump_block_group_free_space (const char *kern, int debugid, int loop, + int detailed, const char *what, + struct btrfs_block_group_cache *block_group) { + btrfs_dump_free_space_tree (kern, debugid, loop, + detailed, what, block group, + block_group-key.objectid, + rb_first(block_group-free_space_ctl-free_space_offset)); +} + /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: @@ -5108,6 +5190,9 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, bool have_caching_bg = false; u64 ideal_cache_percent = 0; u64 ideal_cache_offset = 0; + int debug = 0; + int debugid = 0; + static atomic_t debugcnt; WARN_ON(num_bytes root-sectorsize); btrfs_set_key_type(ins, BTRFS_EXTENT_ITEM_KEY); @@ -5131,6 +5216,8 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, allowed_chunk_alloc = 1; if (data BTRFS_BLOCK_GROUP_METADATA use_cluster) { + /* debug = 1; */ + debugid = atomic_inc_return(debugcnt); last_ptr = root-fs_info-meta_alloc_cluster; if (!btrfs_test_opt(root, SSD)) empty_cluster = 64 * 1024; @@ -5158,6 +5245,10 @@ static noinline int
[PATCH 02/20] Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because setup_cluster_no_bitmap checks that the list is empty before inserting the bitmap entry in the list for setup_cluster_bitmap, but the list field is only initialized when it is restored from the on-disk free space cache, or when it is written out to disk. Besides a potential race condition due to the multiple use of the list field, filesystem performance severely degrades over time: as we use up all non-bitmap free extents, the try-to-set-up-cluster dance is done at every metadata block allocation. For every block group, we fail to set up a cluster, and after failing on them all up to twice, we fall back to the much slower unclustered allocation. To make matters worse, before the unclustered allocation, we try to create new block groups until we reach the 1% threshold, which introduces additional bitmaps and thus block groups that we'll iterate over at each metadata block request. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 6e5b7e4..ff179b1 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl *ctl, { info-offset = offset_to_bitmap(ctl, offset); info-bytes = 0; + INIT_LIST_HEAD(info-list); link_free_space(ctl, info); ctl-total_bitmaps++; -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/20] Btrfs: activate allocation debugging
Activate various messages that help track down clustered allocation problems, that are disabled and optimized out by default. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 823ab22..66edda2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5216,7 +5216,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, allowed_chunk_alloc = 1; if (data BTRFS_BLOCK_GROUP_METADATA use_cluster) { - /* debug = 1; */ + debug = 1; debugid = atomic_inc_return(debugcnt); last_ptr = root-fs_info-meta_alloc_cluster; if (!btrfs_test_opt(root, SSD)) @@ -5393,7 +5393,7 @@ alloc: goto checks; } - /* debug = 2; */ + debug = 2; if (debug 1) printk(KERN_DEBUG btrfs %x.%i: failed cluster alloc\n, debugid, loop); @@ -5446,7 +5446,7 @@ refill_cluster: * this cluster didn't work out, free it and * start over */ - /* debug = 2; */ + debug = 2; if ((debug 1 || (debug last_ptr-block_group)) last_ptr-window_start) btrfs_dump_cluster(KERN_DEBUG, debugid, loop, 0, drop, last_ptr); btrfs_return_cluster_to_free_space(NULL, last_ptr); -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/20] Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that others might have set up. Odds are that there won't be one, but if someone else succeeded in setting it up, we might as well use it, even if we don't try to set up a cluster again. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 26 ++ 1 files changed, 18 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 24eef3a..9eec362 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5271,15 +5271,10 @@ alloc: spin_unlock(block_group-free_space_ctl-tree_lock); /* -* Ok we want to try and use the cluster allocator, so lets look -* there, unless we are on LOOP_NO_EMPTY_SIZE, since we will -* have tried the cluster allocator plenty of times at this -* point and not have found anything, so we are likely way too -* fragmented for the clustering stuff to find anything, so lets -* just skip it and let the allocator find whatever block it can -* find +* Ok we want to try and use the cluster allocator, so +* lets look there */ - if (last_ptr loop LOOP_NO_EMPTY_SIZE) { + if (last_ptr) { /* * the refill lock keeps out other * people trying to start a new cluster @@ -5328,6 +5323,20 @@ alloc: } spin_unlock(last_ptr-lock); refill_cluster: + /* If we are on LOOP_NO_EMPTY_SIZE, we can't +* set up a new clusters, so lets just skip it +* and let the allocator find whatever block +* it can find. If we reach this point, we +* will have tried the cluster allocator +* plenty of times and not have found +* anything, so we are likely way too +* fragmented for the clustering stuff to find +* anything. */ + if (loop = LOOP_NO_EMPTY_SIZE) { + spin_unlock(last_ptr-refill_lock); + goto unclustered_alloc; + } + /* * this cluster didn't work out, free it and * start over @@ -5375,6 +5384,7 @@ refill_cluster: goto loop; } +unclustered_alloc: offset = btrfs_find_space_for_alloc(block_group, search_start, num_bytes, empty_size); /* -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/20] Btrfs: enable removal of second disk with raid1 metadata
Enable removal of a second disk even if that requires conversion of metadata from raid1 to dup, but not when data would lose replication. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/volumes.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c37433d..7b348c2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1290,12 +1290,16 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto out; } - if ((all_avail BTRFS_BLOCK_GROUP_RAID1) + if ((root-fs_info-avail_data_alloc_bits BTRFS_BLOCK_GROUP_RAID1) root-fs_info-fs_devices-num_devices = 2) { printk(KERN_ERR btrfs: unable to go below two devices on raid1\n); ret = -EINVAL; goto out; + } else if ((all_avail BTRFS_BLOCK_GROUP_RAID1) + root-fs_info-fs_devices-num_devices = 2) { + printk(KERN_ERR btrfs: going below two devices + will switch metadata from raid1 to dup\n); } if (strcmp(device_path, missing) == 0) { -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/20] Btrfs: report reason for failed relocation
btrfs filesystem balance sometimes fails on corrupted filesystems, but without any information that explains what the failure was to help track down the problem. This patch adds logging for nearly all error conditions that may cause relocation to fail. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/relocation.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index dff29d5..15a2270 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2496,6 +2496,7 @@ static int do_relocation(struct btrfs_trans_handle *trans, if (!upper-eb) { ret = btrfs_search_slot(trans, root, key, path, 0, 1); if (ret 0) { + printk(KERN_INFO btrfs: searching slot %llu failed: %i\n, key-objectid, -ret); err = ret; break; } @@ -2543,6 +2544,7 @@ static int do_relocation(struct btrfs_trans_handle *trans, btrfs_tree_unlock(eb); free_extent_buffer(eb); if (ret 0) { + printk(KERN_INFO btrfs: cow slot failed: %i\n, -ret); err = ret; goto next; } @@ -2730,6 +2732,7 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans, BUG_ON(node-processed); root = select_one_root(trans, node); if (root == ERR_PTR(-ENOENT)) { + printk(KERN_INFO btrfs: could not find a root to update\n); update_processed_blocks(rc, node); goto out; } @@ -2756,6 +2759,8 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans, btrfs_release_path(path); if (ret 0) ret = 0; + if (ret 0) + printk(KERN_INFO btrfs: failed to search slot %llu: %i\n, key-objectid, -ret); } if (!ret) update_processed_blocks(rc, node); @@ -2813,12 +2818,14 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, block-level, block-bytenr); if (IS_ERR(node)) { err = PTR_ERR(node); + printk(KERN_INFO btrfs: failed to build backref tree for key %llu byte %llu: %i\n, block-key.objectid, block-bytenr, -err); goto out; } ret = relocate_tree_block(trans, rc, node, block-key, path); if (ret 0) { + printk(KERN_INFO btrfs: failed to relocate tree block: %i\n, -ret); if (ret != -EAGAIN || rb_node == rb_first(blocks)) err = ret; goto out; @@ -3770,6 +3777,7 @@ restart: ret = relocate_tree_blocks(trans, rc, blocks); if (ret 0) { if (ret != -EAGAIN) { + printk(KERN_INFO btrfs: failed to relocate blocks for key %llu: %i\n, key.objectid, -ret); err = ret; break; } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/20] Btrfs: skip block groups without enough space for a cluster
We test whether a block group has enough free space to hold the requested block, but when we're doing clustered allocation, we can save some cycles by testing whether it has enough room for the cluster upfront, otherwise we end up attempting to set up a cluster and failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered allocation, and by then we'll have zeroed the cluster size, so this patch won't stop us from using the block group as a last resort. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7edb9e6..525ff20 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5264,7 +5264,7 @@ alloc: spin_lock(block_group-free_space_ctl-tree_lock); if (cached block_group-free_space_ctl-free_space - num_bytes + empty_size) { + num_bytes + empty_cluster + empty_size) { spin_unlock(block_group-free_space_ctl-tree_lock); goto loop; } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/20] Btrfs: note when a bitmap is skipped because its list is in use
Bitmap lists serve two purposes: recording the order of loading/saving on-disk free space caches, and setting up a list of bitmaps to try to set up a cluster. Complain if a list is unexpectedly busy. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index ec23d43..dd7fe43 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -904,6 +904,7 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, goto out_nospc; if (e-bitmap) { + BUG_ON(!list_empty(e-list)); list_add_tail(e-list, bitmap_list); bitmaps++; } @@ -2380,6 +2381,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, while (entry-bitmap) { if (list_empty(entry-list)) list_add_tail(entry-list, bitmaps); + else if (entry-bitmap) + printk(KERN_ERR btrfs: not using (busy?!?) bitmap %lli\n, + (unsigned long long)entry-offset); node = rb_next(entry-offset_index); if (!node) return -ENOSPC; @@ -2402,6 +2406,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, if (entry-bitmap) { if (list_empty(entry-list)) list_add_tail(entry-list, bitmaps); + else + printk(KERN_ERR btrfs: not using (busy?!?) bitmap %lli\n, + (unsigned long long)entry-offset); continue; } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/20] Btrfs: introduce -o cluster and -o nocluster
Introduce -o nocluster to disable the use of clusters for extent allocation, and -o cluster to reverse it. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/ctree.h |3 ++- fs/btrfs/extent-tree.c |2 +- fs/btrfs/super.c | 16 ++-- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 04a5dfc..1deaf2d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -971,7 +971,7 @@ struct btrfs_fs_info { * is required instead of the faster short fsync log commits */ u64 last_trans_log_full_commit; - unsigned long mount_opt:20; + unsigned long mount_opt:28; unsigned long compress_type:4; u64 max_inline; u64 alloc_start; @@ -1413,6 +1413,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_AUTO_DEFRAG(1 16) #define BTRFS_MOUNT_INODE_MAP_CACHE(1 17) #define BTRFS_MOUNT_RECOVERY (1 18) +#define BTRFS_MOUNT_NO_ALLOC_CLUSTER (1 19) #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7064979..7ddbf9b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5186,7 +5186,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, bool found_uncached_bg = false; bool failed_cluster_refill = false; bool failed_alloc = false; - bool use_cluster = true; + bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER); bool have_caching_bg = false; u64 ideal_cache_percent = 0; u64 ideal_cache_offset = 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8bd9d6d..26b13d7 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -164,7 +164,8 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, - Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_err, + Opt_inode_cache, Opt_no_space_cache, Opt_recovery, + Opt_nocluster, Opt_cluster, Opt_err, }; static match_table_t tokens = { @@ -199,6 +200,8 @@ static match_table_t tokens = { {Opt_inode_cache, inode_cache}, {Opt_no_space_cache, nospace_cache}, {Opt_recovery, recovery}, + {Opt_nocluster, nocluster}, + {Opt_cluster, cluster}, {Opt_err, NULL}, }; @@ -390,12 +393,19 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG); break; case Opt_defrag: - printk(KERN_INFO btrfs: enabling auto defrag); + printk(KERN_INFO btrfs: enabling auto defrag\n); btrfs_set_opt(info-mount_opt, AUTO_DEFRAG); break; case Opt_recovery: printk(KERN_INFO btrfs: enabling auto recovery); btrfs_set_opt(info-mount_opt, RECOVERY); + case Opt_nocluster: + printk(KERN_INFO btrfs: disabling alloc clustering\n); + btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER); + break; + case Opt_cluster: + printk(KERN_INFO btrfs: enabling alloc clustering\n); + btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option @@ -722,6 +732,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,autodefrag); if (btrfs_test_opt(root, INODE_MAP_CACHE)) seq_puts(seq, ,inode_cache); + if (btrfs_test_opt(root, NO_ALLOC_CLUSTER)) + seq_puts(seq, ,nocluster); return 0; } -- 1.7.4.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/20] Btrfs: revamp clustered allocation logic
Parameterize clusters on minimum total size, minimum chunk size and minimum contiguous size for at least one chunk, without limits on cluster, window or gap sizes. Don't tolerate any fragmentation for SSD_SPREAD; accept it for metadata, but try to keep data dense. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c | 112 +++ 1 files changed, 49 insertions(+), 63 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index dd7fe43..3aa56e4 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2284,23 +2284,23 @@ out: static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, struct btrfs_free_space *entry, struct btrfs_free_cluster *cluster, - u64 offset, u64 bytes, u64 min_bytes) + u64 offset, u64 bytes, + u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; unsigned long next_zero; unsigned long i; - unsigned long search_bits; - unsigned long total_bits; + unsigned long want_bits; + unsigned long min_bits; unsigned long found_bits; unsigned long start = 0; unsigned long total_found = 0; int ret; - bool found = false; i = offset_to_bit(entry-offset, block_group-sectorsize, max_t(u64, offset, entry-offset)); - search_bits = bytes_to_bits(bytes, block_group-sectorsize); - total_bits = bytes_to_bits(min_bytes, block_group-sectorsize); + want_bits = bytes_to_bits(bytes, block_group-sectorsize); + min_bits = bytes_to_bits(min_bytes, block_group-sectorsize); again: found_bits = 0; @@ -2309,7 +2309,7 @@ again: i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) { next_zero = find_next_zero_bit(entry-bitmap, BITS_PER_BITMAP, i); - if (next_zero - i = search_bits) { + if (next_zero - i = min_bits) { found_bits = next_zero - i; break; } @@ -2319,10 +2319,9 @@ again: if (!found_bits) return -ENOSPC; - if (!found) { + if (!total_found) { start = i; cluster-max_size = 0; - found = true; } total_found += found_bits; @@ -2330,13 +2329,8 @@ again: if (cluster-max_size found_bits * block_group-sectorsize) cluster-max_size = found_bits * block_group-sectorsize; - if (total_found total_bits) { - i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero); - if (i - start total_bits * 2) { - total_found = 0; - cluster-max_size = 0; - found = false; - } + if (total_found want_bits || cluster-max_size cont1_bytes) { + i = next_zero + 1; goto again; } @@ -2352,23 +2346,23 @@ again: /* * This searches the block group for just extents to fill the cluster with. + * Try to find a cluster with at least bytes total bytes, at least one + * extent of cont1_bytes, and other clusters of at least min_bytes. */ static noinline int setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, struct btrfs_free_cluster *cluster, struct list_head *bitmaps, u64 offset, u64 bytes, - u64 min_bytes) + u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; struct btrfs_free_space *first = NULL; struct btrfs_free_space *entry = NULL; - struct btrfs_free_space *prev = NULL; struct btrfs_free_space *last; struct rb_node *node; u64 window_start; u64 window_free; u64 max_extent; - u64 max_gap = 128 * 1024; entry = tree_search_offset(ctl, offset, 0, 1); if (!entry) @@ -2378,8 +2372,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, * We don't want bitmaps, so just move along until we find a normal * extent entry. */ - while (entry-bitmap) { - if (list_empty(entry-list)) + while (entry-bitmap || entry-bytes min_bytes) { + if (entry-bitmap list_empty(entry-list)) list_add_tail(entry-list, bitmaps); else if (entry-bitmap) printk(KERN_ERR btrfs: not using (busy?!?) bitmap %lli\n, @@ -2395,12 +2389,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, max_extent = entry-bytes
[PATCH] Btrfs: don't waste metadata block groups for clustered allocation
We try to maintain about 1% of the filesystem space in free space in data block groups, but we need not do that for metadata, since we only allocate one block at a time. This patch also moves the adjustment of flags to account for mixed data/metadata block groups into the block protected by spin lock, and before the point in which we now look at flags to decide whether or not we should keep the free space buffer. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 26 ++ 1 files changed, 14 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 75bafe9..b3ec6c3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3228,7 +3228,7 @@ static void force_metadata_allocation(struct btrfs_fs_info *info) static int should_alloc_chunk(struct btrfs_root *root, struct btrfs_space_info *sinfo, u64 alloc_bytes, - int force) + u64 flags, int force) { struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv; u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly; @@ -3246,10 +3246,10 @@ static int should_alloc_chunk(struct btrfs_root *root, num_allocated += global_rsv-size; /* -* in limited mode, we want to have some free space up to +* in limited mode, we want to have some free data space up to * about 1% of the FS size. */ - if (force == CHUNK_ALLOC_LIMITED) { + if (force == CHUNK_ALLOC_LIMITED (flags BTRFS_BLOCK_GROUP_DATA)) { thresh = btrfs_super_total_bytes(root-fs_info-super_copy); thresh = max_t(u64, 64 * 1024 * 1024, div_factor_fine(thresh, 1)); @@ -3310,7 +3310,16 @@ again: return 0; } - if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, force)) { + /* +* If we have mixed data/metadata chunks we want to make sure we keep +* allocating mixed chunks instead of individual chunks. +*/ + if (btrfs_mixed_space_info(space_info)) + flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA); + + if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, + flags, force)) { + space_info-force_alloc = CHUNK_ALLOC_NO_FORCE; spin_unlock(space_info-lock); return 0; } else if (space_info-chunk_alloc) { @@ -3336,13 +3345,6 @@ again: } /* -* If we have mixed data/metadata chunks we want to make sure we keep -* allocating mixed chunks instead of individual chunks. -*/ - if (btrfs_mixed_space_info(space_info)) - flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA); - - /* * if we're doing a data chunk, go ahead and make sure that * we keep a reasonable number of metadata chunks allocated in the * FS as well. @@ -5312,7 +5314,7 @@ alloc: /* * whoops, this cluster doesn't actually point to * this block group. Get a ref on the block -* group is does point to and try again +* group it does point to and try again */ if (!last_ptr_loop last_ptr-block_group last_ptr-block_group != block_group -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Don't prevent removal of devices that break raid reqs
On Nov 11, 2011, Chris Mason chris.ma...@oracle.com wrote: On Thu, Nov 10, 2011 at 05:32:48PM -0200, Alexandre Oliva wrote: Instead of preventing the removal of devices that would render existing raid10 or raid1 impossible, warn but go ahead with it; the rebalancing code is smart enough to use different block group types. We'll need a --force or some kind. There are definitely cases users have wanted to do this but it is rarely a good idea ;) Even if it's just metadata that will turn from raid1 to dup, as in the revised patch below? From 276b1af70556bf5bdbaa1f81cb630d6c83962323 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Tue, 8 Nov 2011 12:33:11 -0200 Subject: [PATCH 1/8] Btrfs: enable removal of second disk with raid1 metadata Enable removal of a second disk even if that requires conversion of metadata from raid1 to dup, but not when data would lose replication. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/volumes.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c37433d..7b348c2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1290,12 +1290,16 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto out; } - if ((all_avail BTRFS_BLOCK_GROUP_RAID1) + if ((root-fs_info-avail_data_alloc_bits BTRFS_BLOCK_GROUP_RAID1) root-fs_info-fs_devices-num_devices = 2) { printk(KERN_ERR btrfs: unable to go below two devices on raid1\n); ret = -EINVAL; goto out; + } else if ((all_avail BTRFS_BLOCK_GROUP_RAID1) + root-fs_info-fs_devices-num_devices = 2) { + printk(KERN_ERR btrfs: going below two devices + will switch metadata from raid1 to dup\n); } if (strcmp(device_path, missing) == 0) { -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: Revamp cluster allocation logic
On Nov 10, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: These are patches I posted before, except these are based on cmason's for-linus. Reposting at josef's request. Reposting again, at josef's request, this time consolidating the 3 patches into one. From 349a2a26d97c6497f7e4df55b1bdb2f93a673376 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Fri, 14 Oct 2011 12:10:36 -0300 Subject: [PATCH 4/8] Btrfs: revamp clustered allocation logic Parameterize clusters on minimum total size, minimum chunk size and minimum contiguous size for at least one chunk, without limits on cluster, window or gap sizes. Don't tolerate any fragmentation for SSD_SPREAD; accept it for metadata, but try to keep data dense. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c | 114 ++ 1 files changed, 49 insertions(+), 65 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 181760f..7fe88b5 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2271,23 +2271,23 @@ out: static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, struct btrfs_free_space *entry, struct btrfs_free_cluster *cluster, -u64 offset, u64 bytes, u64 min_bytes) +u64 offset, u64 bytes, +u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; unsigned long next_zero; unsigned long i; - unsigned long search_bits; - unsigned long total_bits; + unsigned long want_bits; + unsigned long min_bits; unsigned long found_bits; unsigned long start = 0; unsigned long total_found = 0; int ret; - bool found = false; i = offset_to_bit(entry-offset, block_group-sectorsize, max_t(u64, offset, entry-offset)); - search_bits = bytes_to_bits(bytes, block_group-sectorsize); - total_bits = bytes_to_bits(min_bytes, block_group-sectorsize); + want_bits = bytes_to_bits(bytes, block_group-sectorsize); + min_bits = bytes_to_bits(min_bytes, block_group-sectorsize); again: found_bits = 0; @@ -2296,7 +2296,7 @@ again: i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) { next_zero = find_next_zero_bit(entry-bitmap, BITS_PER_BITMAP, i); - if (next_zero - i = search_bits) { + if (next_zero - i = min_bits) { found_bits = next_zero - i; break; } @@ -2306,23 +2306,16 @@ again: if (!found_bits) return -ENOSPC; - if (!found) { + if (!total_found) start = i; - found = true; - } total_found += found_bits; if (cluster-max_size found_bits * block_group-sectorsize) cluster-max_size = found_bits * block_group-sectorsize; - if (total_found total_bits) { + if (total_found want_bits || cluster-max_size cont1_bytes) { i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero); - if (i - start total_bits * 2) { - total_found = 0; - cluster-max_size = 0; - found = false; - } goto again; } @@ -2338,23 +2331,23 @@ again: /* * This searches the block group for just extents to fill the cluster with. + * Try to find a cluster with at least bytes total bytes, at least one + * extent of cont1_bytes, and other clusters of at least min_bytes. */ static noinline int setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, struct btrfs_free_cluster *cluster, struct list_head *bitmaps, u64 offset, u64 bytes, - u64 min_bytes) + u64 cont1_bytes, u64 min_bytes) { struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; struct btrfs_free_space *first = NULL; struct btrfs_free_space *entry = NULL; - struct btrfs_free_space *prev = NULL; struct btrfs_free_space *last; struct rb_node *node; u64 window_start; u64 window_free; u64 max_extent; - u64 max_gap = 128 * 1024; entry = tree_search_offset(ctl, offset, 0, 1); if (!entry) @@ -2364,8 +2357,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, * We don't want bitmaps, so just move along until we find a normal * extent entry. */ - while (entry-bitmap) { - if (list_empty(entry-list)) + while (entry-bitmap || entry-bytes min_bytes) { + if (entry-bitmap list_empty(entry-list)) list_add_tail(entry-list, bitmaps); node = rb_next(entry-offset_index); if (!node) @@ -2378,12 +2371,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, max_extent = entry-bytes; first = entry; last = entry; - prev = entry; - while (window_free = min_bytes) { - node = rb_next(entry-offset_index); - if (!node) - return -ENOSPC; + for (node = rb_next(entry-offset_index); node; + node = rb_next(entry-offset_index)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry-bitmap) { @@ -2392,26 +2382,18 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, continue; } - /* - * we haven't filled the empty size and the window is - * very large
report relocation failures
I've had some corrupted filesystems that failed to balance and to remove devices. It was slightly annoying that btrfs would exit with a nonzero status, but no information about the error was logged anywhere. This patch introduces some error reporting, catching the one error I was running into: -ENOENT looking for a backref, presumably because outdated metadata that ended up being used as if it was still live. I ended up losing the filesystem before I could figure out what exactly the problem was, but with this info it would hopefully not take as long to track it down. From 2bbc4ae372f8ca31701db8ed0cf8e15edf76311e Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Wed, 16 Nov 2011 01:25:06 -0200 Subject: [PATCH 6/8] Btrfs: report reason for failed relocation btrfs filesystem balance sometimes fails on corrupted filesystems, but without any information that explains what the failure was to help track down the problem. This patch adds logging for nearly all error conditions that may cause relocation to fail. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/relocation.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index dff29d5..15a2270 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2496,6 +2496,7 @@ static int do_relocation(struct btrfs_trans_handle *trans, if (!upper-eb) { ret = btrfs_search_slot(trans, root, key, path, 0, 1); if (ret 0) { +printk(KERN_INFO btrfs: searching slot %llu failed: %i\n, key-objectid, -ret); err = ret; break; } @@ -2543,6 +2544,7 @@ static int do_relocation(struct btrfs_trans_handle *trans, btrfs_tree_unlock(eb); free_extent_buffer(eb); if (ret 0) { +printk(KERN_INFO btrfs: cow slot failed: %i\n, -ret); err = ret; goto next; } @@ -2730,6 +2732,7 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans, BUG_ON(node-processed); root = select_one_root(trans, node); if (root == ERR_PTR(-ENOENT)) { + printk(KERN_INFO btrfs: could not find a root to update\n); update_processed_blocks(rc, node); goto out; } @@ -2756,6 +2759,8 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans, btrfs_release_path(path); if (ret 0) ret = 0; + if (ret 0) +printk(KERN_INFO btrfs: failed to search slot %llu: %i\n, key-objectid, -ret); } if (!ret) update_processed_blocks(rc, node); @@ -2813,12 +2818,14 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, block-level, block-bytenr); if (IS_ERR(node)) { err = PTR_ERR(node); + printk(KERN_INFO btrfs: failed to build backref tree for key %llu byte %llu: %i\n, block-key.objectid, block-bytenr, -err); goto out; } ret = relocate_tree_block(trans, rc, node, block-key, path); if (ret 0) { + printk(KERN_INFO btrfs: failed to relocate tree block: %i\n, -ret); if (ret != -EAGAIN || rb_node == rb_first(blocks)) err = ret; goto out; @@ -3770,6 +3777,7 @@ restart: ret = relocate_tree_blocks(trans, rc, blocks); if (ret 0) { if (ret != -EAGAIN) { + printk(KERN_INFO btrfs: failed to relocate blocks for key %llu: %i\n, key.objectid, -ret); err = ret; break; } -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: Introduce option to rebalance only metadata
On Nov 15, 2011, Ilya Dryomov idryo...@gmail.com wrote: And the exact command to mimic your patch is btrfs fi restripe start -m mount point Thanks. I wasn't aware of the restripe patch when I wrote this Quick Hack (TM). -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
revised -o nocluster, and -o cluster to reverse on remount
Here's a revised version of the -o nocluster patch, updated for cmason's for-linus branch, and a separate -o cluster option that enables one to enable and disable this option on remount. One thing I'm not sure is whether -o remount,nocluster will release a cluster that may have been allocated before the remount. Please keep that in mind before merging the patch. From a3323c03f1b3d2cfeb4905268d117426232d4a3b Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Sat, 29 Oct 2011 02:20:55 -0200 Subject: [PATCH 4/8] Disable clustered allocation with -o nocluster Introduce -o nocluster to disable the use of clusters for extent allocation. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/ctree.h |1 + fs/btrfs/extent-tree.c |2 +- fs/btrfs/super.c | 11 +-- 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b9ba59f..324df91 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1410,6 +1410,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_AUTO_DEFRAG (1 16) #define BTRFS_MOUNT_INODE_MAP_CACHE (1 17) #define BTRFS_MOUNT_RECOVERY (1 18) +#define BTRFS_MOUNT_NO_ALLOC_CLUSTER (1 19) #define btrfs_clear_opt(o, opt) ((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 18ea90c..767edac 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5051,7 +5051,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, bool found_uncached_bg = false; bool failed_cluster_refill = false; bool failed_alloc = false; - bool use_cluster = true; + bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER); bool have_caching_bg = false; u64 ideal_cache_percent = 0; u64 ideal_cache_offset = 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index dcd5aef..988e697 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -164,7 +164,8 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, - Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_err, + Opt_inode_cache, Opt_no_space_cache, Opt_recovery, + Opt_nocluster, Opt_err, }; static match_table_t tokens = { @@ -199,6 +200,7 @@ static match_table_t tokens = { {Opt_inode_cache, inode_cache}, {Opt_no_space_cache, no_space_cache}, {Opt_recovery, recovery}, + {Opt_nocluster, nocluster}, {Opt_err, NULL}, }; @@ -390,12 +392,15 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG); break; case Opt_defrag: - printk(KERN_INFO btrfs: enabling auto defrag); + printk(KERN_INFO btrfs: enabling auto defrag\n); btrfs_set_opt(info-mount_opt, AUTO_DEFRAG); break; case Opt_recovery: printk(KERN_INFO btrfs: enabling auto recovery); btrfs_set_opt(info-mount_opt, RECOVERY); + case Opt_nocluster: + printk(KERN_INFO btrfs: disabling alloc clustering\n); + btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER); break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option @@ -721,6 +726,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,autodefrag); if (btrfs_test_opt(root, INODE_MAP_CACHE)) seq_puts(seq, ,inode_cache); + if (btrfs_test_opt(root, NO_ALLOC_CLUSTER)) + seq_puts(seq, ,nocluster); return 0; } -- 1.7.4.4 From 572ec833d94278e7eda7c274962165c70d9154e5 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Sun, 6 Nov 2011 23:51:08 -0200 Subject: [PATCH 5/8] Add -o cluster, so that nocluster can be disabled with remount. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/super.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 988e697..2baba99 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -165,7 +165,7 @@ enum { Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, - Opt_nocluster, Opt_err, + Opt_nocluster, Opt_cluster, Opt_err, }; static match_table_t tokens = { @@ -201,6 +201,7 @@ static match_table_t tokens = { {Opt_no_space_cache, no_space_cache}, {Opt_recovery, recovery}, {Opt_nocluster, nocluster}, + {Opt_cluster, cluster}, {Opt_err, NULL}, }; @@ -402,6 +403,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) printk(KERN_INFO btrfs: disabling alloc clustering\n); btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER); break; + case Opt_cluster: + printk(KERN_INFO btrfs: enabling alloc clustering\n); + btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); + break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount
Don't prevent removal of devices that break raid reqs
Instead of preventing the removal of devices that would render existing raid10 or raid1 impossible, warn but go ahead with it; the rebalancing code is smart enough to use different block group types. Should the refusal remain, so that we'd only proceed with a newly-introduced --force option or so? Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/volumes.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4d5b29f..507afca 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1281,18 +1281,14 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) if ((all_avail BTRFS_BLOCK_GROUP_RAID10) root-fs_info-fs_devices-num_devices = 4) { - printk(KERN_ERR btrfs: unable to go below four devices - on raid10\n); - ret = -EINVAL; - goto out; + printk(KERN_ERR btrfs: going below four devices + will turn raid10 into raid1\n); } if ((all_avail BTRFS_BLOCK_GROUP_RAID1) root-fs_info-fs_devices-num_devices = 2) { - printk(KERN_ERR btrfs: unable to go below two - devices on raid1\n); - ret = -EINVAL; - goto out; + printk(KERN_ERR btrfs: going below two devices + will lose raid1 redundancy\n); } if (strcmp(device_path, missing) == 0) { -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Introduce option to rebalance only metadata
Experimental patch to be able to compact only the metadata after clustered allocation allocated lots of unnecessary metadata block groups. It's also useful to measure performance differences between -o cluster and -o nocluster. I guess it should be implemented as a balance option rather than a separate ioctl, but this was good enough for me to try it. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/ioctl.c |2 ++ fs/btrfs/ioctl.h |3 +++ fs/btrfs/volumes.c | 33 - fs/btrfs/volumes.h |1 + 4 files changed, 34 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4a34c47..69bf6f2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3074,6 +3074,8 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_dev_info(root, argp); case BTRFS_IOC_BALANCE: return btrfs_balance(root-fs_info-dev_root); + case BTRFS_IOC_BALANCE_METADATA: + return btrfs_balance_metadata(root-fs_info-dev_root); case BTRFS_IOC_CLONE: return btrfs_ioctl_clone(file, arg, 0, 0, 0); case BTRFS_IOC_CLONE_RANGE: diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 252ae99..46bc428 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -277,4 +277,7 @@ struct btrfs_ioctl_logical_ino_args { #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \ struct btrfs_ioctl_ino_path_args) +#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \ + struct btrfs_ioctl_vol_args) + #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f8e29431..4d5b29f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2077,7 +2077,7 @@ static u64 div_factor(u64 num, int factor) return num; } -int btrfs_balance(struct btrfs_root *dev_root) +static int btrfs_balance_skip(struct btrfs_root *dev_root, u64 skip_type) { int ret; struct list_head *devices = dev_root-fs_info-fs_devices-devices; @@ -2089,6 +2089,9 @@ int btrfs_balance(struct btrfs_root *dev_root) struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root; struct btrfs_trans_handle *trans; struct btrfs_key found_key; + struct btrfs_chunk *chunk; + u64 chunk_type; + bool skip; if (dev_root-fs_info-sb-s_flags MS_RDONLY) return -EROFS; @@ -2158,11 +2161,21 @@ int btrfs_balance(struct btrfs_root *dev_root) if (found_key.offset == 0) break; + if (skip_type) { + chunk = btrfs_item_ptr(path-nodes[0], path-slots[0], + struct btrfs_chunk); + chunk_type = btrfs_chunk_type(path-nodes[0], chunk); + skip = (chunk_type skip_type); + } else + skip = false; + btrfs_release_path(path); - ret = btrfs_relocate_chunk(chunk_root, - chunk_root-root_key.objectid, - found_key.objectid, - found_key.offset); + + ret = (skip ? 0 : + btrfs_relocate_chunk(chunk_root, + chunk_root-root_key.objectid, + found_key.objectid, + found_key.offset)); if (ret ret != -ENOSPC) goto error; key.offset = found_key.offset - 1; @@ -2174,6 +2187,16 @@ error: return ret; } +int btrfs_balance(struct btrfs_root *dev_root) +{ + return btrfs_balance_skip(dev_root, 0); +} + +int btrfs_balance_metadata(struct btrfs_root *dev_root) +{ + return btrfs_balance_skip(dev_root, BTRFS_BLOCK_GROUP_DATA); +} + /* * shrinking a device means finding all of the device extents past * the new size, and then following the back refs to the chunks. diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index ab5b1c4..c467499 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -223,6 +223,7 @@ struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid, int btrfs_shrink_device(struct btrfs_device *device, u64 new_size); int btrfs_init_new_device(struct btrfs_root *root, char *path); int btrfs_balance(struct btrfs_root *dev_root); +int btrfs_balance_metadata(struct btrfs_root *dev_root); int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset); int find_free_dev_extent(struct btrfs_trans_handle *trans, struct btrfs_device *device, u64 num_bytes, -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http
Re: Introduce option to rebalance only metadata
On Nov 10, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: Experimental patch to be able to compact only the metadata after clustered allocation allocated lots of unnecessary metadata block groups. It's also useful to measure performance differences between -o cluster and -o nocluster. I guess it should be implemented as a balance option rather than a separate ioctl, but this was good enough for me to try it. And here's a corresponding patch for the btrfs program, on a (probably very old) btrfs-progs tree. From 8765d64f95966eec28cad83bd870fc2270afaebd Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Thu, 10 Nov 2011 17:35:29 -0200 Subject: [PATCH] Introduce balance-md to balance metadata only. Patch for btrfs to use a separate experimental IOCTL to rebalance only metadata block groups. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- btrfs.c |4 btrfs_cmds.c | 25 + btrfs_cmds.h |1 + ioctl.h |3 +++ 4 files changed, 33 insertions(+), 0 deletions(-) diff --git a/btrfs.c b/btrfs.c index 46314cf..9edaebe 100644 --- a/btrfs.c +++ b/btrfs.c @@ -95,6 +95,10 @@ static struct Command commands[] = { filesystem balance, path\n Balance the chunks across the device. }, + { do_balance_md, 1, + filesystem balance-md, path\n + Balance the chunks across the device. + }, { do_scan, 999, device scan, [device [device..]\n Scan all device for or the passed device for a btrfs\n diff --git a/btrfs_cmds.c b/btrfs_cmds.c index 8031c58..b8f4c05 100644 --- a/btrfs_cmds.c +++ b/btrfs_cmds.c @@ -776,6 +776,31 @@ int do_balance(int argc, char **argv) } return 0; } + +int do_balance_md(int argc, char **argv) +{ + + int fdmnt, ret=0; + struct btrfs_ioctl_vol_args args; + char *path = argv[1]; + + fdmnt = open_file_or_dir(path); + if (fdmnt 0) { + fprintf(stderr, ERROR: can't access to '%s'\n, path); + return 12; + } + + memset(args, 0, sizeof(args)); + ret = ioctl(fdmnt, BTRFS_IOC_BALANCE_METADATA, args); + close(fdmnt); + if(ret0){ + fprintf(stderr, ERROR: balancing '%s'\n, path); + + return 19; + } + return 0; +} + int do_remove_volume(int nargs, char **args) { diff --git a/btrfs_cmds.h b/btrfs_cmds.h index 7bde191..96cab6d 100644 --- a/btrfs_cmds.h +++ b/btrfs_cmds.h @@ -23,6 +23,7 @@ int do_defrag(int argc, char **argv); int do_show_filesystem(int nargs, char **argv); int do_add_volume(int nargs, char **args); int do_balance(int nargs, char **argv); +int do_balance_md(int nargs, char **argv); int do_remove_volume(int nargs, char **args); int do_scan(int nargs, char **argv); int do_resize(int nargs, char **argv); diff --git a/ioctl.h b/ioctl.h index 776d7a9..5210c0b 100644 --- a/ioctl.h +++ b/ioctl.h @@ -169,4 +169,7 @@ struct btrfs_ioctl_space_args { #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) + +#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \ + struct btrfs_ioctl_vol_args) #endif -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Revamp cluster allocation logic
These are patches I posted before, except these are based on cmason's for-linus. Reposting at josef's request. From c8036334e5a033a6ca0963e8fb716d03b1945158 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Fri, 14 Oct 2011 12:10:36 -0300 Subject: [PATCH 1/8] Revamp btrfs cluster creation logic. Parameterized clusters on minimum total size and minimum chunk size, without an upper bound. Don't tolerate fragmentation for SSD_SPREAD; accept some fragmentation for metadata but try to keep data dense. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c | 64 +++--- 1 files changed, 35 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 7a15fcf..7572396 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2273,8 +2273,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; unsigned long next_zero; unsigned long i; - unsigned long search_bits; - unsigned long total_bits; + unsigned long want_bits; + unsigned long min_bits; unsigned long found_bits; unsigned long start = 0; unsigned long total_found = 0; @@ -2283,8 +2283,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, i = offset_to_bit(entry-offset, block_group-sectorsize, max_t(u64, offset, entry-offset)); - search_bits = bytes_to_bits(bytes, block_group-sectorsize); - total_bits = bytes_to_bits(min_bytes, block_group-sectorsize); + want_bits = bytes_to_bits(bytes, block_group-sectorsize); + min_bits = bytes_to_bits(min_bytes, block_group-sectorsize); again: found_bits = 0; @@ -2293,7 +2293,7 @@ again: i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) { next_zero = find_next_zero_bit(entry-bitmap, BITS_PER_BITMAP, i); - if (next_zero - i = search_bits) { + if (next_zero - i = min_bits) { found_bits = next_zero - i; break; } @@ -2313,9 +2313,9 @@ again: if (cluster-max_size found_bits * block_group-sectorsize) cluster-max_size = found_bits * block_group-sectorsize; - if (total_found total_bits) { + if (total_found want_bits) { i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero); - if (i - start total_bits * 2) { + if (i - start want_bits * 2) { total_found = 0; cluster-max_size = 0; found = false; @@ -2361,8 +2361,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, * We don't want bitmaps, so just move along until we find a normal * extent entry. */ - while (entry-bitmap) { - if (list_empty(entry-list)) + while (entry-bitmap || entry-bytes min_bytes) { + if (entry-bitmap list_empty(entry-list)) list_add_tail(entry-list, bitmaps); node = rb_next(entry-offset_index); if (!node) @@ -2377,10 +2377,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, last = entry; prev = entry; - while (window_free = min_bytes) { - node = rb_next(entry-offset_index); - if (!node) - return -ENOSPC; + for (node = rb_next(entry-offset_index); node; + node = rb_next(entry-offset_index)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry-bitmap) { @@ -2389,12 +2387,19 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, continue; } + if (entry-bytes min_bytes) + continue; + /* * we haven't filled the empty size and the window is * very large. reset and try again */ if (entry-offset - (prev-offset + prev-bytes) max_gap || - entry-offset - window_start (min_bytes * 2)) { + entry-offset - window_start (window_free * 2)) { + /* We got a cluster of the requested size, + we're done. */ + if (window_free = bytes) +break; first = entry; window_start = entry-offset; window_free = entry-bytes; @@ -2409,6 +2414,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, prev = entry; } + if (window_free bytes) + return -ENOSPC; + cluster-window_start = first-offset; node = first-offset_index; @@ -2422,7 +2430,7 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, entry = rb_entry(node, struct btrfs_free_space, offset_index); node = rb_next(entry-offset_index); - if (entry-bitmap) + if (entry-bitmap || entry-bytes min_bytes) continue; rb_erase(entry-offset_index, ctl-free_space_offset); @@ -2504,7 +2512,7 @@ search: /* * here we try to find a cluster of blocks in a block group. The goal - * is to find at least bytes free and up to empty_size + bytes free. + * is to find at least bytes+empty_size. * We might not find them all in one contiguous area. * * returns zero and sets up cluster if things worked out, otherwise @@ -2522,19 +2530,16 @@ int btrfs_find_space_cluster(struct btrfs_trans_handle *trans, u64 min_bytes
Record end of metadata allocation
So I'm trying to figure out what it is that makes clustered allocation so much faster than unclustered allocation. E.g., for a nearly quiescent filesystem with as little as 90MB of metadata, balance-md (from another patch I posted today) takes some 4.5 seconds (worst case 6s, best case 4s) with clustered allocation, while with -o nocluster it takes some 6.5s (best case 6s, worst case 7s). With -o mincluster, introduced by the patch below (by no means intended for merging, it's far too hackish) it's some 0.1s faster than with -o nocluster, but nothing really significant, and I didn't even take care of locking last_ptr. So I conclude it's not remembering the search starting point that makes -o cluster faster. Anyhow, since this is slightly faster than unclustered allocation, I suppose we could introduce something along these lines for the -o nocluster case, no? From c16a9e53e41e7616e4498534eea25ca1f396d7b4 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Thu, 10 Nov 2011 20:55:40 -0200 Subject: [PATCH 9/9] Add -o mincluster option. If this option is enabled, save the location of the last successful allocation, so as to emulate some of the cluster allocation logic (though not non-bitmap preference) without actually going through the exercise of allocating clusters. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/extent-tree.c | 16 +--- fs/btrfs/free-space-cache.c |1 + fs/btrfs/super.c| 17 + 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4da27be..caa73b2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5053,7 +5053,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, { int ret = 0; struct btrfs_root *root = orig_root-fs_info-extent_root; - struct btrfs_free_cluster *last_ptr = NULL; + struct btrfs_free_cluster *last_ptr = NULL, *save_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; int empty_cluster = 2 * 1024 * 1024; int allowed_chunk_alloc = 0; @@ -5095,8 +5095,16 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, if (data BTRFS_BLOCK_GROUP_METADATA use_cluster) { last_ptr = root-fs_info-meta_alloc_cluster; - if (!btrfs_test_opt(root, SSD)) - empty_cluster = 64 * 1024; + if (!btrfs_test_opt(root, SSD)) { + /* !SSD SSD_SPREAD == -o mincluster. */ + if (btrfs_test_opt(root, SSD_SPREAD)) { +save_ptr = last_ptr; +hint_byte = save_ptr-window_start; +last_ptr = NULL; +use_cluster = false; + } else +empty_cluster = 64 * 1024; + } } if ((data BTRFS_BLOCK_GROUP_DATA) use_cluster @@ -5402,6 +5410,8 @@ checks: btrfs_add_free_space(block_group, offset, search_start - offset); BUG_ON(offset search_start); + if (save_ptr) + save_ptr-window_start = search_start + num_bytes; btrfs_put_block_group(block_group); break; loop: diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index afd1129..2706369 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2576,6 +2576,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster) cluster-max_size = 0; INIT_LIST_HEAD(cluster-block_group_list); cluster-block_group = NULL; + cluster-window_start = 0; } int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 2baba99..dd76fa4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -165,7 +165,7 @@ enum { Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, - Opt_nocluster, Opt_cluster, Opt_err, + Opt_nocluster, Opt_cluster, Opt_mincluster, Opt_err, }; static match_table_t tokens = { @@ -202,6 +202,7 @@ static match_table_t tokens = { {Opt_recovery, recovery}, {Opt_nocluster, nocluster}, {Opt_cluster, cluster}, + {Opt_mincluster, mincluster}, {Opt_err, NULL}, }; @@ -407,6 +408,11 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) printk(KERN_INFO btrfs: enabling alloc clustering\n); btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); break; + case Opt_mincluster: + printk(KERN_INFO btrfs: enabling minimal alloc clustering\n); + btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER); + btrfs_set_opt(info-mount_opt, SSD_SPREAD); + break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option '%s'\n, p); @@ -705,9 +711,12 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs) } if (btrfs_test_opt(root, NOSSD)) seq_puts(seq, ,nossd); - if (btrfs_test_opt(root, SSD_SPREAD)) - seq_puts(seq, ,ssd_spread); - else if (btrfs_test_opt(root, SSD)) + if (btrfs_test_opt(root, SSD_SPREAD)) { + if (btrfs_test_opt(root, SSD)) + seq_puts(seq, ,ssd_spread); + else
Re: corrupted btrfs after suspend2ram uncorrectable with scrub
Hi, Gustavo, On Nov 1, 2011, Gustavo Sverzut Barbieri barbi...@gmail.com wrote: btrfs csum failed ino 2957021 extent 85041815552 csum 667310679 wanted 0 mirror 0 Is there any way to recover it? :-S Did you try mounting without data checksums? -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patches for BTRFS (mail-server slow down in 3.0 and more)
On Oct 31, 2011, David Sterba d...@jikos.cz wrote: On Mon, Oct 31, 2011 at 02:19:18AM -0200, Alexandre Oliva wrote: On Oct 29, 2011, Chris Mason chris.ma...@oracle.com wrote: The last one isn't a bad idea, but please do make a real mount option for it ;) Like this? @@ -195,6 +195,7 @@ static match_table_t tokens = { {Opt_subvolrootid, subvolrootid=%d}, {Opt_defrag, autodefrag}, {Opt_inode_cache, inode_cache}, + {Opt_nocluster, nocluster}, {Opt_err, NULL}, How about 'no_alloc_cluster' ? I considered that, too, but choosing the option name was the most difficult part of the patch :-) I ended up going for the shorter name, just to get the conversation started ;-) I don't feel strongly about it. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patches for BTRFS (mail-server slow down in 3.0 and more)
On Oct 29, 2011, Chris Mason chris.ma...@oracle.com wrote: The last one isn't a bad idea, but please do make a real mount option for it ;) Like this? From af086e7b88637be5c9806181a1d70db9c645cb50 Mon Sep 17 00:00:00 2001 From: Alexandre Oliva lxol...@fsfla.org Date: Sat, 29 Oct 2011 02:20:55 -0200 Subject: [PATCH 4/4] Disable clustered allocation with -o nocluster Introduce -o nocluster to disable the use of clusters for extent allocation. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/ctree.h |1 + fs/btrfs/extent-tree.c |2 +- fs/btrfs/super.c | 11 +-- 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 03912c5..b1138fb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1363,6 +1363,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_ENOSPC_DEBUG (1 15) #define BTRFS_MOUNT_AUTO_DEFRAG (1 16) #define BTRFS_MOUNT_INODE_MAP_CACHE (1 17) +#define BTRFS_MOUNT_NO_ALLOC_CLUSTER (1 18) #define btrfs_clear_opt(o, opt) ((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f5be06a..5d7c9a7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4886,7 +4886,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, bool found_uncached_bg = false; bool failed_cluster_refill = false; bool failed_alloc = false; - bool use_cluster = true; + bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER); u64 ideal_cache_percent = 0; u64 ideal_cache_offset = 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 15634d4..57c7bb1 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -162,7 +162,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, - Opt_inode_cache, Opt_err, + Opt_inode_cache, Opt_nocluster, Opt_err, }; static match_table_t tokens = { @@ -195,6 +195,7 @@ static match_table_t tokens = { {Opt_subvolrootid, subvolrootid=%d}, {Opt_defrag, autodefrag}, {Opt_inode_cache, inode_cache}, + {Opt_nocluster, nocluster}, {Opt_err, NULL}, }; @@ -378,9 +379,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG); break; case Opt_defrag: - printk(KERN_INFO btrfs: enabling auto defrag); + printk(KERN_INFO btrfs: enabling auto defrag\n); btrfs_set_opt(info-mount_opt, AUTO_DEFRAG); break; + case Opt_nocluster: + printk(KERN_INFO btrfs: disabling alloc clustering\n); + btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER); + break; case Opt_err: printk(KERN_INFO btrfs: unrecognized mount option '%s'\n, p); @@ -729,6 +734,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ,autodefrag); if (btrfs_test_opt(root, INODE_MAP_CACHE)) seq_puts(seq, ,inode_cache); + if (btrfs_test_opt(root, NO_ALLOC_CLUSTER)) + seq_puts(seq, ,nocluster); return 0; } -- 1.7.4.4 -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
Re: Patches for BTRFS (mail-server slow down in 3.0 and more)
On Oct 28, 2011, Marcel Lohmann mar...@malowa.de wrote: I would really appreciate if you could send me the patches. Here are the patches I mentioned on IRC. I've sent two of them to Josef for him to push upstream, but I'm not sure he posted them here for I'm not on the list (yet?). The other two are newer, and the last one is definitely not for inclusion (just for testing or as a temporary work-around). I've been using the first 3 with some success on a couple of mail servers: I haven't hit the ridiculous slow downs from frequent unsuccessful calls of setup_cluster_no_bitmap after a while, like I did with 3.0 (and 3.1) any more. However, the excess use of metadata that I've experienced on ceph OSDs isn't fixed by them. A btrfs balance with the first 3 still has 22GB of metadata block groups even though only 4.1GB of metadata are in use, or 19GB of metadata with only 2GB of metadata in use. With the 4th patch and -o clear_cache, the first rebalancing of the 22GB-metadata filesystem got it down to 8GB; the second fs is still on rebalancing ~800GB (wishlist mental note: introduce some means to rebalance only the metadata) Here are the patches, against 3.1-libre (should apply cleanly on 3.1). ---BeginMessage--- Parameterized clusters on minimum total size and minimum chunk size, without an upper bound. Don't tolerate fragmentation for SSD_SPREAD; accept some fragmentation for metadata but try to keep data dense. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- fs/btrfs/free-space-cache.c | 64 +++--- 1 files changed, 35 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 41ac927..4973816 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2092,8 +2092,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl; unsigned long next_zero; unsigned long i; - unsigned long search_bits; - unsigned long total_bits; + unsigned long want_bits; + unsigned long min_bits; unsigned long found_bits; unsigned long start = 0; unsigned long total_found = 0; @@ -2102,8 +2102,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group, i = offset_to_bit(entry-offset, block_group-sectorsize, max_t(u64, offset, entry-offset)); - search_bits = bytes_to_bits(bytes, block_group-sectorsize); - total_bits = bytes_to_bits(min_bytes, block_group-sectorsize); + want_bits = bytes_to_bits(bytes, block_group-sectorsize); + min_bits = bytes_to_bits(min_bytes, block_group-sectorsize); again: found_bits = 0; @@ -2112,7 +2112,7 @@ again: i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) { next_zero = find_next_zero_bit(entry-bitmap, BITS_PER_BITMAP, i); - if (next_zero - i = search_bits) { + if (next_zero - i = min_bits) { found_bits = next_zero - i; break; } @@ -2132,9 +2132,9 @@ again: if (cluster-max_size found_bits * block_group-sectorsize) cluster-max_size = found_bits * block_group-sectorsize; - if (total_found total_bits) { + if (total_found want_bits) { i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero); - if (i - start total_bits * 2) { + if (i - start want_bits * 2) { total_found = 0; cluster-max_size = 0; found = false; @@ -2180,8 +2180,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, * We don't want bitmaps, so just move along until we find a normal * extent entry. */ - while (entry-bitmap) { - if (list_empty(entry-list)) + while (entry-bitmap || entry-bytes min_bytes) { + if (entry-bitmap list_empty(entry-list)) list_add_tail(entry-list, bitmaps); node = rb_next(entry-offset_index); if (!node) @@ -2196,10 +2196,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, last = entry; prev = entry; - while (window_free = min_bytes) { - node = rb_next(entry-offset_index); - if (!node) - return -ENOSPC; + for (node = rb_next(entry-offset_index); node; +node = rb_next(entry-offset_index)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry-bitmap) { @@ -2208,12 +2206,19 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group, continue
Re: “bio too big” regression and silent data corruption in 3.0
Here's some additional information and work-arounds. On Aug 7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: A bit of investigation showed that max_hw_sectors for the USB disk was 120, much lower than the internal SATA and PATA disks. FWIW, overriding /sys/class/block/sd*/queue/max_sectors_kb of all disks used by the filesystem to the lowest max_hw_sectors_kb works around this problem, at least as long as you don't hit it before you get a chance to change the setting. Raid0 block groups were created to hold data from single block groups and, if it couldn't create big-enough raid0 blocks because *any* of the other disks was nearly-full, removal would fail. AFAICT this was my misunderstanding of the situation. Apparenty btrfs can rebalance the disk space in other partitions so as to create raid0 blocks during removal. However, in my case it didn't because there was some metadata inconsistency in the partition I was trying to remove that led to block tree checksum errors being printed when it hit that part of the partition, aborting the removal. The checksum errors were likely caused by the bio too big problem. it appears to be impossible to go back from RAID1 to DUP metadata once you temporarily add a second disk, and any metadata block group happens to be allocated before you remove it (why couldn't it go back to DUP, rather than refusing the removal outright, which prevents even single block groups from being moved?) FWIW, I disabled the test that refuses to shrink a filesystem containing RAID1 to a single disk and issued such a request while running this modified kernel, and it completed successfully and perfectly. Can we change it from hard error to warning? 5. This long message reminded me that another machine that has been running 3.0 seems to have got *much* slower recently. I thought it had to do with the 98% full filesystem (though 40GB available for new block group allocations would seem to be plenty), and the constant metadata activity caused by ceph creating and removing snapshots all the time. AFAICT it had to do with extended attributes (heavily used by ceph), that caused a large number of metadata block groups to be allocated, even though only a tiny fraction of the space in them ended up being used. I've observed this in two of the ceph object stores. I've also noticed that rsyncing the OSDs with all extended attributes (-A -X) caused the source to use up a *lot* of CPU and far longer than without. I don't know why that is, but getfattr --dump at the source and setfattr --restore at the target does pretty much the same, without incurring such large CPU and time costs, so there's something to be improved somewhere, in rsync and/or in btrfs. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: tl;dr version: 3.0 produces “bio too big” dmesg entries and silently corrupts data in “meta-raid1/data-single” configurations on disks with different max_hw_sectors, where 2.6.38 worked fine. FWIW, I just got the same problem with 2.6.38. No idea how I hadn't hit it before, but it's not a 3.0 regression, just a regular (but IMHO very serious) bug. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: 2. Removing a partition from the filesystem (say, the external disk) didn't relocate “single” block groups as such to other disks, as expected. /me reads some code and resets expectations about RAID0 in btrfs ;-) update_block_group_flags is what does this. It doesn't care what was chosen when the filesystem was created, it just forces RAID0 if more than 1 disk remains: /* turn single device chunks into raid0 */ return stripped | BTRFS_BLOCK_GROUP_RAID0; Is this really intended? Given my current understanding that RAID0 doesn't mean striping over all disks, but only over two disks, I guess I might even be interested in it, but... I still think the user's choice should be honored, but I don't see where the choice is stored (if it is at all). I wonder, why can't btrfs mark at least mounted partitions as busy, in much the same way that swap, md and various filesystems do, to avoid such accidental reuses? Heh. And *unmark* them when they're removed, too... As in, it won't let me create a new filesystem in a partition that was just removed from a filesystem, if that was the partition listed in /etc/mtab. -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote: in very much the same way that it appears to be impossible to go back from RAID1 to DUP metadata once you temporarily add a second disk, and any metadata block group happens to be allocated before you remove it (why couldn't it go back to DUP, rather than refusing the removal outright, which prevents even single block groups from being moved?) Which also appears to be intentional. The code to suport this is right there in update_block_group_flags, but btrfs_rm_device refuses to let it do its job, denying the removal attempt right away, without any means to bypass the test. Could at least an option to bypass the test be introduced, through say a mount option, some /sys setting, whatever? -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
help recover from unmountable btrfs
After running one too many times into “parent transid verify failed” that prevents a filesystem from being mounted, I found out how to adjust some system blocks so that the kernel could get past that check and mount the filesystem. In one case, I could get all the data I wanted from the filesystem; in another, many checksums failed and I ended up throwing it all away, so no guarrantees. mpiechotka's running into the problem and bringing it up on IRC prompted me to post for wider consumption this patch for btrfsck, that will tell you what to do to make the filesystem mountable again. Add verbosity to btrfsck so that we can manually recover from a failure to update the roots. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- disk-io.c | 41 + volumes.c |3 +++ 2 files changed, 44 insertions(+), 0 deletions(-) diff --git a/disk-io.c b/disk-io.c index a6e1000..6860c26 100644 --- a/disk-io.c +++ b/disk-io.c @@ -87,6 +87,22 @@ int csum_tree_block_size(struct extent_buffer *buf, u16 csum_size, printk(checksum verify failed on %llu wanted %X found %X\n, (unsigned long long)buf-start, *((int *)result), *((char *)buf-data)); + if (csum_size == 4) { + fprintf(stderr, dd if=(fd %i) bs=1c skip=%llu count=4 | od -t x1:\n%02x %02x %02x %02x\n, + buf-fd, + (unsigned long long)buf-dev_bytenr, + (__u8)buf-data[0], + (__u8)buf-data[1], + (__u8)buf-data[2], + (__u8)buf-data[3]); + fprintf(stderr, printf \\\x%02x\\x%02x\\x%02x\\x%02x\ | dd of=(fd %i) bs=1c seek=%llu conv=notrunc count=4\n, + (__u8)result[0], + (__u8)result[1], + (__u8)result[2], + (__u8)result[3], + buf-fd, + (unsigned long long)buf-dev_bytenr); + } free(result); return 1; } @@ -165,6 +181,31 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, (unsigned long long)eb-start, (unsigned long long)parent_transid, (unsigned long long)btrfs_header_generation(eb)); + fprintf(stderr, dd if=(fd %i) bs=1c skip=%llu count=8 | od -t x1:\n%02x %02x %02x %02x %02x %02x %02x %02x\n, + eb-fd, + (unsigned long long)eb-dev_bytenr + + offsetof (struct btrfs_header, generation), + (__u8)eb-data[offsetof (struct btrfs_header, generation)], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 1], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 2], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 3], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 4], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 5], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 6], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 7]); + btrfs_set_header_generation(eb, parent_transid); + fprintf(stderr, printf \\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\ | dd of=(fd %i) bs=1c seek=%llu conv=notrunc count=8\n, + (__u8)eb-data[offsetof (struct btrfs_header, generation)], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 1], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 2], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 3], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 4], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 5], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 6], + (__u8)eb-data[offsetof (struct btrfs_header, generation) + 7], + eb-fd, + (unsigned long long)eb-dev_bytenr + + offsetof (struct btrfs_header, generation)); ret = 1; out: clear_extent_buffer_uptodate(io_tree, eb); diff --git a/volumes.c b/volumes.c index 7671855..c30a3ba 100644 --- a/volumes.c +++ b/volumes.c @@ -188,6 +188,9 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags) device-fd = fd; if (flags == O_RDWR) device-writeable = 1; + fprintf(stderr, Device %llu (%s) opened in fd %i\n, + (unsigned long long)device-devid, + device-name, device-fd