from:"Alexandre Oliva"

[check] set metadata extent size of tree block extents

2016-03-05 Thread Alexandre Oliva

When scanning extents, we didn't set num_bytes when visiting a tree
block extent.  On the corrupted filesystem I was trying to fix, this
caused an extent to have its size guessed as zero, so we'd compute end
as start-1, which caused us to hit insert_state's BUG_ON(end<start).

Signed-off-by: Alexandre Oliva <ol...@gnu.org>
---
 cmds-check.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 0165fba..e563354 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -5208,9 +5208,10 @@ static int process_extent_item(struct btrfs_root *root,
 
ei = btrfs_item_ptr(eb, slot, struct btrfs_extent_item);
refs = btrfs_extent_refs(eb, ei);
-   if (btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK)
+   if (btrfs_extent_flags(eb, ei) & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
metadata = 1;
-   else
+   num_bytes = root->leafsize;
+   } else
metadata = 0;
 
add_extent_rec(extent_cache, NULL, 0, key.objectid, num_bytes,

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

after crash, btrfs attempts to clean up extent it has already cleaned up

2016-02-22 Thread Alexandre Oliva

Are there others getting errors like $SUBJECT, described in more detail
at https://bugzilla.kernel.org/show_bug.cgi?id=112561

If my theory is correct, workloads involving lots of snapshots, such as
Ceph OSDs, might run into it quite often.

Although I could recover from a few such metadata corruptions by hand,
when btrfs check --repair couldn't fix it, it's quite cumbersome.  I
wonder if a change like this, made conditional on a mount option, would
be considered appropriate.  I considered making it conditional on -o
recovery, but ended up just making it unconditional for my own temporary
use.


As for fixing metadata corruption by hand, I've been thinking it might
be useful to have some tool to help navigate and change metadata,
extract files and whatnot, much like debugfs for ext* filesystems.
Would others find it useful?  Is anyone else already working on such a
thing?


diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cadacf6..849765a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6356,7 +6356,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
"unable to find ref byte nr %llu parent %llu root %llu  
owner %llu offset %llu",
bytenr, parent, root_objectid, owner_objectid,
owner_offset);
-   btrfs_abort_transaction(trans, extent_root, ret);
+   ret = 0; /*btrfs_abort_transaction(trans, extent_root, ret);*/
goto out;
} else {
btrfs_abort_transaction(trans, extent_root, ret);


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

non-atomic xattr replacement in btrfs = rsync random errors

2014-11-06 Thread Alexandre Oliva

A few days ago, I started using rsync batches to archive old copies of
ceph OSD snapshots for certain kinds of disaster recovery. This seems
to exercise an unexpected race condition in rsync, which happens to
expose what appears to be a race condition in btrfs, causing random
scary but harmless errors when replaying the rsync batches.

strace has revealed that the two rsync processes running concurrently to
apply the batch both attempt to access xattrs of the same directory
concurrently. I understand rsync is supposed to avoid this, but
something's going wrong with that. Here's the smoking gun, snipped from
strace -p 27251 -p 27253 -o smoking.gun, where both processes are
started from a single rsync --read-batch=- -aHAX --del ... run:

0: 27251 stat(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, unfinished ...
1: 27253 stat(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, {st_mode=S_IFDIR|0755,
st_size=5470, ...}) = 0
2: 27251 ... stat resumed {st_mode=S_IFDIR|0755, st_size=5470, ...}) = 0
3: 27253 llistxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6,
user.cephos.phash.contents\0, 1024) = 27
4: 27251 llistxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6, unfinished ...
5: 27253 lsetxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6,
user.cephos.phash.contents,
\x01F\x00\x00\x00\x00\x00\x00\x00\x0f\x00\x00\x00\x03\x00\x00, 17, 0
unfinished ...
6: 27251 ... llistxattr resumed user.cephos.phash.contents\0, 1024) = 27
7: 27251 lgetxattr(osd/0.6ed_head/DIR_D/DIR_E/DIR_6,
user.cephos.phash.contents, 0x0, 0) = -1 ENODATA (No data available)
8: 27253 ... lsetxattr resumed ) = 0
9: 27253 utimensat(AT_FDCWD, osd/0.6ed_head/DIR_D/DIR_E/DIR_6, {UTIME_NOW,
{1407992261, 0}}, AT_SYMLINK_NOFOLLOW) = 0
a: 27251 write(2, rsync: get_xattr_data: lgetxattr..., 181) = 181

lines 0-2, 3-6 and 5-8, show concurrent access of both rsync processes
to the same directory. This wouldn't be a problem, not even for
replaying batches, for the lsetxattr would put the intended xattr value
in there regardless of whether the scanner saw the xattr value before or
after that.

What makes the problem visible is that btrfs appears to have a race in
its handling of xattr replacement, leaving a window between the removal
of the old value and the insertion of the new one, as shown by lines
5-8. line 3 show the attribute existed before, and lines 5-8 show it
disappears in line 7, while lsetxattr still runs to replace it.

If rsync tries hard enough to hit this window, the lgetxattr concurrent
to the lsetxattr eventually hits, and then rsync reports an error:

rsync: get_xattr_data:
lgetxattr(/media/px/snapshots/cluster/20141102-to-20140816/osd/0.6ed_head/DIR_D/DIR_E/DIR_6,user.cephos.phash.contents,0)
failed: No data available (61)

At the end, it exits with a nonzero status, even though nothing really
wrong went on and the tree ended up looking just as it was supposed to.

Now, I'm a bit concerned because the btrfs race condition, if exercised
on security-related xattrs or ACLs, could cause data to become visible
that shouldn't, which could turn this into a locally exploitable
security issue. Sure enough nobody goes nuts repeatedly changing the
ACLs of a dir or file containing information that should be guarded by
it, so as to increase the likelihood that an attacker succeeds in
accessing the data, but still... I don't think the temporary removal of
the xattr for subsequent insertion should be visible at all.

I'm sorry for reporting a potential security issue like that, but by the
time it occurred to me that it might have potential security
implications, I'd already mentioned the problem on #btrfs at FreeNode,
so the horse was out of the barn already :-(

I hope this helps,

--
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: non-atomic xattr replacement in btrfs = rsync random errors

2014-11-06 Thread Alexandre Oliva

[dropping rs...@lists.samba.org, it rejects posts from non-subscribers;
 refer to https://bugzilla.samba.org/show_bug.cgi?id=10925 instead]

On Nov  6, 2014, Alexandre Oliva ol...@gnu.org wrote:

 What makes the problem visible is that btrfs appears to have a race in
 its handling of xattr replacement, leaving a window between the removal
 of the old value and the insertion of the new one

The bugs described above occurred with rsync-3.1.0-5.fc20.x86_64 and
kernel-libre-3.16.7-200.fc20.gnu.x86_64.  The btrfs code in kernel-libre
is unchanged from the corresponding Fedora kernel.  The distro is BLAG
200k/x86_64, under development.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs: add -k option to filesystem df

2014-08-31 Thread Alexandre Oliva

On Aug 30, 2014, Shriramana Sharma samj...@gmail.com wrote:

 But somehow I feel the name of the long option could be made better than
 --kbytes which is not exactly descriptive of what it accomplishes. IIUC so
 far only bytes are displayed right?

kbytes displays KiBs, whereas the preexisting code chooses a magnitude
most suitable to present the size in a human-friendly way.  I'd be happy
to drop the long option, to follow GNU df's practice: there's no long
option (without arguments) equivalent to -k there.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

fixes for btrfs check --repair

2014-08-30 Thread Alexandre Oliva

I got a faulty memory module a while ago, and it ran for a while,
corrupting a number of filesystems on that server.  Most of the
corruption is long gone, as the filesystems (ceph osds) were
reconstructed, but I tried really hard to avoid having to rebuild one
4TB filesystem from scratch, since it was still fully operational.  I
failed, but in the process, I ran into and fixed two btrfs check
--repair bugs.  I gave up when removing an old snapshot caused the
delayed refs processing to abort because it couldn't find a ref to
delete, whereas btrfs check --repair completed successfully without
fixing anything.  Mounting the apparently-clean filesystem would still
run into the same delayed refs error, but trying to map the logical
extent back to a file produced an error.  Since it was far too big to
preserve, even in metadata only, I didn't, and proceeded to mkfs.btrfs
right away.

Here are the patches.

repair: remove recowed entry from the to-recow list

From: Alexandre Oliva ol...@gnu.org

If we attempt to repair a filesystem with metadata blocks that need
recowing, we'll get into an infinite loop repeatedly recowing the
first entry in the list, without ever removing it from the list.
Oops.  Fixed.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 cmds-check.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 268e588..66c982f 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6760,6 +6760,7 @@ int cmd_check(int argc, char **argv)
 
 		eb = list_first_entry(root-fs_info-recow_ebs,
   struct extent_buffer, recow);
+		list_del_init(eb-recow);
 		ret = recow_extent_buffer(root, eb);
 		if (ret)
 			break;
check: do not dereference tree_refs as data_refs

From: Alexandre Oliva ol...@gnu.org

In a filesystem corrupted by a faulty memory module, btrfsck would get
very confused attempting to access backrefs that weren't data backrefs
as if they were.  Besides invoking undefined behavior for accessing
potentially-uninitialized data past the end of objects, or with
dynamic types unrelated with the static types held in the
corresponding memory, it used offsets and lengths from such fields
that did not correspond to anything in the filesystem proper.

Moving the test for full backrefs and checking that they're data
backrefs earlier avoided the crash I was running into, but that was
not enough to make the filesystem complete a successful repair.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 cmds-check.c |   19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 66c982f..319dd2b 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4781,15 +4781,17 @@ static int verify_backrefs(struct btrfs_trans_handle *trans,
 		return 0;
 
 	list_for_each_entry(back, rec-backrefs, list) {
+		if (back-full_backref || !back-is_data)
+			continue;
+
 		dback = (struct data_backref *)back;
+
 		/*
 		 * We only pay attention to backrefs that we found a real
 		 * backref for.
 		 */
 		if (dback-found_ref == 0)
 			continue;
-		if (back-full_backref)
-			continue;
 
 		/*
 		 * For now we only catch when the bytes don't match, not the
@@ -4905,6 +4907,9 @@ static int verify_backrefs(struct btrfs_trans_handle *trans,
 	 * references and fix up the ones that don't match.
 	 */
 	list_for_each_entry(back, rec-backrefs, list) {
+		if (back-full_backref || !back-is_data)
+			continue;
+
 		dback = (struct data_backref *)back;
 
 		/*
@@ -4913,8 +4918,6 @@ static int verify_backrefs(struct btrfs_trans_handle *trans,
 		 */
 		if (dback-found_ref == 0)
 			continue;
-		if (back-full_backref)
-			continue;
 
 		if (dback-bytes == best-bytes 
 		dback-disk_bytenr == best-bytenr)
@@ -5134,14 +5137,16 @@ static int find_possible_backrefs(struct btrfs_trans_handle *trans,
 	int ret;
 
 	list_for_each_entry(back, rec-backrefs, list) {
+		/* Don't care about full backrefs (poor unloved backrefs) */
+		if (back-full_backref || !back-is_data)
+			continue;
+
 		dback = (struct data_backref *)back;
 
 		/* We found this one, we don't need to do a lookup */
 		if (dback-found_ref)
 			continue;
-		/* Don't care about full backrefs (poor unloved backrefs) */
-		if (back-full_backref)
-			continue;
+
 		key.objectid = dback-root;
 		key.type = BTRFS_ROOT_ITEM_KEY;
 		key.offset = (u64)-1;


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

btrfs: add -k option to filesystem df

2014-08-30 Thread Alexandre Oliva

Introduce support for df to print sizes in KiB, easy to extend to other
bases.

The man page is also updated and fixed in that it made it seem like
multiple paths were accepted.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 Documentation/btrfs-filesystem.txt |4 +++-
 cmds-filesystem.c  |   26 +++---
 utils.c|   29 +++--
 utils.h|1 +
 4 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/btrfs-filesystem.txt 
b/Documentation/btrfs-filesystem.txt
index c9c0b00..70ba4b8 100644
--- a/Documentation/btrfs-filesystem.txt
+++ b/Documentation/btrfs-filesystem.txt
@@ -17,8 +17,10 @@ resizing, defragment.
 
 SUBCOMMAND
 --
-*df* path [path...]::
+*df* [--kbytes] path::
 Show space usage information for a mount point.
++
+If '-k' or '--kbytes' is passed, sizes will be printed in KiB.
 
 *show* [--mounted|--all-devices|path|uuid|device|label]::
 Show the btrfs filesystem with some additional info.
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 7e8ca95..737fcf3 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -113,8 +113,9 @@ static const char * const filesystem_cmd_group_usage[] = {
 };
 
 static const char * const cmd_df_usage[] = {
-   btrfs filesystem df path,
+   btrfs filesystem df [-k] path,
Show space usage information for a mount point,
+   -k|--kbytesshow disk spaces in KB,
NULL
 };
 
@@ -226,10 +227,29 @@ static int cmd_df(int argc, char **argv)
char *path;
DIR  *dirstream = NULL;
 
-   if (check_argc_exact(argc, 2))
+   while (1) {
+   int long_index;
+   static struct option long_options[] = {
+   { kbytes, no_argument, NULL, 'k'},
+   { NULL, no_argument, NULL, 0 },
+   };
+   int c = getopt_long(argc, argv, k, long_options,
+   long_index);
+   if (c  0)
+   break;
+   switch (c) {
+   case 'k':
+   pretty_size_force_base (1024);
+   break;
+   default:
+   usage(cmd_df_usage);
+   }
+   }
+
+   if (check_argc_max(argc, optind + 1))
usage(cmd_df_usage);
 
-   path = argv[1];
+   path = argv[optind];
 
fd = open_file_or_dir(path, dirstream);
if (fd  0) {
diff --git a/utils.c b/utils.c
index 6c09366..f760d1b 100644
--- a/utils.c
+++ b/utils.c
@@ -1377,19 +1377,43 @@ out:
 }
 
 static char *size_strs[] = { , KiB, MiB, GiB, TiB, PiB, EiB};
+u64 forced_base = 0;
+int pretty_size_force_base(u64 base)
+{
+   u64 check = 1;
+   while (check  base)
+   check *= 1024;
+   if (check != base  base)
+   return -1;
+   forced_base = base;
+   return 0;
+}
 int pretty_size_snprintf(u64 size, char *str, size_t str_bytes)
 {
int num_divs = 0;
+   u64 last_size = size;
float fraction;
 
if (str_bytes == 0)
return 0;
 
-   if( size  1024 ){
+   if( forced_base ){
+   u64 base = forced_base;
+   while (base  1) {
+   base /= 1024;
+   last_size = size;
+   size /= 1024;
+   num_divs++;
+   }
+   if (num_divs  2)
+   return snprintf(str, str_bytes, %llu%s,
+   (unsigned long long)size,
+   size_strs[num_divs]);
+   goto check;
+   } else if( size  1024 ){
fraction = size;
num_divs = 0;
} else {
-   u64 last_size = size;
num_divs = 0;
while(size = 1024){
last_size = size;
@@ -1397,6 +1421,7 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_bytes)
num_divs ++;
}
 
+   check:
if (num_divs = ARRAY_SIZE(size_strs)) {
str[0] = '\0';
return -1;
diff --git a/utils.h b/utils.h
index fd25126..bbcb042 100644
--- a/utils.h
+++ b/utils.h
@@ -71,6 +71,7 @@ int check_mounted_where(int fd, const char *file, char 
*where, int size,
 int btrfs_device_already_in_root(struct btrfs_root *root, int fd,
 int super_offset);
 
+int pretty_size_force_base(u64 base);
 int pretty_size_snprintf(u64 size, char *str, size_t str_bytes);
 #define pretty_size(size)  \
({  \


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http

[PATCH] [btrfs] add volid to failed csum messages

2014-05-20 Thread Alexandre Oliva

The failed csum messages generated by btrfs mention the inode number,
but on filesystems with multiple subvolumes, that's not enough to
identify the file.  I've added the inode number to the messages so
that they're more complete.

I also noticed that the extent/offset information printed for the file
isn't always correct.  Indeed, when we print an offset that could be
fed to inspect-internal logical-resolve, we used the term offset, that
doesn't make it clear it's a logical offset, whereas when we print a
physical disk offset, as in compression.c, we used the term extent,
which incorrectly implied it to be a logical offset.  I've renamed
them to lofst and phofst, which are hopefully clearer.  Ideally, we'd
uniformly print logical offsets in these messages, but presumably the
information isn't readily available for check_compressed_csum.

I haven't quite tested this beyond building it (I don't have a sure way
to trigger csum errors :-), but AFAICT the objectid I've added is the
same number that one can pass to mount as subvolid, or look up in the
btrfs subvol list table.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 fs/btrfs/compression.c |8 +---
 fs/btrfs/inode.c   |   12 
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b01fb6c..9f095b3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -129,9 +129,11 @@ static int check_compressed_csum(struct inode *inode,
 
if (csum != *cb_sum) {
btrfs_info(BTRFS_I(inode)-root-fs_info,
-  csum failed ino %llu extent %llu csum %u wanted %u 
mirror %d,
-  btrfs_ino(inode), disk_start, csum, *cb_sum,
-  cb-mirror_num);
+  csum failed ino %llu vol %llu phofst %llu 
csum %u wanted %u mirror %d,
+  btrfs_ino(inode),
+  BTRFS_I(inode)-root-root_key.objectid,
+  disk_start, csum, *cb_sum,
+  cb-mirror_num);
ret = -EIO;
goto fail;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d3d4448..cc32b84 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2829,8 +2829,10 @@ good:
 
 zeroit:
if (__ratelimit(_rs))
-   btrfs_info(root-fs_info, csum failed ino %llu off %llu csum 
%u expected csum %u,
-   btrfs_ino(page-mapping-host), start, csum, 
csum_expected);
+   btrfs_info(root-fs_info, csum failed ino %llu vol %llu lofst 
%llu csum %u expected csum %u,
+  btrfs_ino(page-mapping-host),
+  root-root_key.objectid,
+  start, csum, csum_expected);
memset(kaddr + offset, 1, end - start + 1);
flush_dcache_page(page);
kunmap_atomic(kaddr);
@@ -6981,8 +6983,10 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
 
flush_dcache_page(bvec-bv_page);
if (csum != csums[i]) {
-   btrfs_err(root-fs_info, csum failed ino %llu 
off %llu csum %u expected csum %u,
- btrfs_ino(inode), start, csum,
+   btrfs_err(root-fs_info, csum failed ino %llu 
vol %llu lofst %llu csum %u expected csum %u,
+ btrfs_ino(inode),
+ root-root_key.objectid,
+ start, csum,
  csums[i]);
err = -EIO;
}


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs raid5

2013-10-26 Thread Alexandre Oliva

On Oct 22, 2013, Duncan 1i5t5.dun...@cox.net wrote:

 the quick failure should they try raid56 in its current state simply
 alerts them to the problem they already had.

What quick failure?  There's no such thing in place AFAIK.  It seems to
do all the work properly, the limitations in the current implementation
will only show up when an I/O error kicks in.  I can't see any
indication, in existing announcements, that recovery from I/O errors in
raid56 is missing, let alone that it's so utterly and completely broken
that it will freeze the entire filesystem and require a forced reboot to
unmount the filesystem and make any other data in it accessible again.

That's far, far worse than the general state of btrfs, and that's not a
documented limitation of raid56, so how would someone be expected to
know about it?  It certainly isn't obvious by having a cursory look at
the code either.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs raid5

2013-10-22 Thread Alexandre Oliva

On Oct 22, 2013, Duncan 1i5t5.dun...@cox.net wrote:

 This is because there's a hole in the recovery process in case of a
 lost device, making it dangerous to use except for the pure test-case.

It's not just that; any I/O error in raid56 chunks will trigger a BUG
and make the filesystem unusable until the next reboot, because the
mirror number is zero.  I wrote this patch last week, just before
leaving on a trip, and I was happy to find out it enabled a
frequently-failing disk to hold a filesystem that turned out to be
surprisingly reliable!


btrfs: some progress in raid56 recovery

From: Alexandre Oliva ol...@gnu.org

This patch is WIP, but it has enabled a raid6 filesystem on a bad disk
(frequent read failures at random blocks) to work flawlessly for a
couple of weeks, instead of hanging the entire filesystem upon the
first read error.

One of the problems is that we have the mirror number set to zero on
most raid56 reads.  That's unexpected, for mirror numbers start at
one.  I couldn't quite figure out where to fix the mirror number in
the bio construction, but by simply refraining from failing when the
mirror number is zero, I found out we end up retrying the read with
the next mirror, which becomes a read retry that, on my bad disk,
often succeeds.  So, that was the first win.

After that, I had to make a few further tweaks so that other BUG_ONs
wouldn't hit, and we'd instead fail the read altogether, i.e., in the
extent_io layer, we still don't repair/rewrite the raid56 blocks, nor
do we attempt to rebuild bad blocks out of the other blocks in the
stride.  In a few cases in which the read retry didn't succeed, I'd
get an extent cksum verify failure, which I regarded as ok.

What did surprise me was that, for some of these failures, but not
all, the raid56 recovery code would kick in and rebuild the bad block,
so that we'd get the correct data back in spite of the cksum failure
and the bad block.  I'm still puzzled by that; I can't explain what
I'm observing, but surely the correct data is coming out of somewhere
;-)

Another oddity I noticed is that sometimes the mirror numbers appear
to be totally out of range; I suspect there might be some type
mismatch or out-of-range memory access that causes some other
information to be read as a mirror number from bios or somesuch.  I
couldn't track that down yet.

As it stands, although I know this still doesn't kick in the recovery
or repair code at the right place, the patch is usable on its own, and
it is surely an improvement over the current state of raid56 in btrfs,
so it might be a good idea to put it in.  So far, I've put more than
1TB of data on that failing disk with 16 partitions on raid6, and
somehow I got all the data back successfully: every file passed an
md5sum check, in spite of tons of I/O errors in the process.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 fs/btrfs/extent_io.c |   17 -
 fs/btrfs/raid56.c|   18 ++
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fe443fe..4a592a3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2061,11 +2061,11 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, 
u64 start,
struct btrfs_mapping_tree *map_tree = fs_info-mapping_tree;
int ret;
 
-   BUG_ON(!mirror_num);
-
/* we can't repair anything in raid56 yet */
if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
-   return 0;
+   return -EIO;
+
+   BUG_ON(!mirror_num);
 
bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
if (!bio)
@@ -2157,7 +2157,6 @@ static int clean_io_failure(u64 start, struct page *page)
return 0;
 
failrec = (struct io_failure_record *)(unsigned long) private_failure;
-   BUG_ON(!failrec-this_mirror);
 
if (failrec-in_validation) {
/* there was no real error, just free the record */
@@ -2167,6 +2166,12 @@ static int clean_io_failure(u64 start, struct page *page)
goto out;
}
 
+   if (!failrec-this_mirror) {
+   pr_debug(clean_io_failure: failrec-this_mirror not set, 
assuming %llu not repaired\n,
+  failrec-start);
+   goto out;
+   }
+
spin_lock(BTRFS_I(inode)-io_tree.lock);
state = find_first_extent_bit_state(BTRFS_I(inode)-io_tree,
failrec-start,
@@ -2338,7 +2343,9 @@ static int bio_readpage_error(struct bio *failed_bio, 
struct page *page,
 * everything for repair_io_failure to do the rest for us.
 */
if (failrec-in_validation) {
-   BUG_ON(failrec-this_mirror != failed_mirror);
+   if (failrec-this_mirror != failed_mirror)
+   pr_debug(bio_readpage_error: this_mirror 
equals failed_mirror: %i\n

Re: Q: Why subvolumes?

2013-08-04 Thread Alexandre Oliva

On Jul 23, 2013, Jerome Haltom was...@cogito.cx wrote:

 Why not just create the new dev_id on the destination snapshot of any
 directory? That way the snapshot can share inodes with is source.

Agreed.  Nothing stops us from implementing snapshotting of any
directory whatsoever: all it takes is to take a snapshot of the
subvolume enclosing the directory we want to snapshot, removing
everything that's not in the requested directory from the snapshot, and
making that directory the root of the snapshot.  The only tricky bit
here AFAICT is to arrange for the non-snapshotted subtree components to
be cleaned up in background.  If we had some primitive to unlink an
entire subtree and clean it up in background we could use that.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I/O errors block the entire filesystem

2013-05-15 Thread Alexandre Oliva

On May 14, 2013, Liu Bo bo.li@oracle.com wrote:

 In one of the failures that caused machine load spikes, I tried to
 collect info on active processes with perf top and SysRq-T, but nothing
 there seemed to explain the spike.  Thoughts on how to figure out what's
 causing this?

 Although I've seen your solution patch in this thread, I'm still curious
 about this senario, could you please share the reproducer script or
 something?

I'm afraid I don't have one.  I just use the filesystem on various
disks, with ceph osds and other non-ceph subvolumes and files, and
occasionally I run into one of these bad blocks and the filesystem gets
into these odd states.

 I guess that you're using '-l 64k -n 64k' for mkfs.btrfs

That is correct, but IIUC this should only affect metadata, and metadata
recovery from the DUP block works.  It's data (single copy) that fails
as described.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I/O errors block the entire filesystem

2013-05-15 Thread Alexandre Oliva

On May 15, 2013, Josef Bacik jba...@fusionio.com wrote:

 So this should only happen in the case that you are on a dm device it looks
 like, is that how you are running?

That was my first thought, but no, I'm using partitions out of the SATA
disks directly.  I even checked for stray dm out of fake raid or
somesuch, but the dm modules were not even loaded, and perusing
/sys/block confirms the “scsi” devices are actual ATA disks.

Further investigation suggested that when individual 512-byte blocks are
read from a disk (that's the block size reported by the kernel), the
underlying disk driver is supposed to inform the upper layer about what
it could read by updating the bio_vec bits in precisely the observed
way.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: I/O errors block the entire filesystem

2013-05-11 Thread Alexandre Oliva

On Apr  4, 2013, Alexandre Oliva ol...@gnu.org wrote:

 I've been trying to figure out the btrfs I/O stack to try to understand
 why, sometimes (but not always), after a failure to read a (data
 non-replicated) block from the disk, the file being accessed becomes
 permanently locked, and the filesystem, unmountable.

So, after some further investigation, we could determine that the
problem was that end_bio_extent_readpage would unlock_extent_cached only
part of the page, because it had previously computed whole_page as zero
because of the nonzero bv_offset.

So I started hunting for some place that would set up the bio with
partial pages, and I failed.

I was already suspecting some race condition or other form of corruption
of the bvec before it got to end_bio_extent_readpage when I realized
that the bv_offset was always a multiple of 512 bytes, and it
represented the offset into the 4KiB page that the sector that failed to
read was going to occupy.

So I started hunting for places that modified bv_offset, and I found
blk_update_request in fs/blk-core.c, where the very error message
reporting the failed sector was output.

The conclusion is that we cannot assume bvec is unmodified between our
submitting the bio and our getting an error back.

OTOH, I don't see that we ever set up bvecs that do not correspond to
whole pages.  Indeed, my attempts to catch such situations with a
wrapper around bio_add_page got no hits whatsoever, which suggests we
could just do away with the whole_page computation, and take
bv_offset+bv_len == PAGE_CACHE_SIZE as the requested read size.

With this patch, after a read error, I get an EIO rather than a process
hang that causes further attempts to access the file to hang, generally
in a non-interruptible way.  Yay!


btrfs: do away with non-whole_page extent I/O

From: Alexandre Oliva ol...@gnu.org

end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error.  This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case.  Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes.  The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 fs/btrfs/extent_io.c |   85 ++
 1 file changed, 30 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cdee391..f44b033 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1873,28 +1873,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 }
 
 /*
- * helper function to unlock a page if all the extents in the tree
- * for that page are unlocked
- */
-static void check_page_locked(struct extent_io_tree *tree, struct page *page)
-{
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_CACHE_SIZE - 1;
-	if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL))
-		unlock_page(page);
-}
-
-/*
- * helper function to end page writeback if all the extents
- * in the tree for that page are done with writeback
- */
-static void check_page_writeback(struct extent_io_tree *tree,
- struct page *page)
-{
-	end_page_writeback(page);
-}
-
-/*
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
@@ -2323,19 +2301,24 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
 	struct extent_io_tree *tree;
 	u64 start;
 	u64 end;
-	int whole_page;
 
 	do {
 		struct page *page = bvec-bv_page;
 		tree = BTRFS_I(page-mapping-host)-io_tree;
 
-		start = page_offset(page) + bvec-bv_offset;
-		end = start + bvec-bv_len - 1;
+		/* We always issue full-page reads, but if some block
+		 * in a page fails to read, blk_update_request() will
+		 * advance bv_offset and adjust bv_len to compensate.
+		 * Print a warning for nonzero offsets, and an error
+		 * if they don't add up to a full page.  */
+		if (bvec-bv_offset || bvec-bv_len != PAGE_CACHE_SIZE)
+			printk(%s page write in btrfs with offset %u and length %u\n,
+			   bvec-bv_offset + bvec-bv_len != PAGE_CACHE_SIZE
+			   ? KERN_ERR partial : KERN_INFO incomplete,
+			   bvec-bv_offset, bvec-bv_len);
 
-		if (bvec-bv_offset == 0

I/O errors block the entire filesystem

2013-04-04 Thread Alexandre Oliva

I've been trying to figure out the btrfs I/O stack to try to understand
why, sometimes (but not always), after a failure to read a (data
non-replicated) block from the disk, the file being accessed becomes
permanently locked, and the filesystem, unmountable.

Sometimes (but not always) it's possible to kill the process that
accessed the file, and sometimes (but not always) the failure causes
the machine load to skyrocket by 60+ processes.

In one of the failures that caused machine load spikes, I tried to
collect info on active processes with perf top and SysRq-T, but nothing
there seemed to explain the spike.  Thoughts on how to figure out what's
causing this?

Another weirdness I noticed is that, after a single read failure,
btree_io_failed_hook gets called multiple times, until io_pages gets
down to zero.  This seems wrong: I think it should only be called once
when a single block fails, rather than having that single failure get
all pending pages marked as failed, no?

Here are some instrumented dumps I collected from one occurrence of the
scenario described in the previous paragraph (it didn't cause a load
spike).  Only one disk block had a read failure.  At the end, I enclose
the patch that got those dumps printed, the result of several iterations
in which one failure led me to find another function to instrument.

end_request: I/O error, dev sdd, sector 183052083
btrfs: bdev /dev/sdd4 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
btrfs_end_bio orig -EIO 1  0 pending 0 end a0240820,a020c2d0
end_workqueue_bio err -5 bi_rw 0
ata5: EH complete
end_workqueue_fn err -5 end_io a020c2d0,a0231080
btree_io_failed_hook failed_mirror 1 io_pages 15 readahead 0
end_bio_extent_readpage err -5 faied_hook a020bed0 ret -5
btree_io_failed_hook failed_mirror 1 io_pages 14 readahead 0
end_bio_extent_readpage err -5 failed_hook a020bed0 ret -5
[...repeat both msgs with io_pages decremented one at a time...]
btree_io_failed_hook failed_mirror 1 io_pages 0 readahead 0
end_bio_extent_readpage err -5 failed_hook a020bed0 ret -5
(no further related messages)

Be verbose about the path followed after an I/O error

From: Alexandre Oliva lxol...@fsfla.org


---
 fs/btrfs/disk-io.c   |   22 --
 fs/btrfs/extent_io.c |6 ++
 fs/btrfs/volumes.c   |   31 +--
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6d19a0a..20f9828 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -659,13 +659,18 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 {
 	struct extent_buffer *eb;
 	struct btrfs_root *root = BTRFS_I(page-mapping-host)-root;
+	long io_pages;
+	bool readahead;
 
 	eb = (struct extent_buffer *)page-private;
 	set_bit(EXTENT_BUFFER_IOERR, eb-bflags);
 	eb-read_mirror = failed_mirror;
-	atomic_dec(eb-io_pages);
-	if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags))
+	io_pages = atomic_dec_return(eb-io_pages);
+	if ((readahead = test_and_clear_bit(EXTENT_BUFFER_READAHEAD, eb-bflags)))
 		btree_readahead_hook(root, eb, eb-start, -EIO);
+	printk(KERN_ERR
+	   btree_io_failed_hook failed_mirror %i io_pages %li readahead %i\n,
+	   failed_mirror, io_pages, readahead);
 	return -EIO;	/* we fixed nothing */
 }
 
@@ -674,6 +679,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
 	struct end_io_wq *end_io_wq = bio-bi_private;
 	struct btrfs_fs_info *fs_info;
 
+	if (err) {
+		printk(KERN_ERR
+		   end_workqueue_bio err %i bi_rw %lx\n,
+		   err, (unsigned long)bio-bi_rw);
+	}
+
 	fs_info = end_io_wq-info;
 	end_io_wq-error = err;
 	end_io_wq-work.func = end_workqueue_fn;
@@ -1647,6 +1658,13 @@ static void end_workqueue_fn(struct btrfs_work *work)
 	fs_info = end_io_wq-info;
 
 	error = end_io_wq-error;
+
+	if (error) {
+		printk(KERN_ERR
+		   end_workqueue_fn err %i end_io %p,%p\n,
+		   error, bio-bi_end_io, end_io_wq-end_io);
+	}
+
 	bio-bi_private = end_io_wq-private;
 	bio-bi_end_io = end_io_wq-end_io;
 	kfree(end_io_wq);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cdee391..355b24e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2422,6 +2422,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 
 		if (!uptodate  tree-ops  tree-ops-readpage_io_failed_hook) {
 			ret = tree-ops-readpage_io_failed_hook(page, mirror);
+			printk(KERN_ERR
+			   end_bio_extent_readpage err %i failed_hook %p ret %i\n,
+			   err, tree-ops-readpage_io_failed_hook, ret);
 			if (!ret  !err 
 			test_bit(BIO_UPTODATE, bio-bi_flags))
 uptodate = 1;
@@ -2437,6 +2440,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			 * remain responsible for that page.
 			 */
 			ret = bio_readpage_error(bio, page, start, end, mirror, NULL);
+			printk(KERN_ERR
+			   end_bio_extent_readpage err %i readpage_error ret %i\n,
+			   err, ret

Re: corruption of active mmapped files in btrfs snapshots

2013-03-29 Thread Alexandre Oliva

On Mar 25, 2013, Chris Mason chris.ma...@fusionio.com wrote:

 This patch changes our compression code to call clear_page_dirty_for_io
 before we compress, and then redirty the pages if the compression fails.

 Alexandre, many thanks for tracking this down into a well defined use
 case. 

Thanks for the patch, it's run flawlessly since I started gradually
rolling it out onto my ceph OSDs on Monday!  Ship it! :-)

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-23 Thread Alexandre Oliva

On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote:

 Quoting Samuel Just (2013-03-22 13:06:41)
 Incomplete writes for leveldb should just result in lost updates, not
 corruption.

 In this case, I think Alexandre is scanning for zeros in the file.

Yup, the symptom is zeros at the end of a page, with nonzeros on the
subsequent page, which indicates that the writes to the previous page
were dropped.

What I actually do is to iterate over the entire database, which will
error out when the block header is found to be corrupted.  I use this
program I wrote (also hereby provided under GNU GPLv3+) to check the
database for corruption.

#include assert.h
#include iostream
#include leveldb/db.h

int main(int argc, char *argv[]) {
  bool paranoid = false;
  bool dump = false;
  bool repair = false;
  bool quiet = false;
  int i = 0;
  int errors = 0;

  if (argc == 1) {
  usage:
std::cout  usage: [flags] dbname [flags] ...  std::endl
	   -d --dump dump database contents  std::endl
	   -r --repair   repair database  std::endl
	   -p --paranoid enable paranoid mode  std::endl
	   -l --lax  disable paranoid mode (default)  std::endl
	   -q --quietenable quiet mode  std::endl
	   -v --verbose  disable quiet mode (default)  std::endl
	   -h --help show this message and exit  std::endl
	   dbnamecheck, dump and repair  std::endl
	   std::endl
	   exit status is the number of errors  std::endl;
return errors;
  }

  for (i++; i  argc; i++) {
if (argv[i][0] == '-') {
  if (strcmp (argv[i], --dump) == 0
	  || strcmp (argv[i], -d) == 0)
	dump = true;
  else if (strcmp (argv[i], --repair) == 0
	   || strcmp (argv[i], -r) == 0)
	repair = true;
  else if (strcmp (argv[i], --paranoid) == 0
	   || strcmp (argv[i], -p) == 0)
	paranoid = true;
  else if (strcmp (argv[i], --lax) == 0
	   || strcmp (argv[i], -l) == 0)
	paranoid = false;
  else if (strcmp (argv[i], --quiet) == 0
	   || strcmp (argv[i], -q) == 0)
	quiet = true;
  else if (strcmp (argv[i], --verbose) == 0
	   || strcmp (argv[i], -v) == 0)
	quiet = false;
  else if (strcmp (argv[i], --help) == 0
	   || strcmp (argv[i], -h) == 0)
	goto usage;
  else {
	std::cerr  unrecognized option:   argv[i]  std::endl;
	goto usage;
  }
} else {
  if (!quiet)
	std::cout  argv[i]  std::endl;

  leveldb::DB* db;
  leveldb::Options options;
  options.paranoid_checks = paranoid;
  leveldb::Status status = leveldb::DB::Open(options, argv[i], db);
  bool bad = false;

  if (!status.ok()) {
	std::cerr  status.ToString()  std::endl;
	bad = true;
  } else {
	leveldb::ReadOptions rdopt;
	rdopt.verify_checksums = paranoid;
	rdopt.fill_cache = false;
	leveldb::Iterator* it = db-NewIterator(rdopt);
	int count = 0;
	try {
	  for (it-SeekToFirst(); it-Valid(); it-Next()) {
	count++;
	if (dump)
	  std::cout  it-key().ToString()  : 
			 it-value().ToString()  std::endl;
	else if (!quiet  count % 1000 == 0)
	  std::cout  count   entries\r  std::flush;
	  }
	  if (!it-status().ok()) {
	std::cerr  it-status().ToString()  std::endl;
	bad = true;
	  }
	} catch (...) {
	  std::cerr  caught an exception  std::endl;
	}
	delete it;
	if (!quiet)
	  std::cout  count   entries  std::endl;
  }

  delete db;

  if (bad) {
	errors++;
	if (repair) {
	  if (!quiet)
	std::cout  repairing...  std::endl;
	  status = RepairDB(argv[i], options);
	  if (!status.ok()) {
	std::cerr  status.ToString()  std::endl;
	errors++;
	  }
	} else if (!quiet)
	  std::cout  use --repair to repair  std::endl;
  }
}
  }

  return errors;
}


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Re: corruption of active mmapped files in btrfs snapshots

2013-03-23 Thread Alexandre Oliva

On Mar 22, 2013, David Sterba dste...@suse.cz wrote:

 I've reproduced this without compression, with autodefrag on.

I don't have autodefrag on, unless it's enabled by default on 3.8.3 or
on the for-linus tree.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Alexandre Oliva

On Mar 22, 2013, Chris Mason clma...@fusionio.com wrote:

 Are you using compression in btrfs or just in leveldb?

btrfs lzo compression.

 I'd like to take snapshots out of the picture for a minute.

That's understandable, I guess, but I don't know that anyone has ever
got the problem without snapshots.  I mean, even when the master copy of
the database got corrupted, snapshots of the subvol containing it were
being taken every now and again, because that's the way ceph works.
Even back when I noticed corruption of firefox _CACHE_* files, snapshots
taken for archival were involved.  So, unless the program happens to
trigger the problem with the -DNOSNAPS option about as easily as it did
without it, I guess we may not have a choice but to keep snapshots in
the picture.

 We need some way to synchronize the leveldb with snapshotting

I purposefully refrained from doing that, because AFAICT ceph doesn't do
that.  Once I failed to trigger the problem with Sync calls, and
determined ceph only syncs the leveldb logs before taking its snapshots,
I went without syncing and finally succeeded in triggering the bug in
snapshots, by simulating very similar snapshotting and mmaping
conditions to those generated by ceph.  I haven't managed to trigger the
corruption of the master subvol yet with the test program, but I already
knew its corruption didn't occur as often as that of the snapshots, and
since it smells like two slightly different symptoms of the same bug, I
decided to leave the test program at that.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-21 Thread Alexandre Oliva

On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote:

 On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote:
 that is being processed inside the snapshot.

 This doesn't explain why the master database occasionally gets similarly
 corrupted, does it?

 Actually, scratch this bit for now.  I don't really have proof that the
 master database actually gets corrupted while it's in use

Scratch the “scratch this”.  The master database actually gets
corrupted, and it's with recently-created files, created after earlier
known-good snapshots.  So, it can't really be orphan processing, can it?

Some more info from the errors and instrumentation:

- no data syncing on the affected files is taking place.  it's just
  memcpy()ing data in 4KiB-sized chunks onto mmap()ed areas,
  munmap()ing it, growing the file with ftruncate and mapping a
  subsequent chunk for further output

- the NULs at the end of pages do NOT occur at munmap/mmap boundaries as
  I suspected at first, but they do coincide with the end of extents
  that are smaller than the maximum compressed extent size.  So,
  something's making btrfs flush pages to disk before the pages are
  completely written (which is fine in principle), but apparently
  failing to pick up subsequent changes to the pages (eek!)

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-21 Thread Alexandre Oliva

);)
	;

totalsize += size;
  }

  printf(\r%i blocks, %llu total size\n,
	 blocks, totalsize);

#if NOBGCMP
  if (system(cmp snaptest./??)) {
printf (\ncmp error: %s\n, strerror (errno));
break;
  }
#endif
}

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Re: corruption of active mmapped files in btrfs snapshots

2013-03-19 Thread Alexandre Oliva

On Mar 19, 2013, Chris Mason clma...@fusionio.com wrote:

 My guess is the truncate is creating a orphan item

Would it, even though the truncate is used to grow rather than to shrink
the file?

 that is being processed inside the snapshot.

This doesn't explain why the master database occasionally gets similarly
corrupted, does it?

 Is it possible to create a smaller leveldb unit test that we might use
 to exercise all of this?

I suppose we can even do away with leveldb altogether, using only a
PosixMmapFile object, as created by PosixEnv::NewWritableFile (all of
this is defined in leveldb's util/env_posix.cc), to exercise the
creation and growth of multiple files, one at a time, taking btrfs
snapshots at random in between the writes.  This ought to suffice.

One thing I'm yet to check is whether ceph uses the sync leveldb
WriteOption, to determine whether or not to call the file object's Sync
member function in the test; this would bring fdatasync and msync calls
into the picture, that would otherwise be left entirely out of the test.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-19 Thread Alexandre Oliva

On Mar 19, 2013, Sage Weil s...@inktank.com wrote:

 There is a set of unit tests in the leveldb source tree that ought to do 
 the trick:

   git clone https://code.google.com/p/leveldb/

But these don't create btrfs snapshots.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-19 Thread Alexandre Oliva

On Mar 19, 2013, Alexandre Oliva ol...@gnu.org wrote:

 that is being processed inside the snapshot.

 This doesn't explain why the master database occasionally gets similarly
 corrupted, does it?

Actually, scratch this bit for now.  I don't really have proof that the
master database actually gets corrupted while it's in use, rather than
having inherited corruption on a server restart, that rolls back to the
most recent snapshot and replays the osd journal on it.  It could be
that the used snapshot is corrupted in a way that doesn't manifest
itself immediately, or that that it gets corrupted afterwards with your
delayed-orphan theory.

I wrote a test that exercises leveldb's PosixMmapFile with highly
compressible appends of varying sizes, as well as syncs and btrfs
snapshots at random, but I haven't been able to trigger the problem with
it (yet?).

I'm now instrumenting the failing code to try to collect more data.  It
looks like, even though ceph does use leveldb's sync option in some
situations, the syncs don't seem to get all to the data files, only to
the leveldb logs.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva

For quite a while, I've experienced oddities with snapshotted Firefox
_CACHE_00?_ files, whose checksums (and contents) would change after the
btrfs snapshot was taken, and would even change depending on how the
file was brought to memory (e.g., rsyncing it to backup storage vs
checking its md5sum before or after the rsync).  This only affected
these cache files, so I didn't give it too much attention.

A similar problem seems to affect the leveldb databases maintained by
ceph within the periodic snapshots it takes of its object storage
volumes.  I'm told others using ceph on filesystems other than btrfs are
not observing this problem, which makes me thing it's not memory
corruption within ceph itself.  I've looked into this for a bit, and I'm
now inclined to believe it has to do with some bad interaction of mmap
and snapshots; I'm not sure the fact that the filesystem has compression
enabled has any effect, but that's certainly a possibility.

leveldb does not modify file contents once they're initialized, it only
appends to files, ftruncate()ing them to about a MB early on, mmap()ping
that in and memcpy()ing blocks of various sizes to the end of the output
buffer, occasionally msync()ing the maps, or running fdatasync if it
didn't msync a map before munmap()ping it.  If it runs out of space in a
map, it munmap()s the previously mapped range, truncates the file to a
larger size, then maps in the new tail of the file, starting at the page
it should append to next.

What I'm observing is that some btrfs snapshots taken by ceph osds,
containing the leveldb database, are corrupted, causing crashes during
the use of the database.

I've scripted regular checks of osd snapshots, saving the
last-known-good database along with the first one that displays the
corruption.  Studying about two dozen failures over the weekend, that
took place on all of 13 btrfs-based osds on 3 servers running btrfs as
in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
similar pattern: a stream of NULs of varying sizes at the end of a page,
starting at a block boundary (leveldb doesn't do page-sized blocking, so
blocks can start anywhere in a page), and ending close to the beginning
of the next page, although not exactly at the page boundary; 20 bytes
past the page boundary seemed to be the most common size, but the
occasional presence of NULs in the database contents makes it harder to
tell for sure.

The stream of NULs ended in the middle of a database block (meaning it
was not the beginning of a subsequent database block written later; the
beginning of the database block was partially replaced with NULs).
Furthermore, the checksum fails to match on this one partially-NULed
block.  Since the checksum is computed just before the block and the
checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
that the block was copied entirely to the right place at some point, and
if part of it became zeros, it's either because the modification was
partially lost, or because the mmapped buffer was partially overwritten.
The fact that all instances of corruption I looked at were correct right
to the end of one block boundary, and then all zeros instead of the
beginning of the subsequent block to the end of that page, makes a
failure to write that modified page seem more likely in my mind (more so
given the Firefox _CACHE_ file oddities in snapshots); intense memory
pressure at the time of the corruption also seems to favor this
possibility.

Now, it could be that btrfs requires those who modify SHARED mmap()ed
files so as to make sure that data makes it to a subsequent snapshot,
along the lines of msync MS_ASYNC, and leveldb does not take this sort
of precaution.  However, I noticed that the unexpected stream of zeros
after a prior block and before the rest of the subsequent block
*remains* in subsequent snapshots, which to me indicates the page update
is effectively lost.  This explains why even the running osd, that
operates on the “current” subvolumes from which snapshots for recovery
are taken, occasionally crashes because of database corruption, and will
later fail to restart from an earlier snapshot due to that same
corruption.


Does this problem sound familiar to anyone else?

Should mmaped-file writers in general do more than umount or msync to
ensure changes make it to subsequent snapshots that are supposed to be
consistent?

Any tips on where to start looking so as to fix the problem, or even to
confirm that the problem is indeed in btrfs?


TIA,

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva

While I wrote the previous email, a smoking gun formed in one of my
servers: a snapshot that had passed a database consistency check turned
out to be corrupted when I tried to rollback to it!  Since the snapshot
was not modified in any way between the initial scripted check and the
later manual check, the problem must be in btrfs.

On Mar 18, 2013, Alexandre Oliva ol...@gnu.org wrote:

 I've scripted regular checks of osd snapshots, saving the
 last-known-good database along with the first one that displays the
 corruption.  Studying about two dozen failures over the weekend, that
 took place on all of 13 btrfs-based osds on 3 servers running btrfs as
 in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
 similar pattern: a stream of NULs of varying sizes at the end of a page,
 starting at a block boundary (leveldb doesn't do page-sized blocking, so
 blocks can start anywhere in a page), and ending close to the beginning
 of the next page, although not exactly at the page boundary; 20 bytes
 past the page boundary seemed to be the most common size, but the
 occasional presence of NULs in the database contents makes it harder to
 tell for sure.

Additional corrupted snapshots collected today have confirmed this
pattern, except that today I got several corrupted files with non-NULs
right at the beginning of the page following the one that marked the
beginning of the corrupted database block.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva

On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote:

 A few questions.  Does leveldb use O_DIRECT and mmap together?

No, it doesn't use O_DIRECT at all.  Its I/O interface is very
simplified: it just opens each new file (database chunks limited to 2MB)
with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
munmap and fdatasync.  It doesn't seem to modify data once it's written;
it only appends.  Reading data back from it uses a completely different
class interface, using separate descriptors and using pread only.

 (the source of a write being pages that are mmap'd from somewhere
 else)

AFAICT the source of the memcpy()s that append to the file are
malloc()ed memory.

 That's the most likely place for this kind of problem.  Also, you
 mention crc errors.  Are those reported by btrfs or are they application
 level crcs.

These are CRCs leveldb computes and writes out after each db block.  No
btrfs CRC errors are reported in this process.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: collapse concurrent forced allocations

2013-03-03 Thread Alexandre Oliva

On Feb 23, 2013, Alexandre Oliva ol...@gnu.org wrote:

 On Feb 22, 2013, Josef Bacik jba...@fusionio.com wrote:
 So I understand what you are getting at, but I think you are doing it wrong. 
  If
 we're calling with CHUNK_ALLOC_FORCE, but somebody has already started to
 allocate with CHUNK_ALLOC_NO_FORCE, we'll reset the space_info-force_alloc 
 to
 our original caller's CHUNK_ALLOC_FORCE.

 But that's ok, do_chunk_alloc will set space_info-force_alloc to
 CHUNK_ALLOC_NO_FORCE at the end, when it succeeds allocating, and then
 anyone else waiting on the mutex to try to allocate will load the
 NO_FORCE from space_info.

 So we only really care about making sure a chunk is actually
 allocated, instead of doing this flag shuffling we should just do

 if (space_info-chunk_alloc) { spin_unlock(space_info-lock);
 wait_event(!space_info-chunk_alloc); return 0;

I looked a bit further into it.  I think I this would work if we had a
wait_queue for space_info-chunk_alloc.  We don't, so the mutex
interface is probably the best we can do.

OTOH, I found out we seem to get into an allocate spree when a large
file is being quickly created, such as when creating a ceph journal or
making a copy of a multi-GB file.  I suppose btrfs is just trying to
allocate contiguous space for the file, but unfortunately there doesn't
seem to be a fallback for allocation failure: as soon as data allocation
fails and space_info is set as full, the large write fails and the
filesystem becomes full, without even trying to use non-contiguous
storage.  Isn't that a bug?


I've also been trying to track down why, on a single-data filesystem,
(compressed?) data reads that fail because of bad blocks also spike the
CPU load and lock the file that failed to map in and the entire
filesystem, so that the only way to recover is to force a reboot.
Does this sound familiar to anyone?

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: collapse concurrent forced allocations

2013-02-23 Thread Alexandre Oliva

On Feb 22, 2013, Josef Bacik jba...@fusionio.com wrote:

 So I understand what you are getting at, but I think you are doing it wrong.  
 If
 we're calling with CHUNK_ALLOC_FORCE, but somebody has already started to
 allocate with CHUNK_ALLOC_NO_FORCE, we'll reset the space_info-force_alloc to
 our original caller's CHUNK_ALLOC_FORCE.

But that's ok, do_chunk_alloc will set space_info-force_alloc to
CHUNK_ALLOC_NO_FORCE at the end, when it succeeds allocating, and then
anyone else waiting on the mutex to try to allocate will load the
NO_FORCE from space_info.

 So we only really care about making sure a chunk is actually
 allocated, instead of doing this flag shuffling we should just do

 if (space_info-chunk_alloc) {
   spin_unlock(space_info-lock);
   wait_event(!space_info-chunk_alloc);
   return 0;
 }

Sorry, I don't follow.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

collapse concurrent forced allocations (was: Re: clear chunk_alloc flag on retryable failure)

2013-02-21 Thread Alexandre Oliva

On Feb 21, 2013, Alexandre Oliva ol...@gnu.org wrote:

 What I saw in that function also happens to explain why in some cases I
 see filesystems allocate a huge number of chunks that remain unused
 (leading to the scenario above, of not having more chunks to allocate).
 It happens for data and metadata, but not necessarily both.  I'm
 guessing some thread sets the force_alloc flag on the corresponding
 space_info, and then several threads trying to get disk space end up
 attempting to allocate a new chunk concurrently.  All of them will see
 the force_alloc flag and bump their local copy of force up to the level
 they see first, and they won't clear it even if another thread succeeds
 in allocating a chunk, thus clearing the force flag.  Then each thread
 that observed the force flag will, on its turn, force the allocation of
 a new chunk.  And any threads that come in while it does that will see
 the force flag still set and pick it up, and so on.  This sounds like a
 problem to me, but...  what should the correct behavior be?  Clear
 force_flag once we copy it to a local force?  Reset force to the
 incoming value on every loop?

I think a slight variant of the following makes the most sense, so I
implemented it in the patch below.

 Set the flag to our incoming force if we have it at first, clear our
 local flag, and move it from the space_info when we determined that we
 are the thread that's going to perform the allocation?


From: Alexandre Oliva ol...@gnu.org

btrfs: consume force_alloc in the first thread to chunk_alloc

Even if multiple threads in do_chunk_alloc look at force_alloc and see
a force flag, it suffices that one of them consumes the flag.  Arrange
for an incoming force argument to make to force_alloc in case of
concurrent calls, so that it is used only by the first thread to get
to allocation after the initial request.

Signed-off-by: Alexandre Oliva ol...@gnu.org
---
 fs/btrfs/extent-tree.c |8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6ee89d5..66283f7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3574,8 +3574,12 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
 
 again:
 	spin_lock(space_info-lock);
+
+	/* Bring force_alloc to force and tentatively consume it.  */
 	if (force  space_info-force_alloc)
 		force = space_info-force_alloc;
+	space_info-force_alloc = CHUNK_ALLOC_NO_FORCE;
+
 	if (space_info-full) {
 		spin_unlock(space_info-lock);
 		return 0;
@@ -3586,6 +3590,10 @@ again:
 		return 0;
 	} else if (space_info-chunk_alloc) {
 		wait_for_alloc = 1;
+		/* Reset force_alloc so that it's consumed by the
+		   first thread that completes the allocation.  */
+		space_info-force_alloc = force;
+		force = CHUNK_ALLOC_NO_FORCE;
 	} else {
 		space_info-chunk_alloc = 1;
 	}

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

ceph-on-btrfs inline-cow regression fix for 3.4.3

2012-06-12 Thread Alexandre Oliva

Hi, Greg,

There's a btrfs regression in 3.4 that's causing a lot of grief to
ceph-on-btrfs users like myself.  This small and nice patch cures it.
It's in Linus' master already.  I've been running it on top of 3.4.2,
and it would be very convenient for me if this could be in 3.4.3.

Although the patch mentions ENOSPC, the fix has nothing to do with disk
full conditions; it's more along the lines of not finding enough room
for inline data contents and/or failing to split the btree nodes to make
room for it.  I don't know that anyone knows for sure, but without this
patch what we get is a horrible error, that can only be fixed with a
reboot.  Yeah, not even umountmount will make the filesystem writable
again.  The fix makes us return an error condition in this case, that
callers are prepared to deal with.

I know btrfs hasn't had maintenance fixes in stable series, but Chris
Mason tells me the only reason is that nobody stepped up to do so.
Given my interest, I might as well give it a try ;-)

Thanks,

From 2adcac1a7331d93a17285804819caa96070b231f Mon Sep 17 00:00:00 2001
From: Josef Bacik jo...@redhat.com
Date: Wed, 23 May 2012 16:10:14 -0400
Subject: [PATCH] Btrfs: fall back to non-inline if we don't have enough space

If cow_file_range_inline fails with ENOSPC we abort the transaction which
isn't very nice.  This really shouldn't be happening anyways but there's no
sense in making it a horrible error when we can easily just go allocate
normal data space for this stuff.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/inode.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0298928..92df0a5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -257,10 +257,13 @@ static noinline int cow_file_range_inline(struct btrfs_trans_handle *trans,
 	ret = insert_inline_extent(trans, root, inode, start,
    inline_len, compressed_size,
    compress_type, compressed_pages);
-	if (ret) {
+	if (ret  ret != -ENOSPC) {
 		btrfs_abort_transaction(trans, root, ret);
 		return ret;
+	} else if (ret == -ENOSPC) {
+		return 1;
 	}
+
 	btrfs_delalloc_release_metadata(inode, end + 1 - start);
 	btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 	return 0;
-- 
1.7.7.6



-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

avoid redundant block group free-space checks

2011-12-11 Thread Alexandre Oliva

It was pointed out to me that the test for enough free space in a block
group was wrong in that it would skip a block group that had most of its
free space reserved by a cluster.

I offer two mutually exclusive, (so far) very lightly tested patches to
address this problem.

One moves the test to the middle of the clustered allocation logic,
between the release of the cluster and the attempt to create a new
cluster, with some ugliness due to more indentation, locking operations
and testing.

The other, that I like better but haven't given any significant amount
of testing yet, only performs the test when we fall back to unclustered
allocation, relying on btrfs_find_space_cluster to test for enough free
space early (it does); it also arranges for the cluster in the current
block group to be released before we try unclustered allocation.

From f1d4d6212a4cfb2fde6a15780d9b337319d3d1e1 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Mon, 12 Dec 2011 04:33:33 -0200
Subject: [PATCH] Btrfs: delay block group's free space test within allocator

If a block group has a cluster, we don't want to test its free space
when the cluster has taken an unknown amount of free space.  Delay the
free space test after failing to allocate from the cluster and releasing
it.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   37 -
 1 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 05e1386..1de4c47 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5277,15 +5277,6 @@ alloc:
 		if (unlikely(block_group-ro))
 			goto loop;
 
-		spin_lock(block_group-free_space_ctl-tree_lock);
-		if (cached 
-		block_group-free_space_ctl-free_space 
-		num_bytes + empty_cluster + empty_size) {
-			spin_unlock(block_group-free_space_ctl-tree_lock);
-			goto loop;
-		}
-		spin_unlock(block_group-free_space_ctl-tree_lock);
-
 		/*
 		 * Ok we want to try and use the cluster allocator, so
 		 * lets look there
@@ -5323,6 +5314,7 @@ alloc:
 			}
 refill_cluster:
 			BUG_ON(used_block_group != block_group);
+
 			/* If we are on LOOP_NO_EMPTY_SIZE, we can't
 			 * set up a new clusters, so lets just skip it
 			 * and let the allocator find whatever block
@@ -5332,17 +5324,29 @@ refill_cluster:
 			 * anything, so we are likely way too
 			 * fragmented for the clustering stuff to find
 			 * anything.  */
-			if (loop = LOOP_NO_EMPTY_SIZE) {
+			if (loop = LOOP_NO_EMPTY_SIZE)
 spin_unlock(last_ptr-refill_lock);
-goto unclustered_alloc;
+			else {
+/*
+ * this cluster didn't work out, free
+ * it and start over
+ */
+btrfs_return_cluster_to_free_space(NULL, last_ptr);
 			}
+		}
 
-			/*
-			 * this cluster didn't work out, free it and
-			 * start over
-			 */
-			btrfs_return_cluster_to_free_space(NULL, last_ptr);
+		spin_lock(block_group-free_space_ctl-tree_lock);
+		if (cached 
+		block_group-free_space_ctl-free_space 
+		num_bytes + empty_cluster + empty_size) {
+			spin_unlock(block_group-free_space_ctl-tree_lock);
+			if (last_ptr  loop  LOOP_NO_EMPTY_SIZE)
+spin_unlock(last_ptr-refill_lock);
+			goto loop;
+		}
+		spin_unlock(block_group-free_space_ctl-tree_lock);
 
+		if (last_ptr  loop  LOOP_NO_EMPTY_SIZE) {
 			/* allocate a cluster in this block group */
 			ret = btrfs_find_space_cluster(trans, root,
 	   block_group, last_ptr,
@@ -5382,7 +5386,6 @@ refill_cluster:
 			goto loop;
 		}
 
-unclustered_alloc:
 		offset = btrfs_find_space_for_alloc(block_group, search_start,
 		num_bytes, empty_size);
 		/*
-- 
1.7.4.4

From 72c9239effd15c7c921c5265e860a14084e1f13e Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Mon, 12 Dec 2011 04:48:19 -0200
Subject: [PATCH 1/9] Btrfs: test free space only for unclustered allocation

Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained.  Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.

I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   34 +++---
 1 files changed, 23 insertions

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-30 Thread Alexandre Oliva

On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

That's unexpected.

 This seems to be coming from btrfs-endio-1. A kernel thread that has
 not caught my attention on unpatched systems, yet.

I suppose I could wave my hands while explaining that you're getting
higher data throughput, so it's natural that it would take up more
resources, but that explanation doesn't satisfy me.  I suppose
allocation might have got slightly more CPU intensive in some cases, as
we now use bitmaps where before we'd only use the cheaper-to-allocate
extents.  But that's unsafisfying as well.

 Do you have any idea what's going on here?

Sorry, not really.

 (Please note that the filesystem is still unmodified - metadata
 overhead is large).

Speaking of metadata overhead, I found out that the bitmap-enabling
patch is not enough for a metadata balance to get rid of excess metadata
block groups.  I had to apply patch #16 to get it again.  It sort of
makes sense: without patch 16, too often will we get to the end of the
list of metadata block groups and advance from LOOP_FIND_IDEAL to
LOOP_CACHING_WAIT (skipping NOWAIT after we've cached free space for all
block groups), and if we get to the end of that loop as well (how?  I
couldn't quite figure out, but it only seems to happen under high
contention) we'll advance to LOOP_ALLOC_CHUNK and end up unnecessarily
allocating a new chunk.

Patch 16 makes sure we don't jump ahead during LOOP_CACHING_WAIT, so we
won't get new chunks unless they can really help us keep the system
going.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: initialize new bitmaps' list

2011-11-28 Thread Alexandre Oliva

We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation.  For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
---
 fs/btrfs/free-space-cache.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 33fa4bb..4642c42 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl 
*ctl,
 {
info-offset = offset_to_bitmap(ctl, offset);
info-bytes = 0;
+   INIT_LIST_HEAD(info-list);
link_free_space(ctl, info);
ctl-total_bitmaps++;
 
-- 
1.7.4.4

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/20] Btrfs: fix comment typo

2011-11-28 Thread Alexandre Oliva

---
 fs/btrfs/extent-tree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5d86877..bc0f13d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5304,7 +5304,7 @@ alloc:
/*
 * whoops, this cluster doesn't actually point to
 * this block group.  Get a ref on the block
-* group is does point to and try again
+* group it does point to and try again
 */
if (!last_ptr_loop  last_ptr-block_group 
last_ptr-block_group != block_group 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/20] Here's my current btrfs patchset

2011-11-28 Thread Alexandre Oliva

The first 11 patches are relatively simple fixes or improvements that
I suppose go could make it even in 3.2 (02 is particularly essential
to avoid progressive performance degradation and metadata space waste
in the default clustered allocation strategy).

Patch 12 and its complement 15, and also 19, are debugging aids that
helped me track down the problem fixed in 02.

Patch 13 is a revised version of the larger-clusters patch I posted
before, that adds a microoptimization to the bitmap computations to
the earlier version.

Patches 14 to 20 are probably not suitable for inclusion, and are
provided only for reference, although I'm still undecided on 16: it
seems to me to make sense to stick to the ordered list and index
instead of jumping to the current cluster's block group, but it may
also make sense performance-wise to start at the current cluster and
advance from there.  We still do that, as long as we find a cluster
to begin with, but I'm yet to double check on the race that causes
multiple subsequent releases/creation of clusters under heavy load.
I'm sure I saw it, and I no longer do, but now I'm no longer sure
whether this is the patch that fixed it, or about the details of how
we came about that scenario.

Patches 14, 17, 18 and 20 were posted before, and I'm probably dropping
them from future patchsets unless I find them to be still useful.

Alexandre Oliva (20):
  Btrfs: enable removal of second disk with raid1 metadata
  Btrfs: initialize new bitmaps' list
  Btrfs: fix comment typo
  Btrfs: reset cluster's max_size when creating bitmap cluster
  Btrfs: start search for new cluster at the beginning of the block
group
  Btrfs: skip block groups without enough space for a cluster
  Btrfs: don't set up allocation result twice
  Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
  Btrfs: skip allocation attempt from empty cluster
  Btrfs: report reason for failed relocation
  Btrfs: note when a bitmap is skipped because its list is in use
  Btrfs: introduce verbose debug mode for patched clustered allocation
recovery
  Btrfs: revamp clustered allocation logic
  Btrfs: introduce option to rebalance only metadata
  Btrfs: activate allocation debugging
  Btrfs: try cluster but don't advance in search list
  Btrfs: introduce -o cluster and -o nocluster
  Btrfs: add -o mincluster option
  Btrfs: log when a bitmap is rejected for a cluster
  Btrfs: don't waste metadata block groups for clustered allocation

 fs/btrfs/ctree.h|3 +-
 fs/btrfs/extent-tree.c  |  297 ---
 fs/btrfs/free-space-cache.c |  132 ++-
 fs/btrfs/ioctl.c|2 +
 fs/btrfs/ioctl.h|3 +
 fs/btrfs/relocation.c   |8 +
 fs/btrfs/super.c|   31 -
 fs/btrfs/volumes.c  |   39 +-
 fs/btrfs/volumes.h  |1 +
 9 files changed, 369 insertions(+), 147 deletions(-)

-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/20] Btrfs: try cluster but don't advance in search list

2011-11-28 Thread Alexandre Oliva

When we find an existing cluster, we switch to its block group as the
current block group, possibly skipping multiple blocks in the process.
Furthermore, under heavy contention, multiple threads may fail to
allocate from a cluster and then release just-created clusters just to
proceed to create new ones in a different block group.

This patch tries to allocate from an existing cluster regardless of its
block group, and doesn't switch to that group, instead proceeding to
try to allocate a cluster from the group it was iterating before the
attempt.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   76 +---
 1 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 66edda2..7064979 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5174,11 +5174,11 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
struct btrfs_root *root = orig_root-fs_info-extent_root;
struct btrfs_free_cluster *last_ptr = NULL;
struct btrfs_block_group_cache *block_group = NULL;
+   struct btrfs_block_group_cache *used_block_group;
int empty_cluster = 2 * 1024 * 1024;
int allowed_chunk_alloc = 0;
int done_chunk_alloc = 0;
struct btrfs_space_info *space_info;
-   int last_ptr_loop = 0;
int loop = 0;
int index = 0;
int alloc_type = (data  BTRFS_BLOCK_GROUP_DATA) ?
@@ -5245,6 +5245,7 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
 ideal_cache:
block_group = btrfs_lookup_block_group(root-fs_info,
   search_start);
+   used_block_group = block_group;
if (debug  1)
printk(KERN_DEBUG btrfs %x.%i: ideal cache block 
%llx\n,
   debugid, loop,
@@ -5286,6 +5287,7 @@ search:
u64 offset;
int cached;
 
+   used_block_group = block_group;
btrfs_get_block_group(block_group);
search_start = block_group-key.objectid;
 
@@ -5380,13 +5382,20 @@ alloc:
 * people trying to start a new cluster
 */
spin_lock(last_ptr-refill_lock);
-   if (!last_ptr-block_group ||
-   last_ptr-block_group-ro ||
-   !block_group_bits(last_ptr-block_group, data))
+   used_block_group = last_ptr-block_group;
+   if (used_block_group != block_group 
+   (!used_block_group ||
+used_block_group-ro ||
+!block_group_bits(used_block_group, data))) {
+   used_block_group = block_group;
goto refill_cluster;
+   }
+
+   if (used_block_group != block_group)
+   btrfs_get_block_group(used_block_group);
 
-   offset = btrfs_alloc_from_cluster(block_group, last_ptr,
-num_bytes, search_start);
+   offset = btrfs_alloc_from_cluster(used_block_group,
+ last_ptr, num_bytes, used_block_group-key.objectid);
if (offset) {
/* we have a block, we're done */
spin_unlock(last_ptr-refill_lock);
@@ -5398,36 +5407,15 @@ alloc:
printk(KERN_DEBUG btrfs %x.%i: failed cluster 
alloc\n,
   debugid, loop);
 
-   spin_lock(last_ptr-lock);
-   /*
-* whoops, this cluster doesn't actually point to
-* this block group.  Get a ref on the block
-* group it does point to and try again
-*/
-   if (!last_ptr_loop  last_ptr-block_group 
-   last_ptr-block_group != block_group 
-   index =
-get_block_group_index(last_ptr-block_group)) {
-
-   btrfs_put_block_group(block_group);
-   block_group = last_ptr-block_group;
-   btrfs_get_block_group(block_group);
-   spin_unlock(last_ptr-lock);
-   spin_unlock(last_ptr-refill_lock);
-
-   last_ptr_loop = 1;
-   search_start = block_group-key.objectid;
-   /*
-* we know this block group is properly
-* in the list

[PATCH 07/20] Btrfs: don't set up allocation result twice

2011-11-28 Thread Alexandre Oliva

We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler.  Remove one of the
assignments.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 525ff20..24eef3a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5412,9 +5412,6 @@ checks:
goto loop;
}
 
-   ins-objectid = search_start;
-   ins-offset = num_bytes;
-
if (offset  search_start)
btrfs_add_free_space(block_group, offset,
 search_start - offset);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/20] Btrfs: reset cluster's max_size when creating bitmap cluster

2011-11-28 Thread Alexandre Oliva

The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk.  We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ff179b1..ec23d43 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2320,6 +2320,7 @@ again:
 
if (!found) {
start = i;
+   cluster-max_size = 0;
found = true;
}
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 19/20] Btrfs: log when a bitmap is rejected for a cluster

2011-11-28 Thread Alexandre Oliva

---
 fs/btrfs/free-space-cache.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 953f7dd..0151274 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2316,6 +2316,16 @@ again:
i = next_zero;
}
 
+   if (!found_bits  total_found)
+   printk(KERN_INFO btrfs: bitmap %llx want:%llx min:%llx 
cont:%llx start:%llx max:%llx total:%llx\n,
+  (unsigned long long)entry-offset,
+  (unsigned long long)bytes,
+  (unsigned long long)min_bytes,
+  (unsigned long long)cont1_bytes,
+  (unsigned long long)(start * block_group-sectorsize),
+  (unsigned long long)cluster-max_size,
+  (unsigned long long)(total_found * 
block_group-sectorsize));
+
if (!found_bits)
return -ENOSPC;
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 18/20] Btrfs: add -o mincluster option

2011-11-28 Thread Alexandre Oliva

With -o mincluster, we save the location of the last successful
allocation, so as to emulate some of the cluster allocation logic
(though not non-bitmap preference) without actually going through the
exercise of allocating clusters.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c  |   16 +---
 fs/btrfs/free-space-cache.c |1 +
 fs/btrfs/super.c|   17 +
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7ddbf9b..3c649fe 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5172,7 +5172,7 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
 {
int ret = 0;
struct btrfs_root *root = orig_root-fs_info-extent_root;
-   struct btrfs_free_cluster *last_ptr = NULL;
+   struct btrfs_free_cluster *last_ptr = NULL, *save_ptr = NULL;
struct btrfs_block_group_cache *block_group = NULL;
struct btrfs_block_group_cache *used_block_group;
int empty_cluster = 2 * 1024 * 1024;
@@ -5219,8 +5219,16 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
debug = 1;
debugid = atomic_inc_return(debugcnt);
last_ptr = root-fs_info-meta_alloc_cluster;
-   if (!btrfs_test_opt(root, SSD))
-   empty_cluster = 64 * 1024;
+   if (!btrfs_test_opt(root, SSD)) {
+   /* !SSD  SSD_SPREAD == -o mincluster.  */
+   if (btrfs_test_opt(root, SSD_SPREAD)) {
+   save_ptr = last_ptr;
+   hint_byte = save_ptr-window_start;
+   last_ptr = NULL;
+   use_cluster = false;
+   } else
+   empty_cluster = 64 * 1024;
+   }
}
 
if ((data  BTRFS_BLOCK_GROUP_DATA)  use_cluster 
@@ -5556,6 +5564,8 @@ checks:
btrfs_add_free_space(used_block_group, offset,
 search_start - offset);
BUG_ON(offset  search_start);
+   if (save_ptr)
+   save_ptr-window_start = search_start + num_bytes;
if (used_block_group != block_group)
btrfs_put_block_group(used_block_group);
btrfs_put_block_group(block_group);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3aa56e4..953f7dd 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2579,6 +2579,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster 
*cluster)
cluster-max_size = 0;
INIT_LIST_HEAD(cluster-block_group_list);
cluster-block_group = NULL;
+   cluster-window_start = 0;
 }
 
 int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 26b13d7..32fe064 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -165,7 +165,7 @@ enum {
Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
-   Opt_nocluster, Opt_cluster, Opt_err,
+   Opt_nocluster, Opt_cluster, Opt_mincluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -202,6 +202,7 @@ static match_table_t tokens = {
{Opt_recovery, recovery},
{Opt_nocluster, nocluster},
{Opt_cluster, cluster},
+   {Opt_mincluster, mincluster},
{Opt_err, NULL},
 };
 
@@ -407,6 +408,11 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
printk(KERN_INFO btrfs: enabling alloc clustering\n);
btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
break;
+   case Opt_mincluster:
+   printk(KERN_INFO btrfs: enabling minimal alloc 
clustering\n);
+   btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
+   btrfs_set_opt(info-mount_opt, SSD_SPREAD);
+   break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
   '%s'\n, p);
@@ -706,9 +712,12 @@ static int btrfs_show_options(struct seq_file *seq, struct 
vfsmount *vfs)
}
if (btrfs_test_opt(root, NOSSD))
seq_puts(seq, ,nossd);
-   if (btrfs_test_opt(root, SSD_SPREAD))
-   seq_puts(seq, ,ssd_spread);
-   else if (btrfs_test_opt(root, SSD))
+   if (btrfs_test_opt(root, SSD_SPREAD)) {
+   if (btrfs_test_opt(root, SSD))
+   seq_puts(seq, ,ssd_spread);
+   else
+   seq_puts(seq, ,mincluster);
+   } else if (btrfs_test_opt(root

[PATCH 14/20] Btrfs: introduce option to rebalance only metadata

2011-11-28 Thread Alexandre Oliva

Experimental patch to be able to compact only the metadata after
excessive block groups are created.  I guess it should be implemented
as a balance option rather than a separate ioctl, but this was good
enough for me to try it.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/ioctl.c   |2 ++
 fs/btrfs/ioctl.h   |3 +++
 fs/btrfs/volumes.c |   33 -
 fs/btrfs/volumes.h |1 +
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a90e749..6f53983 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3077,6 +3077,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_dev_info(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_balance(root-fs_info-dev_root);
+   case BTRFS_IOC_BALANCE_METADATA:
+   return btrfs_balance_metadata(root-fs_info-dev_root);
case BTRFS_IOC_CLONE:
return btrfs_ioctl_clone(file, arg, 0, 0, 0);
case BTRFS_IOC_CLONE_RANGE:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 252ae99..46bc428 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -277,4 +277,7 @@ struct btrfs_ioctl_logical_ino_args {
 #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
struct btrfs_ioctl_ino_path_args)
 
+#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \
+   struct btrfs_ioctl_vol_args)
+
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7b348c2..db4397d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2084,7 +2084,7 @@ static u64 div_factor(u64 num, int factor)
return num;
 }
 
-int btrfs_balance(struct btrfs_root *dev_root)
+static int btrfs_balance_skip(struct btrfs_root *dev_root, u64 skip_type)
 {
int ret;
struct list_head *devices = dev_root-fs_info-fs_devices-devices;
@@ -2096,6 +2096,9 @@ int btrfs_balance(struct btrfs_root *dev_root)
struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root;
struct btrfs_trans_handle *trans;
struct btrfs_key found_key;
+   struct btrfs_chunk *chunk;
+   u64 chunk_type;
+   bool skip;
 
if (dev_root-fs_info-sb-s_flags  MS_RDONLY)
return -EROFS;
@@ -2165,11 +2168,21 @@ int btrfs_balance(struct btrfs_root *dev_root)
if (found_key.offset == 0)
break;
 
+   if (skip_type) {
+   chunk = btrfs_item_ptr(path-nodes[0], path-slots[0],
+  struct btrfs_chunk);
+   chunk_type = btrfs_chunk_type(path-nodes[0], chunk);
+   skip = (chunk_type  skip_type);
+   } else
+   skip = false;
+
btrfs_release_path(path);
-   ret = btrfs_relocate_chunk(chunk_root,
-  chunk_root-root_key.objectid,
-  found_key.objectid,
-  found_key.offset);
+
+   ret = (skip ? 0 :
+  btrfs_relocate_chunk(chunk_root,
+   chunk_root-root_key.objectid,
+   found_key.objectid,
+   found_key.offset));
if (ret  ret != -ENOSPC)
goto error;
key.offset = found_key.offset - 1;
@@ -2181,6 +2194,16 @@ error:
return ret;
 }
 
+int btrfs_balance(struct btrfs_root *dev_root)
+{
+   return btrfs_balance_skip(dev_root, 0);
+}
+
+int btrfs_balance_metadata(struct btrfs_root *dev_root)
+{
+   return btrfs_balance_skip(dev_root, BTRFS_BLOCK_GROUP_DATA);
+}
+
 /*
  * shrinking a device means finding all of the device extents past
  * the new size, and then following the back refs to the chunks.
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 78f2d4d..6844010 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -229,6 +229,7 @@ struct btrfs_device *btrfs_find_device(struct btrfs_root 
*root, u64 devid,
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_root *root, char *path);
 int btrfs_balance(struct btrfs_root *dev_root);
+int btrfs_balance_metadata(struct btrfs_root *dev_root);
 int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
 int find_free_dev_extent(struct btrfs_trans_handle *trans,
 struct btrfs_device *device, u64 num_bytes,
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/20] Btrfs: start search for new cluster at the beginning of the block group

2011-11-28 Thread Alexandre Oliva

Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bc0f13d..7edb9e6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5287,10 +5287,8 @@ alloc:
spin_lock(last_ptr-refill_lock);
if (last_ptr-block_group 
(last_ptr-block_group-ro ||
-   !block_group_bits(last_ptr-block_group, data))) {
-   offset = 0;
+   !block_group_bits(last_ptr-block_group, data)))
goto refill_cluster;
-   }
 
offset = btrfs_alloc_from_cluster(block_group, last_ptr,
 num_bytes, search_start);
@@ -5341,7 +5339,7 @@ refill_cluster:
/* allocate a cluster in this block group */
ret = btrfs_find_space_cluster(trans, root,
   block_group, last_ptr,
-  offset, num_bytes,
+  search_start, num_bytes,
   empty_cluster + empty_size);
if (ret == 0) {
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 20/20] Btrfs: don't waste metadata block groups for clustered allocation

2011-11-28 Thread Alexandre Oliva

We try to maintain about 1% of the filesystem space in free space in
data block groups, but we need not do that for metadata, since we only
allocate one block at a time.

This patch also moves the adjustment of flags to account for mixed
data/metadata block groups into the block protected by spin lock, and
before the point in which we now look at flags to decide whether or
not we should keep the free space buffer.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   24 +---
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3c649fe..cce452d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3228,7 +3228,7 @@ static void force_metadata_allocation(struct 
btrfs_fs_info *info)
 
 static int should_alloc_chunk(struct btrfs_root *root,
  struct btrfs_space_info *sinfo, u64 alloc_bytes,
- int force)
+ u64 flags, int force)
 {
struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv;
u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly;
@@ -3246,10 +3246,10 @@ static int should_alloc_chunk(struct btrfs_root *root,
num_allocated += global_rsv-size;
 
/*
-* in limited mode, we want to have some free space up to
+* in limited mode, we want to have some free data space up to
 * about 1% of the FS size.
 */
-   if (force == CHUNK_ALLOC_LIMITED) {
+   if (force == CHUNK_ALLOC_LIMITED  (flags  BTRFS_BLOCK_GROUP_DATA)) {
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
thresh = max_t(u64, 64 * 1024 * 1024,
   div_factor_fine(thresh, 1));
@@ -3310,7 +3310,16 @@ again:
return 0;
}
 
-   if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, force)) {
+   /*
+* If we have mixed data/metadata chunks we want to make sure we keep
+* allocating mixed chunks instead of individual chunks.
+*/
+   if (btrfs_mixed_space_info(space_info))
+   flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
+
+   if (!should_alloc_chunk(extent_root, space_info, alloc_bytes,
+   flags, force)) {
+   space_info-force_alloc = CHUNK_ALLOC_NO_FORCE;
spin_unlock(space_info-lock);
return 0;
} else if (space_info-chunk_alloc) {
@@ -3336,13 +3345,6 @@ again:
}
 
/*
-* If we have mixed data/metadata chunks we want to make sure we keep
-* allocating mixed chunks instead of individual chunks.
-*/
-   if (btrfs_mixed_space_info(space_info))
-   flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
-
-   /*
 * if we're doing a data chunk, go ahead and make sure that
 * we keep a reasonable number of metadata chunks allocated in the
 * FS as well.
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/20] Btrfs: skip allocation attempt from empty cluster

2011-11-28 Thread Alexandre Oliva

If we don't have a cluster, don't bother trying to allocate from it,
jumping right away to the attempt to allocate a new cluster.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9eec362..92e640b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5280,9 +5280,9 @@ alloc:
 * people trying to start a new cluster
 */
spin_lock(last_ptr-refill_lock);
-   if (last_ptr-block_group 
-   (last_ptr-block_group-ro ||
-   !block_group_bits(last_ptr-block_group, data)))
+   if (!last_ptr-block_group ||
+   last_ptr-block_group-ro ||
+   !block_group_bits(last_ptr-block_group, data))
goto refill_cluster;
 
offset = btrfs_alloc_from_cluster(block_group, last_ptr,
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/20] Btrfs: introduce verbose debug mode for patched clustered allocation recovery

2011-11-28 Thread Alexandre Oliva

This patch adds several debug messages that helped me track down
problems in the cluster allocation logic.  All the messages are
disabled by default, so that they're optimized away, but enabling the
commented-out settings of debug brings some helpful messages.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |  148 +++-
 1 files changed, 147 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92e640b..823ab22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5073,6 +5073,88 @@ enum btrfs_loop_type {
LOOP_NO_EMPTY_SIZE = 4,
 };
 
+/* ??? Move to free-space-cache.c? */
+static void
+btrfs_dump_free_space_tree (const char *kern, int debugid, int loop,
+   int detailed, const char *what, const char *what2,
+   unsigned long long prev, struct rb_node *node) {
+   struct btrfs_free_space *entry;
+   int entries = 0, frags = 0;
+   unsigned long long size = 0;
+   unsigned long bits = 0, i, p, q;
+
+   if (detailed)
+   printk(%sbtrfs %x.%i: %s %s %llx:\n,
+  kern, debugid, loop, what, what2, prev);
+
+   while (node) {
+   entries++;
+   entry = rb_entry(node, struct btrfs_free_space, offset_index);
+   node = rb_next(entry-offset_index);
+
+   size += entry-bytes;
+
+   if (detailed)
+   printk(%sbtrfs %x.%i:  +%llx,%llx%s\n,
+  kern, debugid, loop,
+  (long long)(entry-offset - prev),
+  (unsigned long long)entry-bytes,
+  entry-bitmap ? (detailed  1 ? : :  bitmap) 
: );
+
+   if (!entry-bitmap)
+   continue;
+
+   i = 0;
+#define BITS_PER_BITMAP (PAGE_CACHE_SIZE * 8)
+   do {
+   p = i;
+   i = find_next_bit (entry-bitmap, BITS_PER_BITMAP, i);
+   q = i;
+   i = find_next_zero_bit (entry-bitmap, BITS_PER_BITMAP, 
i);
+
+   if (i != q)
+   frags++;
+   bits += i - q;
+
+   if (detailed  1)
+   printk(%sbtrfs %x.%i:   b+%lx,%lx\n,
+  kern, debugid, loop, q - p, i - q);
+   } while (i  BITS_PER_BITMAP);
+#undef BITS_PER_BITMAP
+   }
+
+   if (detailed)
+   printk(%sbtrfs %x.%i:  entries %x size %llx bits %lx frags 
%x\n,
+  kern, debugid, loop, entries, size, bits, frags);
+   else
+   printk(%sbtrfs %x.%i: %s %s %llx: e:%x s:%llx b:%lx f:%x\n,
+  kern, debugid, loop, what, what2,
+  prev, entries, size, bits, frags);
+}
+
+static void
+btrfs_dump_cluster (const char *kern, int debugid, int loop, int detailed,
+   const char *what, struct btrfs_free_cluster *cluster) {
+   spin_lock (cluster-lock);
+
+   btrfs_dump_free_space_tree (kern, debugid, loop,
+   detailed, what, cluster,
+   cluster-window_start,
+   rb_first(cluster-root));
+
+   spin_unlock (cluster-lock);
+}
+
+static void
+btrfs_dump_block_group_free_space (const char *kern, int debugid, int loop,
+  int detailed, const char *what,
+  struct btrfs_block_group_cache *block_group) 
{
+   btrfs_dump_free_space_tree (kern, debugid, loop,
+   detailed, what, block group,
+   block_group-key.objectid,
+   rb_first(block_group-free_space_ctl-free_space_offset));
+}
+
 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
@@ -5108,6 +5190,9 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
bool have_caching_bg = false;
u64 ideal_cache_percent = 0;
u64 ideal_cache_offset = 0;
+   int debug = 0;
+   int debugid = 0;
+   static atomic_t debugcnt;
 
WARN_ON(num_bytes  root-sectorsize);
btrfs_set_key_type(ins, BTRFS_EXTENT_ITEM_KEY);
@@ -5131,6 +5216,8 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
allowed_chunk_alloc = 1;
 
if (data  BTRFS_BLOCK_GROUP_METADATA  use_cluster) {
+   /* debug = 1; */
+   debugid = atomic_inc_return(debugcnt);
last_ptr = root-fs_info-meta_alloc_cluster;
if (!btrfs_test_opt(root, SSD))
empty_cluster = 64 * 1024;
@@ -5158,6 +5245,10 @@ static noinline int

[PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-28 Thread Alexandre Oliva

We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation.  For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 6e5b7e4..ff179b1 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1470,6 +1470,7 @@ static void add_new_bitmap(struct btrfs_free_space_ctl 
*ctl,
 {
info-offset = offset_to_bitmap(ctl, offset);
info-bytes = 0;
+   INIT_LIST_HEAD(info-list);
link_free_space(ctl, info);
ctl-total_bitmaps++;
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 15/20] Btrfs: activate allocation debugging

2011-11-28 Thread Alexandre Oliva

Activate various messages that help track down clustered allocation
problems, that are disabled and optimized out by default.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 823ab22..66edda2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5216,7 +5216,7 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
allowed_chunk_alloc = 1;
 
if (data  BTRFS_BLOCK_GROUP_METADATA  use_cluster) {
-   /* debug = 1; */
+   debug = 1;
debugid = atomic_inc_return(debugcnt);
last_ptr = root-fs_info-meta_alloc_cluster;
if (!btrfs_test_opt(root, SSD))
@@ -5393,7 +5393,7 @@ alloc:
goto checks;
}
 
-   /* debug = 2; */
+   debug = 2;
if (debug  1)
printk(KERN_DEBUG btrfs %x.%i: failed cluster 
alloc\n,
   debugid, loop);
@@ -5446,7 +5446,7 @@ refill_cluster:
 * this cluster didn't work out, free it and
 * start over
 */
-   /* debug = 2; */
+   debug = 2;
if ((debug  1 || (debug  last_ptr-block_group))  
last_ptr-window_start)
btrfs_dump_cluster(KERN_DEBUG, debugid, loop, 
0, drop, last_ptr);
btrfs_return_cluster_to_free_space(NULL, last_ptr);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/20] Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE

2011-11-28 Thread Alexandre Oliva

If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
others might have set up.  Odds are that there won't be one, but if
someone else succeeded in setting it up, we might as well use it, even
if we don't try to set up a cluster again.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   26 ++
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 24eef3a..9eec362 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5271,15 +5271,10 @@ alloc:
spin_unlock(block_group-free_space_ctl-tree_lock);
 
/*
-* Ok we want to try and use the cluster allocator, so lets look
-* there, unless we are on LOOP_NO_EMPTY_SIZE, since we will
-* have tried the cluster allocator plenty of times at this
-* point and not have found anything, so we are likely way too
-* fragmented for the clustering stuff to find anything, so lets
-* just skip it and let the allocator find whatever block it can
-* find
+* Ok we want to try and use the cluster allocator, so
+* lets look there
 */
-   if (last_ptr  loop  LOOP_NO_EMPTY_SIZE) {
+   if (last_ptr) {
/*
 * the refill lock keeps out other
 * people trying to start a new cluster
@@ -5328,6 +5323,20 @@ alloc:
}
spin_unlock(last_ptr-lock);
 refill_cluster:
+   /* If we are on LOOP_NO_EMPTY_SIZE, we can't
+* set up a new clusters, so lets just skip it
+* and let the allocator find whatever block
+* it can find.  If we reach this point, we
+* will have tried the cluster allocator
+* plenty of times and not have found
+* anything, so we are likely way too
+* fragmented for the clustering stuff to find
+* anything.  */
+   if (loop = LOOP_NO_EMPTY_SIZE) {
+   spin_unlock(last_ptr-refill_lock);
+   goto unclustered_alloc;
+   }
+
/*
 * this cluster didn't work out, free it and
 * start over
@@ -5375,6 +5384,7 @@ refill_cluster:
goto loop;
}
 
+unclustered_alloc:
offset = btrfs_find_space_for_alloc(block_group, search_start,
num_bytes, empty_size);
/*
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/20] Btrfs: enable removal of second disk with raid1 metadata

2011-11-28 Thread Alexandre Oliva

Enable removal of a second disk even if that requires conversion of
metadata from raid1 to dup, but not when data would lose replication.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/volumes.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c37433d..7b348c2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1290,12 +1290,16 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
goto out;
}
 
-   if ((all_avail  BTRFS_BLOCK_GROUP_RAID1) 
+   if ((root-fs_info-avail_data_alloc_bits  BTRFS_BLOCK_GROUP_RAID1) 
root-fs_info-fs_devices-num_devices = 2) {
printk(KERN_ERR btrfs: unable to go below two 
   devices on raid1\n);
ret = -EINVAL;
goto out;
+   } else if ((all_avail  BTRFS_BLOCK_GROUP_RAID1) 
+  root-fs_info-fs_devices-num_devices = 2) {
+   printk(KERN_ERR btrfs: going below two devices 
+  will switch metadata from raid1 to dup\n);
}
 
if (strcmp(device_path, missing) == 0) {
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/20] Btrfs: report reason for failed relocation

2011-11-28 Thread Alexandre Oliva

btrfs filesystem balance sometimes fails on corrupted filesystems, but
without any information that explains what the failure was to help
track down the problem.  This patch adds logging for nearly all error
conditions that may cause relocation to fail.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/relocation.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index dff29d5..15a2270 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2496,6 +2496,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
if (!upper-eb) {
ret = btrfs_search_slot(trans, root, key, path, 0, 1);
if (ret  0) {
+   printk(KERN_INFO btrfs: searching slot %llu 
failed: %i\n, key-objectid, -ret);
err = ret;
break;
}
@@ -2543,6 +2544,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
btrfs_tree_unlock(eb);
free_extent_buffer(eb);
if (ret  0) {
+   printk(KERN_INFO btrfs: cow slot failed: 
%i\n, -ret);
err = ret;
goto next;
}
@@ -2730,6 +2732,7 @@ static int relocate_tree_block(struct btrfs_trans_handle 
*trans,
BUG_ON(node-processed);
root = select_one_root(trans, node);
if (root == ERR_PTR(-ENOENT)) {
+   printk(KERN_INFO btrfs: could not find a root to update\n);
update_processed_blocks(rc, node);
goto out;
}
@@ -2756,6 +2759,8 @@ static int relocate_tree_block(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
if (ret  0)
ret = 0;
+   if (ret  0)
+   printk(KERN_INFO btrfs: failed to search slot 
%llu: %i\n, key-objectid, -ret);
}
if (!ret)
update_processed_blocks(rc, node);
@@ -2813,12 +2818,14 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
  block-level, block-bytenr);
if (IS_ERR(node)) {
err = PTR_ERR(node);
+   printk(KERN_INFO btrfs: failed to build backref tree 
for key %llu byte %llu: %i\n, block-key.objectid, block-bytenr, -err);
goto out;
}
 
ret = relocate_tree_block(trans, rc, node, block-key,
  path);
if (ret  0) {
+   printk(KERN_INFO btrfs: failed to relocate tree block: 
%i\n, -ret);
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
goto out;
@@ -3770,6 +3777,7 @@ restart:
ret = relocate_tree_blocks(trans, rc, blocks);
if (ret  0) {
if (ret != -EAGAIN) {
+   printk(KERN_INFO btrfs: failed to 
relocate blocks for key %llu: %i\n, key.objectid, -ret);
err = ret;
break;
}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/20] Btrfs: skip block groups without enough space for a cluster

2011-11-28 Thread Alexandre Oliva

We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing.  Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7edb9e6..525ff20 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5264,7 +5264,7 @@ alloc:
spin_lock(block_group-free_space_ctl-tree_lock);
if (cached 
block_group-free_space_ctl-free_space 
-   num_bytes + empty_size) {
+   num_bytes + empty_cluster + empty_size) {
spin_unlock(block_group-free_space_ctl-tree_lock);
goto loop;
}
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 11/20] Btrfs: note when a bitmap is skipped because its list is in use

2011-11-28 Thread Alexandre Oliva

Bitmap lists serve two purposes: recording the order of loading/saving
on-disk free space caches, and setting up a list of bitmaps to try to
set up a cluster.  Complain if a list is unexpectedly busy.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index ec23d43..dd7fe43 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -904,6 +904,7 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct 
inode *inode,
goto out_nospc;
 
if (e-bitmap) {
+   BUG_ON(!list_empty(e-list));
list_add_tail(e-list, bitmap_list);
bitmaps++;
}
@@ -2380,6 +2381,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
while (entry-bitmap) {
if (list_empty(entry-list))
list_add_tail(entry-list, bitmaps);
+   else if (entry-bitmap)
+   printk(KERN_ERR btrfs: not using (busy?!?) bitmap 
%lli\n,
+  (unsigned long long)entry-offset);
node = rb_next(entry-offset_index);
if (!node)
return -ENOSPC;
@@ -2402,6 +2406,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
if (entry-bitmap) {
if (list_empty(entry-list))
list_add_tail(entry-list, bitmaps);
+   else
+   printk(KERN_ERR btrfs: not using (busy?!?) 
bitmap %lli\n,
+  (unsigned long long)entry-offset);
continue;
}
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 17/20] Btrfs: introduce -o cluster and -o nocluster

2011-11-28 Thread Alexandre Oliva

Introduce -o nocluster to disable the use of clusters for extent
allocation, and -o cluster to reverse it.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/ctree.h   |3 ++-
 fs/btrfs/extent-tree.c |2 +-
 fs/btrfs/super.c   |   16 ++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 04a5dfc..1deaf2d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -971,7 +971,7 @@ struct btrfs_fs_info {
 * is required instead of the faster short fsync log commits
 */
u64 last_trans_log_full_commit;
-   unsigned long mount_opt:20;
+   unsigned long mount_opt:28;
unsigned long compress_type:4;
u64 max_inline;
u64 alloc_start;
@@ -1413,6 +1413,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_AUTO_DEFRAG(1  16)
 #define BTRFS_MOUNT_INODE_MAP_CACHE(1  17)
 #define BTRFS_MOUNT_RECOVERY   (1  18)
+#define BTRFS_MOUNT_NO_ALLOC_CLUSTER   (1  19)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7064979..7ddbf9b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5186,7 +5186,7 @@ static noinline int find_free_extent(struct 
btrfs_trans_handle *trans,
bool found_uncached_bg = false;
bool failed_cluster_refill = false;
bool failed_alloc = false;
-   bool use_cluster = true;
+   bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER);
bool have_caching_bg = false;
u64 ideal_cache_percent = 0;
u64 ideal_cache_offset = 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8bd9d6d..26b13d7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -164,7 +164,8 @@ enum {
Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
-   Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_err,
+   Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
+   Opt_nocluster, Opt_cluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -199,6 +200,8 @@ static match_table_t tokens = {
{Opt_inode_cache, inode_cache},
{Opt_no_space_cache, nospace_cache},
{Opt_recovery, recovery},
+   {Opt_nocluster, nocluster},
+   {Opt_cluster, cluster},
{Opt_err, NULL},
 };
 
@@ -390,12 +393,19 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG);
break;
case Opt_defrag:
-   printk(KERN_INFO btrfs: enabling auto defrag);
+   printk(KERN_INFO btrfs: enabling auto defrag\n);
btrfs_set_opt(info-mount_opt, AUTO_DEFRAG);
break;
case Opt_recovery:
printk(KERN_INFO btrfs: enabling auto recovery);
btrfs_set_opt(info-mount_opt, RECOVERY);
+   case Opt_nocluster:
+   printk(KERN_INFO btrfs: disabling alloc clustering\n);
+   btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER);
+   break;
+   case Opt_cluster:
+   printk(KERN_INFO btrfs: enabling alloc clustering\n);
+   btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
break;
case Opt_err:
printk(KERN_INFO btrfs: unrecognized mount option 
@@ -722,6 +732,8 @@ static int btrfs_show_options(struct seq_file *seq, struct 
vfsmount *vfs)
seq_puts(seq, ,autodefrag);
if (btrfs_test_opt(root, INODE_MAP_CACHE))
seq_puts(seq, ,inode_cache);
+   if (btrfs_test_opt(root, NO_ALLOC_CLUSTER))
+   seq_puts(seq, ,nocluster);
return 0;
 }
 
-- 
1.7.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/20] Btrfs: revamp clustered allocation logic

2011-11-28 Thread Alexandre Oliva

Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes.  Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |  112 +++
 1 files changed, 49 insertions(+), 63 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index dd7fe43..3aa56e4 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2284,23 +2284,23 @@ out:
 static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group,
struct btrfs_free_space *entry,
struct btrfs_free_cluster *cluster,
-   u64 offset, u64 bytes, u64 min_bytes)
+   u64 offset, u64 bytes,
+   u64 cont1_bytes, u64 min_bytes)
 {
struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
unsigned long next_zero;
unsigned long i;
-   unsigned long search_bits;
-   unsigned long total_bits;
+   unsigned long want_bits;
+   unsigned long min_bits;
unsigned long found_bits;
unsigned long start = 0;
unsigned long total_found = 0;
int ret;
-   bool found = false;
 
i = offset_to_bit(entry-offset, block_group-sectorsize,
  max_t(u64, offset, entry-offset));
-   search_bits = bytes_to_bits(bytes, block_group-sectorsize);
-   total_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
+   want_bits = bytes_to_bits(bytes, block_group-sectorsize);
+   min_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
 
 again:
found_bits = 0;
@@ -2309,7 +2309,7 @@ again:
 i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) {
next_zero = find_next_zero_bit(entry-bitmap,
   BITS_PER_BITMAP, i);
-   if (next_zero - i = search_bits) {
+   if (next_zero - i = min_bits) {
found_bits = next_zero - i;
break;
}
@@ -2319,10 +2319,9 @@ again:
if (!found_bits)
return -ENOSPC;
 
-   if (!found) {
+   if (!total_found) {
start = i;
cluster-max_size = 0;
-   found = true;
}
 
total_found += found_bits;
@@ -2330,13 +2329,8 @@ again:
if (cluster-max_size  found_bits * block_group-sectorsize)
cluster-max_size = found_bits * block_group-sectorsize;
 
-   if (total_found  total_bits) {
-   i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero);
-   if (i - start  total_bits * 2) {
-   total_found = 0;
-   cluster-max_size = 0;
-   found = false;
-   }
+   if (total_found  want_bits || cluster-max_size  cont1_bytes) {
+   i = next_zero + 1;
goto again;
}
 
@@ -2352,23 +2346,23 @@ again:
 
 /*
  * This searches the block group for just extents to fill the cluster with.
+ * Try to find a cluster with at least bytes total bytes, at least one
+ * extent of cont1_bytes, and other clusters of at least min_bytes.
  */
 static noinline int
 setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
struct btrfs_free_cluster *cluster,
struct list_head *bitmaps, u64 offset, u64 bytes,
-   u64 min_bytes)
+   u64 cont1_bytes, u64 min_bytes)
 {
struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
struct btrfs_free_space *first = NULL;
struct btrfs_free_space *entry = NULL;
-   struct btrfs_free_space *prev = NULL;
struct btrfs_free_space *last;
struct rb_node *node;
u64 window_start;
u64 window_free;
u64 max_extent;
-   u64 max_gap = 128 * 1024;
 
entry = tree_search_offset(ctl, offset, 0, 1);
if (!entry)
@@ -2378,8 +2372,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
 * We don't want bitmaps, so just move along until we find a normal
 * extent entry.
 */
-   while (entry-bitmap) {
-   if (list_empty(entry-list))
+   while (entry-bitmap || entry-bytes  min_bytes) {
+   if (entry-bitmap  list_empty(entry-list))
list_add_tail(entry-list, bitmaps);
else if (entry-bitmap)
printk(KERN_ERR btrfs: not using (busy?!?) bitmap 
%lli\n,
@@ -2395,12 +2389,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
max_extent = entry-bytes

[PATCH] Btrfs: don't waste metadata block groups for clustered allocation

2011-11-26 Thread Alexandre Oliva

We try to maintain about 1% of the filesystem space in free space in
data block groups, but we need not do that for metadata, since we only
allocate one block at a time.

This patch also moves the adjustment of flags to account for mixed
data/metadata block groups into the block protected by spin lock, and
before the point in which we now look at flags to decide whether or
not we should keep the free space buffer.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c |   26 ++
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 75bafe9..b3ec6c3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3228,7 +3228,7 @@ static void force_metadata_allocation(struct 
btrfs_fs_info *info)
 
 static int should_alloc_chunk(struct btrfs_root *root,
  struct btrfs_space_info *sinfo, u64 alloc_bytes,
- int force)
+ u64 flags, int force)
 {
struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv;
u64 num_bytes = sinfo-total_bytes - sinfo-bytes_readonly;
@@ -3246,10 +3246,10 @@ static int should_alloc_chunk(struct btrfs_root *root,
num_allocated += global_rsv-size;
 
/*
-* in limited mode, we want to have some free space up to
+* in limited mode, we want to have some free data space up to
 * about 1% of the FS size.
 */
-   if (force == CHUNK_ALLOC_LIMITED) {
+   if (force == CHUNK_ALLOC_LIMITED  (flags  BTRFS_BLOCK_GROUP_DATA)) {
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
thresh = max_t(u64, 64 * 1024 * 1024,
   div_factor_fine(thresh, 1));
@@ -3310,7 +3310,16 @@ again:
return 0;
}
 
-   if (!should_alloc_chunk(extent_root, space_info, alloc_bytes, force)) {
+   /*
+* If we have mixed data/metadata chunks we want to make sure we keep
+* allocating mixed chunks instead of individual chunks.
+*/
+   if (btrfs_mixed_space_info(space_info))
+   flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
+
+   if (!should_alloc_chunk(extent_root, space_info, alloc_bytes,
+   flags, force)) {
+   space_info-force_alloc = CHUNK_ALLOC_NO_FORCE;
spin_unlock(space_info-lock);
return 0;
} else if (space_info-chunk_alloc) {
@@ -3336,13 +3345,6 @@ again:
}
 
/*
-* If we have mixed data/metadata chunks we want to make sure we keep
-* allocating mixed chunks instead of individual chunks.
-*/
-   if (btrfs_mixed_space_info(space_info))
-   flags |= (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA);
-
-   /*
 * if we're doing a data chunk, go ahead and make sure that
 * we keep a reasonable number of metadata chunks allocated in the
 * FS as well.
@@ -5312,7 +5314,7 @@ alloc:
/*
 * whoops, this cluster doesn't actually point to
 * this block group.  Get a ref on the block
-* group is does point to and try again
+* group it does point to and try again
 */
if (!last_ptr_loop  last_ptr-block_group 
last_ptr-block_group != block_group 
-- 
1.7.4.4

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Don't prevent removal of devices that break raid reqs

2011-11-19 Thread Alexandre Oliva

On Nov 11, 2011, Chris Mason chris.ma...@oracle.com wrote:

 On Thu, Nov 10, 2011 at 05:32:48PM -0200, Alexandre Oliva wrote:
 Instead of preventing the removal of devices that would render existing
 raid10 or raid1 impossible, warn but go ahead with it; the rebalancing
 code is smart enough to use different block group types.

 We'll need a --force or some kind.  There are definitely cases users
 have wanted to do this but it is rarely a good idea ;)

Even if it's just metadata that will turn from raid1 to dup, as in the
revised patch below?

From 276b1af70556bf5bdbaa1f81cb630d6c83962323 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Tue, 8 Nov 2011 12:33:11 -0200
Subject: [PATCH 1/8] Btrfs: enable removal of second disk with raid1 metadata

Enable removal of a second disk even if that requires conversion of
metadata from raid1 to dup, but not when data would lose replication.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/volumes.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c37433d..7b348c2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1290,12 +1290,16 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
 		goto out;
 	}
 
-	if ((all_avail  BTRFS_BLOCK_GROUP_RAID1) 
+	if ((root-fs_info-avail_data_alloc_bits  BTRFS_BLOCK_GROUP_RAID1) 
 	root-fs_info-fs_devices-num_devices = 2) {
 		printk(KERN_ERR btrfs: unable to go below two 
 		   devices on raid1\n);
 		ret = -EINVAL;
 		goto out;
+	} else if ((all_avail  BTRFS_BLOCK_GROUP_RAID1) 
+		   root-fs_info-fs_devices-num_devices = 2) {
+		printk(KERN_ERR btrfs: going below two devices 
+		   will switch metadata from raid1 to dup\n);
 	}
 
 	if (strcmp(device_path, missing) == 0) {
-- 
1.7.4.4



-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Re: Revamp cluster allocation logic

2011-11-19 Thread Alexandre Oliva

On Nov 10, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 These are patches I posted before, except these are based on cmason's
 for-linus.  Reposting at josef's request.

Reposting again, at josef's request, this time consolidating the 3
patches into one.

From 349a2a26d97c6497f7e4df55b1bdb2f93a673376 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Fri, 14 Oct 2011 12:10:36 -0300
Subject: [PATCH 4/8] Btrfs: revamp clustered allocation logic

Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes.  Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |  114 ++
 1 files changed, 49 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 181760f..7fe88b5 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2271,23 +2271,23 @@ out:
 static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group,
 struct btrfs_free_space *entry,
 struct btrfs_free_cluster *cluster,
-u64 offset, u64 bytes, u64 min_bytes)
+u64 offset, u64 bytes,
+u64 cont1_bytes, u64 min_bytes)
 {
 	struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
 	unsigned long next_zero;
 	unsigned long i;
-	unsigned long search_bits;
-	unsigned long total_bits;
+	unsigned long want_bits;
+	unsigned long min_bits;
 	unsigned long found_bits;
 	unsigned long start = 0;
 	unsigned long total_found = 0;
 	int ret;
-	bool found = false;
 
 	i = offset_to_bit(entry-offset, block_group-sectorsize,
 			  max_t(u64, offset, entry-offset));
-	search_bits = bytes_to_bits(bytes, block_group-sectorsize);
-	total_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
+	want_bits = bytes_to_bits(bytes, block_group-sectorsize);
+	min_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
 
 again:
 	found_bits = 0;
@@ -2296,7 +2296,7 @@ again:
 	 i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) {
 		next_zero = find_next_zero_bit(entry-bitmap,
 	   BITS_PER_BITMAP, i);
-		if (next_zero - i = search_bits) {
+		if (next_zero - i = min_bits) {
 			found_bits = next_zero - i;
 			break;
 		}
@@ -2306,23 +2306,16 @@ again:
 	if (!found_bits)
 		return -ENOSPC;
 
-	if (!found) {
+	if (!total_found)
 		start = i;
-		found = true;
-	}
 
 	total_found += found_bits;
 
 	if (cluster-max_size  found_bits * block_group-sectorsize)
 		cluster-max_size = found_bits * block_group-sectorsize;
 
-	if (total_found  total_bits) {
+	if (total_found  want_bits || cluster-max_size  cont1_bytes) {
 		i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero);
-		if (i - start  total_bits * 2) {
-			total_found = 0;
-			cluster-max_size = 0;
-			found = false;
-		}
 		goto again;
 	}
 
@@ -2338,23 +2331,23 @@ again:
 
 /*
  * This searches the block group for just extents to fill the cluster with.
+ * Try to find a cluster with at least bytes total bytes, at least one
+ * extent of cont1_bytes, and other clusters of at least min_bytes.
  */
 static noinline int
 setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 			struct btrfs_free_cluster *cluster,
 			struct list_head *bitmaps, u64 offset, u64 bytes,
-			u64 min_bytes)
+			u64 cont1_bytes, u64 min_bytes)
 {
 	struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
 	struct btrfs_free_space *first = NULL;
 	struct btrfs_free_space *entry = NULL;
-	struct btrfs_free_space *prev = NULL;
 	struct btrfs_free_space *last;
 	struct rb_node *node;
 	u64 window_start;
 	u64 window_free;
 	u64 max_extent;
-	u64 max_gap = 128 * 1024;
 
 	entry = tree_search_offset(ctl, offset, 0, 1);
 	if (!entry)
@@ -2364,8 +2357,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 	 * We don't want bitmaps, so just move along until we find a normal
 	 * extent entry.
 	 */
-	while (entry-bitmap) {
-		if (list_empty(entry-list))
+	while (entry-bitmap || entry-bytes  min_bytes) {
+		if (entry-bitmap  list_empty(entry-list))
 			list_add_tail(entry-list, bitmaps);
 		node = rb_next(entry-offset_index);
 		if (!node)
@@ -2378,12 +2371,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 	max_extent = entry-bytes;
 	first = entry;
 	last = entry;
-	prev = entry;
 
-	while (window_free = min_bytes) {
-		node = rb_next(entry-offset_index);
-		if (!node)
-			return -ENOSPC;
+	for (node = rb_next(entry-offset_index); node;
+	 node = rb_next(entry-offset_index)) {
 		entry = rb_entry(node, struct btrfs_free_space, offset_index);
 
 		if (entry-bitmap) {
@@ -2392,26 +2382,18 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 			continue;
 		}
 
-		/*
-		 * we haven't filled the empty size and the window is
-		 * very large

report relocation failures

2011-11-19 Thread Alexandre Oliva

I've had some corrupted filesystems that failed to balance and to remove
devices.  It was slightly annoying that btrfs would exit with a nonzero
status, but no information about the error was logged anywhere.  This
patch introduces some error reporting, catching the one error I was
running into: -ENOENT looking for a backref, presumably because outdated
metadata that ended up being used as if it was still live.

I ended up losing the filesystem before I could figure out what exactly
the problem was, but with this info it would hopefully not take as long
to track it down.

From 2bbc4ae372f8ca31701db8ed0cf8e15edf76311e Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Wed, 16 Nov 2011 01:25:06 -0200
Subject: [PATCH 6/8] Btrfs: report reason for failed relocation

btrfs filesystem balance sometimes fails on corrupted filesystems, but
without any information that explains what the failure was to help
track down the problem.  This patch adds logging for nearly all error
conditions that may cause relocation to fail.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/relocation.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index dff29d5..15a2270 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2496,6 +2496,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 		if (!upper-eb) {
 			ret = btrfs_search_slot(trans, root, key, path, 0, 1);
 			if (ret  0) {
+printk(KERN_INFO btrfs: searching slot %llu failed: %i\n, key-objectid, -ret);
 err = ret;
 break;
 			}
@@ -2543,6 +2544,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 			btrfs_tree_unlock(eb);
 			free_extent_buffer(eb);
 			if (ret  0) {
+printk(KERN_INFO btrfs: cow slot failed: %i\n, -ret);
 err = ret;
 goto next;
 			}
@@ -2730,6 +2732,7 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans,
 	BUG_ON(node-processed);
 	root = select_one_root(trans, node);
 	if (root == ERR_PTR(-ENOENT)) {
+		printk(KERN_INFO btrfs: could not find a root to update\n);
 		update_processed_blocks(rc, node);
 		goto out;
 	}
@@ -2756,6 +2759,8 @@ static int relocate_tree_block(struct btrfs_trans_handle *trans,
 			btrfs_release_path(path);
 			if (ret  0)
 ret = 0;
+			if (ret  0)
+printk(KERN_INFO btrfs: failed to search slot %llu: %i\n, key-objectid, -ret);
 		}
 		if (!ret)
 			update_processed_blocks(rc, node);
@@ -2813,12 +2818,14 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 	  block-level, block-bytenr);
 		if (IS_ERR(node)) {
 			err = PTR_ERR(node);
+			printk(KERN_INFO btrfs: failed to build backref tree for key %llu byte %llu: %i\n, block-key.objectid, block-bytenr, -err);
 			goto out;
 		}
 
 		ret = relocate_tree_block(trans, rc, node, block-key,
 	  path);
 		if (ret  0) {
+			printk(KERN_INFO btrfs: failed to relocate tree block: %i\n, -ret);
 			if (ret != -EAGAIN || rb_node == rb_first(blocks))
 err = ret;
 			goto out;
@@ -3770,6 +3777,7 @@ restart:
 			ret = relocate_tree_blocks(trans, rc, blocks);
 			if (ret  0) {
 if (ret != -EAGAIN) {
+	printk(KERN_INFO btrfs: failed to relocate blocks for key %llu: %i\n, key.objectid, -ret);
 	err = ret;
 	break;
 }
-- 
1.7.4.4



-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Re: Introduce option to rebalance only metadata

2011-11-15 Thread Alexandre Oliva

On Nov 15, 2011, Ilya Dryomov idryo...@gmail.com wrote:

 And the exact command to mimic your patch is

 btrfs fi restripe start -m mount point

Thanks.  I wasn't aware of the restripe patch when I wrote this Quick
Hack (TM).

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

revised -o nocluster, and -o cluster to reverse on remount

2011-11-10 Thread Alexandre Oliva

Here's a revised version of the -o nocluster patch, updated for cmason's
for-linus branch, and a separate -o cluster option that enables
one to enable and disable this option on remount.

One thing I'm not sure is whether -o remount,nocluster will release a
cluster that may have been allocated before the remount.  Please keep
that in mind before merging the patch.

From a3323c03f1b3d2cfeb4905268d117426232d4a3b Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Sat, 29 Oct 2011 02:20:55 -0200
Subject: [PATCH 4/8] Disable clustered allocation with -o nocluster

Introduce -o nocluster to disable the use of clusters for extent
allocation.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/extent-tree.c |2 +-
 fs/btrfs/super.c   |   11 +--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b9ba59f..324df91 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1410,6 +1410,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_AUTO_DEFRAG		(1  16)
 #define BTRFS_MOUNT_INODE_MAP_CACHE	(1  17)
 #define BTRFS_MOUNT_RECOVERY		(1  18)
+#define BTRFS_MOUNT_NO_ALLOC_CLUSTER	(1  19)
 
 #define btrfs_clear_opt(o, opt)		((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 18ea90c..767edac 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5051,7 +5051,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
 	bool found_uncached_bg = false;
 	bool failed_cluster_refill = false;
 	bool failed_alloc = false;
-	bool use_cluster = true;
+	bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER);
 	bool have_caching_bg = false;
 	u64 ideal_cache_percent = 0;
 	u64 ideal_cache_offset = 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index dcd5aef..988e697 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -164,7 +164,8 @@ enum {
 	Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
 	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
-	Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_err,
+	Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
+	Opt_nocluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -199,6 +200,7 @@ static match_table_t tokens = {
 	{Opt_inode_cache, inode_cache},
 	{Opt_no_space_cache, no_space_cache},
 	{Opt_recovery, recovery},
+	{Opt_nocluster, nocluster},
 	{Opt_err, NULL},
 };
 
@@ -390,12 +392,15 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG);
 			break;
 		case Opt_defrag:
-			printk(KERN_INFO btrfs: enabling auto defrag);
+			printk(KERN_INFO btrfs: enabling auto defrag\n);
 			btrfs_set_opt(info-mount_opt, AUTO_DEFRAG);
 			break;
 		case Opt_recovery:
 			printk(KERN_INFO btrfs: enabling auto recovery);
 			btrfs_set_opt(info-mount_opt, RECOVERY);
+		case Opt_nocluster:
+			printk(KERN_INFO btrfs: disabling alloc clustering\n);
+			btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER);
 			break;
 		case Opt_err:
 			printk(KERN_INFO btrfs: unrecognized mount option 
@@ -721,6 +726,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ,autodefrag);
 	if (btrfs_test_opt(root, INODE_MAP_CACHE))
 		seq_puts(seq, ,inode_cache);
+	if (btrfs_test_opt(root, NO_ALLOC_CLUSTER))
+		seq_puts(seq, ,nocluster);
 	return 0;
 }
 
-- 
1.7.4.4

From 572ec833d94278e7eda7c274962165c70d9154e5 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Sun, 6 Nov 2011 23:51:08 -0200
Subject: [PATCH 5/8] Add -o cluster, so that nocluster can be disabled with
 remount.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/super.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 988e697..2baba99 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -165,7 +165,7 @@ enum {
 	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
 	Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
-	Opt_nocluster, Opt_err,
+	Opt_nocluster, Opt_cluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -201,6 +201,7 @@ static match_table_t tokens = {
 	{Opt_no_space_cache, no_space_cache},
 	{Opt_recovery, recovery},
 	{Opt_nocluster, nocluster},
+	{Opt_cluster, cluster},
 	{Opt_err, NULL},
 };
 
@@ -402,6 +403,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			printk(KERN_INFO btrfs: disabling alloc clustering\n);
 			btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER);
 			break;
+		case Opt_cluster:
+			printk(KERN_INFO btrfs: enabling alloc clustering\n);
+			btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
+			break;
 		case Opt_err:
 			printk(KERN_INFO btrfs: unrecognized mount

Don't prevent removal of devices that break raid reqs

2011-11-10 Thread Alexandre Oliva

Instead of preventing the removal of devices that would render existing
raid10 or raid1 impossible, warn but go ahead with it; the rebalancing
code is smart enough to use different block group types.

Should the refusal remain, so that we'd only proceed with a
newly-introduced --force option or so?

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/volumes.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4d5b29f..507afca 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1281,18 +1281,14 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
 
if ((all_avail  BTRFS_BLOCK_GROUP_RAID10) 
root-fs_info-fs_devices-num_devices = 4) {
-   printk(KERN_ERR btrfs: unable to go below four devices 
-  on raid10\n);
-   ret = -EINVAL;
-   goto out;
+   printk(KERN_ERR btrfs: going below four devices 
+  will turn raid10 into raid1\n);
}
 
if ((all_avail  BTRFS_BLOCK_GROUP_RAID1) 
root-fs_info-fs_devices-num_devices = 2) {
-   printk(KERN_ERR btrfs: unable to go below two 
-  devices on raid1\n);
-   ret = -EINVAL;
-   goto out;
+   printk(KERN_ERR btrfs: going below two devices 
+  will lose raid1 redundancy\n);
}
 
if (strcmp(device_path, missing) == 0) {
-- 
1.7.4.4


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Introduce option to rebalance only metadata

2011-11-10 Thread Alexandre Oliva

Experimental patch to be able to compact only the metadata after
clustered allocation allocated lots of unnecessary metadata block
groups.  It's also useful to measure performance differences between
-o cluster and -o nocluster.

I guess it should be implemented as a balance option rather than a
separate ioctl, but this was good enough for me to try it.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/ioctl.c   |2 ++
 fs/btrfs/ioctl.h   |3 +++
 fs/btrfs/volumes.c |   33 -
 fs/btrfs/volumes.h |1 +
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4a34c47..69bf6f2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3074,6 +3074,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_dev_info(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_balance(root-fs_info-dev_root);
+   case BTRFS_IOC_BALANCE_METADATA:
+   return btrfs_balance_metadata(root-fs_info-dev_root);
case BTRFS_IOC_CLONE:
return btrfs_ioctl_clone(file, arg, 0, 0, 0);
case BTRFS_IOC_CLONE_RANGE:
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 252ae99..46bc428 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -277,4 +277,7 @@ struct btrfs_ioctl_logical_ino_args {
 #define BTRFS_IOC_LOGICAL_INO _IOWR(BTRFS_IOCTL_MAGIC, 36, \
struct btrfs_ioctl_ino_path_args)
 
+#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \
+   struct btrfs_ioctl_vol_args)
+
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f8e29431..4d5b29f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2077,7 +2077,7 @@ static u64 div_factor(u64 num, int factor)
return num;
 }
 
-int btrfs_balance(struct btrfs_root *dev_root)
+static int btrfs_balance_skip(struct btrfs_root *dev_root, u64 skip_type)
 {
int ret;
struct list_head *devices = dev_root-fs_info-fs_devices-devices;
@@ -2089,6 +2089,9 @@ int btrfs_balance(struct btrfs_root *dev_root)
struct btrfs_root *chunk_root = dev_root-fs_info-chunk_root;
struct btrfs_trans_handle *trans;
struct btrfs_key found_key;
+   struct btrfs_chunk *chunk;
+   u64 chunk_type;
+   bool skip;
 
if (dev_root-fs_info-sb-s_flags  MS_RDONLY)
return -EROFS;
@@ -2158,11 +2161,21 @@ int btrfs_balance(struct btrfs_root *dev_root)
if (found_key.offset == 0)
break;
 
+   if (skip_type) {
+   chunk = btrfs_item_ptr(path-nodes[0], path-slots[0],
+  struct btrfs_chunk);
+   chunk_type = btrfs_chunk_type(path-nodes[0], chunk);
+   skip = (chunk_type  skip_type);
+   } else
+   skip = false;
+
btrfs_release_path(path);
-   ret = btrfs_relocate_chunk(chunk_root,
-  chunk_root-root_key.objectid,
-  found_key.objectid,
-  found_key.offset);
+
+   ret = (skip ? 0 :
+  btrfs_relocate_chunk(chunk_root,
+   chunk_root-root_key.objectid,
+   found_key.objectid,
+   found_key.offset));
if (ret  ret != -ENOSPC)
goto error;
key.offset = found_key.offset - 1;
@@ -2174,6 +2187,16 @@ error:
return ret;
 }
 
+int btrfs_balance(struct btrfs_root *dev_root)
+{
+   return btrfs_balance_skip(dev_root, 0);
+}
+
+int btrfs_balance_metadata(struct btrfs_root *dev_root)
+{
+   return btrfs_balance_skip(dev_root, BTRFS_BLOCK_GROUP_DATA);
+}
+
 /*
  * shrinking a device means finding all of the device extents past
  * the new size, and then following the back refs to the chunks.
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ab5b1c4..c467499 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -223,6 +223,7 @@ struct btrfs_device *btrfs_find_device(struct btrfs_root 
*root, u64 devid,
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_root *root, char *path);
 int btrfs_balance(struct btrfs_root *dev_root);
+int btrfs_balance_metadata(struct btrfs_root *dev_root);
 int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
 int find_free_dev_extent(struct btrfs_trans_handle *trans,
 struct btrfs_device *device, u64 num_bytes,
-- 
1.7.4.4


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http

Re: Introduce option to rebalance only metadata

2011-11-10 Thread Alexandre Oliva

On Nov 10, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 Experimental patch to be able to compact only the metadata after
 clustered allocation allocated lots of unnecessary metadata block
 groups.  It's also useful to measure performance differences between
 -o cluster and -o nocluster.

 I guess it should be implemented as a balance option rather than a
 separate ioctl, but this was good enough for me to try it.

And here's a corresponding patch for the btrfs program, on a (probably
very old) btrfs-progs tree.

From 8765d64f95966eec28cad83bd870fc2270afaebd Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Thu, 10 Nov 2011 17:35:29 -0200
Subject: [PATCH] Introduce balance-md to balance metadata only.

Patch for btrfs to use a separate experimental IOCTL to rebalance
only metadata block groups.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 btrfs.c  |4 
 btrfs_cmds.c |   25 +
 btrfs_cmds.h |1 +
 ioctl.h  |3 +++
 4 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 46314cf..9edaebe 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -95,6 +95,10 @@ static struct Command commands[] = {
 	  filesystem balance, path\n
 		Balance the chunks across the device.
 	},
+	{ do_balance_md, 1,
+	  filesystem balance-md, path\n
+		Balance the chunks across the device.
+	},
 	{ do_scan,
 	  999, device scan, [device [device..]\n
 		Scan all device for or the passed device for a btrfs\n
diff --git a/btrfs_cmds.c b/btrfs_cmds.c
index 8031c58..b8f4c05 100644
--- a/btrfs_cmds.c
+++ b/btrfs_cmds.c
@@ -776,6 +776,31 @@ int do_balance(int argc, char **argv)
 	}
 	return 0;
 }
+
+int do_balance_md(int argc, char **argv)
+{
+
+	int	fdmnt, ret=0;
+	struct btrfs_ioctl_vol_args args;
+	char	*path = argv[1];
+
+	fdmnt = open_file_or_dir(path);
+	if (fdmnt  0) {
+		fprintf(stderr, ERROR: can't access to '%s'\n, path);
+		return 12;
+	}
+
+	memset(args, 0, sizeof(args));
+	ret = ioctl(fdmnt, BTRFS_IOC_BALANCE_METADATA, args);
+	close(fdmnt);
+	if(ret0){
+		fprintf(stderr, ERROR: balancing '%s'\n, path);
+
+		return 19;
+	}
+	return 0;
+}
+
 int do_remove_volume(int nargs, char **args)
 {
 
diff --git a/btrfs_cmds.h b/btrfs_cmds.h
index 7bde191..96cab6d 100644
--- a/btrfs_cmds.h
+++ b/btrfs_cmds.h
@@ -23,6 +23,7 @@ int do_defrag(int argc, char **argv);
 int do_show_filesystem(int nargs, char **argv);
 int do_add_volume(int nargs, char **args);
 int do_balance(int nargs, char **argv);
+int do_balance_md(int nargs, char **argv);
 int do_remove_volume(int nargs, char **args);
 int do_scan(int nargs, char **argv);
 int do_resize(int nargs, char **argv);
diff --git a/ioctl.h b/ioctl.h
index 776d7a9..5210c0b 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -169,4 +169,7 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
 struct btrfs_ioctl_space_args)
+
+#define BTRFS_IOC_BALANCE_METADATA _IOW(BTRFS_IOCTL_MAGIC, 37, \
+	struct btrfs_ioctl_vol_args)
 #endif
-- 
1.7.4.4



-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Revamp cluster allocation logic

2011-11-10 Thread Alexandre Oliva

These are patches I posted before, except these are based on cmason's
for-linus.  Reposting at josef's request.

From c8036334e5a033a6ca0963e8fb716d03b1945158 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Fri, 14 Oct 2011 12:10:36 -0300
Subject: [PATCH 1/8] Revamp btrfs cluster creation logic.

Parameterized clusters on minimum total size and minimum chunk size,
without an upper bound.  Don't tolerate fragmentation for SSD_SPREAD;
accept some fragmentation for metadata but try to keep data dense.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |   64 +++---
 1 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 7a15fcf..7572396 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2273,8 +2273,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
 	unsigned long next_zero;
 	unsigned long i;
-	unsigned long search_bits;
-	unsigned long total_bits;
+	unsigned long want_bits;
+	unsigned long min_bits;
 	unsigned long found_bits;
 	unsigned long start = 0;
 	unsigned long total_found = 0;
@@ -2283,8 +2283,8 @@ static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group,
 
 	i = offset_to_bit(entry-offset, block_group-sectorsize,
 			  max_t(u64, offset, entry-offset));
-	search_bits = bytes_to_bits(bytes, block_group-sectorsize);
-	total_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
+	want_bits = bytes_to_bits(bytes, block_group-sectorsize);
+	min_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
 
 again:
 	found_bits = 0;
@@ -2293,7 +2293,7 @@ again:
 	 i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) {
 		next_zero = find_next_zero_bit(entry-bitmap,
 	   BITS_PER_BITMAP, i);
-		if (next_zero - i = search_bits) {
+		if (next_zero - i = min_bits) {
 			found_bits = next_zero - i;
 			break;
 		}
@@ -2313,9 +2313,9 @@ again:
 	if (cluster-max_size  found_bits * block_group-sectorsize)
 		cluster-max_size = found_bits * block_group-sectorsize;
 
-	if (total_found  total_bits) {
+	if (total_found  want_bits) {
 		i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero);
-		if (i - start  total_bits * 2) {
+		if (i - start  want_bits * 2) {
 			total_found = 0;
 			cluster-max_size = 0;
 			found = false;
@@ -2361,8 +2361,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 	 * We don't want bitmaps, so just move along until we find a normal
 	 * extent entry.
 	 */
-	while (entry-bitmap) {
-		if (list_empty(entry-list))
+	while (entry-bitmap || entry-bytes  min_bytes) {
+		if (entry-bitmap  list_empty(entry-list))
 			list_add_tail(entry-list, bitmaps);
 		node = rb_next(entry-offset_index);
 		if (!node)
@@ -2377,10 +2377,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 	last = entry;
 	prev = entry;
 
-	while (window_free = min_bytes) {
-		node = rb_next(entry-offset_index);
-		if (!node)
-			return -ENOSPC;
+	for (node = rb_next(entry-offset_index); node;
+	 node = rb_next(entry-offset_index)) {
 		entry = rb_entry(node, struct btrfs_free_space, offset_index);
 
 		if (entry-bitmap) {
@@ -2389,12 +2387,19 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 			continue;
 		}
 
+		if (entry-bytes  min_bytes)
+			continue;
+
 		/*
 		 * we haven't filled the empty size and the window is
 		 * very large.  reset and try again
 		 */
 		if (entry-offset - (prev-offset + prev-bytes)  max_gap ||
-		entry-offset - window_start  (min_bytes * 2)) {
+		entry-offset - window_start  (window_free * 2)) {
+			/* We got a cluster of the requested size,
+			   we're done.  */
+			if (window_free = bytes)
+break;
 			first = entry;
 			window_start = entry-offset;
 			window_free = entry-bytes;
@@ -2409,6 +2414,9 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 		prev = entry;
 	}
 
+	if (window_free  bytes)
+		return -ENOSPC;
+
 	cluster-window_start = first-offset;
 
 	node = first-offset_index;
@@ -2422,7 +2430,7 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		entry = rb_entry(node, struct btrfs_free_space, offset_index);
 		node = rb_next(entry-offset_index);
-		if (entry-bitmap)
+		if (entry-bitmap || entry-bytes  min_bytes)
 			continue;
 
 		rb_erase(entry-offset_index, ctl-free_space_offset);
@@ -2504,7 +2512,7 @@ search:
 
 /*
  * here we try to find a cluster of blocks in a block group.  The goal
- * is to find at least bytes free and up to empty_size + bytes free.
+ * is to find at least bytes+empty_size.
  * We might not find them all in one contiguous area.
  *
  * returns zero and sets up cluster if things worked out, otherwise
@@ -2522,19 +2530,16 @@ int btrfs_find_space_cluster(struct btrfs_trans_handle *trans,
 	u64 min_bytes

Record end of metadata allocation

2011-11-10 Thread Alexandre Oliva

So I'm trying to figure out what it is that makes clustered allocation
so much faster than unclustered allocation.  E.g., for a nearly
quiescent filesystem with as little as 90MB of metadata, balance-md
(from another patch I posted today) takes some 4.5 seconds (worst case
6s, best case 4s) with clustered allocation, while with -o nocluster it
takes some 6.5s (best case 6s, worst case 7s).  With -o mincluster,
introduced by the patch below (by no means intended for merging, it's
far too hackish) it's some 0.1s faster than with -o nocluster, but
nothing really significant, and I didn't even take care of locking
last_ptr.  So I conclude it's not remembering the search starting point
that makes -o cluster faster.

Anyhow, since this is slightly faster than unclustered allocation, I
suppose we could introduce something along these lines for the -o
nocluster case, no?

From c16a9e53e41e7616e4498534eea25ca1f396d7b4 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Thu, 10 Nov 2011 20:55:40 -0200
Subject: [PATCH 9/9] Add -o mincluster option.

If this option is enabled, save the location of the last successful
allocation, so as to emulate some of the cluster allocation logic
(though not non-bitmap preference) without actually going through the
exercise of allocating clusters.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/extent-tree.c  |   16 +---
 fs/btrfs/free-space-cache.c |1 +
 fs/btrfs/super.c|   17 +
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4da27be..caa73b2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5053,7 +5053,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
 {
 	int ret = 0;
 	struct btrfs_root *root = orig_root-fs_info-extent_root;
-	struct btrfs_free_cluster *last_ptr = NULL;
+	struct btrfs_free_cluster *last_ptr = NULL, *save_ptr = NULL;
 	struct btrfs_block_group_cache *block_group = NULL;
 	int empty_cluster = 2 * 1024 * 1024;
 	int allowed_chunk_alloc = 0;
@@ -5095,8 +5095,16 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
 
 	if (data  BTRFS_BLOCK_GROUP_METADATA  use_cluster) {
 		last_ptr = root-fs_info-meta_alloc_cluster;
-		if (!btrfs_test_opt(root, SSD))
-			empty_cluster = 64 * 1024;
+		if (!btrfs_test_opt(root, SSD)) {
+			/* !SSD  SSD_SPREAD == -o mincluster.  */
+			if (btrfs_test_opt(root, SSD_SPREAD)) {
+save_ptr = last_ptr;
+hint_byte = save_ptr-window_start;
+last_ptr = NULL;
+use_cluster = false;
+			} else
+empty_cluster = 64 * 1024;
+		}
 	}
 
 	if ((data  BTRFS_BLOCK_GROUP_DATA)  use_cluster 
@@ -5402,6 +5410,8 @@ checks:
 			btrfs_add_free_space(block_group, offset,
 	 search_start - offset);
 		BUG_ON(offset  search_start);
+		if (save_ptr)
+			save_ptr-window_start = search_start + num_bytes;
 		btrfs_put_block_group(block_group);
 		break;
 loop:
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index afd1129..2706369 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2576,6 +2576,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 	cluster-max_size = 0;
 	INIT_LIST_HEAD(cluster-block_group_list);
 	cluster-block_group = NULL;
+	cluster-window_start = 0;
 }
 
 int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2baba99..dd76fa4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -165,7 +165,7 @@ enum {
 	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
 	Opt_inode_cache, Opt_no_space_cache, Opt_recovery,
-	Opt_nocluster, Opt_cluster, Opt_err,
+	Opt_nocluster, Opt_cluster, Opt_mincluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -202,6 +202,7 @@ static match_table_t tokens = {
 	{Opt_recovery, recovery},
 	{Opt_nocluster, nocluster},
 	{Opt_cluster, cluster},
+	{Opt_mincluster, mincluster},
 	{Opt_err, NULL},
 };
 
@@ -407,6 +408,11 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			printk(KERN_INFO btrfs: enabling alloc clustering\n);
 			btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
 			break;
+		case Opt_mincluster:
+			printk(KERN_INFO btrfs: enabling minimal alloc clustering\n);
+			btrfs_clear_opt(info-mount_opt, NO_ALLOC_CLUSTER);
+			btrfs_set_opt(info-mount_opt, SSD_SPREAD);
+			break;
 		case Opt_err:
 			printk(KERN_INFO btrfs: unrecognized mount option 
 			   '%s'\n, p);
@@ -705,9 +711,12 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	}
 	if (btrfs_test_opt(root, NOSSD))
 		seq_puts(seq, ,nossd);
-	if (btrfs_test_opt(root, SSD_SPREAD))
-		seq_puts(seq, ,ssd_spread);
-	else if (btrfs_test_opt(root, SSD))
+	if (btrfs_test_opt(root, SSD_SPREAD)) {
+		if (btrfs_test_opt(root, SSD))
+			seq_puts(seq, ,ssd_spread);
+		else

Re: corrupted btrfs after suspend2ram uncorrectable with scrub

2011-11-01 Thread Alexandre Oliva

Hi, Gustavo,

On Nov  1, 2011, Gustavo Sverzut Barbieri barbi...@gmail.com wrote:

   btrfs csum failed ino 2957021 extent 85041815552 csum 667310679
 wanted 0 mirror 0

 Is there any way to recover it?  :-S

Did you try mounting without data checksums?

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patches for BTRFS (mail-server slow down in 3.0 and more)

2011-10-31 Thread Alexandre Oliva

On Oct 31, 2011, David Sterba d...@jikos.cz wrote:

 On Mon, Oct 31, 2011 at 02:19:18AM -0200, Alexandre Oliva wrote:
 On Oct 29, 2011, Chris Mason chris.ma...@oracle.com wrote:
 
  The last one isn't a bad idea, but please do make a real mount option
  for it ;)
 
 Like this?

 @@ -195,6 +195,7 @@ static match_table_t tokens = {
 {Opt_subvolrootid, subvolrootid=%d},
 {Opt_defrag, autodefrag},
 {Opt_inode_cache, inode_cache},
 +   {Opt_nocluster, nocluster},
 {Opt_err, NULL},

 How about 'no_alloc_cluster' ?

I considered that, too, but choosing the option name was the most
difficult part of the patch :-) I ended up going for the shorter name,
just to get the conversation started ;-) I don't feel strongly about it.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patches for BTRFS (mail-server slow down in 3.0 and more)

2011-10-30 Thread Alexandre Oliva

On Oct 29, 2011, Chris Mason chris.ma...@oracle.com wrote:

 The last one isn't a bad idea, but please do make a real mount option
 for it ;)

Like this?

From af086e7b88637be5c9806181a1d70db9c645cb50 Mon Sep 17 00:00:00 2001
From: Alexandre Oliva lxol...@fsfla.org
Date: Sat, 29 Oct 2011 02:20:55 -0200
Subject: [PATCH 4/4] Disable clustered allocation with -o nocluster

Introduce -o nocluster to disable the use of clusters for extent
allocation.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/extent-tree.c |2 +-
 fs/btrfs/super.c   |   11 +--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 03912c5..b1138fb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1363,6 +1363,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_ENOSPC_DEBUG	 (1  15)
 #define BTRFS_MOUNT_AUTO_DEFRAG		(1  16)
 #define BTRFS_MOUNT_INODE_MAP_CACHE	(1  17)
+#define BTRFS_MOUNT_NO_ALLOC_CLUSTER	(1  18)
 
 #define btrfs_clear_opt(o, opt)		((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f5be06a..5d7c9a7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4886,7 +4886,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
 	bool found_uncached_bg = false;
 	bool failed_cluster_refill = false;
 	bool failed_alloc = false;
-	bool use_cluster = true;
+	bool use_cluster = !btrfs_test_opt(root, NO_ALLOC_CLUSTER);
 	u64 ideal_cache_percent = 0;
 	u64 ideal_cache_offset = 0;
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 15634d4..57c7bb1 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -162,7 +162,7 @@ enum {
 	Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
 	Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag,
-	Opt_inode_cache, Opt_err,
+	Opt_inode_cache, Opt_nocluster, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -195,6 +195,7 @@ static match_table_t tokens = {
 	{Opt_subvolrootid, subvolrootid=%d},
 	{Opt_defrag, autodefrag},
 	{Opt_inode_cache, inode_cache},
+	{Opt_nocluster, nocluster},
 	{Opt_err, NULL},
 };
 
@@ -378,9 +379,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 			btrfs_set_opt(info-mount_opt, ENOSPC_DEBUG);
 			break;
 		case Opt_defrag:
-			printk(KERN_INFO btrfs: enabling auto defrag);
+			printk(KERN_INFO btrfs: enabling auto defrag\n);
 			btrfs_set_opt(info-mount_opt, AUTO_DEFRAG);
 			break;
+		case Opt_nocluster:
+			printk(KERN_INFO btrfs: disabling alloc clustering\n);
+			btrfs_set_opt(info-mount_opt, NO_ALLOC_CLUSTER);
+			break;
 		case Opt_err:
 			printk(KERN_INFO btrfs: unrecognized mount option 
 			   '%s'\n, p);
@@ -729,6 +734,8 @@ static int btrfs_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ,autodefrag);
 	if (btrfs_test_opt(root, INODE_MAP_CACHE))
 		seq_puts(seq, ,inode_cache);
+	if (btrfs_test_opt(root, NO_ALLOC_CLUSTER))
+		seq_puts(seq, ,nocluster);
 	return 0;
 }
 
-- 
1.7.4.4



-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer

Re: Patches for BTRFS (mail-server slow down in 3.0 and more)

2011-10-28 Thread Alexandre Oliva

On Oct 28, 2011, Marcel Lohmann mar...@malowa.de wrote:

 I would really appreciate if you could send me the patches.

Here are the patches I mentioned on IRC.  I've sent two of them to Josef
for him to push upstream, but I'm not sure he posted them here for I'm
not on the list (yet?).  The other two are newer, and the last one is
definitely not for inclusion (just for testing or as a temporary
work-around).

I've been using the first 3 with some success on a couple of mail
servers: I haven't hit the ridiculous slow downs from frequent
unsuccessful calls of setup_cluster_no_bitmap after a while, like I did
with 3.0 (and 3.1) any more.

However, the excess use of metadata that I've experienced on ceph OSDs
isn't fixed by them.  A btrfs balance with the first 3 still has 22GB of
metadata block groups even though only 4.1GB of metadata are in use, or
19GB of metadata with only 2GB of metadata in use.  With the 4th patch
and -o clear_cache, the first rebalancing of the 22GB-metadata
filesystem got it down to 8GB; the second fs is still on rebalancing
~800GB (wishlist mental note: introduce some means to rebalance only the
metadata)

Here are the patches, against 3.1-libre (should apply cleanly on 3.1).


---BeginMessage---
Parameterized clusters on minimum total size and minimum chunk size,
without an upper bound.  Don't tolerate fragmentation for SSD_SPREAD;
accept some fragmentation for metadata but try to keep data dense.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 fs/btrfs/free-space-cache.c |   64 +++---
 1 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 41ac927..4973816 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2092,8 +2092,8 @@ static int btrfs_bitmap_cluster(struct 
btrfs_block_group_cache *block_group,
struct btrfs_free_space_ctl *ctl = block_group-free_space_ctl;
unsigned long next_zero;
unsigned long i;
-   unsigned long search_bits;
-   unsigned long total_bits;
+   unsigned long want_bits;
+   unsigned long min_bits;
unsigned long found_bits;
unsigned long start = 0;
unsigned long total_found = 0;
@@ -2102,8 +2102,8 @@ static int btrfs_bitmap_cluster(struct 
btrfs_block_group_cache *block_group,
 
i = offset_to_bit(entry-offset, block_group-sectorsize,
  max_t(u64, offset, entry-offset));
-   search_bits = bytes_to_bits(bytes, block_group-sectorsize);
-   total_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
+   want_bits = bytes_to_bits(bytes, block_group-sectorsize);
+   min_bits = bytes_to_bits(min_bytes, block_group-sectorsize);
 
 again:
found_bits = 0;
@@ -2112,7 +2112,7 @@ again:
 i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, i + 1)) {
next_zero = find_next_zero_bit(entry-bitmap,
   BITS_PER_BITMAP, i);
-   if (next_zero - i = search_bits) {
+   if (next_zero - i = min_bits) {
found_bits = next_zero - i;
break;
}
@@ -2132,9 +2132,9 @@ again:
if (cluster-max_size  found_bits * block_group-sectorsize)
cluster-max_size = found_bits * block_group-sectorsize;
 
-   if (total_found  total_bits) {
+   if (total_found  want_bits) {
i = find_next_bit(entry-bitmap, BITS_PER_BITMAP, next_zero);
-   if (i - start  total_bits * 2) {
+   if (i - start  want_bits * 2) {
total_found = 0;
cluster-max_size = 0;
found = false;
@@ -2180,8 +2180,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
 * We don't want bitmaps, so just move along until we find a normal
 * extent entry.
 */
-   while (entry-bitmap) {
-   if (list_empty(entry-list))
+   while (entry-bitmap || entry-bytes  min_bytes) {
+   if (entry-bitmap  list_empty(entry-list))
list_add_tail(entry-list, bitmaps);
node = rb_next(entry-offset_index);
if (!node)
@@ -2196,10 +2196,8 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
last = entry;
prev = entry;
 
-   while (window_free = min_bytes) {
-   node = rb_next(entry-offset_index);
-   if (!node)
-   return -ENOSPC;
+   for (node = rb_next(entry-offset_index); node;
+node = rb_next(entry-offset_index)) {
entry = rb_entry(node, struct btrfs_free_space, offset_index);
 
if (entry-bitmap) {
@@ -2208,12 +2206,19 @@ setup_cluster_no_bitmap(struct btrfs_block_group_cache 
*block_group,
continue

Re: “bio too big” regression and silent data corruption in 3.0

2011-08-16 Thread Alexandre Oliva

Here's some additional information and work-arounds.

On Aug  7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 A bit of investigation showed that max_hw_sectors for the USB disk was
 120, much lower than the internal SATA and PATA disks.

FWIW, overriding /sys/class/block/sd*/queue/max_sectors_kb of all disks
used by the filesystem to the lowest max_hw_sectors_kb works around this
problem, at least as long as you don't hit it before you get a chance to
change the setting.

 Raid0 block groups were created to hold data from single block groups
 and, if it couldn't create big-enough raid0 blocks because *any* of
 the other disks was nearly-full, removal would fail.

AFAICT this was my misunderstanding of the situation.  Apparenty btrfs
can rebalance the disk space in other partitions so as to create raid0
blocks during removal.  However, in my case it didn't because there was
some metadata inconsistency in the partition I was trying to remove that
led to block tree checksum errors being printed when it hit that part of
the partition, aborting the removal.  The checksum errors were likely
caused by the bio too big problem.

 it appears to be impossible to go back from RAID1 to DUP metadata once
 you temporarily add a second disk, and any metadata block group
 happens to be allocated before you remove it (why couldn't it go back
 to DUP, rather than refusing the removal outright, which prevents even
 single block groups from being moved?)

FWIW, I disabled the test that refuses to shrink a filesystem containing
RAID1 to a single disk and issued such a request while running this
modified kernel, and it completed successfully and perfectly.  Can we
change it from hard error to warning?

 5. This long message reminded me that another machine that has been
 running 3.0 seems to have got *much* slower recently.  I thought it had
 to do with the 98% full filesystem (though 40GB available for new block
 group allocations would seem to be plenty), and the constant metadata
 activity caused by ceph creating and removing snapshots all the time.

AFAICT it had to do with extended attributes (heavily used by ceph),
that caused a large number of metadata block groups to be allocated,
even though only a tiny fraction of the space in them ended up being
used.  I've observed this in two of the ceph object stores.

I've also noticed that rsyncing the OSDs with all extended attributes
(-A -X) caused the source to use up a *lot* of CPU and far longer than
without.  I don't know why that is, but getfattr --dump at the source
and setfattr --restore at the target does pretty much the same, without
incurring such large CPU and time costs, so there's something to be
improved somewhere, in rsync and/or in btrfs.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: “bio too big” regression and silent data corruption in 3.0

2011-08-08 Thread Alexandre Oliva

On Aug  7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 tl;dr version: 3.0 produces “bio too big” dmesg entries and silently
 corrupts data in “meta-raid1/data-single” configurations on disks with
 different max_hw_sectors, where 2.6.38 worked fine.

FWIW, I just got the same problem with 2.6.38.  No idea how I hadn't hit
it before, but it's not a 3.0 regression, just a regular (but IMHO very
serious) bug.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: “bio too big” regression and silent data corruption in 3.0

2011-08-08 Thread Alexandre Oliva

On Aug  7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 2. Removing a partition from the filesystem (say, the external disk)
 didn't relocate “single” block groups as such to other disks, as
 expected.

/me reads some code and resets expectations about RAID0 in btrfs ;-)

update_block_group_flags is what does this.  It doesn't care what was
chosen when the filesystem was created, it just forces RAID0 if more
than 1 disk remains:

/* turn single device chunks into raid0 */
return stripped | BTRFS_BLOCK_GROUP_RAID0;

Is this really intended?  Given my current understanding that RAID0
doesn't mean striping over all disks, but only over two disks, I guess I
might even be interested in it, but...  I still think the user's choice
should be honored, but I don't see where the choice is stored (if it is
at all).


 I wonder, why can't btrfs mark at least mounted partitions as busy, in
 much the same way that swap, md and various filesystems do, to avoid
 such accidental reuses?

Heh.  And *unmark* them when they're removed, too...  As in, it won't
let me create a new filesystem in a partition that was just removed from
a filesystem, if that was the partition listed in /etc/mtab.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: “bio too big” regression and silent data corruption in 3.0

2011-08-08 Thread Alexandre Oliva

On Aug  7, 2011, Alexandre Oliva ol...@lsd.ic.unicamp.br wrote:

 in very much the same way that it appears to be impossible to go
 back from RAID1 to DUP metadata once you temporarily add a second disk,
 and any metadata block group happens to be allocated before you remove
 it (why couldn't it go back to DUP, rather than refusing the removal
 outright, which prevents even single block groups from being moved?)

Which also appears to be intentional.  The code to suport this is right
there in update_block_group_flags, but btrfs_rm_device refuses to let it
do its job, denying the removal attempt right away, without any means to
bypass the test.  Could at least an option to bypass the test be
introduced, through say a mount option, some /sys setting, whatever?

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

help recover from unmountable btrfs

2011-08-06 Thread Alexandre Oliva

After running one too many times into “parent transid verify failed”
that prevents a filesystem from being mounted, I found out how to adjust
some system blocks so that the kernel could get past that check and
mount the filesystem.  In one case, I could get all the data I wanted
from the filesystem; in another, many checksums failed and I ended up
throwing it all away, so no guarrantees.  mpiechotka's running into the
problem and bringing it up on IRC prompted me to post for wider
consumption this patch for btrfsck, that will tell you what to do to
make the filesystem mountable again.

Add verbosity to btrfsck so that we can manually recover from a failure
to update the roots.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 disk-io.c |   41 +
 volumes.c |3 +++
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index a6e1000..6860c26 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -87,6 +87,22 @@ int csum_tree_block_size(struct extent_buffer *buf, u16 
csum_size,
printk(checksum verify failed on %llu wanted %X 
   found %X\n, (unsigned long long)buf-start,
   *((int *)result), *((char *)buf-data));
+   if (csum_size == 4) {
+ fprintf(stderr, dd if=(fd %i) bs=1c skip=%llu 
count=4 | od -t x1:\n%02x %02x %02x %02x\n,
+ buf-fd,
+ (unsigned long long)buf-dev_bytenr,
+ (__u8)buf-data[0],
+ (__u8)buf-data[1],
+ (__u8)buf-data[2],
+ (__u8)buf-data[3]);
+ fprintf(stderr, printf 
\\\x%02x\\x%02x\\x%02x\\x%02x\ | dd of=(fd %i) bs=1c seek=%llu conv=notrunc 
count=4\n,
+ (__u8)result[0],
+ (__u8)result[1],
+ (__u8)result[2],
+ (__u8)result[3],
+ buf-fd,
+ (unsigned long long)buf-dev_bytenr);
+   }
free(result);
return 1;
}
@@ -165,6 +181,31 @@ static int verify_parent_transid(struct extent_io_tree 
*io_tree,
   (unsigned long long)eb-start,
   (unsigned long long)parent_transid,
   (unsigned long long)btrfs_header_generation(eb));
+   fprintf(stderr, dd if=(fd %i) bs=1c skip=%llu count=8 | od -t 
x1:\n%02x %02x %02x %02x %02x %02x %02x %02x\n,
+   eb-fd,
+   (unsigned long long)eb-dev_bytenr
+   + offsetof (struct btrfs_header, generation),
+   (__u8)eb-data[offsetof (struct btrfs_header, generation)],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 1],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 2],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 3],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 4],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 5],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 6],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 7]);
+   btrfs_set_header_generation(eb, parent_transid);
+   fprintf(stderr, printf 
\\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\\x%02x\ | dd of=(fd %i) 
bs=1c seek=%llu conv=notrunc count=8\n,
+   (__u8)eb-data[offsetof (struct btrfs_header, generation)],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 1],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 2],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 3],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 4],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 5],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 6],
+   (__u8)eb-data[offsetof (struct btrfs_header, generation) + 7],
+   eb-fd,
+   (unsigned long long)eb-dev_bytenr
+   + offsetof (struct btrfs_header, generation));
ret = 1;
 out:
clear_extent_buffer_uptodate(io_tree, eb);
diff --git a/volumes.c b/volumes.c
index 7671855..c30a3ba 100644
--- a/volumes.c
+++ b/volumes.c
@@ -188,6 +188,9 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, 
int flags)
device-fd = fd;
if (flags == O_RDWR)
device-writeable = 1;
+   fprintf(stderr, Device %llu (%s) opened in fd %i\n,
+   (unsigned long long)device-devid,
+   device-name, device-fd

75 matches

Mail list logo