Re: [RFC PATCH v2 0/3] Btrfs: apply the Probabilistic Skiplist on btrfs

2012-01-13 Thread Andi Kleen

For the btrfs extent cache it's unclear if just RCUing is a good
fit anyways: some workloads are very write heavy and RCU only
performs well if you have a lot more reads than writes.
For write heavy RCUification usually slows it down.

 FWIW, I'm mentioning this out of self interest - I need a RCU safe
 tree structure to index extents for lookless lookups in the XFS
 buffer cache, but I've got a long list of things to do before I get
 to it. If someone else implements the tree, that's most of the work
 done for me. :)

FWIW there are fine grained rbtrees in papers too, but they are too fine 
grained imho: you may need to take a large number of locks for a single 
traversal. While atomics got a lot cheaper recently they are still
somewhat expensive and you don't want too many of them in your 
fast path. Also I found when there is actual contention having
too many bouncing locks is quite bad because the latencies of passing
the cache lines around really add up. In these cases uses less fine
locks is better.

Mathieu also did RCU rbtrees but they are quite complicated.

IMHO we would like to have something inbetween a global tree lock and a
fully fine grained tree where the lock complexity cannot get out of band.
May need a separate data structure for the locks.

I don't have a leading candidate for that currently.

There are some variants of rbtrees that are less strict and have
a simpler rebalance which are interesting. But also some
other tree data structures. Needs more work.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug(?): btrfs carries on working if part of a device disappears

2012-01-13 Thread Liu Bo
On 01/06/2012 02:02 AM, Maik Zumstrull wrote:
 Hello list,
 
 I hit a funny BIOS bug the other day where the BIOS suddenly sets a
 HPA on a random hard disk, leaving only the first 33 MB accessible.
 That disk had one device of a multi-device btrfs on it in my case.
 (With dm-crypt/LUKS in between, no partitioning or LVM.)
 
 The reason I'm writing to you is that btrfs apparently didn't care at
 all. It didn't complain, and it certainly didn't consider Uhm, maybe
 I should stop writing to a file system that mostly doesn't exist
 anymore. The only errors I saw in dmesg were from the lower block
 device level: someone trying to read or write beyond the end of a
 device. An error btrfs apparently didn't mind. It took me a while to
 figure out what had happened, during which time btrfsck and the btrfs
 kernel part worked together to pretty much totally trash the fs. (I'm
 still trying a few things, but I'm not hopeful. Hold the default
 backup rant, I can in fact recover anything that was on this from
 elsewhere, I think.)
 
 So, I think during mount, btrfs should check the reported size of the
 block device, and if it's significantly smaller than fs metadata
 implies it must be, mount degraded or read-only or not at all. And
 mostly, complain. Loudly.
 

I also notice this, when we mkfs.btrfs with a -b fssize, if fssize is
larger than dev size, it will not complain and get beyond the end errors.

so maybe we limit the mkfs size:

diff --git a/mkfs.c b/mkfs.c
index e3ced19..3ac8525 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1282,6 +1282,8 @@ int main(int ac, char **av)
ret = btrfs_prepare_device(fd, file, zero_end, 
dev_block_count, mixed);
if (block_count == 0)
block_count = dev_block_count;
+   if (block_count  dev_block_count);
+   block_count = dev_block_count;
} else {
ac = 0;
file = av[optind++];

thanks,
liubo

 This was on Debian's linux-image-3.1.0-1-amd6 at version 3.1.6-1.
 Other ways this could happen than HPA are LVM or partitioning.
 
 
 Maik
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug(?): btrfs carries on working if part of a device disappears

2012-01-13 Thread Ben Klein
On 13 January 2012 23:07, Liu Bo liubo2...@cn.fujitsu.com wrote:
 On 01/06/2012 02:02 AM, Maik Zumstrull wrote:
 Hello list,

 I hit a funny BIOS bug the other day where the BIOS suddenly sets a
 HPA on a random hard disk, leaving only the first 33 MB accessible.
 That disk had one device of a multi-device btrfs on it in my case.
 (With dm-crypt/LUKS in between, no partitioning or LVM.)

 The reason I'm writing to you is that btrfs apparently didn't care at
 all. It didn't complain, and it certainly didn't consider Uhm, maybe
 I should stop writing to a file system that mostly doesn't exist
 anymore. The only errors I saw in dmesg were from the lower block
 device level: someone trying to read or write beyond the end of a
 device. An error btrfs apparently didn't mind. It took me a while to
 figure out what had happened, during which time btrfsck and the btrfs
 kernel part worked together to pretty much totally trash the fs. (I'm
 still trying a few things, but I'm not hopeful. Hold the default
 backup rant, I can in fact recover anything that was on this from
 elsewhere, I think.)

 So, I think during mount, btrfs should check the reported size of the
 block device, and if it's significantly smaller than fs metadata
 implies it must be, mount degraded or read-only or not at all. And
 mostly, complain. Loudly.


 I also notice this, when we mkfs.btrfs with a -b fssize, if fssize is
 larger than dev size, it will not complain and get beyond the end errors.

 so maybe we limit the mkfs size:

 diff --git a/mkfs.c b/mkfs.c
 index e3ced19..3ac8525 100644
 --- a/mkfs.c
 +++ b/mkfs.c
 @@ -1282,6 +1282,8 @@ int main(int ac, char **av)
                ret = btrfs_prepare_device(fd, file, zero_end, 
 dev_block_count, mixed);
                if (block_count == 0)
                        block_count = dev_block_count;
 +               if (block_count  dev_block_count);
 +                       block_count = dev_block_count;
        } else {
                ac = 0;
                file = av[optind++];

 thanks,
 liubo

It might be a better idea to error out at this point. If the user is
asking for a filesystem larger than what is possible on the device, I
think the mkfs should fail completely.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add a delalloc mutex to inodes for delalloc reservations

2012-01-13 Thread Josef Bacik
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead.  This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/btrfs_inode.h |3 +++
 fs/btrfs/extent-tree.c |5 +++--
 fs/btrfs/inode.c   |   11 +--
 fs/btrfs/ioctl.c   |2 --
 fs/btrfs/relocation.c  |2 --
 5 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 634608d..9b9b15f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -51,6 +51,9 @@ struct btrfs_inode {
/* held while logging the inode in tree-log.c */
struct mutex log_mutex;
 
+   /* held while doing delalloc reservations */
+   struct mutex delalloc_mutex;
+
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 673d32e..5519c65 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4245,12 +4245,11 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
/* Need to be holding the i_mutex here if we aren't free space cache */
if (btrfs_is_free_space_inode(root, inode))
flush = 0;
-   else
-   WARN_ON(!mutex_is_locked(inode-i_mutex));
 
if (flush  btrfs_transaction_in_commit(root-fs_info))
schedule_timeout(1);
 
+   mutex_lock(BTRFS_I(inode)-delalloc_mutex);
num_bytes = ALIGN(num_bytes, root-sectorsize);
 
spin_lock(BTRFS_I(inode)-lock);
@@ -4305,6 +4304,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
  btrfs_ino(inode),
  to_free, 0);
}
+   mutex_unlock(BTRFS_I(inode)-delalloc_mutex);
return ret;
}
 
@@ -4315,6 +4315,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, 
u64 num_bytes)
}
BTRFS_I(inode)-reserved_extents += nr_extents;
spin_unlock(BTRFS_I(inode)-lock);
+   mutex_unlock(BTRFS_I(inode)-delalloc_mutex);
 
if (to_reserve)
trace_btrfs_space_reservation(root-fs_info,delalloc,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85b8b90..f26fa6b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2239,14 +2239,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
continue;
}
nr_truncate++;
-   /*
-* Need to hold the imutex for reservation purposes, not
-* a huge deal here but I have a WARN_ON in
-* btrfs_delalloc_reserve_space to catch offenders.
-*/
-   mutex_lock(inode-i_mutex);
ret = btrfs_truncate(inode);
-   mutex_unlock(inode-i_mutex);
} else {
nr_unlink++;
}
@@ -6417,10 +6410,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
u64 page_start;
u64 page_end;
 
-   /* Need this to keep space reservations serialized */
-   mutex_lock(inode-i_mutex);
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
-   mutex_unlock(inode-i_mutex);
if (!ret)
ret = btrfs_update_time(vma-vm_file);
if (ret) {
@@ -6764,6 +6754,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
extent_io_tree_init(ei-io_tree, inode-i_data);
extent_io_tree_init(ei-io_failure_tree, inode-i_data);
mutex_init(ei-log_mutex);
+   mutex_init(ei-delalloc_mutex);
btrfs_ordered_inode_tree_init(ei-ordered_tree);
INIT_LIST_HEAD(ei-i_orphan);
INIT_LIST_HEAD(ei-delalloc_inodes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c04f02c..40eaa9f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -858,10 +858,8 @@ static int cluster_pages_for_defrag(struct inode *inode,
return 0;
file_end = (isize - 1)  PAGE_CACHE_SHIFT;
 
-   mutex_lock(inode-i_mutex);
ret = btrfs_delalloc_reserve_space(inode,
   num_pages  PAGE_CACHE_SHIFT);
-   mutex_unlock(inode-i_mutex);
if (ret)
return ret;
 again:
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index cfb5543..dff29d5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2947,9 +2947,7 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
index = (cluster-start 

[PATCH] Btrfs: space leak tracepoints

2012-01-13 Thread Josef Bacik
This in addition to a script in my btrfs-tracing tree will help track down space
leaks when we're getting space left over in block groups on umount.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/delayed-inode.c |   45 
 fs/btrfs/extent-tree.c   |   58 --
 fs/btrfs/inode-map.c |4 +++
 fs/btrfs/transaction.c   |2 +
 include/trace/events/btrfs.h |   30 +
 5 files changed, 119 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 9c1eccc..fe4cd0f 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -595,8 +595,12 @@ static int btrfs_delayed_item_reserve_metadata(struct 
btrfs_trans_handle *trans,
 
num_bytes = btrfs_calc_trans_metadata_size(root, 1);
ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
-   if (!ret)
+   if (!ret) {
+   trace_btrfs_space_reservation(root-fs_info, delayed_item,
+ item-key.objectid,
+ num_bytes, 1);
item-bytes_reserved = num_bytes;
+   }
 
return ret;
 }
@@ -610,6 +614,9 @@ static void btrfs_delayed_item_release_metadata(struct 
btrfs_root *root,
return;
 
rsv = root-fs_info-delayed_block_rsv;
+   trace_btrfs_space_reservation(root-fs_info, delayed_item,
+ item-key.objectid, item-bytes_reserved,
+ 0);
btrfs_block_rsv_release(root, rsv,
item-bytes_reserved);
 }
@@ -624,7 +631,7 @@ static int btrfs_delayed_inode_reserve_metadata(
struct btrfs_block_rsv *dst_rsv;
u64 num_bytes;
int ret;
-   int release = false;
+   bool release = false;
 
src_rsv = trans-block_rsv;
dst_rsv = root-fs_info-delayed_block_rsv;
@@ -651,8 +658,13 @@ static int btrfs_delayed_inode_reserve_metadata(
 */
if (ret == -EAGAIN)
ret = -ENOSPC;
-   if (!ret)
+   if (!ret) {
node-bytes_reserved = num_bytes;
+   trace_btrfs_space_reservation(root-fs_info,
+ delayed_inode,
+ btrfs_ino(inode),
+ num_bytes, 1);
+   }
return ret;
} else if (src_rsv == root-fs_info-delalloc_block_rsv) {
spin_lock(BTRFS_I(inode)-lock);
@@ -707,11 +719,17 @@ out:
 * reservation here.  I think it may be time for a documentation page on
 * how block rsvs. work.
 */
-   if (!ret)
+   if (!ret) {
+   trace_btrfs_space_reservation(root-fs_info, delayed_inode,
+ btrfs_ino(inode), num_bytes, 1);
node-bytes_reserved = num_bytes;
+   }
 
-   if (release)
+   if (release) {
+   trace_btrfs_space_reservation(root-fs_info, delalloc,
+ btrfs_ino(inode), num_bytes, 0);
btrfs_block_rsv_release(root, src_rsv, num_bytes);
+   }
 
return ret;
 }
@@ -725,6 +743,8 @@ static void btrfs_delayed_inode_release_metadata(struct 
btrfs_root *root,
return;
 
rsv = root-fs_info-delayed_block_rsv;
+   trace_btrfs_space_reservation(root-fs_info, delayed_inode,
+ node-inode_id, node-bytes_reserved, 0);
btrfs_block_rsv_release(root, rsv,
node-bytes_reserved);
node-bytes_reserved = 0;
@@ -1372,13 +1392,6 @@ int btrfs_insert_delayed_dir_index(struct 
btrfs_trans_handle *trans,
goto release_node;
}
 
-   ret = btrfs_delayed_item_reserve_metadata(trans, root, delayed_item);
-   /*
-* we have reserved enough space when we start a new transaction,
-* so reserving metadata failure is impossible
-*/
-   BUG_ON(ret);
-
delayed_item-key.objectid = btrfs_ino(dir);
btrfs_set_key_type(delayed_item-key, BTRFS_DIR_INDEX_KEY);
delayed_item-key.offset = index;
@@ -1391,6 +1404,14 @@ int btrfs_insert_delayed_dir_index(struct 
btrfs_trans_handle *trans,
dir_item-type = type;
memcpy((char *)(dir_item + 1), name, name_len);
 
+   ret = btrfs_delayed_item_reserve_metadata(trans, root, delayed_item);
+   /*
+* we have reserved enough space when we start a new transaction,
+* so reserving metadata failure is impossible
+*/
+   BUG_ON(ret);
+
+
mutex_lock(delayed_node-mutex);
ret = __btrfs_add_delayed_insertion_item(delayed_node, delayed_item);
if 

Can't resize second device in RAID1

2012-01-13 Thread Marco L. Crociani
Hi,
the situation:
Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
    Total devices 2 FS bytes used 284.98GB
    devid    2 size 311.82GB used 286.51GB path /dev/sdb3
    devid    1 size 897.76GB used 286.51GB path /dev/sda3

RootFS created when sda3 was 897.76GB and sdb3 311.82GB.
I have now freed other space on sdb. So I deleted sdb3 and recreated
it occupying all available space.

Disk /dev/sdb: 2000 GB, 2000396321280 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
/dev/sdb3  54  117249   941368837   83  Linux

same as
/dev/sda3  54  117249   941368837   83  Linux

# ./btrfs filesystem resize max /mnt/RootFS
Resize '/mnt/RootFS' of 'max'

on dmesg I get only:
[  657.438464] btrfs: new size for /dev/sda3 is 963962208256

# ./btrfs fi sh
Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
    Total devices 2 FS bytes used 284.98GB
    devid    2 size 311.82GB used 286.51GB path /dev/sdb3
    devid    1 size 897.76GB used 286.51GB path /dev/sda3

/dev/sdb3 is the same.

How can I resize /dev/sdb3?

Regards,

--
Marco Lorenzo Crociani,
marco.croci...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't resize second device in RAID1

2012-01-13 Thread Hugo Mills
On Sat, Jan 14, 2012 at 12:12:06AM +0100, Marco L. Crociani wrote:
 Hi,
 the situation:
 Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
     Total devices 2 FS bytes used 284.98GB
     devid    2 size 311.82GB used 286.51GB path /dev/sdb3
     devid    1 size 897.76GB used 286.51GB path /dev/sda3
 
 RootFS created when sda3 was 897.76GB and sdb3 311.82GB.
 I have now freed other space on sdb. So I deleted sdb3 and recreated
 it occupying all available space.
 
 Disk /dev/sdb: 2000 GB, 2000396321280 bytes
 255 heads, 63 sectors/track, 243201 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 /dev/sdb3  54  117249   941368837   83  Linux
 
 same as
 /dev/sda3  54  117249   941368837   83  Linux
 
 # ./btrfs filesystem resize max /mnt/RootFS
 Resize '/mnt/RootFS' of 'max'
 
 on dmesg I get only:
 [  657.438464] btrfs: new size for /dev/sda3 is 963962208256
 
 # ./btrfs fi sh
 Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
     Total devices 2 FS bytes used 284.98GB
     devid    2 size 311.82GB used 286.51GB path /dev/sdb3
     devid    1 size 897.76GB used 286.51GB path /dev/sda3
 
 /dev/sdb3 is the same.
 
 How can I resize /dev/sdb3?

   I think the syntax you need is btrfs fi resize max /mnt/RootFS:2

   But I could be wrong. If it works, can you add it to the UseCases
page on the wiki, please?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Someone's been throwing dead sheep down my Fun Well ---   


signature.asc
Description: Digital signature


Re: Can't resize second device in RAID1

2012-01-13 Thread Marco L. Crociani
On Sat, Jan 14, 2012 at 12:17 AM, Hugo Mills h...@carfax.org.uk wrote:
 On Sat, Jan 14, 2012 at 12:12:06AM +0100, Marco L. Crociani wrote:
 Hi,
 the situation:
 Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
     Total devices 2 FS bytes used 284.98GB
     devid    2 size 311.82GB used 286.51GB path /dev/sdb3
     devid    1 size 897.76GB used 286.51GB path /dev/sda3

 RootFS created when sda3 was 897.76GB and sdb3 311.82GB.
 I have now freed other space on sdb. So I deleted sdb3 and recreated
 it occupying all available space.

 Disk /dev/sdb: 2000 GB, 2000396321280 bytes
 255 heads, 63 sectors/track, 243201 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 /dev/sdb3  54  117249   941368837   83  Linux

 same as
 /dev/sda3  54  117249   941368837   83  Linux

 # ./btrfs filesystem resize max /mnt/RootFS
 Resize '/mnt/RootFS' of 'max'

 on dmesg I get only:
 [  657.438464] btrfs: new size for /dev/sda3 is 963962208256

 # ./btrfs fi sh
 Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
     Total devices 2 FS bytes used 284.98GB
     devid    2 size 311.82GB used 286.51GB path /dev/sdb3
     devid    1 size 897.76GB used 286.51GB path /dev/sda3

 /dev/sdb3 is the same.

 How can I resize /dev/sdb3?

   I think the syntax you need is btrfs fi resize max /mnt/RootFS:2


It's wrong :(

./btrfs fi resize max /mnt/RootFS:2
ERROR: can't access to '/mnt/RootFS:2'

   But I could be wrong. If it works, can you add it to the UseCases
 page on the wiki, please?

Sure.


   Hugo.

 --
 === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
       --- Someone's been throwing dead sheep down my Fun Well ---



-- 
Marco Lorenzo Crociani,
marco.croci...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html