Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-04-04 Thread Konstantinos Skarlatos

Hello,
I would like to ask about the status of this feature/patch, is it 
accepted into btrfs code, and how can I use it?


I am interested in enabling compression in a specific 
folder(force-compress would be ideal) of a large btrfs volume, and 
disabling it for the rest.



On 21/3/2011 10:57 πμ, liubo wrote:

Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.

NOTE:
  - The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).

v1->v2:
Rebase the patch with the latest btrfs.

Signed-off-by: Liu Bo
---
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |6 ++
  fs/btrfs/inode.c   |   32 
  fs/btrfs/ioctl.c   |   41 +
  4 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b4b9d1..b77d1a5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1283,6 +1283,7 @@ struct btrfs_root {
  #define BTRFS_INODE_NODUMP(1<<  8)
  #define BTRFS_INODE_NOATIME   (1<<  9)
  #define BTRFS_INODE_DIRSYNC   (1<<  10)
+#define BTRFS_INODE_COMPRESS   (1<<  11)

  /* some macros to generate set/get funcs for the struct fields.  This
   * assumes there is a lefoo_to_cpu for every type, so lets make a simple
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..a894c12 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb,

btrfs_check_super_valid(fs_info, sb->s_flags&  MS_RDONLY);

+   /*
+* In the long term, we'll store the compression type in the super
+* block, and it'll be used for per file compression control.
+*/
+   fs_info->compress_type = BTRFS_COMPRESS_ZLIB;
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index db67821..e687bb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -381,7 +381,8 @@ again:
 */
if (!(BTRFS_I(inode)->flags&  BTRFS_INODE_NOCOMPRESS)&&
(btrfs_test_opt(root, COMPRESS) ||
-(BTRFS_I(inode)->force_compress))) {
+(BTRFS_I(inode)->force_compress) ||
+(BTRFS_I(inode)->flags&  BTRFS_INODE_COMPRESS))) {
WARN_ON(pages);
pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);

@@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
else if (!btrfs_test_opt(root, COMPRESS)&&
-!(BTRFS_I(inode)->force_compress))
+!(BTRFS_I(inode)->force_compress)&&
+!(BTRFS_I(inode)->flags&  BTRFS_INODE_COMPRESS))
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1);
else
@@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
location->offset = 0;
btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);

-   btrfs_inherit_iflags(inode, dir);
-
if ((mode&  S_IFREG)) {
if (btrfs_test_opt(root, NODATASUM))
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
@@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW;
}

+   btrfs_inherit_iflags(inode, dir);
+
insert_inode_hash(inode);
inode_tree_add(inode);
return inode;
@@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt,
return 0;
  }

+/*
+ * If a file is moved, it will inherit the cow and compression flags of the new
+ * directory.
+ */
+static void fixup_inode_flags(struct inode *dir, struct inode *inode)
+{
+   struct btrfs_inode *b_dir = BTRFS_I(dir);
+   struct btrfs_inode *b_inode = BTRFS_I(inode);
+
+   if (b_dir->flags&  BTRFS_INODE_NODATACOW)
+   b_inode->flags |= BTRFS_INODE_NODAT

Odd rebalancing behavior

2011-04-04 Thread Michel Alexandre Salim
I have an external 4-disk enclosure, connected through USB 2.0 (my
laptop does not have a USB 3.0 connector, and the eSATA connector
somehow does not work); it initially had a 2-disk btrfs soft-RAID1 file
system (both data and metadata are RAID1).

I recently added two more disks and did a rebalance. To my surprise it
went past the point where all four disks have the same amount of disk
usage, and went all the way to the original disks being empty, and the
new disks having all the data!

Label: 'media.store'  uuid: 4cfd3551-aa85-4399-b872-9238ddb14c97
Total devices 4 FS bytes used 1.22TB
devid3 size 1.82TB used 1.24TB path /dev/sdb
devid4 size 1.82TB used 1.24TB path /dev/sdc
devid2 size 1.82TB used 8.00MB path /dev/sde
devid1 size 1.82TB used 12.00MB path /dev/sdd

Is this to be expected? Would another rebalance fix it, or should I
force-stop it by shutting down when the disk usage is roughly balanced?

This is on Fedora 15 pre-release, x86_64, fully updated, kernel
2.6.38.2-9 and btrfs-progs 0.19-13

Thanks,

-- 
Michel Alexandre Salim
GPG key ID: 78884778

()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs subvolume snapshot syntax too "smart"

2011-04-04 Thread krz...@gmail.com
I understand btrfs intent but same command run twice should not give
diffrent results. This really makes snapshot automation hard


root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
Create a snapshot of '/ssd/sub1' in '/ssd/5'
root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
ERROR: cannot snapshot '/ssd/sub1'
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balancing start - and stop?

2011-04-04 Thread Stephane Chazelas
2011-04-03 21:35:00 +0200, Helmut Hullen:
> Hallo, Stephane,
> 
> Du meintest am 03.04.11:
> 
>  balancing about 2 TByte needed about 20 hours.
> 
> [...]
> 
> >> Hugo has explained the limits of regarding
> >>
> >> dmesg | grep relocating
> >>
> >> or (more simple) the last lines of "dmesg" and looking for the
> >> "relocating" lines. But: what do these lines tell now? What is the
> >> (pessimistic) estimation when you extrapolate the data?
> 
> [...]
> 
> > 4.7 more days to go. And I reckon it will have written about 9
> > TB to disk by that time (which is the total size of the volume,
> > though only 3.8TB are occupied).
> 
> Yes - that's the pessimistic estimation. As Hugo has explained it can  
> finish faster - just look to the data tomorrow again.
[...]

That may be an optimistic estimation actually, as there hasn't
been much progress in the last 34 hours:

# dmesg | awk -F '[][ ]+' '/reloc/ &&n++%5==0 {x=(n-$7)/($2-t)/1048576; printf 
"%s\t%s\t%.2f\t%*s\n", $2/3600,$7, x, x/3, ""; t=$2; n=$7}' | tr ' ' '*' | tail 
-40
125.629 4170039951360   11.93   ***
125.641 4166818725888   70.99   ***
125.699 4157155049472   43.87   **
125.753 4144270147584   63.34   *
125.773 4137827696640   84.98   
125.786 4134606471168   64.39   *
125.823 4124942794752   70.09   ***
125.87  4112057892864   71.66   ***
125.887 4105615441920   100.60  *
125.898 4102394216448   81.26   ***
125.935 4092730540032   69.06   ***
126.33  4085751218176   4.69*
131.904 4072597880832   0.63
132.082 4059712978944   19.20   **
132.12  4053270528000   45.52   ***
132.138 4050049302528   45.60   ***
132.225 4040385626112   29.68   *
132.267 4027500724224   81.17   ***
132.283 4021058273280   106.31  ***
132.29  4017837047808   110.42  
132.316 4008173371392   100.54  *
132.358 3995288469504   81.18   ***
132.475 3988846018560   14.62   
132.514 3985624793088   21.55   ***
132.611 3975961116672   26.40   
132.663 3963076214784   65.31   *
132.678 3956633763840   120.11  
132.685 3956365328384   10.26   ***
137.701 3949922877440   0.34
137.709 3946701651968   106.54  ***
137.744 3937037975552   72.10   
137.889 3927105863680   18.18   **
137.901 3926837428224   5.85*
141.555 3926300557312   0.04
141.93  3925226815488   0.76
151.227 3924421509120   0.02
151.491 3924153073664   0.27
151.712 3923616202752   0.64
165.301 3922542460928   0.02
174.346 3921737154560   0.02

At this rate (third field expressed in MiB/s), it could take
months to complete.

iostat still reports writes at about 5MiB/s though. Note that
this system is not doing anything else at all.

There definitely seems to be scope for optimisation in the
"balancing" I'd say.

-- 
Stephane
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs subvolume snapshot syntax too "smart"

2011-04-04 Thread Goffredo Baroncelli
On 04/04/2011 09:09 PM, krz...@gmail.com wrote:
> I understand btrfs intent but same command run twice should not give
> diffrent results. This really makes snapshot automation hard
> 
> 
> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
> Create a snapshot of '/ssd/sub1' in '/ssd/5'
> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
> Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
> Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
> ERROR: cannot snapshot '/ssd/sub1'

The same is true for cp:

# cp -rf /ssd/sub1 /ssd/5   -> copy "sub1" as "5"
# cp -rf /ssd/sub1 /ssd/5   -> copy "sub1" in "5"

However you are right. It could be fixed easily adding a switch like
"--script", which force to handle the last part of the destination as
the name of the subvolume, raising an error if it already exists.

"subvolume snapshot" is the only command which suffers of this kind of
problem ?

Regards
G.Baroncelli

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs subvolume snapshot syntax too "smart"

2011-04-04 Thread Freddie Cash
On Mon, Apr 4, 2011 at 12:47 PM, Goffredo Baroncelli  wrote:
> On 04/04/2011 09:09 PM, krz...@gmail.com wrote:
>> I understand btrfs intent but same command run twice should not give
>> diffrent results. This really makes snapshot automation hard
>>
>>
>> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
>> Create a snapshot of '/ssd/sub1' in '/ssd/5'
>> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
>> Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
>> root@sv12 [/ssd]# btrfs subvolume snapshot /ssd/sub1 /ssd/5
>> Create a snapshot of '/ssd/sub1' in '/ssd/5/sub1'
>> ERROR: cannot snapshot '/ssd/sub1'
>
> The same is true for cp:
>
> # cp -rf /ssd/sub1 /ssd/5       -> copy "sub1" as "5"
> # cp -rf /ssd/sub1 /ssd/5       -> copy "sub1" in "5"
>
> However you are right. It could be fixed easily adding a switch like
> "--script", which force to handle the last part of the destination as
> the name of the subvolume, raising an error if it already exists.
>
> "subvolume snapshot" is the only command which suffers of this kind of
> problem ?

Isn't this a situation where supporting a trailing / would help?

For example, with the / at the end, means "put the snapshot into the
folder".  Thus "btrfs subvolume snapshot /ssd/sub1 /ssd/5/" would
create a "sub1" snapshot inside the 5/ folder.  Running it a second
time would error out since /ssd/5/sub1/ already exists.  And if the 5/
folder doesn't exist, it would error out.

And without the / at the end, means "name the snapshot".  Thus "btrfs
subvolume snapshot /ssd/sub1 /ssd/5" would create a snapshot named
"/ssd/5".  Running the command again would error out due to the
snapshot already existing.  And if the 5/ folder doesn't exist, it's
created.  And it errors out if the 5/ folder already exists.

Or, something along those lines.  Similar to how other apps work
with/without a trailing /.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix free space cache when there are pinned extents and clusters V2

2011-04-04 Thread Mitch Harder
On Fri, Apr 1, 2011 at 9:55 AM, Josef Bacik  wrote:
> I noticed a huge problem with the free space cache that was presenting as an
> early ENOSPC.  Turns out when writing the free space cache out I forgot to 
> take
> into account pinned extents and more importantly clusters.  This would result 
> in
> us leaking free space everytime we unmounted the filesystem and remounted it. 
>  I
> fix this by making sure to check and see if the current block group has a
> cluster and writing out any entries that are in the cluster to the cache, as
> well as writing any pinned extents we currently have to the cache since those
> will be available for us to use the next time the fs mounts.  This patch also
> adds a check to the end of load_free_space_cache to make sure we got the right
> amount of free space cache, and if not make sure to clear the cache and 
> re-cache
> the old fashioned way.  Thanks,
>
> Signed-off-by: Josef Bacik 
> ---
> V1->V2:
> - use block_group->free_space instead of
>  btrfs_block_group_free_space(block_group)
>
>  fs/btrfs/free-space-cache.c |   82 --
>  1 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index f03ef97..74bc432 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -24,6 +24,7 @@
>  #include "free-space-cache.h"
>  #include "transaction.h"
>  #include "disk-io.h"
> +#include "extent_io.h"
>
>  #define BITS_PER_BITMAP                (PAGE_CACHE_SIZE * 8)
>  #define MAX_CACHE_BYTES_PER_GIG        (32 * 1024)
> @@ -222,6 +223,7 @@ int load_free_space_cache(struct btrfs_fs_info *fs_info,
>        u64 num_entries;
>        u64 num_bitmaps;
>        u64 generation;
> +       u64 used = btrfs_block_group_used(&block_group->item);
>        u32 cur_crc = ~(u32)0;
>        pgoff_t index = 0;
>        unsigned long first_page_offset;
> @@ -467,6 +469,17 @@ next:
>                index++;
>        }
>
> +       spin_lock(&block_group->tree_lock);
> +       if (block_group->free_space != (block_group->key.offset - used -
> +                                       block_group->bytes_super)) {
> +               spin_unlock(&block_group->tree_lock);
> +               printk(KERN_ERR "block group %llu has an wrong amount of free 
> "
> +                      "space\n", block_group->key.objectid);
> +               ret = 0;
> +               goto free_cache;
> +       }
> +       spin_unlock(&block_group->tree_lock);
> +
>        ret = 1;
>  out:
>        kfree(checksums);
> @@ -495,8 +508,11 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>        struct list_head *pos, *n;
>        struct page *page;
>        struct extent_state *cached_state = NULL;
> +       struct btrfs_free_cluster *cluster = NULL;
> +       struct extent_io_tree *unpin = NULL;
>        struct list_head bitmap_list;
>        struct btrfs_key key;
> +       u64 start, end, len;
>        u64 bytes = 0;
>        u32 *crc, *checksums;
>        pgoff_t index = 0, last_index = 0;
> @@ -505,6 +521,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>        int entries = 0;
>        int bitmaps = 0;
>        int ret = 0;
> +       bool next_page = false;
>
>        root = root->fs_info->tree_root;
>
> @@ -551,6 +568,18 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>         */
>        first_page_offset = (sizeof(u32) * num_checksums) + sizeof(u64);
>
> +       /* Get the cluster for this block_group if it exists */
> +       if (!list_empty(&block_group->cluster_list))
> +               cluster = list_entry(block_group->cluster_list.next,
> +                                    struct btrfs_free_cluster,
> +                                    block_group_list);
> +
> +       /*
> +        * We shouldn't have switched the pinned extents yet so this is the
> +        * right one
> +        */
> +       unpin = root->fs_info->pinned_extents;
> +
>        /*
>         * Lock all pages first so we can lock the extent safely.
>         *
> @@ -580,6 +609,12 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>        lock_extent_bits(&BTRFS_I(inode)->io_tree, 0, i_size_read(inode) - 1,
>                         0, &cached_state, GFP_NOFS);
>
> +       /*
> +        * When searching for pinned extents, we need to start at our start
> +        * offset.
> +        */
> +       start = block_group->key.objectid;
> +
>        /* Write out the extent entries */
>        do {
>                struct btrfs_free_space_entry *entry;
> @@ -587,6 +622,8 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>                unsigned long offset = 0;
>                unsigned long start_offset = 0;
>
> +               next_page = false;
> +
>                if (index == 0) {
>                        start_offset = first_page_offset;
>                        offset = start_offset;
> @@ -598,7 +635,7 @@ int btrfs_write_out_cache(struct btrfs_root *root,
>                entry = addr + st

bug report

2011-04-04 Thread Larry D'Anna
So i made a filesystem image 

 $ dd if=/dev/zero of=root_fs bs=1024 count=$(expr 1024 \* 1024)
 $ mkfs.btrfs root_fs 

Then I put some debian on it (my kernel is 2.6.35-27-generic #48-Ubuntu) 

 $ mkdir root 
 $ mount -o loop root_fs root 
 $ debootstrap sid root 
 $ umount root

Then i run uml.   (2.6.35-1um-0ubuntu1)

 $ linux single eth0=tuntap,tap0,fe:fd:f0:00:00:01

and then try to apt-get some stuff, and the result is this:

btrfs csum failed ino 17498 off 2412544 csum 491052325 private 446722121
btrfs csum failed ino 17498 off 2416640 csum 2077462867 private 906054605
btrfs csum failed ino 17498 off 2420736 csum 263316283 private 2215839539
btrfs csum failed ino 17498 off 2424832 csum 4177088190 private 2414263107
btrfs csum failed ino 17498 off 2428928 csum 4028205539 private 3560605623
btrfs csum failed ino 17498 off 2433024 csum 1724529595 private 200634979
btrfs csum failed ino 17498 off 2437120 csum 4038631380 private 2927872002
btrfs csum failed ino 17498 off 2441216 csum 2616837020 private 729736037
btrfs csum failed ino 17498 off 2498560 csum 2566472073 private 3417075259
btrfs csum failed ino 17498 off 2502656 csum 2566472073 private 1410567947


 $ find / -mount -inum 17498  
 /var/cache/apt/srcpkgcache.bin

I've gone through this twice now, so it's repeatable at least.  I know 2.6.35 is
kinda old but was this kind of thing to be expected back then?  


  --larry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix subvolume mount by name problem when default mount subvolume is set

2011-04-04 Thread Chris Mason
Excerpts from Zhong, Xin's message of 2011-03-31 03:59:22 -0400:
> We create two subvolumes (meego_root and meego_home) in
> btrfs root directory. And set meego_root as default mount
> subvolume. After we remount btrfs, meego_root is mounted
> to top directory by default. Then when we try to mount
> meego_home (subvol=meego_home) to a subdirectory, it failed.
> The problem is when default mount subvolume is set to
> meego_root, we search meego_home in it but can not find it.
> So the solution is to search meego_home in btrfs root
> directory instead when subvol=meego_home is given.

I think this one is difficult because if they have set the default
subvolume they might have done so because the original default has the
result of a busted upgrade or something in it.

So, I think the subvol= should be relative to the default.  Would it
work for you to add a new mount option to specify the subvol id to
search for subvol=?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug report

2011-04-04 Thread Helmut Hullen
Hallo, Larry,

Du meintest am 05.04.11:

[...]

> and then try to apt-get some stuff, and the result is this:

> btrfs csum failed ino 17498 off 2412544 csum 491052325 private
> 446722121
> btrfs csum failed ino 17498 off 2416640 csum 2077462867
> private 906054605

[...]

> I've gone through this twice now, so it's repeatable at least.  I
> know 2.6.35 is kinda old but was this kind of thing to be expected
> back then?

First try an actual kernel; I prefer 2.6.38.1

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html