[PATCH V19 00/19] Allow I/O on blocks whose size is less than page size

2016-06-14 Thread Chandan Rajendra
Btrfs assumes block size to be the same as the machine's page
size. This would mean that a Btrfs instance created on a 4k page size
machine (e.g. x86) will not be mountable on machines with larger page
sizes (e.g. PPC64/AARCH64). This patchset aims to resolve this
incompatibility.

This patchset continues with the work posted previously at
http://thread.gmane.org/gmane.comp.file-systems.btrfs/55653.

I have reverted the upstream commit "btrfs: fix lockups from
btrfs_clear_path_blocking" (f82c458a2c3ffb94b431fc6ad791a79df1b3713e)
since this led to soft-lockups when the patch "Btrfs:
subpagesize-blocksize: Prevent writes to an extent buffer when
PG_writeback flag is set" is applied. During 2015's Vault Conference
Btrfs meetup, Chris Mason had suggested that he will write up a
suitable locking function to be used when writing dirty pages that map
metadata blocks. Until we have a suitable locking function available,
this patchset temporarily disables the commit
f82c458a2c3ffb94b431fc6ad791a79df1b3713e.

The commits for the Btrfs kernel module can be found at
https://github.com/chandanr/linux/tree/btrfs/subpagesize-blocksize.

To create a filesystem with block size < page size, a patched version
of the Btrfs-progs package is required. The corresponding fixes for
Btrfs-progs can be found at
https://github.com/chandanr/btrfs-progs/tree/btrfs/subpagesize-blocksize.

The patchset is based off kdave/master branch. I had cherry picked the
following fixes to self-tests (from kdave/for-next branch) before
applying the subpage-blocksize patchset:

Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
Btrfs: self-tests: Use macros instead of constants and add missing newline
Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
Btrfs: self-tests: Support non-4k page size
Btrfs: Fix integer overflow when calculating bytes_per_bitmap
Btrfs: test_check_exists: Fix infinite loop when searching for free space 
entries

Fstests run status:
1. x86_64
   - With 4k sectorsize, all the tests that succeed with the master
 branch at git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
 branch also do so with the patches applied.
   - With 2k sectorsize, generic/027 never seems to complete. In my
 case, the test did not complete even after 45 mins of run time.
2. ppc64
   - With 4k sectorsize, 16k nodesize and with "nospace_cache" mount
 option, except for scrub and compression tests, all the tests
 that succeed with the master branch at
 git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
 branch also do so with the patches applied.
   - With 64k sectorsize & nodesize, all the tests that succeed with
 the master branch at
 git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
 branch also do so with the patches applied.

TODO:
1. On ppc64, btrfsck segfaults when checking a filesystem instance
   having 2k sectorsize.
2. I am planning to fix scrub & compression via a separate patchset.

Changes from V18:
1. The per-page bitmap used to track the block status is now allocated
   from a slab cache.
2. The per-page bitmap is allocated and used only in cases where
   sectorsize < PAGE_SIZE.
3. The new patch "Btrfs: subpage-blocksize: Disable compression"
   disables compression in subpage-blocksize scenario.

Changes from V17:
1. Due to mistakes made during git rebase operations, fixes ended up
   in incorrect patches. This patchset gets the fixes in the right
   patches.

Changes from V16:
1. The V15 patchset consisted of patches obtained from an incorrect
   git branch. Apologies for the mistake. All the entries listed under
   "Changes from V15" hold good for V16.

Changes from V15:
1. The invocation of cleancache_get_page() in __do_readpage() assumed
   blocksize to be same as PAGE_SIZE. We now invoke cleancache_get_page()
   only if blocksize is same as PAGE_SIZE. Thanks to David Sterba for
   pointing this out.
2. In __extent_writepage_io() we used to accumulate all the contiguous
   dirty blocks within the page before submitting the file offset range
   for I/O. In some cases this caused the bio to span across more than
   a stripe. For example, With 4k block size, 64K stripe size
   and 64K page size, assume
   - All the blocks mapped by the page are contiguous on the logical
 address space.
   - The first block of the page is mapped to the second block of the
 stripe.
   In such a scenario, we would add all the blocks of the page to
   bio. This would mean that we would overflow the stripe by one 4K
   block. Hence this patchset removes the optimization and invokes
   submit_extent_page() for every dirty 4K block.
3. The following patches are newly added:
   - Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset
 when moving to a new bio_vec 
   - Btrfs: subpage-blocksize: Ma

[PATCH V19 10/19] Btrfs: subpage-blocksize: btrfs_punch_hole: Fix uptodate blocks check

2016-06-14 Thread Chandan Rajendra
In case of subpage-blocksize, the file blocks to be punched may map only
part of a page. For file blocks inside such pages, we need to check for
the presence of BLK_STATE_UPTODATE flag.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file.c | 89 -
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 26c93f2..df9a6bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2327,6 +2327,8 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
struct btrfs_path *path;
struct btrfs_block_rsv *rsv;
struct btrfs_trans_handle *trans;
+   struct address_space *mapping = inode->i_mapping;
+   pgoff_t start_index, end_index;
u64 lockstart;
u64 lockend;
u64 tail_start;
@@ -2339,6 +2341,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
int err = 0;
unsigned int rsv_count;
bool same_block;
+   bool same_page;
bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
u64 ino_size;
bool truncated_block = false;
@@ -2435,11 +2438,45 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
goto out_only_mutex;
}
 
+   start_index = lockstart >> PAGE_SHIFT;
+   end_index = lockend >> PAGE_SHIFT;
+
+   same_page = lockstart >> PAGE_SHIFT
+   == lockend >> PAGE_SHIFT;
+
while (1) {
struct btrfs_ordered_extent *ordered;
+   struct page *start_page = NULL;
+   struct page *end_page = NULL;
+   u64 nr_pages;
+   int start_page_blks_uptodate;
+   int end_page_blks_uptodate;
 
truncate_pagecache_range(inode, lockstart, lockend);
 
+   if (lockstart & (PAGE_SIZE - 1)) {
+   start_page = find_or_create_page(mapping, start_index,
+   GFP_NOFS);
+   if (!start_page) {
+   inode_unlock(inode);
+   return -ENOMEM;
+   }
+   }
+
+   if (!same_page && ((lockend + 1) & (PAGE_SIZE - 1))) {
+   end_page = find_or_create_page(mapping, end_index,
+   GFP_NOFS);
+   if (!end_page) {
+   if (start_page) {
+   unlock_page(start_page);
+   put_page(start_page);
+   }
+   inode_unlock(inode);
+   return -ENOMEM;
+   }
+   }
+
+
lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 &cached_state);
ordered = btrfs_lookup_first_ordered_extent(inode, lockend);
@@ -2449,18 +2486,68 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
 * and nobody raced in and read a page in this range, if we did
 * we need to try again.
 */
+   nr_pages = round_up(lockend, PAGE_SIZE)
+   - round_down(lockstart, PAGE_SIZE);
+   nr_pages >>= PAGE_SHIFT;
+
+   start_page_blks_uptodate = 0;
+   end_page_blks_uptodate = 0;
+   if (root->sectorsize < PAGE_SIZE) {
+   u64 page_end;
+
+   page_end = round_down(lockstart, PAGE_SIZE)
+   + PAGE_SIZE - 1;
+   page_end = min(page_end, lockend);
+   if (start_page
+   && PagePrivate(start_page)
+   && test_page_blks_state(start_page, 1 << 
BLK_STATE_UPTODATE,
+   lockstart, page_end, 0))
+   start_page_blks_uptodate = 1;
+   if (end_page
+   && PagePrivate(end_page)
+   && test_page_blks_state(end_page, 1 << 
BLK_STATE_UPTODATE,
+   page_offset(end_page), 
lockend, 0))
+   end_page_blks_uptodate = 1;
+   } else {
+   if (start_page && PagePrivate(start_page)
+   && PageUptodate(start_page))
+   start_page_blks_uptodate = 1;
+   if (end_page && PagePrivate(end_page)
+   && PageUptodate(end_page))
+   end_page_blks_uptodate = 1;
+   }
+
if ((!ordered ||
(ordered->file_offset + orde

[PATCH V19 09/19] Btrfs: subpage-blocksize: Explicitly track I/O status of blocks of an ordered extent.

2016-06-14 Thread Chandan Rajendra
In subpage-blocksize scenario a page can have more than one block. So in
addition to PagePrivate2 flag, we would have to track the I/O status of
each block of a page to reliably mark the ordered extent as complete.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c|  19 +--
 fs/btrfs/extent_io.h|   5 +-
 fs/btrfs/inode.c| 365 ++--
 fs/btrfs/ordered-data.c |  19 +++
 fs/btrfs/ordered-data.h |   4 +
 5 files changed, 296 insertions(+), 116 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 74e27f9..40ed2f0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4654,11 +4654,10 @@ int extent_invalidatepage(struct extent_io_tree *tree,
  * to drop the page.
  */
 static int try_release_extent_state(struct extent_map_tree *map,
-   struct extent_io_tree *tree,
-   struct page *page, gfp_t mask)
+   struct extent_io_tree *tree,
+   struct page *page, u64 start, u64 end,
+   gfp_t mask)
 {
-   u64 start = page_offset(page);
-   u64 end = start + PAGE_SIZE - 1;
int ret = 1;
 
if (test_range_bit(tree, start, end,
@@ -4692,12 +4691,12 @@ static int try_release_extent_state(struct 
extent_map_tree *map,
  * map records are removed
  */
 int try_release_extent_mapping(struct extent_map_tree *map,
-  struct extent_io_tree *tree, struct page *page,
-  gfp_t mask)
+   struct extent_io_tree *tree, struct page *page,
+   u64 start, u64 end, gfp_t mask)
 {
struct extent_map *em;
-   u64 start = page_offset(page);
-   u64 end = start + PAGE_SIZE - 1;
+   u64 orig_start = start;
+   u64 orig_end = end;
 
if (gfpflags_allow_blocking(mask) &&
page->mapping->host->i_size > SZ_16M) {
@@ -4731,7 +4730,9 @@ int try_release_extent_mapping(struct extent_map_tree 
*map,
free_extent_map(em);
}
}
-   return try_release_extent_state(map, tree, page, mask);
+   return try_release_extent_state(map, tree, page,
+   orig_start, orig_end,
+   mask);
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 9dd84ef..b5304c4 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -279,8 +279,9 @@ typedef struct extent_map *(get_extent_t)(struct inode 
*inode,
 void extent_io_tree_init(struct extent_io_tree *tree,
 struct address_space *mapping);
 int try_release_extent_mapping(struct extent_map_tree *map,
-  struct extent_io_tree *tree, struct page *page,
-  gfp_t mask);
+   struct extent_io_tree *tree, struct page *page,
+   u64 start, u64 end,
+   gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 struct extent_state **cached);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a6bb415..a8d745a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3058,56 +3058,119 @@ static void finish_ordered_fn(struct btrfs_work *work)
btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
-   struct extent_state *state, int uptodate)
+static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
+   u64 blk, u64 nr_blks, int uptodate)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = ordered->inode;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *wq;
btrfs_work_func_t func;
-   u64 ordered_start, ordered_end;
int done;
 
-   trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+   while (nr_blks--) {
+   if (test_and_set_bit(blk, ordered->blocks_done)) {
+   blk++;
+   continue;
+   }
 
-   ClearPagePrivate2(page);
-loop:
-   ordered_extent = btrfs_lookup_ordered_range(inode, start,
-   end - start + 1);
-   if (!ordered_extent)
-   goto out;
+   done = btrfs_dec_test_ordered_pending(inode, &ordered,
+   ordered->file_offset
+   + (blk << inode->i_blkbits),
+   root->sectorsize,
+   uptodate);
+   if (done) {
+  

[PATCH V19 07/19] Btrfs: subpage-blocksize: Allow mounting filesystems where sectorsize < PAGE_SIZE

2016-06-14 Thread Chandan Rajendra
This commit allows mounting filesystem instances with sectorsize smaller
than the PAGE_SIZE.

Since the code assumes that the super block is either equal to or larger
than sectorsize, this commit brings back the nodesize argument for
btrfs_find_create_tree_block() function. This change allows us to be
able to mount and use filesystems with 2048 bytes as the sectorsize.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c | 21 -
 fs/btrfs/disk-io.h |  2 +-
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/extent_io.c   |  3 +--
 fs/btrfs/extent_io.h   |  4 ++--
 fs/btrfs/tree-log.c|  2 +-
 fs/btrfs/volumes.c |  9 ++---
 7 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 75129d2..b50f9a3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1099,7 +1099,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr)
struct extent_buffer *buf = NULL;
struct inode *btree_inode = root->fs_info->btree_inode;
 
-   buf = btrfs_find_create_tree_block(root, bytenr);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return;
read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
@@ -1115,7 +1115,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr,
struct extent_io_tree *io_tree = &BTRFS_I(btree_inode)->io_tree;
int ret;
 
-   buf = btrfs_find_create_tree_block(root, bytenr);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return 0;
 
@@ -1146,12 +1146,12 @@ struct extent_buffer *btrfs_find_tree_block(struct 
btrfs_fs_info *fs_info,
 }
 
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-u64 bytenr)
+u64 bytenr, u32 blocksize)
 {
if (btrfs_test_is_dummy_root(root))
return alloc_test_extent_buffer(root->fs_info, bytenr,
-   root->nodesize);
-   return alloc_extent_buffer(root->fs_info, bytenr);
+   blocksize);
+   return alloc_extent_buffer(root->fs_info, bytenr, blocksize);
 }
 
 
@@ -1175,7 +1175,7 @@ struct extent_buffer *read_tree_block(struct btrfs_root 
*root, u64 bytenr,
struct extent_buffer *buf = NULL;
int ret;
 
-   buf = btrfs_find_create_tree_block(root, bytenr);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return ERR_PTR(-ENOMEM);
 
@@ -4089,17 +4089,12 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
 * Check sectorsize and nodesize first, other check will need it.
 * Check all possible sectorsize(4K, 8K, 16K, 32K, 64K) here.
 */
-   if (!is_power_of_2(sectorsize) || sectorsize < 4096 ||
+   if (!is_power_of_2(sectorsize) || sectorsize < 2048 ||
sectorsize > BTRFS_MAX_METADATA_BLOCKSIZE) {
printk(KERN_ERR "BTRFS: invalid sectorsize %llu\n", sectorsize);
ret = -EINVAL;
}
-   /* Only PAGE SIZE is supported yet */
-   if (sectorsize != PAGE_SIZE) {
-   printk(KERN_ERR "BTRFS: sectorsize %llu not supported yet, only 
support %lu\n",
-   sectorsize, PAGE_SIZE);
-   ret = -EINVAL;
-   }
+
if (!is_power_of_2(nodesize) || nodesize < sectorsize ||
nodesize > BTRFS_MAX_METADATA_BLOCKSIZE) {
printk(KERN_ERR "BTRFS: invalid nodesize %llu\n", nodesize);
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index a81ff8d..aa3fb08 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -50,7 +50,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr);
 int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 int mirror_num, struct extent_buffer **eb);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-  u64 bytenr);
+  u64 bytenr, u32 blocksize);
 void clean_tree_block(struct btrfs_trans_handle *trans,
  struct btrfs_fs_info *fs_info, struct extent_buffer *buf);
 int open_ctree(struct super_block *sb,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index eed17ec..d225479 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8015,7 +8015,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 {
struct extent_buffer *buf;
 
-   buf = btrfs_find_create_tree_block(root, bytenr);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return ERR_PTR(-ENOMEM);
btrfs_set_header_generation(buf, trans->transid);

[PATCH V19 08/19] Btrfs: subpage-blocksize: Deal with partial ordered extent allocations.

2016-06-14 Thread Chandan Rajendra
In subpage-blocksize scenario, extent allocations for only some of the
dirty blocks of a page can succeed, while allocation for rest of the
blocks can fail. This patch allows I/O against such pages to be
submitted.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 27 ++-
 fs/btrfs/inode.c | 39 ++-
 2 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0465311..74e27f9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1863,17 +1863,23 @@ void extent_clear_unlock_delalloc(struct inode *inode, 
u64 start, u64 end,
if (page_ops & PAGE_SET_PRIVATE2)
SetPagePrivate2(pages[i]);
 
+   if (page_ops & PAGE_SET_ERROR)
+   SetPageError(pages[i]);
+
if (pages[i] == locked_page) {
put_page(pages[i]);
continue;
}
-   if (page_ops & PAGE_CLEAR_DIRTY)
+
+   if ((page_ops & PAGE_CLEAR_DIRTY)
+   && !PagePrivate2(pages[i]))
clear_page_dirty_for_io(pages[i]);
-   if (page_ops & PAGE_SET_WRITEBACK)
+   if ((page_ops & PAGE_SET_WRITEBACK)
+   && !PagePrivate2(pages[i]))
set_page_writeback(pages[i]);
-   if (page_ops & PAGE_SET_ERROR)
-   SetPageError(pages[i]);
-   if (page_ops & PAGE_END_WRITEBACK)
+
+   if ((page_ops & PAGE_END_WRITEBACK)
+   && !PagePrivate2(pages[i]))
end_page_writeback(pages[i]);
if (page_ops & PAGE_UNLOCK)
unlock_page(pages[i]);
@@ -2565,7 +2571,7 @@ void end_extent_writepage(struct page *page, int err, u64 
start, u64 end)
uptodate = 0;
}
 
-   if (!uptodate) {
+   if (!uptodate || PageError(page)) {
ClearPageUptodate(page);
SetPageError(page);
ret = ret < 0 ? ret : -EIO;
@@ -3420,7 +3426,6 @@ static noinline_for_stack int writepage_delalloc(struct 
inode *inode,
   nr_written);
/* File system has been set read-only */
if (ret) {
-   SetPageError(page);
/* fill_delalloc should be return < 0 for error
 * but just in case, we use > 0 here meaning the
 * IO is started, so we don't want to return > 0
@@ -3641,7 +3646,6 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
struct inode *inode = page->mapping->host;
struct extent_page_data *epd = data;
u64 start = page_offset(page);
-   u64 page_end = start + PAGE_SIZE - 1;
int ret;
int nr = 0;
size_t pg_offset = 0;
@@ -3686,7 +3690,7 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
if (ret == 1)
goto done_unlocked;
-   if (ret)
+   if (ret && !PagePrivate2(page))
goto done;
 
ret = __extent_writepage_io(inode, page, wbc, epd,
@@ -3700,10 +3704,7 @@ done:
set_page_writeback(page);
end_page_writeback(page);
}
-   if (PageError(page)) {
-   ret = ret < 0 ? ret : -EIO;
-   end_extent_writepage(page, ret, start, page_end);
-   }
+
unlock_page(page);
return ret;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e85865b..a6bb415 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -943,6 +943,8 @@ static noinline int cow_file_range(struct inode *inode,
struct btrfs_key ins;
struct extent_map *em;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+   struct btrfs_ordered_extent *ordered;
+   unsigned long page_ops, extent_ops;
int ret = 0;
 
if (btrfs_is_free_space_inode(inode)) {
@@ -987,8 +989,6 @@ static noinline int cow_file_range(struct inode *inode,
btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
 
while (disk_num_bytes > 0) {
-   unsigned long op;
-
cur_alloc_size = disk_num_bytes;
ret = btrfs_reserve_extent(root, cur_alloc_size,
   root->sectorsize, 0, alloc_hint,
@@ -1041,7 +1041,7 @@ static noinline int cow_file_range(struct inode *inode,
ret = btrfs_reloc_clone_csums(inode, start,
   

[PATCH V19 06/19] Btrfs: subpage-blocksize: Write only dirty extent buffers belonging to a page

2016-06-14 Thread Chandan Rajendra
For the subpage-blocksize scenario, this patch adds the ability to write
a single extent buffer to the disk.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  32 +++---
 fs/btrfs/extent_io.c | 277 +--
 2 files changed, 242 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fe89687..75129d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -504,28 +504,30 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
-   u64 start = page_offset(page);
-   u64 found_start;
struct extent_buffer *eb;
+   u64 found_start;
+   int ret;
 
eb = (struct extent_buffer *)page->private;
if (page != eb_head(eb)->pages[0])
return 0;
 
-   found_start = btrfs_header_bytenr(eb);
-   /*
-* Please do not consolidate these warnings into a single if.
-* It is useful to know what went wrong.
-*/
-   if (WARN_ON(found_start != start))
-   return -EUCLEAN;
-   if (WARN_ON(!PageUptodate(page)))
-   return -EUCLEAN;
-
-   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
-   btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
+   do {
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
+   continue;
+   if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)))
+   continue;
+   found_start = btrfs_header_bytenr(eb);
+   if (WARN_ON(found_start != eb->start))
+   return 0;
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
+   ret = csum_tree_block(fs_info, eb, 0);
+   if (ret)
+   return ret;
+   } while ((eb = eb->eb_next) != NULL);
 
-   return csum_tree_block(fs_info, eb, 0);
+   return 0;
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f62a039..b7ad9c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3717,29 +3717,49 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
-static noinline_for_stack int
-lock_extent_buffer_for_io(struct extent_buffer *eb,
- struct btrfs_fs_info *fs_info,
- struct extent_page_data *epd)
+static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
 {
+   struct extent_buffer *eb = &ebh->eb;
unsigned long i, num_pages;
-   int flush = 0;
+
+   num_pages = num_extent_pages(eb->start, eb->len);
+   for (i = 0; i < num_pages; i++) {
+   struct page *p = ebh->pages[i];
+   if (!trylock_page(p)) {
+   flush_write_bio(epd);
+   lock_page(p);
+   }
+   }
+
+   return;
+}
+
+static int noinline_for_stack
+lock_extent_buffer_for_io(struct extent_buffer *eb,
+   struct btrfs_fs_info *fs_info,
+   struct extent_page_data *epd)
+{
+   int dirty;
int ret = 0;
 
if (!btrfs_try_tree_write_lock(eb)) {
-   flush = 1;
flush_write_bio(epd);
btrfs_tree_lock(eb);
}
 
if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
btrfs_tree_unlock(eb);
-   if (!epd->sync_io)
-   return 0;
-   if (!flush) {
-   flush_write_bio(epd);
-   flush = 1;
+   if (!epd->sync_io) {
+   if (!dirty)
+   return 1;
+   else
+   return 2;
}
+
+   flush_write_bio(epd);
+
while (1) {
wait_on_extent_buffer_writeback(eb);
btrfs_tree_lock(eb);
@@ -3762,29 +3782,14 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
-   ret = 1;
+   ret = 0;
} else {
spin_unlock(&eb_head(eb)->refs_lock);
+   ret = 1;
}
 
btrfs_tree_unlock(eb);
 
-   if (!ret)
-   return ret;
-
-   num_pages = num_extent_pages(eb->start, eb->len);
-   for (i = 0; i < num_pages; i++) {
-   struct page *p = eb_head(eb)->pages[i];
-
- 

[PATCH V19 05/19] Btrfs: subpage-blocksize: Read tree blocks whose size is < PAGE_SIZE

2016-06-14 Thread Chandan Rajendra
In the case of subpage-blocksize, this patch makes it possible to read
only a single metadata block from the disk instead of all the metadata
blocks that map into a page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  52 -
 fs/btrfs/disk-io.h   |   3 ++
 fs/btrfs/extent_io.c | 127 +++
 3 files changed, 142 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2d20845..fe89687 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -612,29 +612,36 @@ static noinline int check_leaf(struct btrfs_root *root,
return 0;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+   struct page *page,
+   u64 start, u64 end, int mirror)
 {
-   u64 found_start;
-   int found_level;
+   struct address_space *mapping = 
(io_bio->bio).bi_io_vec->bv_page->mapping;
+   struct extent_buffer_head *ebh;
struct extent_buffer *eb;
-   struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+   struct btrfs_root *root = BTRFS_I(mapping->host)->root;
struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret = 0;
+   u64 found_start;
+   int found_level;
int reads_done;
-
-   if (!page->private)
-   goto out;
+   int ret = 0;
 
eb = (struct extent_buffer *)page->private;
+   do {
+   if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
+   break;
+   } while ((eb = eb->eb_next) != NULL);
+
+   ASSERT(eb);
+
+   ebh = eb_head(eb);
 
/* the pending IO might have been the only thing that kept this buffer
 * in memory.  Make sure we have a ref for all this other checks
 */
extent_buffer_get(eb);
 
-   reads_done = atomic_dec_and_test(&eb_head(eb)->io_bvecs);
+   reads_done = atomic_dec_and_test(&ebh->io_bvecs);
if (!reads_done)
goto err;
 
@@ -690,30 +697,13 @@ err:
btree_readahead_hook(fs_info, eb, eb->start, ret);
 
if (ret) {
-   /*
-* our io error hook is going to dec the io pages
-* again, we have to make sure it has something
-* to decrement
-*/
atomic_inc(&eb_head(eb)->io_bvecs);
clear_extent_buffer_uptodate(eb);
}
-   free_extent_buffer(eb);
-out:
-   return ret;
-}
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
+   free_extent_buffer(eb);
 
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_READ_ERR, &eb->ebflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(&eb_head(eb)->io_bvecs);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
-   btree_readahead_hook(eb_head(eb)->fs_info, eb, eb->start, -EIO);
-   return -EIO;/* we fixed nothing */
+   return ret;
 }
 
 static void end_workqueue_bio(struct bio *bio)
@@ -4518,8 +4508,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root 
*root)
 }
 
 static const struct extent_io_ops btree_extent_io_ops = {
-   .readpage_end_io_hook = btree_readpage_end_io_hook,
-   .readpage_io_failed_hook = btree_io_failed_hook,
.submit_bio_hook = btree_submit_bio_hook,
/* note we're sharing with inode.c for the merge bio hook */
.merge_bio_hook = btrfs_merge_bio_hook,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index acba821..a81ff8d 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -113,6 +113,9 @@ static inline void btrfs_put_fs_root(struct btrfs_root 
*root)
kfree(root);
 }
 
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+   struct page *page,
+   u64 start, u64 end, int mirror);
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
  int atomic);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d0a3c5a..f62a039 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -14,6 +14,7 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "ctree.h"
+#include "disk-io.h"
 #include "btrfs_inode.h"
 #include "volumes.h"
 #include "check-integrity.h"
@@ -2200,7 +2201,7 @@ int repair_eb_io_failure(struct btrfs_root *root, struct 
extent_buffer *eb,
struct page *p = eb_head(eb)->pages[i];
 
ret = repair_io_failure(root->fs_info->btree_inode, start,
-   PAGE_SIZE, start, p,
+

[PATCH V19 04/19] Btrfs: subpage-blocksize: Define extent_buffer_head

2016-06-14 Thread Chandan Rajendra
In order to handle multiple extent buffers per page, first we need to create a
way to handle all the extent buffers that are attached to a page.

This patch creates a new data structure 'struct extent_buffer_head', and moves
fields that are common to all extent buffers from 'struct extent_buffer' to
'struct extent_buffer_head'

Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
extent_buffer_head->bflags.

Reviewed-by: Liu Bo 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.c   |   2 +-
 fs/btrfs/ctree.h   |   6 +-
 fs/btrfs/disk-io.c |  72 ++--
 fs/btrfs/extent-tree.c |   6 +-
 fs/btrfs/extent_io.c   | 602 ++---
 fs/btrfs/extent_io.h   |  63 ++--
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/super.c   |   7 +-
 fs/btrfs/tests/btrfs-tests.c   |  12 +-
 fs/btrfs/tests/extent-io-tests.c   |   5 +-
 fs/btrfs/tests/free-space-tree-tests.c |  79 +++--
 fs/btrfs/volumes.c |   2 +-
 include/trace/events/btrfs.h   |   2 +-
 13 files changed, 555 insertions(+), 305 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 4602568..27d1b1a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -160,7 +160,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
-   if (atomic_inc_not_zero(&eb->refs)) {
+   if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
rcu_read_unlock();
break;
}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..6479990 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1473,14 +1473,16 @@ static inline void btrfs_set_token_##name(struct 
extent_buffer *eb, \
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)   \
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_SIZE -1));   \
u##bits res = le##bits##_to_cpu(p->member); \
return res; \
 }  \
 static inline void btrfs_set_##name(struct extent_buffer *eb,  \
u##bits val)\
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_SIZE -1));   \
p->member = cpu_to_le##bits(val);   \
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c3764dd..2d20845 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -375,10 +375,9 @@ static int verify_parent_transid(struct extent_io_tree 
*io_tree,
ret = 0;
goto out;
}
-   btrfs_err_rl(eb->fs_info,
+   btrfs_err_rl(eb_head(eb)->fs_info,
"parent transid verify failed on %llu wanted %llu found %llu",
-   eb->start,
-   parent_transid, btrfs_header_generation(eb));
+   eb->start, parent_transid, btrfs_header_generation(eb));
ret = 1;
 
/*
@@ -452,7 +451,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
int mirror_num = 0;
int failed_mirror = 0;
 
-   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
while (1) {
ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -471,7 +470,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
 * there is no reason to read the other copies, they won't be
 * any less wrong.
 */
-   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
+   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags))
break;
 
num_copies = btrfs_num_copies(root->fs_info,
@@ -510,7 +509,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
struct extent_buffer *eb;
 
eb = (struct extent_buffer *)page->private;
-   if (page != eb->pages[0])
+   if (page != eb_head(eb)->pages

[PATCH V19 01/19] Btrfs: subpage-blocksize: Fix whole page read.

2016-06-14 Thread Chandan Rajendra
For the subpage-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles reading data from files.

To track the status of individual blocks of a page, this patch makes use
of a bitmap pointed to by the newly introduced per-page 'struct
btrfs_page_private'.

The per-page btrfs_page_private->io_lock plays the same role as
BH_Uptodate_Lock (see end_buffer_async_read()) i.e. without the io_lock
we may end up in the following situation,

NOTE: Assume 64k page size and 4k block size. Also assume that the first
12 blocks of the page are contiguous while the next 4 blocks are
contiguous. When reading the page we end up submitting two "logical
address space" bios. So end_bio_extent_readpage function is invoked
twice, once for each bio.

|-+-+-|
| Task A  | Task B  | Task C  |
|-+-+-|
| end_bio_extent_readpage | | |
| process block 0 | | |
| - clear BLK_STATE_IO| | |
| - page_read_complete| | |
| process block 1 | | |
| | | |
| | | |
| | end_bio_extent_readpage | |
| | process block 0 | |
| | - clear BLK_STATE_IO| |
| | - page_read_complete| |
| | process block 1 | |
| | | |
| process block 11| process block 3 | |
| - clear BLK_STATE_IO| - clear BLK_STATE_IO| |
| - page_read_complete| - page_read_complete| |
|   - returns true|   - returns true| |
|   - unlock_page()   | | |
| | | lock_page() |
| |   - unlock_page()   | |
|-+-+-|

We end up incorrectly unlocking the page twice and "Task C" ends up
working on an unlocked page. So private->io_lock makes sure that only
one of the tasks gets "true" as the return value when page_io_complete()
is invoked. As an optimization the patch gets the io_lock only when the
last block of the bio_vec is being processed.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 371 ---
 fs/btrfs/extent_io.h |  74 +-
 fs/btrfs/inode.c |  16 +--
 3 files changed, 338 insertions(+), 123 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a3412d6..066764d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -23,6 +23,7 @@
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
+static struct kmem_cache *page_private_cache;
 static struct bio_set *btrfs_bioset;
 
 static inline bool extent_state_in_tree(const struct extent_state *state)
@@ -173,10 +174,16 @@ int __init extent_io_init(void)
if (!extent_buffer_cache)
goto free_state_cache;
 
+   page_private_cache = kmem_cache_create("btrfs_page_private",
+   sizeof(struct btrfs_page_private), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+   if (!page_private_cache)
+   goto free_buffer_cache;
+
btrfs_bioset = bioset_create(BIO_POOL_SIZE,
 offsetof(struct btrfs_io_bio, bio));
if (!btrfs_bioset)
-   goto free_buffer_cache;
+   goto free_page_private_cache;
 
if (bioset_integrity_create(btrfs_bioset, BIO_POOL_SIZE))
goto free_bioset;
@@ -187,6 +194,10 @@ free_bioset:
bioset_free(btrfs_bioset);
btrfs_bioset = NULL;
 
+free_page_private_cache:
+   kmem_cache_destroy(page_private_cache);
+   page_private_cache = NULL;
+
 free_buffer_cache:
kmem_cache_destroy(extent_buffer_cache);
extent_buffer_cache = NULL;
@@ -1322,6 +1333,95 @@ int clear_record_extent_bits(struct extent_io_tree 
*tree, u64 start, u64 end,
  changeset);
 }
 
+static int modify_page_blks_state(struct page *page,
+   unsigned long blk_states,
+   u64 start, u64 end, int set)
+{
+   struct inode *inode = page->mapping->host;
+   unsigned long *bitmap;
+   unsigned long first_state;
+   unsigned long state;
+   u64 nr_blks;
+   u64 blk;
+
+   ASSERT(BTRFS_I(ino

[PATCH V19 02/19] Btrfs: subpage-blocksize: Fix whole page write

2016-06-14 Thread Chandan Rajendra
For the subpage-blocksize scenario, a page can contain multiple
blocks. In such cases, this patch handles writing data to files.

Also, When setting EXTENT_DELALLOC, we no longer set EXTENT_UPTODATE bit on
the extent_io_tree since uptodate status is being tracked by the bitmap
pointed to by page->private.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c  | 150 --
 fs/btrfs/file.c   |  17 ++
 fs/btrfs/inode.c  |  75 +
 fs/btrfs/relocation.c |   3 +
 4 files changed, 155 insertions(+), 90 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 066764d..969e043 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1493,24 +1493,6 @@ void extent_range_redirty_for_io(struct inode *inode, 
u64 start, u64 end)
}
 }
 
-/*
- * helper function to set both pages and extents in the tree writeback
- */
-static void set_range_writeback(struct extent_io_tree *tree, u64 start, u64 
end)
-{
-   unsigned long index = start >> PAGE_SHIFT;
-   unsigned long end_index = end >> PAGE_SHIFT;
-   struct page *page;
-
-   while (index <= end_index) {
-   page = find_get_page(tree->mapping, index);
-   BUG_ON(!page); /* Pages should be in the extent_io_tree */
-   set_page_writeback(page);
-   put_page(page);
-   index++;
-   }
-}
-
 /* find the first state struct with 'bits' set after 'start', and
  * return it.  tree->lock must be held.  NULL will returned if
  * nothing was found after 'start'
@@ -2578,36 +2560,41 @@ void end_extent_writepage(struct page *page, int err, 
u64 start, u64 end)
  */
 static void end_bio_extent_writepage(struct bio *bio)
 {
+   struct btrfs_page_private *pg_private;
struct bio_vec *bvec;
+   unsigned long flags;
u64 start;
u64 end;
+   int clear_writeback;
int i;
 
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
+   struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-   
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-  "partial page write in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   
btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-  "incomplete page write in btrfs with offset 
%u and "
-  "length %u",
-   bvec->bv_offset, bvec->bv_len);
-   }
+   pg_private = NULL;
+   flags = 0;
+   clear_writeback = 1;
 
-   start = page_offset(page);
-   end = start + bvec->bv_offset + bvec->bv_len - 1;
+   start = page_offset(page) + bvec->bv_offset;
+   end = start + bvec->bv_len - 1;
+
+   if (root->sectorsize < PAGE_SIZE) {
+   pg_private = (struct btrfs_page_private *)page->private;
+   spin_lock_irqsave(&pg_private->io_lock, flags);
+   }
 
end_extent_writepage(page, bio->bi_error, start, end);
-   end_page_writeback(page);
+
+   if (root->sectorsize < PAGE_SIZE) {
+   clear_page_blks_state(page, 1 << BLK_STATE_IO, start,
+   end);
+   clear_writeback = page_io_complete(page);
+   spin_unlock_irqrestore(&pg_private->io_lock, flags);
+   }
+
+   if (clear_writeback)
+   end_page_writeback(page);
}
 
bio_put(bio);
@@ -3479,7 +3466,6 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
u64 block_start;
u64 iosize;
sector_t sector;
-   struct extent_state *cached_state = NULL;
struct extent_map *em;
struct block_device *bdev;
size_t pg_offset = 0;
@@ -3531,20 +3517,29 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 page_end, NULL, 1);
break;
}
-   em = epd->get_extent(inode, page, pg_offset, cur,
-end - cur + 1, 1);
+

[PATCH V19 03/19] Btrfs: subpage-blocksize: Make sure delalloc range intersects with the locked page's range

2016-06-14 Thread Chandan Rajendra
find_delalloc_range indirectly depends on EXTENT_UPTODDATE to make sure that
the delalloc range returned intersects with the file range mapped by the
page. Since we now track "uptodate" state in a per-page
bitmap (i.e. in btrfs_page_private->bstate), this commit makes an explicit
check to make sure that the delalloc range starts from within the file range
mapped by the page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 969e043..56d53c3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1580,6 +1580,7 @@ out:
  * 1 is returned if we find something, 0 if nothing was in the tree
  */
 static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
+   struct page *locked_page,
u64 *start, u64 *end, u64 max_bytes,
struct extent_state **cached_state)
 {
@@ -1588,6 +1589,9 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
u64 cur_start = *start;
u64 found = 0;
u64 total_bytes = 0;
+   u64 page_end;
+
+   page_end = page_offset(locked_page) + PAGE_SIZE - 1;
 
spin_lock(&tree->lock);
 
@@ -1608,7 +1612,8 @@ static noinline u64 find_delalloc_range(struct 
extent_io_tree *tree,
  (state->state & EXTENT_BOUNDARY))) {
goto out;
}
-   if (!(state->state & EXTENT_DELALLOC)) {
+   if (!(state->state & EXTENT_DELALLOC)
+   || (page_end < state->start)) {
if (!found)
*end = state->end;
goto out;
@@ -1746,8 +1751,9 @@ again:
/* step one, find a bunch of delalloc bytes starting at start */
delalloc_start = *start;
delalloc_end = 0;
-   found = find_delalloc_range(tree, &delalloc_start, &delalloc_end,
-   max_bytes, &cached_state);
+   found = find_delalloc_range(tree, locked_page,
+   &delalloc_start, &delalloc_end,
+   max_bytes, &cached_state);
if (!found || delalloc_end <= *start) {
*start = delalloc_start;
*end = delalloc_end;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V19 16/19] Btrfs: subpage-blocksize: btrfs_clone: Flush dirty blocks of a page that do not map the clone range

2016-06-14 Thread Chandan Rajendra
After cloning the required extents, we truncate all the pages that map
the file range being cloned. In subpage-blocksize scenario, we could
have dirty blocks before and/or after the clone range in the
leading/trailing pages. Truncating these pages would lead to data
loss. Hence this commit forces such dirty blocks to be flushed to disk
before performing the clone operation.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ioctl.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d715f21..77c2aa8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3917,6 +3917,7 @@ static noinline int btrfs_clone_files(struct file *file, 
struct file *file_src,
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
+   u64 dest_end;
int same_inode = src == inode;
 
/*
@@ -3977,6 +3978,21 @@ static noinline int btrfs_clone_files(struct file *file, 
struct file *file_src,
goto out_unlock;
}
 
+   if ((round_down(destoff, PAGE_SIZE) < inode->i_size) &&
+   !IS_ALIGNED(destoff, PAGE_SIZE)) {
+   ret = filemap_write_and_wait_range(inode->i_mapping,
+   round_down(destoff, PAGE_SIZE),
+   destoff - 1);
+   }
+
+   dest_end = destoff + len - 1;
+   if ((dest_end < inode->i_size) &&
+   !IS_ALIGNED(dest_end + 1, PAGE_SIZE)) {
+   ret = filemap_write_and_wait_range(inode->i_mapping,
+   dest_end + 1,
+   round_up(dest_end, PAGE_SIZE));
+   }
+
if (destoff > inode->i_size) {
ret = btrfs_cont_expand(inode, inode->i_size, destoff);
if (ret)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V19 18/19] Btrfs: subpage-blocksize: __btrfs_lookup_bio_sums: Set offset when moving to a new bio_vec

2016-06-14 Thread Chandan Rajendra
In __btrfs_lookup_bio_sums() we set the file offset value at the
beginning of every iteration of the while loop. This is incorrect since
the blocks mapped by the current bvec->bv_page might not yet have been
completely processed.

This commit fixes the issue by setting the file offset value when we
move to the next bvec of the bio.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file-item.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 62a81ee..fb6a7e8 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -222,11 +222,11 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
+   else
+   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
page_bytes_left = bvec->bv_len;
while (bio_index < bio->bi_vcnt) {
-   if (!dio)
-   offset = page_offset(bvec->bv_page) + bvec->bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
   (u32 *)csum, nblocks);
if (count)
@@ -301,6 +301,9 @@ found:
goto done;
}
bvec++;
+   if (!dio)
+   offset = page_offset(bvec->bv_page)
+   + bvec->bv_offset;
page_bytes_left = bvec->bv_len;
}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V19 12/19] Revert "btrfs: fix lockups from btrfs_clear_path_blocking"

2016-06-14 Thread Chandan Rajendra
The patch "Btrfs: subpage-blocksize: Prevent writes to an extent buffer
when PG_writeback flag is set" requires btrfs_try_tree_write_lock() to
be a true try lock w.r.t to both spinning and blocking locks. During
2015's Vault Conference Btrfs meetup, Chris Mason had suggested that he
will write up a suitable locking function to be used when writing dirty
pages that map metadata blocks. Until we have a suitable locking
function available, this patch temporarily disables the commit
f82c458a2c3ffb94b431fc6ad791a79df1b3713e.
---
 fs/btrfs/ctree.c   | 14 --
 fs/btrfs/locking.c | 24 +++-
 fs/btrfs/locking.h |  2 --
 3 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 4ff045b..02e21f1 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -81,6 +81,13 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
 {
int i;
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+   /* lockdep really cares that we take all of these spinlocks
+* in the right order.  If any of the locks in the path are not
+* currently blocking, it is going to complain.  So, make really
+* really sure by forcing the path to blocking before we clear
+* the path blocking.
+*/
if (held) {
btrfs_set_lock_blocking_rw(held, held_rw);
if (held_rw == BTRFS_WRITE_LOCK)
@@ -89,6 +96,7 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
held_rw = BTRFS_READ_LOCK_BLOCKING;
}
btrfs_set_path_blocking(p);
+#endif
 
for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
if (p->nodes[i] && p->locks[i]) {
@@ -100,8 +108,10 @@ noinline void btrfs_clear_path_blocking(struct btrfs_path 
*p,
}
}
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
if (held)
btrfs_clear_lock_blocking_rw(held, held_rw);
+#endif
 }
 
 /* this also releases the path */
@@ -2906,7 +2916,7 @@ cow_done:
}
p->locks[level] = BTRFS_WRITE_LOCK;
} else {
-   err = btrfs_tree_read_lock_atomic(b);
+   err = btrfs_try_tree_read_lock(b);
if (!err) {
btrfs_set_path_blocking(p);
btrfs_tree_read_lock(b);
@@ -3038,7 +3048,7 @@ again:
}
 
level = btrfs_header_level(b);
-   err = btrfs_tree_read_lock_atomic(b);
+   err = btrfs_try_tree_read_lock(b);
if (!err) {
btrfs_set_path_blocking(p);
btrfs_tree_read_lock(b);
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index d13128c..8b50e60 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -132,26 +132,6 @@ again:
 }
 
 /*
- * take a spinning read lock.
- * returns 1 if we get the read lock and 0 if we don't
- * this won't wait for blocking writers
- */
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
-{
-   if (atomic_read(&eb->blocking_writers))
-   return 0;
-
-   read_lock(&eb->lock);
-   if (atomic_read(&eb->blocking_writers)) {
-   read_unlock(&eb->lock);
-   return 0;
-   }
-   atomic_inc(&eb->read_locks);
-   atomic_inc(&eb->spinning_readers);
-   return 1;
-}
-
-/*
  * returns 1 if we get the read lock and 0 if we don't
  * this won't wait for blocking writers
  */
@@ -182,7 +162,9 @@ int btrfs_try_tree_write_lock(struct extent_buffer *eb)
atomic_read(&eb->blocking_readers))
return 0;
 
-   write_lock(&eb->lock);
+   if (!write_trylock(&eb->lock))
+   return 0;
+
if (atomic_read(&eb->blocking_writers) ||
atomic_read(&eb->blocking_readers)) {
write_unlock(&eb->lock);
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index c44a9d5..b81e0e9 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -35,8 +35,6 @@ void btrfs_clear_lock_blocking_rw(struct extent_buffer *eb, 
int rw);
 void btrfs_assert_tree_locked(struct extent_buffer *eb);
 int btrfs_try_tree_read_lock(struct extent_buffer *eb);
 int btrfs_try_tree_write_lock(struct extent_buffer *eb);
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb);
-
 
 static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V19 14/19] Btrfs: subpage-blocksize: extent_clear_unlock_delalloc: Prevent page from being unlocked more than once

2016-06-14 Thread Chandan Rajendra
extent_clear_unlock_delalloc() can unlock a page more than once as shown
below (assume 4k as the block size and 64k as the page size).

cow_file_range
  create 4k ordered extent corresponding to page offsets 0 - 4095
  extent_clear_unlock_delalloc corresponding to page offsets 0 - 4095
unlock page
  create 4k ordered extent corresponding to page offsets 4096 - 8191
  extent_clear_unlock_delalloc corresponding to page offsets 4096 - 8191
unlock page

To prevent such a scenario this commit passes "delalloc end" to
extent_clear_unlock_delalloc() to help decide whether the page can be unlocked
or not.

NOTE: Since extent_clear_unlock_delalloc() is used by compression code
as well, the commit passes ordered extent "end" as the value for the
argument corresponding to "delalloc end" for invocations made from
compression code path. This will be fixed by a future commit that gets
compression to work in subpage-blocksize scenario.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 16 ++
 fs/btrfs/extent_io.h |  5 ++--
 fs/btrfs/inode.c | 84 ++--
 3 files changed, 61 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9e28419..73cb21d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1835,9 +1835,8 @@ out_failed:
 }
 
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-struct page *locked_page,
-unsigned clear_bits,
-unsigned long page_ops)
+   u64 delalloc_end, struct page *locked_page,
+   unsigned clear_bits, unsigned long page_ops)
 {
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
int ret;
@@ -1845,6 +1844,7 @@ void extent_clear_unlock_delalloc(struct inode *inode, 
u64 start, u64 end,
unsigned long index = start >> PAGE_SHIFT;
unsigned long end_index = end >> PAGE_SHIFT;
unsigned long nr_pages = end_index - index + 1;
+   u64 page_end;
int i;
 
clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS);
@@ -1881,8 +1881,14 @@ void extent_clear_unlock_delalloc(struct inode *inode, 
u64 start, u64 end,
if ((page_ops & PAGE_END_WRITEBACK)
&& !PagePrivate2(pages[i]))
end_page_writeback(pages[i]);
-   if (page_ops & PAGE_UNLOCK)
-   unlock_page(pages[i]);
+
+   if (page_ops & PAGE_UNLOCK) {
+   page_end = page_offset(pages[i]) +
+   PAGE_SIZE - 1;
+   if ((page_end <= end)
+   || (end == delalloc_end))
+   unlock_page(pages[i]);
+   }
put_page(pages[i]);
}
nr_pages -= ret;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index cbc2099..e8e504c 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -505,9 +505,8 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long offset,
 void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-struct page *locked_page,
-unsigned bits_to_clear,
-unsigned long page_ops);
+   u64 delalloc_end, struct page *locked_page,
+   unsigned bits_to_clear, unsigned long page_ops);
 struct bio *
 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
gfp_t gfp_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a8d745a..4a4ea7f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -104,9 +104,10 @@ static int btrfs_setsize(struct inode *inode, struct iattr 
*attr);
 static int btrfs_truncate(struct inode *inode);
 static int btrfs_finish_ordered_io(struct btrfs_ordered_extent 
*ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
-  struct page *locked_page,
-  u64 start, u64 end, int *page_started,
-  unsigned long *nr_written, int unlock);
+   struct page *locked_page,
+   u64 start, u64 end, u64 delalloc_end,
+   int *page_started, unsigned long *nr_written,
+   int unlock);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
 

[PATCH V19 15/19] Btrfs: subpage-blocksize: Enable dedupe ioctl

2016-06-14 Thread Chandan Rajendra
The function implementing the dedupe ioctl
i.e. btrfs_ioctl_file_extent_same(), returns with an error in
subpage-blocksize scenario. This was done due to the fact that Btrfs did
not have code to deal with block size < page size. This commit removes
this restriction since we now support "block size < page size".

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ioctl.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d527e34..d715f21 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3321,21 +3321,11 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, 
u64 loff, u64 olen,
 {
struct inode *src = file_inode(src_file);
struct inode *dst = file_inode(dst_file);
-   u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
ssize_t res;
 
if (olen > BTRFS_MAX_DEDUPE_LEN)
olen = BTRFS_MAX_DEDUPE_LEN;
 
-   if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
-   /*
-* Btrfs does not support blocksize < page_size. As a
-* result, btrfs_cmp_data() won't correctly handle
-* this situation without an update.
-*/
-   return -EINVAL;
-   }
-
res = btrfs_extent_same(src, loff, olen, dst, dst_loff);
if (res)
return res;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V19 11/19] Btrfs: subpage-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set

2016-06-14 Thread Chandan Rajendra
In non-subpage-blocksize scenario, BTRFS_HEADER_FLAG_WRITTEN flag
prevents Btrfs code from writing into an extent buffer whose pages are
under writeback. This facility isn't sufficient for achieving the same
in subpage-blocksize scenario, since we have more than one extent buffer
mapped to a page.

Hence this patch adds a new flag (i.e. EXTENT_BUFFER_HEAD_WRITEBACK) and
corresponding code to track the writeback status of the page and to
prevent writes to any of the extent buffers mapped to the page while
writeback is going on.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.c   |  21 ++-
 fs/btrfs/extent-tree.c |  11 
 fs/btrfs/extent_io.c   | 150 -
 fs/btrfs/extent_io.h   |   1 +
 4 files changed, 155 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 27d1b1a..4ff045b 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1541,6 +1541,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
struct extent_buffer *parent, int parent_slot,
struct extent_buffer **cow_ret)
 {
+   struct extent_buffer_head *ebh = eb_head(buf);
u64 search_start;
int ret;
 
@@ -1554,6 +1555,14 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
   trans->transid, root->fs_info->generation);
 
if (!should_cow_block(trans, root, buf)) {
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags)) {
+   if (parent)
+   btrfs_set_lock_blocking(parent);
+   btrfs_set_lock_blocking(buf);
+   wait_on_bit_io(&ebh->bflags,
+   EXTENT_BUFFER_HEAD_WRITEBACK,
+   TASK_UNINTERRUPTIBLE);
+   }
*cow_ret = buf;
return 0;
}
@@ -2673,6 +2682,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root
  *root, struct btrfs_key *key, struct btrfs_path *p, int
  ins_len, int cow)
 {
+   struct extent_buffer_head *ebh;
struct extent_buffer *b;
int slot;
int ret;
@@ -2775,8 +2785,17 @@ again:
 * then we don't want to set the path blocking,
 * so we test it here
 */
-   if (!should_cow_block(trans, root, b))
+   if (!should_cow_block(trans, root, b)) {
+   ebh = eb_head(b);
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+   &ebh->bflags)) {
+   btrfs_set_path_blocking(p);
+   wait_on_bit_io(&ebh->bflags,
+   EXTENT_BUFFER_HEAD_WRITEBACK,
+   TASK_UNINTERRUPTIBLE);
+   }
goto cow_done;
+   }
 
/*
 * must have write locks on this node and the
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d225479..9aef0373 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8013,14 +8013,25 @@ static struct extent_buffer *
 btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
  u64 bytenr, int level)
 {
+   struct extent_buffer_head *ebh;
struct extent_buffer *buf;
 
buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return ERR_PTR(-ENOMEM);
+
+   ebh = eb_head(buf);
btrfs_set_header_generation(buf, trans->transid);
btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
btrfs_tree_lock(buf);
+
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+   &ebh->bflags)) {
+   btrfs_set_lock_blocking(buf);
+   wait_on_bit_io(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK,
+   TASK_UNINTERRUPTIBLE);
+   }
+
clean_tree_block(trans, root->fs_info, buf);
clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 40ed2f0..9e28419 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3718,6 +3718,52 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
+static void lock_extent_buffers(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
+{
+   struct extent_buffer *locked_eb = NULL;
+   struct extent_buffer *eb;
+again:
+   eb = &ebh->eb;
+   do {
+   if (eb == locked_eb)
+   continue;
+
+   if (!

[PATCH V19 13/19] Btrfs: subpage-blocksize: Fix file defragmentation code

2016-06-14 Thread Chandan Rajendra
This commit gets file defragmentation code to work in subpage-blocksize
scenario. It does this by keeping track of page offsets that mark block
boundaries and passing them as arguments to the functions that implement
the defragmentation logic.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ioctl.c | 198 ++-
 1 file changed, 136 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0517356..d527e34 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -902,12 +902,13 @@ out_unlock:
 static int check_defrag_in_cache(struct inode *inode, u64 offset, u32 thresh)
 {
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
struct extent_map *em = NULL;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
u64 end;
 
read_lock(&em_tree->lock);
-   em = lookup_extent_mapping(em_tree, offset, PAGE_SIZE);
+   em = lookup_extent_mapping(em_tree, offset, root->sectorsize);
read_unlock(&em_tree->lock);
 
if (em) {
@@ -997,7 +998,7 @@ static struct extent_map *defrag_lookup_extent(struct inode 
*inode, u64 start)
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct extent_map *em;
-   u64 len = PAGE_SIZE;
+   u64 len = BTRFS_I(inode)->root->sectorsize;
 
/*
 * hopefully we have this extent in the tree already, try without
@@ -1116,37 +1117,47 @@ out:
  * before calling this.
  */
 static int cluster_pages_for_defrag(struct inode *inode,
-   struct page **pages,
-   unsigned long start_index,
-   unsigned long num_pages)
+   struct page **pages,
+   unsigned long start_index,
+   size_t pg_offset,
+   unsigned long num_blks)
 {
-   unsigned long file_end;
u64 isize = i_size_read(inode);
+   u64 start_blk;
+   u64 end_blk;
u64 page_start;
u64 page_end;
u64 page_cnt;
+   u64 blk_cnt;
int ret;
int i;
int i_done;
struct btrfs_ordered_extent *ordered;
struct extent_state *cached_state = NULL;
struct extent_io_tree *tree;
+   struct btrfs_root *root;
gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 
-   file_end = (isize - 1) >> PAGE_SHIFT;
-   if (!isize || start_index > file_end)
+   root = BTRFS_I(inode)->root;
+   start_blk = (start_index << PAGE_SHIFT) + pg_offset;
+   start_blk >>= inode->i_blkbits;
+   end_blk = (isize - 1) >> inode->i_blkbits;
+   if (!isize || start_blk > end_blk)
return 0;
 
-   page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);
+   blk_cnt = min_t(u64, (u64)num_blks, (u64)end_blk - start_blk + 1);
 
ret = btrfs_delalloc_reserve_space(inode,
-   start_index << PAGE_SHIFT,
-   page_cnt << PAGE_SHIFT);
+   start_blk << inode->i_blkbits,
+   blk_cnt << inode->i_blkbits);
if (ret)
return ret;
i_done = 0;
tree = &BTRFS_I(inode)->io_tree;
 
+   page_cnt = DIV_ROUND_UP(pg_offset + (blk_cnt << inode->i_blkbits),
+   PAGE_SIZE);
+
/* step one, lock all the pages */
for (i = 0; i < page_cnt; i++) {
struct page *page;
@@ -1157,12 +1168,22 @@ again:
break;
 
page_start = page_offset(page);
-   page_end = page_start + PAGE_SIZE - 1;
+
+   if (i == 0)
+   page_start += pg_offset;
+
+   if (i == page_cnt - 1) {
+   page_end = (start_index << PAGE_SHIFT) + pg_offset;
+   page_end += (blk_cnt << inode->i_blkbits) - 1;
+   } else {
+   page_end = page_offset(page) + PAGE_SIZE - 1;
+   }
+
while (1) {
lock_extent_bits(tree, page_start, page_end,
 &cached_state);
-   ordered = btrfs_lookup_ordered_extent(inode,
- page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start,
+   page_end - page_start + 
1);
unlock_extent_cached(tree, page_start, page_end,
 &cached_state, GFP_NOFS);
if (!ordered)
@@ -1201,7 +1222,7 @@ again:
}
 
pages[i] =

[PATCH V19 17/19] Btrfs: subpage-blocksize: Make file extent relocate code subpage blocksize aware

2016-06-14 Thread Chandan Rajendra
The file extent relocation code currently assumes blocksize to be same
as PAGE_SIZE. This commit adds code to support subpage blocksize
scenario.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/relocation.c | 56 ++-
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 40b4439..cf94efc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3103,14 +3103,19 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
 {
u64 page_start;
u64 page_end;
+   u64 block_start;
u64 offset = BTRFS_I(inode)->index_cnt;
+   u64 blocksize = BTRFS_I(inode)->root->sectorsize;
+   u64 reserved_space;
unsigned long index;
unsigned long last_index;
struct page *page;
struct file_ra_state *ra;
gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+   int nr_blocks;
int nr = 0;
int ret = 0;
+   int i;
 
if (!cluster->nr)
return 0;
@@ -3130,10 +3135,16 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
if (ret)
goto out;
 
+   page_start = cluster->start - offset;
+   page_end = min_t(u64, round_down(page_start, PAGE_SIZE) + PAGE_SIZE - 1,
+   cluster->end - offset);
+
index = (cluster->start - offset) >> PAGE_SHIFT;
last_index = (cluster->end - offset) >> PAGE_SHIFT;
while (index <= last_index) {
-   ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE);
+   reserved_space = page_end - page_start + 1;
+
+   ret = btrfs_delalloc_reserve_metadata(inode, reserved_space);
if (ret)
goto out;
 
@@ -3146,7 +3157,7 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
   mask);
if (!page) {
btrfs_delalloc_release_metadata(inode,
-   PAGE_SIZE);
+   reserved_space);
ret = -ENOMEM;
goto out;
}
@@ -3165,41 +3176,50 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
unlock_page(page);
put_page(page);
btrfs_delalloc_release_metadata(inode,
-   PAGE_SIZE);
+   reserved_space);
ret = -EIO;
goto out;
}
}
 
-   page_start = page_offset(page);
-   page_end = page_start + PAGE_SIZE - 1;
-
lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 
set_page_extent_mapped(page);
 
-   if (nr < cluster->nr &&
-   page_start + offset == cluster->boundary[nr]) {
-   set_extent_bits(&BTRFS_I(inode)->io_tree,
-   page_start, page_end,
-   EXTENT_BOUNDARY);
-   nr++;
+   nr_blocks = (page_end + 1 - page_start) >> inode->i_blkbits;
+
+   block_start = page_start;
+   for (i = 0; i < nr_blocks; i++) {
+   if (nr < cluster->nr &&
+   block_start + offset == cluster->boundary[nr]) {
+   set_extent_bits(&BTRFS_I(inode)->io_tree,
+   block_start, block_start + 
blocksize - 1,
+   EXTENT_BOUNDARY);
+   nr++;
+   }
+
+   block_start += blocksize;
}
 
btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-   set_page_blks_state(page,
-   1 << BLK_STATE_DIRTY | 1 << BLK_STATE_UPTODATE,
-   page_start, page_end);
-   set_page_dirty(page);
+   if (blocksize < PAGE_SIZE)
+   set_page_blks_state(page,
+   1 << BLK_STATE_DIRTY | 1 << 
BLK_STATE_UPTODATE,
+   page_start, page_end);
+
+   unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 
-   unlock_extent(&BTRFS_I(inode)->io_tree,
- page_start, page_end);
+   set_page_dirty(page);
unlock_page(page);
put_page(page);
 
index++;
balance_dirty_pages_ratelimited(ino

[PATCH V19 19/19] Btrfs: subpage-blocksize: Disable compression

2016-06-14 Thread Chandan Rajendra
The subpage-blocksize patchset does not yet support compression. Hence,
the kernel might crash when executing compression code in
subpage-blocksize scenario. This commit disables enabling compression
feature during 'mount' and also when the  user invokes
'chattr +c ' command.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ioctl.c |  8 +++-
 fs/btrfs/super.c | 20 
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 77c2aa8..c4fd80e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -322,6 +322,11 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
} else if (flags & FS_COMPR_FL) {
const char *comp;
 
+   if (root->sectorsize < PAGE_SIZE) {
+   ret = -EINVAL;
+   goto out_drop;
+   }
+
ip->flags |= BTRFS_INODE_COMPRESS;
ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
 
@@ -1342,7 +1347,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
return -EINVAL;
 
if (range->flags & BTRFS_DEFRAG_RANGE_COMPRESS) {
-   if (range->compress_type > BTRFS_COMPRESS_TYPES)
+   if ((range->compress_type > BTRFS_COMPRESS_TYPES)
+   || (root->sectorsize < PAGE_SIZE))
return -EINVAL;
if (range->compress_type)
compress_type = range->compress_type;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index ae30f52..70c0ee3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -368,6 +368,17 @@ static const match_table_t tokens = {
{Opt_err, NULL},
 };
 
+static int can_enable_compression(struct btrfs_fs_info *fs_info)
+{
+   if (btrfs_super_sectorsize(fs_info->super_copy) < PAGE_SIZE) {
+   btrfs_err(fs_info,
+   "Compression is not supported for subpage-blocksize");
+   return 0;
+   }
+
+   return 1;
+}
+
 /*
  * Regular mount options parser.  Everything that is needed only when
  * reading in a new superblock is parsed here.
@@ -477,6 +488,10 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options,
if (token == Opt_compress ||
token == Opt_compress_force ||
strcmp(args[0].from, "zlib") == 0) {
+   if (!can_enable_compression(info)) {
+   ret = -EINVAL;
+   goto out;
+   }
compress_type = "zlib";
info->compress_type = BTRFS_COMPRESS_ZLIB;
btrfs_set_opt(info->mount_opt, COMPRESS);
@@ -484,6 +499,10 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options,
btrfs_clear_opt(info->mount_opt, NODATASUM);
no_compress = 0;
} else if (strcmp(args[0].from, "lzo") == 0) {
+   if (!can_enable_compression(info)) {
+   ret = -EINVAL;
+   goto out;
+   }
compress_type = "lzo";
info->compress_type = BTRFS_COMPRESS_LZO;
btrfs_set_opt(info->mount_opt, COMPRESS);
@@ -806,6 +825,7 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options,
break;
}
}
+
 check:
/*
 * Extra check for current option against current flag
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-14 Thread Filipe Manana
On Tue, Jun 14, 2016 at 1:58 AM, Qu Wenruo  wrote:
>
>
> At 06/13/2016 05:49 PM, Filipe Manana wrote:
>>
>> On Mon, Jun 13, 2016 at 9:06 AM, Lu Fengqi 
>> wrote:
>>>
>>> At 06/13/2016 03:29 PM, Lu Fengqi wrote:


 At 06/13/2016 11:04 AM, Eryu Guan wrote:
>
>
> On Mon, Jun 13, 2016 at 10:10:50AM +0800, Lu Fengqi wrote:
>>
>>
>> Test if qgroup can handle extent de-reference during reallocation.
>> "extent de-reference" means that reducing an extent's reference count
>> or freeing an extent.
>> Although current qgroup can handle it, we still need to prevent any
>> regression which may break current qgroup.
>>
>> Signed-off-by: Lu Fengqi 
>> ---
>>  common/rc   |  4 +--
>>  tests/btrfs/028 | 98
>> +
>>  tests/btrfs/028.out |  2 ++
>>  tests/btrfs/group   |  1 +
>>  4 files changed, 103 insertions(+), 2 deletions(-)
>>  create mode 100755 tests/btrfs/028
>>  create mode 100644 tests/btrfs/028.out
>>
>> diff --git a/common/rc b/common/rc
>> index 51092a0..650d198 100644
>> --- a/common/rc
>> +++ b/common/rc
>> @@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
>>  # stress btrfs by running balance operation in a loop
>>  _btrfs_stress_balance()
>>  {
>> -local btrfs_mnt=$1
>> +local options=$@
>>  while true; do
>> -$BTRFS_UTIL_PROG balance start $btrfs_mnt
>> +$BTRFS_UTIL_PROG balance start $options
>>  done
>>  }
>>
>> diff --git a/tests/btrfs/028 b/tests/btrfs/028
>> new file mode 100755
>> index 000..8cea49a
>> --- /dev/null
>> +++ b/tests/btrfs/028
>> @@ -0,0 +1,98 @@
>> +#! /bin/bash
>> +# FS QA Test 028
>> +#
>> +# Test if qgroup can handle extent de-reference during reallocation.
>> +# "extent de-reference" means that reducing an extent's reference
>> count
>> +# or freeing an extent.
>> +# Although current qgroup can handle it, we still need to prevent any
>> +# regression which may break current qgroup.
>> +#
>>
>>
>> +#---
>>
>> +# Copyright (c) 2016 Fujitsu. All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software
>> Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>>
>>
>> +#---
>>
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1# failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +cd /
>> +rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +
>> +_scratch_mkfs
>> +_scratch_mount
>> +
>> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
>> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>> +
>> +# Increase the probability of generating de-refer extent, and
>> decrease
>> +# other.
>> +args=`_scale_fsstress_args -z \
>> +-f write=10 -f unlink=10 \
>> +-f creat=10 -f fsync=10 \
>> +-f fsync=10 -n 10 -p 2 \
>> +-d $SCRATCH_MNT/stress_dir`
>> +echo "Run fsstress $args" >>$seqres.full
>> +$FSSTRESS_PROG $args >/dev/null 2>&1 &
>> +fsstress_pid=$!
>> +
>> +echo "Start balance" >>$seqres.full
>> +_btrfs_stress_balance -d $SCRATCH_MNT >/dev/null 2>&1 &
>> +balance_pid=$!
>> +
>> +# 30s is enough to trigger bug
>> +sleep $((30*$TIME_FACTOR))
>> +kill $fsstress_pid $balance_pid
>> +wait
>> +
>> +# kill _btrfs_stress_balance can't end balance, so call btrfs
>> balance cancel
>> +# to cancel running or paused balance.
>> +$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
>> +
>> +_run_btrfs_util_prog filesystem syn

Re: [PATCH] btrfs-progs: doc: add missing newline in btrfs-convert

2016-06-14 Thread David Sterba
On Fri, Jun 10, 2016 at 09:57:55AM -0400, Noah Massey wrote:
> Signed-off-by: Noah Massey 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: doc: correct the destination of btrfs-receive

2016-06-14 Thread David Sterba
On Tue, Jun 14, 2016 at 02:50:19PM +0900, Satoru Takeuchi wrote:
> We can set not only btrfs mount point but also any path belong to
> btrfs mount point as btrfs-receive's destination.
> 
> Signed-off-by: Satoru Takeuchi 

The patches from you have a consistent whitespace damage, I've fixed
the btrfs-crc but now that I see it again I suspect some error on your
side.

> @@ -7,14 +7,14 @@ btrfs-receive - receive subvolumes from send stream
> 
>   SYNOPSIS
>   
> -*btrfs receive* [options] 
> +*btrfs receive* [options] 
> 
>   DESCRIPTION
>   ---
> 
>   Receive a stream of changes and replicate one or more subvolumes that were
>   previously used with *btrfs send* The received subvolumes are stored to
> -'mount'.
> +'path'.
> 
>   *btrfs receive* will fail int the following cases:
> 
> @@ -37,7 +37,7 @@ by default, btrfs receive uses standard input to receive 
> the stream,
>   use this option to read from a file instead
> 
>   -C|--chroot::
> -confine the process to 'mount' using `chroot`(1)
> +confine the process to 'path' using `chroot`(1)
> 
>   -e::
>   terminate after receiving an 'end cmd' marker in the stream.

ie. all the context lines start with two spaces instead of one. I'll
apply this patch manually but please have a look.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] strace patches for new ioctls

2016-06-14 Thread David Sterba
On Tue, May 31, 2016 at 04:51:03PM -0400, Chris Mason wrote:
> On Tue, May 31, 2016 at 04:41:55PM -0400, Jeff Mahoney wrote:
> > Hi all -
> > 
> > Strace 4.12 was tagged for release today and it supports decoding of
> > btrfs ioctls.  I'd like to propose a requirement that future ioctl
> > additions come with a patch to strace as well so we don't get out of sync.
> > 
> > I wrote the decoding to help with some issues some of our developers
> > were seeing while writing various tools.  Rather than describing their
> > workload or showing us the code, an strace in verbose mode shows us
> > pretty much exactly what they're doing.
> > 
> > Opinions?
> 
> Maybe add a link to the wiki of your last set of patches to use as a
> template?  It's a great idea.

[ I've replied to that but the mail got lost somewhere ]

There's a new wiki page to track the notes, checklists and hints for
such things.

https://btrfs.wiki.kernel.org/index.php/Development_notes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-14 Thread Qu Wenruo



At 06/14/2016 04:41 PM, Filipe Manana wrote:

On Tue, Jun 14, 2016 at 1:58 AM, Qu Wenruo  wrote:



At 06/13/2016 05:49 PM, Filipe Manana wrote:


On Mon, Jun 13, 2016 at 9:06 AM, Lu Fengqi 
wrote:


At 06/13/2016 03:29 PM, Lu Fengqi wrote:



At 06/13/2016 11:04 AM, Eryu Guan wrote:



On Mon, Jun 13, 2016 at 10:10:50AM +0800, Lu Fengqi wrote:



Test if qgroup can handle extent de-reference during reallocation.
"extent de-reference" means that reducing an extent's reference count
or freeing an extent.
Although current qgroup can handle it, we still need to prevent any
regression which may break current qgroup.

Signed-off-by: Lu Fengqi 
---
 common/rc   |  4 +--
 tests/btrfs/028 | 98
+
 tests/btrfs/028.out |  2 ++
 tests/btrfs/group   |  1 +
 4 files changed, 103 insertions(+), 2 deletions(-)
 create mode 100755 tests/btrfs/028
 create mode 100644 tests/btrfs/028.out

diff --git a/common/rc b/common/rc
index 51092a0..650d198 100644
--- a/common/rc
+++ b/common/rc
@@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
 # stress btrfs by running balance operation in a loop
 _btrfs_stress_balance()
 {
-local btrfs_mnt=$1
+local options=$@
 while true; do
-$BTRFS_UTIL_PROG balance start $btrfs_mnt
+$BTRFS_UTIL_PROG balance start $options
 done
 }

diff --git a/tests/btrfs/028 b/tests/btrfs/028
new file mode 100755
index 000..8cea49a
--- /dev/null
+++ b/tests/btrfs/028
@@ -0,0 +1,98 @@
+#! /bin/bash
+# FS QA Test 028
+#
+# Test if qgroup can handle extent de-reference during reallocation.
+# "extent de-reference" means that reducing an extent's reference
count
+# or freeing an extent.
+# Although current qgroup can handle it, we still need to prevent any
+# regression which may break current qgroup.
+#


+#---

+# Copyright (c) 2016 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software
Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA


+#---

+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs
+_scratch_mount
+
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+# Increase the probability of generating de-refer extent, and
decrease
+# other.
+args=`_scale_fsstress_args -z \
+-f write=10 -f unlink=10 \
+-f creat=10 -f fsync=10 \
+-f fsync=10 -n 10 -p 2 \
+-d $SCRATCH_MNT/stress_dir`
+echo "Run fsstress $args" >>$seqres.full
+$FSSTRESS_PROG $args >/dev/null 2>&1 &
+fsstress_pid=$!
+
+echo "Start balance" >>$seqres.full
+_btrfs_stress_balance -d $SCRATCH_MNT >/dev/null 2>&1 &
+balance_pid=$!
+
+# 30s is enough to trigger bug
+sleep $((30*$TIME_FACTOR))
+kill $fsstress_pid $balance_pid
+wait
+
+# kill _btrfs_stress_balance can't end balance, so call btrfs
balance cancel
+# to cancel running or paused balance.
+$BTRFS_UTIL_PROG balance cancel $SCRATCH_MNT &> /dev/null
+
+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
+
+_scratch_unmount
+
+# generate a qgroup report and look for inconsistent groups
+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
+grep -q -E "Counts for qgroup.*are different"
+if [ $? -ne 0 ]; then
+echo "Silence is golden"
+# success, all done
+status=0
+fi




I'm testing with 4.7-rc1 kernel and btrfs-progs v4.4, this test fails,
which means btrfs check finds inconsistent groups. But according to
your
commit log, current kernel should pass the test. So is the failure
expected?

Also, just grep for different qgroup counts and print the message out
if
grep finds the message, so it breaks golden image on error and we know
something really goes wrong. Right now test fails just because of
missing "Silence is golden", which is unclear why it fails:

 @@ -1,2 +1 @@
  QA output created by 028
 -Silence is golden

Do the following instead:

$BTRFS_UTIL_PROG check ... |

Re: [PATCH v3] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-14 Thread Filipe Manana
On Tue, Jun 14, 2016 at 10:08 AM, Qu Wenruo  wrote:
>
>
> At 06/14/2016 04:41 PM, Filipe Manana wrote:
>>
>> On Tue, Jun 14, 2016 at 1:58 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> At 06/13/2016 05:49 PM, Filipe Manana wrote:


 On Mon, Jun 13, 2016 at 9:06 AM, Lu Fengqi 
 wrote:
>
>
> At 06/13/2016 03:29 PM, Lu Fengqi wrote:
>>
>>
>>
>> At 06/13/2016 11:04 AM, Eryu Guan wrote:
>>>
>>>
>>>
>>> On Mon, Jun 13, 2016 at 10:10:50AM +0800, Lu Fengqi wrote:



 Test if qgroup can handle extent de-reference during reallocation.
 "extent de-reference" means that reducing an extent's reference
 count
 or freeing an extent.
 Although current qgroup can handle it, we still need to prevent any
 regression which may break current qgroup.

 Signed-off-by: Lu Fengqi 
 ---
  common/rc   |  4 +--
  tests/btrfs/028 | 98
 +
  tests/btrfs/028.out |  2 ++
  tests/btrfs/group   |  1 +
  4 files changed, 103 insertions(+), 2 deletions(-)
  create mode 100755 tests/btrfs/028
  create mode 100644 tests/btrfs/028.out

 diff --git a/common/rc b/common/rc
 index 51092a0..650d198 100644
 --- a/common/rc
 +++ b/common/rc
 @@ -3284,9 +3284,9 @@ _btrfs_get_profile_configs()
  # stress btrfs by running balance operation in a loop
  _btrfs_stress_balance()
  {
 -local btrfs_mnt=$1
 +local options=$@
  while true; do
 -$BTRFS_UTIL_PROG balance start $btrfs_mnt
 +$BTRFS_UTIL_PROG balance start $options
  done
  }

 diff --git a/tests/btrfs/028 b/tests/btrfs/028
 new file mode 100755
 index 000..8cea49a
 --- /dev/null
 +++ b/tests/btrfs/028
 @@ -0,0 +1,98 @@
 +#! /bin/bash
 +# FS QA Test 028
 +#
 +# Test if qgroup can handle extent de-reference during
 reallocation.
 +# "extent de-reference" means that reducing an extent's reference
 count
 +# or freeing an extent.
 +# Although current qgroup can handle it, we still need to prevent
 any
 +# regression which may break current qgroup.
 +#



 +#---

 +# Copyright (c) 2016 Fujitsu. All Rights Reserved.
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software
 Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA



 +#---

 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo "QA output created by $seq"
 +
 +here=`pwd`
 +tmp=/tmp/$$
 +status=1# failure is the default!
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
 +_cleanup()
 +{
 +cd /
 +rm -f $tmp.*
 +}
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +
 +# remove previous $seqres.full before test
 +rm -f $seqres.full
 +
 +# real QA test starts here
 +_supported_fs btrfs
 +_supported_os Linux
 +_require_scratch
 +
 +_scratch_mkfs
 +_scratch_mount
 +
 +_run_btrfs_util_prog quota enable $SCRATCH_MNT
 +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
 +
 +# Increase the probability of generating de-refer extent, and
 decrease
 +# other.
 +args=`_scale_fsstress_args -z \
 +-f write=10 -f unlink=10 \
 +-f creat=10 -f fsync=10 \
 +-f fsync=10 -n 10 -p 2 \
 +-d $SCRATCH_MNT/stress_dir`
 +echo "Run fsstress $args" >>$seqres.full
 +$FSSTRESS_PROG $args >/dev/null 2>&1 &
 +fsstress_pid=$!
 +
 +echo "Start balance" >>$seqres.full
 +_btrfs_stre

Re: [PATCH] btrfs-progs: doc: correct the destination of btrfs-receive

2016-06-14 Thread Hugo Mills
On Tue, Jun 14, 2016 at 10:51:33AM +0200, David Sterba wrote:
> On Tue, Jun 14, 2016 at 02:50:19PM +0900, Satoru Takeuchi wrote:
> > We can set not only btrfs mount point but also any path belong to
> > btrfs mount point as btrfs-receive's destination.
> > 
> > Signed-off-by: Satoru Takeuchi 
> 
> The patches from you have a consistent whitespace damage, I've fixed
> the btrfs-crc but now that I see it again I suspect some error on your
> side.
> 
> > @@ -7,14 +7,14 @@ btrfs-receive - receive subvolumes from send stream
> > 
> >   SYNOPSIS
> >   
> > -*btrfs receive* [options] 
> > +*btrfs receive* [options] 
> > 
> >   DESCRIPTION
> >   ---
> > 
> >   Receive a stream of changes and replicate one or more subvolumes that were
> >   previously used with *btrfs send* The received subvolumes are stored to
> > -'mount'.
> > +'path'.
> > 
> >   *btrfs receive* will fail int the following cases:
> > 
> > @@ -37,7 +37,7 @@ by default, btrfs receive uses standard input to receive 
> > the stream,
> >   use this option to read from a file instead
> > 
> >   -C|--chroot::
> > -confine the process to 'mount' using `chroot`(1)
> > +confine the process to 'path' using `chroot`(1)
> > 
> >   -e::
> >   terminate after receiving an 'end cmd' marker in the stream.
> 
> ie. all the context lines start with two spaces instead of one. I'll
> apply this patch manually but please have a look.

   Looking at this, I suspect it's a consequence of sending it as
"Content-Type: format=flowed; delsp=yes". I'm not sure which of those
two options is the culprit. When I look at the message in my client
(mutt), it looks absolutely fine. When I pipe it to hexdump, the
double-spacing is apparent.

   Hugo.

-- 
Hugo Mills | It's against my programming to impersonate a deity!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  C3PO, Return of the Jedi


signature.asc
Description: Digital signature


[PATCH] fstests: btrfs: test if qgroup handle data balance correctly

2016-06-14 Thread Qu Wenruo
Btrfs after v4.2 kernel will leaks qgroup numbers for relocated data
extents, due to the design of tree block direct swap.

This test case will check if such data balance will corrupt qgroup.

Reported-by: Mark Fasheh 
Reported-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/028 | 85 +
 tests/btrfs/028.out | 67 +
 tests/btrfs/group   |  1 +
 3 files changed, 153 insertions(+)
 create mode 100755 tests/btrfs/028
 create mode 100644 tests/btrfs/028.out

diff --git a/tests/btrfs/028 b/tests/btrfs/028
new file mode 100755
index 000..16c6935
--- /dev/null
+++ b/tests/btrfs/028
@@ -0,0 +1,85 @@
+#! /bin/bash
+# FS QA Test 028
+#
+# Test if btrfs leaks qgroup numbers for data extents
+#
+# Due to balance code is doing trick tree block swap, which doing
+# non-standard extent reference update, qgroup can't handle it correctly,
+# and leads to corrupted qgroup numbers.
+#
+#---
+# Copyright (c) 2016 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs
+# Need to use inline extents to fill metadata rapidly
+_scratch_mount "-o max_inline=2048"
+
+# create 64K inlined metadata, which will ensure there is a 2-level
+# metadata. Even for maximum nodesize(64K)
+for i in $(seq 32); do
+   _pwrite_byte 0xcdcdcdcd 0 2k $SCRATCH_MNT/small_$i | _filter_xfs_io
+done
+
+# then a large data write to make the quota corruption obvious enough
+_pwrite_byte 0xcdcdcdcd 0 32m $SCRATCH_MNT/large | _filter_xfs_io
+sync
+
+# enable quota and rescan to get correct number
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+# now balance data block groups to corrupt qgroup
+_run_btrfs_util_prog balance start -d $SCRATCH_MNT
+
+_scratch_unmount
+# generate a qgroup report and look for inconsistent groups
+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
+   grep -E "Counts for qgroup.*are different"
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/028.out b/tests/btrfs/028.out
new file mode 100644
index 000..7ec6b0d
--- /dev/null
+++ b/tests/btrfs/028.out
@@ -0,0 +1,67 @@
+QA output created by 028
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2048/2048 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YY

[PATCH] btrfs: relocation: Fix leaking qgroups numbers on data extents

2016-06-14 Thread Qu Wenruo
When balancing data extents, qgroup will leak all its numbers for
balanced data extents.

The root cause is the non-standard extent reference update used in
balance code.

The problem happens in the following steps:
(Use 4M as original data extent size, and 257 as src root objectid)

1) Balance create new data extents and increase its refs

Balance will alloc new data extent and create new EXTENT_DATA in data
reloc tree, while its refernece is increased with 2 different
referencer: 257 and data reloc tree.

While at that time, file tree is still referring to old extents.

Extent bytenr   |Real referencer| backrefs |

New | Data reloc| Data reloc + 257 | << Mismatch
Old | 257   | 257  |

Qgroup number: 4M + metadata

2) Commit trans before merging reloc tree
Then we goes to prepare_to_merge(), which will commit transacation.

In the qgroup update codes inside commit_transaction, although backref
walk codes find the new data extents has 2 backref, but file tree
backref can't find referencer(file tree is still referring to old
extents), and data reloc doesn't count as file tree.

Extent bytenr   | nr_old_roots | nr_new_roots | qgroup change |
--|
New | 0| 0| 0 |
Old | 1| 1| 0 |

Qgroup number: 4M + metadata +-0 = 4M + metadata

3) Swap tree blocks and free old tree blocks
Then we goes to merge_reloc_roots(), which swaps the tree blocks
directly, and free the old tree blocks.
Freeing tree blocks will also free its data extents, this goes through
normal routine, and qgroup handles it well, decreasing the numbers.

And since new data extent is not updated here (updated in step 1), so
qgroup won't scan new data extent.

Extent bytenr   | nr_old_roots | nr_new_roots | qgroup change |
--|
New |-No modification, doesn't go through qgroup--|
Old | 1| 0| -4M   |

Qgroup number: 4M + metadata -4M = metadata

This patch will fix it by re-dirtying new extent at step 3), so backref
walk and qgroup can get correct result.

And thanks to the new qgroup framework, we don't need to check whether
it is needed to dirty some file extents. Even some unrelated extents are
re-dirtied, qgroup can handle it quite well.

So we only need to ensure we don't miss some extents.

Reported-by: Mark Fasheh 
Reported-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 94 +++
 1 file changed, 94 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..f1d696d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "qgroup.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -1750,6 +1751,78 @@ int memcmp_node_keys(struct extent_buffer *eb, int slot,
 }
 
 /*
+ * Helper function to fixup screwed qgroups caused by increased extent ref,
+ * which doesn't follow normal extent ref update behavior.
+ * (Correct behavior is, increase extent ref and modify source root in
+ *  one trans)
+ * No better solution as long as we're doing swapping trick to do balance.
+ */
+static int qgroup_redirty_data_extents(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root, u64 bytenr,
+  u64 gen)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct extent_buffer *leaf;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   int slot;
+   int ret = 0;
+
+   if (!fs_info->quota_enabled || !is_fstree(root->objectid))
+   return 0;
+   if (WARN_ON(!trans))
+   return -EINVAL;
+
+   delayed_refs = &trans->transaction->delayed_refs;
+
+   leaf = read_tree_block(root, bytenr, gen);
+   if (IS_ERR(leaf)) {
+   return PTR_ERR(leaf);
+   } else if (!extent_buffer_uptodate(leaf)) {
+   ret = -EIO;
+   goto out;
+   }
+
+   /* We only care leaf, which may contains EXTENT_DATA */
+   if (btrfs_header_level(leaf) != 0)
+   goto out;
+
+   for (slot = 0; slot < btrfs_header_nritems(leaf); slot++) {
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_qgroup_extent_record *record;
+   struct btrfs_qgroup_extent_record *exist;
+
+   btrfs_item_key_to_cpu(leaf, &key, slot);
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   continue;
+   fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);

Re: [BUG] receive not seeing file that exists

2016-06-14 Thread Benedikt Morbach
Hi all,

On Thu, Jun 2, 2016 at 9:26 AM, Benedikt Morbach
 wrote:
> I've encountered a bug in btrfs-receive. When receiving a certain
> incremental send, it will error with:
>
> ERROR: cannot open
> backup/detritus/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal:
> No such file or directory
>
> even though that path exists and the parent subvolume is identical on
> both ends (I checked manually).
>
> I've noticed this happen before on the same directory (and google
> confirms it has also happened to others) and /var/log/journal/ and its
> children are the only directories with 'chattr +C' on this system, so
> it might be related to that?
>
> This was reported on IRC a week or so ago and Josef requested a tree
> --inode of the file/the dirs leading to it and the incremental send,
> so here you go:
>
>
> send side:
> /mnt
> [256]  btrfs_pool_ssd
>
> /mnt/btrfs_pool_ssd
> [256]  backup
>
> /mnt/btrfs_pool_ssd/backup
> [256]  root
>
> /mnt/btrfs_pool_ssd/backup/root
> [256]  root.20160524T1800
> [256]  root.20160524T1900
>
> /mnt/btrfs_pool_ssd/backup/root/root.20160524T1800
> [268]  var
>
> /mnt/btrfs_pool_ssd/backup/root/root.20160524T1800/var
> [   9035]  log
>
> /mnt/btrfs_pool_ssd/backup/root/root.20160524T1800/var/log
> [35122105]  journal
>
> /mnt/btrfs_pool_ssd/backup/root/root.20160524T1800/var/log/journal
> [35122136]  9cbb44cf160f4c1089f77e32ed376a0b
>
> 
> /mnt/btrfs_pool_ssd/backup/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b
> [53198460]  user-1000.journal
>
>
> receive side:
> /backup
> [256]  detritus
>
> /backup/detritus
> [256]  root
>
> /backup/detritus/root
> [256]  root.20160524T1800
>
> /backup/detritus/root/root.20160524T1800
> [267]  var
>
> /backup/detritus/root/root.20160524T1800/var
> [856]  log
>
> /backup/detritus/root/root.20160524T1800/var/log
> [ 316157]  journal
>
> /backup/detritus/root/root.20160524T1800/var/log/journal
> [ 316158]  9cbb44cf160f4c1089f77e32ed376a0b
>
> 
> /backup/detritus/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b
> [ 738979]  user-1000.journal
>
> both trimmed down to only the relevant path.
>
> I don't know how the ML handles attachments, so incremental send
> stream (with --no-data) is here:
> http://dev.exherbo.org/~moben/send-receive_incremental.stream
>
> Let me know if you need anything else or if I misunderstood the tree
> thing. (I _think_ I can also provide the with-data send, but I'd like
> to take a look at that first ;) )


I just re-tested this with kernel 4.6.2 and progs 4.6 and the error is the same.

I also ran a scrub on both the send and receive fs and both came out
with 0 errors.

Another thing I noticed: the +C attr doesn't seem to be preserved by
send/receive:

send # lsattr
/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal
C--
/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal

receive # lsattr
/backup/detritus/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal
---
/backup/detritus/root/root.20160524T1800/var/log/journal/9cbb44cf160f4c1089f77e32ed376a0b/user-1000.journal

Might be related, given that /var/log/journal (and its children) are
the only dirs/files with +C on this system afaik.

Cheers
Benedikt
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] fstests: btrfs: add test for qgroup handle extent de-reference

2016-06-14 Thread Eryu Guan
On Tue, Jun 14, 2016 at 09:42:39AM +0800, Lu Fengqi wrote:
> Test if qgroup can handle extent de-reference during reallocation.
> "extent de-reference" means that reducing an extent's reference count
> or freeing an extent.
> Although current qgroup can handle it, we still need to prevent any
> regression which may break current qgroup.
> 
> Signed-off-by: Lu Fengqi 

Looks good to me.

Reviewed-by: Eryu Guan 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs for 4.7, part 2

2016-06-14 Thread Anand Jain


Chris,

  Sorry for the delay due to vacation.

more below..

On 05/29/2016 08:21 PM, Chris Mason wrote:

On Sat, May 28, 2016 at 01:14:13PM +0800, Anand Jain wrote:



On 05/27/2016 11:42 PM, Chris Mason wrote:

I'm getting errors from btrfs fi show -d, after the very last round of
device replaces.  A little extra debugging:

bytenr mismatch, want=4332716032, have=0
ERROR: cannot read chunk root
ERROR reading /dev/vdh
failed /dev/vdh

Which is cute because the very next command we run fscks /dev/vdh and
succeeds.


Checked the code paths both btrfs fi show -d and btrfs check,
both are calling flush during relative open_ctree in progs.

However the flush is called after we have read superblock. That
means the read_superblock during 'show' cli (only) will read superblock
without flush, and 'check' won't, because 011 calls 'check' after
'show'. But it still does not explain the above error, which is
during open_ctree not at read superblock. Remains strange case as
of now.


It's because we're just not done writing it out yet when btrfs fi show
is run.
I think replace is special here.



Also. I can't reproduce.



I'm in a relatively new test rig using kvm, which probably explains why
I haven't seen it before.  You can probably make it easier by adding
a sleep inside the actual __free_device() func.


So the page cache is stale and this isn't related to any of our
patches.


close_ctree() calls into btrfs_close_devices(), which calls
btrfs_close_one_device(), which uses:

call_rcu(&device->rcu, free_device);

close_ctree() also does an rcu_barrier() to make sure and wait for
free_device() to finish.

But, free_device() just puts the work into schedule_work(), so we don't
know for sure the blkdev_put is done when we exit.


Right, saw that before. Any idea why its like that ? Or if it
should be fixed?


It's just trying to limit the work that is done from call_rcu, and it
should
definitely be fixed.  It might cause EBUSY or other problems.  Probably
easiest to add a counter or completion object that gets changed by the
__free_device function.



yes indeed sleep made the problem to reproduce,

Also looks like this problem was identified by below
commit before, however the fix wasn't correct.
   
 commit bc178622d40d87e75abc131007342429c9b03351
 btrfs: use rcu_barrier() to wait for bdev puts at unmount

 ::
 Adding an rcu_barrier() to btrfs_close_devices() causes unmount
 to wait
 until all blkdev_put()s are done, and the device is truly free once
 unmount complet
   

 As free_devces() spinoff __free_device() to make the actual
 bdev put we need to wait on __free_device(). But rcu_barrier()
 just waits for free_device() to complete, so at the end of
 rcu_barrier() the blkdev_put()  may not be completed.


 Wrote a new fix as in the patches,
  [PATH 2/2] btrfs: wait for bdev put

 For review comments.


Thanks, -Anand


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: reorg btrfs_close_one_device()

2016-06-14 Thread Anand Jain
Moves closer to the caller and removes declaration

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 71 +++---
 1 file changed, 35 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a637e99e4c6b..a4e8d48acd4b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -140,7 +140,6 @@ static int btrfs_relocate_sys_chunks(struct btrfs_root 
*root);
 static void __btrfs_reset_dev_stats(struct btrfs_device *dev);
 static void btrfs_dev_stat_print_on_error(struct btrfs_device *dev);
 static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
-static void btrfs_close_one_device(struct btrfs_device *device);
 
 DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
@@ -853,6 +852,41 @@ static void free_device(struct rcu_head *head)
schedule_work(&device->rcu_work);
 }
 
+static void btrfs_close_one_device(struct btrfs_device *device)
+{
+   struct btrfs_fs_devices *fs_devices = device->fs_devices;
+   struct btrfs_device *new_device;
+   struct rcu_string *name;
+
+   if (device->bdev)
+   fs_devices->open_devices--;
+
+   if (device->writeable &&
+   device->devid != BTRFS_DEV_REPLACE_DEVID) {
+   list_del_init(&device->dev_alloc_list);
+   fs_devices->rw_devices--;
+   }
+
+   if (device->missing)
+   fs_devices->missing_devices--;
+
+   new_device = btrfs_alloc_device(NULL, &device->devid,
+   device->uuid);
+   BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
+
+   /* Safe because we are under uuid_mutex */
+   if (device->name) {
+   name = rcu_string_strdup(device->name->str, GFP_NOFS);
+   BUG_ON(!name); /* -ENOMEM */
+   rcu_assign_pointer(new_device->name, name);
+   }
+
+   list_replace_rcu(&device->dev_list, &new_device->dev_list);
+   new_device->fs_devices = device->fs_devices;
+
+   call_rcu(&device->rcu, free_device);
+}
+
 static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_device *device, *tmp;
@@ -7054,38 +7088,3 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info 
*fs_info)
fs_devices = fs_devices->seed;
}
 }
-
-static void btrfs_close_one_device(struct btrfs_device *device)
-{
-   struct btrfs_fs_devices *fs_devices = device->fs_devices;
-   struct btrfs_device *new_device;
-   struct rcu_string *name;
-
-   if (device->bdev)
-   fs_devices->open_devices--;
-
-   if (device->writeable &&
-   device->devid != BTRFS_DEV_REPLACE_DEVID) {
-   list_del_init(&device->dev_alloc_list);
-   fs_devices->rw_devices--;
-   }
-
-   if (device->missing)
-   fs_devices->missing_devices--;
-
-   new_device = btrfs_alloc_device(NULL, &device->devid,
-   device->uuid);
-   BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
-
-   /* Safe because we are under uuid_mutex */
-   if (device->name) {
-   name = rcu_string_strdup(device->name->str, GFP_NOFS);
-   BUG_ON(!name); /* -ENOMEM */
-   rcu_assign_pointer(new_device->name, name);
-   }
-
-   list_replace_rcu(&device->dev_list, &new_device->dev_list);
-   new_device->fs_devices = device->fs_devices;
-
-   call_rcu(&device->rcu, free_device);
-}
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: wait for bdev put

2016-06-14 Thread Anand Jain
Further to the previous commit
 bc178622d40d87e75abc131007342429c9b03351
 btrfs: use rcu_barrier() to wait for bdev puts at unmount

Since free_device() spinoff __free_device() the rcu_barrier() for
  call_rcu(&device->rcu, free_device);
didn't help.

This patch reverts changes by
 bc178622d40d87e75abc131007342429c9b03351
and implement a method to wait on the __free_device() by using
a new bdev_closing member in struct btrfs_device.

Signed-off-by: Anand Jain 
[rework: bc178622d40d87e75abc131007342429c9b03351]
---
 fs/btrfs/volumes.c | 44 ++--
 fs/btrfs/volumes.h |  1 +
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a4e8d48acd4b..404ce1daebb1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -254,6 +255,17 @@ static struct btrfs_device *__alloc_device(void)
return dev;
 }
 
+static int is_device_closing(struct list_head *head)
+{
+   struct btrfs_device *dev;
+
+   list_for_each_entry(dev, head, dev_list) {
+   if (dev->bdev_closing)
+   return 1;
+   }
+   return 0;
+}
+
 static noinline struct btrfs_device *__find_device(struct list_head *head,
   u64 devid, u8 *uuid)
 {
@@ -832,12 +844,22 @@ again:
 static void __free_device(struct work_struct *work)
 {
struct btrfs_device *device;
+   struct btrfs_device *new_device_addr;
 
device = container_of(work, struct btrfs_device, rcu_work);
 
if (device->bdev)
blkdev_put(device->bdev, device->mode);
 
+   /*
+* If we are coming here from btrfs_close_one_device()
+* then it allocates a new device structure for the same
+* devid, so find device again with the devid
+*/
+   new_device_addr = __find_device(&device->fs_devices->devices,
+   device->devid, NULL);
+
+   new_device_addr->bdev_closing = 0;
rcu_string_free(device->name);
kfree(device);
 }
@@ -884,6 +906,12 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
list_replace_rcu(&device->dev_list, &new_device->dev_list);
new_device->fs_devices = device->fs_devices;
 
+   /*
+* So to wait for kworkers to finish all blkdev_puts,
+* so device is really free when umount is done.
+*/
+   new_device->bdev_closing = 1;
+
call_rcu(&device->rcu, free_device);
 }
 
@@ -912,6 +940,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices = NULL;
int ret;
+   int retry_cnt = 5;
 
mutex_lock(&uuid_mutex);
ret = __btrfs_close_devices(fs_devices);
@@ -927,12 +956,15 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
__btrfs_close_devices(fs_devices);
free_fs_devices(fs_devices);
}
-   /*
-* Wait for rcu kworkers under __btrfs_close_devices
-* to finish all blkdev_puts so device is really
-* free when umount is done.
-*/
-   rcu_barrier();
+
+   while (is_device_closing(&fs_devices->devices) &&
+   --retry_cnt) {
+   mdelay(1000); //1 sec
+   }
+
+   if (!(retry_cnt > 0))
+   printk(KERN_WARNING "BTRFS: %pU bdev_put didn't complete, 
giving up\n",
+   fs_devices->fsid);
return ret;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0ac90f8d85bd..945e49f5e17d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -150,6 +150,7 @@ struct btrfs_device {
/* Counter to record the change of device stats */
atomic_t dev_stats_ccnt;
atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+   int bdev_closing;
 };
 
 /*
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BUG: unable to mount btrfs on ppc64 starting from v4.7-rc3 kernel

2016-06-14 Thread Eryu Guan
Hi,

I'm unable to mount btrfs on ppc64 hosts and other hosts with 64k
pagesize(like aarch64, ppc64le). It seems that it's commit 99e3ecfcb9f4
("Btrfs: add more validation checks for superblock") introduced this
failure, btrfs fails stripesize check.

[root@ibm-p8-kvm-09-guest-06 btrfs-progs]# uname -r
4.7.0-rc3
[root@ibm-p8-kvm-09-guest-06 btrfs-progs]# ./mkfs.btrfs -f /dev/vda3
btrfs-progs v4.4
See http://btrfs.wiki.kernel.org for more information.

Label:  (null)
UUID:   06813ff6-d585-4c54-b4df-b7d6920d27ba
Node size:  65536
Sector size:65536
Filesystem size:15.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.01GiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
115.00GiB  /dev/vda3

[root@ibm-p8-kvm-09-guest-06 btrfs-progs]# mount /dev/vda3 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/vda3,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
[root@ibm-p8-kvm-09-guest-06 btrfs-progs]# dmesg | tail
...
[ 1910.048650] BTRFS: device fsid 06813ff6-d585-4c54-b4df-b7d6920d27ba devid 1 
transid 3 /dev/vda3
[ 1913.152085] BTRFS error (device vda3): invalid stripesize 4096
[ 1913.154349] BTRFS error (device vda3): superblock contains fatal errors
[ 1913.200300] BTRFS: open_ctree failed

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-14 Thread Maximilian Böhm
Wow, I'm overwhelmed, thank you very much for your help!

So, firstly I did a
dd if=/dev/zero of=nullz.raw bs=1 count=0 seek=2028060672
and overwrote the ISO on the HDD.

Then I was able to restore the GPT using gdisk with "use backup GPT
header (rebuilding main)". Now I have an intact GPT, a displayed
partition with original name, and an unknown filesystem.

btrfs check --repair /dev/sdc1
and
btrfs-show-super /dev/sdc1 --all

still don't work.

But I btrfs-show-super now finds the third superblock (see at the
bottom of this mail).

Then I tried btrfs restore with -u (superblock 3, because it's 0, 1, 2):

$ losetup /dev/loop1 -o 1M /dev/sdc
$ btrfs restore /dev/loop1 -l -u 2
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
bytenr mismatch, want=21004288, have=0
ERROR: cannot read chunk root
Could not open root, trying backup super

Hm, doesn't "-u 2" defines the backup superblock 3?


$ btrfsck /dev/loop1
No valid Btrfs found on /dev/loop1
Couldn't open file system

BTW, Photorecs finds lots of files but e.g. MKV video are corrupted in
some form that they don't want to play but KDE's Dolphin is able to
generate previews in some cases.
Any idea about how I should proceed further?




$ btrfs-show-super /dev/sdc1 --all
superblock: bytenr=65536, device=/dev/sdc1
-
ERROR: bad magic on superblock on /dev/sdc1 at 65536

superblock: bytenr=67108864, device=/dev/sdc1
-
ERROR: bad magic on superblock on /dev/sdc1 at 67108864

superblock: bytenr=274877906944, device=/dev/sdc1
-
csum0x615e669a [match]
bytenr  274877906944
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid446c7a9c-fcad-42f2-b093-ee495ca8f5be
label   Speicherschatz
generation  5503
root511320200
sys_array_size  129
chunk_root_generation   5485
root_level  1
chunk_root  21004288
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 8001561821184
bytes_used  6272827662336
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x169
( MIXED_BACKREF |
  COMPRESS_LZO |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
csum_type   0
csum_size   4
cache_generation5503
uuid_tree_generation5503
dev_item.uuid   0471bf89-89d1-424a-9fc1-d48241ff453b
dev_item.fsid   446c7a9c-fcad-42f2-b093-ee495ca8f5be [match]
dev_item.type   0
dev_item.total_bytes8001561821184
dev_item.bytes_used 6283562319872
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to map extents to files

2016-06-14 Thread Nikolaus Rath
On Jun 10 2016, Qu Wenruo  wrote:
> At 06/02/2016 10:56 PM, Nikolaus Rath wrote:
>> On Jun 02 2016, Qu Wenruo  wrote:
>>> At 06/02/2016 11:06 AM, Nikolaus Rath wrote:
 Hello,

 For one of my btrfs volumes, btrfsck reports a lot of the following
 warnings:

 [...]
 checking extents
 bad extent [138477568, 138510336), type mismatch with chunk
 bad extent [140091392, 140148736), type mismatch with chunk
 bad extent [140148736, 140201984), type mismatch with chunk
 bad extent [140836864, 140865536), type mismatch with chunk
 [...]

 Is there a way to discover which files are affected by this (in
 particular so that I can take a look at them before and after a btrfsck
 --repair)?
>>>
>>> Which version is the progs? If the fs is not converted from ext2/3/4,
>>> it may be a false alert.
>>
>> Version is 4.4.1. The fs may very well have been converted from ext4,
>> but I can't tell for sure.
>
> For such case, btrfsck --repair is unable to fix it, as btrfs-progs is
> not able to balance extents.
>
> Normally, a full balance would fix it.
>
>
> I would try to update btrfs-progs to 4.5 and recheck, to see if it's a
> false alert.
> If not, then remove unused snapshots and then do the full balance.

Newest btrfs-progs reported the same error, and a full balance fixed
it. Thank you!

Best,
Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Another 'traling page extent' observation

2016-06-14 Thread Holger Hoffstätte
Hi,

I have made another observation regarding extra extents; it seems I'm
good at finding these things. Sorry. ;-)

This time it's with rsync. I found it when I started to use the --inplace
option, which doesn't do rsync's usual write-to-temporary/rename cycle, but
instead writes to a destination file directly. All of a sudden many newly
backed up files had a traling 4k extent, for no good reason.

This has nothing to do with extending overwrites (where new extents would
of course be fine); it happens when the file is new. It is also independent
of the file size or the filesystem state: it does not seem to be caused by
fragmented free space.

Reproducer example (current dir is btrfs):

$ls -al /tmp/data
-rw-r--r-- 1 root root 17569552 Jun 14 16:33 /tmp/data

$rm -f data && rsync /tmp/data . && sync && filefrag -ek data
Filesystem type is: 9123683e
File size of data is 17569552 (17160 blocks of 1024 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   17159:   53918020..  53935179:  17160: last,eof
data: 1 extent found

$rm -f data && rsync --inplace /tmp/data . && sync && filefrag -ek data
Filesystem type is: 9123683e
File size of data is 17569552 (17160 blocks of 1024 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   17155:   48133532..  48150687:  17156:
   1:17156..   17159:   36734592..  36734595:  4:   48150688: last,eof
data: 2 extents found

This is repeatable and independent of the file, so I suspect that after
Liu Bo's fix for the previously reported stray extents in the middle
of the file with slow buffered writes [1] there's an edge case where a page
is still treated differently at the end after close()-ing the file - which
rsync does correctly.

This is on my 4.4.x++ kernel with btrfs ~4.7 and space_cache=v2, but since
it also happens on a fresh volume with v1 it's probably just another
off-by-one somewhere in the writeback/page handling.

thanks,
Holger

[1] commit a91326679f aka "Btrfs: make mapping->writeback_index point to the
last written page"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-14 Thread Goffredo Baroncelli
On 2016-06-12 20:53, Hans van Kranenburg wrote:
> Hi!
> 
> On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:
>> Hi All,
>> 
>> On 2016-06-10 22:47, Hans van Kranenburg wrote:
 +if (sk->min_objectid < sk->max_objectid) + 
 sk->min_objectid += 1;
>>> 
>>> ...and now it's (289406977 168 19193856), which means you're 
>>> continuing your search *after* the block group item!
>>> 
>>> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 
>>> 64) + 19193856, which is 1366685806470112827871857008640
>>> 
>>> The search is continued at 136668581119247931074150336,
>>> which skips 4722366482869645213696 possible places where an
>>> object could live in the tree.
>> 
>> I am not sure to follow you. The extent tree (the tree involved in 
>> the search), contains only two kind of object:
>> 
>> - BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0,
>> size in bytes) - EXTENT_ITEM, where the key means (logical address,
>> 0xa8, size in bytes)
>> 
>> So it seems that for each (possible) "logical address", only two 
>> items might exist; the two item are completely identified by 
>> (objectid, type, ). It should not possible (for the extent tree)
>> to have two item with the same objectid,key and different offset.
>> So, for the extent tree, it is safe to advance only the objectid
>> field.
>> 
>> I am wrong ?
> 
> When calling the search ioctl, the caller has to provide a memory
> buffer that the kernel is going to fill with results. For
> BTRFS_IOC_TREE_SEARCH used here, this buffer has a fixed size of 4096
> bytes. Without some headers etc, this leaves a bit less than 4000
> bytes of space for the kernel to write search result objects to.
> 
> If I do a search that will result in far more objects to be returned
> than possible to fit in those <4096 bytes, the kernel will just put a
> few in there until the next one does not fit any more.
> 
> It's the responsibility of the caller to change the start of the
> search to point just after the last received object and do the search
> again, in order to retrieve a few extra results.

You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
the latter would be skipped.

I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
fields was separate from the key, the use of this ioctl would
be a lot simpler. Moreover in most case (like this one), it would be 
reduced the context switches, because the ioctl would return
only valid data.



> 
> So, the important line here was: "...when the extent_item just
> manages to squeeze in as last result into the current result buffer
> from the ioctl..."
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-14 Thread Hugo Mills
On Tue, Jun 14, 2016 at 08:11:59PM +0200, Goffredo Baroncelli wrote:
> On 2016-06-12 20:53, Hans van Kranenburg wrote:
> > Hi!
> > 
> > On 06/12/2016 08:41 PM, Goffredo Baroncelli wrote:
> >> Hi All,
> >> 
> >> On 2016-06-10 22:47, Hans van Kranenburg wrote:
>  +if (sk->min_objectid < sk->max_objectid) + 
>  sk->min_objectid += 1;
> >>> 
> >>> ...and now it's (289406977 168 19193856), which means you're 
> >>> continuing your search *after* the block group item!
> >>> 
> >>> (289406976 168 19193856) is actually (289406976 << 72) + (168 << 
> >>> 64) + 19193856, which is 1366685806470112827871857008640
> >>> 
> >>> The search is continued at 136668581119247931074150336,
> >>> which skips 4722366482869645213696 possible places where an
> >>> object could live in the tree.
> >> 
> >> I am not sure to follow you. The extent tree (the tree involved in 
> >> the search), contains only two kind of object:
> >> 
> >> - BLOCK_GROUP_ITEM  where the key means (logical address, 0xc0,
> >> size in bytes) - EXTENT_ITEM, where the key means (logical address,
> >> 0xa8, size in bytes)
> >> 
> >> So it seems that for each (possible) "logical address", only two 
> >> items might exist; the two item are completely identified by 
> >> (objectid, type, ). It should not possible (for the extent tree)
> >> to have two item with the same objectid,key and different offset.
> >> So, for the extent tree, it is safe to advance only the objectid
> >> field.
> >> 
> >> I am wrong ?
> > 
> > When calling the search ioctl, the caller has to provide a memory
> > buffer that the kernel is going to fill with results. For
> > BTRFS_IOC_TREE_SEARCH used here, this buffer has a fixed size of 4096
> > bytes. Without some headers etc, this leaves a bit less than 4000
> > bytes of space for the kernel to write search result objects to.
> > 
> > If I do a search that will result in far more objects to be returned
> > than possible to fit in those <4096 bytes, the kernel will just put a
> > few in there until the next one does not fit any more.
> > 
> > It's the responsibility of the caller to change the start of the
> > search to point just after the last received object and do the search
> > again, in order to retrieve a few extra results.
> 
> You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
> next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
> the latter would be skipped.
> 
> I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
> fields was separate from the key, the use of this ioctl would
> be a lot simpler. Moreover in most case (like this one), it would be 
> reduced the context switches, because the ioctl would return
> only valid data.

   There's an argument for implementing it. However, given the way the
indexing works (concatenation of the key elements, resulting in
lexical ordering of keys), you'd still have to do exactly the same
work, only in the kernel instead. The only thing you really win is the
number of context switches.

   It would really have to be a new ioctl, too. You can't change the
behaviour of the existing one.

   Hugo.

> > 
> > So, the important line here was: "...when the extent_item just
> > manages to squeeze in as last result into the current result buffer
> > from the ioctl..."
> > 
> 
> 

-- 
Hugo Mills | "What are we going to do tonight?"
hugo@... carfax.org.uk | "The same thing we do every night, Pinky. Try to
http://carfax.org.uk/  | take over the world!"
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-14 Thread Ashish Samant



On 06/10/2016 01:47 PM, Hans van Kranenburg wrote:

Hi,

Correct me if I'm wrong,

On 06/09/2016 11:46 PM, Ashish Samant wrote:
+/* return 0 if balance can remove a data block group, otherwise 
return 1 */

+static int search_data_bgs(const char *path)
+{
+struct btrfs_ioctl_search_args args;
+struct btrfs_ioctl_search_key *sk;
+struct btrfs_ioctl_search_header *header;
+struct btrfs_block_group_item *bg;
+unsigned long off = 0;
+DIR *dirstream = NULL;
+int e;
+int fd;
+int i;
+u64 total_free = 0;
+u64 min_used = (u64)-1;
+u64 free_of_min_used = 0;
+u64 bg_of_min_used = 0;
+u64 flags;
+u64 used;
+int ret = 0;
+int nr_data_bgs = 0;
+
+fd = btrfs_open_dir(path, &dirstream, 1);
+if (fd < 0)
+return 1;
+
+memset(&args, 0, sizeof(args));
+sk = &args.key;
+
+sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+sk->min_objectid = sk->min_offset = sk->min_transid = 0;
+sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
+sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+sk->nr_items = 65536;


This search returns not only block group information, but also 
everything else. You're first retrieving the complete extent tree to 
userspace, in buffers...



+
+while (1) {
+ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+e = errno;
+if (ret < 0) {
+fprintf(stderr, "ret %d error '%s'\n", ret,
+strerror(e));
+return ret;
+}
+/*
+ * it should not happen.
+ */
+if (sk->nr_items == 0)
+break;
+
+off = 0;
+for (i = 0; i < sk->nr_items; i++) {
+header = (struct btrfs_ioctl_search_header *)(args.buf
+  + off);
+
+off += sizeof(*header);
+if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {


...and then just throwing 99.99% of the results away again. This 
is going to take a phenomenal amount of effort on a huge filesystem, 
copying unnessecary data around between the kernel and your program.


The first thing I learned myself when starting to play with the search 
ioctl is that the search doesn't happen in some kind of 3 dimensional 
space. You can't just filter on a type of object when walking the tree.


http://logs.tvrrug.org.uk/logs/%23btrfs/2016-02-13.html#2016-02-13T22:32:52 



The sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY only 
makes the search space start somewhere halfway objid 0 and end halfway 
objid max, including all other possible values for the type field for 
all objids in between.



+bg = (struct btrfs_block_group_item *)
+(args.buf + off);
+flags = btrfs_block_group_flags(bg);
+if (flags & BTRFS_BLOCK_GROUP_DATA) {
+nr_data_bgs++;
+used = btrfs_block_group_used(bg);
+printf(
+"block_group %15llu (len %11llu used %11llu)\n",
+header->objectid,
+header->offset, used);
+total_free += header->offset - used;
+if (min_used >= used) {
+min_used = used;
+free_of_min_used =
+header->offset - used;
+bg_of_min_used =
+header->objectid;
+}
+}
+}
+
+off += header->len;
+sk->min_objectid = header->objectid;
+sk->min_type = header->type;
+sk->min_offset = header->offset;


When the following is a part of your extent tree...

key (289406976 EXTENT_ITEM 19193856) itemoff 15718 itemsize 53
extent refs 1 gen 11 flags DATA
extent data backref root 5 objectid 258 offset 0 count 1

key (289406976 BLOCK_GROUP_ITEM 1073741824) itemoff 15694 itemsize 24
block group used 24612864 chunk_objectid 256 flags DATA

...and when the extent_item just manages to squeeze in as last result 
into the current result buffer from the ioctl...


...then your search key looks like (289406976 168 19193856) after 
copying the values from the last seen object...



+}
+sk->nr_items = 65536;
+
+if (sk->min_objectid < sk->max_objectid)
+sk->min_objectid += 1;


...and now it's (289406977 168 19193856), which means you're 
continuing your search *after* the block group item!


(289406976 168 19193856) is actually (289406976 << 72) + (168 << 64) + 
19193856, which is 1366685806470112827871857008640


The search is continued at 136668581119247931074150336, which 
skips 4722366482869645213696 possible places where an object could 
live in the tree.

Ah, yes you are right.



+else
+break;
+}
+
+if (nr_data_bgs <= 1) {
+printf("Data block groups in fs = %d, no need to do 
balance.\n"

Re: BUG: unable to mount btrfs on ppc64 starting from v4.7-rc3 kernel

2016-06-14 Thread Liu Bo
Hi,

On Tue, Jun 14, 2016 at 07:13:22PM +0800, Eryu Guan wrote:
> Hi,
> 
> I'm unable to mount btrfs on ppc64 hosts and other hosts with 64k
> pagesize(like aarch64, ppc64le). It seems that it's commit 99e3ecfcb9f4
> ("Btrfs: add more validation checks for superblock") introduced this
> failure, btrfs fails stripesize check.
> 
> [root@ibm-p8-kvm-09-guest-06 btrfs-progs]# uname -r
> 4.7.0-rc3
> [root@ibm-p8-kvm-09-guest-06 btrfs-progs]# ./mkfs.btrfs -f /dev/vda3
> btrfs-progs v4.4
> See http://btrfs.wiki.kernel.org for more information.
> 
> Label:  (null)
> UUID:   06813ff6-d585-4c54-b4df-b7d6920d27ba
> Node size:  65536
> Sector size:65536
> Filesystem size:15.00GiB
> Block group profiles:
>   Data: single8.00MiB
>   Metadata: DUP   1.01GiB
>   System:   DUP  12.00MiB
> SSD detected:   no
> Incompat features:  extref, skinny-metadata
> Number of devices:  1
> Devices:
>IDSIZE  PATH
> 115.00GiB  /dev/vda3
> 
> [root@ibm-p8-kvm-09-guest-06 btrfs-progs]# mount /dev/vda3 /mnt
> mount: wrong fs type, bad option, bad superblock on /dev/vda3,
>missing codepage or helper program, or other error
> 
>In some cases useful info is found in syslog - try
>dmesg | tail or so.
> [root@ibm-p8-kvm-09-guest-06 btrfs-progs]# dmesg | tail
> ...
> [ 1910.048650] BTRFS: device fsid 06813ff6-d585-4c54-b4df-b7d6920d27ba devid 
> 1 transid 3 /dev/vda3
> [ 1913.152085] BTRFS error (device vda3): invalid stripesize 4096
> [ 1913.154349] BTRFS error (device vda3): superblock contains fatal errors
> [ 1913.200300] BTRFS: open_ctree failed

Ah, that's right, we need to update btrfs-progs to set super_stripesize to
sectorsize.

In mkfs.c we have,

{
u32 sectorsize = 4096;
u32 stripesize = 4096;
...
sectorsize = max(sectorsize, (u32)sysconf(_SC_PAGESIZE));
...
mkfs_cfg.sectorsize = sectorsize;
mkfs_cfg.stripesize = stripesize;

ret = make_btrfs(fd, &mkfs_cfg, NULL);
...
}

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: add check-only option for balance

2016-06-14 Thread Goffredo Baroncelli
On 2016-06-14 20:16, Hugo Mills wrote:
[]
>>
>> You are right. If the last item in the buffer is a EXTENT_ITEM, and the 
>> next item in the disk is a BLOCK_GROUP_ITEM with the same object id,
>> the latter would be skipped.
>>
>> I was find always terrible the BTRFS_IOC_TREE_SEARCH; if the min_*
>> fields was separate from the key, the use of this ioctl would
>> be a lot simpler. Moreover in most case (like this one), it would be 
>> reduced the context switches, because the ioctl would return
>> only valid data.
> 
>There's an argument for implementing it. However, given the way the
> indexing works (concatenation of the key elements, resulting in
> lexical ordering of keys), you'd still have to do exactly the same
> work, only in the kernel instead. The only thing you really win is the
> number of context switches.
> 
>It would really have to be a new ioctl, too. You can't change the
> behaviour of the existing one.
> 
>Hugo.

It was 2010...

http://www.spinics.net/lists/linux-btrfs/msg07636.html


> 
>>>
>>> So, the important line here was: "...when the extent_item just
>>> manages to squeeze in as last result into the current result buffer
>>> from the ioctl..."
>>>
>>
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-14 Thread boli
> Replace doesn't need to do a balance, it's largely just a block level copy of 
> the device being replaced, but with some special handling so that the 
> filesystem is consistent throughout the whole operation.  This is most of why 
> it's so much more efficient than add/delete.

Thanks for this correction. In the mean time I experienced myself that replace 
is pretty fast…

Last time I wrote I thought the initial 4 day "remove missing" was 
successful/complete, but as it turned out that device was still missing. Maybe 
that Ctrl+C I tried after a few days did work after all. I only checked/noticed 
this after the 8 TB drive was zeroed and encrypted.

Luckily, most of the "missing" data was already rebuilt onto the remaining 2 
drives, and only 1.27 TiB were still "missing".

In hindsight I should probably have repeated "remove missing" here, but to 
completion. What I did instead was a "replace -r" onto the 8 TB drive. This did 
successfully rebuild the missing 1.27 TiB of data onto the 8 TB drive, at a 
speedy ~144 MiB/s no less!

So I was back to a 4-drive raid1, with 3x 6 TB drives and 1x 8 TB drive (though 
that 8 TB drive had very little data on it). Then I tried to "remove" (without 
"-r" this time) the 6 TB drive with the least amount of data on it (one had 4.0 
TiB, where the other two had 5.45 TiB each). This failed after a few minutes 
because of "no space left on device". 

Austin's mail reminded me to resize due to the larger disk, which I then did, 
but that device still couldn't be removed, same error message.
I then consulted the wiki, which mentions that space for metadata might be 
rather full (11.91 used of 12.66 GiB total here), and to try a "balance" with a 
low "dusage" in such cases.

For now I avoided that by removing one of the other two (rather full) 6 TB 
drives at random, and this has been going on for the last 20 hours or so. 
Thanks to running it in a screen I can check the progress this time around, and 
it's doing its thing at ~41 MiB/s, or ~7 hours per TiB, on average.

Maybe the "no data left on device" will sort itself out during this "remove"'s 
balance, otherwise I'll do it manually later.

> The most efficient way of converting the array online without adding any more 
> disks than you have to begin with is:
> 1. Delete one device from the array with device delete.
> 2. Physically switch the now unused device with one of the new devices.
> 3. Use btrfs replace to replace one of the devices in the array with the 
> newly connected device (and make sure to resize to the full size of the new 
> device).
> 4. Repeat from step 2 until you aren't using any of the old devices in the 
> array.
> 5. You should have one old device left unused, physically switch it for a new 
> device.
> 6. Use btrfs device add to add the new device to the array, then run a full 
> balance.
> 
> This will result in only two balances being needed (one implicit in the 
> device delete, and the explicit final one to restripe across the full array), 
> and will result in the absolute minimum possible data transfer.

Thank you for these very explicit/succinct instructions! Also thanks to Henk 
and Duncan! I will definitely do a full balance when all disks are replaced.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-14 Thread Andrew Armenia
On Sun, Jun 12, 2016 at 10:06 PM, Anand Jain  wrote:
>
>
> On 06/03/2016 09:50 AM, Andrew Armenia wrote:
>>
>> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> the chunk allocator to use only up to X devices when allocating a chunk.
>> This may help reduce the seek penalties seen in filesystems with large
>> numbers of devices.
>
>
> Have you reviewed implementations like device allocation grouping?
> Some info is in the btrfs project ideas..
>
> https://btrfs.wiki.kernel.org/index.php/Project_ideas
>  Chunk allocation groups
>  Limits on number of stripes (stripe width)
>  Linear chunk allocation mode
>
> (Device allocation grouping is important for enterprise storage solutions).
>
> Thanks, Anand

I have looked at those ideas; allocation groups are what I'm ideally
after but I decided to start out small. I just spotted the dev_group
field in the device tree that appears to be currently unused, so
perhaps I will look at developing a group-aware allocator instead of
just limiting the chunk width.

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-14 Thread Hugo Mills
On Tue, Jun 14, 2016 at 03:44:47PM -0400, Andrew Armenia wrote:
> On Sun, Jun 12, 2016 at 10:06 PM, Anand Jain  wrote:
> >
> >
> > On 06/03/2016 09:50 AM, Andrew Armenia wrote:
> >>
> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
> >> the chunk allocator to use only up to X devices when allocating a chunk.
> >> This may help reduce the seek penalties seen in filesystems with large
> >> numbers of devices.
> >
> >
> > Have you reviewed implementations like device allocation grouping?
> > Some info is in the btrfs project ideas..
> >
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas
> >  Chunk allocation groups
> >  Limits on number of stripes (stripe width)
> >  Linear chunk allocation mode
> >
> > (Device allocation grouping is important for enterprise storage solutions).
> >
> > Thanks, Anand
> 
> I have looked at those ideas; allocation groups are what I'm ideally
> after but I decided to start out small. I just spotted the dev_group
> field in the device tree that appears to be currently unused, so
> perhaps I will look at developing a group-aware allocator instead of
> just limiting the chunk width.

   I made some design notes on a generalised approach for this a while
ago:

http://www.spinics.net/lists/linux-btrfs/msg33782.html
http://www.spinics.net/lists/linux-btrfs/msg33916.html

   Hugo.

-- 
Hugo Mills | What do we want?
hugo@... carfax.org.uk | Time Travel!
http://carfax.org.uk/  | When do we want it?
PGP: E2AB1DE4  | Irrelevant!   Terminator: Genisys


signature.asc
Description: Digital signature


Re: refcount overflow in 4.4.6-grsec kernel

2016-06-14 Thread Marco Schindler
Tobias Hunger  gmail.com> writes:

> 
> Hi,
> 
> I updated my archlinux to use a grsec kernel (version 4.4.6). Now I
> get lots of errors from PAX and all backtraces show mention btrfs.
> 
> Is this a known problem? Is there anything I can help to debug this?

I'm seeing the same issue with 4.4.8 on hardened gentoo.
This forum post relates, claiming it's a bug within btrfs.
https://forums.grsecurity.net/viewtopic.php?f=3&t=4392

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: let super_stripesize match with sectorsize

2016-06-14 Thread Liu Bo
Right now stripesize is set to 4096 while sectorsize is set to
max(4096, pagesize).  However, kernel requires super_stripesize
to match with sectorsize.

Reported-by: Eryu Guan 
Signed-off-by: Liu Bo 
---
 mkfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mkfs.c b/mkfs.c
index a3a3c14..8d00766 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1482,6 +1482,8 @@ int main(int argc, char **argv)
}
 
sectorsize = max(sectorsize, (u32)sysconf(_SC_PAGESIZE));
+   stripesize = sectorsize;
+
saved_optind = optind;
dev_cnt = argc - optind;
if (dev_cnt == 0)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] receive: strip root subvol path during process_clone

2016-06-14 Thread Benedikt Morbach
otherwise we get

ERROR: cannot open : No such file or directory

because / doesn't exist, so openat() will fail 
below.

Signed-off-by: Benedikt Morbach 
---
 cmds-receive.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/cmds-receive.c b/cmds-receive.c
index f4a3a4f..a975fdd 100644
--- a/cmds-receive.c
+++ b/cmds-receive.c
@@ -753,6 +753,17 @@ static int process_clone(const char *path, u64 offset, u64 
len,
subvol_path = strdup(si->path);
}
 
+   /* strip the subvolume that we are receiving to from the start of 
subvol_path */
+   if (r->full_root_path &&
+   strstr(subvol_path, r->full_root_path) == subvol_path) {
+   size_t root_len = strlen(r->full_root_path);
+   size_t sub_len = strlen(subvol_path);
+
+   memmove(subvol_path,
+   subvol_path + root_len + 1,
+   sub_len - root_len);
+   }
+
ret = path_cat_out(full_clone_path, subvol_path, clone_path);
if (ret < 0) {
error("clone: target path invalid: %s", clone_path);
-- 
2.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] receive: strip root subvol path during process_clone

2016-06-14 Thread Benedikt Morbach
Hi all,

this fixes http://thread.gmane.org/gmane.comp.file-systems.btrfs/56902 
for me.
I got to this via gdb + good old debug printf and tbh I'm not entirely
clear about the semantics of process_clone, so some error handling
might be missing here?

Cheers
Benedikt

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: receive: handle root subvol path in clone

2016-06-14 Thread Benedikt Morbach
otherwise we get

ERROR: cannot open : No such file or directory

because / doesn't exist, so openat() will fail 
below.

Signed-off-by: Benedikt Morbach 
---


resend with 'btrfs-progs:' in the subject.
Sorry for the noise

cheers

 cmds-receive.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/cmds-receive.c b/cmds-receive.c
index f4a3a4f..a975fdd 100644
--- a/cmds-receive.c
+++ b/cmds-receive.c
@@ -753,6 +753,17 @@ static int process_clone(const char *path, u64 offset, u64 
len,
subvol_path = strdup(si->path);
}
 
+   /* strip the subvolume that we are receiving to from the start of 
subvol_path */
+   if (r->full_root_path &&
+   strstr(subvol_path, r->full_root_path) == subvol_path) {
+   size_t root_len = strlen(r->full_root_path);
+   size_t sub_len = strlen(subvol_path);
+
+   memmove(subvol_path,
+   subvol_path + root_len + 1,
+   sub_len - root_len);
+   }
+
ret = path_cat_out(full_clone_path, subvol_path, clone_path);
if (ret < 0) {
error("clone: target path invalid: %s", clone_path);
-- 
2.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: doc: correct the destination of btrfs-receive

2016-06-14 Thread Satoru Takeuchi
On 2016/06/14 18:16, Hugo Mills wrote:
> On Tue, Jun 14, 2016 at 10:51:33AM +0200, David Sterba wrote:
>> On Tue, Jun 14, 2016 at 02:50:19PM +0900, Satoru Takeuchi wrote:
>>> We can set not only btrfs mount point but also any path belong to
>>> btrfs mount point as btrfs-receive's destination.
>>>
>>> Signed-off-by: Satoru Takeuchi 
>>
>> The patches from you have a consistent whitespace damage, I've fixed
>> the btrfs-crc but now that I see it again I suspect some error on your
>> side.

The problem is on my side. I'm sorry.

>>
>>> @@ -7,14 +7,14 @@ btrfs-receive - receive subvolumes from send stream
>>>
>>>   SYNOPSIS
>>>   
>>> -*btrfs receive* [options] 
>>> +*btrfs receive* [options] 
>>>
>>>   DESCRIPTION
>>>   ---
>>>
>>>   Receive a stream of changes and replicate one or more subvolumes that were
>>>   previously used with *btrfs send* The received subvolumes are stored to
>>> -'mount'.
>>> +'path'.
>>>
>>>   *btrfs receive* will fail int the following cases:
>>>
>>> @@ -37,7 +37,7 @@ by default, btrfs receive uses standard input to receive 
>>> the stream,
>>>   use this option to read from a file instead
>>>
>>>   -C|--chroot::
>>> -confine the process to 'mount' using `chroot`(1)
>>> +confine the process to 'path' using `chroot`(1)
>>>
>>>   -e::
>>>   terminate after receiving an 'end cmd' marker in the stream.
>>
>> ie. all the context lines start with two spaces instead of one. I'll
>> apply this patch manually but please have a look.
> 
>Looking at this, I suspect it's a consequence of sending it as
> "Content-Type: format=flowed; delsp=yes". I'm not sure which of those
> two options is the culprit. When I look at the message in my client
> (mutt), it looks absolutely fine. When I pipe it to hexdump, the
> double-spacing is apparent.

You're right. These are added to charset="iso-2022-jp" plain text
mail since thunderbird 45.

I disabled the setting that appends the above mentioned options.
So, probably the following sample patch doesn't have strange spaces.

===
We can set not only btrfs mount points but also any paths belong to
btrfs mount point as btrfs-receive's destination.

Signed-off-by: Satoru Takeuchi 
---
 Documentation/btrfs-receive.asciidoc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/btrfs-receive.asciidoc 
b/Documentation/btrfs-receive.asciidoc
index fbbded2..e246603 100644
--- a/Documentation/btrfs-receive.asciidoc
+++ b/Documentation/btrfs-receive.asciidoc
@@ -7,14 +7,14 @@ btrfs-receive - receive subvolumes from send stream

 SYNOPSIS
 
-*btrfs receive* [options] 
+*btrfs receive* [options] 

 DESCRIPTION
 ---

 Receive a stream of changes and replicate one or more subvolumes that were
 previously used with *btrfs send* The received subvolumes are stored to
-'mount'.
+'path'.

 *btrfs receive* will fail int the following cases:

@@ -37,7 +37,7 @@ by default, btrfs receive uses standard input to receive the 
stream,
 use this option to read from a file instead

 -C|--chroot::
-confine the process to 'mount' using `chroot`(1)
+confine the process to 'path' using `chroot`(1)

 -e::
 terminate after receiving an 'end cmd' marker in the stream.
-- 
2.5.5
===

Thanks,
Satoru

> 
>Hugo.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 11/13] btrfs: relocation: Enhance error handling to avoid BUG_ON

2016-06-14 Thread Qu Wenruo
Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b7de713..32fcd8d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -887,6 +887,13 @@ again:
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(&rc->backref_cache, cur);
goto out;
}
 
@@ -2991,14 +2998,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, &block->key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, &block->key,
@@ -3006,11 +3020,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.8.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 09/13] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-06-14 Thread Qu Wenruo
Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  18 
 fs/btrfs/inode.c   | 257 ++---
 fs/btrfs/relocation.c  |  16 +++
 3 files changed, 256 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 689d25a..e0db77e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2405,6 +2406,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 
if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2419,6 +2422,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedup
+* hash.
+*/
+   btrfs_dedupe_del(trans, fs_info, node->bytenr);
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -6826,6 +6838,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedupe_del(trans, info, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, extent_root,
+   ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, extent_root, 
ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e5558d9..23a725f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
   struct page *locked_page,
   u64 start, u64 end, int *page_started,
-  unsigned long *nr_written, int unlock);
+  unsigned long *nr_written, int unlock,
+  struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
   u64 len, u64 orig_start,
   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
struct page **pages;
unsigned long nr_pages;
int compress_type;
+   struct btrfs_dedupe_hash *hash;
struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 u64 compressed_size,
 struct page **pages,
 unsigned long nr_pages,
-int compress_type)
+int compress_type,
+struct btrfs_dedupe_hash *hash)
 {
struct async_extent *async_ex

[PATCH v11 10/13] btrfs: dedupe: Add ioctl for inband dedupelication

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interface are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c  | 48 
 fs/btrfs/dedupe.h  | 15 ++
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/extent-tree.c |  7 +++--
 fs/btrfs/ioctl.c   | 68 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 23 
 7 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 4c5b3fc..74e396a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_type = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   return;
+   }
+   mutex_lock(&dedupe_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_type = dedupe_info->hash_type;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(&dedupe_info->lock);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -395,6 +422,27 @@ static void unblock_all_writers(struct btrfs_fs_info 
*fs_info)
percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 9162d2c..f605a7f 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -91,6 +91,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 type)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
u64 blocksize, u64 limit_nr, u64 limit_mem);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -102,6 +111,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
+/*
  * Calculate hash for dedupe.
  * Caller must ensure [start, start + dedupe_bs) has valid data.
  *
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dccd608..9918e2ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -50,6 +50,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_X86
 #include 
@@ -3902,6 +3903,8 @@ void close_ctree(struct btrfs_root *root)
 
btrfs_free_qgroup_config(fs_info);
 
+   btrfs_dedupe_cleanup(fs_info);
+
if (percp

[PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |   7 +++
 fs/btrfs/dedupe.h  | 149 +
 fs/btrfs/disk-io.c |   1 +
 include/uapi/linux/btrfs.h |  16 +
 4 files changed, 173 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..8f70f53d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1091,6 +1091,13 @@ struct btrfs_fs_info {
struct list_head pinned_chunks;
 
int creating_free_space_tree;
+
+   /*
+* Inband de-duplication related structures
+*/
+   unsigned long dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 000..d7b1a77
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,149 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include 
+#include 
+#include 
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedupe.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_type;
+
+   struct crypto_shash *dedupe_driver;
+
+   /*
+* Use mutex to portect both backends
+* Even for in-memory backends, the rb-tree can be quite large,
+* so mutex is better for such use case.
+*/
+   struct mutex lock;
+
+   /* following members are only used in in-memory backend */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (from unsupported param to tree creation error for some backends)
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+   u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedupe.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (error from hash codes)
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Add a dedupe hash into dedupe

[PATCH v11 02/13] btrfs: dedupe: Introduce function to initialize dedupe info

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/dedupe.c  | 160 +
 fs/btrfs/dedupe.h  |  13 +++-
 include/uapi/linux/btrfs.h |   2 +
 4 files changed, 174 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 000..941ee37
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,160 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+   u16 backend, u64 blocksize, u64 limit)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_type = type;
+   dedupe_info->backend = backend;
+   dedupe_info->blocksize = blocksize;
+   dedupe_info->limit_nr = limit;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(&dedupe_info->lru_list);
+   mutex_init(&dedupe_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
+ u16 backend, u64 blocksize, u64 limit_nr,
+ u64 limit_mem, u64 *ret_limit)
+{
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize))
+   return -EINVAL;
+   /*
+* For new backend and hash type, we return special return code
+* as they can be easily expended.
+*/
+   if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+   return -EOPNOTSUPP;
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+   return -EOPNOTSUPP;
+
+   /* Backend specific check */
+   if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+   if (!limit_nr && !limit_mem)
+   *ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+   else {
+   u64 tmp = (u64)-1;
+
+   if (limit_mem) {
+   tmp = limit_mem / (sizeof(struct inmem_hash) +
+   btrfs_dedupe_hash_size(hash_type));
+   /* Too small limit_mem to fill a hash item */
+   if (!tmp)
+   return -EINVAL;
+   }
+   if (!limit_nr)
+   limit_nr = (u64)-1;
+
+   *ret_limit = min(tmp, limit_nr);
+ 

[PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

When testing in-band dedupe, sometimes we got ENOSPC error, though fs
still has much free space. After some debuging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve
plenty of metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents. Please see below case for how ENOSPC occurs:

  1, Buffered write 128MB data in unit of 1MB, so finially we'll have
inode outstanding extents be 1, and reserved_extents be 128.
Note it's btrfs_merge_extent_hook() that merges these 1MB units into
one big outstanding extent, but do not change reserved_extents.

  2, When writing dirty pages, for in-band dedupe, cow_file_range() will
split above big extent in unit of 16KB(assume our in-band dedupe blocksize
is 16KB). When first split opeartion finishes, we'll have 2 outstanding
extents and 128 reserved extents, and just right the currently generated
ordered extent is dispatched to run and complete, then
btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be
called to release metadata, after that we will have 1 outstanding extents
and 1 reserved extents(also see logic in drop_outstanding_extent()). Later
cow_file_range() continues to handles left data range[16KB, 128MB), and if
no other ordered extent was dispatched to run, there will be 8191
outstanding extents and 1 reserved extent.

  3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 8191 outstanding extents' metadata, for 64K node size, it'll be
8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error.

But indeed when a file goes through in-band dedupe, its max extent size
will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band
dedupe blocksize, so current metadata reservation method in btrfs is not
appropriate or correct, here we introduce btrfs_max_extent_size(), which
will return max extent size for corresponding files, which go through in-band
and we use this value to do metadata reservation and extent_io merge, split,
clear operations, we can make sure difference between outstanding_extents
and reserved_extents will not be so big.

Currently only buffered write will go through in-band dedupe if in-band
dedupe is enabled.

Reported-by: Satoru Takeuchi 
Cc: Josef Bacik 
Cc: Mark Fasheh 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h|  16 +++--
 fs/btrfs/dedupe.h   |  37 +++
 fs/btrfs/extent-tree.c  |  62 ++
 fs/btrfs/extent_io.c|  63 +-
 fs/btrfs/extent_io.h|  15 -
 fs/btrfs/file.c |  26 +---
 fs/btrfs/free-space-cache.c |   5 +-
 fs/btrfs/inode-map.c|   4 +-
 fs/btrfs/inode.c| 155 ++--
 fs/btrfs/ioctl.c|   6 +-
 fs/btrfs/ordered-data.h |   1 +
 fs/btrfs/relocation.c   |   8 +--
 12 files changed, 290 insertions(+), 108 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 62037e9..21f2689 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2649,10 +2649,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 void btrfs_subvolume_release_metadata(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv,
  u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+   u32 max_extent_size);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+u32 max_extent_size);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+u32 max_extent_size);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+ u32 max_extent_size);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
@@ -3093,7 +3097,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, 
int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
- struct extent_state **cached_state);
+ s

[PATCH v11 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2016-06-14 Thread Qu Wenruo
For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 30 +++---
 fs/btrfs/delayed-ref.h |  8 
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action)
+{
+   head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
+   qrecord, bytenr, num_bytes, ref_root, reserved,
+   action, 1);
+   add_delayed_data_ref(fs_info, trans, head_ref, &dref->node, bytenr,
+   num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
-   bytenr, num_bytes, ref_root, reserved,
-   action, 1);
-
-   add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
-  num_bytes, parent, ref_root, owner, offset,
-  action);
+   btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+   bytenr, num_bytes, parent, ref_root, owner, offset,
+   reserved, action);
spin_unlock(&delayed_refs->lock);
 
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 5fca953..5830341 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct 
btrfs_delayed_ref_node *ref)
}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
   u64 ref_root, int level, int action,
   struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes,
-- 
2.8.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 867f481..4c5b3fc 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -614,3 +614,49 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   SHASH_DESC_ON_STACK(sdesc, tfm);
+   u64 dedupe_bs;
+   u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   sdesc->tfm = tfm;
+   sdesc->flags = 0;
+   ret = crypto_shash_init(sdesc);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(sdesc, d, sectorsize);
+   kunmap(p);
+   put_page(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(sdesc, hash->hash);
+   return ret;
+}
-- 
2.8.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 00/13] Btrfs dedupe framework

2016-06-14 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160524

In this update, the patchset goes through another re-organization along
with other fixes to address comments from community.
1) Move on-disk backend and dedupe props out of the patchset
   Suggested by David.
   There is still some discussion on the on-disk format.
   And dedupe prop is still not 100% determined.

   So it's better to focus on the current in-memory backend only, which
   doesn't bring any on-disk format change.

   Once the framework is done, new backends and props can be added more
   easily.

2) Better enable/disable and buffered write race avoidance
   Inspired by Mark.
   Although in previous version, we didn't trigger it with our test
   case, but if we manually add delay(5s) to __btrfs_buffered_write(),
   it's possible to trigger disable and buffered write race.

   The cause is, there is a windows between __btrfs_buffered_write() and
   btrfs_dirty_pages().
   In that window, sync_filesystem() can return very quickly since there
   is no dirty page.
   During that window, dedupe disable can happen and finish, and
   buffered writer may access to the NULL pointer of dedupe info.

   Now we use sb->s_writers.rw_sem to wait all current writers and block
   further writers, then sync the fs, change dedupe status and finally
   unblock writers. (Like freeze)
   This provides clearer logical and code, and safer than previous
   method, because there is no windows before we dirty pages.

3) Fix ENOSPC problem with better solution.
   Pointed out by Josef.
   The last 2 patches from Wang fixes ENOSPC problem, in a more
   comprehensive method for delalloc metadata reservation.
   Alone with small outstanding extents improvement, to co-operate with
   tunable max extent size.

Now the whole patchset will only add in-memory backend as a whole.
No other backend nor prop.
So we can focus on the framework itself.

Next version will focus on ioctl interface modification suggested by
David.

Thanks,
Qu

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
v11:
  Remove other backend and props support to focus on the framework and
  in-memory backend. Suggested by David.
  Better disable and buffered write race protection.
  Comprehensive fix to dedupe metadata ENOSPC problem.

Qu Wenruo (3):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON

Wang Xiaoguang (10):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: improve inode's outstanding_extents computation
  btrfs: dedupe: fix false ENOSPC

 fs/btrfs/Makefile   |   2 +-
 fs/btrfs/ctree.h|  25 +-
 fs/btrfs/dedupe.c   | 710 
 fs/btrfs/dedupe.h 

[PATCH v11 08/13] btrfs: ordered-extent: Add support for dedupe

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c | 46 ++
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e96634a..7b1fce4 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* A hash hit means we have already incremented the extents delayed
+* ref.
+* We must handle this even if another process is trying to 
+* turn off dedupe, otherwise we will leak a reference.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_dedupe_sizes[dedupe_info->hash_type]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, &entry->flags);
 
@@ -250,15 +279,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +304,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -577,6 +614,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(&sum->list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4515077..8dda4a5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/f

[PATCH v11 12/13] btrfs: improve inode's outstanding_extents computation

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
gets these warnings from btrfs_destroy_inode():
WARN_ON(BTRFS_I(inode)->outstanding_extents);
WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test,
otherwise there won't be such WARNING.
#include 
#include 
#include 
#include 
#include 

int main(void)
{
int fd;
char buf[68 *1024];

memset(buf, 0, 68 * 1024);
fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
pwrite(fd, buf, 68 * 1024, 64 * 1024);
return;
}

When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is:
64KB128K132KB
|---|---|
 64 + 4KB

1) for above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and set BTRFS_I(inode)->outstanding_extents to 2.
(68KB + 64KB - 1) / 64KB == 2

Outstanding_extents: 2

2) then btrfs_dirty_page() will be called to dirty pages and set
EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called
twice.
The 1st set_bit_hook() call will set DEALLOC flag for the first 64K.
64KB128KB
|---|
64KB DELALLOC
Outstanding_extents: 2

Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase
outstanding_extents counter.
So for 1st set_bit_hooks() call, it won't modify outstanding_extents,
it's still 2.

Then FIRST_DELALLOC flag is *CLEARED*.

3) 2nd btrfs_set_bit_hook() call.
Because FIRST_DELALLOC have been cleared by previous set_bit_hook(),
btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by one, 
so
now BTRFS_I(inode)->outstanding_extents is 3.
64KB128KB132KB
|---||
64K DELALLOC   4K DELALLOC
Outstanding_extents: 3

But the correct outstanding_extents number should be 2, not 3.
The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the
WARN_ON().

Normally, we can solve it by only increasing outstanding_extents in
set_bit_hook().
But the problem is for delalloc_reserve/release_metadata(), we only have
a 'length' parameter, and calculate in-accurate outstanding_extents.
If we only rely on set_bit_hook() release_metadata() will crew things up
as it will decrease inaccurate number.

So the fix we use is:
1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta
   Just as a place holder.
2) Increase *accurate* outstanding_extents at set_bit_hooks()
   This is the real increaser.
3) Decrease *INACCURATE* outstanding_extents before returning
   This makes outstanding_extents to correct value.

For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of
__btrfs_buffered_write(), each iteration will only handle about 2MB
data.
So btrfs_dirty_pages() won't need to handle cases cross 2 extents.

Cc: Mark Fasheh 
Cc: Josef Bacik 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 68 +++-
 fs/btrfs/ioctl.c |  6 ++---
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8f70f53d..62037e9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3094,6 +3094,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info 
*fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  struct extent_state **cached_state);
+int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
+   struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 struct btrfs_root *new_root,
 struct btrfs_root *parent_root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 23a725f..4a02383 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1723,11 +1723,15 @@ static void btrfs_split_extent_hook(struct inode *inode,
struct extent_state *orig, u64 split)
 {
u64 size;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
 
/* not delalloc, ignore it */
if (!(orig->state & EXTENT_DELALLOC))
return;
 
+   if (root == root->fs_info->tree_root)
+   return;
+
size = orig->end - orig->start + 1;
if (size > BTRFS_MAX_EXTENT_SIZE) {
u64 num_extents;
@@ -1765,11 

[PATCH v11 06/13] btrfs: dedupe: Introduce function to search for an existing hash

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 185 ++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 960b039..867f481 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -429,3 +430,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = &dedupe_info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_type = dedupe_info->hash_type;
+   int hash_len = btrfs_dedupe_sizes[hash_type];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(&entry->lru_list);
+   list_add(&entry->lru_list, &dedupe_info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (root->fs_info->quota_enabled &&
+   is_fstree(root->root_key.objectid)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(&dedupe_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   delayed_refs = &trans->transaction->delayed_refs;
+
+   spin_lock(&delayed_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+   insert_dref, insert_head, insert_qrecord,
+   bytenr, num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos, 0,
+   BTRFS_ADD_DELAYED_REF);
+   spin_unlock(&delayed_refs->lock);
+
+   /* add_delayed_data_ref_locked will free unused memory */
+   

[PATCH v11 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 941ee37..be83aca 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+   if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+   GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
u16 backend, u64 blocksize, u64 limit)
 {
@@ -158,3 +166,146 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
/* Place holder for bisect, will be implemented in later patches */
return 0;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(&hash->hash_node, parent, p);
+   rb_insert_color(&hash->hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(&hash->bytenr_node, parent, p);
+   rb_insert_color(&hash->bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(&hash->lru_list);
+   rb_erase(&hash->hash_node, &dedupe_info->hash_root);
+   rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 type = dedupe_info->hash_type;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(type);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+   mutex_lock(&dedupe_info->lock);
+
+   ret = inmem_insert_bytenr(&dedupe_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(&dedupe_info->hash_root, ihash,
+   btrfs_dedupe_sizes[type]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(&ihash->lru_list, &dedupe_info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+}
+

[PATCH v11 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 132 +++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index be83aca..960b039 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -161,12 +161,6 @@ enable:
return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-   /* Place holder for bisect, will be implemented in later patches */
-   return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 struct inmem_hash *hash, int hash_len)
 {
@@ -309,3 +303,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(&dedupe_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(&dedupe_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(&dedupe_info->lock);
+   list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(&dedupe_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+   down_write(&sb->s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   up_write(&sb->s_umount);
+   percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished writers and block further writers.
+* Then sync the whole fs so all current write will go through
+* dedupe, and all later write won't go through dedupe.
+*/
+   block_all_writers(fs_info);
+   ret = sync_filesystem(fs_info->sb);
+   fs_info->dedupe_enabled = 0;
+   fs_info->dedupe_info = NULL;
+   unblock_all_writers(fs_info);
+   if (ret < 0)
+   return ret;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
-- 
2.8.3



--
To unsu

Re: [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC

2016-06-14 Thread kbuild test robot
Hi,

[auto build test ERROR on v4.7-rc3]
[cannot apply to btrfs/next next-20160614]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160615-101646
config: i386-randconfig-a0-201624 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/btrfs/tests/extent-io-tests.c: In function 'test_find_delalloc':
>> fs/btrfs/tests/extent-io-tests.c:117:2: error: too few arguments to function 
>> 'set_extent_delalloc'
 set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
   fs/btrfs/tests/extent-io-tests.c:148:2: error: too few arguments to function 
'set_extent_delalloc'
 set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
   fs/btrfs/tests/extent-io-tests.c:203:2: error: too few arguments to function 
'set_extent_delalloc'
 set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
--
   fs/btrfs/tests/inode-tests.c: In function 'test_extent_accounting':
>> fs/btrfs/tests/inode-tests.c:969:8: error: too few arguments to function 
>> 'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:984:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1018:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1041:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1060:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1097:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^

vim +/set_extent_delalloc +117 fs/btrfs/tests/extent-io-tests.c

294e30fe Josef Bacik 2013-10-09  111}
294e30fe Jose

[PATCH v11.1 13/13] btrfs: dedupe: fix false ENOSPC

2016-06-14 Thread Qu Wenruo
From: Wang Xiaoguang 

When testing in-band dedupe, sometimes we got ENOSPC error, though fs
still has much free space. After some debuging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve
plenty of metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents. Please see below case for how ENOSPC occurs:

  1, Buffered write 128MB data in unit of 1MB, so finially we'll have
inode outstanding extents be 1, and reserved_extents be 128.
Note it's btrfs_merge_extent_hook() that merges these 1MB units into
one big outstanding extent, but do not change reserved_extents.

  2, When writing dirty pages, for in-band dedupe, cow_file_range() will
split above big extent in unit of 16KB(assume our in-band dedupe blocksize
is 16KB). When first split opeartion finishes, we'll have 2 outstanding
extents and 128 reserved extents, and just right the currently generated
ordered extent is dispatched to run and complete, then
btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be
called to release metadata, after that we will have 1 outstanding extents
and 1 reserved extents(also see logic in drop_outstanding_extent()). Later
cow_file_range() continues to handles left data range[16KB, 128MB), and if
no other ordered extent was dispatched to run, there will be 8191
outstanding extents and 1 reserved extent.

  3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 8191 outstanding extents' metadata, for 64K node size, it'll be
8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error.

But indeed when a file goes through in-band dedupe, its max extent size
will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band
dedupe blocksize, so current metadata reservation method in btrfs is not
appropriate or correct, here we introduce btrfs_max_extent_size(), which
will return max extent size for corresponding files, which go through in-band
and we use this value to do metadata reservation and extent_io merge, split,
clear operations, we can make sure difference between outstanding_extents
and reserved_extents will not be so big.

Currently only buffered write will go through in-band dedupe if in-band
dedupe is enabled.

Reported-by: Satoru Takeuchi 
Cc: Josef Bacik 
Cc: Mark Fasheh 
Signed-off-by: Wang Xiaoguang 
---
v11.1
  Fix compile error on self test.
---
 fs/btrfs/ctree.h |  16 ++--
 fs/btrfs/dedupe.h|  37 ++
 fs/btrfs/extent-tree.c   |  62 
 fs/btrfs/extent_io.c |  63 +++-
 fs/btrfs/extent_io.h |  15 +++-
 fs/btrfs/file.c  |  26 +--
 fs/btrfs/free-space-cache.c  |   5 +-
 fs/btrfs/inode-map.c |   4 +-
 fs/btrfs/inode.c | 155 +++
 fs/btrfs/ioctl.c |   6 +-
 fs/btrfs/ordered-data.h  |   1 +
 fs/btrfs/relocation.c|   8 +-
 fs/btrfs/tests/extent-io-tests.c |   6 +-
 fs/btrfs/tests/inode-tests.c |  12 +--
 14 files changed, 299 insertions(+), 117 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 62037e9..21f2689 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2649,10 +2649,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 void btrfs_subvolume_release_metadata(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv,
  u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+   u32 max_extent_size);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+u32 max_extent_size);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+u32 max_extent_size);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+ u32 max_extent_size);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
@@ -3093,7 +3097,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, 
int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
   in

Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-14 Thread Duncan
boli posted on Tue, 14 Jun 2016 21:28:57 +0200 as excerpted:

> So I was back to a 4-drive raid1, with 3x 6 TB drives and 1x 8 TB drive
> (though that 8 TB drive had very little data on it). Then I tried to
> "remove" (without "-r" this time) the 6 TB drive with the least amount
> of data on it (one had 4.0 TiB, where the other two had 5.45 TiB each).
> This failed after a few minutes because of "no space left on device".
> 
> Austin's mail reminded me to resize due to the larger disk, which I then
> did, but that device still couldn't be removed, same error message.
> I then consulted the wiki, which mentions that space for metadata might
> be rather full (11.91 used of 12.66 GiB total here), and to try a
> "balance" with a low "dusage" in such cases.
> 
> For now I avoided that by removing one of the other two (rather full) 6
> TB drives at random, and this has been going on for the last 20 hours or
> so. Thanks to running it in a screen I can check the progress this time
> around, and it's doing its thing at ~41 MiB/s, or ~7 hours per TiB, on
> average.

The ENOSPC errors are likely due to the fact that the raid1 allocator 
needs _two_ devices with free space.  If your 6T devices get too full, 
even if the 8T device is nearly empty, you'll run into ENOSPC, because 
you have just one device with unallocated space and the raid1 allocator 
needs two.

btrfs device usage should help diagnose this condition, with btrfs 
filesystem show also showing the individual device space allocation but 
not as much other information as usage will.

If you run into this, you may just have to do the hardware yank and 
replace-missing thing again, yanking a 6T and replacing with an 8T.  
Don't forget the resize.  That should leave you with two devices with 
free space and thus hopefully allow normal raid1 reallocation with a 
device remove again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC

2016-06-14 Thread kbuild test robot
Hi,

[auto build test WARNING on v4.7-rc3]
[cannot apply to btrfs/next next-20160614]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160615-101646
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': 
unknown attribute
>> fs/btrfs/tests/extent-io-tests.c:117:28: sparse: not enough arguments for 
>> function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c:148:28: sparse: not enough arguments for 
function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c:203:28: sparse: not enough arguments for 
function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c: In function 'test_find_delalloc':
   fs/btrfs/tests/extent-io-tests.c:117:2: error: too few arguments to function 
'set_extent_delalloc'
 set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
   fs/btrfs/tests/extent-io-tests.c:148:2: error: too few arguments to function 
'set_extent_delalloc'
 set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
   fs/btrfs/tests/extent-io-tests.c:203:2: error: too few arguments to function 
'set_extent_delalloc'
 set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL);
 ^~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 
start,
  ^~~
--
   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': 
unknown attribute
>> fs/btrfs/tests/inode-tests.c:969:40: sparse: not enough arguments for 
>> function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:984:40: sparse: not enough arguments for 
function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1018:40: sparse: not enough arguments for 
function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1041:40: sparse: not enough arguments for 
function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1060:40: sparse: not enough arguments for 
function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1097:40: sparse: not enough arguments for 
function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c: In function 'test_extent_accounting':
   fs/btrfs/tests/inode-tests.c:969:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:984:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1018:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
^
   fs/btrfs/tests/inode-tests.c:1041:8: error: too few arguments to function 
'btrfs_set_extent_delalloc'
 ret = btrfs_set_extent_delalloc(inode,
   ^
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/t

Re: [PATCH] Btrfs: let super_stripesize match with sectorsize

2016-06-14 Thread Chandan Rajendra
On Tuesday, June 14, 2016 02:33:43 PM Liu Bo wrote:
> Right now stripesize is set to 4096 while sectorsize is set to
> max(4096, pagesize).  However, kernel requires super_stripesize
> to match with sectorsize.
> 
> Reported-by: Eryu Guan 
> Signed-off-by: Liu Bo 
> ---
>  mkfs.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mkfs.c b/mkfs.c
> index a3a3c14..8d00766 100644
> --- a/mkfs.c
> +++ b/mkfs.c
> @@ -1482,6 +1482,8 @@ int main(int argc, char **argv)
>   }
> 
>   sectorsize = max(sectorsize, (u32)sysconf(_SC_PAGESIZE));
> + stripesize = sectorsize;
> +
>   saved_optind = optind;
>   dev_cnt = argc - optind;
>   if (dev_cnt == 0)

Hello Liu Bo,

We have to fix the following check in check_super() as well,

   if (btrfs_super_stripesize(sb) != 4096) {
error("invalid stripesize %u", btrfs_super_stripesize(sb));
goto error_out;
}

i.e. btrfs_super_stripesize(sb) must be equal to
btrfs_super_sectorsize(sb).

However in btrfs-progs (mkfs.c to be precise) since we had stripesize
hardcoded to 4096, setting stripesize to the value of sectorsize in
mkfs.c will cause the following to occur when mkfs.btrfs is invoked for
devices with existing Btrfs filesystem instances,

NOTE: Assume we have changed the stripesize validation in btrfs-progs'
check_super() to,

if (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb)) {
error("invalid stripesize %u", btrfs_super_stripesize(sb));
goto error_out;
}


main()
 for each device file passed as an argument,
   test_dev_for_mkfs()
 check_mounted
   check_mounted_where
 btrfs_scan_one_device
   btrfs_read_dev_super
 check_super() call will fail for existing filesystems which have
 stripesize set to 4k. All existing filesystem instances will fall
 into this category.

This error value is pushed up the call stack and this causes the device to not
get added to the fs_devices_mnt list in check_mounted_where(). Hence we would
fail to correctly check the mount status of the multi-device btrfs
filesystems.

I will try to figure out a solution to this problem.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: BTRFS, remarkable problem: filesystem turns to read-only caused by firefox download

2016-06-14 Thread Paul Verreth
Dear all.

When I download a video using  Firefox DownloadHelper addon, the
filesystem suddenly turns read only. Not a coincedence, I tried it
several times, and it happened every time again

Info:
Linux wolfgang 4.2.0-35-generic #40-Ubuntu SMP Tue Mar 15 22:15:45 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version
btrfs-progs v4.0

firefox --version
Mozilla Firefox 46.0

package: btrfs-tools
State: installed
Automatically installed: no
Version: 4.0-2
Priority: optional
Section: admin
Maintainer: Ubuntu Developers 
Architecture: amd64
Uncompressed Size: 3.518 k
Depends: e2fslibs (>= 1.42), libblkid1 (>= 2.17.2), libc6 (>= 2.8), libcomerr2
 (>= 1.01), liblzo2-2, libuuid1 (>= 2.16), zlib1g (>= 1:1.2.0)
Conflicts: btrfs-tools
Description: Checksumming Copy on Write Filesystem utilities

Homepage: http://btrfs.wiki.kernel.org/


extract from dmesg:

[171145.415378] tree block key (18446744073709551611 48 3255000) level 0
[171145.415379] shared block backref parent 203105845248
[171145.415379] item 4 key (75093737472 168 4096) itemoff 3740 itemsize 51
[171145.415380] extent refs 1 gen 1551545 flags 258
[171145.415381] tree block key (3547221 12 3547219) level 0
[171145.415381] shared block backref parent 75092348928
[171145.415382] item 5 key (75093741568 168 4096) itemoff 3689 itemsize 51
[171145.415382] extent refs 1 gen 1452265 flags 258
[171145.415383] tree block key (3280755 12 3280753) level 0
[171145.415383] shared block backref parent 203105845248
[171145.415384] item 6 key (75093745664 168 4096) itemoff 3638 itemsize 51
[171145.415385] extent refs 1 gen 1452265 flags 258
[171145.415385] tree block key (18446744073709551611 48 3255000) level 0
[171145.415386] shared block backref parent 203105845248
[171145.415386] item 7 key (75093749760 168 4096) itemoff 3587 itemsize 51
[171145.415387] extent refs 1 gen 1514341 flags 258
[171145.415387] tree block key (3473870 1 0) level 0
[171145.415388] shared block backref parent 75091329024
[171145.415388] item 8 key (75093753856 168 4096) itemoff 3536 itemsize 51
[171145.415389] extent refs 1 gen 1597177 flags 258
[171145.415390] tree block key (2921841 108 6848512) level 0
[171145.415390] shared block backref parent 75091030016
[171145.415391] item 9 key (75093757952 168 4096) itemoff 3485 itemsize 51
[171145.415391] extent refs 1 gen 1452265 flags 258
[171145.415392] tree block key (18446744073709551611 48 3254998) level 0
[171145.415392] shared block backref parent 203105845248
[171145.415393] item 10 key (75093766144 168 4096) itemoff 3434 itemsize 51
[171145.415394] extent refs 1 gen 1452265 flags 258
[171145.415394] tree block key (3280757 96 12) level 0
[171145.415395] shared block backref parent 203105845248
[171145.415395] item 11 key (75093770240 168 4096) itemoff 3383 itemsize 51
[171145.415396] extent refs 1 gen 1452265 flags 258
[171145.415396] tree block key (18446744073709551611 48 3254998) level 0
[171145.415397] shared block backref parent 203105845248
[171145.415398] item 12 key (75093774336 168 4096) itemoff 3332 itemsize 51
[171145.415398] extent refs 1 gen 1452265 flags 258
[171145.415399] tree block key (3280738 84 4205285998) level 0
[171145.415399] shared block backref parent 203105845248
[171145.415400] item 13 key (75093778432 168 4096) itemoff 3281 itemsize 51
[171145.415400] extent refs 1 gen 1452265 flags 258
[171145.415401] tree block key (18446744073709551611 48 3254998) level 0
[171145.415401] shared block backref parent 203105845248
[171145.415402] item 14 key (75093782528 168 4096) itemoff 3230 itemsize 51
[171145.415403] extent refs 1 gen 1551545 flags 258
[171145.415403] tree block key (3547236 84 3743801254) level 0
[171145.415404] shared block backref parent 75092348928
[171145.415404] item 15 key (75093790720 168 4096) itemoff 3179 itemsize 51
[171145.415405] extent refs 1 gen 305740 flags 258
[171145.415406] tree block key (831798 96 265) level 0
[171145.415406] shared block backref parent 25729994752
[171145.415407] item 16 key (75093794816 168 4096) itemoff 3128 itemsize 51
[171145.415407] extent refs 1 gen 1525268 flags 2
[171145.415408] tree block key (11528453 1 0) level 0
[171145.415408] tree block backref root 281474976710913
[171145.415409] item 17 key (75093798912 168 4096) itemoff 3077 itemsize 51
[171145.415410] extent refs 1 gen 1452265 flags 258
[171145.415410] tree block key (18446744073709551611 48 3254998) level 0
[171145.415411] shared block backref parent 203105845248
[171145.415411] item 18 key (75093803008 168 4096) itemoff 3026 itemsize 51
[171145.415412] extent refs 1 gen 1452265 flags 258
[171145.415413] tree block key (18446744073709551611 48 3255000) level 0
[171145.415413] shared block backref parent 203105845248
[171145.415414] item 19 key (75093807104 168 4096) itemoff 2975 itemsize 51
[171145.415414] extent refs 1 gen 1141992 flags 2
[171145.415415] tree block key (25600704512 168 4096) level 0
[171145.415415] tree block backref root 2
[171145.415416] item 20 key (75093811200 168 4

Re: BTRFS, remarkable problem: filesystem turns to read-only caused by firefox download

2016-06-14 Thread Fajar A. Nugraha
On Wed, Jun 15, 2016 at 1:29 PM, Paul Verreth  wrote:
> Dear all.
>
> When I download a video using  Firefox DownloadHelper addon, the
> filesystem suddenly turns read only. Not a coincedence, I tried it
> several times, and it happened every time again
>
> Info:
> Linux wolfgang 4.2.0-35-generic #40-Ubuntu SMP Tue Mar 15 22:15:45 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux

> Segmentation fault
>
> Jun  5 15:03:15 ubuntu kernel: [ 2062.544303] BTRFS info (device
> sdb5): relocating block group 383447465984 flags 17


> What can I do to repair this problem?

The usual starting advice would be "try with latest kernel and see if
you can still reproduce the problem". Is it ubuntu wily? It'd go end
of in July anyway, so you might want to upgrade to xenial (or at
least, just the kernel, for the purpose of troubleshooting your
problem).

Or even try http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/
(should be usable, but might report some errors/warning due to missing
ubuntu patches)

-- 
Fajar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html