Re: [PATCH] btrfs: Fix the wrong condition judgment about subset extent map

2014-09-21 Thread Miao Xie
This patch and the previous one(The following patch) also fixed a oops, which 
can be reproduced
by LTP stress test(ltpstress.sh + fsstress).

[PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted 
extent map

Thanks
Miao

On Mon, 22 Sep 2014 09:13:03 +0800, Qu Wenruo wrote:
> Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
> best fitted extent map
> is using wrong condition to judgement whether the range is a subset of a
> existing extent map.
> 
> This may cause bug in btrfs no-holes mode.
> 
> This patch will correct the judgment and fix the bug.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8039021..a99ee9d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6527,7 +6527,7 @@ insert:
>* extent causing the -EEXIST.
>*/
>   if (start >= extent_map_end(existing) ||
> - start + len <= existing->start) {
> + start <= existing->start) {
>   /*
>* The existing extent map is the one nearest to
>* the [start, start + len) range which overlaps
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


HOW ARE YOU?

2014-09-21 Thread Benjamin Siaka
Hello my Dear,

I will greatly appreciate my correspondence meets you in good health condition.

My name is Mr. Benjamin Siaka. I am seeking for your co-operation for 
investment partnership in your Country. I shall provide the FUND for the 
investment. When you acknowledged the receipt of this correspondence, 
thereafter I will give you the Full Details of my investment proposal.

I await your response in earliest.

My regards,
Mr. Benjamin Siaka.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Fix the wrong condition judgment about subset extent map

2014-09-21 Thread Qu Wenruo
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.

This may cause bug in btrfs no-holes mode.

This patch will correct the judgment and fix the bug.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8039021..a99ee9d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6527,7 +6527,7 @@ insert:
 * extent causing the -EEXIST.
 */
if (start >= extent_map_end(existing) ||
-   start + len <= existing->start) {
+   start <= existing->start) {
/*
 * The existing extent map is the one nearest to
 * the [start, start + len) range which overlaps
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V7 10/16] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units.

2014-09-21 Thread Chandan Rajendra
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 41 ++---
 fs/btrfs/inode.c | 48 +---
 3 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5b7b7ca..59779dc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3815,7 +3815,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct inode *dir, u64 objectid,
const char *name, int name_len);
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
int front);
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 444819d..b1e0d27 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2200,21 +2200,24 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
u64 tail_len;
u64 orig_start = offset;
u64 cur_offset;
+   unsigned char blocksize_bits;
u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
u64 drop_end;
int ret = 0;
int err = 0;
int rsv_count;
-   bool same_page;
+   bool same_block;
bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
u64 ino_size;
 
+   blocksize_bits = inode->i_sb->s_blocksize_bits;
+
ret = btrfs_wait_ordered_range(inode, offset, len);
if (ret)
return ret;
 
mutex_lock(&inode->i_mutex);
-   ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE);
+   ino_size = round_up(inode->i_size, root->sectorsize);
ret = find_first_non_hole(inode, &offset, &len);
if (ret < 0)
goto out_only_mutex;
@@ -2224,29 +2227,28 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
goto out_only_mutex;
}
 
-   lockstart = round_up(offset , BTRFS_I(inode)->root->sectorsize);
+   lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
lockend = round_down(offset + len,
 BTRFS_I(inode)->root->sectorsize) - 1;
-   same_page = ((offset >> PAGE_CACHE_SHIFT) ==
-   ((offset + len - 1) >> PAGE_CACHE_SHIFT));
-
+   same_block = ((offset >> blocksize_bits)
+   == ((offset + len - 1) >> blocksize_bits));
/*
-* We needn't truncate any page which is beyond the end of the file
+* We needn't truncate any block which is beyond the end of the file
 * because we are sure there is no data there.
 */
/*
-* Only do this if we are in the same page and we aren't doing the
-* entire page.
+* Only do this if we are in the same block and we aren't doing the
+* entire block.
 */
-   if (same_page && len < PAGE_CACHE_SIZE) {
+   if (same_block && len < root->sectorsize) {
if (offset < ino_size)
-   ret = btrfs_truncate_page(inode, offset, len, 0);
+   ret = btrfs_truncate_block(inode, offset, len, 0);
goto out_only_mutex;
}
 
-   /* zero back part of the first page */
+   /* zero back part of the first block */
if (offset < ino_size) {
-   ret = btrfs_truncate_page(inode, offset, 0, 0);
+   ret = btrfs_truncate_block(inode, offset, 0, 0);
if (ret) {
mutex_unlock(&inode->i_mutex);
return ret;
@@ -2281,11 +2283,12 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (!ret) {
/* zero the front end of the last page */
if (tail_start + tail_len < ino_size) {
-   ret = btrfs_truncate_page(inode,
-   tail_start + tail_len, 0, 1);
+   ret = btrfs_truncate_block(inode,
+   tail_start + tail_len,
+   0, 1);
if (ret)
goto out_only_mutex;
-   }
+   }
}
}
 
@@ -2506,10 +2509,10 @@ static long btrfs_fallocate(struct file *file, int mode,
} else {
/*
 * If we are fallocating from the end of the file onward we
-* need to zero out the en

[RFC PATCH V7 14/16] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.

2014-09-21 Thread Chandan Rajendra
In subpagesize-blocksize scenario a page can have more than one block. So
in addition to PagePrivate2 flag, we would have to track the I/O status of
each block of a page to reliably mark the ordered extent as complete.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c| 327 
 fs/btrfs/ordered-data.c |  17 +++
 fs/btrfs/ordered-data.h |   4 +
 3 files changed, 267 insertions(+), 81 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4ed78dd..d79a543 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2827,51 +2827,115 @@ static void finish_ordered_fn(struct btrfs_work *work)
btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
+   u64 blk, u64 nr_blks, int uptodate)
+{
+   struct inode *inode = ordered->inode;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_workqueue *workers;
+   int done;
+
+   while (nr_blks--) {
+   if (test_and_set_bit(blk, ordered->blocks_done)) {
+   blk++;
+   continue;
+   }
+
+   done = btrfs_dec_test_ordered_pending(inode, &ordered,
+   ordered->file_offset
+   + (blk << 
inode->i_sb->s_blocksize_bits),
+   root->sectorsize,
+   uptodate);
+   if (done) {
+   btrfs_init_work(&ordered->work, finish_ordered_fn,
+   NULL, NULL);
+
+   ordered->work.func = finish_ordered_fn;
+   ordered->work.flags = 0;
+
+   if (btrfs_is_free_space_inode(inode))
+   workers = root->fs_info->endio_freespace_worker;
+   else
+   workers = root->fs_info->endio_write_workers;
+
+   btrfs_queue_work(workers, &ordered->work);
+   }
+
+   blk++;
+   }
+}
+
+int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
struct extent_state *state, int uptodate)
 {
struct inode *inode = page->mapping->host;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_extent *ordered_extent = NULL;
-   struct btrfs_workqueue *workers;
-   u64 ordered_start, ordered_end;
-   int done;
+   u64 blk, nr_blks;
+   int clear;
 
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
-   ClearPagePrivate2(page);
-loop:
-   ordered_extent = btrfs_lookup_ordered_range(inode, start,
-   end - start + 1);
-   if (!ordered_extent)
-   goto out;
+   while (start < end) {
+   ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+   if (!ordered_extent) {
+   start += root->sectorsize;
+   continue;
+   }
 
-   ordered_start = max_t(u64, start, ordered_extent->file_offset);
-   ordered_end = min_t(u64, end,
-   ordered_extent->file_offset + ordered_extent->len - 1);
+   blk = (start - ordered_extent->file_offset)
+   >> inode->i_sb->s_blocksize_bits;
 
-   done = btrfs_dec_test_ordered_pending(inode, &ordered_extent,
-   ordered_start,
-   ordered_end - ordered_start + 1,
-   uptodate);
-   if (done) {
-   btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, 
NULL);
+   nr_blks = (min(end, ordered_extent->file_offset + 
ordered_extent->len - 1)
+   + 1 - start) >> inode->i_sb->s_blocksize_bits;
 
-   if (btrfs_is_free_space_inode(inode))
-   workers = root->fs_info->endio_freespace_worker;
-   else
-   workers = root->fs_info->endio_write_workers;
+   BUG_ON(!nr_blks);
 
-   btrfs_queue_work(workers, &ordered_extent->work);
+   mark_blks_io_complete(ordered_extent, blk, nr_blks, uptodate);
+
+   start = ordered_extent->file_offset + ordered_extent->len;
+
+   btrfs_put_ordered_extent(ordered_extent);
}
 
-   btrfs_put_ordered_extent(ordered_extent);
+   start = page_offset(page);
+   end = start + PAGE_CACHE_SIZE - 1;
+   clear = 1;
 
-   start = ordered_end + 1;
+   while (start < end) {
+   ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+   if (!ordered_

[RFC PATCH V7 11/16] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in sectorsized units.

2014-09-21 Thread Chandan Rajendra
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ad7d0f..23ce9ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7812,26 +7812,23 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
loff_t size;
int ret;
int reserved = 0;
+   u64 delalloc_size;
u64 page_start;
u64 page_end;
 
sb_start_pagefault(inode->i_sb);
-   ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
-   if (!ret) {
-   ret = file_update_time(vma->vm_file);
-   reserved = 1;
-   }
+
+   ret = file_update_time(vma->vm_file);
if (ret) {
if (ret == -ENOMEM)
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
-   if (reserved)
-   goto out;
-   goto out_noreserve;
+   goto out;
}
 
ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
+
 again:
lock_page(page);
size = i_size_read(inode);
@@ -7862,6 +7859,19 @@ again:
goto again;
}
 
+   if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT))
+   delalloc_size = round_up(size - page_start, root->sectorsize);
+   else
+   delalloc_size = PAGE_CACHE_SIZE;
+
+   ret = btrfs_delalloc_reserve_space(inode, delalloc_size);
+   if (ret) {
+   /* -ENOSPC */
+   ret = VM_FAULT_SIGBUS;
+   goto out_unlock;
+   }
+   reserved = 1;
+
/*
 * XXX - page_mkwrite gets called every time the page is dirtied, even
 * if it was already dirty, so for space accounting reasons we need to
@@ -7874,7 +7884,8 @@ again:
  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
  0, 0, &cached_state, GFP_NOFS);
 
-   ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+   ret = btrfs_set_extent_delalloc(inode, page_start,
+   page_start + delalloc_size - 1,
&cached_state);
if (ret) {
unlock_extent_cached(io_tree, page_start, page_end,
@@ -7913,8 +7924,8 @@ out_unlock:
}
unlock_page(page);
 out:
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
-out_noreserve:
+   if (reserved)
+   btrfs_delalloc_release_space(inode, delalloc_size);
sb_end_pagefault(inode->i_sb);
return ret;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V7 07/16] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE

2014-09-21 Thread Chandan Rajendra
From: Chandra Seetharaman 

This patch allows mounting filesystems with blocksize smaller than the
PAGE_SIZE.

Signed-off-by: Chandra Seetharaman 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6c6e8bb..2f3caaf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2634,12 +2634,6 @@ int open_ctree(struct super_block *sb,
goto fail_sb_buffer;
}
 
-   if (sectorsize != PAGE_SIZE) {
-   printk(KERN_WARNING "BTRFS: Incompatible sector size(%lu) "
-  "found on %s\n", (unsigned long)sectorsize, sb->s_id);
-   goto fail_sb_buffer;
-   }
-
mutex_lock(&fs_info->chunk_mutex);
ret = btrfs_read_sys_array(tree_root);
mutex_unlock(&fs_info->chunk_mutex);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V7 16/16] Btrfs: subpagesize-blocksize: Track blocks of ordered extent submitted for write I/O.

2014-09-21 Thread Chandan Rajendra
In the subpagesize-blocksize scenario, the following command (with 4k as the
PAGE_SIZE and 2k as the block size) can cause false accounting of blocks of an
ordered extent that is written to disk:

$ xfs_io -f -c "pwrite 0 10240" \
-c "sync_range 0 4096" \
-c "sync_range 8192 2048" \
-c "pwrite 10240 2048" \
-c "sync_range 10240 2048" \
/mnt/btrfs/file.bin

To fix this, we would have to explicitly track the blocks of an ordered extent
that have already been submitted for write I/O.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c| 24 ++--
 fs/btrfs/ordered-data.c |  4 +++-
 fs/btrfs/ordered-data.h |  4 
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ccd9e1c..2cf9e59 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3201,6 +3201,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
u64 extent_offset;
u64 extent_end;
u64 iosize;
+   u64 blk, nr_blks;
+   u64 blk_submitted;
sector_t sector;
struct extent_state *cached_state = NULL;
struct block_device *bdev;
@@ -3267,11 +3269,26 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
iosize = min(extent_end - cur, end - cur + 1);
iosize = ALIGN(iosize, blocksize);
 
+   blk = extent_offset >> inode->i_sb->s_blocksize_bits;
+   nr_blks = iosize >> inode->i_sb->s_blocksize_bits;
+
+   blk_submitted = find_next_bit(ordered->blocks_submitted,
+   ordered->len >> 
inode->i_sb->s_blocksize_bits,
+   blk);
+   if (blk_submitted < blk + nr_blks) {
+   if (blk_submitted == blk) {
+   cur += blocksize;
+   btrfs_put_ordered_extent(ordered);
+   continue;
+   }
+   iosize = (blk_submitted - blk)
+   << inode->i_sb->s_blocksize_bits;
+   nr_blks = iosize >> inode->i_sb->s_blocksize_bits;
+   }
+
sector = (ordered->start + extent_offset) >> 9;
bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
compressed = test_bit(BTRFS_ORDERED_COMPRESSED, 
&ordered->flags);
-   btrfs_put_ordered_extent(ordered);
-   ordered = NULL;
 
/*
 * compressed and inline extents are written through other
@@ -3284,6 +3301,7 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 */
nr++;
cur += iosize;
+   btrfs_put_ordered_extent(ordered);
continue;
}
 
@@ -3298,6 +3316,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
} else {
unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 1;
 
+   bitmap_set(ordered->blocks_submitted, blk, nr_blks);
+   btrfs_put_ordered_extent(ordered);
set_range_writeback(tree, cur, cur + iosize - 1);
if (!PageWriteback(page)) {
btrfs_err(BTRFS_I(inode)->root->fs_info,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4d9832f..59b2544 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -199,13 +199,15 @@ static int __btrfs_add_ordered_extent(struct inode 
*inode, u64 file_offset,
nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits);
if (nr_longs == 1) {
entry->blocks_done = &entry->blocks_bitmap;
+   entry->blocks_submitted = &entry->blocks_submitted_bitmap;
} else {
-   entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
+   entry->blocks_done = kzalloc(2 * nr_longs * sizeof(unsigned 
long),
GFP_NOFS);
if (!entry->blocks_done) {
kmem_cache_free(btrfs_ordered_extent_cache, entry);
return -ENOMEM;
}
+   entry->blocks_submitted = entry->blocks_done + nr_longs;
}
 
entry->file_offset = file_offset;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 7de3b1e..851914c 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,10 @@ struct btrfs_ordered_extent {
/* bitmap to track the blocks that have been written to disk */
unsigned long *blocks_done;
unsigned long blocks_bitmap;
+
+   /* bitmap to track the blocks that have been submitted for write i/o */
+   unsigned long *blocks_submitted;
+   unsigned

[RFC PATCH V7 05/16] Btrfs: subpagesize-blocksize: Read tree blocks whose size is

2014-09-21 Thread Chandan Rajendra
In the case of subpagesize-blocksize, this patch makes it possible to read
only a single metadata block from the disk instead of all the metadata blocks
that map into a page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  45 +++--
 fs/btrfs/disk-io.h   |   3 ++
 fs/btrfs/extent_io.c | 138 ++-
 3 files changed, 148 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a79833..20168e6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -431,7 +431,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
int mirror_num = 0;
int failed_mirror = 0;
 
-   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
while (1) {
ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -450,7 +450,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
 * there is no reason to read the other copies, they won't be
 * any less wrong.
 */
-   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
+   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags))
break;
 
num_copies = btrfs_num_copies(root->fs_info,
@@ -582,12 +582,13 @@ static noinline int check_leaf(struct btrfs_root *root,
return 0;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+   struct page *page,
+   u64 start, u64 end, int mirror)
 {
u64 found_start;
int found_level;
+   struct extent_buffer_head *ebh;
struct extent_buffer *eb;
struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
int ret = 0;
@@ -597,18 +598,26 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
goto out;
 
eb = (struct extent_buffer *)page->private;
+   do {
+   if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
+   break;
+   } while ((eb = eb->eb_next) != NULL);
+
+   BUG_ON(!eb);
+
+   ebh = eb_head(eb);
 
/* the pending IO might have been the only thing that kept this buffer
 * in memory.  Make sure we have a ref for all this other checks
 */
extent_buffer_get(eb);
 
-   reads_done = atomic_dec_and_test(&eb->io_pages);
+   reads_done = atomic_dec_and_test(&ebh->io_bvecs);
if (!reads_done)
goto err;
 
eb->read_mirror = mirror;
-   if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+   if (test_bit(EXTENT_BUFFER_IOERR, &eb->ebflags)) {
ret = -EIO;
goto err;
}
@@ -650,7 +659,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 * return -EIO.
 */
if (found_level == 0 && check_leaf(root, eb)) {
-   set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+   set_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
ret = -EIO;
}
 
@@ -658,7 +667,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
set_extent_buffer_uptodate(eb);
 err:
if (reads_done &&
-   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
btree_readahead_hook(root, eb, eb->start, ret);
 
if (ret) {
@@ -667,7 +676,7 @@ err:
 * again, we have to make sure it has something
 * to decrement
 */
-   atomic_inc(&eb->io_pages);
+   atomic_inc(&eb_head(eb)->io_bvecs);
clear_extent_buffer_uptodate(eb);
}
free_extent_buffer(eb);
@@ -675,20 +684,6 @@ out:
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(&eb->io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
-   btree_readahead_hook(root, eb, eb->start, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio, int err)
 {
struct end_io_wq *end_io_wq = bio->bi_private;
@@ -4156,8 +4151,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root 
*root)
 }
 
 static struct extent_io_ops btree_extent_io_ops = {

[RFC PATCH V7 13/16] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.

2014-09-21 Thread Chandan Rajendra
In subpagesize-blocksize scenario, extent allocations for only some of the
dirty blocks of a page can succeed, while allocation for rest of the blocks
can fail. This patch allows I/O against such partially allocated ordered
extents to be submitted.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 24 +---
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c | 39 +--
 3 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8ea21c1..ccd9e1c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1774,15 +1774,22 @@ int extent_clear_unlock_delalloc(struct inode *inode, 
u64 start, u64 end,
if (page_ops & PAGE_SET_PRIVATE2)
SetPagePrivate2(pages[i]);
 
+   if (page_ops & PAGE_SET_ERROR)
+   SetPageError(pages[i]);
+
if (pages[i] == locked_page) {
page_cache_release(pages[i]);
continue;
}
-   if (page_ops & PAGE_CLEAR_DIRTY)
+
+   if ((page_ops & PAGE_CLEAR_DIRTY)
+   && !PagePrivate2(pages[i]))
clear_page_dirty_for_io(pages[i]);
-   if (page_ops & PAGE_SET_WRITEBACK)
+   if ((page_ops & PAGE_SET_WRITEBACK)
+   && !PagePrivate2(pages[i]))
set_page_writeback(pages[i]);
-   if (page_ops & PAGE_END_WRITEBACK)
+   if ((page_ops & PAGE_END_WRITEBACK)
+   && !PagePrivate2(pages[i]))
end_page_writeback(pages[i]);
if (page_ops & PAGE_UNLOCK)
unlock_page(pages[i]);
@@ -2403,7 +2410,7 @@ int end_extent_writepage(struct page *page, int err, u64 
start, u64 end)
uptodate = 0;
}
 
-   if (!uptodate) {
+   if (!uptodate || PageError(page)) {
ClearPageUptodate(page);
SetPageError(page);
ret = ret < 0 ? ret : -EIO;
@@ -3123,7 +3130,6 @@ static noinline_for_stack int writepage_delalloc(struct 
inode *inode,
   nr_written);
/* File system has been set read-only */
if (ret) {
-   SetPageError(page);
/* fill_delalloc should be return < 0 for error
 * but just in case, we use > 0 here meaning the
 * IO is started, so we don't want to return > 0
@@ -3332,7 +3338,6 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
struct inode *inode = page->mapping->host;
struct extent_page_data *epd = data;
u64 start = page_offset(page);
-   u64 page_end = start + PAGE_CACHE_SIZE - 1;
int ret;
int nr = 0;
size_t pg_offset;
@@ -3375,7 +3380,7 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
if (ret == 1)
goto done_unlocked;
-   if (ret)
+   if (ret && !PagePrivate2(page))
goto done;
 
ret = __extent_writepage_io(inode, page, wbc, epd,
@@ -3389,10 +3394,7 @@ done:
set_page_writeback(page);
end_page_writeback(page);
}
-   if (PageError(page)) {
-   ret = ret < 0 ? ret : -EIO;
-   end_extent_writepage(page, ret, start, page_end);
-   }
+
unlock_page(page);
return ret;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 840e9a0..04ffd5b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -51,6 +51,7 @@
 #define PAGE_SET_WRITEBACK (1 << 2)
 #define PAGE_END_WRITEBACK (1 << 3)
 #define PAGE_SET_PRIVATE2  (1 << 4)
+#define PAGE_SET_ERROR (1 << 5)
 
 /*
  * page->private values.  Every page that is controlled by the extent
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 91c5580..4ed78dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -880,6 +880,8 @@ static noinline int cow_file_range(struct inode *inode,
struct btrfs_key ins;
struct extent_map *em;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+   struct btrfs_ordered_extent *ordered;
+   unsigned long page_ops, extent_ops;
int ret = 0;
 
if (btrfs_is_free_space_inode(inode)) {
@@ -924,8 +926,6 @@ static noinline int cow_file_range(struct inode *inode,
btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
 
while (disk_num_bytes > 0) {
-   unsigned long op;
-
  

[RFC PATCH V7 06/16] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page

2014-09-21 Thread Chandan Rajendra
For the subpagesize-blocksize scenario, This patch adds the ability to write a
single extent buffer to the disk.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  20 ++--
 fs/btrfs/extent_io.c | 300 ---
 2 files changed, 250 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 20168e6..6c6e8bb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -484,17 +484,23 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
 {
-   u64 start = page_offset(page);
-   u64 found_start;
struct extent_buffer *eb;
+   u64 found_start;
 
eb = (struct extent_buffer *)page->private;
-   if (page != eb->pages[0])
+   if (page != eb_head(eb)->pages[0])
return 0;
-   found_start = btrfs_header_bytenr(eb);
-   if (WARN_ON(found_start != start || !PageUptodate(page)))
-   return 0;
-   csum_tree_block(root, eb, 0);
+   do {
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
+   continue;
+   if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)))
+   continue;
+   found_start = btrfs_header_bytenr(eb);
+   if (WARN_ON(found_start != eb->start))
+   return 0;
+   csum_tree_block(root, eb, 0);
+   } while ((eb = eb->eb_next) != NULL);
+
return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bcf6412..f9db1be 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3427,33 +3427,54 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
-static noinline_for_stack int
-lock_extent_buffer_for_io(struct extent_buffer *eb,
- struct btrfs_fs_info *fs_info,
- struct extent_page_data *epd)
+static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
 {
+   struct extent_buffer *eb = &ebh->eb;
unsigned long i, num_pages;
-   int flush = 0;
+
+   num_pages = num_extent_pages(eb->start, eb->len);
+   for (i = 0; i < num_pages; i++) {
+   struct page *p = extent_buffer_page(eb, i);
+
+   if (!trylock_page(p)) {
+   flush_write_bio(epd);
+   lock_page(p);
+   }
+   }
+
+   return;
+}
+
+static int noinline_for_stack
+lock_extent_buffer_for_io(struct extent_buffer *eb,
+   struct btrfs_fs_info *fs_info,
+   struct extent_page_data *epd)
+{
+   int dirty;
int ret = 0;
 
if (!btrfs_try_tree_write_lock(eb)) {
-   flush = 1;
flush_write_bio(epd);
btrfs_tree_lock(eb);
}
 
-   if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+   if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
btrfs_tree_unlock(eb);
-   if (!epd->sync_io)
-   return 0;
-   if (!flush) {
-   flush_write_bio(epd);
-   flush = 1;
+   if (!epd->sync_io) {
+   if (!dirty)
+   return 1;
+   else
+   return 2;
}
+
+   flush_write_bio(epd);
+
while (1) {
wait_on_extent_buffer_writeback(eb);
btrfs_tree_lock(eb);
-   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
break;
btrfs_tree_unlock(eb);
}
@@ -3464,37 +3485,22 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 * under IO since we can end up having no IO bits set for a short period
 * of time.
 */
-   spin_lock(&eb->refs_lock);
-   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
-   set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-   spin_unlock(&eb->refs_lock);
+   spin_lock(&eb_head(eb)->refs_lock);
+   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
+   set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
+   spin_unlock(&eb_head(eb)->refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
-   ret = 1;
+   ret

[RFC PATCH V7 08/16] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.

2014-09-21 Thread Chandan Rajendra
Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_CACHE_SIZE. This patch makes the checksum
computation and look up code to work with sectorsize units.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file-item.c | 87 
 fs/btrfs/inode.c | 53 +---
 2 files changed, 89 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 54c84da..000418a 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 item_start_offset = 0;
u64 item_last_offset = 0;
u64 disk_bytenr;
+   u64 page_bytes_left;
u32 diff;
int nblocks;
int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
+
+   page_bytes_left = bvec->bv_len;
while (bio_index < bio->bi_vcnt) {
if (!dio)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)->root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
-   offset + bvec->bv_len - 1,
+   offset + root->sectorsize - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {

btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
 found:
csum += count * csum_size;
nblocks -= count;
-   bio_index += count;
+
while (count--) {
-   disk_bytenr += bvec->bv_len;
-   offset += bvec->bv_len;
-   bvec++;
+   disk_bytenr += root->sectorsize;
+   offset += root->sectorsize;
+   page_bytes_left -= root->sectorsize;
+   if (!page_bytes_left) {
+   bio_index++;
+   bvec++;
+   page_bytes_left = bvec->bv_len;
+   }
+
}
}
btrfs_free_path(path);
@@ -442,6 +451,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
struct bio_vec *bvec = bio->bi_io_vec;
int bio_index = 0;
int index;
+   int nr_sectors;
+   int i;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;
@@ -469,41 +480,51 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
if (!contig)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
-   if (offset >= ordered->file_offset + ordered->len ||
-   offset < ordered->file_offset) {
-   unsigned long bytes_left;
-   sums->len = this_sum_bytes;
-   this_sum_bytes = 0;
-   btrfs_add_ordered_sum(inode, ordered, sums);
-   btrfs_put_ordered_extent(ordered);
+   data = kmap_atomic(bvec->bv_page);
 
-   bytes_left = bio->bi_iter.bi_size - total_bytes;
 
-   sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
-  GFP_NOFS);
-   BUG_ON(!sums); /* -ENOMEM */
-   sums->len = bytes_left;
-   ordered = btrfs_lookup_ordered_extent(inode, offset);
-   BUG_ON(!ordered); /* Logic error */
-   sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
-  total_bytes;
-   index = 0;
+   nr_sectors = (bvec->bv_len + root->sectorsize - 1)
+   >> root->fs_info->sb->s_blocksize_bits;
+
+
+   for (i = 0; i < nr_sectors; i++) {
+   if (offset >= ordered->file_offset + ordered->len ||
+   offset < ordered->file_offset) {
+   unsigned long bytes_left;
+   sums->len = this_sum_bytes;
+   this_sum_bytes = 0;
+   btrfs_add_ordered_sum(inode, ordered, sums);
+   btrfs_put_ordered_extent(ordere

[RFC PATCH V7 09/16] Btrfs: subpagesize-blocksize: __extent_writepage: Write only dirty blocks of a page.

2014-09-21 Thread Chandan Rajendra
The code now loops across 'ordered extents' instead of 'extent maps' to figure
out the dirty blocks of the page to be submitted for a write operation.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 74 
 1 file changed, 29 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f9db1be..3c33944 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3186,18 +3186,18 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 int write_flags, int *nr_ret)
 {
struct extent_io_tree *tree = epd->tree;
+   struct btrfs_ordered_extent *ordered;
u64 start = page_offset(page);
u64 page_end = start + PAGE_CACHE_SIZE - 1;
u64 end;
u64 cur = start;
u64 extent_offset;
-   u64 block_start;
+   u64 extent_end;
u64 iosize;
sector_t sector;
struct extent_state *cached_state = NULL;
-   struct extent_map *em;
struct block_device *bdev;
-   size_t pg_offset = 0;
+   size_t pg_offset;
size_t blocksize;
int ret = 0;
int nr = 0;
@@ -3237,59 +3237,46 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
blocksize = inode->i_sb->s_blocksize;
 
while (cur <= end) {
-   u64 em_end;
if (cur >= i_size) {
if (tree->ops && tree->ops->writepage_end_io_hook)
tree->ops->writepage_end_io_hook(page, cur,
 page_end, NULL, 1);
break;
}
-   em = epd->get_extent(inode, page, pg_offset, cur,
-end - cur + 1, 1);
-   if (IS_ERR_OR_NULL(em)) {
-   SetPageError(page);
-   ret = PTR_ERR_OR_ZERO(em);
-   break;
-   }
 
-   extent_offset = cur - em->start;
-   em_end = extent_map_end(em);
-   BUG_ON(em_end <= cur);
+   ordered = btrfs_lookup_ordered_extent(inode, cur);
+   if (!ordered) {
+   cur += blocksize;
+   continue;
+   }
+
+   pg_offset = cur & (PAGE_CACHE_SIZE - 1);
+
+   extent_offset = cur - ordered->file_offset;
+   extent_end = ordered->file_offset + ordered->len;
+   extent_end = (extent_end < ordered->file_offset) ? -1 : 
extent_end;
+   BUG_ON(extent_end <= cur);
BUG_ON(end < cur);
-   iosize = min(em_end - cur, end - cur + 1);
+   iosize = min(extent_end - cur, end - cur + 1);
iosize = ALIGN(iosize, blocksize);
-   sector = (em->block_start + extent_offset) >> 9;
-   bdev = em->bdev;
-   block_start = em->block_start;
-   compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
-   free_extent_map(em);
-   em = NULL;
+
+   sector = (ordered->start + extent_offset) >> 9;
+   bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+   compressed = test_bit(BTRFS_ORDERED_COMPRESSED, 
&ordered->flags);
+   btrfs_put_ordered_extent(ordered);
+   ordered = NULL;
 
/*
 * compressed and inline extents are written through other
 * paths in the FS
 */
-   if (compressed || block_start == EXTENT_MAP_HOLE ||
-   block_start == EXTENT_MAP_INLINE) {
-   /*
-* end_io notification does not happen here for
-* compressed extents
-*/
-   if (!compressed && tree->ops &&
-   tree->ops->writepage_end_io_hook)
-   tree->ops->writepage_end_io_hook(page, cur,
-cur + iosize - 1,
-NULL, 1);
-   else if (compressed) {
-   /* we don't want to end_page_writeback on
-* a compressed extent.  this happens
-* elsewhere
-*/
-   nr++;
-   }
-
+   if (compressed) {
+   /* we don't want to end_page_writeback on
+* a compressed extent.  this happens
+* elsewhere
+*/
+   nr++;
cur += iosize;
-   pg_offset += iosize;
continue;
}

[RFC PATCH V7 12/16] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.

2014-09-21 Thread Chandan Rajendra
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 3 ++-
 fs/btrfs/inode.c | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3c33944..8ea21c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3027,7 +3027,8 @@ static int __extent_read_full_page(struct extent_io_tree 
*tree,
 
while (1) {
lock_extent(tree, start, end);
-   ordered = btrfs_lookup_ordered_extent(inode, start);
+   ordered = btrfs_lookup_ordered_range(inode, start,
+   PAGE_CACHE_SIZE);
if (!ordered)
break;
unlock_extent(tree, start, end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 23ce9ff..91c5580 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1821,7 +1821,7 @@ again:
if (PagePrivate2(page))
goto out;
 
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, 
PAGE_CACHE_SIZE);
if (ordered) {
unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
 page_end, &cached_state, GFP_NOFS);
@@ -7724,7 +7724,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 
if (!inode_evicting)
lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, 
PAGE_CACHE_SIZE);
if (ordered) {
/*
 * IO on this page will never be started, so we need
@@ -7849,7 +7849,7 @@ again:
 * we can't set the delalloc bits if there are pending ordered
 * extents.  Drop our locks and wait for them to finish
 */
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
if (ordered) {
unlock_extent_cached(io_tree, page_start, page_end,
 &cached_state, GFP_NOFS);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V7 15/16] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43.

2014-09-21 Thread Chandan Rajendra
In subpagesize-blocksize, we have multiple blocks in a page. Checking for
existence of a page in the page cache isn't a sufficient check, since we
could be truncating a subset of the blocks mapped by the page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/btrfs_inode.h |  2 --
 fs/btrfs/file.c|  4 ++-
 fs/btrfs/inode.c   | 77 +++---
 3 files changed, 7 insertions(+), 76 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 43527fd..50497bf 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -278,6 +278,4 @@ static inline void btrfs_inode_resume_unlocked_dio(struct 
inode *inode)
  &BTRFS_I(inode)->runtime_flags);
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
-
 #endif
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b1e0d27..3707515 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2314,7 +2314,9 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if ((!ordered ||
(ordered->file_offset + ordered->len <= lockstart ||
 ordered->file_offset > lockend)) &&
-!btrfs_page_exists_in_range(inode, lockstart, lockend)) {
+!test_range_bit(&BTRFS_I(inode)->io_tree, lockstart,
+lockend, EXTENT_UPTODATE, 0,
+cached_state)) {
if (ordered)
btrfs_put_ordered_extent(ordered);
break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d79a543..4dedab6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6832,76 +6832,6 @@ out:
return ret;
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
-{
-   struct radix_tree_root *root = &inode->i_mapping->page_tree;
-   int found = false;
-   void **pagep = NULL;
-   struct page *page = NULL;
-   int start_idx;
-   int end_idx;
-
-   start_idx = start >> PAGE_CACHE_SHIFT;
-
-   /*
-* end is the last byte in the last page.  end == start is legal
-*/
-   end_idx = end >> PAGE_CACHE_SHIFT;
-
-   rcu_read_lock();
-
-   /* Most of the code in this while loop is lifted from
-* find_get_page.  It's been modified to begin searching from a
-* page and return just the first page found in that range.  If the
-* found idx is less than or equal to the end idx then we know that
-* a page exists.  If no pages are found or if those pages are
-* outside of the range then we're fine (yay!) */
-   while (page == NULL &&
-  radix_tree_gang_lookup_slot(root, &pagep, NULL, start_idx, 1)) {
-   page = radix_tree_deref_slot(pagep);
-   if (unlikely(!page))
-   break;
-
-   if (radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page)) {
-   page = NULL;
-   continue;
-   }
-   /*
-* Otherwise, shmem/tmpfs must be storing a swap entry
-* here as an exceptional entry: so return it without
-* attempting to raise page count.
-*/
-   page = NULL;
-   break; /* TODO: Is this relevant for this use case? */
-   }
-
-   if (!page_cache_get_speculative(page)) {
-   page = NULL;
-   continue;
-   }
-
-   /*
-* Has the page moved?
-* This is part of the lockless pagecache protocol. See
-* include/linux/pagemap.h for details.
-*/
-   if (unlikely(page != *pagep)) {
-   page_cache_release(page);
-   page = NULL;
-   }
-   }
-
-   if (page) {
-   if (page->index <= end_idx)
-   found = true;
-   page_cache_release(page);
-   }
-
-   rcu_read_unlock();
-   return found;
-}
-
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
  struct extent_state **cached_state, int writing)
 {
@@ -6926,9 +6856,10 @@ static int lock_extent_direct(struct inode *inode, u64 
lockstart, u64 lockend,
 * invalidate needs to happen so that reads after a write do not
 * get stale data.
 */
-   if (!ordered &&
-   (!writing ||
-!btrfs_page_exists_in_range(inode, lockstart, lockend)))
+   if (!ordered && (!writing ||
+   !test_range_bit(&BTRFS_I(inode)->io_tree,
+   

[RFC PATCH V7 00/16] Btrfs: Subpagesize-blocksize: Get rid of whole page I/O.

2014-09-21 Thread Chandan Rajendra
This patchset continues with the work posted earlier at
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg37041.html.

Changes from V1:
1. Remove usage of bio_vec->bv_{len,offset} in end_bio_extent_readpage()
   and end_bio_extent_writepage().

Changes from V2:
1. Get __extent_writepage() to write only the dirty blocks of a page.
2. Fix "page private not zero on page" warning message which is printed
   when running xfstests.

Changes from V3:
1. Get "Hole punching" and "Extent preallocation" to work correctly in
   subpagesize-blocksize scenario.
2. Get btrfs_page_mkwrite() to reserve space in sectorsized units.

Changes from V4:
1. V2's "Btrfs: subpagesize-blocksize: Get rid of whole page reads"
   patch was incorrectly replaced with an older version when working
   on V3 patches. Fix this.
2. Fix btrfs_endio_direct_read() to compute checksums for all possible
   blocks in a page.

Changes from V5:
1. Rebased patchset on top of current btrfs-next tree (i.e. commit
   8d875f95da43c6a8f18f77869f2ef26e9594fecc). This involved using
   "immutable biovecs".
2. Deal with partially allocated ordered extents across a page.
3. Explicitly track I/O status of blocks of an ordered extent.

Changes from V6:
1. Fix softlockup issue that occured during unmounting a 4k blocksized
   filesystem instance.
2. Track blocks of an ordered extent submitted for write I/O to avoid
   I/O resubmission in certain scenarios.

Xfstests' generic tests were run on an x86_64 machine with the patches
applied. The Btrfs kernel module was compiled without ACL and quotas support
and hence tests related to those were not run.

For 2k blocksize, the following xfstests' generic tests failed:
1. generic/091
2. generic/125
3. generic/127 (softlockup)
4. generic/251 (ENOSPC),

The following xfstests' generic tests failed for both 2k and 4k blocksize:
1. generic/224 (OOM)
   This looks mostly an issue caused by non-btrfs code as the test failed
   for the exact same reason when run on an ext4 filesystem instance.
2. generic/263
   FALLOC_FL_ZERO_RANGE isn't supported by Btrfs. Hence the test fails.

The following is a list of known TODO items which will be implemented in
future revisions of this patchset:
1. Get Xfstests' generic tests to successfully run on both 4k and 2k
   blocksizes.
2. Remove PAGE_CACHE_SIZE delalloc reservation in 
btrfs_writepage_fixup_worker().
3. Create separate slab caches for 'extent buffer head' and 'extent buffer'.
4. Add 'leak list' tracking for 'extent buffer' instances.
5. Rename EXTENT_BUFFER_TREE_REF and EXTENT_BUFFER_IN_TREE to
   EXTENT_BUFFER_HEAD_TREE_REF and EXTENT_BUFFER_HEAD_IN_TREE respectively.
   
Chandan Rajendra (14):
  Btrfs: subpagesize-blocksize: Get rid of whole page reads.
  Btrfs: subpagesize-blocksize: Get rid of whole page writes.
  Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release
extents aligned to block size.
  Btrfs: subpagesize-blocksize: Read tree blocks whose size is
http://vger.kernel.org/majordomo-info.html


[RFC PATCH V7 04/16] Btrfs: subpagesize-blocksize: Define extent_buffer_head.

2014-09-21 Thread Chandan Rajendra
From: Chandra Seetharaman 

In order to handle multiple extent buffers per page, first we need to create a
way to handle all the extent buffers that are attached to a page.

This patch creates a new data structure 'struct extent_buffer_head', and moves
fields that are common to all extent buffers in a page from 'struct extent
buffer' to 'struct extent_buffer_head'

Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
extent_buffer_head->bflags.

Signed-off-by: Chandra Seetharaman 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/backref.c   |   2 +-
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |   6 +-
 fs/btrfs/disk-io.c   |  46 --
 fs/btrfs/extent-tree.c   |   6 +-
 fs/btrfs/extent_io.c | 373 +--
 fs/btrfs/extent_io.h |  47 --
 fs/btrfs/volumes.c   |   2 +-
 include/trace/events/btrfs.h |   2 +-
 9 files changed, 328 insertions(+), 158 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 54a201d..1d3d5d6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1305,7 +1305,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
eb = path->nodes[0];
/* make sure we can use eb after releasing the path */
if (eb != eb_in) {
-   atomic_inc(&eb->refs);
+   atomic_inc(&eb_head(eb)->refs);
btrfs_tree_read_lock(eb);
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
}
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 44ee5d2..693b541 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -169,7 +169,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
-   if (atomic_inc_not_zero(&eb->refs)) {
+   if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
rcu_read_unlock();
break;
}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8e29b61..5b7b7ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2215,14 +2215,16 @@ static inline void btrfs_set_token_##name(struct 
extent_buffer *eb, \
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)   \
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_CACHE_SIZE -1)); \
u##bits res = le##bits##_to_cpu(p->member); \
return res; \
 }  \
 static inline void btrfs_set_##name(struct extent_buffer *eb,  \
u##bits val)\
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_CACHE_SIZE -1)); \
p->member = cpu_to_le##bits(val);   \
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d0ed9e6..3a79833 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1030,13 +1030,21 @@ static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
struct extent_buffer *eb;
+   int i, dirty = 0;
 
BUG_ON(!PagePrivate(page));
eb = (struct extent_buffer *)page->private;
BUG_ON(!eb);
-   BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-   BUG_ON(!atomic_read(&eb->refs));
-   btrfs_assert_tree_locked(eb);
+
+   do {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
+   if (dirty)
+   break;
+   } while ((eb = eb->eb_next) != NULL);
+
+   BUG_ON(!dirty);
+   BUG_ON(!atomic_read(&(eb_head(eb)->refs)));
+   btrfs_assert_tree_locked(&ebh->eb);
 #endif
return __set_page_dirty_nobuffers(page);
 }
@@ -1080,7 +1088,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr, u32 blocksize,
if (!buf)
return 0;
 
-   set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
+   set_bit(EXTENT_BUFFER_READAHEAD, &buf->ebflags);
 
ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
   btree_get_extent, mirror_num);
@@ -1089,7 +1097,7 @@ int reada_tree_bl

[RFC PATCH V7 02/16] Btrfs: subpagesize-blocksize: Get rid of whole page writes.

2014-09-21 Thread Chandan Rajendra
This commit brings back functions that set/clear EXTENT_WRITEBACK bits. These
are required to reliably clear PG_writeback page flag.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 47 +++
 fs/btrfs/inode.c | 40 +++-
 2 files changed, 58 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5d9cc68..7229c4d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1300,6 +1300,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, 
u64 start, u64 end,
cached_state, mask);
 }
 
+static int set_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return set_extent_bit(tree, start, end, EXTENT_WRITEBACK, NULL,
+   cached_state, mask);
+}
+
+static int clear_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return clear_extent_bit(tree, start, end, EXTENT_WRITEBACK, 1, 0,
+   cached_state, mask);
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
@@ -1406,6 +1420,7 @@ static int set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
page_cache_release(page);
index++;
}
+   set_extent_writeback(tree, start, end, NULL, GFP_NOFS);
return 0;
 }
 
@@ -2408,31 +2423,23 @@ static void end_bio_extent_writepage(struct bio *bio, 
int err)
 
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
+   struct inode *inode = page->mapping->host;
+   struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+   u64 page_start, page_end;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-   
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-  "partial page write in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   
btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-  "incomplete page write in btrfs with offset 
%u and "
-  "length %u",
-   bvec->bv_offset, bvec->bv_len);
-   }
-
-   start = page_offset(page);
-   end = start + bvec->bv_offset + bvec->bv_len - 1;
+   start = page_offset(page) + bvec->bv_offset;
+   end = start + bvec->bv_len - 1;
 
if (end_extent_writepage(page, err, start, end))
continue;
 
-   end_page_writeback(page);
+   clear_extent_writeback(tree, start, end, NULL, GFP_ATOMIC);
+
+   page_start = page_offset(page);
+   page_end = page_offset(page) + PAGE_CACHE_SIZE - 1;
+   if (!test_range_bit(tree, page_start, page_end,
+   EXTENT_WRITEBACK, 0, NULL))
+   end_page_writeback(page);
}
 
bio_put(bio);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7309832..2ffb4df 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2823,22 +2823,44 @@ static int btrfs_writepage_end_io_hook(struct page 
*page, u64 start, u64 end,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *workers;
+   u64 ordered_start, ordered_end;
+   int done;
 
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
ClearPagePrivate2(page);
-   if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-   end - start + 1, uptodate))
-   return 0;
+loop:
+   ordered_extent = btrfs_lookup_ordered_range(inode, start,
+   end - start + 1);
+   if (!ordered_extent)
+   goto out;
 
-   btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
+   ordered_start = max_t(u64, start, ordered_extent->file_offset);
+   ordered_end = min_t(u64, end,
+   o

[RFC PATCH V7 03/16] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size.

2014-09-21 Thread Chandan Rajendra
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d3afac2..444819d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1366,18 +1366,21 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
size_t num_pages, loff_t pos,
+   size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
u64 start_pos;
u64 last_pos;
int i;
int ret = 0;
 
-   start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
-   last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
+   start_pos = pos & ~((u64)root->sectorsize - 1);
+   last_pos = start_pos
+   + ALIGN(pos + write_bytes - start_pos, root->sectorsize) - 1;
 
-   if (start_pos < inode->i_size) {
+   if (start_pos < inode->i_size) {
struct btrfs_ordered_extent *ordered;
lock_extent_bits(&BTRFS_I(inode)->io_tree,
 start_pos, last_pos, 0, cached_state);
@@ -1494,6 +1497,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
while (iov_iter_count(i) > 0) {
size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+   size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
 nrptrs * (size_t)PAGE_CACHE_SIZE -
 offset);
@@ -1514,7 +1518,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
break;
}
 
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+   sector_offset = pos & (root->sectorsize - 1);
+   reserve_bytes = ALIGN(write_bytes + sector_offset, 
root->sectorsize);
+
ret = btrfs_check_data_free_space(inode, reserve_bytes);
if (ret == -ENOSPC &&
(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
@@ -1529,7 +1535,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
num_pages = (write_bytes + offset +
 PAGE_CACHE_SIZE - 1) >>
PAGE_CACHE_SHIFT;
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+
+   reserve_bytes = ALIGN(write_bytes + 
sector_offset,
+   root->sectorsize);
ret = 0;
} else {
ret = -ENOSPC;
@@ -1564,8 +1572,8 @@ again:
break;
 
ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
- pos, &lockstart, &lockend,
- &cached_state);
+   pos, write_bytes, &lockstart, 
&lockend,
+   &cached_state);
if (ret < 0) {
if (ret == -EAGAIN)
goto again;
@@ -1602,9 +1610,9 @@ again:
 * we still have an outstanding extent for the chunk we actually
 * managed to copy.
 */
-   if (num_pages > dirty_pages) {
-   release_bytes = (num_pages - dirty_pages) <<
-   PAGE_CACHE_SHIFT;
+   if (write_bytes > copied) {
+   release_bytes = (write_bytes - copied)
+   & ~((u64)root->sectorsize - 1);
if (copied > 0) {
spin_lock(&BTRFS_I(inode)->lock);
BTRFS_I(inode)->outstanding_extents++;
@@ -1618,7 +1626,7 @@ again:
 release_bytes);
}
 
-   release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
+   release_bytes = ALIGN(copied + sector_offset, root->sectorsize);
 
if (copied > 0)
ret = btrfs_dirty_pages(root, inode, pages,
@@ -1640,7 +1648,7 @@ again:
if (only_release_metadata && copied > 0) {
u64 lockstart = round_down(pos, root->sectorsize);
u64 lockend = lockstart +
-   (dirty_pages << PAGE_CACHE_SHIFT) - 1;
+   ALIGN(copied, root->sectorsize) - 1;
 
   

[RFC PATCH V7 01/16] Btrfs: subpagesize-blocksize: Get rid of whole page reads.

2014-09-21 Thread Chandan Rajendra
Based on original patch from Aneesh Kumar K.V 

For the subpagesize-blocksize scenario, a page can contain multiple
blocks. This patch handles this case.

This patch also brings back check_page_locked() to reliably unlock pages in
readpage's end bio function.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 152 ---
 1 file changed, 58 insertions(+), 94 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a389820..5d9cc68 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1951,15 +1951,29 @@ int test_range_bit(struct extent_io_tree *tree, u64 
start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
+static void check_page_uptodate(struct extent_io_tree *tree, struct page *page,
+   struct extent_state *cached)
 {
u64 start = page_offset(page);
u64 end = start + PAGE_CACHE_SIZE - 1;
-   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
+   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached))
SetPageUptodate(page);
 }
 
 /*
+ * helper function to unlock a page if all the extents in the tree
+ * for that page are unlocked
+ */
+static void check_page_locked(struct extent_io_tree *tree, struct page *page)
+{
+   u64 start = page_offset(page);
+   u64 end = start + PAGE_CACHE_SIZE - 1;
+
+   if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL))
+   unlock_page(page);
+}
+
+/*
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
  * io_failure_record is used to record state as we go through all the
@@ -2275,7 +2289,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 *  a) deliver good data to the caller
 *  b) correct the bad sectors on disk
 */
-   if (failed_bio->bi_vcnt > 1) {
+   if ((failed_bio->bi_vcnt > 1)
+   || (failed_bio->bi_io_vec->bv_len
+   > BTRFS_I(inode)->root->sectorsize)) {
/*
 * to fulfill b), we need to know the exact failing sectors, as
 * we don't want to rewrite any more than the failed ones. thus,
@@ -2422,18 +2438,6 @@ static void end_bio_extent_writepage(struct bio *bio, 
int err)
bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
- int uptodate)
-{
-   struct extent_state *cached = NULL;
-   u64 end = start + len - 1;
-
-   if (uptodate && tree->track_uptodate)
-   set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
-   unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
-}
-
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
@@ -2450,13 +2454,12 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
struct bio_vec *bvec;
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   struct extent_state *cached = NULL;
struct extent_io_tree *tree;
u64 offset = 0;
u64 start;
u64 end;
-   u64 len;
-   u64 extent_start = 0;
-   u64 extent_len = 0;
+   int nr_sectors;
int mirror;
int ret;
int i;
@@ -2467,54 +2470,31 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
 
pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
 "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
 io_bio->mirror_num);
tree = &BTRFS_I(inode)->io_tree;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-   
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-  "partial page read in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   
btrfs_info(BTRFS_I(pa

Re: Help for creating a useful bugreport

2014-09-21 Thread Chris Murphy

On Sep 21, 2014, at 4:57 AM, Jakob Breier  wrote:
> 
> I've tried opening a few files and they appear to be fine. I'll take a more 
> thorough look later.

I'm not sure what to expect with --init-extent-tree after the fact. It might 
have holes in it (?) seeing as I'm not sure how we get something whole from 
something that had holes in it, that otherwise couldn't be fixed any other way. 
So I'd take it with a grain of salt unless someone else says it should be fine.

I'm vaguely remembering something about --init-extent-tree discussed in 
archives, but can't find it, about maybe it should come with a more dire 
warning, and that the user would probably want to create a new filesystem after 
retrieving their data from a fixed file system. I know --init-csum-tree removes 
all csums, so everything produces checksum errors.

You might want to check for hardware related sources for this corruption 
though. I don't recall if there was any kind of crash that instigated all of 
this? Do a 'smartctl -t long' test and also a badblocks -swv (destructive 
write/read with progress) and see if this turns up any problems with device. 
Check dmesg after badblocks to see if there are any drive or controller related 
messages, esp link resets or read or write errors. Last it might be worth a 
memtest86+ overnight or even better over a weekend.

Something caused this corruption, even on crashes I've not had much worse than 
orphans needing to be cleaned up on the next boot.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device delete progress

2014-09-21 Thread Chris Murphy

On Sep 20, 2014, at 8:09 PM, Russell Coker  wrote:

> We need to have a way to determine the progress of a device delete operation. 
>  
> Also for a balance of a RAID-1 that has more than 2 devices it would be good 
> to know how much space is used on each device.

btrfs replace does do this; but in the use case where the user wants to 
permanently shrink a filesystem by only removing a device, then it might be 
nice to have a status for that too considering it might be an 8TB helium drive 
and take a while.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting a dead device

2014-09-21 Thread Chris Murphy

On Sep 20, 2014, at 7:39 PM, Russell Coker  wrote:
> 
> Anyway the new drive turned out to have some errors, writes failed and I've 
> got a heap of errors such as the above.

I'm curious if smartctl -t conveyance reveals any problems, it's not a full 
surface test but is designed to be a test for (typical?) problems drives have 
due to shipment damage, and doesn't take very long.

>  The errors started immediately after 
> adding the drive and the system wasn't actively writing to the filesystem.  
> So 
> very few (if any) writes made it to the device.
> 
> # btrfs device delete /dev/sdc3 /
> ERROR: error removing the device '/dev/sdc3' - Invalid argument
> 
> It seems that I can't remove the device because removing requires writing.

What kernel message do you get associated with this? Try using the devid 
instead of /dev/.

For future reference, btrfs replace start is better to use than add+delete. 
It's an optimization but it also makes it possible to ignore the device being 
replaced for reads; and you can also get a status on the progress with "btrfs 
replace status". And it looks like it does some additional error checking.

> 
> # btrfs device delete /dev/sdc3 /
> ERROR: error removing the device '/dev/sdc3' - No such file or directory
> # btrfs device stats /
> [/dev/sda3].write_io_errs   0
> [/dev/sda3].read_io_errs0
> [/dev/sda3].flush_io_errs   0
> [/dev/sda3].corruption_errs 57
> [/dev/sda3].generation_errs 0
> [/dev/sdb3].write_io_errs   0
> [/dev/sdb3].read_io_errs0
> [/dev/sdb3].flush_io_errs   0
> [/dev/sdb3].corruption_errs 0
> [/dev/sdb3].generation_errs 0
> [/dev/sdc3].write_io_errs   267
> [/dev/sdc3].read_io_errs0
> [/dev/sdc3].flush_io_errs   0
> [/dev/sdc3].corruption_errs 0
> [/dev/sdc3].generation_errs 0
> 
> The drive is attached by USB so I turned off the USB device and then got the 
> above result.  So it still seems impossible to remove the device even though 
> it's physically not present.  I've connected a new USB disk which is now 
> /dev/sdd, so it seems that BTRFS is keeping the name /dev/sdc locked.

Pretty sure kernel assignment is major:minor, and anything under /dev/ is udev. 
What do you get for
btrfs fi show

Unfortunately this won't show devid for missing devices, so you might have to 
infer this. But you can use btrfs replace start -r  /dev/sddX 


> 
> Also as an aside, while the stats about write errors are useful, in this case 
> it would be really good if there was a count of successful writes, it would 
> be 
> useful to know if the successful write count was close to 0.

I think this is for other tools. Btrfs is a file system its responsible for the 
integrity of the data it writes, I don't think it's responsible for 
prequalifying drives.

Even a simple dd if=/dev/zero of=/dev/sdc bs=64k count=1600 will write out 
100MB, and dmesg will show if there are any controller or drive problems on 
writes. You may have to do more than 100MB for problems to show up but you get 
the idea.

You can also use badblocks -swv (progress, destructive write/read, verbose) 
which will also show writes the drive says succeeded but are actually corrupt.

Use smartctl -t conveyance/short/long to isolate the drive mechanism itself. 
This obviously doesn't test writes.

Consumer drives should fairly quickly report persistent write failures, which 
libata will report in dmesg. A common problem though, is they try to do reads 
much longer than the linux SCSI layer default timeout. Either the SCT ERC 
timeout of the drive needs to be reduced below 30 seconds; or the linux SCSI 
layer timeout needs to be raised above the drive SCT ERC timeout. Otherwise the 
drive keeps trying to do reads, the linux SCSI layer gives up on the 
non-communicating drive (which is busy recovering) and resets the link. Now the 
read error doesn't actually happen, doesn't report the offending sector, and 
Btrfs can't fix the problem.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] Return a value from printk_ratelimited

2014-09-21 Thread Joe Perches
On Sun, 2014-09-21 at 06:25 -0700, Paul E. McKenney wrote:
> On Fri, Sep 19, 2014 at 11:15:53AM -0700, Joe Perches wrote:
> > On Fri, 2014-09-19 at 13:21 -0400, Steven Rostedt wrote:
> > > On Fri, 19 Sep 2014 02:01:29 -0700
> > > Omar Sandoval  wrote:
> > > 
> > > > printk returns an integer; there's no reason for printk_ratelimited to 
> > > > swallow
> > > > it.
> > 
> > Except for the lack of usefulness of the return value itself.
> > See: https://lkml.org/lkml/2009/10/7/275
> 
> When printk()'s return value is changed to void, then yes, we should
> clearly change this code to match that.
> 
> So, I have to ask...  What happened to the patch later in that series
> that was to remove the uses of the printk() return value?

I don't know.

Last I recall via searching emails, Alan Jenkins was going to do
something with it. (I've added his old email to this reply, but
I doubt still works)

I remember checking whether or not the removing the return value
reduced the code size on x86 (it did not), and forgot about it.

I don't know if removing the printk return value reduces overall
image size in any arch, so I didn't pursue it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] Return a value from printk_ratelimited

2014-09-21 Thread Paul E. McKenney
On Fri, Sep 19, 2014 at 11:15:53AM -0700, Joe Perches wrote:
> On Fri, 2014-09-19 at 13:21 -0400, Steven Rostedt wrote:
> > On Fri, 19 Sep 2014 02:01:29 -0700
> > Omar Sandoval  wrote:
> > 
> > > printk returns an integer; there's no reason for printk_ratelimited to 
> > > swallow
> > > it.
> 
> Except for the lack of usefulness of the return value itself.
> See: https://lkml.org/lkml/2009/10/7/275

When printk()'s return value is changed to void, then yes, we should
clearly change this code to match that.

So, I have to ask...  What happened to the patch later in that series
that was to remove the uses of the printk() return value?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] Move BTRFS RCU string to common library

2014-09-21 Thread Paul E. McKenney
On Fri, Sep 19, 2014 at 12:22:57PM -0400, Chris Mason wrote:
> On 09/19/2014 12:05 PM, Paul E. McKenney wrote:
> > On Fri, Sep 19, 2014 at 11:47:53AM -0400, Chris Mason wrote:
> >>
> >>
> >> On 09/19/2014 11:45 AM, Paul E. McKenney wrote:
> >>> On Fri, Sep 19, 2014 at 02:01:28AM -0700, Omar Sandoval wrote:
>  This patch series moves the generic RCU string library used internally 
>  by BTRFS
>  to be accessible by anyone. It provides printk_in_rcu and
>  printk_ratelimited_in_rcu to print these strings. In order to avoid a 
>  weird
>  inconsistency between the two, the first patch fixes printk_ratelimited 
>  so it
>  passes on the return value from printk.
> 
>  The second patch actually moves the RCU string library. Version 2 passes 
>  on the
>  return values from printk{,_ratelimited} and fixes some style issues.
> 
>  Omar Sandoval (2):
> >>>
> >>> For the series:
> >>>
> >>> Acked-by: Paul E. McKenney 
> >>
> >> Fine by me too, Paul, do you want to merge it in?
> > 
> > I would be happy to.
> > 
> > Are you thinking in terms of 3.18 or 3.19?  These look OK either way, but
> > thought I should check.
> 
> Either way is fine with me.  Actually this will have minor conflicts
> with my current branch headed for-next, so I can resolve and send as a
> stand alone pull.

There are no conflicts with RCU, just adding a file, so I am just as
happy to have you send this via your tree.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Default to acting like fsck.

2014-09-21 Thread Tobias Geerinckx-Rice
On 21 September 2014 03:01, Dimitri John Ledkov  wrote:
>
> Inspect arguments, if we are not called as btrfs, then assume we are
> called to act like fsck.
[...]
> -   if (!strcmp(bname, "btrfsck")) {
> +   if (strcmp(bname, "btrfs") != 0) {

That's assuming a lot.

Silently (!) breaking people's btrfs-3.15_patched-DontRandomlyPanicV2
is a recipe for needless hair-pulling. Is there a reason for not using
something less like strstr(bname, "fsck") that I am missing?

Regards,

T G-R
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help for creating a useful bugreport

2014-09-21 Thread Jakob Breier

Hi,

Chris Murphy wrote:

On Sep 19, 2014, at 2:58 AM, Jakob Breier  wrote:
> Unfortunately I don't have much to work with. Can you help me with extracting
> enough information to create a useful bugreport?
What storage device(s)?


The device (/dev/dm-1) is a LUKS device. It is about a gigabyte in size:

"""
$ sudo cryptsetup status /dev/dm-1
/dev/dm-1 is active.
  type:LUKS1
  cipher:  aes-xts-plain64
  keysize: 256 bits
  device:  /dev/sdb1
  offset:  4096 sectors
  size:1953711741 sectors
  mode:read/write
$ sudo blockdev --getsize64 /dev/dm-1
1000300411392
$ sudo blockdev --getsize /dev/dm-1
1953711741
"""

Syslog:
"""
kernel: BTRFS: device label EncJakobExtern devid 1 transid 4923 /dev/dm-1
udisksd[3948]: Unlocked LUKS device /dev/sdb1 as /dev/dm-1
"""

The underlying hdd is an external USB device:

"""
kernel: scsi 6:0:0:0: Direct-Access Intenso  External USB 3.0 0206 
PQ: 0 ANSI: 6

kernel: sd 6:0:0:0: Attached scsi generic sg2 type 0
kernel: sd 6:0:0:0: [sdb] 2930276352 512-byte logical blocks: (1.50 
TB/1.36 TiB)

kernel: sd 6:0:0:0: [sdb] Write Protect is off
kernel: sd 6:0:0:0: [sdb] Mode Sense: 27 00 00 00
kernel: sd 6:0:0:0: [sdb] No Caching mode page found
kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
"""


Include results from
# btrfs check


Output is attached:
$ sudo btrfs check /dev/dm-1 > btrfsCheckOutput 2>&1


And also a note whether you get different results with -s1, -s2, -s3 (how many
backups superblocks you have depends on file system size so some of those might
not work).


The result is identical for no parameter, -s1 and -s2, except for the 
additional "using SB copy 1, bytenr 67108864", resp. "using SB copy 2, 
bytenr 274877906944". -s3 yields "ERROR: super mirror should be less 
than: 3".



Since it won't mount you can't get fi df, but if you can provide that info so
we know if, e.g. the metadata is single (by default on SSD) or DUP.


I don't recall what it said (if I've run it on this partition before), 
unfortunately. I believe I have created the partition with the standard 
settings, though.



Was it created with btrfs-progs 3.16, and has it only been written to with
kernel 3.16 or other kernels also?


Looking through my yum history I would guess it was either created with
  btrfs-progs-0.20.rc1.20131114git9f0c53f-1.fc20.x86_64
or with
  btrfs-progs-3.12-1.fc20.x86_64
but I'm not sure. I think I've created in May or June this year. 
Similarly, the first kernel to access the fs is probably either of these:

  kernel-3.14.3-200.fc20.x86_64
  kernel-3.14.4-200.fc20.x86_64
  kernel-3.14.5-200.fc20.x86_64
  kernel-3.14.6-200.fc20.x86_64
The kernel on which the mount error first occurred and which probably 
caused the error is this one:

  kernel-3.15.7-200.fc20.x86_64


If you can use btrfs-image per the wiki, and keep the image around, it might
come in handy for a Btrfs developer.


I got an error running btrfs-image:

"""
$ sudo btrfs-image -c 9 -s /dev/dm-1 
/run/media/[...]/btrfsImageWithoutData.bit

parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
Ignoring transid failure
parent transid verify failed on 46661632 wanted 4923 found 3306
parent transid verify failed on 46661632 wanted 4923 found 3306
parent transid verify failed on 46661632 wanted 4923 found 3306
parent transid verify failed on 46661632 wanted 4923 found 3306
Ignoring transid failure
parent transid verify failed on 46678016 wanted 4923 found 3306
Ignoring transid failure
Error going to next leaf -5
create failed (Success)
"""

Nothing interesting in the syslog and no output file was written. Adding 
"-w" seemed to fix it and I got an 838MB output file:


"""
$ sudo btrfs-image -c 9 -s -w /dev/dm-1 
/run/media/[...]/btrfsImageWithoutData.bit

parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
parent transid verify failed on 46678016 wanted 4923 found 3306
Ignoring transid failure
"""


> Sep 19 10:16:18 localhost.localdomain kernel: parent transid verify failed on
> 46678016 wanted 4923 found 3306
> Sep 19 10:16:18 localhost.localdomain kernel: parent transid verify failed on
> 46678016 wanted 4923 found 3306

These messages come up often on the list. The notes written in disk-io.c say
this:
  * we can't consider a given block up to date unless the transid of the
  * block matches the transid in the parent node's pointer.  This is how we
  * detect blocks that either didn't get written at all or got written
  * in the wrong place.

I don't know whether this definitely means hardware related problems of some
sort, but it sounds suspiciously like that because blocks should get written in
the correct place. Right? But they didn't.


Wh

Re: [PATCH] btrfs: fix ABBA deadlock in btrfs_dev_replace_finishing()

2014-09-21 Thread Miao Xie
It has been fixed by

https://patchwork.kernel.org/patch/4747961/

Thanks
Miao

On Sun, 21 Sep 2014 12:41:49 +0800, Eryu Guan wrote:
> btrfs_map_bio() first calls btrfs_bio_counter_inc_blocked() which checks
> fs state and increase bio_counter, then calls __btrfs_map_block() which
> will take the dev_replace lock.
> 
> On the other hand, btrfs_dev_replace_finishing() takes dev_replace lock
> first then set fs state to BTRFS_FS_STATE_DEV_REPLACING and waits for
> bio_counter to be zero.
> 
> The deadlock can be reproduced easily by running replace and fsstress at
> the same time, e.g.
> 
> mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
> mount /dev/sdb1 /mnt/btrfs
> fsstress -d /mnt/btrfs -n 100 -p 2 -l 0 & # fsstress from ltp supports -l 
> option
> i=0
> while btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs && \
>   btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs; do
>   echo "=== loop $i ==="
>   let i=$i+1
> done
> 
> This was introduced by
> 
> c404e0d Btrfs: fix use-after-free in the finishing procedure of the device 
> replace
> 
> Signed-off-by: Eryu Guan 
> ---
> 
> Tested by the reproducer and xfstests, no new failure found.
> 
> But I found kmem_cache leak if I remove btrfs module after my new test 
> case[1],
> which does fsstress & replace & subvolume create/mount/umount/delete at the 
> same
> time.
> 
> BUG btrfs_extent_state (Tainted: GB ): Objects remaining in 
> btrfs_extent_state on kmem_cache_close()
> ..
> kmem_cache_destroy btrfs_extent_state: Slab cache still has objects
> CPU: 3 PID: 9503 Comm: modprobe Tainted: GB  3.17.0-rc5+ #12
> Hardware name: Hewlett-Packard ProLiant DL388eGen8, BIOS P73 06/01/2012
>   8dd09c52 880411c37eb0 81642f7a
>  8800b9a19300 880411c37ed0 8118ce89 
>  a05dcd20 880411c37ee0 a056a80f 880411c37ef0
> Call Trace:
>  [] dump_stack+0x45/0x56
>  [] kmem_cache_destroy+0xf9/0x100
>  [] extent_io_exit+0x1f/0x50 [btrfs]
>  [] exit_btrfs_fs+0x2c/0x549 [btrfs]
>  [] SyS_delete_module+0x162/0x200
>  [] ? do_notify_resume+0x97/0xb0
>  [] system_call_fastpath+0x16/0x1b
> 
> The test would hang before the fix. I'm not sure if it's related to the fix
> (seems not), please help review.
> 
> Thanks,
> Eryu Guan
> 
> [1] http://www.spinics.net/lists/linux-btrfs/msg37625.html
> 
>  fs/btrfs/dev-replace.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index eea26e1..5dfd292 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -510,6 +510,7 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   /* keep away write_all_supers() during the finishing procedure */
>   mutex_lock(&root->fs_info->chunk_mutex);
>   mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> + btrfs_rm_dev_replace_blocked(fs_info);
>   btrfs_dev_replace_lock(dev_replace);
>   dev_replace->replace_state =
>   scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
> @@ -567,12 +568,8 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>   btrfs_kobj_rm_device(fs_info, src_device);
>   btrfs_kobj_add_device(fs_info, tgt_device);
>  
> - btrfs_rm_dev_replace_blocked(fs_info);
> -
>   btrfs_rm_dev_replace_srcdev(fs_info, src_device);
>  
> - btrfs_rm_dev_replace_unblocked(fs_info);
> -
>   /*
>* this is again a consistent state where no dev_replace procedure
>* is running, the target device is part of the filesystem, the
> @@ -581,6 +578,7 @@ static int btrfs_dev_replace_finishing(struct 
> btrfs_fs_info *fs_info,
>* belong to this filesystem.
>*/
>   btrfs_dev_replace_unlock(dev_replace);
> + btrfs_rm_dev_replace_unblocked(fs_info);
>   mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>   mutex_unlock(&root->fs_info->chunk_mutex);
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device delete progress

2014-09-21 Thread Duncan
Russell Coker posted on Sun, 21 Sep 2014 12:09:11 +1000 as excerpted:

> We need to have a way to determine the progress of a device delete
> operation.
> Also for a balance of a RAID-1 that has more than 2 devices it would be
> good to know how much space is used on each device.
> 
> Could btrfs fi df be extended to show information separately for each
> device?

btrfs fi show should give you at least some minimal per-device stats, 
today.  Enough to at least have some idea of the progress of a balance 
when adding/removing devices.

Longer term, there has been discussion of extending/changing the fi df 
format and making it far more verbose, including the information found in 
btrfs fi show as well, and making everything potentially per-device.  I 
hadn't paid a whole lot of attention to the details, however.

Alternatively, leave df more or less as it is (perhaps extending it a bit 
but attempting not to kill existing scripts using it) and put the detail 
in a new btrfs filesystem usage.  This sounds rather more reasonable to 
me.

Either way the idea is to give people a single command that combines the 
current output of fi show and fi df, ideally displaying per-device and 
filesystem totals both, in enough verbosity to avoid the unintuitive and 
arcane btrfs specific knowledge required today to interpret it.

I had originally presumed that such a change would happen before the 
experimental labels came off, but it didn't.  I don't know the timetable 
for it now, or even if it's still planned, as IIRC the discussion died 
away back in the btrfs-progs 3.12 era and I expected to see it in 3.14 
and it wasn't there, so I don't know...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting a dead device

2014-09-21 Thread Russell Coker
On Sun, 21 Sep 2014, Duncan <1i5t5.dun...@cox.net> wrote:
> Russell Coker posted on Sun, 21 Sep 2014 11:39:17 +1000 as excerpted:
> > On a system running the Debian 3.14.15-2 kernel I added a new drive to a
> > RAID-1 array.  My aim was to add a device and remove one of the old
> > devices.
> 
> That's an old kernel and presumably an old btrfs-progs.  Quite a number
> of device management fixes have gone in recently, and you'd likely not be
> in quite that predicament were you running a current kernel (make sure
> it's 3.16.2+ or 3.17-rc2+ to get the fix for the workqueues bug that
> affected 3.15 and thru 3.16.1 and 3.17-rc1).

3.16.2 is in Debian, I'm in the process of upgrading to it.

> And the recommended way to handle a device replace now would be btrfs
> replace, doing the add and delete in a single (albeit possibly long) step
> instead of separately.

I'm changing from a 500G single filesystem to a 200G RAID-1 (there's only 150G 
of data).  The change from 500G to 200G can't be done with a replace as a 
replace requires an equal or greater size.

I did a 2 step process, add/delete to go to a 200G USB attached device for 
half the array and then replace to go from 200G on USB to 200G internal.

> > The drive is attached by USB so I turned off the USB device and then got
> > the above result.  So it still seems impossible to remove the device
> > even though it's physically not present.  I've connected a new USB disk
> > which is now /dev/sdd, so it seems that BTRFS is keeping the name
> > /dev/sdc locked.
> > 
> > Should there be a way to fix this without rebooting or anything?
> 
> Did you try btrfs device delete missing?  It's documented on the wiki but
> apparently not yet on the manpage.

I did that after rebooting.  It didn't occur to me to try a "missing" 
operation when the drive really wasn't missing.

> According to the wiki that deletes
> the first device that was in the metadata but not found when booting, so
> you may have to reboot to do it, but it should work.

That would be a bug.  There's no reason a reboot should be required if we can 
remove a drive and add a new one with the kernel recognising it.  Hot-swap 
disks aren't any sort of new feature.

> Tho with the recent
> stale-devices fixes, were that a current kernel you may not actually have
> to reboot to have delete missing work.  But you probably will on 3.14,
> and of course to upgrade kernels you'd have to reboot anyway, so...

Yes a reboot was needed anyway.  But I'd have liked to delay that.

> > Also as an aside, while the stats about write errors are useful, in this
> > case it would be really good if there was a count of successful writes,
> > it would be useful to know if the successful write count was close to 0.
> > 
> >  My understanding of the BTRFS design is that there would be no
> > 
> > performance penalty for adding counts of the number of successful reads
> > and writes to the superblock.  Could this be done?
> 
> Not necessarily for reads, consider the case when the filesystem is read-
> only as my btrfs root filesystem is by default -- lots of reads but
> likely no writes and no super-block updates for the entire uptime.  But I
> believe you're correct for writes, since they'd ultimately update the
> superblocks anyway.

For the case of a read-only filesystem it's OK to skip read stats.  It would 
also be a bad idea to update read stats without writing data.  But there's no 
reason why read stats couldn't be accumulated in-memory and written out the 
next time something was written to disk.  That would give a slight inaccuracy 
in the case where there was a power failure after some period of reading 
without writing, but that's an unusual corner case.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting a dead device

2014-09-21 Thread Duncan
Russell Coker posted on Sun, 21 Sep 2014 11:39:17 +1000 as excerpted:

> On a system running the Debian 3.14.15-2 kernel I added a new drive to a
> RAID-1 array.  My aim was to add a device and remove one of the old
> devices.

That's an old kernel and presumably an old btrfs-progs.  Quite a number 
of device management fixes have gone in recently, and you'd likely not be 
in quite that predicament were you running a current kernel (make sure 
it's 3.16.2+ or 3.17-rc2+ to get the fix for the workqueues bug that 
affected 3.15 and thru 3.16.1 and 3.17-rc1).

And the recommended way to handle a device replace now would be btrfs 
replace, doing the add and delete in a single (albeit possibly long) step 
instead of separately.

[snip most of the problem description]

> The drive is attached by USB so I turned off the USB device and then got
> the above result.  So it still seems impossible to remove the device
> even though it's physically not present.  I've connected a new USB disk
> which is now /dev/sdd, so it seems that BTRFS is keeping the name
> /dev/sdc locked.
> 
> Should there be a way to fix this without rebooting or anything?

Did you try btrfs device delete missing?  It's documented on the wiki but 
apparently not yet on the manpage.  According to the wiki that deletes 
the first device that was in the metadata but not found when booting, so 
you may have to reboot to do it, but it should work.  Tho with the recent 
stale-devices fixes, were that a current kernel you may not actually have 
to reboot to have delete missing work.  But you probably will on 3.14, 
and of course to upgrade kernels you'd have to reboot anyway, so...

> Also as an aside, while the stats about write errors are useful, in this
> case it would be really good if there was a count of successful writes,
> it would be useful to know if the successful write count was close to 0.
>  My understanding of the BTRFS design is that there would be no
> performance penalty for adding counts of the number of successful reads
> and writes to the superblock.  Could this be done?

Not necessarily for reads, consider the case when the filesystem is read-
only as my btrfs root filesystem is by default -- lots of reads but 
likely no writes and no super-block updates for the entire uptime.  But I 
believe you're correct for writes, since they'd ultimately update the 
superblocks anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html