RE: [PATCH 8/9] btrfs: wait for delayed iputs on no space

2015-04-13 Thread Zhao Lei
Hi, Chris

> -Original Message-
> From: Chris Mason [mailto:c...@fb.com]
> Sent: Monday, April 13, 2015 10:55 PM
> To: Zhaolei; linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space
> 
> On 04/09/2015 12:34 AM, Zhaolei wrote:
> > From: Zhao Lei 
> >
> > btrfs will report no_space when we run following write and delete file
> > loop:
> >  # FILE_SIZE_M=[ 75% of fs space ]
> >  # DEV=[ some dev ]
> >  # MNT=[ some dir ]
> >  #
> >  # mkfs.btrfs -f "$DEV"
> >  # mount -o nodatacow "$DEV" "$MNT"
> >  # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0
> > bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done  #
> >
> > Reason:
> >  iput() and evict() is run after write pages to block device, if
> > write pages work is not finished before next write, the "rm"ed space
> > is not freed, and caused above bug.
> >
> > Fix:
> >  We can add "-o flushoncommit" mount option to avoid above bug, but
> > it have performance problem. Actually, we can to wait for on-the-fly
> > writes only when no-space happened, it is which this patch do.
> 
> Can you please change this so we only do this flush if the first commit 
> doesn't
> free up enough space?  I think this is going to have a performance impact as
> the FS fills up.
> 
btrfs_wait_ordered_roots() can only ensure that all bio are finished,
and relative iputs are added into delayed_iputs in end_io.
And we need 2 commit to make free space accessable:
One for run delayed_iputs(), and another for unpin.

It is why I put above line to first commit, to ensure we have
enough commit operation to make free space accessable.

It is only called then the disk is almost full, and have no performance impact
in most case(disk not full).

Another way is to call btrfs_wait_ordered_roots() after first commit() try,
but give it addition commit().

Thanks
Zhaolei


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Big disk space usage difference, even after defrag, on identical data

2015-04-13 Thread Duncan
Gian-Carlo Pascutto posted on Mon, 13 Apr 2015 16:06:39 +0200 as
excerpted:

>> Defrag should force the rewrite of entire files and take care of this,
>> but obviously it's not returning to "clean" state.  I forgot what the
>> default minimum file size is if -t isn't set, maybe 128 MiB?  But a -t1
>> will force it to defrag even small files, and I recall at least one
>> thread here where the poster said it made all the difference for him,
>> so try that.  And the -f should force a filesystem sync afterward, so
>> you know the numbers from any report you run afterward match the final
>> state.
> 
> Reading the corresponding manual, the -t explanation says that "any
> extent bigger than this size will be considered already defragged". So I
> guess setting -t1 might've fixed the problem too...but after checking
> the source, I'm not so sure.

Oops!  You are correct.  There was an on-list discussion of that before 
that I had forgotten.  The "make sure everything gets defragged" magic 
setting is -t 1G or higher, *not* the -t 1 I was trying to tell you 
previously (which will end up skipping everything, instead of defragging 
everything).

Thanks for spotting the inconsistency and calling me on it! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Big disk space usage difference, even after defrag, on identical data

2015-04-13 Thread Zygo Blaxell
On Mon, Apr 13, 2015 at 04:06:39PM +0200, Gian-Carlo Pascutto wrote:
> On 13-04-15 07:06, Duncan wrote:
> 
> >> So what can explain this? Where did the 66G go?
> > 
> > Out of curiosity, does a balance on the actively used btrfs help?
> > 
> > You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t 
> > (minimum size file) options.  Does adding -f -t1 help?
> 
> Unfortunately I can no longer try this, see the other reply why. But the
> problem turned out to be some 1G-sized files, written using 3-5 extents,
> that for whatever reason defrag was not touching.

There are several corner cases that defrag won't touch by default.
It's designed to be conservative and favor speed over size.

Also when the kernel decides you're not getting enough compression, it
seems to disable compression on the file _forever_ even if future writes
are compressible again.  mount -o compress-force works around that.

> > You aren't doing btrfs snapshots of either subvolume, are you?
> 
> No :-) I should've mentioned that.

read-only snapshots:  yet another thing defrag won't touch.

> > While initial usage will  be higher due to the lack of compression,
> > as you've discovered, over time, on an actively updated database,
> > compression isn't all that effective anyway.
> 
> I don't see why. If you're referring to the additional overhead of
> continuously compressing and decompressing everything - yes, of course.
> But in my case I have a mostly-append workload to a huge amount of
> fairly compressible data that's on magnetic storage, so compression is a
> win in disk space and perhaps even in performance.

Short writes won't compress--not just well, but at all--because btrfs
won't look at adjacent already-written blocks.  If you write a file
at less than 4K/minute, there will be no compression, as each new extent
(or replacement extent for overwritten data) is already minimum-sized.

If you write in bursts of 128K or more, consecutively, then you can
get compression benefit.

There has been talk of teaching autodefrag to roll up the last few dozen
extents of files that grow slowly so they can be compressed.

> It turns out that if your dataset isn't update heavy (so it doesn't
> fragment to begin with), or has to be queried via indexed access (i.e.
> mostly via random seeks), the fragmentation doesn't matter much anyway.
> Conversely, btrfs appears to have better sync performance with multiple
> threads, and allows one to disable part of the partial-page-write
> protection logic in the database (full_page_writes=off for PostgreSQL),
> because btrfs is already doing the COW to ensure those can't actually
> happen [1].
> 
> The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs),
> which certainly is contrary to popular wisdom. Maybe btrfs would fall
> off eventually as fragementation does set in gradually, but given that
> there's an offline defragmentation tool that can run in the background,
> I don't care.

I've found the performance of PostgreSQL to be wildly variable on btrfs.
It may be OK at first, but watch it for a week or two to admire the
full four-orders-of-magnitude swing (100 tps to 0.01 tps).  :-O

> [1] I wouldn't be too surprised if database COW, which consists of
> journal-writing a copy of the data out of band, then rewriting it again
> in the original place, is actually functionally equivalent do disabling
> COW in the database and running btrfs + defrag. Obviously you shouldn't
> keep COW enabled in btrfs *AND* the DB, requiring all data to be copied
> around at least 3 times...which I'm afraid almost everyone does because
> it's the default...

Journalling writes all the data twice:  once to the journal, once to
update the origin page after the journal (though PostgreSQL will omit
some of those duplicate writes in cases where there is no origin page
to overwrite).

COW writes all the new and updated data only once.

In the event of a crash, if the log tree is not recoverable (and it's
a rich source of btrfs bugs, so it's often not), you lose everything
that happened to the database in the last 30 seconds.  If you were
already using async commit in PostgreSQL anyway then that's not much
of a concern (and not having to call fsync 100 times a second _really_
helps performance!)  but if you really need sync commit then btrfs is
not the filesystem for you.

> -- 
> GCP
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


Re: [PATCH RESEND] btrfs: unlock i_mutex after attempting to delete subvolume during send

2015-04-13 Thread Omar Sandoval
On Fri, Apr 10, 2015 at 02:20:40PM -0700, Omar Sandoval wrote:
> Whenever the check for a send in progress introduced in commit
> 521e0546c970 (btrfs: protect snapshots from deleting during send) is
> hit, we return without unlocking inode->i_mutex. This is easy to see
> with lockdep enabled:
> 
> [  +0.59] 
> [  +0.28] [ BUG: lock held when returning to user space! ]
> [  +0.29] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
> [  +0.26] 
> [  +0.29] btrfs/211 is leaving the kernel with locks still held!
> [  +0.29] 1 lock held by btrfs/211:
> [  +0.23]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: 
> [] btrfs_ioctl_snap_destroy+0x2df/0x7a0
> 
> Make sure we unlock it in the error path.
> 
> Reviewed-by: Filipe Manana 
> Reviewed-by: David Sterba 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Omar Sandoval 
> ---
> Just resending this with Filipe's and David's Reviewed-bys and Cc-ing
> stable.
> 
>  fs/btrfs/ioctl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Ping.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs: ENOMEM bugfixes

2015-04-13 Thread Omar Sandoval
On Fri, Mar 27, 2015 at 02:06:49PM -0700, Omar Sandoval wrote:
> On Fri, Mar 13, 2015 at 12:43:42PM -0700, Omar Sandoval wrote:
> > On Fri, Mar 13, 2015 at 12:04:30PM +0100, David Sterba wrote:
> > > On Wed, Mar 11, 2015 at 09:40:17PM -0700, Omar Sandoval wrote:
> > > > Ping. For anyone following along, it looks like commit cc87317726f8
> > > > ("mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change")
> > > > reverted the commit that exposed these bugs. Josef said he was okay with
> > > > taking these, will they make it to an upcoming -rc soon?
> > > 
> > > Upcoming yes, but based on my experience with pushing patches that are
> > > not really regressions in late rc's it's unlikely for 4.1.
> > 
> > Ok, seeing as these bugs are going to be really hard to trigger now that
> > the old GFP_FS behavior has been restored, I'm fine with waiting for the
> > next merge window.
> > 
> > Thank you!
> 
> Chris, would you mind taking these for a spin in your integration branch
> for the next merge window?
> 
> Thanks,
> -- 
> Omar

Ping.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Btrfs: two stage dirty block group writeout

2015-04-13 Thread Chris Mason
Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one.  This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason 
---
 fs/btrfs/ctree.h|   6 ++
 fs/btrfs/extent-tree.c  |  57 +--
 fs/btrfs/free-space-cache.c | 131 +++-
 fs/btrfs/free-space-cache.h |   8 ++-
 4 files changed, 170 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e305ccd..1df0d9d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1261,9 +1261,12 @@ struct btrfs_io_ctl {
struct page *page;
struct page **pages;
struct btrfs_root *root;
+   struct inode *inode;
unsigned long size;
int index;
int num_pages;
+   int entries;
+   int bitmaps;
unsigned check_crcs:1;
 };
 
@@ -1332,6 +1335,9 @@ struct btrfs_block_group_cache {
 
/* For dirty block groups */
struct list_head dirty_list;
+   struct list_head io_list;
+
+   struct btrfs_io_ctl io_ctl;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d4b3d680..40c9513 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3388,7 +3388,11 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
struct btrfs_block_group_cache *cache;
struct btrfs_transaction *cur_trans = trans->transaction;
int ret = 0;
+   int should_put;
struct btrfs_path *path;
+   LIST_HEAD(io);
+   int num_started = 0;
+   int num_waited = 0;
 
if (list_empty(&cur_trans->dirty_bgs))
return 0;
@@ -3407,16 +3411,60 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
cache = list_first_entry(&cur_trans->dirty_bgs,
 struct btrfs_block_group_cache,
 dirty_list);
+
+   /*
+* this can happen if cache_save_setup re-dirties a block
+* group that is already under IO.  Just wait for it to
+* finish and then do it all again
+*/
+   if (!list_empty(&cache->io_list)) {
+   list_del_init(&cache->io_list);
+   btrfs_wait_cache_io(root, trans, cache,
+   &cache->io_ctl, path,
+   cache->key.objectid);
+   btrfs_put_block_group(cache);
+   num_waited++;
+   }
+
list_del_init(&cache->dirty_list);
+   should_put = 1;
+
if (cache->disk_cache_state == BTRFS_DC_CLEAR)
cache_save_setup(cache, trans, path);
+
if (!ret)
-   ret = btrfs_run_delayed_refs(trans, root,
-(unsigned long) -1);
-   if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP)
-   btrfs_write_out_cache(root, trans, cache, path);
+   ret = btrfs_run_delayed_refs(trans, root, (unsigned 
long) -1);
+
+   if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP) {
+   cache->io_ctl.inode = NULL;
+   ret = btrfs_write_out_cache(root, trans, cache, path);
+   if (ret == 0 && cache->io_ctl.inode) {
+   num_started++;
+   should_put = 0;
+   list_add_tail(&cache->io_list, &io);
+   } else {
+   /*
+* if we failed to write the cache, the
+* generation will be bad and life goes on
+*/
+   ret = 0;
+   }
+   }
if (!ret)
ret = write_one_cache_group(trans, root, path, cache);
+
+   /* if its not on the io list, we need to put the block group */
+   if (should_put)
+   btrfs_put_block_group(cache);
+   }
+
+   while (!list_empty(&io)) {
+   cache = list_first_entry(&io, struct btrfs_block_group_cache,
+io_list);
+   list_del_init(&cache->io_list);
+   num_waited++;
+   btrfs_wait_cache_io(root, trans, cache,
+   &cache->io_ctl, path, cache->key.objectid);
btrfs_put_block_group(cache);
}
 
@@ -9013,6 +9061,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, 
u64 start, u64 siz

[PATCH 3/4] Btrfs: don't use highmem for free space cache pages

2015-04-13 Thread Chris Mason
In order to create the free space cache concurrently with FS modifications,
we need to take a few block group locks.

The cache code also does kmap, which would schedule with the locks held.
Instead of going through kmap_atomic, lets just use lowmem for the cache
pages.

Signed-off-by: Chris Mason 
---
 fs/btrfs/free-space-cache.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 6886ae0..83532a2 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -85,7 +85,8 @@ static struct inode *__lookup_free_space_inode(struct 
btrfs_root *root,
}
 
mapping_set_gfp_mask(inode->i_mapping,
-   mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+   mapping_gfp_mask(inode->i_mapping) &
+   ~(GFP_NOFS & ~__GFP_HIGHMEM));
 
return inode;
 }
@@ -310,7 +311,6 @@ static void io_ctl_free(struct btrfs_io_ctl *io_ctl)
 static void io_ctl_unmap_page(struct btrfs_io_ctl *io_ctl)
 {
if (io_ctl->cur) {
-   kunmap(io_ctl->page);
io_ctl->cur = NULL;
io_ctl->orig = NULL;
}
@@ -320,7 +320,7 @@ static void io_ctl_map_page(struct btrfs_io_ctl *io_ctl, 
int clear)
 {
ASSERT(io_ctl->index < io_ctl->num_pages);
io_ctl->page = io_ctl->pages[io_ctl->index++];
-   io_ctl->cur = kmap(io_ctl->page);
+   io_ctl->cur = page_address(io_ctl->page);
io_ctl->orig = io_ctl->cur;
io_ctl->size = PAGE_CACHE_SIZE;
if (clear)
@@ -446,10 +446,9 @@ static void io_ctl_set_crc(struct btrfs_io_ctl *io_ctl, 
int index)
  PAGE_CACHE_SIZE - offset);
btrfs_csum_final(crc, (char *)&crc);
io_ctl_unmap_page(io_ctl);
-   tmp = kmap(io_ctl->pages[0]);
+   tmp = page_address(io_ctl->pages[0]);
tmp += index;
*tmp = crc;
-   kunmap(io_ctl->pages[0]);
 }
 
 static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, int index)
@@ -466,10 +465,9 @@ static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, 
int index)
if (index == 0)
offset = sizeof(u32) * io_ctl->num_pages;
 
-   tmp = kmap(io_ctl->pages[0]);
+   tmp = page_address(io_ctl->pages[0]);
tmp += index;
val = *tmp;
-   kunmap(io_ctl->pages[0]);
 
io_ctl_map_page(io_ctl, 0);
crc = btrfs_csum_data(io_ctl->orig + offset, crc,
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs: move struct io_ctl into ctree.h and rename it

2015-04-13 Thread Chris Mason
We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h

Signed-off-by: Chris Mason 
---
 fs/btrfs/ctree.h| 11 +
 fs/btrfs/free-space-cache.c | 55 ++---
 2 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6bf16d5..e305ccd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1256,6 +1256,17 @@ struct btrfs_caching_control {
atomic_t count;
 };
 
+struct btrfs_io_ctl {
+   void *cur, *orig;
+   struct page *page;
+   struct page **pages;
+   struct btrfs_root *root;
+   unsigned long size;
+   int index;
+   int num_pages;
+   unsigned check_crcs:1;
+};
+
 struct btrfs_block_group_cache {
struct btrfs_key key;
struct btrfs_block_group_item item;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c514820..47c2adb 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -271,18 +271,7 @@ static int readahead_cache(struct inode *inode)
return 0;
 }
 
-struct io_ctl {
-   void *cur, *orig;
-   struct page *page;
-   struct page **pages;
-   struct btrfs_root *root;
-   unsigned long size;
-   int index;
-   int num_pages;
-   unsigned check_crcs:1;
-};
-
-static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode,
+static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct inode *inode,
   struct btrfs_root *root, int write)
 {
int num_pages;
@@ -298,7 +287,7 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct inode 
*inode,
(num_pages * sizeof(u32)) >= PAGE_CACHE_SIZE)
return -ENOSPC;
 
-   memset(io_ctl, 0, sizeof(struct io_ctl));
+   memset(io_ctl, 0, sizeof(struct btrfs_io_ctl));
 
io_ctl->pages = kcalloc(num_pages, sizeof(struct page *), GFP_NOFS);
if (!io_ctl->pages)
@@ -311,12 +300,12 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct 
inode *inode,
return 0;
 }
 
-static void io_ctl_free(struct io_ctl *io_ctl)
+static void io_ctl_free(struct btrfs_io_ctl *io_ctl)
 {
kfree(io_ctl->pages);
 }
 
-static void io_ctl_unmap_page(struct io_ctl *io_ctl)
+static void io_ctl_unmap_page(struct btrfs_io_ctl *io_ctl)
 {
if (io_ctl->cur) {
kunmap(io_ctl->page);
@@ -325,7 +314,7 @@ static void io_ctl_unmap_page(struct io_ctl *io_ctl)
}
 }
 
-static void io_ctl_map_page(struct io_ctl *io_ctl, int clear)
+static void io_ctl_map_page(struct btrfs_io_ctl *io_ctl, int clear)
 {
ASSERT(io_ctl->index < io_ctl->num_pages);
io_ctl->page = io_ctl->pages[io_ctl->index++];
@@ -336,7 +325,7 @@ static void io_ctl_map_page(struct io_ctl *io_ctl, int 
clear)
memset(io_ctl->cur, 0, PAGE_CACHE_SIZE);
 }
 
-static void io_ctl_drop_pages(struct io_ctl *io_ctl)
+static void io_ctl_drop_pages(struct btrfs_io_ctl *io_ctl)
 {
int i;
 
@@ -351,7 +340,7 @@ static void io_ctl_drop_pages(struct io_ctl *io_ctl)
}
 }
 
-static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode,
+static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, struct inode 
*inode,
int uptodate)
 {
struct page *page;
@@ -385,7 +374,7 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, 
struct inode *inode,
return 0;
 }
 
-static void io_ctl_set_generation(struct io_ctl *io_ctl, u64 generation)
+static void io_ctl_set_generation(struct btrfs_io_ctl *io_ctl, u64 generation)
 {
__le64 *val;
 
@@ -408,7 +397,7 @@ static void io_ctl_set_generation(struct io_ctl *io_ctl, 
u64 generation)
io_ctl->cur += sizeof(u64);
 }
 
-static int io_ctl_check_generation(struct io_ctl *io_ctl, u64 generation)
+static int io_ctl_check_generation(struct btrfs_io_ctl *io_ctl, u64 generation)
 {
__le64 *gen;
 
@@ -437,7 +426,7 @@ static int io_ctl_check_generation(struct io_ctl *io_ctl, 
u64 generation)
return 0;
 }
 
-static void io_ctl_set_crc(struct io_ctl *io_ctl, int index)
+static void io_ctl_set_crc(struct btrfs_io_ctl *io_ctl, int index)
 {
u32 *tmp;
u32 crc = ~(u32)0;
@@ -461,7 +450,7 @@ static void io_ctl_set_crc(struct io_ctl *io_ctl, int index)
kunmap(io_ctl->pages[0]);
 }
 
-static int io_ctl_check_crc(struct io_ctl *io_ctl, int index)
+static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, int index)
 {
u32 *tmp, val;
u32 crc = ~(u32)0;
@@ -494,7 +483,7 @@ static int io_ctl_check_crc(struct io_ctl *io_ctl, int 
index)
return 0;
 }
 
-static int io_ctl_add_entry(struct io_ctl *io_ctl, u64 offset, u64 bytes,
+static int io_ctl_add_entry(struct btrfs_io_ctl *io_ctl, u64 offset, u64 bytes,
void *bitmap)
 {
struct btrfs_free_space_entry *entry;
@@ -524,7 +513,7 @@ static int io_ctl_add_entry(

[PATCH 0/4] btrfs: reduce block group cache writeout times during commit

2015-04-13 Thread Chris Mason
Large filesystems with lots of block groups can suffer long stalls during
commit while we create and send down all of the block group caches.  The
more blocks groups dirtied in a transaction, the longer these stalls can be.
Some workloads average 10 seconds per commit, but see peak times much higher.

The first problem is that we write and wait for each block group cache
individually, so we aren't keeping the disk pipeline full.  This patch
set uses the io_ctl struct to start cache IO, and then waits on it in bulk.

The second problem is that we only allow cache writeout while new modifications
are blocked during the final stage of commit.  This adds some locking
so that cache writeout can happen very early in the commit, and any block
groups that are redirtied will be sent down during the final stages.

With both together, average commit stalls are under a second and our overall
performance is much smoother.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Btrfs: allow block group cache writeout outside critical section in commit

2015-04-13 Thread Chris Mason
We loop through all of the dirty block groups during commit and write
the free space cache.  In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit.  We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason 
---
 fs/btrfs/ctree.h|   8 ++
 fs/btrfs/disk-io.c  |   1 +
 fs/btrfs/extent-tree.c  | 241 +++-
 fs/btrfs/free-space-cache.c |  69 +++--
 fs/btrfs/free-space-cache.h |   1 +
 fs/btrfs/inode-map.c|   2 +-
 fs/btrfs/relocation.c   |   9 +-
 fs/btrfs/transaction.c  |  38 ++-
 fs/btrfs/transaction.h  |   9 ++
 9 files changed, 341 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1df0d9d..83051fa 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1491,6 +1491,12 @@ struct btrfs_fs_info {
struct mutex chunk_mutex;
struct mutex volume_mutex;
 
+   /*
+* this is taken to make sure we don't set block groups ro after
+* the free space cache has been allocated on them
+*/
+   struct mutex ro_block_group_mutex;
+
/* this is used during read/modify/write to make sure
 * no two ios are trying to mod the same stripe at the same
 * time
@@ -3407,6 +3413,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner, u64 offset, int 
no_quota);
 
+int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 int btrfs_setup_space_cache(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 568cc4e..b5e3d5f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2572,6 +2572,7 @@ int open_ctree(struct super_block *sb,
mutex_init(&fs_info->transaction_kthread_mutex);
mutex_init(&fs_info->cleaner_mutex);
mutex_init(&fs_info->volume_mutex);
+   mutex_init(&fs_info->ro_block_group_mutex);
init_rwsem(&fs_info->commit_root_sem);
init_rwsem(&fs_info->cleanup_work_sem);
init_rwsem(&fs_info->subvol_sem);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 40c9513..02c2b29 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3298,7 +3298,7 @@ again:
if (ret)
goto out_put;
 
-   ret = btrfs_truncate_free_space_cache(root, trans, inode);
+   ret = btrfs_truncate_free_space_cache(root, trans, NULL, inode);
if (ret)
goto out_put;
}
@@ -3382,20 +3382,156 @@ int btrfs_setup_space_cache(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
-int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
+/*
+ * transaction commit does final block group cache writeback during a
+ * critical section where nothing is allowed to change the FS.  This is
+ * required in order for the cache to actually match the block group,
+ * but can introduce a lot of latency into the commit.
+ *
+ * So, btrfs_start_dirty_block_groups is here to kick off block group
+ * cache IO.  There's a chance we'll have to redo some of it if the
+ * block group changes again during the commit, but it greatly reduces
+ * the commit latency by getting rid of the easy block groups while
+ * we're still allowing others to join the commit.
+ */
+int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
   struct btrfs_root *root)
 {
struct btrfs_block_group_cache *cache;
struct btrfs_transaction *cur_trans = trans->transaction;
int ret = 0;
int should_put;
-   struct btrfs_path *path;
-   LIST_HEAD(io);
+   struct btrfs_path *path = NULL;
+   LIST_HEAD(dirty);
+   struct list_head *io = &cur_trans->io_bgs;
int num_started = 0;
-   int num_waited = 0;
+   int loops = 0;
+
+   spin_lock(&cur_trans->dirty_bgs_lock);
+   if (!list_empty(&cur_trans->dirty_bgs)) {
+   list_splice_init(&cur_trans->dirty_bgs, &dirty);
+   }
+   spin_unlock(&cur_trans->dirty_bgs_lock);
 
-   if (list_empty(&cur_trans->dirty_bgs))
+again:
+   if (list_empty(&dirty)) {
+   btrfs_free_path(path);
return 0;
+   }
+
+ 

[PATCH 0/5] Btrfs: truncate space reservation fixes

2015-04-13 Thread Chris Mason
One of the production workloads here at FB ends up creating and eventually
deleting very large files.  We were consistently hitting ENOSPC aborts
while trying to delete the files because there wasn't enough metadata
reserved to cover deleting CRCs or actually updating the block group
items on disk.

This patchset addresses these problems by adding crc items into the
math for delayed ref processing, and changing the truncate items loop
to reserve metadata more often.

It also solves a performance problem where we are constantly committing
the transaction in hopes of making enospc progress.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] Btrfs: don't commit the transaction in the async space flushing

2015-04-13 Thread Chris Mason
From: Josef Bacik 

We're triggering a huge number of commits from
btrfs_async_reclaim_metadata_space.  These aren't really requried,
because everyone calling the async reclaim code is going to end up
triggering a commit on their own.

Signed-off-by: Chris Mason 
---
 fs/btrfs/extent-tree.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ae8db3ba..3d4b3d680 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4329,8 +4329,13 @@ out:
 static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
struct btrfs_fs_info *fs_info, u64 used)
 {
-   return (used >= div_factor_fine(space_info->total_bytes, 98) &&
-   !btrfs_fs_closing(fs_info) &&
+   u64 thresh = div_factor_fine(space_info->total_bytes, 98);
+
+   /* If we're just plain full then async reclaim just slows us down. */
+   if (space_info->bytes_used >= thresh)
+   return 0;
+
+   return (used >= thresh && !btrfs_fs_closing(fs_info) &&
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
@@ -4385,10 +4390,7 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (!btrfs_need_do_async_reclaim(space_info, fs_info,
 flush_state))
return;
-   } while (flush_state <= COMMIT_TRANS);
-
-   if (btrfs_need_do_async_reclaim(space_info, fs_info, flush_state))
-   queue_work(system_unbound_wq, work);
+   } while (flush_state < COMMIT_TRANS);
 }
 
 void btrfs_init_async_reclaim_work(struct work_struct *work)
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Btrfs: refill block reserves during truncate

2015-04-13 Thread Chris Mason
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents.  This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason 
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/extent-tree.c |  9 -
 fs/btrfs/inode.c   | 45 +++--
 3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 95944b8..6bf16d5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3297,6 +3297,9 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
address_space *mapping)
 }
 
 /* extent-tree.c */
+
+u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes);
+
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
 unsigned num_items)
 {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a6f88eb..75f4bed 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2636,7 +2636,7 @@ static inline u64 heads_to_leaves(struct btrfs_root 
*root, u64 heads)
  * Takes the number of bytes to be csumm'ed and figures out how many leaves it
  * would require to store the csums for that many bytes.
  */
-static u64 csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes)
+u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes)
 {
u64 csum_size;
u64 num_csums_per_leaf;
@@ -2665,7 +2665,7 @@ int btrfs_check_space_for_delayed_refs(struct 
btrfs_trans_handle *trans,
if (num_heads > 1)
num_bytes += (num_heads - 1) * root->nodesize;
num_bytes <<= 1;
-   num_bytes += csum_bytes_to_leaves(root, csum_bytes) * root->nodesize;
+   num_bytes += btrfs_csum_bytes_to_leaves(root, csum_bytes) * 
root->nodesize;
global_rsv = &root->fs_info->global_block_rsv;
 
/*
@@ -5098,13 +5098,12 @@ static u64 calc_csum_metadata_size(struct inode *inode, 
u64 num_bytes,
BTRFS_I(inode)->csum_bytes == 0)
return 0;
 
-   old_csums = csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes);
-
+   old_csums = btrfs_csum_bytes_to_leaves(root, 
BTRFS_I(inode)->csum_bytes);
if (reserve)
BTRFS_I(inode)->csum_bytes += num_bytes;
else
BTRFS_I(inode)->csum_bytes -= num_bytes;
-   num_csums = csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes);
+   num_csums = btrfs_csum_bytes_to_leaves(root, 
BTRFS_I(inode)->csum_bytes);
 
/* No change, no need to reserve more */
if (old_csums == num_csums)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cec23cf..88537c5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4163,6 +4163,21 @@ out:
return err;
 }
 
+static int truncate_space_check(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   u64 bytes_deleted)
+{
+   int ret;
+
+   bytes_deleted = btrfs_csum_bytes_to_leaves(root, bytes_deleted);
+   ret = btrfs_block_rsv_add(root, &root->fs_info->trans_block_rsv,
+ bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
+   if (!ret)
+   trans->bytes_reserved += bytes_deleted;
+   return ret;
+
+}
+
 /*
  * this can truncate away extent items, csum items and directory items.
  * It starts at a high offset and removes keys until it can't find
@@ -4201,6 +4216,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = 0;
bool should_throttle = 0;
+   bool should_end = 0;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4396,6 +4412,8 @@ delete:
} else {
break;
}
+   should_throttle = 0;
+
if (found_extent &&
(test_bit(BTRFS_ROOT_REF_COWS, &root->state) ||
 root == root->fs_info->tree_root)) {
@@ -4409,17 +4427,24 @@ delete:
if (btrfs_should_throttle_delayed_refs(trans, root))
btrfs_async_run_delayed_refs(root,
trans->delayed_ref_updates * 2, 0);
+   if (be_nice) {
+   if (truncate_space_check(trans, root,
+extent_num_bytes)) {
+   should_end = 1;
+   }
+   if (btrfs_should_throttle_delayed_refs(trans,
+  root)) {
+   should_throttle = 1;
+   }
+ 

[PATCH 1/5] Btrfs: account for crcs in delayed ref processing

2015-04-13 Thread Chris Mason
From: Josef Bacik 

As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs.  This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.

This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.

Signed-off-by: Chris Mason 
---
 fs/btrfs/delayed-ref.c | 22 --
 fs/btrfs/delayed-ref.h | 10 ++
 fs/btrfs/extent-tree.c | 46 +++---
 fs/btrfs/inode.c   | 25 ++---
 fs/btrfs/transaction.c |  4 
 5 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 6d16bea..8f8ed7d 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -489,11 +489,13 @@ update_existing_ref(struct btrfs_trans_handle *trans,
  * existing and update must have the same bytenr
  */
 static noinline void
-update_existing_head_ref(struct btrfs_delayed_ref_node *existing,
+update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
+struct btrfs_delayed_ref_node *existing,
 struct btrfs_delayed_ref_node *update)
 {
struct btrfs_delayed_ref_head *existing_ref;
struct btrfs_delayed_ref_head *ref;
+   int old_ref_mod;
 
existing_ref = btrfs_delayed_node_to_head(existing);
ref = btrfs_delayed_node_to_head(update);
@@ -541,7 +543,20 @@ update_existing_head_ref(struct btrfs_delayed_ref_node 
*existing,
 * only need the lock for this case cause we could be processing it
 * currently, for refs we just added we know we're a-ok.
 */
+   old_ref_mod = existing_ref->total_ref_mod;
existing->ref_mod += update->ref_mod;
+   existing_ref->total_ref_mod += update->ref_mod;
+
+   /*
+* If we are going to from a positive ref mod to a negative or vice
+* versa we need to make sure to adjust pending_csums accordingly.
+*/
+   if (existing_ref->is_data) {
+   if (existing_ref->total_ref_mod >= 0 && old_ref_mod < 0)
+   delayed_refs->pending_csums -= existing->num_bytes;
+   if (existing_ref->total_ref_mod < 0 && old_ref_mod >= 0)
+   delayed_refs->pending_csums += existing->num_bytes;
+   }
spin_unlock(&existing_ref->lock);
 }
 
@@ -605,6 +620,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
head_ref->is_data = is_data;
head_ref->ref_root = RB_ROOT;
head_ref->processing = 0;
+   head_ref->total_ref_mod = count_mod;
 
spin_lock_init(&head_ref->lock);
mutex_init(&head_ref->mutex);
@@ -614,7 +630,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
existing = htree_insert(&delayed_refs->href_root,
&head_ref->href_node);
if (existing) {
-   update_existing_head_ref(&existing->node, ref);
+   update_existing_head_ref(delayed_refs, &existing->node, ref);
/*
 * we've updated the existing ref, free the newly
 * allocated ref
@@ -622,6 +638,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref);
head_ref = existing;
} else {
+   if (is_data && count_mod < 0)
+   delayed_refs->pending_csums += num_bytes;
delayed_refs->num_heads++;
delayed_refs->num_heads_ready++;
atomic_inc(&delayed_refs->num_entries);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index a764e23..5eb0892 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -88,6 +88,14 @@ struct btrfs_delayed_ref_head {
struct rb_node href_node;
 
struct btrfs_delayed_extent_op *extent_op;
+
+   /*
+* This is used to track the final ref_mod from all the refs associated
+* with this head ref, this is not adjusted as delayed refs are run,
+* this is meant to track if we need to do the csum accounting or not.
+*/
+   int total_ref_mod;
+
/*
 * when a new extent is allocated, it is just reserved in memory
 * The actual extent isn't inserted into the extent allocation tree
@@ -138,6 +146,8 @@ struct btrfs_delayed_ref_root {
/* total number of head nodes ready for processing */
unsigned long num_heads_ready;
 
+   u64 pending_csums;
+
/*
 * set when the tree is flushing before a transaction commit,
 * used by the throttling code to decide if new updates need
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 41e5812..a6f88eb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2538,6 +2538

[PATCH 5/5] Btrfs: don't steal from the global reserve if we don't have the space

2015-04-13 Thread Chris Mason
From: Josef Bacik 

btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv.  We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.

Signed-off-by: Chris Mason 
---
 fs/btrfs/inode.c | 46 --
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 88537c5..141df0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5010,6 +5010,7 @@ void btrfs_evict_inode(struct inode *inode)
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_block_rsv *rsv, *global_rsv;
+   int steal_from_global = 0;
u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
int ret;
 
@@ -5077,9 +5078,20 @@ void btrfs_evict_inode(struct inode *inode)
 * hard as possible to get this to work.
 */
if (ret)
-   ret = btrfs_block_rsv_migrate(global_rsv, rsv, 
min_size);
+   steal_from_global++;
+   else
+   steal_from_global = 0;
+   ret = 0;
 
-   if (ret) {
+   /*
+* steal_from_global == 0: we reserved stuff, hooray!
+* steal_from_global == 1: we didn't reserve stuff, boo!
+* steal_from_global == 2: we've committed, still not a lot of
+* room but maybe we'll have room in the global reserve this
+* time.
+* steal_from_global == 3: abandon all hope!
+*/
+   if (steal_from_global > 2) {
btrfs_warn(root->fs_info,
"Could not get space for a delete, will 
truncate on mount %d",
ret);
@@ -5095,6 +5107,36 @@ void btrfs_evict_inode(struct inode *inode)
goto no_delete;
}
 
+   /*
+* We can't just steal from the global reserve, we need tomake
+* sure there is room to do it, if not we need to commit and try
+* again.
+*/
+   if (steal_from_global) {
+   if (!btrfs_check_space_for_delayed_refs(trans, root))
+   ret = btrfs_block_rsv_migrate(global_rsv, rsv,
+ min_size);
+   else
+   ret = -ENOSPC;
+   }
+
+   /*
+* Couldn't steal from the global reserve, we have too much
+* pending stuff built up, commit the transaction and try it
+* again.
+*/
+   if (ret) {
+   ret = btrfs_commit_transaction(trans, root);
+   if (ret) {
+   btrfs_orphan_del(NULL, inode);
+   btrfs_free_block_rsv(root, rsv);
+   goto no_delete;
+   }
+   continue;
+   } else {
+   steal_from_global = 0;
+   }
+
trans->block_rsv = rsv;
 
ret = btrfs_truncate_inode_items(trans, root, inode, 0, 0);
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Btrfs: reserve space for block groups

2015-04-13 Thread Chris Mason
From: Josef Bacik 

This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason 
---
 fs/btrfs/extent-tree.c | 12 +---
 fs/btrfs/transaction.c |  1 +
 fs/btrfs/transaction.h |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 75f4bed..ae8db3ba 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2657,7 +2657,8 @@ int btrfs_check_space_for_delayed_refs(struct 
btrfs_trans_handle *trans,
struct btrfs_block_rsv *global_rsv;
u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
-   u64 num_bytes;
+   u64 num_dirty_bgs = trans->transaction->num_dirty_bgs;
+   u64 num_bytes, num_dirty_bgs_bytes;
int ret = 0;
 
num_bytes = btrfs_calc_trans_metadata_size(root, 1);
@@ -2666,17 +2667,21 @@ int btrfs_check_space_for_delayed_refs(struct 
btrfs_trans_handle *trans,
num_bytes += (num_heads - 1) * root->nodesize;
num_bytes <<= 1;
num_bytes += btrfs_csum_bytes_to_leaves(root, csum_bytes) * 
root->nodesize;
+   num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(root,
+num_dirty_bgs);
global_rsv = &root->fs_info->global_block_rsv;
 
/*
 * If we can't allocate any more chunks lets make sure we have _lots_ of
 * wiggle room since running delayed refs can create more delayed refs.
 */
-   if (global_rsv->space_info->full)
+   if (global_rsv->space_info->full) {
+   num_dirty_bgs_bytes <<= 1;
num_bytes <<= 1;
+   }
 
spin_lock(&global_rsv->lock);
-   if (global_rsv->reserved <= num_bytes)
+   if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
ret = 1;
spin_unlock(&global_rsv->lock);
return ret;
@@ -5408,6 +5413,7 @@ static int update_block_group(struct btrfs_trans_handle 
*trans,
if (list_empty(&cache->dirty_list)) {
list_add_tail(&cache->dirty_list,
  &trans->transaction->dirty_bgs);
+   trans->transaction->num_dirty_bgs++;
btrfs_get_block_group(cache);
}
spin_unlock(&trans->transaction->dirty_bgs_lock);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8b9eea8..234d606 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -251,6 +251,7 @@ loop:
INIT_LIST_HEAD(&cur_trans->switch_commits);
INIT_LIST_HEAD(&cur_trans->pending_ordered);
INIT_LIST_HEAD(&cur_trans->dirty_bgs);
+   cur_trans->num_dirty_bgs = 0;
spin_lock_init(&cur_trans->dirty_bgs_lock);
list_add_tail(&cur_trans->list, &fs_info->trans_list);
extent_io_tree_init(&cur_trans->dirty_pages,
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 96b189b..4cb0ae2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -64,6 +64,7 @@ struct btrfs_transaction {
struct list_head pending_ordered;
struct list_head switch_commits;
struct list_head dirty_bgs;
+   u64 num_dirty_bgs;
spinlock_t dirty_bgs_lock;
struct btrfs_delayed_ref_root delayed_refs;
int aborted;
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] fscache/cachefiles versus btrfs

2015-04-13 Thread Christoph Hellwig
On Fri, Apr 10, 2015 at 02:28:16PM +0100, David Howells wrote:
> Dave Chinner  wrote:
> 
> > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache
> > coherent, not extent based operations. And, really if you need it to
> > really be able to find real holes, then a superblock flag might be a
> > better way of marking filesystems with the required capability.
> 
> Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon
> encountering a hole at the beginning of the area to be read.

NFS READ_PLUS could also make use of this, but someone needs to actually
implement it.

Until we have that lseek SEEK_HOLE/DATA is the way to go, and the
horrible ->bmap hack needs to die ASAP, I can't believe you managed to
sneak that in in the not too distant past.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please add 9c4f61f01d269815bb7c37be3ede59c5587747c6 to stable

2015-04-13 Thread Greg KH
On Mon, Apr 13, 2015 at 07:28:38PM +0500, Roman Mamedov wrote:
> On Thu, 2 Apr 2015 10:17:47 -0400
> Chris Mason  wrote:
> 
> > Hi stable friends,
> > 
> > Can you please backport this one to 3.19.y.  It fixes a bug introduced 
> > by:
> > 
> > 381cf6587f8a8a8e981bc0c18859b51dc756, which was tagged for stable 
> > 3.14+
> > 
> > The symptoms of the bug are deadlocks during log reply after a crash.  
> > The patch wasn't intentionally fixing the deadlock, which is why we 
> > missed it when tagging fixes.
> 
> Unfortunately still not fixed (no btrfs-related changes) in 3.14.38 and
> 3.18.11 released today.

I have a few hundred stable backports left to sort through, don't worry,
this is still in the queue, it's not lost.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-13 Thread Zach Brown
> > >> Could we perhaps instead of a length, define a 'pos_in_start' and a
> > >> 'pos_in_end' offset (with the latter being -1 for a full-file copy)
> > >> and then return an 'loff_t' value stating where the copy ended?
> > >
> > > Well, the resulting offset will be set if the caller provided it.  So
> > > they could already be getting the copied length from that.  But they
> > > might not specify the offsets.  Maybe they're just using the results to
> > > total up a completion indicator.
> > >
> > > Maybe we could make the length a pointer like the offsets that's set to
> > > the copied length on return.
> > 
> > That works, but why do we care so much about the difference between a
> > length and an offset as a return value?
> > 
> 
> I think it just comes down to potential confusion for users. What's
> more useful, the number of bytes actually copied, or the offset into the
> file where the copy ended?
> 
> I tend to the think an offset is more useful for someone trying to
> copy a file in chunks, particularly if the file is sparse. That gives
> them a clear place to continue the copy.
> 
> So, I think I agree with Trond that phrasing this interface in terms of
> file offsets seems like it might be more useful. That also neatly
> sidesteps the size_t limitations on 32-bit platforms.

Yeah, fair enough.  I'll rework it.

> > To be fair, the NFS copy offload also allows the copy to proceed out
> > of order, in which case the range of copied data could be
> > non-contiguous in the case of a failure. However neither the length
> > nor the offset case will give you the full story in that case. Any
> > return value can at best be considered to define an offset range whose
> > contents need to be checked for success/failure.
> > 
> 
> Yuck! How the heck do you clean up the mess if that happens? I guess
> you're just stuck redoing the copy with normal READ/WRITE?

I don't think anyone will worry about checking file contents.

Yes, technically you can get fragmented completion past the initial
contiguous region that the interface told you is done.   You can get
that with O_DIRECT today.

But it's a rare case that is not worth worrying about.  You'll retry at
the contiguous offset until it doesn't make progress and then fall back
to read/write.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: error removing the device '/dev/sdXN' - Inappropriate ioctl for device

2015-04-13 Thread Anand Jain


Thanks Martin for helping with the data.

Unfortunately no matter what I try I couldn't reproduce the
"Inappropriate ioctl for device"

But I got INVALD which is wrong as well. So I sent a patch for
that. EIO is appropriate in this case.


Next, Passing a devid to delete a device - yes it would help
patch is ready in my WS BUT - just noticed that error handling
code nicely put the FS into readonly when there is commit
failure so that's something which should be fixed along with
this patch which I am attempting. Since as in your case with
3 disks raid1, your reconstructed radi1 will still be considered
as healthy after taking out a disk.

Further, to recover from your situation you could try replace,
that will work. OR if you are not planning to add another disk then,
try this: (sorry needs a down time)

umount
remove failed disk from the system
mount -o degrade
btrfs dev del missing 
mount -o remount
(I hope you would check / mange the space availably part)


(By the way, those missing messages are user land fabricated,
 btrfs-progs: 206efb60cbe3049e0d44c6da3c1909aeee18f813

so don't depend on that, to know what kernel knows
probably you need /proc/fs/btrfs/devlist or sysfs
posted in the ML before,

btrfs fi show -m

was written so to know actual kernel visibility but
it was again crippled by above commit id.

David, should back out 206efb60cbe3049e0d44c6da3c1909aeee18f813
as mentioned before it will help.


On 04/07/2015 07:41 PM, Martin wrote:

On 06/04/15 14:32, Anand Jain wrote:

btrfs fi show -d



That gives:

# btrfs fi show -d
warning, device 3 is missing
warning devid 3 not found already
warning, device 3 is missing
warning devid 3 not found already


David,

 As commented before - you shouldn't have integrated the patch

  915902c5002485fb13d27c4b699a73fb66cc0f09


Thanks,  Anand





Label: 'btrfs_root'  uuid: 92452e9a-2775-45c4-922c-f01b2afd51c2
 Total devices 3 FS bytes used 30.94GiB
 devid1 size 24.00GiB used 24.00GiB path /dev/sda4
 devid2 size 24.00GiB used 24.00GiB path /dev/sdc4
 devid3 size 24.00GiB used 24.00GiB path /dev/sde4

Label: 'btrfs_data'  uuid: d1b96638-be89-4291-8a40-f2f2e1dc5223
 Total devices 3 FS bytes used 95.74GiB
 devid1 size 87.24GiB used 86.48GiB path /dev/sda5
 devid2 size 87.24GiB used 87.24GiB path /dev/sdc5
 devid3 size 87.24GiB used 87.24GiB path /dev/sde5

Label: 'btrfs_root2'  uuid: 62603ce8-c333-4ca7-92f7-f8bdd712ab37
 Total devices 3 FS bytes used 151.60MiB
 devid1 size 24.00GiB used 24.00GiB path /dev/sdb4
 devid2 size 24.00GiB used 24.00GiB path /dev/sdd4
 *** Some devices missing

Label: 'btrfs_data2'  uuid: 3aaee716-b98b-4c86-ba5a-53456994f152
 Total devices 3 FS bytes used 159.34GiB
 devid1 size 206.47GiB used 206.02GiB path /dev/sdb5
 devid2 size 206.47GiB used 206.47GiB path /dev/sdd5
 *** Some devices missing

btrfs-progs v3.19.1


And without the "-d":

# btrfs fi show
Label: 'btrfs_root'  uuid: 92452e9a-2775-45c4-922c-f01b2afd51c2
 Total devices 3 FS bytes used 30.94GiB
 devid1 size 24.00GiB used 24.00GiB path /dev/sda4
 devid2 size 24.00GiB used 24.00GiB path /dev/sdc4
 devid3 size 24.00GiB used 24.00GiB path /dev/sde4

Label: 'btrfs_data'  uuid: d1b96638-be89-4291-8a40-f2f2e1dc5223
 Total devices 3 FS bytes used 95.74GiB
 devid1 size 87.24GiB used 86.48GiB path /dev/sda5
 devid2 size 87.24GiB used 87.24GiB path /dev/sdc5
 devid3 size 87.24GiB used 87.24GiB path /dev/sde5

Label: 'btrfs_root2'  uuid: 62603ce8-c333-4ca7-92f7-f8bdd712ab37
 Total devices 3 FS bytes used 151.60MiB
 devid1 size 24.00GiB used 24.00GiB path /dev/sdb4
 devid2 size 24.00GiB used 24.00GiB path /dev/sdd4
 devid3 size 24.00GiB used 24.00GiB path /dev/sdf4

Label: 'btrfs_data2'  uuid: 3aaee716-b98b-4c86-ba5a-53456994f152
 Total devices 3 FS bytes used 159.34GiB
 devid1 size 206.47GiB used 206.02GiB path /dev/sdb5
 devid2 size 206.47GiB used 206.47GiB path /dev/sdd5
 devid3 size 206.47GiB used 206.47GiB path /dev/sdf5

btrfs-progs v3.19.1


Interestingly, all the log messages about /dev/sdf are now no longer
being repeated.

(And nope, not had a chance to swap that disk yet!)


Hence, should I do a "btrfs device delete missing /mnt/data2"?

Cheers,
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space

2015-04-13 Thread Chris Mason
On 04/09/2015 12:34 AM, Zhaolei wrote:
> From: Zhao Lei 
> 
> btrfs will report no_space when we run following write and delete
> file loop:
>  # FILE_SIZE_M=[ 75% of fs space ]
>  # DEV=[ some dev ]
>  # MNT=[ some dir ]
>  #
>  # mkfs.btrfs -f "$DEV"
>  # mount -o nodatacow "$DEV" "$MNT"
>  # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M 
> count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
>  #
> 
> Reason:
>  iput() and evict() is run after write pages to block device, if
>  write pages work is not finished before next write, the "rm"ed space
>  is not freed, and caused above bug.
> 
> Fix:
>  We can add "-o flushoncommit" mount option to avoid above bug, but
>  it have performance problem. Actually, we can to wait for on-the-fly
>  writes only when no-space happened, it is which this patch do.

Can you please change this so we only do this flush if the first commit
doesn't free up enough space?  I think this is going to have a
performance impact as the FS fills up.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GSoC 2015] Btrfs content based storage

2015-04-13 Thread David Sterba
On Fri, Mar 27, 2015 at 10:58:42AM -0400, harshad shirwadkar wrote:
> I am a CS graduate student from Carnegie Mellon University. I am
> hoping to build the feature - "Content based storage mode" under
> Google Summer of Code 2015. This project has also been listed as an
> idea on BTRFS ideas page. However, I have not found a mentor yet, and
> without a mentor I can not participate in the program. Please let me
> know if anybody is interested in mentoring this project. Here is a
> link to my proposal:
> 
> http://harshadjs.github.io/2015/03/27/Fedora-BTRFS-Content-Storage-Mode/

This probably has a significant overlap with the in-band dedup work from
Liu bo [1]. Your proposal expects an interface to look up the data by
hash which hasn't been implemented afaik.

[1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/34097 (v10)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Degraded volume silently fails to mount

2015-04-13 Thread Michael Tharp

On 4/13/2015 10:07, Hugo Mills wrote:

On Mon, Apr 13, 2015 at 09:51:09AM -0400, Michael Tharp wrote:

Hi list,

I've got a 4 disk raid1 volume that has one failed disk. I have so
far been unable to mount it in degraded mode, but the failure is
that "mount" silently does nothing.


Check to see if systemd is unmounting it immediately after
mount. This seems to be the usual reason for silent failures to mount
an FS these days.


Sigh, that was it. Thanks. Faith in btrfs restored.

I had a custom unit file because the generated ones weren't getting the 
LUKS device dependencies correct. When the drive failed I commented out 
the fstab and crypttab entries but forgot about the custom unit file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] Btrfs: __btrfs_std_error() logic should be consistent w/out CONFIG_PRINTK defined

2015-04-13 Thread Anand Jain
error handling logic behaves differently with or without
CONFIG_PRINTK defined, since there are two copies of the same
function which a bit of different logic

One, when CONFIG_PRINTK is defined, code is

__btrfs_std_error(..)
{
::
   save_error_info(fs_info);
   if (sb->s_flags & MS_BORN)
   btrfs_handle_error(fs_info);
}

and two when CONFIG_PRINTK is not defined, the code is

__btrfs_std_error(..)
{
::
   if (sb->s_flags & MS_BORN) {
   save_error_info(fs_info);
   btrfs_handle_error(fs_info);
}
}

I doubt if this was intentional ? and appear to have caused since
we maintain two copies of the same function and they got diverged
with commits.

Now to decide which logic is correct reviewed changes as below,

 533574c6bc30cf526cc1c41bde050c854a945efb
Commit added two copies of this function

 cf79ffb5b79e8a2b587fbf218809e691bb396c98
Commit made change to only one copy of the function and to the
copy when CONFIG_PRINTK is defined.

To fix this, instead of maintaining two copies of same function
approach, maintain single function, and just put the extra
portion of the code under CONFIG_PRINTK define.

This patch just does that. And keeps code of with CONFIG_PRINTK
defined.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 27 +--
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7533afb..b0a465f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -130,7 +130,6 @@ static void btrfs_handle_error(struct btrfs_fs_info 
*fs_info)
}
 }
 
-#ifdef CONFIG_PRINTK
 /*
  * __btrfs_std_error decodes expected errors from the caller and
  * invokes the approciate error response.
@@ -139,7 +138,9 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const 
char *function,
   unsigned int line, int errno, const char *fmt, ...)
 {
struct super_block *sb = fs_info->sb;
+#ifdef CONFIG_PRINTK
const char *errstr;
+#endif
 
/*
 * Special case: if the error is EROFS, and we're already
@@ -148,6 +149,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const 
char *function,
if (errno == -EROFS && (sb->s_flags & MS_RDONLY))
return;
 
+#ifdef CONFIG_PRINTK
errstr = btrfs_decode_error(errno);
if (fmt) {
struct va_format vaf;
@@ -165,6 +167,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const 
char *function,
printk(KERN_CRIT "BTRFS: error (device %s) in %s:%d: errno=%d 
%s\n",
sb->s_id, function, line, errno, errstr);
}
+#endif
 
/* Don't go through full error handling during mount */
save_error_info(fs_info);
@@ -172,6 +175,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const 
char *function,
btrfs_handle_error(fs_info);
 }
 
+#ifdef CONFIG_PRINTK
 static const char * const logtypes[] = {
"emergency",
"alert",
@@ -211,27 +215,6 @@ void btrfs_printk(const struct btrfs_fs_info *fs_info, 
const char *fmt, ...)
 
va_end(args);
 }
-
-#else
-
-void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
-  unsigned int line, int errno, const char *fmt, ...)
-{
-   struct super_block *sb = fs_info->sb;
-
-   /*
-* Special case: if the error is EROFS, and we're already
-* under MS_RDONLY, then it is safe here.
-*/
-   if (errno == -EROFS && (sb->s_flags & MS_RDONLY))
-   return;
-
-   /* Don't go through full error handling during mount */
-   if (sb->s_flags & MS_BORN) {
-   save_error_info(fs_info);
-   btrfs_handle_error(fs_info);
-   }
-}
 #endif
 
 /*
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] Btrfs: SB read failure should return EIO for __bread failure

2015-04-13 Thread Anand Jain
This will return EIO when __bread() fails to read SB,
instead of EINVAL.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 18 +++---
 fs/btrfs/volumes.c |  8 
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 53c83c9..f47c643 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2410,8 +2410,8 @@ int open_ctree(struct super_block *sb,
 * Read super block and check the signature bytes only
 */
bh = btrfs_read_dev_super(fs_devices->latest_bdev);
-   if (!bh) {
-   err = -EINVAL;
+   if (IS_ERR(bh)) {
+   err = PTR_ERR(bh);
goto fail_alloc;
}
 
@@ -3093,6 +3093,7 @@ struct buffer_head *btrfs_read_dev_super(struct 
block_device *bdev)
int i;
u64 transid = 0;
u64 bytenr;
+   int ret = -EINVAL;
 
/* we would like to check all the supers, but that would make
 * a btrfs mount succeed after a mkfs from a different FS.
@@ -3106,13 +3107,20 @@ struct buffer_head *btrfs_read_dev_super(struct 
block_device *bdev)
break;
bh = __bread(bdev, bytenr / 4096,
BTRFS_SUPER_INFO_SIZE);
-   if (!bh)
+   /*
+* If we fail to read from the underlaying drivers, as of now
+* the best option we have is to mark it EIO.
+*/
+   if (!bh) {
+   ret = -EIO;
continue;
+   }
 
super = (struct btrfs_super_block *)bh->b_data;
if (btrfs_super_bytenr(super) != bytenr ||
btrfs_super_magic(super) != BTRFS_MAGIC) {
brelse(bh);
+   ret = -EINVAL;
continue;
}
 
@@ -3124,6 +3132,10 @@ struct buffer_head *btrfs_read_dev_super(struct 
block_device *bdev)
brelse(bh);
}
}
+
+   if (!latest)
+   return ERR_PTR(ret);
+
return latest;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0009fde..5536281 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -212,8 +212,8 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t 
flags, void *holder,
}
invalidate_bdev(*bdev);
*bh = btrfs_read_dev_super(*bdev);
-   if (!*bh) {
-   ret = -EINVAL;
+   if (IS_ERR(*bh)) {
+   ret = PTR_ERR(*bh);
blkdev_put(*bdev, flags);
goto error;
}
@@ -6770,8 +6770,8 @@ int btrfs_scratch_superblock(struct btrfs_device *device)
struct btrfs_super_block *disk_super;
 
bh = btrfs_read_dev_super(device->bdev);
-   if (!bh)
-   return -EINVAL;
+   if (IS_ERR(bh))
+   return PTR_ERR(bh);
disk_super = (struct btrfs_super_block *)bh->b_data;
 
memset(&disk_super->magic, 0, sizeof(disk_super->magic));
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] Btrfs-progs: fix compile warnings

2015-04-13 Thread Anand Jain
simple compile time warning fixes.

cmds-check.c: In function ‘del_file_extent_hole’:
cmds-check.c:289: warning: ‘prev.len’ may be used uninitialized in this function
cmds-check.c:289: warning: ‘prev.start’ may be used uninitialized in this 
function
cmds-check.c:290: warning: ‘next.len’ may be used uninitialized in this function
cmds-check.c:290: warning: ‘next.start’ may be used uninitialized in this 
function

btrfs-calc-size.c: In function ‘print_seek_histogram’:
btrfs-calc-size.c:221: warning: ‘group_start’ may be used uninitialized in this 
function
btrfs-calc-size.c:223: warning: ‘group_end’ may be used uninitialized in this 
function

Signed-off-by: Anand Jain 
---
 btrfs-calc-size.c | 4 ++--
 cmds-check.c  | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c
index 1372084..88f92e1 100644
--- a/btrfs-calc-size.c
+++ b/btrfs-calc-size.c
@@ -218,9 +218,9 @@ static void print_seek_histogram(struct root_stats *stat)
struct rb_node *n = rb_first(&stat->seek_root);
struct seek *seek;
u64 tick_interval;
-   u64 group_start;
+   u64 group_start = 0;
u64 group_count = 0;
-   u64 group_end;
+   u64 group_end = 0;
u64 i;
u64 max_seek = stat->max_seek_len;
int digits = 1;
diff --git a/cmds-check.c b/cmds-check.c
index ed8c698..de22185 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -293,6 +293,9 @@ static int del_file_extent_hole(struct rb_root *holes,
int have_next = 0;
int ret = 0;
 
+   memset(&prev, 0, sizeof(struct file_extent_hole));
+   memset(&next, 0, sizeof(struct file_extent_hole));
+
tmp.start = start;
tmp.len = len;
node = rb_search(holes, &tmp, compare_hole_range, NULL);
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please add 9c4f61f01d269815bb7c37be3ede59c5587747c6 to stable

2015-04-13 Thread Roman Mamedov
On Thu, 2 Apr 2015 10:17:47 -0400
Chris Mason  wrote:

> Hi stable friends,
> 
> Can you please backport this one to 3.19.y.  It fixes a bug introduced 
> by:
> 
> 381cf6587f8a8a8e981bc0c18859b51dc756, which was tagged for stable 
> 3.14+
> 
> The symptoms of the bug are deadlocks during log reply after a crash.  
> The patch wasn't intentionally fixing the deadlock, which is why we 
> missed it when tagging fixes.

Unfortunately still not fixed (no btrfs-related changes) in 3.14.38 and
3.18.11 released today.

> 
> Please put this commit everywhere you've cherry-picked 
> 381cf6587f8a8a8e981bc0c18859b51dc756
> 
> commit 9c4f61f01d269815bb7c37be3ede59c5587747c6
> Author: David Sterba 
> Date:   Fri Jan 2 19:12:57 2015 +0100
> 
> btrfs: simplify insert_orphan_item
> 
> We can search and add the orphan item in one go,
> btrfs_insert_orphan_item will find out if the item already exists.
> 
> Signed-off-by: David Sterba 
> 
> -chris
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: Big disk space usage difference, even after defrag, on identical data

2015-04-13 Thread Gian-Carlo Pascutto
On 13-04-15 07:06, Duncan wrote:

>> So what can explain this? Where did the 66G go?
> 
> Out of curiosity, does a balance on the actively used btrfs help?
> 
> You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t 
> (minimum size file) options.  Does adding -f -t1 help?

Unfortunately I can no longer try this, see the other reply why. But the
problem turned out to be some 1G-sized files, written using 3-5 extents,
that for whatever reason defrag was not touching.

> You aren't doing btrfs snapshots of either subvolume, are you?

No :-) I should've mentioned that.

> Defrag should force the rewrite of entire files and take care of this, 
> but obviously it's not returning to "clean" state.  I forgot what the 
> default minimum file size is if -t isn't set, maybe 128 MiB?  But a -t1 
> will force it to defrag even small files, and I recall at least one 
> thread here where the poster said it made all the difference for him, so 
> try that.  And the -f should force a filesystem sync afterward, so you 
> know the numbers from any report you run afterward match the final state.

Reading the corresponding manual, the -t explanation says that "any
extent bigger than this size will be considered already defragged". So I
guess setting -t1 might've fixed the problem too...but after checking
the source, I'm not so sure.

I didn't find the -t default in the manpages - after browsing through
the source, the default is in the kernel:
https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L1268
(Not sure what units those are.)

I wonder if this is relevant:
https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L2572

This seems to reset the -t flag if compress (-c) is set? This looks a
bit fishy?

> Meanwhile, you may consider using the nocow attribute on those database 
> files.  It will disable compression on them,

I'm using btrfs specifically to get compression, so this isn't an option.

> While initial usage will  be higher due to the lack of compression,
> as you've discovered, over time, on an actively updated database,
> compression isn't all that effective anyway.

I don't see why. If you're referring to the additional overhead of
continuously compressing and decompressing everything - yes, of course.
But in my case I have a mostly-append workload to a huge amount of
fairly compressible data that's on magnetic storage, so compression is a
win in disk space and perhaps even in performance.

I'm well aware of the many caveats in using btrfs for databases -
they're well documented and although I much appreciate your extended
explanation, it wasn't new to me.

It turns out that if your dataset isn't update heavy (so it doesn't
fragment to begin with), or has to be queried via indexed access (i.e.
mostly via random seeks), the fragmentation doesn't matter much anyway.
Conversely, btrfs appears to have better sync performance with multiple
threads, and allows one to disable part of the partial-page-write
protection logic in the database (full_page_writes=off for PostgreSQL),
because btrfs is already doing the COW to ensure those can't actually
happen [1].

The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs),
which certainly is contrary to popular wisdom. Maybe btrfs would fall
off eventually as fragementation does set in gradually, but given that
there's an offline defragmentation tool that can run in the background,
I don't care.

[1] I wouldn't be too surprised if database COW, which consists of
journal-writing a copy of the data out of band, then rewriting it again
in the original place, is actually functionally equivalent do disabling
COW in the database and running btrfs + defrag. Obviously you shouldn't
keep COW enabled in btrfs *AND* the DB, requiring all data to be copied
around at least 3 times...which I'm afraid almost everyone does because
it's the default...

-- 
GCP
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Degraded volume silently fails to mount

2015-04-13 Thread Hugo Mills
On Mon, Apr 13, 2015 at 09:51:09AM -0400, Michael Tharp wrote:
> Hi list,
> 
> I've got a 4 disk raid1 volume that has one failed disk. I have so
> far been unable to mount it in degraded mode, but the failure is
> that "mount" silently does nothing.

   Check to see if systemd is unmounting it immediately after
mount. This seems to be the usual reason for silent failures to mount
an FS these days.

   Hugo.

>  # btrfs fi sh
> warning devid 2 not found already
> Label: 'seneca'  uuid: b9da07f5-c0fd-45ad-861b-d1bcad6cbf4c
>   Total devices 4 FS bytes used 581.71GiB
>   devid1 size 931.51GiB used 334.02GiB path /dev/mapper/luks-seneca-1
>   devid3 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-3
>   devid4 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-4
>   *** Some devices missing
> 
> Btrfs v3.18.1
>  # mount -t btrfs -o degraded /dev/mapper/luks-seneca-1 /seneca
>  # echo $?
> 0
>  # ls /seneca/
>  # grep seneca /proc/mounts
>  # dmesg |tail
> [   84.955467] BTRFS: device label seneca devid 1 transid 1753 /dev/dm-4
> [   87.926347] BTRFS: device label seneca devid 4 transid 1753 /dev/dm-5
> [  107.069109] BTRFS: device label seneca devid 3 transid 1753 /dev/dm-6
> [  195.267046] BTRFS info (device dm-6): allowing degraded mounts
> [  195.267094] BTRFS info (device dm-6): disk space caching is enabled
> [  195.267133] BTRFS: has skinny extents
> [  195.277615] BTRFS warning (device dm-6): devid 2 missing
> [  781.160250] BTRFS info (device dm-6): allowing degraded mounts
> [  781.160270] BTRFS info (device dm-6): disk space caching is enabled
> [  781.160286] BTRFS: has skinny extents
>  # uname -a
> Linux ambrosia.homeslice 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26
> 21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>  # btrfs --version
> Btrfs v3.18.1
> 
> 
> Any ideas?

-- 
Hugo Mills | What's a Nazgûl like you doing in a place like this?
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Degraded volume silently fails to mount

2015-04-13 Thread Michael Tharp

Hi list,

I've got a 4 disk raid1 volume that has one failed disk. I have so far 
been unable to mount it in degraded mode, but the failure is that 
"mount" silently does nothing.


 # btrfs fi sh
warning devid 2 not found already
Label: 'seneca'  uuid: b9da07f5-c0fd-45ad-861b-d1bcad6cbf4c
Total devices 4 FS bytes used 581.71GiB
devid1 size 931.51GiB used 334.02GiB path /dev/mapper/luks-seneca-1
devid3 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-3
devid4 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-4
*** Some devices missing

Btrfs v3.18.1
 # mount -t btrfs -o degraded /dev/mapper/luks-seneca-1 /seneca
 # echo $?
0
 # ls /seneca/
 # grep seneca /proc/mounts
 # dmesg |tail
[   84.955467] BTRFS: device label seneca devid 1 transid 1753 /dev/dm-4
[   87.926347] BTRFS: device label seneca devid 4 transid 1753 /dev/dm-5
[  107.069109] BTRFS: device label seneca devid 3 transid 1753 /dev/dm-6
[  195.267046] BTRFS info (device dm-6): allowing degraded mounts
[  195.267094] BTRFS info (device dm-6): disk space caching is enabled
[  195.267133] BTRFS: has skinny extents
[  195.277615] BTRFS warning (device dm-6): devid 2 missing
[  781.160250] BTRFS info (device dm-6): allowing degraded mounts
[  781.160270] BTRFS info (device dm-6): disk space caching is enabled
[  781.160286] BTRFS: has skinny extents
 # uname -a
Linux ambrosia.homeslice 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26 
21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

 # btrfs --version
Btrfs v3.18.1


Any ideas?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: plan to allow documentation contributions via github

2015-04-13 Thread David Sterba
Hi,

plan: I'd like to allow documentation updates through github web interface.
"patches via mailinglist" will continue to work unchanged.

The current way to clone the git, edit files, send to mailinglist might
discourage people who are not developers or not used to working with git
that way.

There are some issues around the pull request I'm not yet clear how to
resolve.  I'd like to keep the git history clean so the pull requests
will not get merged the usual way. I'll probably merge the
changes/patches manually and then close the request. There shall be a
branch to serve as a starting point for any new edits, but will be a
moving target after the pending patches get merged. I hope this will
work for the browser-only approach, the merging burden is on my side.

In order to get a working 'Preview' for the changes, we'd have to rename
all .txt files to .asciidoc. Then you get a nice formatting on the
github site for free.

You can see an example here:
https://github.com/kdave/btrfs-progs/blob/test-asciidoc/Documentation/btrfs-balance.asciidoc

The documentation is separated from code so we can afford to relax the
submission rules, though we'll still need the signed-off-by and names
for the final commits.

Thanks for feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs-progs: improve troubleshooting avoid duplicate error strings

2015-04-13 Thread Marc MERLIN
On Mon, Apr 13, 2015 at 08:37:01PM +0800, Anand Jain wrote:
> my troubleshooting experience says have unique error string per module.

+1 to that, thank you.

Marc

> In the below eg, its one additional step to know error line,
> 
> cat -n cmds-device.c | egrep "error removing the device"
>185"ERROR: error removing the device '%s' - %s\n",
>190"ERROR: error removing the device '%s' - %s\n",
> 
> which is completely avoidable.
> 
> Signed-off-by: Anand Jain 
> ---
>  cmds-device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/cmds-device.c b/cmds-device.c
> index 1c72e90..1c32771 100644
> --- a/cmds-device.c
> +++ b/cmds-device.c
> @@ -187,7 +187,7 @@ static int cmd_rm_dev(int argc, char **argv)
>   ret++;
>   } else if (res < 0) {
>   fprintf(stderr,
> - "ERROR: error removing the device '%s' - %s\n",
> + "ERROR: ioctl error removing the device '%s' - 
> %s\n",
>   argv[i], strerror(e));
>   ret++;
>   }
> -- 
> 2.0.0.153.g79d
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BtrFS and encryption

2015-04-13 Thread M G Berberich
Hello,

it seems that ext4 is getting encryption-support

  http://thread.gmane.org/gmane.comp.file-systems.ext4/48206

rumors say because of performance-problems with eCryptFS in Android.
f2fs should get a compatible interface too.

I would like to see this in BtrFS as well…

MfG
bmg

-- 
„Des is völlig wurscht, was heut beschlos- | M G Berberich
 sen wird: I bin sowieso dagegn!“  | berbe...@fmi.uni-passau.de
(SPD-Stadtrat Kurt Schindler; Regensburg)  | www.fmi.uni-passau.de/~berberic
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[4.0] BTRFS + ecryptfs: Iceweasel cache process hanging on evicting inodes

2015-04-13 Thread Martin Steigerwald
Hi!

This may or may not be related to

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes 
on random write into big file 
https://bugzilla.kernel.org/show_bug.cgi?id=90401

BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald | 26 Dec 14:37 2014
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790


I am not sure cause I didn´t check CPU usage of processes.

It may be a different issue thus reporting here first.

This is 4.0 kernel with just the patch from Lutz included to make trimming
work.

In case you suspect this to be an ecryptfs issue please tell me and I will
forward to ecryptfs mailing list. I really hope that BTRFS will take on the
Ext4 and probably F2FS work to include encryption within the filesystem
directly.


After seeing Iceweasel not responding anymore in several tabs I saw this
in syslog:

Apr 13 12:49:23 merkaba kernel: [ 4080.770733] INFO: task Cache2 I/O:3529 
blocked for more than 120 seconds.
Apr 13 12:49:23 merkaba kernel: [ 4080.770741]   Tainted: G   O
4.0.0-tp520-btrfs-trim+ #25
Apr 13 12:49:23 merkaba kernel: [ 4080.770744] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 13 12:49:23 merkaba kernel: [ 4080.770746] Cache2 I/O  D 
88030d0b7c58 0  3529   3060 0x
Apr 13 12:49:23 merkaba kernel: [ 4080.770752]  88030d0b7c58 
88030d0b7c38 8802dd6d49b0 88030d0b7c38
Apr 13 12:49:23 merkaba kernel: [ 4080.770758]  88030d0b7fd8 
8802999ec488 88031915e2ac 88030d0b7cd8
Apr 13 12:49:23 merkaba kernel: [ 4080.770763]  88031915e290 
88030d0b7c78 814c28e8 
Apr 13 12:49:23 merkaba kernel: [ 4080.770768] Call Trace:
Apr 13 12:49:23 merkaba kernel: [ 4080.770777]  [] 
schedule+0x6f/0x7e
Apr 13 12:49:23 merkaba kernel: [ 4080.770822]  [] 
lock_extent_bits+0x100/0x188 [btrfs]
Apr 13 12:49:23 merkaba kernel: [ 4080.770828]  [] ? 
finish_wait+0x5f/0x5f
Apr 13 12:49:23 merkaba kernel: [ 4080.770855]  [] 
btrfs_evict_inode+0x14a/0x423 [btrfs]
Apr 13 12:49:23 merkaba kernel: [ 4080.770865]  [] 
evict+0xa8/0x150
Apr 13 12:49:23 merkaba kernel: [ 4080.770869]  [] 
iput+0x16f/0x1bb
Apr 13 12:49:23 merkaba kernel: [ 4080.770880]  [] 
ecryptfs_evict_inode+0x29/0x2d [ecryptfs]
Apr 13 12:49:23 merkaba kernel: [ 4080.770888]  [] ? 
ecryptfs_show_options+0x11e/0x11e [ecryptfs]
Apr 13 12:49:23 merkaba kernel: [ 4080.770893]  [] 
evict+0xa8/0x150
Apr 13 12:49:23 merkaba kernel: [ 4080.770896]  [] 
iput+0x16f/0x1bb
Apr 13 12:49:23 merkaba kernel: [ 4080.770901]  [] 
do_unlinkat+0x151/0x1f0
Apr 13 12:49:23 merkaba kernel: [ 4080.770906]  [] ? 
user_exit+0x13/0x15
Apr 13 12:49:23 merkaba kernel: [ 4080.770910]  [] ? 
syscall_trace_enter_phase1+0x57/0x12a
Apr 13 12:49:23 merkaba kernel: [ 4080.770914]  [] ? 
syscall_trace_leave+0xcb/0x108
Apr 13 12:49:23 merkaba kernel: [ 4080.770918]  [] 
SyS_unlink+0x11/0x13
Apr 13 12:49:23 merkaba kernel: [ 4080.770923]  [] 
system_call_fastpath+0x12/0x17


Apr 13 12:51:23 merkaba kernel: [ 4200.790479] INFO: task Cache2 I/O:3529 
blocked for more than 120 seconds.
Apr 13 12:51:23 merkaba kernel: [ 4200.790492]   Tainted: G   O
4.0.0-tp520-btrfs-trim+ #25
Apr 13 12:51:23 merkaba kernel: [ 4200.790496] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 13 12:51:23 merkaba kernel: [ 4200.790500] Cache2 I/O  D 
88030d0b7c58 0  3529   3060 0x
Apr 13 12:51:23 merkaba kernel: [ 4200.790511]  88030d0b7c58 
88030d0b7c38 8802dd6d49b0 88030d0b7c38
Apr 13 12:51:23 merkaba kernel: [ 4200.790520]  88030d0b7fd8 
8802999ec488 88031915e2ac 88030d0b7cd8
Apr 13 12:51:23 merkaba kernel: [ 4200.790527]  88031915e290 
88030d0b7c78 814c28e8 
Apr 13 12:51:23 merkaba kernel: [ 4200.790536] Call Trace:
Apr 13 12:51:23 merkaba kernel: [ 4200.790552]  [] 
schedule+0x6f/0x7e
Apr 13 12:51:23 merkaba kernel: [ 4200.790637]  [] 
lock_extent_bits+0x100/0x188 [btrfs]
Apr 13 12:51:23 merkaba kernel: [ 4200.790645]  [] ? 
finish_wait+0x5f/0x5f
Apr 13 12:51:23 merkaba kernel: [ 4200.790700]  [] 
btrfs_evict_inode+0x14a/0x423 [btrfs]
Apr 13 12:51:23 merkaba kernel: [ 4200.790716]  [] 
evict+0xa8/0x150
Apr 13 12:51:23 merkaba kernel: [ 4200.790721]  [] 
iput+0x16f/0x1bb
Apr 13 12:51:23 merkaba kernel: [ 4200.790742]  [] 
ecryptfs_evict_inode+0x29/0x2d [ecryptfs]
Apr 13 12:51:23 merkaba kernel: [ 4200.790758]  [] ? 
ecryptfs_show_options+0x11e/0x11e [ecryptfs]
Apr 13 12:51:23 merkaba kernel: [ 4200.790765]  [] 
evict+0xa8/0x150
Apr 13 12:51:23 merkaba kernel: [ 4200.790770]  [] 
iput+0x16f/0x1bb
Apr 13 12:51:23 merkaba kernel: [ 4200.790777]  [] 
do_unlinkat+0x151/0x1f0
Apr 13 12:51:23 merkaba kernel: [ 4200.790786]  [] ? 
user_exit+0x13/0x15
Apr 13 12:51:23 merkaba kernel: [ 4200.790793]  [] ? 
syscall_trace_enter_phase1+0x57/0x12a
Apr 13 12:51:23 merkaba kernel: [ 4200.790799]  [] ? 
syscall_trace_le

[PATCH 1/1] btrfs-progs: improve troubleshooting avoid duplicate error strings

2015-04-13 Thread Anand Jain
my troubleshooting experience says have unique error string per module.

In the below eg, its one additional step to know error line,

cat -n cmds-device.c | egrep "error removing the device"
   185  "ERROR: error removing the device '%s' - %s\n",
   190  "ERROR: error removing the device '%s' - %s\n",

which is completely avoidable.

Signed-off-by: Anand Jain 
---
 cmds-device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-device.c b/cmds-device.c
index 1c72e90..1c32771 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -187,7 +187,7 @@ static int cmd_rm_dev(int argc, char **argv)
ret++;
} else if (res < 0) {
fprintf(stderr,
-   "ERROR: error removing the device '%s' - %s\n",
+   "ERROR: ioctl error removing the device '%s' - 
%s\n",
argv[i], strerror(e));
ret++;
}
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Big disk space usage difference, even after defrag, on identical data

2015-04-13 Thread Gian-Carlo Pascutto
On 13-04-15 06:04, Zygo Blaxell wrote:

>> I would think that compression differences or things like
>> fragmentation or bookending for modified files shouldn't affect
>> this, because the first filesystem has been
>> defragmented/recompressed and didn't shrink.
>> 
>> So what can explain this? Where did the 66G go?
> 
> There are a few places:  the kernel may have decided your files are
> not compressible and disabled compression on them (some older kernels
> did this with great enthusiasm);

As stated in the previous mail, this is 3.19.1. Moreover, the data is
either uniformly compressible or not at all. Lastly, note that the
*exact same* mount options are being used on *the exact same kernel*
with *the exact same data*. Getting a different compressible decision
given the same inputs would point to bugs.

> your files might have preallocated space from the fallocate system
> call (which disables compression and allocates contiguous space, so
> defrag will not touch it).

So defrag -clzo or -czlib won't actually re-compress mostly-continuous
files? That's evil. I have no idea whether PostgreSQL allocates files
that way, though.

> 'filefrag -v' can tell you if this is happening to your files.

Not sure how to interpret that. Without "-v", I see most of the (DB)
data has 2-5 extents per Gigabyte. A few have 8192 extents per Gigabyte.

Comparing to the copy that takes 66G less, there every (compressible)
file has about 8192 extents per Gigabyte, and the others 5 or 6.

So you may be right that some DB files are "wedged" in a format that
btrfs can't compress. I forced the files to be rewritten (VACUUM FULL)
and that "fixed" the problem.

> In practice database files take about double the amount of space
> they appear to because of extent shingling.

This is what I called "bookending" in the original mail, I didn't know
the correct name, but I understand doing updates can result in N^2/2 or
thereabouts disk space usage, however:

> Defragmenting the files helps free space temporarily; however, space
> usage will quickly grow again until it returns to the steady state
> around 2x the file size.

As stated in the original mail, the filesystem was *freshly
defragmented* so that can't have been the cause.

> Until this is fixed, the most space-efficient approach seems to be to
> force compression (so the maximum extent is 128K instead of 1GB)

Would that fix the problem with fallocated() files?

-- 
GCP

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs receive hardening patches

2015-04-13 Thread lauri
Hello,

due to security reasons in certain usecases it would be nice to force
btrfs receive to confine itself to the directory of subvolume. I've
attached a patch that issues chroot before parsing btrfs stream. Let
me know if this breaks anything, preliminary tests showed it performed
as expected. If necessary I can make this functionality optional via
command-line flag.

-- 
Lauri Võsandi
tel: +372 53329412
e-mail: lauri.vosa...@gmail.com
blog: http://lauri.vosandi.com/
diff --git a/cmds-receive.c b/cmds-receive.c
index 44ef27e..e03acdd 100644
--- a/cmds-receive.c
+++ b/cmds-receive.c
@@ -867,15 +867,20 @@ static int do_receive(struct btrfs_receive *r, const char 
*tomnt, int r_fd,
goto out;
}
 
-   /*
-* find_mount_root returns a root_path that is a subpath of
-* dest_dir_full_path. Now get the other part of root_path,
-* which is the destination dir relative to root_path.
+
+   /**
+* Nasty hack to enforce chroot before parsing btrfs stream
 */
-   r->dest_dir_path = dest_dir_full_path + strlen(r->root_path);
-   while (r->dest_dir_path[0] == '/')
-   r->dest_dir_path++;
+   if (chroot(dest_dir_full_path)) {
+   fprintf(stderr,
+   "ERROR: failed to chroot to %s\n",
+   dest_dir_full_path);
+   ret = -EINVAL;
+   goto out;
+   }
 
+   r->root_path = r->dest_dir_path = strdup("/");
+   
ret = subvol_uuid_search_init(r->mnt_fd, &r->sus);
if (ret < 0)
goto out;


Re: Big disk space usage difference, even after defrag, on identical data

2015-04-13 Thread Duncan
Zygo Blaxell posted on Mon, 13 Apr 2015 00:04:36 -0400 as excerpted:

> A database ends up maxing out at about a factor of two space usage
> because it tends to write short uniform-sized bursts of pages randomly,
> so we get a pattern a bit like bricks in a wall:
> 
> 0 MB AA BB CC DD EE FF GG HH II JJ KK 1 MB half the extents 0 MB
>  LL MM NN OO PP QQ RR SS TT UU V 1 MB the other half
> 
> 0 MB ALLBMMCNNDOOEPPFQQGRRHSSITTJUUKV 1 MB what the file looks
> like
> 
> Fixing this is non-trivial (it may require an incompatible disk format
> change).  Until this is fixed, the most space-efficient approach seems
> to be to force compression (so the maximum extent is 128K instead of
> 1GB) and never defragment database files ever.

... Or set the database file nocow at creation, and don't snapshot it, so 
overwrites are always in-place.  (Btrfs compression and checksumming get 
turned off with nocow, but as we've seen, compression isn't all that 
effective on random-rewrite-pattern files anyway, and databases generally 
have their own data integrity handling, so neither one is a huge loss, 
and the in-place rewrite makes for better performance and a more 
predictable steady-state.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html