RE: [PATCH 8/9] btrfs: wait for delayed iputs on no space
Hi, Chris > -Original Message- > From: Chris Mason [mailto:c...@fb.com] > Sent: Monday, April 13, 2015 10:55 PM > To: Zhaolei; linux-btrfs@vger.kernel.org > Subject: Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space > > On 04/09/2015 12:34 AM, Zhaolei wrote: > > From: Zhao Lei > > > > btrfs will report no_space when we run following write and delete file > > loop: > > # FILE_SIZE_M=[ 75% of fs space ] > > # DEV=[ some dev ] > > # MNT=[ some dir ] > > # > > # mkfs.btrfs -f "$DEV" > > # mount -o nodatacow "$DEV" "$MNT" > > # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 > > bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done # > > > > Reason: > > iput() and evict() is run after write pages to block device, if > > write pages work is not finished before next write, the "rm"ed space > > is not freed, and caused above bug. > > > > Fix: > > We can add "-o flushoncommit" mount option to avoid above bug, but > > it have performance problem. Actually, we can to wait for on-the-fly > > writes only when no-space happened, it is which this patch do. > > Can you please change this so we only do this flush if the first commit > doesn't > free up enough space? I think this is going to have a performance impact as > the FS fills up. > btrfs_wait_ordered_roots() can only ensure that all bio are finished, and relative iputs are added into delayed_iputs in end_io. And we need 2 commit to make free space accessable: One for run delayed_iputs(), and another for unpin. It is why I put above line to first commit, to ensure we have enough commit operation to make free space accessable. It is only called then the disk is almost full, and have no performance impact in most case(disk not full). Another way is to call btrfs_wait_ordered_roots() after first commit() try, but give it addition commit(). Thanks Zhaolei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Big disk space usage difference, even after defrag, on identical data
Gian-Carlo Pascutto posted on Mon, 13 Apr 2015 16:06:39 +0200 as excerpted: >> Defrag should force the rewrite of entire files and take care of this, >> but obviously it's not returning to "clean" state. I forgot what the >> default minimum file size is if -t isn't set, maybe 128 MiB? But a -t1 >> will force it to defrag even small files, and I recall at least one >> thread here where the poster said it made all the difference for him, >> so try that. And the -f should force a filesystem sync afterward, so >> you know the numbers from any report you run afterward match the final >> state. > > Reading the corresponding manual, the -t explanation says that "any > extent bigger than this size will be considered already defragged". So I > guess setting -t1 might've fixed the problem too...but after checking > the source, I'm not so sure. Oops! You are correct. There was an on-list discussion of that before that I had forgotten. The "make sure everything gets defragged" magic setting is -t 1G or higher, *not* the -t 1 I was trying to tell you previously (which will end up skipping everything, instead of defragging everything). Thanks for spotting the inconsistency and calling me on it! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Big disk space usage difference, even after defrag, on identical data
On Mon, Apr 13, 2015 at 04:06:39PM +0200, Gian-Carlo Pascutto wrote: > On 13-04-15 07:06, Duncan wrote: > > >> So what can explain this? Where did the 66G go? > > > > Out of curiosity, does a balance on the actively used btrfs help? > > > > You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t > > (minimum size file) options. Does adding -f -t1 help? > > Unfortunately I can no longer try this, see the other reply why. But the > problem turned out to be some 1G-sized files, written using 3-5 extents, > that for whatever reason defrag was not touching. There are several corner cases that defrag won't touch by default. It's designed to be conservative and favor speed over size. Also when the kernel decides you're not getting enough compression, it seems to disable compression on the file _forever_ even if future writes are compressible again. mount -o compress-force works around that. > > You aren't doing btrfs snapshots of either subvolume, are you? > > No :-) I should've mentioned that. read-only snapshots: yet another thing defrag won't touch. > > While initial usage will be higher due to the lack of compression, > > as you've discovered, over time, on an actively updated database, > > compression isn't all that effective anyway. > > I don't see why. If you're referring to the additional overhead of > continuously compressing and decompressing everything - yes, of course. > But in my case I have a mostly-append workload to a huge amount of > fairly compressible data that's on magnetic storage, so compression is a > win in disk space and perhaps even in performance. Short writes won't compress--not just well, but at all--because btrfs won't look at adjacent already-written blocks. If you write a file at less than 4K/minute, there will be no compression, as each new extent (or replacement extent for overwritten data) is already minimum-sized. If you write in bursts of 128K or more, consecutively, then you can get compression benefit. There has been talk of teaching autodefrag to roll up the last few dozen extents of files that grow slowly so they can be compressed. > It turns out that if your dataset isn't update heavy (so it doesn't > fragment to begin with), or has to be queried via indexed access (i.e. > mostly via random seeks), the fragmentation doesn't matter much anyway. > Conversely, btrfs appears to have better sync performance with multiple > threads, and allows one to disable part of the partial-page-write > protection logic in the database (full_page_writes=off for PostgreSQL), > because btrfs is already doing the COW to ensure those can't actually > happen [1]. > > The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs), > which certainly is contrary to popular wisdom. Maybe btrfs would fall > off eventually as fragementation does set in gradually, but given that > there's an offline defragmentation tool that can run in the background, > I don't care. I've found the performance of PostgreSQL to be wildly variable on btrfs. It may be OK at first, but watch it for a week or two to admire the full four-orders-of-magnitude swing (100 tps to 0.01 tps). :-O > [1] I wouldn't be too surprised if database COW, which consists of > journal-writing a copy of the data out of band, then rewriting it again > in the original place, is actually functionally equivalent do disabling > COW in the database and running btrfs + defrag. Obviously you shouldn't > keep COW enabled in btrfs *AND* the DB, requiring all data to be copied > around at least 3 times...which I'm afraid almost everyone does because > it's the default... Journalling writes all the data twice: once to the journal, once to update the origin page after the journal (though PostgreSQL will omit some of those duplicate writes in cases where there is no origin page to overwrite). COW writes all the new and updated data only once. In the event of a crash, if the log tree is not recoverable (and it's a rich source of btrfs bugs, so it's often not), you lose everything that happened to the database in the last 30 seconds. If you were already using async commit in PostgreSQL anyway then that's not much of a concern (and not having to call fsync 100 times a second _really_ helps performance!) but if you really need sync commit then btrfs is not the filesystem for you. > -- > GCP > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [PATCH RESEND] btrfs: unlock i_mutex after attempting to delete subvolume during send
On Fri, Apr 10, 2015 at 02:20:40PM -0700, Omar Sandoval wrote: > Whenever the check for a send in progress introduced in commit > 521e0546c970 (btrfs: protect snapshots from deleting during send) is > hit, we return without unlocking inode->i_mutex. This is easy to see > with lockdep enabled: > > [ +0.59] > [ +0.28] [ BUG: lock held when returning to user space! ] > [ +0.29] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted > [ +0.26] > [ +0.29] btrfs/211 is leaving the kernel with locks still held! > [ +0.29] 1 lock held by btrfs/211: > [ +0.23] #0: (&type->i_mutex_dir_key){+.+.+.}, at: > [] btrfs_ioctl_snap_destroy+0x2df/0x7a0 > > Make sure we unlock it in the error path. > > Reviewed-by: Filipe Manana > Reviewed-by: David Sterba > Cc: sta...@vger.kernel.org > Signed-off-by: Omar Sandoval > --- > Just resending this with Filipe's and David's Reviewed-bys and Cc-ing > stable. > > fs/btrfs/ioctl.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) Ping. -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] btrfs: ENOMEM bugfixes
On Fri, Mar 27, 2015 at 02:06:49PM -0700, Omar Sandoval wrote: > On Fri, Mar 13, 2015 at 12:43:42PM -0700, Omar Sandoval wrote: > > On Fri, Mar 13, 2015 at 12:04:30PM +0100, David Sterba wrote: > > > On Wed, Mar 11, 2015 at 09:40:17PM -0700, Omar Sandoval wrote: > > > > Ping. For anyone following along, it looks like commit cc87317726f8 > > > > ("mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change") > > > > reverted the commit that exposed these bugs. Josef said he was okay with > > > > taking these, will they make it to an upcoming -rc soon? > > > > > > Upcoming yes, but based on my experience with pushing patches that are > > > not really regressions in late rc's it's unlikely for 4.1. > > > > Ok, seeing as these bugs are going to be really hard to trigger now that > > the old GFP_FS behavior has been restored, I'm fine with waiting for the > > next merge window. > > > > Thank you! > > Chris, would you mind taking these for a spin in your integration branch > for the next merge window? > > Thanks, > -- > Omar Ping. -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Btrfs: two stage dirty block group writeout
Block group cache writeout is currently waiting on the pages for each block group cache before moving on to writing the next one. This commit switches things around to send down all the caches and then wait on them in batches. The end result is much faster, since we're keeping the disk pipeline full. Signed-off-by: Chris Mason --- fs/btrfs/ctree.h| 6 ++ fs/btrfs/extent-tree.c | 57 +-- fs/btrfs/free-space-cache.c | 131 +++- fs/btrfs/free-space-cache.h | 8 ++- 4 files changed, 170 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e305ccd..1df0d9d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1261,9 +1261,12 @@ struct btrfs_io_ctl { struct page *page; struct page **pages; struct btrfs_root *root; + struct inode *inode; unsigned long size; int index; int num_pages; + int entries; + int bitmaps; unsigned check_crcs:1; }; @@ -1332,6 +1335,9 @@ struct btrfs_block_group_cache { /* For dirty block groups */ struct list_head dirty_list; + struct list_head io_list; + + struct btrfs_io_ctl io_ctl; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d4b3d680..40c9513 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3388,7 +3388,11 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, struct btrfs_block_group_cache *cache; struct btrfs_transaction *cur_trans = trans->transaction; int ret = 0; + int should_put; struct btrfs_path *path; + LIST_HEAD(io); + int num_started = 0; + int num_waited = 0; if (list_empty(&cur_trans->dirty_bgs)) return 0; @@ -3407,16 +3411,60 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, cache = list_first_entry(&cur_trans->dirty_bgs, struct btrfs_block_group_cache, dirty_list); + + /* +* this can happen if cache_save_setup re-dirties a block +* group that is already under IO. Just wait for it to +* finish and then do it all again +*/ + if (!list_empty(&cache->io_list)) { + list_del_init(&cache->io_list); + btrfs_wait_cache_io(root, trans, cache, + &cache->io_ctl, path, + cache->key.objectid); + btrfs_put_block_group(cache); + num_waited++; + } + list_del_init(&cache->dirty_list); + should_put = 1; + if (cache->disk_cache_state == BTRFS_DC_CLEAR) cache_save_setup(cache, trans, path); + if (!ret) - ret = btrfs_run_delayed_refs(trans, root, -(unsigned long) -1); - if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP) - btrfs_write_out_cache(root, trans, cache, path); + ret = btrfs_run_delayed_refs(trans, root, (unsigned long) -1); + + if (!ret && cache->disk_cache_state == BTRFS_DC_SETUP) { + cache->io_ctl.inode = NULL; + ret = btrfs_write_out_cache(root, trans, cache, path); + if (ret == 0 && cache->io_ctl.inode) { + num_started++; + should_put = 0; + list_add_tail(&cache->io_list, &io); + } else { + /* +* if we failed to write the cache, the +* generation will be bad and life goes on +*/ + ret = 0; + } + } if (!ret) ret = write_one_cache_group(trans, root, path, cache); + + /* if its not on the io list, we need to put the block group */ + if (should_put) + btrfs_put_block_group(cache); + } + + while (!list_empty(&io)) { + cache = list_first_entry(&io, struct btrfs_block_group_cache, +io_list); + list_del_init(&cache->io_list); + num_waited++; + btrfs_wait_cache_io(root, trans, cache, + &cache->io_ctl, path, cache->key.objectid); btrfs_put_block_group(cache); } @@ -9013,6 +9061,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 siz
[PATCH 3/4] Btrfs: don't use highmem for free space cache pages
In order to create the free space cache concurrently with FS modifications, we need to take a few block group locks. The cache code also does kmap, which would schedule with the locks held. Instead of going through kmap_atomic, lets just use lowmem for the cache pages. Signed-off-by: Chris Mason --- fs/btrfs/free-space-cache.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 6886ae0..83532a2 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -85,7 +85,8 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, } mapping_set_gfp_mask(inode->i_mapping, - mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS); + mapping_gfp_mask(inode->i_mapping) & + ~(GFP_NOFS & ~__GFP_HIGHMEM)); return inode; } @@ -310,7 +311,6 @@ static void io_ctl_free(struct btrfs_io_ctl *io_ctl) static void io_ctl_unmap_page(struct btrfs_io_ctl *io_ctl) { if (io_ctl->cur) { - kunmap(io_ctl->page); io_ctl->cur = NULL; io_ctl->orig = NULL; } @@ -320,7 +320,7 @@ static void io_ctl_map_page(struct btrfs_io_ctl *io_ctl, int clear) { ASSERT(io_ctl->index < io_ctl->num_pages); io_ctl->page = io_ctl->pages[io_ctl->index++]; - io_ctl->cur = kmap(io_ctl->page); + io_ctl->cur = page_address(io_ctl->page); io_ctl->orig = io_ctl->cur; io_ctl->size = PAGE_CACHE_SIZE; if (clear) @@ -446,10 +446,9 @@ static void io_ctl_set_crc(struct btrfs_io_ctl *io_ctl, int index) PAGE_CACHE_SIZE - offset); btrfs_csum_final(crc, (char *)&crc); io_ctl_unmap_page(io_ctl); - tmp = kmap(io_ctl->pages[0]); + tmp = page_address(io_ctl->pages[0]); tmp += index; *tmp = crc; - kunmap(io_ctl->pages[0]); } static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, int index) @@ -466,10 +465,9 @@ static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, int index) if (index == 0) offset = sizeof(u32) * io_ctl->num_pages; - tmp = kmap(io_ctl->pages[0]); + tmp = page_address(io_ctl->pages[0]); tmp += index; val = *tmp; - kunmap(io_ctl->pages[0]); io_ctl_map_page(io_ctl, 0); crc = btrfs_csum_data(io_ctl->orig + offset, crc, -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: move struct io_ctl into ctree.h and rename it
We'll need to put the io_ctl into the block_group cache struct, so name it struct btrfs_io_ctl and move it into ctree.h Signed-off-by: Chris Mason --- fs/btrfs/ctree.h| 11 + fs/btrfs/free-space-cache.c | 55 ++--- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6bf16d5..e305ccd 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1256,6 +1256,17 @@ struct btrfs_caching_control { atomic_t count; }; +struct btrfs_io_ctl { + void *cur, *orig; + struct page *page; + struct page **pages; + struct btrfs_root *root; + unsigned long size; + int index; + int num_pages; + unsigned check_crcs:1; +}; + struct btrfs_block_group_cache { struct btrfs_key key; struct btrfs_block_group_item item; diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index c514820..47c2adb 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -271,18 +271,7 @@ static int readahead_cache(struct inode *inode) return 0; } -struct io_ctl { - void *cur, *orig; - struct page *page; - struct page **pages; - struct btrfs_root *root; - unsigned long size; - int index; - int num_pages; - unsigned check_crcs:1; -}; - -static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, +static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct inode *inode, struct btrfs_root *root, int write) { int num_pages; @@ -298,7 +287,7 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, (num_pages * sizeof(u32)) >= PAGE_CACHE_SIZE) return -ENOSPC; - memset(io_ctl, 0, sizeof(struct io_ctl)); + memset(io_ctl, 0, sizeof(struct btrfs_io_ctl)); io_ctl->pages = kcalloc(num_pages, sizeof(struct page *), GFP_NOFS); if (!io_ctl->pages) @@ -311,12 +300,12 @@ static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode, return 0; } -static void io_ctl_free(struct io_ctl *io_ctl) +static void io_ctl_free(struct btrfs_io_ctl *io_ctl) { kfree(io_ctl->pages); } -static void io_ctl_unmap_page(struct io_ctl *io_ctl) +static void io_ctl_unmap_page(struct btrfs_io_ctl *io_ctl) { if (io_ctl->cur) { kunmap(io_ctl->page); @@ -325,7 +314,7 @@ static void io_ctl_unmap_page(struct io_ctl *io_ctl) } } -static void io_ctl_map_page(struct io_ctl *io_ctl, int clear) +static void io_ctl_map_page(struct btrfs_io_ctl *io_ctl, int clear) { ASSERT(io_ctl->index < io_ctl->num_pages); io_ctl->page = io_ctl->pages[io_ctl->index++]; @@ -336,7 +325,7 @@ static void io_ctl_map_page(struct io_ctl *io_ctl, int clear) memset(io_ctl->cur, 0, PAGE_CACHE_SIZE); } -static void io_ctl_drop_pages(struct io_ctl *io_ctl) +static void io_ctl_drop_pages(struct btrfs_io_ctl *io_ctl) { int i; @@ -351,7 +340,7 @@ static void io_ctl_drop_pages(struct io_ctl *io_ctl) } } -static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, +static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, struct inode *inode, int uptodate) { struct page *page; @@ -385,7 +374,7 @@ static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode, return 0; } -static void io_ctl_set_generation(struct io_ctl *io_ctl, u64 generation) +static void io_ctl_set_generation(struct btrfs_io_ctl *io_ctl, u64 generation) { __le64 *val; @@ -408,7 +397,7 @@ static void io_ctl_set_generation(struct io_ctl *io_ctl, u64 generation) io_ctl->cur += sizeof(u64); } -static int io_ctl_check_generation(struct io_ctl *io_ctl, u64 generation) +static int io_ctl_check_generation(struct btrfs_io_ctl *io_ctl, u64 generation) { __le64 *gen; @@ -437,7 +426,7 @@ static int io_ctl_check_generation(struct io_ctl *io_ctl, u64 generation) return 0; } -static void io_ctl_set_crc(struct io_ctl *io_ctl, int index) +static void io_ctl_set_crc(struct btrfs_io_ctl *io_ctl, int index) { u32 *tmp; u32 crc = ~(u32)0; @@ -461,7 +450,7 @@ static void io_ctl_set_crc(struct io_ctl *io_ctl, int index) kunmap(io_ctl->pages[0]); } -static int io_ctl_check_crc(struct io_ctl *io_ctl, int index) +static int io_ctl_check_crc(struct btrfs_io_ctl *io_ctl, int index) { u32 *tmp, val; u32 crc = ~(u32)0; @@ -494,7 +483,7 @@ static int io_ctl_check_crc(struct io_ctl *io_ctl, int index) return 0; } -static int io_ctl_add_entry(struct io_ctl *io_ctl, u64 offset, u64 bytes, +static int io_ctl_add_entry(struct btrfs_io_ctl *io_ctl, u64 offset, u64 bytes, void *bitmap) { struct btrfs_free_space_entry *entry; @@ -524,7 +513,7 @@ static int io_ctl_add_entry(
[PATCH 0/4] btrfs: reduce block group cache writeout times during commit
Large filesystems with lots of block groups can suffer long stalls during commit while we create and send down all of the block group caches. The more blocks groups dirtied in a transaction, the longer these stalls can be. Some workloads average 10 seconds per commit, but see peak times much higher. The first problem is that we write and wait for each block group cache individually, so we aren't keeping the disk pipeline full. This patch set uses the io_ctl struct to start cache IO, and then waits on it in bulk. The second problem is that we only allow cache writeout while new modifications are blocked during the final stage of commit. This adds some locking so that cache writeout can happen very early in the commit, and any block groups that are redirtied will be sent down during the final stages. With both together, average commit stalls are under a second and our overall performance is much smoother. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Btrfs: allow block group cache writeout outside critical section in commit
We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason --- fs/btrfs/ctree.h| 8 ++ fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 241 +++- fs/btrfs/free-space-cache.c | 69 +++-- fs/btrfs/free-space-cache.h | 1 + fs/btrfs/inode-map.c| 2 +- fs/btrfs/relocation.c | 9 +- fs/btrfs/transaction.c | 38 ++- fs/btrfs/transaction.h | 9 ++ 9 files changed, 341 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1df0d9d..83051fa 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1491,6 +1491,12 @@ struct btrfs_fs_info { struct mutex chunk_mutex; struct mutex volume_mutex; + /* +* this is taken to make sure we don't set block groups ro after +* the free space cache has been allocated on them +*/ + struct mutex ro_block_group_mutex; + /* this is used during read/modify/write to make sure * no two ios are trying to mod the same stripe at the same * time @@ -3407,6 +3413,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, u64 owner, u64 offset, int no_quota); +int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btrfs_setup_space_cache(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 568cc4e..b5e3d5f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2572,6 +2572,7 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->transaction_kthread_mutex); mutex_init(&fs_info->cleaner_mutex); mutex_init(&fs_info->volume_mutex); + mutex_init(&fs_info->ro_block_group_mutex); init_rwsem(&fs_info->commit_root_sem); init_rwsem(&fs_info->cleanup_work_sem); init_rwsem(&fs_info->subvol_sem); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 40c9513..02c2b29 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3298,7 +3298,7 @@ again: if (ret) goto out_put; - ret = btrfs_truncate_free_space_cache(root, trans, inode); + ret = btrfs_truncate_free_space_cache(root, trans, NULL, inode); if (ret) goto out_put; } @@ -3382,20 +3382,156 @@ int btrfs_setup_space_cache(struct btrfs_trans_handle *trans, return 0; } -int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, +/* + * transaction commit does final block group cache writeback during a + * critical section where nothing is allowed to change the FS. This is + * required in order for the cache to actually match the block group, + * but can introduce a lot of latency into the commit. + * + * So, btrfs_start_dirty_block_groups is here to kick off block group + * cache IO. There's a chance we'll have to redo some of it if the + * block group changes again during the commit, but it greatly reduces + * the commit latency by getting rid of the easy block groups while + * we're still allowing others to join the commit. + */ +int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_block_group_cache *cache; struct btrfs_transaction *cur_trans = trans->transaction; int ret = 0; int should_put; - struct btrfs_path *path; - LIST_HEAD(io); + struct btrfs_path *path = NULL; + LIST_HEAD(dirty); + struct list_head *io = &cur_trans->io_bgs; int num_started = 0; - int num_waited = 0; + int loops = 0; + + spin_lock(&cur_trans->dirty_bgs_lock); + if (!list_empty(&cur_trans->dirty_bgs)) { + list_splice_init(&cur_trans->dirty_bgs, &dirty); + } + spin_unlock(&cur_trans->dirty_bgs_lock); - if (list_empty(&cur_trans->dirty_bgs)) +again: + if (list_empty(&dirty)) { + btrfs_free_path(path); return 0; + } + +
[PATCH 0/5] Btrfs: truncate space reservation fixes
One of the production workloads here at FB ends up creating and eventually deleting very large files. We were consistently hitting ENOSPC aborts while trying to delete the files because there wasn't enough metadata reserved to cover deleting CRCs or actually updating the block group items on disk. This patchset addresses these problems by adding crc items into the math for delayed ref processing, and changing the truncate items loop to reserve metadata more often. It also solves a performance problem where we are constantly committing the transaction in hopes of making enospc progress. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] Btrfs: don't commit the transaction in the async space flushing
From: Josef Bacik We're triggering a huge number of commits from btrfs_async_reclaim_metadata_space. These aren't really requried, because everyone calling the async reclaim code is going to end up triggering a commit on their own. Signed-off-by: Chris Mason --- fs/btrfs/extent-tree.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ae8db3ba..3d4b3d680 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4329,8 +4329,13 @@ out: static inline int need_do_async_reclaim(struct btrfs_space_info *space_info, struct btrfs_fs_info *fs_info, u64 used) { - return (used >= div_factor_fine(space_info->total_bytes, 98) && - !btrfs_fs_closing(fs_info) && + u64 thresh = div_factor_fine(space_info->total_bytes, 98); + + /* If we're just plain full then async reclaim just slows us down. */ + if (space_info->bytes_used >= thresh) + return 0; + + return (used >= thresh && !btrfs_fs_closing(fs_info) && !test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state)); } @@ -4385,10 +4390,7 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (!btrfs_need_do_async_reclaim(space_info, fs_info, flush_state)) return; - } while (flush_state <= COMMIT_TRANS); - - if (btrfs_need_do_async_reclaim(space_info, fs_info, flush_state)) - queue_work(system_unbound_wq, work); + } while (flush_state < COMMIT_TRANS); } void btrfs_init_async_reclaim_work(struct work_struct *work) -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] Btrfs: refill block reserves during truncate
When truncate starts, it allocates some space in the block reserves so that we'll have enough to update metadata along the way. For very large files, we can easily go through all of that space as we loop through the extents. This changes truncate to refill the space reservation as it progresses through the file. Signed-off-by: Chris Mason --- fs/btrfs/ctree.h | 3 +++ fs/btrfs/extent-tree.c | 9 - fs/btrfs/inode.c | 45 +++-- 3 files changed, 46 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 95944b8..6bf16d5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3297,6 +3297,9 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping) } /* extent-tree.c */ + +u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes); + static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root, unsigned num_items) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a6f88eb..75f4bed 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2636,7 +2636,7 @@ static inline u64 heads_to_leaves(struct btrfs_root *root, u64 heads) * Takes the number of bytes to be csumm'ed and figures out how many leaves it * would require to store the csums for that many bytes. */ -static u64 csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes) +u64 btrfs_csum_bytes_to_leaves(struct btrfs_root *root, u64 csum_bytes) { u64 csum_size; u64 num_csums_per_leaf; @@ -2665,7 +2665,7 @@ int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, if (num_heads > 1) num_bytes += (num_heads - 1) * root->nodesize; num_bytes <<= 1; - num_bytes += csum_bytes_to_leaves(root, csum_bytes) * root->nodesize; + num_bytes += btrfs_csum_bytes_to_leaves(root, csum_bytes) * root->nodesize; global_rsv = &root->fs_info->global_block_rsv; /* @@ -5098,13 +5098,12 @@ static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes, BTRFS_I(inode)->csum_bytes == 0) return 0; - old_csums = csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes); - + old_csums = btrfs_csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes); if (reserve) BTRFS_I(inode)->csum_bytes += num_bytes; else BTRFS_I(inode)->csum_bytes -= num_bytes; - num_csums = csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes); + num_csums = btrfs_csum_bytes_to_leaves(root, BTRFS_I(inode)->csum_bytes); /* No change, no need to reserve more */ if (old_csums == num_csums) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cec23cf..88537c5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4163,6 +4163,21 @@ out: return err; } +static int truncate_space_check(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + u64 bytes_deleted) +{ + int ret; + + bytes_deleted = btrfs_csum_bytes_to_leaves(root, bytes_deleted); + ret = btrfs_block_rsv_add(root, &root->fs_info->trans_block_rsv, + bytes_deleted, BTRFS_RESERVE_NO_FLUSH); + if (!ret) + trans->bytes_reserved += bytes_deleted; + return ret; + +} + /* * this can truncate away extent items, csum items and directory items. * It starts at a high offset and removes keys until it can't find @@ -4201,6 +4216,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = 0; bool should_throttle = 0; + bool should_end = 0; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4396,6 +4412,8 @@ delete: } else { break; } + should_throttle = 0; + if (found_extent && (test_bit(BTRFS_ROOT_REF_COWS, &root->state) || root == root->fs_info->tree_root)) { @@ -4409,17 +4427,24 @@ delete: if (btrfs_should_throttle_delayed_refs(trans, root)) btrfs_async_run_delayed_refs(root, trans->delayed_ref_updates * 2, 0); + if (be_nice) { + if (truncate_space_check(trans, root, +extent_num_bytes)) { + should_end = 1; + } + if (btrfs_should_throttle_delayed_refs(trans, + root)) { + should_throttle = 1; + } +
[PATCH 1/5] Btrfs: account for crcs in delayed ref processing
From: Josef Bacik As we delete large extents, we end up doing huge amounts of COW in order to delete the corresponding crcs. This adds accounting so that we keep track of that space and flushing of delayed refs so that we don't build up too much delayed crc work. This helps limit the delayed work that must be done at commit time and tries to avoid ENOSPC aborts because the crcs eat all the global reserves. Signed-off-by: Chris Mason --- fs/btrfs/delayed-ref.c | 22 -- fs/btrfs/delayed-ref.h | 10 ++ fs/btrfs/extent-tree.c | 46 +++--- fs/btrfs/inode.c | 25 ++--- fs/btrfs/transaction.c | 4 5 files changed, 83 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 6d16bea..8f8ed7d 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -489,11 +489,13 @@ update_existing_ref(struct btrfs_trans_handle *trans, * existing and update must have the same bytenr */ static noinline void -update_existing_head_ref(struct btrfs_delayed_ref_node *existing, +update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, +struct btrfs_delayed_ref_node *existing, struct btrfs_delayed_ref_node *update) { struct btrfs_delayed_ref_head *existing_ref; struct btrfs_delayed_ref_head *ref; + int old_ref_mod; existing_ref = btrfs_delayed_node_to_head(existing); ref = btrfs_delayed_node_to_head(update); @@ -541,7 +543,20 @@ update_existing_head_ref(struct btrfs_delayed_ref_node *existing, * only need the lock for this case cause we could be processing it * currently, for refs we just added we know we're a-ok. */ + old_ref_mod = existing_ref->total_ref_mod; existing->ref_mod += update->ref_mod; + existing_ref->total_ref_mod += update->ref_mod; + + /* +* If we are going to from a positive ref mod to a negative or vice +* versa we need to make sure to adjust pending_csums accordingly. +*/ + if (existing_ref->is_data) { + if (existing_ref->total_ref_mod >= 0 && old_ref_mod < 0) + delayed_refs->pending_csums -= existing->num_bytes; + if (existing_ref->total_ref_mod < 0 && old_ref_mod >= 0) + delayed_refs->pending_csums += existing->num_bytes; + } spin_unlock(&existing_ref->lock); } @@ -605,6 +620,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, head_ref->is_data = is_data; head_ref->ref_root = RB_ROOT; head_ref->processing = 0; + head_ref->total_ref_mod = count_mod; spin_lock_init(&head_ref->lock); mutex_init(&head_ref->mutex); @@ -614,7 +630,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, existing = htree_insert(&delayed_refs->href_root, &head_ref->href_node); if (existing) { - update_existing_head_ref(&existing->node, ref); + update_existing_head_ref(delayed_refs, &existing->node, ref); /* * we've updated the existing ref, free the newly * allocated ref @@ -622,6 +638,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref); head_ref = existing; } else { + if (is_data && count_mod < 0) + delayed_refs->pending_csums += num_bytes; delayed_refs->num_heads++; delayed_refs->num_heads_ready++; atomic_inc(&delayed_refs->num_entries); diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index a764e23..5eb0892 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -88,6 +88,14 @@ struct btrfs_delayed_ref_head { struct rb_node href_node; struct btrfs_delayed_extent_op *extent_op; + + /* +* This is used to track the final ref_mod from all the refs associated +* with this head ref, this is not adjusted as delayed refs are run, +* this is meant to track if we need to do the csum accounting or not. +*/ + int total_ref_mod; + /* * when a new extent is allocated, it is just reserved in memory * The actual extent isn't inserted into the extent allocation tree @@ -138,6 +146,8 @@ struct btrfs_delayed_ref_root { /* total number of head nodes ready for processing */ unsigned long num_heads_ready; + u64 pending_csums; + /* * set when the tree is flushing before a transaction commit, * used by the throttling code to decide if new updates need diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 41e5812..a6f88eb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2538,6 +2538
[PATCH 5/5] Btrfs: don't steal from the global reserve if we don't have the space
From: Josef Bacik btrfs_evict_inode() needs to be more careful about stealing from the global_rsv. We dont' want to end up aborting commit with ENOSPC just because the evict_inode code was too greedy. Signed-off-by: Chris Mason --- fs/btrfs/inode.c | 46 -- 1 file changed, 44 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 88537c5..141df0c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5010,6 +5010,7 @@ void btrfs_evict_inode(struct inode *inode) struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_block_rsv *rsv, *global_rsv; + int steal_from_global = 0; u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); int ret; @@ -5077,9 +5078,20 @@ void btrfs_evict_inode(struct inode *inode) * hard as possible to get this to work. */ if (ret) - ret = btrfs_block_rsv_migrate(global_rsv, rsv, min_size); + steal_from_global++; + else + steal_from_global = 0; + ret = 0; - if (ret) { + /* +* steal_from_global == 0: we reserved stuff, hooray! +* steal_from_global == 1: we didn't reserve stuff, boo! +* steal_from_global == 2: we've committed, still not a lot of +* room but maybe we'll have room in the global reserve this +* time. +* steal_from_global == 3: abandon all hope! +*/ + if (steal_from_global > 2) { btrfs_warn(root->fs_info, "Could not get space for a delete, will truncate on mount %d", ret); @@ -5095,6 +5107,36 @@ void btrfs_evict_inode(struct inode *inode) goto no_delete; } + /* +* We can't just steal from the global reserve, we need tomake +* sure there is room to do it, if not we need to commit and try +* again. +*/ + if (steal_from_global) { + if (!btrfs_check_space_for_delayed_refs(trans, root)) + ret = btrfs_block_rsv_migrate(global_rsv, rsv, + min_size); + else + ret = -ENOSPC; + } + + /* +* Couldn't steal from the global reserve, we have too much +* pending stuff built up, commit the transaction and try it +* again. +*/ + if (ret) { + ret = btrfs_commit_transaction(trans, root); + if (ret) { + btrfs_orphan_del(NULL, inode); + btrfs_free_block_rsv(root, rsv); + goto no_delete; + } + continue; + } else { + steal_from_global = 0; + } + trans->block_rsv = rsv; ret = btrfs_truncate_inode_items(trans, root, inode, 0, 0); -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] Btrfs: reserve space for block groups
From: Josef Bacik This changes our delayed refs calculations to include the space needed to write back dirty block groups. Signed-off-by: Chris Mason --- fs/btrfs/extent-tree.c | 12 +--- fs/btrfs/transaction.c | 1 + fs/btrfs/transaction.h | 1 + 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 75f4bed..ae8db3ba 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2657,7 +2657,8 @@ int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_block_rsv *global_rsv; u64 num_heads = trans->transaction->delayed_refs.num_heads_ready; u64 csum_bytes = trans->transaction->delayed_refs.pending_csums; - u64 num_bytes; + u64 num_dirty_bgs = trans->transaction->num_dirty_bgs; + u64 num_bytes, num_dirty_bgs_bytes; int ret = 0; num_bytes = btrfs_calc_trans_metadata_size(root, 1); @@ -2666,17 +2667,21 @@ int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, num_bytes += (num_heads - 1) * root->nodesize; num_bytes <<= 1; num_bytes += btrfs_csum_bytes_to_leaves(root, csum_bytes) * root->nodesize; + num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(root, +num_dirty_bgs); global_rsv = &root->fs_info->global_block_rsv; /* * If we can't allocate any more chunks lets make sure we have _lots_ of * wiggle room since running delayed refs can create more delayed refs. */ - if (global_rsv->space_info->full) + if (global_rsv->space_info->full) { + num_dirty_bgs_bytes <<= 1; num_bytes <<= 1; + } spin_lock(&global_rsv->lock); - if (global_rsv->reserved <= num_bytes) + if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes) ret = 1; spin_unlock(&global_rsv->lock); return ret; @@ -5408,6 +5413,7 @@ static int update_block_group(struct btrfs_trans_handle *trans, if (list_empty(&cache->dirty_list)) { list_add_tail(&cache->dirty_list, &trans->transaction->dirty_bgs); + trans->transaction->num_dirty_bgs++; btrfs_get_block_group(cache); } spin_unlock(&trans->transaction->dirty_bgs_lock); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 8b9eea8..234d606 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -251,6 +251,7 @@ loop: INIT_LIST_HEAD(&cur_trans->switch_commits); INIT_LIST_HEAD(&cur_trans->pending_ordered); INIT_LIST_HEAD(&cur_trans->dirty_bgs); + cur_trans->num_dirty_bgs = 0; spin_lock_init(&cur_trans->dirty_bgs_lock); list_add_tail(&cur_trans->list, &fs_info->trans_list); extent_io_tree_init(&cur_trans->dirty_pages, diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 96b189b..4cb0ae2 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -64,6 +64,7 @@ struct btrfs_transaction { struct list_head pending_ordered; struct list_head switch_commits; struct list_head dirty_bgs; + u64 num_dirty_bgs; spinlock_t dirty_bgs_lock; struct btrfs_delayed_ref_root delayed_refs; int aborted; -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] fscache/cachefiles versus btrfs
On Fri, Apr 10, 2015 at 02:28:16PM +0100, David Howells wrote: > Dave Chinner wrote: > > > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache > > coherent, not extent based operations. And, really if you need it to > > really be able to find real holes, then a superblock flag might be a > > better way of marking filesystems with the required capability. > > Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon > encountering a hole at the beginning of the area to be read. NFS READ_PLUS could also make use of this, but someone needs to actually implement it. Until we have that lseek SEEK_HOLE/DATA is the way to go, and the horrible ->bmap hack needs to die ASAP, I can't believe you managed to sneak that in in the not too distant past. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please add 9c4f61f01d269815bb7c37be3ede59c5587747c6 to stable
On Mon, Apr 13, 2015 at 07:28:38PM +0500, Roman Mamedov wrote: > On Thu, 2 Apr 2015 10:17:47 -0400 > Chris Mason wrote: > > > Hi stable friends, > > > > Can you please backport this one to 3.19.y. It fixes a bug introduced > > by: > > > > 381cf6587f8a8a8e981bc0c18859b51dc756, which was tagged for stable > > 3.14+ > > > > The symptoms of the bug are deadlocks during log reply after a crash. > > The patch wasn't intentionally fixing the deadlock, which is why we > > missed it when tagging fixes. > > Unfortunately still not fixed (no btrfs-related changes) in 3.14.38 and > 3.18.11 released today. I have a few hundred stable backports left to sort through, don't worry, this is still in the queue, it's not lost. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper
> > >> Could we perhaps instead of a length, define a 'pos_in_start' and a > > >> 'pos_in_end' offset (with the latter being -1 for a full-file copy) > > >> and then return an 'loff_t' value stating where the copy ended? > > > > > > Well, the resulting offset will be set if the caller provided it. So > > > they could already be getting the copied length from that. But they > > > might not specify the offsets. Maybe they're just using the results to > > > total up a completion indicator. > > > > > > Maybe we could make the length a pointer like the offsets that's set to > > > the copied length on return. > > > > That works, but why do we care so much about the difference between a > > length and an offset as a return value? > > > > I think it just comes down to potential confusion for users. What's > more useful, the number of bytes actually copied, or the offset into the > file where the copy ended? > > I tend to the think an offset is more useful for someone trying to > copy a file in chunks, particularly if the file is sparse. That gives > them a clear place to continue the copy. > > So, I think I agree with Trond that phrasing this interface in terms of > file offsets seems like it might be more useful. That also neatly > sidesteps the size_t limitations on 32-bit platforms. Yeah, fair enough. I'll rework it. > > To be fair, the NFS copy offload also allows the copy to proceed out > > of order, in which case the range of copied data could be > > non-contiguous in the case of a failure. However neither the length > > nor the offset case will give you the full story in that case. Any > > return value can at best be considered to define an offset range whose > > contents need to be checked for success/failure. > > > > Yuck! How the heck do you clean up the mess if that happens? I guess > you're just stuck redoing the copy with normal READ/WRITE? I don't think anyone will worry about checking file contents. Yes, technically you can get fragmented completion past the initial contiguous region that the interface told you is done. You can get that with O_DIRECT today. But it's a rare case that is not worth worrying about. You'll retry at the contiguous offset until it doesn't make progress and then fall back to read/write. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: error removing the device '/dev/sdXN' - Inappropriate ioctl for device
Thanks Martin for helping with the data. Unfortunately no matter what I try I couldn't reproduce the "Inappropriate ioctl for device" But I got INVALD which is wrong as well. So I sent a patch for that. EIO is appropriate in this case. Next, Passing a devid to delete a device - yes it would help patch is ready in my WS BUT - just noticed that error handling code nicely put the FS into readonly when there is commit failure so that's something which should be fixed along with this patch which I am attempting. Since as in your case with 3 disks raid1, your reconstructed radi1 will still be considered as healthy after taking out a disk. Further, to recover from your situation you could try replace, that will work. OR if you are not planning to add another disk then, try this: (sorry needs a down time) umount remove failed disk from the system mount -o degrade btrfs dev del missing mount -o remount (I hope you would check / mange the space availably part) (By the way, those missing messages are user land fabricated, btrfs-progs: 206efb60cbe3049e0d44c6da3c1909aeee18f813 so don't depend on that, to know what kernel knows probably you need /proc/fs/btrfs/devlist or sysfs posted in the ML before, btrfs fi show -m was written so to know actual kernel visibility but it was again crippled by above commit id. David, should back out 206efb60cbe3049e0d44c6da3c1909aeee18f813 as mentioned before it will help. On 04/07/2015 07:41 PM, Martin wrote: On 06/04/15 14:32, Anand Jain wrote: btrfs fi show -d That gives: # btrfs fi show -d warning, device 3 is missing warning devid 3 not found already warning, device 3 is missing warning devid 3 not found already David, As commented before - you shouldn't have integrated the patch 915902c5002485fb13d27c4b699a73fb66cc0f09 Thanks, Anand Label: 'btrfs_root' uuid: 92452e9a-2775-45c4-922c-f01b2afd51c2 Total devices 3 FS bytes used 30.94GiB devid1 size 24.00GiB used 24.00GiB path /dev/sda4 devid2 size 24.00GiB used 24.00GiB path /dev/sdc4 devid3 size 24.00GiB used 24.00GiB path /dev/sde4 Label: 'btrfs_data' uuid: d1b96638-be89-4291-8a40-f2f2e1dc5223 Total devices 3 FS bytes used 95.74GiB devid1 size 87.24GiB used 86.48GiB path /dev/sda5 devid2 size 87.24GiB used 87.24GiB path /dev/sdc5 devid3 size 87.24GiB used 87.24GiB path /dev/sde5 Label: 'btrfs_root2' uuid: 62603ce8-c333-4ca7-92f7-f8bdd712ab37 Total devices 3 FS bytes used 151.60MiB devid1 size 24.00GiB used 24.00GiB path /dev/sdb4 devid2 size 24.00GiB used 24.00GiB path /dev/sdd4 *** Some devices missing Label: 'btrfs_data2' uuid: 3aaee716-b98b-4c86-ba5a-53456994f152 Total devices 3 FS bytes used 159.34GiB devid1 size 206.47GiB used 206.02GiB path /dev/sdb5 devid2 size 206.47GiB used 206.47GiB path /dev/sdd5 *** Some devices missing btrfs-progs v3.19.1 And without the "-d": # btrfs fi show Label: 'btrfs_root' uuid: 92452e9a-2775-45c4-922c-f01b2afd51c2 Total devices 3 FS bytes used 30.94GiB devid1 size 24.00GiB used 24.00GiB path /dev/sda4 devid2 size 24.00GiB used 24.00GiB path /dev/sdc4 devid3 size 24.00GiB used 24.00GiB path /dev/sde4 Label: 'btrfs_data' uuid: d1b96638-be89-4291-8a40-f2f2e1dc5223 Total devices 3 FS bytes used 95.74GiB devid1 size 87.24GiB used 86.48GiB path /dev/sda5 devid2 size 87.24GiB used 87.24GiB path /dev/sdc5 devid3 size 87.24GiB used 87.24GiB path /dev/sde5 Label: 'btrfs_root2' uuid: 62603ce8-c333-4ca7-92f7-f8bdd712ab37 Total devices 3 FS bytes used 151.60MiB devid1 size 24.00GiB used 24.00GiB path /dev/sdb4 devid2 size 24.00GiB used 24.00GiB path /dev/sdd4 devid3 size 24.00GiB used 24.00GiB path /dev/sdf4 Label: 'btrfs_data2' uuid: 3aaee716-b98b-4c86-ba5a-53456994f152 Total devices 3 FS bytes used 159.34GiB devid1 size 206.47GiB used 206.02GiB path /dev/sdb5 devid2 size 206.47GiB used 206.47GiB path /dev/sdd5 devid3 size 206.47GiB used 206.47GiB path /dev/sdf5 btrfs-progs v3.19.1 Interestingly, all the log messages about /dev/sdf are now no longer being repeated. (And nope, not had a chance to swap that disk yet!) Hence, should I do a "btrfs device delete missing /mnt/data2"? Cheers, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] btrfs: wait for delayed iputs on no space
On 04/09/2015 12:34 AM, Zhaolei wrote: > From: Zhao Lei > > btrfs will report no_space when we run following write and delete > file loop: > # FILE_SIZE_M=[ 75% of fs space ] > # DEV=[ some dev ] > # MNT=[ some dir ] > # > # mkfs.btrfs -f "$DEV" > # mount -o nodatacow "$DEV" "$MNT" > # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M > count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done > # > > Reason: > iput() and evict() is run after write pages to block device, if > write pages work is not finished before next write, the "rm"ed space > is not freed, and caused above bug. > > Fix: > We can add "-o flushoncommit" mount option to avoid above bug, but > it have performance problem. Actually, we can to wait for on-the-fly > writes only when no-space happened, it is which this patch do. Can you please change this so we only do this flush if the first commit doesn't free up enough space? I think this is going to have a performance impact as the FS fills up. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GSoC 2015] Btrfs content based storage
On Fri, Mar 27, 2015 at 10:58:42AM -0400, harshad shirwadkar wrote: > I am a CS graduate student from Carnegie Mellon University. I am > hoping to build the feature - "Content based storage mode" under > Google Summer of Code 2015. This project has also been listed as an > idea on BTRFS ideas page. However, I have not found a mentor yet, and > without a mentor I can not participate in the program. Please let me > know if anybody is interested in mentoring this project. Here is a > link to my proposal: > > http://harshadjs.github.io/2015/03/27/Fedora-BTRFS-Content-Storage-Mode/ This probably has a significant overlap with the in-band dedup work from Liu bo [1]. Your proposal expects an interface to look up the data by hash which hasn't been implemented afaik. [1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/34097 (v10) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Degraded volume silently fails to mount
On 4/13/2015 10:07, Hugo Mills wrote: On Mon, Apr 13, 2015 at 09:51:09AM -0400, Michael Tharp wrote: Hi list, I've got a 4 disk raid1 volume that has one failed disk. I have so far been unable to mount it in degraded mode, but the failure is that "mount" silently does nothing. Check to see if systemd is unmounting it immediately after mount. This seems to be the usual reason for silent failures to mount an FS these days. Sigh, that was it. Thanks. Faith in btrfs restored. I had a custom unit file because the generated ones weren't getting the LUKS device dependencies correct. When the drive failed I commented out the fstab and crypttab entries but forgot about the custom unit file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Btrfs: __btrfs_std_error() logic should be consistent w/out CONFIG_PRINTK defined
error handling logic behaves differently with or without CONFIG_PRINTK defined, since there are two copies of the same function which a bit of different logic One, when CONFIG_PRINTK is defined, code is __btrfs_std_error(..) { :: save_error_info(fs_info); if (sb->s_flags & MS_BORN) btrfs_handle_error(fs_info); } and two when CONFIG_PRINTK is not defined, the code is __btrfs_std_error(..) { :: if (sb->s_flags & MS_BORN) { save_error_info(fs_info); btrfs_handle_error(fs_info); } } I doubt if this was intentional ? and appear to have caused since we maintain two copies of the same function and they got diverged with commits. Now to decide which logic is correct reviewed changes as below, 533574c6bc30cf526cc1c41bde050c854a945efb Commit added two copies of this function cf79ffb5b79e8a2b587fbf218809e691bb396c98 Commit made change to only one copy of the function and to the copy when CONFIG_PRINTK is defined. To fix this, instead of maintaining two copies of same function approach, maintain single function, and just put the extra portion of the code under CONFIG_PRINTK define. This patch just does that. And keeps code of with CONFIG_PRINTK defined. Signed-off-by: Anand Jain --- fs/btrfs/super.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 7533afb..b0a465f 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -130,7 +130,6 @@ static void btrfs_handle_error(struct btrfs_fs_info *fs_info) } } -#ifdef CONFIG_PRINTK /* * __btrfs_std_error decodes expected errors from the caller and * invokes the approciate error response. @@ -139,7 +138,9 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, unsigned int line, int errno, const char *fmt, ...) { struct super_block *sb = fs_info->sb; +#ifdef CONFIG_PRINTK const char *errstr; +#endif /* * Special case: if the error is EROFS, and we're already @@ -148,6 +149,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, if (errno == -EROFS && (sb->s_flags & MS_RDONLY)) return; +#ifdef CONFIG_PRINTK errstr = btrfs_decode_error(errno); if (fmt) { struct va_format vaf; @@ -165,6 +167,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, printk(KERN_CRIT "BTRFS: error (device %s) in %s:%d: errno=%d %s\n", sb->s_id, function, line, errno, errstr); } +#endif /* Don't go through full error handling during mount */ save_error_info(fs_info); @@ -172,6 +175,7 @@ void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, btrfs_handle_error(fs_info); } +#ifdef CONFIG_PRINTK static const char * const logtypes[] = { "emergency", "alert", @@ -211,27 +215,6 @@ void btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...) va_end(args); } - -#else - -void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, - unsigned int line, int errno, const char *fmt, ...) -{ - struct super_block *sb = fs_info->sb; - - /* -* Special case: if the error is EROFS, and we're already -* under MS_RDONLY, then it is safe here. -*/ - if (errno == -EROFS && (sb->s_flags & MS_RDONLY)) - return; - - /* Don't go through full error handling during mount */ - if (sb->s_flags & MS_BORN) { - save_error_info(fs_info); - btrfs_handle_error(fs_info); - } -} #endif /* -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Btrfs: SB read failure should return EIO for __bread failure
This will return EIO when __bread() fails to read SB, instead of EINVAL. Signed-off-by: Anand Jain --- fs/btrfs/disk-io.c | 18 +++--- fs/btrfs/volumes.c | 8 2 files changed, 19 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 53c83c9..f47c643 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2410,8 +2410,8 @@ int open_ctree(struct super_block *sb, * Read super block and check the signature bytes only */ bh = btrfs_read_dev_super(fs_devices->latest_bdev); - if (!bh) { - err = -EINVAL; + if (IS_ERR(bh)) { + err = PTR_ERR(bh); goto fail_alloc; } @@ -3093,6 +3093,7 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev) int i; u64 transid = 0; u64 bytenr; + int ret = -EINVAL; /* we would like to check all the supers, but that would make * a btrfs mount succeed after a mkfs from a different FS. @@ -3106,13 +3107,20 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev) break; bh = __bread(bdev, bytenr / 4096, BTRFS_SUPER_INFO_SIZE); - if (!bh) + /* +* If we fail to read from the underlaying drivers, as of now +* the best option we have is to mark it EIO. +*/ + if (!bh) { + ret = -EIO; continue; + } super = (struct btrfs_super_block *)bh->b_data; if (btrfs_super_bytenr(super) != bytenr || btrfs_super_magic(super) != BTRFS_MAGIC) { brelse(bh); + ret = -EINVAL; continue; } @@ -3124,6 +3132,10 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev) brelse(bh); } } + + if (!latest) + return ERR_PTR(ret); + return latest; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0009fde..5536281 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -212,8 +212,8 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder, } invalidate_bdev(*bdev); *bh = btrfs_read_dev_super(*bdev); - if (!*bh) { - ret = -EINVAL; + if (IS_ERR(*bh)) { + ret = PTR_ERR(*bh); blkdev_put(*bdev, flags); goto error; } @@ -6770,8 +6770,8 @@ int btrfs_scratch_superblock(struct btrfs_device *device) struct btrfs_super_block *disk_super; bh = btrfs_read_dev_super(device->bdev); - if (!bh) - return -EINVAL; + if (IS_ERR(bh)) + return PTR_ERR(bh); disk_super = (struct btrfs_super_block *)bh->b_data; memset(&disk_super->magic, 0, sizeof(disk_super->magic)); -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Btrfs-progs: fix compile warnings
simple compile time warning fixes. cmds-check.c: In function ‘del_file_extent_hole’: cmds-check.c:289: warning: ‘prev.len’ may be used uninitialized in this function cmds-check.c:289: warning: ‘prev.start’ may be used uninitialized in this function cmds-check.c:290: warning: ‘next.len’ may be used uninitialized in this function cmds-check.c:290: warning: ‘next.start’ may be used uninitialized in this function btrfs-calc-size.c: In function ‘print_seek_histogram’: btrfs-calc-size.c:221: warning: ‘group_start’ may be used uninitialized in this function btrfs-calc-size.c:223: warning: ‘group_end’ may be used uninitialized in this function Signed-off-by: Anand Jain --- btrfs-calc-size.c | 4 ++-- cmds-check.c | 3 +++ 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c index 1372084..88f92e1 100644 --- a/btrfs-calc-size.c +++ b/btrfs-calc-size.c @@ -218,9 +218,9 @@ static void print_seek_histogram(struct root_stats *stat) struct rb_node *n = rb_first(&stat->seek_root); struct seek *seek; u64 tick_interval; - u64 group_start; + u64 group_start = 0; u64 group_count = 0; - u64 group_end; + u64 group_end = 0; u64 i; u64 max_seek = stat->max_seek_len; int digits = 1; diff --git a/cmds-check.c b/cmds-check.c index ed8c698..de22185 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -293,6 +293,9 @@ static int del_file_extent_hole(struct rb_root *holes, int have_next = 0; int ret = 0; + memset(&prev, 0, sizeof(struct file_extent_hole)); + memset(&next, 0, sizeof(struct file_extent_hole)); + tmp.start = start; tmp.len = len; node = rb_search(holes, &tmp, compare_hole_range, NULL); -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please add 9c4f61f01d269815bb7c37be3ede59c5587747c6 to stable
On Thu, 2 Apr 2015 10:17:47 -0400 Chris Mason wrote: > Hi stable friends, > > Can you please backport this one to 3.19.y. It fixes a bug introduced > by: > > 381cf6587f8a8a8e981bc0c18859b51dc756, which was tagged for stable > 3.14+ > > The symptoms of the bug are deadlocks during log reply after a crash. > The patch wasn't intentionally fixing the deadlock, which is why we > missed it when tagging fixes. Unfortunately still not fixed (no btrfs-related changes) in 3.14.38 and 3.18.11 released today. > > Please put this commit everywhere you've cherry-picked > 381cf6587f8a8a8e981bc0c18859b51dc756 > > commit 9c4f61f01d269815bb7c37be3ede59c5587747c6 > Author: David Sterba > Date: Fri Jan 2 19:12:57 2015 +0100 > > btrfs: simplify insert_orphan_item > > We can search and add the orphan item in one go, > btrfs_insert_orphan_item will find out if the item already exists. > > Signed-off-by: David Sterba > > -chris > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- With respect, Roman signature.asc Description: PGP signature
Re: Big disk space usage difference, even after defrag, on identical data
On 13-04-15 07:06, Duncan wrote: >> So what can explain this? Where did the 66G go? > > Out of curiosity, does a balance on the actively used btrfs help? > > You mentioned defrag -v -r -clzo, but didn't use the -f (flush) or -t > (minimum size file) options. Does adding -f -t1 help? Unfortunately I can no longer try this, see the other reply why. But the problem turned out to be some 1G-sized files, written using 3-5 extents, that for whatever reason defrag was not touching. > You aren't doing btrfs snapshots of either subvolume, are you? No :-) I should've mentioned that. > Defrag should force the rewrite of entire files and take care of this, > but obviously it's not returning to "clean" state. I forgot what the > default minimum file size is if -t isn't set, maybe 128 MiB? But a -t1 > will force it to defrag even small files, and I recall at least one > thread here where the poster said it made all the difference for him, so > try that. And the -f should force a filesystem sync afterward, so you > know the numbers from any report you run afterward match the final state. Reading the corresponding manual, the -t explanation says that "any extent bigger than this size will be considered already defragged". So I guess setting -t1 might've fixed the problem too...but after checking the source, I'm not so sure. I didn't find the -t default in the manpages - after browsing through the source, the default is in the kernel: https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L1268 (Not sure what units those are.) I wonder if this is relevant: https://github.com/torvalds/linux/blob/4f671fe2f9523a1ea206f63fe60a7c7b3a56d5c7/fs/btrfs/ioctl.c#L2572 This seems to reset the -t flag if compress (-c) is set? This looks a bit fishy? > Meanwhile, you may consider using the nocow attribute on those database > files. It will disable compression on them, I'm using btrfs specifically to get compression, so this isn't an option. > While initial usage will be higher due to the lack of compression, > as you've discovered, over time, on an actively updated database, > compression isn't all that effective anyway. I don't see why. If you're referring to the additional overhead of continuously compressing and decompressing everything - yes, of course. But in my case I have a mostly-append workload to a huge amount of fairly compressible data that's on magnetic storage, so compression is a win in disk space and perhaps even in performance. I'm well aware of the many caveats in using btrfs for databases - they're well documented and although I much appreciate your extended explanation, it wasn't new to me. It turns out that if your dataset isn't update heavy (so it doesn't fragment to begin with), or has to be queried via indexed access (i.e. mostly via random seeks), the fragmentation doesn't matter much anyway. Conversely, btrfs appears to have better sync performance with multiple threads, and allows one to disable part of the partial-page-write protection logic in the database (full_page_writes=off for PostgreSQL), because btrfs is already doing the COW to ensure those can't actually happen [1]. The net result is a *boost* from about 40 tps (ext4) to 55 tps (btrfs), which certainly is contrary to popular wisdom. Maybe btrfs would fall off eventually as fragementation does set in gradually, but given that there's an offline defragmentation tool that can run in the background, I don't care. [1] I wouldn't be too surprised if database COW, which consists of journal-writing a copy of the data out of band, then rewriting it again in the original place, is actually functionally equivalent do disabling COW in the database and running btrfs + defrag. Obviously you shouldn't keep COW enabled in btrfs *AND* the DB, requiring all data to be copied around at least 3 times...which I'm afraid almost everyone does because it's the default... -- GCP -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Degraded volume silently fails to mount
On Mon, Apr 13, 2015 at 09:51:09AM -0400, Michael Tharp wrote: > Hi list, > > I've got a 4 disk raid1 volume that has one failed disk. I have so > far been unable to mount it in degraded mode, but the failure is > that "mount" silently does nothing. Check to see if systemd is unmounting it immediately after mount. This seems to be the usual reason for silent failures to mount an FS these days. Hugo. > # btrfs fi sh > warning devid 2 not found already > Label: 'seneca' uuid: b9da07f5-c0fd-45ad-861b-d1bcad6cbf4c > Total devices 4 FS bytes used 581.71GiB > devid1 size 931.51GiB used 334.02GiB path /dev/mapper/luks-seneca-1 > devid3 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-3 > devid4 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-4 > *** Some devices missing > > Btrfs v3.18.1 > # mount -t btrfs -o degraded /dev/mapper/luks-seneca-1 /seneca > # echo $? > 0 > # ls /seneca/ > # grep seneca /proc/mounts > # dmesg |tail > [ 84.955467] BTRFS: device label seneca devid 1 transid 1753 /dev/dm-4 > [ 87.926347] BTRFS: device label seneca devid 4 transid 1753 /dev/dm-5 > [ 107.069109] BTRFS: device label seneca devid 3 transid 1753 /dev/dm-6 > [ 195.267046] BTRFS info (device dm-6): allowing degraded mounts > [ 195.267094] BTRFS info (device dm-6): disk space caching is enabled > [ 195.267133] BTRFS: has skinny extents > [ 195.277615] BTRFS warning (device dm-6): devid 2 missing > [ 781.160250] BTRFS info (device dm-6): allowing degraded mounts > [ 781.160270] BTRFS info (device dm-6): disk space caching is enabled > [ 781.160286] BTRFS: has skinny extents > # uname -a > Linux ambrosia.homeslice 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26 > 21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > # btrfs --version > Btrfs v3.18.1 > > > Any ideas? -- Hugo Mills | What's a Nazgûl like you doing in a place like this? hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 |Illiad signature.asc Description: Digital signature
Degraded volume silently fails to mount
Hi list, I've got a 4 disk raid1 volume that has one failed disk. I have so far been unable to mount it in degraded mode, but the failure is that "mount" silently does nothing. # btrfs fi sh warning devid 2 not found already Label: 'seneca' uuid: b9da07f5-c0fd-45ad-861b-d1bcad6cbf4c Total devices 4 FS bytes used 581.71GiB devid1 size 931.51GiB used 334.02GiB path /dev/mapper/luks-seneca-1 devid3 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-3 devid4 size 931.51GiB used 334.01GiB path /dev/mapper/luks-seneca-4 *** Some devices missing Btrfs v3.18.1 # mount -t btrfs -o degraded /dev/mapper/luks-seneca-1 /seneca # echo $? 0 # ls /seneca/ # grep seneca /proc/mounts # dmesg |tail [ 84.955467] BTRFS: device label seneca devid 1 transid 1753 /dev/dm-4 [ 87.926347] BTRFS: device label seneca devid 4 transid 1753 /dev/dm-5 [ 107.069109] BTRFS: device label seneca devid 3 transid 1753 /dev/dm-6 [ 195.267046] BTRFS info (device dm-6): allowing degraded mounts [ 195.267094] BTRFS info (device dm-6): disk space caching is enabled [ 195.267133] BTRFS: has skinny extents [ 195.277615] BTRFS warning (device dm-6): devid 2 missing [ 781.160250] BTRFS info (device dm-6): allowing degraded mounts [ 781.160270] BTRFS info (device dm-6): disk space caching is enabled [ 781.160286] BTRFS: has skinny extents # uname -a Linux ambrosia.homeslice 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26 21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version Btrfs v3.18.1 Any ideas? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: plan to allow documentation contributions via github
Hi, plan: I'd like to allow documentation updates through github web interface. "patches via mailinglist" will continue to work unchanged. The current way to clone the git, edit files, send to mailinglist might discourage people who are not developers or not used to working with git that way. There are some issues around the pull request I'm not yet clear how to resolve. I'd like to keep the git history clean so the pull requests will not get merged the usual way. I'll probably merge the changes/patches manually and then close the request. There shall be a branch to serve as a starting point for any new edits, but will be a moving target after the pending patches get merged. I hope this will work for the browser-only approach, the merging burden is on my side. In order to get a working 'Preview' for the changes, we'd have to rename all .txt files to .asciidoc. Then you get a nice formatting on the github site for free. You can see an example here: https://github.com/kdave/btrfs-progs/blob/test-asciidoc/Documentation/btrfs-balance.asciidoc The documentation is separated from code so we can afford to relax the submission rules, though we'll still need the signed-off-by and names for the final commits. Thanks for feedback. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs-progs: improve troubleshooting avoid duplicate error strings
On Mon, Apr 13, 2015 at 08:37:01PM +0800, Anand Jain wrote: > my troubleshooting experience says have unique error string per module. +1 to that, thank you. Marc > In the below eg, its one additional step to know error line, > > cat -n cmds-device.c | egrep "error removing the device" >185"ERROR: error removing the device '%s' - %s\n", >190"ERROR: error removing the device '%s' - %s\n", > > which is completely avoidable. > > Signed-off-by: Anand Jain > --- > cmds-device.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/cmds-device.c b/cmds-device.c > index 1c72e90..1c32771 100644 > --- a/cmds-device.c > +++ b/cmds-device.c > @@ -187,7 +187,7 @@ static int cmd_rm_dev(int argc, char **argv) > ret++; > } else if (res < 0) { > fprintf(stderr, > - "ERROR: error removing the device '%s' - %s\n", > + "ERROR: ioctl error removing the device '%s' - > %s\n", > argv[i], strerror(e)); > ret++; > } > -- > 2.0.0.153.g79d > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BtrFS and encryption
Hello, it seems that ext4 is getting encryption-support http://thread.gmane.org/gmane.comp.file-systems.ext4/48206 rumors say because of performance-problems with eCryptFS in Android. f2fs should get a compatible interface too. I would like to see this in BtrFS as well… MfG bmg -- „Des is völlig wurscht, was heut beschlos- | M G Berberich sen wird: I bin sowieso dagegn!“ | berbe...@fmi.uni-passau.de (SPD-Stadtrat Kurt Schindler; Regensburg) | www.fmi.uni-passau.de/~berberic -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[4.0] BTRFS + ecryptfs: Iceweasel cache process hanging on evicting inodes
Hi! This may or may not be related to Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 BTRFS free space handling still needs more work: Hangs again Martin Steigerwald | 26 Dec 14:37 2014 http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790 I am not sure cause I didn´t check CPU usage of processes. It may be a different issue thus reporting here first. This is 4.0 kernel with just the patch from Lutz included to make trimming work. In case you suspect this to be an ecryptfs issue please tell me and I will forward to ecryptfs mailing list. I really hope that BTRFS will take on the Ext4 and probably F2FS work to include encryption within the filesystem directly. After seeing Iceweasel not responding anymore in several tabs I saw this in syslog: Apr 13 12:49:23 merkaba kernel: [ 4080.770733] INFO: task Cache2 I/O:3529 blocked for more than 120 seconds. Apr 13 12:49:23 merkaba kernel: [ 4080.770741] Tainted: G O 4.0.0-tp520-btrfs-trim+ #25 Apr 13 12:49:23 merkaba kernel: [ 4080.770744] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 13 12:49:23 merkaba kernel: [ 4080.770746] Cache2 I/O D 88030d0b7c58 0 3529 3060 0x Apr 13 12:49:23 merkaba kernel: [ 4080.770752] 88030d0b7c58 88030d0b7c38 8802dd6d49b0 88030d0b7c38 Apr 13 12:49:23 merkaba kernel: [ 4080.770758] 88030d0b7fd8 8802999ec488 88031915e2ac 88030d0b7cd8 Apr 13 12:49:23 merkaba kernel: [ 4080.770763] 88031915e290 88030d0b7c78 814c28e8 Apr 13 12:49:23 merkaba kernel: [ 4080.770768] Call Trace: Apr 13 12:49:23 merkaba kernel: [ 4080.770777] [] schedule+0x6f/0x7e Apr 13 12:49:23 merkaba kernel: [ 4080.770822] [] lock_extent_bits+0x100/0x188 [btrfs] Apr 13 12:49:23 merkaba kernel: [ 4080.770828] [] ? finish_wait+0x5f/0x5f Apr 13 12:49:23 merkaba kernel: [ 4080.770855] [] btrfs_evict_inode+0x14a/0x423 [btrfs] Apr 13 12:49:23 merkaba kernel: [ 4080.770865] [] evict+0xa8/0x150 Apr 13 12:49:23 merkaba kernel: [ 4080.770869] [] iput+0x16f/0x1bb Apr 13 12:49:23 merkaba kernel: [ 4080.770880] [] ecryptfs_evict_inode+0x29/0x2d [ecryptfs] Apr 13 12:49:23 merkaba kernel: [ 4080.770888] [] ? ecryptfs_show_options+0x11e/0x11e [ecryptfs] Apr 13 12:49:23 merkaba kernel: [ 4080.770893] [] evict+0xa8/0x150 Apr 13 12:49:23 merkaba kernel: [ 4080.770896] [] iput+0x16f/0x1bb Apr 13 12:49:23 merkaba kernel: [ 4080.770901] [] do_unlinkat+0x151/0x1f0 Apr 13 12:49:23 merkaba kernel: [ 4080.770906] [] ? user_exit+0x13/0x15 Apr 13 12:49:23 merkaba kernel: [ 4080.770910] [] ? syscall_trace_enter_phase1+0x57/0x12a Apr 13 12:49:23 merkaba kernel: [ 4080.770914] [] ? syscall_trace_leave+0xcb/0x108 Apr 13 12:49:23 merkaba kernel: [ 4080.770918] [] SyS_unlink+0x11/0x13 Apr 13 12:49:23 merkaba kernel: [ 4080.770923] [] system_call_fastpath+0x12/0x17 Apr 13 12:51:23 merkaba kernel: [ 4200.790479] INFO: task Cache2 I/O:3529 blocked for more than 120 seconds. Apr 13 12:51:23 merkaba kernel: [ 4200.790492] Tainted: G O 4.0.0-tp520-btrfs-trim+ #25 Apr 13 12:51:23 merkaba kernel: [ 4200.790496] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 13 12:51:23 merkaba kernel: [ 4200.790500] Cache2 I/O D 88030d0b7c58 0 3529 3060 0x Apr 13 12:51:23 merkaba kernel: [ 4200.790511] 88030d0b7c58 88030d0b7c38 8802dd6d49b0 88030d0b7c38 Apr 13 12:51:23 merkaba kernel: [ 4200.790520] 88030d0b7fd8 8802999ec488 88031915e2ac 88030d0b7cd8 Apr 13 12:51:23 merkaba kernel: [ 4200.790527] 88031915e290 88030d0b7c78 814c28e8 Apr 13 12:51:23 merkaba kernel: [ 4200.790536] Call Trace: Apr 13 12:51:23 merkaba kernel: [ 4200.790552] [] schedule+0x6f/0x7e Apr 13 12:51:23 merkaba kernel: [ 4200.790637] [] lock_extent_bits+0x100/0x188 [btrfs] Apr 13 12:51:23 merkaba kernel: [ 4200.790645] [] ? finish_wait+0x5f/0x5f Apr 13 12:51:23 merkaba kernel: [ 4200.790700] [] btrfs_evict_inode+0x14a/0x423 [btrfs] Apr 13 12:51:23 merkaba kernel: [ 4200.790716] [] evict+0xa8/0x150 Apr 13 12:51:23 merkaba kernel: [ 4200.790721] [] iput+0x16f/0x1bb Apr 13 12:51:23 merkaba kernel: [ 4200.790742] [] ecryptfs_evict_inode+0x29/0x2d [ecryptfs] Apr 13 12:51:23 merkaba kernel: [ 4200.790758] [] ? ecryptfs_show_options+0x11e/0x11e [ecryptfs] Apr 13 12:51:23 merkaba kernel: [ 4200.790765] [] evict+0xa8/0x150 Apr 13 12:51:23 merkaba kernel: [ 4200.790770] [] iput+0x16f/0x1bb Apr 13 12:51:23 merkaba kernel: [ 4200.790777] [] do_unlinkat+0x151/0x1f0 Apr 13 12:51:23 merkaba kernel: [ 4200.790786] [] ? user_exit+0x13/0x15 Apr 13 12:51:23 merkaba kernel: [ 4200.790793] [] ? syscall_trace_enter_phase1+0x57/0x12a Apr 13 12:51:23 merkaba kernel: [ 4200.790799] [] ? syscall_trace_le
[PATCH 1/1] btrfs-progs: improve troubleshooting avoid duplicate error strings
my troubleshooting experience says have unique error string per module. In the below eg, its one additional step to know error line, cat -n cmds-device.c | egrep "error removing the device" 185 "ERROR: error removing the device '%s' - %s\n", 190 "ERROR: error removing the device '%s' - %s\n", which is completely avoidable. Signed-off-by: Anand Jain --- cmds-device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cmds-device.c b/cmds-device.c index 1c72e90..1c32771 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -187,7 +187,7 @@ static int cmd_rm_dev(int argc, char **argv) ret++; } else if (res < 0) { fprintf(stderr, - "ERROR: error removing the device '%s' - %s\n", + "ERROR: ioctl error removing the device '%s' - %s\n", argv[i], strerror(e)); ret++; } -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Big disk space usage difference, even after defrag, on identical data
On 13-04-15 06:04, Zygo Blaxell wrote: >> I would think that compression differences or things like >> fragmentation or bookending for modified files shouldn't affect >> this, because the first filesystem has been >> defragmented/recompressed and didn't shrink. >> >> So what can explain this? Where did the 66G go? > > There are a few places: the kernel may have decided your files are > not compressible and disabled compression on them (some older kernels > did this with great enthusiasm); As stated in the previous mail, this is 3.19.1. Moreover, the data is either uniformly compressible or not at all. Lastly, note that the *exact same* mount options are being used on *the exact same kernel* with *the exact same data*. Getting a different compressible decision given the same inputs would point to bugs. > your files might have preallocated space from the fallocate system > call (which disables compression and allocates contiguous space, so > defrag will not touch it). So defrag -clzo or -czlib won't actually re-compress mostly-continuous files? That's evil. I have no idea whether PostgreSQL allocates files that way, though. > 'filefrag -v' can tell you if this is happening to your files. Not sure how to interpret that. Without "-v", I see most of the (DB) data has 2-5 extents per Gigabyte. A few have 8192 extents per Gigabyte. Comparing to the copy that takes 66G less, there every (compressible) file has about 8192 extents per Gigabyte, and the others 5 or 6. So you may be right that some DB files are "wedged" in a format that btrfs can't compress. I forced the files to be rewritten (VACUUM FULL) and that "fixed" the problem. > In practice database files take about double the amount of space > they appear to because of extent shingling. This is what I called "bookending" in the original mail, I didn't know the correct name, but I understand doing updates can result in N^2/2 or thereabouts disk space usage, however: > Defragmenting the files helps free space temporarily; however, space > usage will quickly grow again until it returns to the steady state > around 2x the file size. As stated in the original mail, the filesystem was *freshly defragmented* so that can't have been the cause. > Until this is fixed, the most space-efficient approach seems to be to > force compression (so the maximum extent is 128K instead of 1GB) Would that fix the problem with fallocated() files? -- GCP -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs receive hardening patches
Hello, due to security reasons in certain usecases it would be nice to force btrfs receive to confine itself to the directory of subvolume. I've attached a patch that issues chroot before parsing btrfs stream. Let me know if this breaks anything, preliminary tests showed it performed as expected. If necessary I can make this functionality optional via command-line flag. -- Lauri Võsandi tel: +372 53329412 e-mail: lauri.vosa...@gmail.com blog: http://lauri.vosandi.com/ diff --git a/cmds-receive.c b/cmds-receive.c index 44ef27e..e03acdd 100644 --- a/cmds-receive.c +++ b/cmds-receive.c @@ -867,15 +867,20 @@ static int do_receive(struct btrfs_receive *r, const char *tomnt, int r_fd, goto out; } - /* -* find_mount_root returns a root_path that is a subpath of -* dest_dir_full_path. Now get the other part of root_path, -* which is the destination dir relative to root_path. + + /** +* Nasty hack to enforce chroot before parsing btrfs stream */ - r->dest_dir_path = dest_dir_full_path + strlen(r->root_path); - while (r->dest_dir_path[0] == '/') - r->dest_dir_path++; + if (chroot(dest_dir_full_path)) { + fprintf(stderr, + "ERROR: failed to chroot to %s\n", + dest_dir_full_path); + ret = -EINVAL; + goto out; + } + r->root_path = r->dest_dir_path = strdup("/"); + ret = subvol_uuid_search_init(r->mnt_fd, &r->sus); if (ret < 0) goto out;
Re: Big disk space usage difference, even after defrag, on identical data
Zygo Blaxell posted on Mon, 13 Apr 2015 00:04:36 -0400 as excerpted: > A database ends up maxing out at about a factor of two space usage > because it tends to write short uniform-sized bursts of pages randomly, > so we get a pattern a bit like bricks in a wall: > > 0 MB AA BB CC DD EE FF GG HH II JJ KK 1 MB half the extents 0 MB > LL MM NN OO PP QQ RR SS TT UU V 1 MB the other half > > 0 MB ALLBMMCNNDOOEPPFQQGRRHSSITTJUUKV 1 MB what the file looks > like > > Fixing this is non-trivial (it may require an incompatible disk format > change). Until this is fixed, the most space-efficient approach seems > to be to force compression (so the maximum extent is 128K instead of > 1GB) and never defragment database files ever. ... Or set the database file nocow at creation, and don't snapshot it, so overwrites are always in-place. (Btrfs compression and checksumming get turned off with nocow, but as we've seen, compression isn't all that effective on random-rewrite-pattern files anyway, and databases generally have their own data integrity handling, so neither one is a huge loss, and the in-place rewrite makes for better performance and a more predictable steady-state.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html